Has there been any good, serious game-theoretic modeling of what 'AI alignment' would actually look like, given diverse & numerous AI systems interacting with billions human individuals and millions of human groups that have diverse, complex, & heterogenous values, preferences, and goals? 

Are there any plausible models in which the AI systems, individuals, and groups can reach any kind of Pareto-efficient equilibrium? 

Or any non-existence proof that such a Pareto-efficient equilibrium (i.e. true 'alignment') is impossible?

22

0
0

Reactions

0
0
New Answer
New Comment

1 Answers sorted by

CLR (https://longtermrisk.org/ ) works on multipolar scenarios, multi-agent systems and game theory, both technical problems and macrostrategy, prioritizing the reduction of conflicts that increase s-risks. The associated foundation https://www.cooperativeai.com/ supports work on similar problems.

For a technical paper on Pareto improvements, see https://link.springer.com/article/10.1007/s10458-022-09574-6

CLR and CRS (https://centerforreducingsuffering.org/ ) also have worked on risks from malevolent actors (https://forum.effectivealtruism.org/posts/LpkXtFXdsRd4rG8Kb/reducing-long-term-risks-from-malevolent-actors ). I'm not sure if s-risks from sadism or retributionism are being worked on, but they're discussed briefly here: https://centerforreducingsuffering.org/research/a-typology-of-s-risks/

I imagine the work often focuses on a small number of groups, but maybe it generalizes. I'm not aware of more concrete realistic models (rather than more toy models or models that aren't aiming to capture likely preferences and values), but I wouldn't be surprised if they exist. This isn't my area of focus, so I'm not that well-informed. I imagine AI safety/governance groups and especially CLR and CRS are thinking about this, but may not have built explicit models.

Michael - thanks very much for these links. I'll check them out!