Author: Leonard Dung
Abstract: Many researchers and intellectuals warn about extreme risks from artificial intelligence. However, these warnings typically came without systematic arguments in support. This paper provides an argument that AI will lead to the permanent disempowerment of humanity, e.g. human extinction, by 2100. It rests on four substantive premises which it motivates and defends: first, the speed of advances in AI capability, as well as the capability level current systems have already reached, suggest that it is practically possible to build AI systems capable of disempowering humanity by 2100. Second, due to incentives and coordination problems, if it is possible to build such AI, it will be built. Third, since it appears to be a hard technical problem to build AI which is aligned with the goals of its designers, and many actors might build powerful AI, misaligned powerful AI will be built. Fourth, because disempowering humanity is useful for a large range of misaligned goals, such AI will try to disempower humanity. If AI is capable of disempowering humanity and tries to disempower humanity by 2100, then humanity will be disempowered by 2100. This conclusion has immense moral and prudential significance.
My thoughts: I read through it rather quickly so take what I say with a grain of salt. That said, it seemed persuasive and well-written. Additionally, the way that they split up the argument was quite nice. I'm very happy to see an attempt to make this argument more philosophically rigorous and I hope to see more work in this vein.
Thank you for the comment, very thought-provoking! I tried to make some reply to each of your comments, but there is much more one could say.
First, I agree that my notion of disempowerment could have been explicated more clearly, although my elucidations fit relatively straightforwardly with your second notion (mainly, perpetual oppression or extinction), not your first. I think conclusions (1) and (2) are both quite significant, although there are important ethical differences.
For the argument, the only case where this potential ambiguity makes a difference is with respect to premise 4 (the instrumental convergence premise). Would be interesting to spell out more which points there seem much more plausible with respect to notion (1) but not to (2). If one has high credence in the view that AIs will decide to compromise with humans, rather than extinguish them, this would be one example of a view which leads to a much higher credence in (1) than in (2).
“I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy…”
I agree that RL does not necessarily create agents with such a clean psychological goal structure, but I think that there is (maybe strong) reason to think that RL often creates such agents. Cases of reward hacking in RL algorithms are precisely cases where an algorithm exhibits such a relatively clean goal structure, single-mindedly pursuing a ‘stupid’ goal while being instrumentally rational and thus apparently having a clear distinction between final and instrumental goals. But, granted, this might depend on what is ‘rewarded’, e.g. if it’s only a game score in a video game, then the goal structure might be cleaner than when it is a variety of very different things, and on whether the relevant RL agents tend to learn goals over rewards or some states of the world.
“I think this quote potentially indicates a flawed mental model of AI development underneath…”
Very good points. Nevertheless, it seems fair to say that it adds to the difficulty of avoiding disempowerment from misaligned AI that not only the first sufficiently capable AI (AGI) has to avoid catastrophic misaligment, but all further AGIs have to either avoid this too or be stopped by the AGIs already in existence. This then relates to points regarding whether the first AGIs do not only avoid catastrophic misalignment, but are sufficiently aligned so that we can use them to stop other AGIs and what the offense-defense balance would be. Could be that this works out, but also does not seem very safe to me.
“I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard…”
I am less convinced that evidence from LLMs shows that value specification is not hard. As you hint at, the question of value specification was never taken to be whether a sufficiently intelligent AI can understand our values (of course it can, if it is sufficiently intelligent), but whether we can specify them as its goals (such that it comes to share them). In (e.g.) GPT-4 trained via RL from human feedback, it is true that it typically executes your instructions as intended. However, sometimes it doesn’t and, moreover, there are theoretical reasons to think that this would stop being the case if the system was sufficiently powerful to do an action which would maximize human feedback but which does not consist in executing instructions as intended (e.g., by deceiving human raters).
“I think the argument about how instrumental convergence implies disempowerment proves too much…”
I am not moved by the appeal to humans here that much. If we had a unified (coordination) human agent (goal-directedness) who does not care about the freedom and welfare of other humans at all (full misalignment) and is sufficiently powerful (capability), then it seems plausible to me that this agent would try to take control of humanity, often in a very bad sense (e.g. extinction). If we relax ‘coordination’ or ‘full misalignment’ as assumptions, then this seems hard to predict. I could still see this ending in an AI which tries to disempower humanity, but it’s hard to say.