From the full report,
It is not merely enough that we specify an “aligned” objective for a powerful AI system, nor just that objective be internalized by the AI system, but that we do both of these on the first try. Otherwise, an AI engaging in misaligned behaviors would be shut down by humans. So to get ahead, the AI would first try to shut down humans.
I dispute that we need to get alignment right on the first try, and otherwise we're doomed. However, this question depends critically on what is meant by "first try". Let's consider two possible interpretations of the idea that we only get "one try" to develop AI:
Interpretation 1: "At some point we will build a general AI system for the first time. If this system is misaligned, then all humans will die. Otherwise, we will not all die."
Interpretation 2: "The decision to build AI is, in a sense, irreversible. Once we have deployed AI systems widely, it is unlikely that we could roll them back, just like how we can't roll back the internet, or electricity."
I expect the first interpretation of this thesis will turn out incorrect because the "first" general AI systems will likely be rather weak and unable to unilaterally disempower all of humanity. This seems evident to me because of the fact that current AI systems are already fairly general (and increasingly so), and yet are weak, and are as-yet far from being able to disempower humanity.
These current systems also seem to be increasing in their capabilities somewhat incrementally, albeit at a rapid pace[1]. I think it is highly likely that we will have many attempts at aligning general AI systems before they become more powerful than the rest of humanity combined, either individually or collectively. This implies that we do not get only "one try" to align AI—in fact, we will likely have many tries, and these attempts will help us accumulate evidence about the difficulty of alignment on the even more powerful systems that we build next.
To the extent that you are simply defining the "first try" as the last system developed before humans become disempowered, then this claim seems confused. Building such a system is better viewed as a "last try" than a "first try" at AI, since it would not necessarily be the first general AI system that we develop. It also seems likely that the construction of such a system would be aided substantially by AI-guided R&D, making it unclear to what extent it was really "humanity's try" at AI.
Interpretation 2 appears similarly confused. It may be true that the decision to deploy AI on a wide scale is irreversible, if indeed these systems have a lot of value and are generally intelligent, which would make it hard to "put the genie back in the bottle". However, AI does not seem unusual in this respect among technologies, as it is similarly nearly impossible to reverse the course of technological progress in almost all other domains.
More generally, it is simply a fundamental feature of all decision-making that actions are irreversible, in the sense that it is impossible to go back in time and make different decisions than the ones we had in fact made. As a general property of the world, rather than a narrow feature of AI development in particular, this fact in isolation does little to motivate any specific AI policy.
- ^
I do not think the existence of emergent capabilities implies that general AI systems are getting more capable in a discontinuous fashion, as emergent capabilities are generally quite narrow abilities, rather than constituting an average competence level of AI systems. On broad measures of intelligence, such as the MMLU, AI systems appear to be developing more incrementally. And moreover, many apparently emergent capabilities are merely artifacts of the way we measure them, and therefore do not reflect underlying discontinuities in latent abilities.
(I have not read the full report yet, I'm merely commenting on a section in the condensed report.)
This argument seems wrong to me. While AI does pose negative externalities—like any technology—it does not seem unusual among technologies in this specific respect (beyond the fact that both the positive and negative effects will be large). Indeed, if AI poses an existential risk, that risk is borne by both the developers and general society. Therefore, it's unclear whether there is actually an incentive for developers to dangerously "race" if they are fully rational and informed of all relevant facts.
In my opinion, the main risk of AI does not come from negative externalities, but rather from a more fundamental knowledge problem: we cannot easily predict the results of deploying AI widely, over long time horizons. This problem is real but it does not by itself imply that individual AI developers are incentivized to act irresponsibly in the way described by the article; instead, it implies that developers may act unwisely out of ignorance of the full consequences of their actions.
These two concepts—negative externalities, and the knowledge problem—should be carefully distinguished, as they have different implications for how to regulate AI optimally. If AI poses large negative externalities (and these are not outweighed by their positive externalities), then the solution could look like a tax on AI development, or regulation with a similar effect. On the other hand, if the problem posed by AI is that it is difficult to predict how AI will impact the world in the coming decades, then the solution plausibly looks more like investigating how AI will likely unfold and affect the world.