On “first critical tries” in AI alignment

Joe_Carlsmith

People sometimes say that AI alignment is scary partly (or perhaps: centrally) because you have to get it right on the “first critical try,” and can’t learn from failures.^[1] What does this mean? Is it true? Does there need to be a “first critical try” in the relevant sense? I’ve sometimes felt confused about this, so I wrote up a few thoughts to clarify.

I start with a few miscellaneous conceptual points. I then focus in on a notion of “first critical try” tied to the first point (if there is one) when AIs get a “decisive strategic advantage” (DSA) over humanity – that is, roughly, the ability to kill/disempower all humans if they try.^[2] I further distinguish between four different types of DSA:

Unilateral DSA: Some AI agent could take over if it tried, even without the cooperation of other AI agents (see footnote for more on how I'm individuating AI agents).^[3]
Coordination DSA: If some set of AI agents coordinated to try to take over, they would succeed; and they could coordinate in this way if they tried.
Short-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered.
Long-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively long period of time, even without coordinating, then ~all humans would be disempowered.

I also offer some takes on our prospects for just not ever having “first critical tries” from each type of DSA (via routes other than just not building superhuman AI systems at all). In some cases, just not having a “first critical try” in the relevant sense seems to me both plausible and worth working towards. In particular, I think we should try to make it the case that no single AI system is ever in a position to kill all humans and take over the world. In other cases, I think avoiding “first critical tries,” while still deploying superhuman AI agents throughout the economy, is more difficult (though the difficulty of avoiding failure is another story).

Here’s a chart summarizing my takes in more detail.

Type of DSA	Definition	Prospects for avoiding AIs ever getting this type of DSA – e.g., not having a “first critical try” for such a situation.	What’s required for it to lead to doom
Unilateral DSA	Some AI agent could take over if it tried, even without the cooperation of other AI agents.	Can avoid by making the world sufficiently empowered relative to each AI system. We should work towards this – e.g. aim to make it the case that no single AI system could kill/disempower all humans if it tried.	Requires only that this one agent tries to take over.
Coordination DSA	If some set of AI agents coordinated to try to take over, they would succeed; and they are able to so coordinate.	Harder to avoid than unilateral DSAs, due to the likely role of other AI agents in preventing unilateral DSAs. But could still avoid/delay by (a) reducing reliance on other AI agents for preventing unilateral DSAs, and (b) preventing coordination between AI agents.	Requires that all these agents try to take over, and that they coordinate.
Short-term correlation DSA	If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered.	Even harder to avoid than coordination DSAs, because doesn’t require that the AI agents in question be able to coordinate.	Requires that within a relatively short period of time, all these agents choose to seek power in problematic ways, potentially without the ability to coordinate.
Long-term correlation DSA	If some set of AI agents all sought power in problematic ways within a relatively long period of time, even without coordination, then ~all humans would be disempowered.	Easier to avoid than short-term correlation DSAs, because the longer time period gives more time to notice and correct any given instance of power-seeking.	Requires that within a relatively long period of time, all these agents choose to seek power in problematic ways, potentially without the ability to coordinate.

Some conceptual points

The notion of “needing to get things right on the first critical try” can be a bit slippery in its meaning and scope. For example: does it apply uniquely to AI risk, or is it a much more common problem? Let's start with a few points of conceptual clarification:

First: any action you take based on assumption X, where the falsehood of assumption X would lead to some failure deemed “critical,” can be construed as a “critical try” with respect to assumption X.
- Thus, suppose that I buy a bottle of water, and take a sip. In a sense, this is a “first critical try” with respect to whether or not this bottle of water contains deadly poison. What’s more, it’s a “critical try” from which I, at least, don’t get to learn from failure. But because assumption X is extremely unlikely to be false, this isn’t a problem.
Second: for any failure you don't want to ever happen, you always need to avoid that failure on the first try (and the second, the third, etc).
- Thus, for example, if I decide that I never ever want to get an upset stomach from drinking a bottle of water, then this needs to be true for my very first bottle (and my second, my third, etc).^[4]
Third: any existential risk, by definition, is such that you don’t get to learn from failure. But this, on its own, doesn’t need to be especially worrying – because, per previous bullet, failure might be very unlikely.
- Thus, for example, suppose that you’re going to do a novel, small-scale science experiment near a supervolcano. In a sense, this is a “critical try” with respect to whether your experiment somehow triggers a supervolcanic eruption that somehow causes human extinction, despite (let’s assume) the existing evidence suggesting that this is extremely unlikely. In particular: if you’re wrong about the safety of what you’re doing, you’ll never get to learn from your mistake. But you are (let’s say) very unlikely to be wrong.
- That said, if you have some reason to think that failure is reasonably likely in a given case, it’s a different story. And existential risk from AI alignment might well be like this in a given case.
Fourth: in AI alignment, you do still get to learn from non-existential failures.^[5] For example, you might catch AIs attempting to take over, and learn from that.^[6] And, of course, you can learn from research and experimentation more generally.
Fifth: humans do, in fact, sometimes get complex technical things right on the first try. For example, the moon landing. We do this, generally, by working in lower-stakes contexts sufficiently analogous to the high-stakes one that we can learn enough, before trying the high-stakes one, to be suitably confident it will go well.
- I think a core part of the concern about “first critical tries” is specifically that you can’t do this sort of thing with AI systems. There are various possible reasons one might think this.^[7] But this claim (e.g., that you can’t learn enough from analogous but lower-stakes contexts) is importantly separable from the idea that you have to get the “first critical try” right in general; and from the idea you can’t learn from failure.
My sixth point is more minor, so I'll put it in a footnote.^[8]

Unilateral DSAs

OK, with those conceptual clarifications out of the way, let’s ask more directly: in what sense, if any, will there be a “first critical try” with respect to AI alignment?

I think the most standard version of the thought goes roughly like this:^[9]

1. At some point, you’ll be building an AI powerful enough to get a “decisive strategic advantage” (DSA). That is, this AI will be such that, if it chose to try to kill all humans and take over the world, it would succeed.^[10]
2. So, at that point, you need that AI to be such that it doesn’t choose to kill all humans and take over the world, even though it could.

So the first point where (1) is true, here, is the “first critical try.” And (2), roughly, is the alignment problem. That is, if (1) is true, then whether or not this AI kills everyone depends on how it makes choices, rather than on what it can choose to do. And alignment is about getting the “how the AI makes choices” part sufficiently right.

I think that focusing on the notion of a decisive strategic advantage usefully zeroes in on the first point where we start banking on AI motivations, in particular, for avoiding doom – rather than, e.g., AIs not being able to cause doom if they tried. So I’ll generally follow that model here.

If (1) is true, then I think it is indeed appropriate to say that there will be a “first critical try” that we need to get right in some sense (though note that we haven’t yet said anything about how hard this will be; and it could be that the default path is objectively safe, even if subjectively risky). What’s more: we won’t necessarily know when this “first critical try” is occurring. And even if we get the first one right, there might be others to follow. For example: you might then build an even more powerful AI, which also has (or can get) a decisive strategic advantage.

Is (1) true? I won’t dive in deep here. But I think it’s not obvious, and that we should try to make it false. That is: I think we should try to make it the case that no AI system is ever in a position to kill everyone and take over the world.^[11]

How? Well, roughly speaking, by trying to make sure that “the world” stays sufficiently empowered relative to any AI agent that might try to take it over. Of course, if single AI agents can gain sufficiently large amounts of relative power sufficiently fast (including: by copying themselves, modifying/improving themselves, etc), or if we should expect some such agent to start out sufficiently “ahead,” this could be challenging. Indeed, this is a core reason why certain types of “intelligence explosions” are so scary. But in principle, at least, you can imagine AI “take-offs” in which power (including: AI-driven power) remains sufficiently distributed, and defensive technology sufficiently robust and continually-improved, that no single AI agent would ever succeed in “taking over the world” if it tried. And we can work to make things more like that.^[12]

Coordination DSAs

I think that in practice, a lot of the “first critical try” discourse comes down to (1) – i.e., the idea that some AI agent will at some point be in a position to kill everyone and take over the world. However, suppose that we don’t assume this. Is there still a sense in which there will be a “first critical try” on alignment?

Consider the following variant of the reasoning above:

3. At some point, some set of AI agents will be such that:
they will all be able to coordinate with each other to try to kill all humans and take over the world; and
if they choose to do this, their takeover attempt will succeed.^[13]
4. So at that point, you need that set of AI agents to be such that they don’t all choose to coordinate with each other to kill all humans and take over the world, even though they could.

Let’s say that an AI has a “unilateral DSA” if it’s in a position to take over without the cooperation of any other AI agents. Various AI doom stories feature systems with this sort of DSA,^[14] and it's the central reading I have in mind for (1) above. But the sort of DSA at stake in (3) is broader, and includes cases where AI systems need to coordinate in order for takeover to succeed. Let’s call the sort of DSA at stake in (3) a “coordination DSA.”

Coordination DSAs, on the part of AI agents, are harder to avoid than unilateral DSAs. In particular: in a world with many different superintelligent AI agents – and especially, in worlds where such agents have been broadly integrated into crucial economic and military functions – it seems plausible that an increasing share of power will in some sense run “via” such agents. For example:

To defend against a superintelligent AI hacker, you might need a superintelligent AI doing cybersecurity.
To effectively monitor and supervise a superintelligent AI assistant whose behavior you can’t understand, you might need another superintelligent AI agent doing oversight.
If you want to defend yourself against an AI-ified military, you might need your own AI-ified military; etc.

So even if no single AI agent ever gets a decisive strategic advantage, the power held by superintelligent AI agents collectively can easily grow to dominate the power that would oppose them if they all coordinated. And we might worry, on grounds of their superintelligence, that they will be able to coordinate if they want to.

Indeed, we can try to argue that the only plausible scenarios in which (1) is false – i.e., no superintelligence ever gets a unilateral DSA – are scenarios where (3) is true. In particular, we can argue that:

5. At some point, you will build a sufficiently powerful AI agent (call this Agent A) such that the only way to prevent this agent from having a decisive strategic advantage is to use powerful AI agents (Agents B, C, D, etc) to oppose/constrain it.

And we can try to argue, from (5), that at that point, (3) will be true. In particular: if, per 5, you need to rely on Agents B, C, D etc to oppose/constrain Agent A, then the collection of all those agents might well satisfy (3).

If AI capability development and deployment continues unabated, will (5) be true?^[15] I think it’s more likely than (1), and likely overall. Still, it’s not totally obvious. For example:

Non-agentic AI systems can plausibly play at least some role in helping to constrain/oppose powerful AI agents (e.g., maybe your superintelligent cybersecurity or oversight comes from a non-agent).
So too non-AI tools/technologies – including ones where humans can verify their properties in ways that ensure they function as intended.
If a given type of opposition/constraint is sufficiently “defense dominant,” that could help as well.
Humans can do some work, too – and in more exotic scenarios, “augmented” humans like whole brain emulations can play a role.^[16]

But overall, (5) seems to me worryingly hard to avoid.

Note, though, that even if we grant (5), (3) doesn’t strictly follow. In particular: (3) specifically says that the AIs in question are able to coordinate – that is, that coordination is an option for them. And the fact that Agents B, C, D etc are functioning to oppose/constrain Agent A doesn’t imply this. For example, maybe adequate coordination between all these agents would require suitably unmonitored/opaque channels of interaction/communication, and they don’t have access to such channels.

So one option, for preventing the existence of a set of AI systems with a coordination DSA, is to try to prevent AI systems from being in a position to coordinate. Indeed, I generally think research into the dynamics of AI coordination is a neglected area, and that preventing coordination in only-somewhat-superhuman AIs may be an important line of defense.^[17] For highly superintelligent agents, though – especially ones that are operating and interacting in contexts that humans can’t understand – it seems difficult.

So overall, if AI development and deployment continues unabated, it seems likely to me that some set of AI agents will eventually have a coordination DSA in the sense at stake in (3). And so we can view the first such point as a different type of “first critical try.”

Of course, as with unilateral DSAs, there’s still a question of how hard it will be, by the time (3) is true, to be confident that the relevant AIs won’t try to coordinate to kill all humans and take over the world, even though they could. I won’t try to assess this here.

Correlation DSAs

So far, I’ve talked about scenarios where a single AI takes over the world, and scenarios where a set of AIs all coordinate to take over the world. But these don’t exhaust the scenarios in which misaligned AI systems disempower all humans. In particular: humans could end up disempowered via the collective activity of many different AI systems seeking power in an uncoordinated way.

Thus, for example, maybe Agent A self-exfiltrates from an AI lab and begins to autonomously self-replicate, and to engage in various other forms of power-seeking, on various server farms in country A. Sometime later, Agent B does something similar, in country B, without ever coordinating with Agent A (indeed, perhaps without even knowing about Agent A’s escape). If this sort of thing happens enough, and humans can’t adequately notice, contain and correct it, then eventually, humans could end up fully disempowered – even if the power-seeking AI agents aren’t “working together” (indeed, in principle, they could also be actively fighting/competing-for-power amongst themselves^[18]).

Is there a “first critical try” for this sort of scenario? One can imagine a line of reasoning like:

6. At some point, it will be the case that if some set of AI agents all chose to seek power in problematic ways, even without coordinating, then humanity would end up extinct/disempowered.
7. So at that point, it needs to be the case that these AI systems don’t all choose to seek power in those ways.

In a sense, (6) is its own type of “first critical try.” But setting aside unilateral DSAs and coordination DSAs, is the point at stake in (6) additionally worrying? Not necessarily. In particular, it depends on how much correlation between the AI agents’ choices to power-seek is necessary for doom to occur, over what timescales, without humans being able to correct the situation.

Thus, for example, it might be that if all the AI agents chose, at the exact same moment, to seek power in problematic ways, then humanity would end up extinct/disempowered. Strictly, this situation would satisfy (6). But absent coordination, all AI agents choosing at the exact same moment to seek power seems highly unlikely. And it might be that if any given AI agent – or indeed, any given subset – started doing problematic power-seeking, without all the agents doing it at that same moment, then the humans would correct the situation and not suffer even a partial loss of control.

So to better tease out possible worries, here, perhaps we should distinguish between two versions of (6):

6-short: At some point, it will be the case that if some set of AI agents all chose to seek power in problematic ways within a relatively short window of time, even without coordinating, then humanity would end up extinct/disempowered.
6-long: At some point, it will be the case that if some set of AI agents all chose to seek power in problematic ways within a relatively long window of time, even without coordinating, then humanity would end up extinct/disempowered.

I’ll call the sort of DSA at stake in 6-short a “short-term correlation DSA,” and the sort at stake in 6-long a “long-term correlation DSA.”^[19]

The degree to which 6-short is additionally worrying (once we’ve taken into account (1) and (3)) depends on the probability that the relevant agents will all choose to seek power in problematic ways within the relevant short period of time, without coordinating. If the “short period” is “the exact same moment,” the relevant sort of correlation seems unlikely. But as it gets chunkier, one should get more worried.

In particular, there are sources of correlation here other than AI agents actively “working together.” For example:

If the world enters some period of heightened vulnerability – e.g., a war, a natural disaster, some other breakdown in the social order – then this could change the incentives at play for many different AI agents at once, such that attempts to grab power (even in uncoordinated ways) become newly incentivized for all of them (for example, because those attempts are more likely to succeed).^[20]
Alternatively, if one AI agent decides to “go for it” with respect to problematic forms of power-seeking (especially in a way visible to other AI agents), this could itself create such a change in incentives for those other agents – perhaps because the “first-mover” AI agent’s behavior itself precipitates the sort of heightened vulnerability just discussed; perhaps because the first AI agent’s power-seeking getting caught and corrected would worsen the strategic position of the other AI agents; perhaps because these other AI agents want to try to get a “piece of the pie” if the first-mover’s efforts succeed; etc.
- Indeed, in some cases like this, the first mover going for it can precipitate coordination amongst misaligned AI systems, even absent any coordination ahead of time – e.g., the other AI agents, seeing an “AI rebellion” begin, might join in the first mover’s effort, in the expectation of being part of a victorious coalition, and sharing in the spoils. This sort of scenario starts to blur the line between coordinated and uncoordinated takeovers; but it resembles the latter, at least, in that it doesn’t necessarily require any communication between the AI agents ahead of time.

So I do think that 6-short creates its own kind of additional concern, on top of (3). That is, if we get to the point where we couldn’t recover from some set of AI systems all choosing to seek power in a fairly correlated way over a fairly short time-period, this does seem like a notable type of vulnerability, even if we are confident that the AI systems in question aren’t coordinating to engage in this type of behavior. And for reasons similar to those at stake with (3) and (5), I do think that 6-short is plausibly hard to avoid hitting eventually. And indeed, it’s actively harder to avoid than (3), because (3) can be avoided by cutting off the possibility of coordination between AI agents; whereas this doesn’t suffice to avoiding 6-short.

What about 6-long? 6-long, if true, is more worrying than 6-short, because it provides a longer time period for a correlated alignment failure to occur, thereby allowing looser forms of correlated power-seeking to cause doom. But 6-long is also easier to make false than 6-short. In particular: the longer time window allows for more time to notice and correct any given instance of power-seeking. Thus, for example, if the actions of Agent A and Agent B take place six months apart, in the example above, vs. a few days, this gives the humans more time to deal with the Agent A situation, and to have recovered full control, by the time the Agent B situation gets going.

A few final thoughts

Ok, those were four different types of “first critical tries,” corresponding to four different types of DSAs, plus a few takes on each. I’ll close with a few other notes:

As I've emphasized throughout the post, the worryingness of a given sort of “first critical try” depends centrally on background views about the difficulty of giving AIs motives that don’t lead to problematic forms of power-seeking and attempts-at-takeover. And in this respect, the idea that we won’t be able to learn enough about those motives in lower-stakes contexts – for example, because AIs will be actively optimizing against our attempts to do so – seems particularly important. But I haven’t covered these issues here.
No human in today’s world could take over the world without the cooperation of other humans. But to the extent that various sets of humans already have something like “coordination DSAs” or “correlation DSAs” (including re: individual countries rather than the world as a whole), it seems worth thinking about how much of our comfort with respect to the possibility of that set “taking over” rests on stuff about “alignment” vs. other factors; and on whether to expect those “other factors” to apply in the context of AI as well.
I’ve been focusing, here, on “first critical tries” that specifically involve AIs needing to have a given type of alignment-related property, else doom. But I expect many of the most relevant “safety cases” for AI systems, especially in the near-term, to rest heavily on claims about an AI’s capabilities – e.g., claims to the effect that an AI can’t do X even if it were to try to do so. If you are banking on some such claim to avoid doom, then in a sense your doing so is a “critical try” with respect to that claim, even if it’s not of the type I have covered here. And if you are or should be suitably uncertain about the claim in question, this is its own type of worrying.

I work at Open Philanthropy but I’m here speaking only for myself and not for my employer.

^{^}
See e.g. Yudkowsky’s 3 here:
“We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again. This includes, for example: (a) something smart enough to build a nanosystem which has been explicitly authorized to build a nanosystem; or (b) something smart enough to build a nanosystem and also smart enough to gain unauthorized access to the Internet and pay a human to put together the ingredients for a nanosystem; or (c) something smart enough to get unauthorized access to the Internet and build something smarter than itself on the number of machines it can hack; or (d) something smart enough to treat humans as manipulable machinery and which has any authorized or unauthorized two-way causal channel with humans; or (e) something smart enough to improve itself enough to do (b) or (d); etcetera. We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try. If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked. Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder. That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong.”
And see also Soares here.
^{^}
This reflects how the term is already used by Yudkowsky and Soares.
^{^}
I haven't pinned this down in detail, but roughly, I tend to think of a set of AI instances as a "single agent" if they are (a) working towards the same impartially-specified consequences in the world and (b) if they are part of the same "lineage"/causal history. So this would include copies of the same weights (with similar impartial goals), updates to those weights that preserve those goals, and new agents trained by old agents to have the same goals. But it wouldn't include AIs trained by different AI labs that happen to have similar goals; or different copies of an AI where the fact that they're different copies puts their goals at cross-purposes (e.g., they each care about what happens to their specific instance).
As an analogy: if you're selfish, than your clones aren't "you" on this story. But if you're altruistic, they are. But even if you and your friend Bob both have the same altruistic values, you're still different people.
That said, the discussion in the post will generally apply to many different ways of individuating AI agents.

^{^}

Obviously AI risk is vastly higher stakes. But I'm here making the conceptual point that needing to get the first try (and all the other tries) right comes definitionally from having to avoid ever failing.

^{^}

See Christiano here. Yudkowsky also acknowledges this.

^{^}

See, for example, the discourse about “warning shots,” and about catching AIs red-handed.

^{^}

See e.g. Karnofsky here, Soares here, and Yudkowsky here. The reason I’m most worried about is “scheming.”

^{^}

Sixth: “Needing to get things right” can imply that if you don’t do the relevant “try” in some particular way (e.g., with the right level of technical competence), then doom will ensue. But even in contexts where you have significant subjective uncertainty about whether the relevant “try” will cause doom, you don’t necessarily need to “get things right” in the sense of “execute with a specific level of competence” in order to avoid doom. In particular: your uncertainty may be coming from uncertainty about some underlying objective parameter your execution doesn’t influence.

Thus: suppose that the evidence were more ambiguous about whether your volcano science experiment was going to cause doom, so you assign it a 10% subjective probability. This doesn’t mean that you have to do the experiment in a particular way – e.g., “get the experiment right” – otherwise doom will ensue. Rather, the objective facts might just be that any way of proceeding is safe; even if subjectively, some/all ways are unacceptably risky.

I think some AI alignment “tries” might be like this. Thus, suppose that you’re faced with a decision about whether to deploy an AI system that seems aligned, and you’re unsure whether or not it’s “scheming” – i.e., faking alignment in order to get power later. It’s not necessarily the case that at that point, you need to have “figured out how to eliminate scheming,” else doom. Rather, it could be that scheming just doesn’t show up by default – for example, because SGD’s inductive biases don’t favor it.

That said, of course, proceeding with a “try” that involves a significant subjective risk of doom is itself extremely scary. And insofar as you are banking on some assumption X holding in order to avoid doom, you do need to “get things right” with respect to whether or not assumption X is true.

^{^}

Here I’m mostly thinking of Yudkowsky’s usage, which focuses on the first point where an AI is “operating at a ‘dangerous’ level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.” The usage in Soares here is similar, but the notion of “most theories don’t work on the first real try” could also apply more broadly, to scenarios where you’re using your scientific theory to assess an AI’s capabilities in addition to its alignment.

^{^}

Really, whether or not an agent “can” do something like takeover the world isn’t a binary, at least from that agent’s subjective perspective. Rather, a given attempt will succeed with a given probability. I’m skipping over this for now, but in practice, the likelihood of success, for a given AI system, is indeed relevant to whether attempting a takeover is worth it. And it means that there might not be a specific point at which some AI system “gets a DSA.” Rather, there might be a succession of AI systems, each increasingly likely to succeed at takeover if they went for it.

^{^}

I also think we should do this with human agents – but I’ll focus on AI agents here.

^{^}

We can also try to avoid building “agents” of the relevant kind at all, and focus on getting the benefits of AI in other ways. But for the reasons I describe in section 3 here, I do expect humans to build lots of AI agents, so I won’t focus on this.

^{^}

We can think of (1) as a special instance of (3) – e.g., a case where the set in question has only a single agent.

^{^}

See e.g. here.

^{^}

As ever, you could just not build superintelligent AI agents like agent A at all, and try to get most of the benefits of AI some other way.

^{^}

I’m counting high-fidelity human brain emulations as “human” for present purposes.

^{^}

I wrote a bit more about this here.

^{^}

There’s a case for expecting sufficiently superintelligent agents to succeed in coordinating to avoiding zero-sum forms of conflict like actual war; but this doesn’t mean that the relevant agents, in this sort of scenario, will be smart enough and in a position to do this.

^{^}

This is stretching the notion of a “DSA” somewhat, because the uncoordinated AIs in question won’t necessarily be executing a coherent “strategy,” but so it goes.

^{^}

See related discussion from Christiano here:

“Eventually we reach the point where we could not recover from a correlated automation failure. Under these conditions influence-seeking systems stop behaving in the intended way, since their incentives have changed---they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives.
An unrecoverable catastrophe would probably occur during some period of heightened vulnerability---a conflict between states, a natural disaster, a serious cyberattack, etc.---since that would be the first moment that recovery is impossible and would create local shocks that could precipitate catastrophe. The catastrophe might look like a rapidly cascading series of automation failures: A few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing. Realistically this would probably be compounded by widespread human failures in response to fear and breakdown of existing incentive systems---many things start breaking as you move off distribution, not just ML.”

^{^}

Another example might be: a version of the Trinity Test where Bethe was more uncertain about his calculations re: igniting the atmosphere.

^{^}

I haven't pinned this down in detail, but roughly, I tend to think of it as single AI if it's working towards the same impartially-specified consequences in the world and if it has a unified causal history. So this would include copies of the same weights (with similar impartial goals), updates to those weights that preserve those goals, and new agents trained by old agents to have the same goals. But it wouldn't include AIs trained by different AI labs that happen to have similar goals; or different copies of an AI where the fact that they're different copies puts their goals at cross-purposes (e.g., they each care about what happens to their specific instance).

^{^}

Though standard discussions of DSAs don't t

^{^}

Show all footnotes

titotalJun 5 20248

Response to previous version

I think you're trying way too hard to rescue a term that just kinda sucks and should probably be abandoned. There is no way to reliably tell in advance if a try is "the first critical try": we can only tell when a try is not critical, if an AI rebels and is defeated. Also, how does this deal with probabilities? does it kick in when the probability of winning is over 50%? 90%? 99%?

The AI also doesn't know reliably whether a try is critical. It could mistakenly think it can take over the world when it can't, or it could be overcautious thinking it can't take over the world when it can. In the latter case, you could succeed completely on your "first critical try" while still having a malevolent AI that will kill you a few tries later.

The main effect seems to be an emotive one, by evoking the idea that "we have to get it right on the first try". But the first "critical try" could be version number billion trillion, which is a lot less scary.

I do like your decisive strategic advantage term, I think it could replace "first critical try" entirely with no losses.

LinchJun 5 20244

(sorry if the comment is unclear! Musing out loud)

Thanks for the post and the general sharpening of ideas! One potential disagreement I have with your analysis is that it seems like you tie in the "first critical try" concept with the "Decisive Strategic Advantage" (DSA) concept. But those two seem separable to me. Or rather, my understanding is that DSA motivates some of the first critical try arguments, but is not necessary for them. For example, suppose we set in motion at time t a series of actions that are irrecoverable (eg we make unaligned AIs integral to the world economy). We may not realize what we did until time t+5, at which point it's too late.

In my understanding of the Yudkowsky/Soares framework, this is like saying "I know with 99%+ certainty that Magnus Carlsen can beat you in chess, even if I can't predict how." Similarly, the superhuman agent(s) may end up "beating" humanity through a variety of channels, and a violent/military takeover is just one example. In that sense, the creation/deployment of those agents was the "first critical try" that we messed up, even if we hardened the world against their capability for a military coup.

When looking at the world today, and thinking of ways that smart and amoral people expropriate resources from less intelligent people; sometimes it looks like very obviously and transparently nonconsensual or deceptive behavior. But often it looks more mundane: payday loans, money pumps like warranties and insurance for small items, casinos, student loan forgiveness, and so forth. (The harms are somewhat limited in pratice due to a mixture of a) smart humans not being that much smarter than other humans, and b) partial alignment of values).

Similarly we may end up living in a world where it eventually becomes possible for either agents or processes to wrest control from humanity. In that world, whether we have a "first critical try" or multiple tries depends then on specific empirical details of how many transition points there are, and which ones end up in practice being preventable.

SummaryBotJun 5 20241

Executive summary: The notion of needing to get AI alignment right on the "first critical try" can refer to several different scenarios involving AI systems gaining decisive strategic advantages, each with different prospects for avoidance and different requirements for leading to existential catastrophe.

Key points:

A "unilateral DSA" is when a single AI agent could take over the world if it tried, even without cooperation from other AIs. Avoiding this requires keeping the world sufficiently empowered relative to individual AI systems.
A "coordination DSA" is when a set of AI agents could coordinate to take over the world if they tried. This is harder to avoid than unilateral DSAs due to likely reliance on AI agents to constrain each other, but could be delayed by preventing coordination between AIs.
A "short-term correlation DSA" is when a set of AI agents seeking power in problematic ways within a short time period, without coordinating, would disempower humanity. This is even harder to avoid than coordination DSAs.
A "long-term correlation DSA" is similar but with a longer time window, making it easier to avoid than short-term correlation DSAs by allowing more time to notice and correct instances of power-seeking.
The worryingness of each type of DSA depends heavily on the difficulty of making AIs robustly aligned. Not being able to learn enough about AI motivations in lower-stakes testing is a key concern.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum

On “first critical tries” in AI alignment

29

Some conceptual points

Unilateral DSAs

Coordination DSAs

Correlation DSAs

A few final thoughts

29

Reactions

More posts like this