I’ve been trying to form a nearcast-based picture of what it might look like to suffer or avoid an AI catastrophe. I’ve written a hypothetical “failure story” (How we might stumble into AI catastrophe) and two “success stories” (one presuming a relatively gradual takeoff, one assuming a more discontinuous one).
Those success stories rely on a couple of key actors (a leading AI lab and a standards-and-monitoring organization) making lots of good choices. But I don’t think stories like these are our only hope. Contra Eliezer, I think we have a nontrivial1 chance of avoiding AI takeover even in a “minimal-dignity” future - say, assuming essentially no growth from here in the size or influence of the communities and research fields focused specifically on existential risk from misaligned AI, and no highly surprising research or other insights from these communities/fields either. (There are further risks beyond AI takeover; this post focuses on AI takeover.)
This is not meant to make anyone relax! Just the opposite - I think we’re in the “This could really go lots of different ways” zone where marginal effort is most valuable. (Though I have to link to my anti-burnout take after saying something like that.) My point is nothing like “We will be fine” - it’s more like “We aren’t stuck at the bottom of the logistic success curve; every bit of improvement in the situation helps our odds.”
I think “Luck could be enough” should be the strong default on priors,2 so in some sense I don’t think I owe tons of argumentation here (I think the burden is on the other side). But in addition to thinking “I haven’t heard knockdown arguments for doom,” I think it’s relevant that I feel like I can at least picture success with minimal dignity (while granting that many people will think my picture is vague, wishful and wildly unrealistic, and they may be right). This post will try to spell that out a bit.
It won’t have security mindset, to say the least - I’ll be sketching things out that “could work,” and it will be easy (for me and others) to name ways they could fail. But I think having an end-to-end picture of how this could look might be helpful for understanding my picture (and pushing back on it!)
I’ll go through:
- How we could navigate the initial alignment problem:3 getting to the first point of having very powerful (human-level-ish), yet safe, AI systems.
- For human-level-ish AIs, I think it’s plausible that the alignment problem is easy, trivial or nonexistent. (Also plausible that it’s fiendishly hard!)
- If so, it could end up cheap and easy to intent-align human-level-ish AIs, such that such AIs end up greatly outnumbering misaligned ones - putting us in good position for the deployment problem (next point).
- How we could navigate the deployment problem:4 reducing the risk that someone in the world will deploy irrecoverably dangerous systems, even though the basic technology exists to make powerful (human-level-ish) AIs safe. (This is often discussed through the lens of “pivotal acts,” though that’s not my preferred framing.5)
- You can think of this as containing two challenges: stopping misaligned human-level-ish AI, and maintaining alignment as AI goes beyond human level.
- A key point is that once we have aligned human-level-ish AI, the world will probably be transformed enormously, to the point where we should consider ~all outcomes in play.
- (Briefly) The main arguments I’ve heard for why this picture is unrealistic/doomed.
- A few more thoughts on the “success without dignity” idea.
As with many of my posts, I don’t claim personal credit for any new ground here. I’m leaning heavily on conversations with others, especially Paul Christiano and Carl Shulman.
The initial alignment problem
What happens if you train an AI using the sort of process outlined here - essentially, generative pretraining followed by reinforcement learning, with the latter refereed by humans?
I think danger is likely by default - but not assured. It seems to depend on a number of hard-to-predict things:
- How accurate is reinforcement?
- The greater an AI’s ability to get better performance by deceiving, manipulating or overpowering supervisors, the greater the danger.
- There are a number of reasons (beyond explicit existential risk concern) that AI labs might invest heavily in accurate reinforcement, via techniques like task decomposition/amplification, recursive reward modeling, mechanistic interpretability, and using AIs to debate or supervise other AIs. Relatively moderate investments here could imaginably lead to highly accurate reinforcement.
- How “natural” are intended generalizations (like “Do what the supervisor is hoping I’ll do, in the sense that most humans would mean this phrase rather than in a precise but malign sense”) vs. unintended ones (like “Do whatever maximizes reward”)?
- It seems plausible that large amounts of generative pretraining could result in an AI having a suite of well-developed humanlike concepts, such as “Do what the supervisor is hoping I’ll do, in the sense that most humans would mean this phrase rather than in a precise technical sense” - and also such as “Fool the supervisor into thinking I did well,” but the latter could be hard enough to pull off successfully in the presence of a basic audit regime (especially for merely human-level-ish AI), and/or sufficiently in conflict with various learned heuristics, that it could be disadvantaged in training.
- In this case, a relatively small amount of reinforcement learning could be enough to orient an AI toward policies that generalize as intended.
- How much is training “outcomes-based vs. process-based”? That is, how much does it look like “An AI goes through a long episode, taking many steps that aren’t supervised or necessarily understood, and ultimately subject to gradient descent based on whether humans approve of the outcome?” vs. “Each local step the AI takes is subject to human supervision and approval?”
- The former leaves a lot of scope for mistaken feedback that trains deception and manipulation. The latter could still in some sense train “doing what humans think they want rather than what they actually want,” but that’s quite different from training “Do whatever results in a seemingly good outcome,” and I think it’s noticeably less vulnerable to some of the key risks.
- Outcomes-based training seems abstractly more “powerful,” and likely to be a big part of training the most powerful systems - but this isn’t assured. Today, training AIs based on outcomes of long episodes is unwieldy, and the most capable AIs haven’t had much of it.
- How natural/necessary is it for a sufficiently capable AI to form ambitious goals and act like a “maximizer”?
- There has been a lot of ink spilled about this, and I think it remains up in the air. I’ll just say that modern AI development methods are not clearly set up to produce “maximizers.”
For what it’s worth, it seems to me like the very shortest, straightest-line imaginable path to transformative AI from today looks relatively favorable on the above dimensions. Today’s most capable models mostly look like “lots of generative pretraining, tiny amounts of reinforcement learning,” which seems like a good thing at least from the point of view of (a) developing a large suite of human-like concepts; (b) a good chance that the RL part of the training can be highly “accurate” (few opportunities to get better performance by deceiving, manipulating or overpowering supervisors). It’s imaginable to AI models could become transformative with a handful of additions (see footnote6) that - while greatly changing capabilities and introducing serious alignment risks - wouldn’t clearly change what I just said.
I’d guess that the situation will get worse (e.g., more outcome-based training) compared to that picture, but I don’t know that. And even if it does, it still seems like we could end up with models that have human-level and human-like capabilities and lack ambitious goals of their own. (Even without a special effort.)
For an opposing view, see my dialogue with Nate Soares. I could imagine being about 70% confident of Nate’s view if I’d done a lot of thinking I haven’t done, but I doubt most people are in that sort of position or should trust Nate (or Eliezer) to be, to the point of being certain of danger by default.
There are also possible countermeasures to alignment risk, identifiable today, that I can imagine going quite well, without needing new breakthrough insights (we would need luck, though). Examples:
Simple checks and balances. An example: we might train “AI watchdogs” to make the best arguments they can (ideally including discussions of internal states) that other AIs are dangerous, that specific measures can make them less dangerous, etc. Gradient descent could be used to both push against spurious arguments (as assessed by randomized deep dives) while pushing toward more important+correct (according to human judges and auditors) critiques. There are tons of reasons this might not work, but I don’t think we should be shocked if it produces huge dividends, even coming from simply human-level AIs that are well-suited to finding flaws and dangers in models and outputs similar to themselves. These dividends could lead to highly accurate training that makes it hard to learn unintended generalizations, and/or to deployment scenarios where it’s overly hard for human-level-ish AIs to coordinate against humans.
Intense red-teaming. I’ve written about how adversarial training could “undo” a good deal of misalignment that has popped up during training, especially if something like relaxed adversarial training (simulating inputs that one can’t actually produce) turns out to be feasible. It’s plausible to me that AI companies might invest heavily in this kind of work, without needing to be mostly motivated by existential risk reduction (they might be seeking intense guarantees against e.g. lawsuit-driving behavior by AI systems).
Training on internal states. I think interpretability research could be useful in many ways, but some require more “dignity” that I’m assuming here7 and/or pertain to the “continuing alignment problem” (next section).8 If we get lucky, though, we could end up with some way of training AIs on their own internal states that works at least well enough for the initial alignment problem.
Training AIs on their own internal states risks simply training them to manipulate and/or obscure their own internal states, but this may be too hard for human-level-ish AI systems, so we might at least get off the ground with something like this.
A related idea is finding a regularizer that penalizes e.g. dishonesty, as in Eliciting Latent Knowledge.
It’s pretty easy for me to imagine that a descendant of the Burns et al. 2022 method, or an output of the Eliciting Latent Knowledge agenda, could fit this general bill without needing any hugely surprising breakthroughs. I also wouldn’t feel terribly surprised if, say, 3 more equally promising approaches emerged in the next couple of years.
The deployment problem
Once someone has developed safe, powerful (human-level-ish) AI, the threat remains that:
- More advanced AI will be developed (including with the help of the human-level-ish AI), and it will be less safe, due to different development methods and less susceptibility to the basic countermeasures above.9
- As it gets cheaper and easier for anyone in the world to build powerful AI systems, someone will do so especially carelessly and/or maliciously.
The situation has now changed in a few ways:
- There’s now a lot more capacity for alignment research, threat assessment research (to make a more convincing case for danger and contribute to standards and monitoring), monitoring and enforcing standards, and more (because these things can be done by AIs). I think interpretability looks like a particularly promising area for “automated research” - AIs might grind through large numbers of analyses relatively quickly and reach a conclusion about the thought process of some larger, more sophisticated system.
- There’s also a lot more capacity for capabilities research that could lead to more advanced, more dangerous AI.
- For a good outcome, alignment research or threat assessment research doesn’t have to “keep up with” capabilities research for a long time - a strong demonstration of danger, or decisive/scalable alignment solution, could be enough.
It’s hard to say how all these factors will shake out. But it seems plausible that one of these things will happen:
- Some relatively cheap, easy, “scalable” solution to AI alignment (the sort of thing ARC is currently looking for) is developed and becomes widely used.
- Some decisive demonstration of danger is achieved, and AIs also help to create a successful campaign to persuade key policymakers to aggressively work toward a standards and monitoring regime. (This could be a very aggressive regime if some particular government, coalition or other actor has a lead in AI development that it can leverage into a lot of power to stop others’ AI development.)
- Something else happens to decisively change dynamics - for example, AIs turn out to be good enough at finding and patching security holes that the offense-defense balance in cybersecurity flips, and it becomes possible to contain even extremely capable AIs.
Any of these could lead to a world in which misaligned AI in the wild is at least rare relative to aligned AI. The advantage for humans+aligned-AIs could be self-reinforcing, as they use their greater numbers to push measures (e.g., standards and monitoring) to suppress misaligned AI systems.
I concede that we wouldn’t be totally out of the woods in this case - things might shake out such that highly-outnumbered misaligned AIs can cause existential catastrophe. But I think we should be optimistic by default from such a point. A footnote elaborates on this, addressing Steve Byrnes’s discussion of a related topic (which I quite liked and think raises good concerns, but isn’t decisive for the scenario I’m contemplating).10
More generally, I think it’s very hard to reason about a world with human-level-ish aligned AIs widely available (and initially outnumbering comparably powerful misaligned AIs), so I think we should not be too confident of doom starting from that point.
Some objections to this picture
The most common arguments I’ve heard for why this picture is hopeless involve some combination of:
- AI systems could quickly become very powerful relative to their supervisors, which means we have to confront a harder version of the alignment problem without first having human-level-ish aligned systems.
- I think it’s certainly plausible this could happen, but I haven’t seen a reason to put it at >50%.
- To be clear, I expect an explosive “takeoff” by historical standards. I want to give Tom Davidson’s analysis more attention, but it implies that there could be mere months between human-level-ish AI and far more capable AI (but that could be enough for a lot of work by human-level-ish AI).
- One key question: to the extent that we can create a feedback loop with AI systems doing research to improve hardware and/or software efficiency (which then increases the size and/or capability of the “automated workforce,” enabling further research ...), will this mostly be via increasing the number of AIs or by increasing per-AI capabilities? There could be a feedback loop with human-level-ish AI systems exploding in number, which seems to present fewer (though still significant) alignment challenges than a feedback loop with AI systems exploding past human capability.11
- It’s arguably very hard to get even human-level-ish capabilities without ambitious misaligned aims. I discussed this topic at some length with Nate Soares - notes here. I disagree with this as a default (though, again, it’s plausible) for reasons given at that link.
- Expecting “offense-defense” asymmetries (as in this post) such that we’d get catastrophe even if aligned AIs greatly outnumber misaligned ones. Again, this seems plausible, but not the right default guess for how things will go, as discussed at the end of the previous section.
I think all of these arguments are plausible, but very far from decisive (and indeed each seems individually <50% likely to me).
Success without dignity
This section is especially hand-wavy and conversational. I probably don’t stand by what you’d get from reading any particular sentence super closely and taking it super seriously. I stand by some sort of vague gesture that this section is trying to make.
I have a high-level intuition that most successful human ventures look - from up close - like dumpster fires. I’m thinking of successful organizations - including those I’ve helped build - as well as cases where humans took highly effective interventions against global threats, e.g. smallpox eradication; recent advances in solar power that I’d guess are substantially traceable to subsidy programs; whatever reasons we haven’t had a single non-test nuclear detonation since 1945.
I expect the way AI risk is “handled by society” to look like a dumpster fire, in the sense that lots of good interventions will be left on the table, lots of very silly things will be done, and no intervention will be satisfyingly robust. Alignment measures will be fallible, standards regimes will be gameable, security setups will be imperfect, and even the best AI labs will have lots of incompetent and/or reckless people inside them doing scary things.
But I don’t think that automatically translates to existential catastrophe, and this distinction seems important. (An analogy: “that bednet has lots of gaping holes in it” vs. “That bednet won’t help” or “That person will get malaria.”) The future is uncertain; we could get lucky and stumble our way into a good outcome.
Furthermore, there are a number of interventions that could interact favorably with some baseline good luck. (I’ll discuss this more in a future post.)
One key strategic implication of this view that I think is particularly worth noting:
- I think there’s a common headspace that says something like: “We’re screwed unless we get a miracle. Hence, ~nothing matters except for (a) buying time for that miracle to happen (b) optimizing heavily for attracting and supporting unexpectedly brilliant people with unexpectedly great ideas.”
- My headspace is something more like: “We could be doomed even in worlds where our interventions go as well as could be reasonably expected; we could be fine in worlds where they go ~maximally poorly; every little bit (of alignment research, of standards and monitoring, of security research, etc.) helps; and a lot of key interventions would benefit from things other than time and top intellectual talent - they’d benefit from alignment-concerned people communicating well, networking well, being knowledgeable about the existing AI state of the art, having good reputations with regulators and the general public, etc. etc. etc.”
- That is, in my headspace, there are lots of things that can help - which also means that there are lots of factors we need to worry about. Many are quite ugly and unpleasant to deal with (e.g., PR and reputation). And there are many gnarly tradeoffs with no clear answer - e.g., I think there are things that hurt community epistemics12 and/or risk making the situation worse13 that still might be right to do.
- I have some suspicion that the first headspace is self-serving for people who really don’t like dealing with that stuff and would rather focus exclusively on trying to do/support/find revolutionary intellectual inquiry. I don’t normally like making accusations like this (they rarely feel constructive) but in this case it feels like a bit of an elephant in the room - it seems like quite a strange view on priors to believe that revolutionary intellectual inquiry is the “whole game” for ~any goal, especially on the relatively short timelines many people have for transformative AI.
I don’t feel emotionally attached to my headspace. It’s nice to not think we’re doomed, but not a very big deal for me,14 and I think I’d enjoy work premised on the first headspace above at least as much as work premised on the second one.
The second headspace is just what seems right at the moment. I haven’t seen convincing arguments that we won’t get lucky, and it seems to me like lots of things can amplify that luck into better odds of success. If I’m missing something correctible, I hope this will prompt discussion that leads there.
Like >10% ↩
Since another way of putting it is: “AI takeover (a pretty specific event) is not certain (conditioned on the ‘minimal-dignity’ conditions above, which don’t seem to constrain the future a ton).” ↩
Phase 1 in this analysis ↩
Phase 2 in this analysis ↩
I think there are alternative ways things could go well, which I’ll cover in the relevant section, so I don’t want to be stuck with a “pivotal acts” frame. ↩
Salient possible additions to today’s models:
- Greater scale (more parameters, more pretraining)
- Multimodality (training the same model on language + images or perhaps video)
- Memory/long contexts: it seems plausible that some relatively minor architectural modification could make today’s language models much better at handling very long contexts than today’s cutting-edge systems, e.g. they could efficiently identify which parts of an even very long context ought to be paid special attention at any given point. This could imaginably be sufficient for them to be “taught” to do tasks, in roughly the way humans are (e.g., I might give an AI a few examples of a successfully done task, ask it to try, critique it, and repeat this loop over the course of hundreds of pages of “teaching” - note that the “teaching” is simply building up a context it can consult for its next step, it is not using gradient descent).
- Scaffolding: a model somewhat like today’s cutting-edge models could be put in a setting where it’s able to delegate tasks to copies of itself. Such tasks might include things like “Think about how to accomplish X, and send me some thoughts” and “That wasn’t good enough, think more please.” In this way, it could be able to vary the amount of “thought” and effort it puts into different aspects of its task. It could also be given access to some basic actuators (shell access might be sufficient). None of this need involve further training, and it could imaginably give an AI enough of the functionality of things like “memory” to be quite capable.
It’s not out of the question to me that we could get to transformative AI with additions like this, and with the vast bulk of the training still just being generative pretraining. ↩
E.g., I think interpretability could be very useful for demonstrating danger, which could lead to a standards-and-monitoring regime, but such a regime would be a lot more “dignified” than the worlds I’m picturing in this post. ↩
I think interpretability is very appealing as something that large numbers of relatively narrow “automated alignment researchers” could work on. ↩
Debate-type setups seem like they would get harder for humans to adjudicate as AI systems advance; more advanced AI seems harder to red-team effectively without its noticing “tells” re: whether it’s in training; internal-state-based training seems more likely to result in “manipulating one’s own internal states” for more advanced AI; ↩
Byrnes’s post seems to assume there are relatively straightforward destruction measures that require draconian, scary “plans” to stop. (Contrast with my discussion here, in which AIs can be integrated throughout the economy in ways that makes it harder for misaligned AIs to “get off the ground” with respect to being developed, escaping containment and acquiring resources.)
- I don’t think this is the right default/prior expectation, given that we see little evidence of this sort of dynamic in history to date. (Relatively capable people who want to cause widespread destruction even at cost to themselves are rare, but do periodically crop up and don’t seem to have been able to effect these sorts of dynamics to date. Individuals have done a lot of damage by building followings and particularly via government power, but this seems very different from the type of dynamic discussed in Byrnes’s post.)
- One could respond by pointing to particular vulnerabilities and destruction plans that seem hard to stop, but I haven’t been sold on anything along these lines, especially when considering that a relatively small number of biological humans’ surviving could still be enough to stop misaligned AIs (if we posit that aligned AIs greatly outnumber misaligned AIs). And I think misaligned AIs are less likely to cause any damage if the odds are against ultimately achieving their aims.
- I note that Byrnes’s post also seems to assume that it’s greatly expensive and difficult to align an AI (I conjecture that it may not be, above).
The latter, more dangerous possibility seems more likely to me, but it seems quite hard to say. (There could also of course be a hybrid situation, as the number and capabilities of AI grow.) ↩
I think optimizing for community epistemics has real downsides, both via infohazards/empowering bad actors and via reputational risks/turning off people who could be helpful. I wish this weren’t the case, and in general I heuristically tend to want to value epistemic virtue very highly, but it seems like it’s a live issue - I (reluctantly) don’t think it’s reasonable to treat “X is bad for community epistemics” as an automatic argument-ender about whether X is bad (though I do think it tends to be a very strong argument). ↩
E.g., working for an AI lab and speeding up AI (I plan to write more about this).
More broadly, it seems to me like essentially all attempts to make the most important century go better also risk making it go a lot worse, and for anyone out there who might’ve done a lot of good to date, there are also arguments that they’ve done a lot of harm (e.g., by raising the salience of the issue overall).
Even “Aligned AI would be better than misaligned AI” seems merely like a strong bet to me, not like a >95% certainty, given what I see as the appropriate level of uncertainty about topics like “What would a misaligned AI actually do, incorporating acausal trade considerations and suchlike?”; “What would humans actually do with intent-aligned AI, and what kind of universe would that lead to?”; and “How should I value various outcomes against each other, and in particular how should I think about hopes of very good outcomes vs. risks of very bad ones?”
To reiterate, on balance I come down in favor of aligned AI, but I think the uncertainties here are massive - multiple key questions seem broadly “above our pay grade” as people trying to reason about a very uncertain future. ↩
I’m a person who just doesn’t pretend to be emotionally scope-sensitive or to viscerally feel the possibility of impending doom. I think it would be hard to do these things if I tried, and I don’t try because I don’t think that would be good for anyone.
I like doing worthy-feeling work (I would be at least as happy with work premised on a “doomer” worldview as on my current one) and hanging out with my family. My estimated odds that I get to live a few more years vs. ~50 more years vs. a zillion more years are quite volatile and don’t seem to impact my daily quality of life much. ↩
Thanks for laying out these points! After having engaged with many people's thoughts on these issues I'm similarly unconvinced about the very unfavourable odds many people seem to assign, so I really look forward to the discussion here.
I'm particularly curious about this point, because when I think about AI risk scenarious I put quite some stock into the potential for very direct government interventions when the risks are more obvious and more clearly a near term problem:
AI seems to me to already clearly be among the top priorities for geopolitical considerations for the US, and it seems like when this is the case the space of options is fairly unrestricted.
Thanks for the post!
I really like these points. It is often easy to forget how uncertain is the future.
I think this is an important point. I consider the question in this paper, published last year at AI Magazine. See the "Competing Models of the Goal" section, and in particular the "Arbitrary Reward Protocols" subsection. (2500 words)
I think there's something missing from the discussion here, which the key point of that section.First, I claim that sufficiently advanced agents will likely need to engage in hypothesis testing between multiple plausible models of what worldly events lead to reinforcement, or else they would fail at certain tasks. So even if the "intended generalization" is a quite bit more plausible to the agent than the unintended one, as long as it is cheap to test them, and as long as it has a long horizon, it would likely deem wireheading to be worth trying out, just in case. That said, in some situations (I mention a chess game in the paper) I expect the intended generalization to be so much simpler that it isn't even worth trying out.
Just a warning before you read it, I use the word "reward" a bit differently than you appear to. In my terminology, I would phrase is this as "Do what the supervisor is hoping" vs. "Do whatever maximizes the relevant physical signal", and the agent would essentially wonder which of the two constitutes "reward", rather than being a priori sure that its past rewards "are" those physical signals.