(Note: This post is probably old news for most readers here, but I find myself repeating this surprisingly often in conversation, so I decided to turn it into a post.)
I don't usually go around saying that I care about AI "safety". I go around saying that I care about "alignment" (although that word is slowly sliding backwards on the semantic treadmill, and I may need a new one soon).
But people often describe me as an “AI safety” researcher to others. This seems like a mistake to me, since it's treating one part of the problem (making an AGI "safe") as though it were the whole problem, and since “AI safety” is often misunderstood as meaning “we win if we can build a useless-but-safe AGI”, or “safety means never having to take on any risks”.
Following Eliezer, I think of an AGI as "safe" if deploying it carries no more than a 50% chance of killing more than a billion people:
When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get. [...] Practically all of the difficulty is in getting to "less than certainty of killing literally everyone". Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'.
Notably absent from this definition is any notion of “certainty” or "proof". I doubt we're going to be able to prove much about the relevant AI systems, and pushing for proofs does not seem to me to be a particularly fruitful approach (and never has; the idea that this was a key part of MIRI’s strategy is a common misconception about MIRI).
On my models, making an AGI "safe" in this sense is a bit like finding a probabilistic circuit: if some probabilistic circuit gives you the right answer with 51% probability, then it's probably not that hard to drive the success probability significantly higher than that.
If anyone can deploy an AGI that is less than 50% likely to kill more than a billion people, then they've probably... well, they've probably found a way to keep their AGI weak enough that it isn’t very useful. But if they can do that with an AGI capable of ending the acute risk period, then they've probably solved most of the alignment problem. Meaning that it should be easy to drive the probability of disaster dramatically lower.
The condition that the AI actually be useful for pivotal acts is an important one. We can already build AI systems that are “safe” in the sense that they won’t destroy the world. The hard part is creating a system that is safe and relevant.
Another concern with the term “safety” (in anything like the colloquial sense) is that the sort of people who use it often endorse the "precautionary principle" or other such nonsense that advocates never taking on risks even when the benefits clearly dominate.
In ordinary engineering, we recognize that safety isn’t infinitely more important than everything else. The goal here is not "prevent all harms from AI", the goal here is "let's use AI to produce long-term near-optimal outcomes (without slaughtering literally everybody as a side-effect)".
Currently, what I expect to happen is that humanity destroys itself with misaligned AGI. And I think we’re nowhere near knowing how to avoid that outcome. So the threat of “unsafe” AI indeed looms extremely large—indeed, this seems to be rather understating the point!—and I endorse researchers doing less capabilities work and publishing less, in the hope that this gives humanity enough time to figure out how to do alignment before it’s too late.
But I view this strategic situation as part of the larger project “cause AI to produce optimal long-term outcomes”. I continue to think it's critically important for humanity to build superintelligences eventually, because whether or not the vast resources of the universe are put towards something wonderful depends on the quality and quantity of cognition that is put to this task.
If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).
Thanks for the thoughtful response. My original comment was simply to note that some people disagree with the pivotal act framing, but it didn’t really offer an alternative and I’d like to engage with the problem more.
I think we have a few worldview differences that drive disagreement on how to limit AI risk given solutions to technical alignment challenges. Maybe you’d agree with me in some of these places, but a few candidates:
Stronger AI can protect us against weaker AI. When you imagine a world where anybody can train an AGI at home, you conclude that anybody will be able to destroy the world from home. I would expect that governments and corporations will maintain a sizable lead over individuals, meaning that individuals cannot take over the world. They wouldn’t necessarily need to preempt the creation of an AGI; they could simply contain it afterwards, by denying it access to resources and exposing its plans for world destruction. This is especially true in worlds where intelligence alone cannot take over the world, and instead requires resources or cooperation between entities, as argued in Section C of Katja Grace’s recent post. I could see somw of these proposals overlapping with your definition of a pivotal act, though I have more of a preference for multilateral and government action.
Government AI policy can be competent. Our nuclear non-proliferation regime is strong, only 8 countries have nuclear capabilities. Gain-of-function research is a strong counter example, but the Biden administration’s export controls on selling advanced semiconductors to China for national security purpose again support the idea of government competence. Strong government action seems possible with either (a) significant AI warning shots or (b) convincing mainstream ML and policy leaders of the danger of AI risk. When Critch suggested that governments build weapons to monitor and disable rogue AGI projects, Eliezer said it’s not realistic but would be incredible if accomplished. Those are the kinds of proposals I’d want to popularize early.
I have longer timelines, expect a more distributed takeoff, and have a more optimistic view of the chances of human survival than I’d expect you do. My plan for preventing AI x-risk is to solve the technical problems, and to convince influential people in ML and policy that the solutions must be implemented. They can then build aligned AI, and employ measures like compute controls and monitoring of large projects to ensure widespread implementation. If it turns out that my worldview is wrong and an AI lab invents a single AGI that could destroy the world relatively soon, I’d be much more open to dramatic pivotal acts that I’m not excited about in my mainline scenario.
Three more targeted replies to your comments:
Your proposed pivotal act in your reply to Critch seems much more reasonable to me than “burn all GPUs”. I’m still fuzzy on the details of how you would uncover all potential AGI projects before they get dangerous, and what you would do to stop them. Perhaps more crucially, I wouldn’t be confident that we’ll have AI that can run whole brain emulation of humans before we have AI that brings x-risk, because WBE would likely require experimental evidence from human brains that early advanced AI will not have.
I strongly agree with the need for more honest discussions about pivotal acts / how to make AI safe. I’m very concerned by the fact that people have opinions they wouldn’t share, even within the AI safety community. One benefit of more open discussion could be reduced stigma around the term — my negative association comes from the framing of a single dramatic action that forever ensures our safety, perhaps via coercion. “Burn all GPUs” exemplifies these failure modes, but I might be more open to alternatives.
I really like “don’t leave you fingerprints on the future.” If more dramatic pivotal acts are necessary, I’d endorse that mindset.
This was interesting to think about and I’d be curious to answer any other questions. In particular, I’m trying to think how to ensure ongoing safety in Ajeya’s HFDT world. The challenge is implementation, assuming somebody has solved deceptive alignment using e.g. interpretability, adversarial training, or training strategies that exploit inductive biases. Generally speaking, I think you’d have to convince the heads of Google, Facebook, and other organizations that can build AGI that these safety procedures are technically necessary. This is a tall order but not impossible. Once the leading groups are all building aligned AGIs, maybe you can promote ongoing safety either with normal policy (e.g. compute controls) or AI-assisted monitoring (your proposal or Critch’s EMPs). I’d like to think about this more but have to run.