[Added 13Jun: Submitted to OpenPhil AI Worldviews Contest - this pdf version most up to date]
This is an accompanying post to AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now.
If you apply a security mindset (Murphy’s Law) to the problem of AI alignment, it should quickly become apparent that it is very difficult. And the slow progress of the field to date is further evidence for this. There are 4 major unsolved components to alignment:
And a variety of corresponding threat models.
Any given approach that might show some promise on one or two of these still leaves the others unsolved. We are so far from being able to address all of them. And not only do we have to address them all, but we have to have totally watertight solutions, robust to AI being superhuman in its capabilities. In the limit of superintelligent AI, the alignment needs to be perfect.
OpenAI brags about a 29% improvement in alignment in their GPT-4 announcement. This is not going to cut it! The alignment paradigms used for today’s LLMs only appear to make them relatively safe because the AIs are weaker than us. If the “grandma’s bedtime story napalm recipe” prompt engineering hack actually led to the manufacture of napalm, it would be immediately obvious how poor today’s level of practical AI alignment is.
Plans that involve increasing AI input into alignment research appear to rest on the assumption that they can be grounded by a sufficiently aligned AI at the start. But how does this not just result in an infinite, error-prone, regress? Such “getting the AI to do your alignment homework” approaches are not safe ways of avoiding doom. Tegmark says that he can use his own proof checker to verify that a superintelligent AI’s complicated plans are safe. But (security mindset): how do you stop physical manipulation of your proof checker's output (rowhammer or more subtle)?
The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom. All the doom flows through the tiniest crack of imperfect alignment once the power level of the AI is superhuman. Every possible exploit, loophole or unknown that could engender catastrophic risk needs to be patched for this to not be the case (or we have to get very lucky).
Despite this, many people in EA who take AI x-risk seriously put P(doom|AGI) in the 1-10% range. I am struggling to understand this. What is happening in the other 90-99%? How is it that it appears that the default is “we’re fine”? I have asked this here, and not had any satisfactory answers.
My read on this so far is that low estimates for P(doom|AGI) are either borne of ignorance of what the true difficulties in AI alignment are; stem from wishful thinking / a lack of security mindset; or are a social phenomenon where people want to sound respectable and non-alarmist; as opposed to being based on any sound technical argument. I would really appreciate it if someone could provide a detailed technical argument for believing P(doom|AGI)≤10%.
To go further in making the case for doom by default given AGI, there are reasons to believe that alignment might actually be impossible (i.e. superintelligence is both unpredictable and controllable). Yampolskiy is of the opinion that “lower level intelligence cannot indefinitely control higher level intelligence". This is far from being a consensus view though, with many people in AI Alignment considering that we have barely scratched the surface in looking for solutions. What is clear is that we need more time!
The case for the reality of existential risk has been well established. Indeed, doom even looks to be a highly likely outcome from the advent of the AGI era. Common sense suggests that the burden of proof is now on AI labs to prove their products are safe (in terms of global catastrophe and extinction risk).
There are two prominent camps in the AI Alignment field when it comes to doom and foom. The Yudkowsky view has very high P(doom|AGI) and a fast take-off happening for the transition from AGI to ASI (Artificial Superintelligence). The Christiano view is more moderate on both axes. However, it’s worth noting that even Christiano is ~50% on doom, and ~a couple of years from AGI to ASI (a “slow” take-off that then speeds up). Metaculus predicts weak AGI within 3 years, and takeoff to happen in less than a year.
50% ≤ P(doom|AGI) < 100% means that we can confidently make the case for the AI capabilities race being a suicide race. It then becomes a personal issue of life-and-death for those pushing forward the technology. Perhaps Sam Altman or Demis Hassabis really are okay with gambling 800M lives in expectation on a 90% chance of utopia from AGI? (Despite a distinct lack of democratic mandate.) But are they okay with such a gamble when the odds are reversed? Taking the bet when it’s 90% chance of doom is not only highly morally problematic, but also, frankly, suicidal and deranged.
What to do about this? Join in pushing for a global moratorium on AGI. For more, see the accompanying post to this one (of which this was originally written as a subsection).
Although some would argue that we still haven’t really tried.
We can operationalise "doom" as Ord's definition of "the greater part of our potential is gone and very little remains"; although I pretty much think of it as everyone being killed – squiggled or equivalent – so that ~0 value remains. If you are less inclined to foom scenarios, think of it as a global catastrophe that kills >10% of the world's population and ends civilisation as we know it.
Relatedly, Richard Ngo laments a lack of plausible rebuttals to the core claims of AGI x-risk.
I’m hereby offering up to $1000 for someone to provide one, on a level that takes into account everything written in this section, and the accompanying post. Please also link me to anything already written, and not referenced in any answers here, that you think is relevant. I’m very interested in further steelmanning the case for high P(doom|AGI). I will note that someone on Twitter has linked me to an unfinished draft that has a high degree of detail and seems promising (but not promising enough to update me significantly away from doom. Still, I’m offering them $1k to finish it). See also, the OpenPhil AI Worldviews Contest (deadline is May 31 and there haven’t been many submissions to date).
If this is the case, it’s a rather depressing thought. We may be far closer to the Dune universe than the Culture one (the worry driving a future Butlerian Jihad will be the advancement of AGI algorithms to the point of individual laptops and phones being able to end the world). For those who may worry about the loss of the “glorious transhumanist future”, and in particular, radical life extension and cryonic reanimation (I’m in favour of these things), I think there is some consolation in thinking that if a really strong taboo emerges around AGI, to the point of stopping all algorithm advancement, we can still achieve these ends using standard supercomputers, bioinformatics and human scientists. I hope so.
See also, Geoffrey Miller: “Our Bayesian prior, based on the simple fact that different sentient beings have different interests, values, goals, and preferences, must be that AI alignment with 'humanity in general', or 'sentient life in general', is simply not possible. Sad, but true.” Worth reading the whole comment this quote is taken from.
To work on Alignment and/or prove its impossibility.
See also his appearance on the Bankless podcast.
Re doom “only” being 50%: "it's not suicide, it's a coin flip; heads utopia, tails you're doomed" seems like quibbling at this point. [Regarding the word you're in "you're doomed" in the above - I used that instead of "we're doomed", because when CEOs hear "we're", they're probably often thinking "not we're, you're. I'll be alright in my secure compound in NZ". But they really won't! Do they think that if the shit hits the fan with this and there are survivors, there won't be the vast majority of the survivors wanting justice?]
This is my approximate estimate for P(doom|AGI). And I’ll note here that the remaining 10% for “we’re fine” is nearly all exotic exogenous factors (related to the simulation hypothesis, moral realism being true - valence realism?, consciousness, DMT aliens being real etc), that I really don't think we can rely on to save us!
You might be thinking, “but they are trapped in an inadequate equilibrium of racing”. Yes, but remember it’s a suicide race! At some point someone really high up in an AI company needs to turn their steering wheel. They say they are worried, but actions speak louder than words. Take the financial/legal/reputational hit and just quit! Make a big public show of it. Pull us back from the brink.
Update: 3 days after I wrote the above, Geoffrey Hinton, the “Godfather of AI”, has left Google to warn of the dangers of AI.