Note: As usual, Rob Bensinger helped me with editing. I recently discussed this model with Alex Lintz, who might soon post his own take on it (edit: here).

 

Some people seem to be under the impression that I believe AGI ruin is a small and narrow target to hit. This is not so. My belief is that most of the outcome space is full of AGI ruin, and that avoiding it is what requires navigating a treacherous and narrow course.

So, to be clear, here is a very rough model of why I think AGI ruin is likely. (>90% likely in our lifetimes.)[1]

My real models are more subtle, take into account more factors, and are less articulate. But people keep coming to me saying "it sounds to me like you think humanity will somehow manage to walk a tightrope, traverse an obstacle course, and thread a needle in order to somehow hit the narrow target of catastrophe, and I don't understand how you're so confident about this". (Even after reading Eliezer's AGI Ruin post—which I predominantly agree with, and which has a very disjunctive character.)

Hopefully this sort of toy model will at least give you some vague flavor of where I’m coming from.

 

Simplified Nate-model

The short version of my model is this: from the current position on the game board, a lot of things need to go right, if we are to survive this.

In somewhat more detail, the following things need to go right:

  • The world’s overall state needs to be such that AI can be deployed to make things good. A non-exhaustive list of things that need to go well for this to happen follows:
    • The world needs to admit of an AGI deployment strategy (compatible with realistic alignable-capabilities levels for early systems) that prevents the world from being destroyed if executed.
    • At least one such strategy needs to be known and accepted by a leading organization.
    • Somehow, at least one leading organization needs to have enough time to nail down AGI, nail down alignable AGI, actually build+align their system, and deploy their system to help.
      • This very likely means that there needs to either be only one organization capable of building AGI for several years, or all the AGI-capable organizations need to be very cautious and friendly and deliberately avoid exerting too much pressure upon each other.
    • It needs to be the case that no local or global governing powers flail around (either prior to AGI, or during AGI development) in ways that prevent a (private or public) group from saving the world with AGI.
  • Technical alignment needs to be solved to the point where good people could deploy AI to make things good. A non-exhaustive list of things that need to go well for this to happen follows:
    • There need to be people who think of themselves as working on technical alignment, whose work is integrated with AGI development and is a central input into how AGI is developed and deployed.
    • They need to be able to perceive every single lethal problem far enough in advance that they have time to solve them.
    • They need to be working on the problems in a way that is productive.
    • The problems (and the general paradigm in which they're attacked) need to be such that people's work can stack, or such that they don't require much serial effort; or the research teams need a lot of time.
    • Significant amounts of this work have to be done without an actual AGI to study and learn from; or the world needs to be able to avoid deploying misaligned AGI long enough for the research to complete.
  • The internal dynamics at the relevant organizations need to be such that the organizations deploy an AGI to make things good. A non-exhaustive list of things that need to go well for this to happen follows:
    • The teams that first gain access to AGI, need to care in the right ways about AGI alignment.
    • The internal bureaucracy needs to be able to distinguish alignment solutions from fake solutions, quite possibly over significant technical disagreement.
      • This ability very likely needs to hold up in the face of immense social and time pressure.
    • People inside the organization need to be able to detect dangerous warning signs.
    • Those people might need very large amounts of social capital inside the organization.
    • While developing AGI, the team needs to avoid splintering or schisming in ways that result in AGI tech proliferating to other organizations, new or old.
    • The team otherwise needs to avoid (deliberately or accidentally) leaking AGI tech to the rest of the world during the development process.
    • The team likewise needs to avoid leaking insights to the wider world prior to AGI, insofar as accumulating proprietary insights enables the group to have a larger technical lead, and insofar as a larger technical lead makes it possible for you to e.g. have three years to figure out alignment once you reach AGI, as opposed to six months.

(I could also add a list of possible disasters from misuse, conditional on us successfully navigating all of the above problems. But conditional on us clearing all of the above hurdles, I feel pretty optimistic about the relevant players’ reasonableness, such that the remaining risks seem much more moderate and tractable to my eye. Thus I’ll leave out misuse risk from my AGI-ruin model in this post; e.g., the ">90% likely in our lifetimes" probability is just talking about misalignment risk.)

One way that this list is a toy model is that it’s assuming we have an actual alignment problem to face, under some amount of time pressure. Alternatives include things like getting (fast, high-fidelity) whole-brain emulation before AGI (which comes with a bunch of its own risks, to be clear). The probability that we somehow dodge the alignment problem in such a way puts a floor on how low models like the above can drive the probabilities of success down (though I’m pessimistic enough about the known-to-me non-AGI strategies that my unconditional p(ruin) is nonetheless >90%).

Some of these bullets trade off against each other: sufficiently good technical solutions might obviate the need for good AGI-team dynamics or good global-scale coordination, and so on. So these factors aren't totally disjunctive. But this list hopefully gives you a flavor for how it looks to me like a lot of separate things need to go right, simultaneously, in order for us to survive, at this point. Saving the world requires threading the needle; destroying the world is the default.

 

Correlations and general competence

You may object: "But Nate, you've warned of the multiple-stage fallacy; surely here you're guilty of the dual fallacy? You can't say that doom is high because three things need to go right, and multiply together the lowish probabilities that all three go right individually, because these are probably correlated."

Yes, they are correlated. They're especially correlated through the fact that the world is derpy.

This is the world where the US federal government's response to COVID was to ban private COVID testing, confiscate PPE bought by states, and warn citizens not to use PPE. It's a world where most of the focus on technical AGI alignment comes from our own local community, takes up a tiny fraction of the field, and most of it doesn't seem to me to be even trying by their own lights to engage with what look to me like the lethal problems.

Some people like to tell themselves that surely we'll get an AI warning shot and that will wake people up; but this sounds to me like wishful thinking from the world where the world has a competent response to the pandemic warning shot we just got.

So yes, these points are correlated. The ability to solve one of these problems is evidence of ability to solve the others, and the good news is that no amount of listing out more problems can drive my probability lower than the probability that I'm simply wrong about humanity's (future) competence. Our survival probability is greater than the product of the probability of solving each individual challenge.

The bad news is that we seem pretty deep in the competence-hole.  We are not one mere hard shake away from everyone snapping to our sane-and-obvious-feeling views. You shake the world, and it winds up in some even stranger state, not in your favorite state.

(In the wake of the 2012 US presidential elections, it looked to me like there was clearly pressure in the US electorate that would need to be relieved, and I was cautiously optimistic that maybe the pressure would force the left into some sort of atheistish torch-of-the-enlightenment party and the right into some sort of libertarian individual-rights party. I, uh, wasn't wrong about there being pressure in the US electorate, but, the 2016 US presidential elections were not exactly what I was hoping for. But I digress.)

Regardless, there's a more general sense that a lot of things need to go right, from here, for us to survive; hence all the doom. And, lest you wonder what sort of single correlated already-known-to-me variable could make my whole argument and confidence come crashing down around me, it's whether humanity's going to rapidly become much more competent about AGI than it appears to be about everything else.

(This seems to me to be what many people imagine will happen to the pieces of the AGI puzzle other than the piece they’re most familiar with, via some sort of generalized Gell-Mann amnesia: the tech folk know that the technical arena is in shambles, but imagine that policy has the ball, and vice versa on the policy side. But whatever.)

So that's where we get our remaining probability mass, as far as I can tell: there's some chance I'm wrong about humanity's overall competence (in the nearish future); there's some chance that this whole model is way off-base for some reason; and there's a teeny chance that we manage to walk this particular tightrope, traverse this particular obstacle course, and thread this particular needle.

And again, I stress that the above is a toy model, rather than a full rendering of all my beliefs on the issue. Though my real model does say that a bunch of things have to go right, if we are to succeed from here.

 

  1. ^

    In stark contrast to the multiple people I’ve talked to recently who thought I was arguing that there's a small chance of ruin, but the expected harm is so large as to be worth worrying about. No.

56

6 comments, sorted by Click to highlight new comments since: Today at 9:09 PM
New Comment

It's concerning to me that the probability of "early rogue AI will inevitably succeed in defeating us" is not only taken to be near 100%, it's not even stated as a premise! Regardless of what you think of that position (I'm preparing a post on why I think the probability is actually quite low), this is not a part of the equation you can just ignore. 

Another  quibble is that "alignment problem" and "existential risk" are taken to be synomous. It's quite possible for the former to be real but not the latter. (Ie, you think the AI will do things we don't want them to do, but you don't think those things will necessarily involve human extinction). 

My basic reason for thinking "early rogue [AGI] will inevitably succeed in defeating us" is:

  • I think human intelligence is crap. E.g.:
    • Human STEM ability occurs in humans as an accidental side-effect — our brains underwent zero selection for the ability to do STEM in our EAA, and barely-any selection to optimize this skill since the Scientific Revolution. We should expect that much more is possible when humans are deliberately optimizing brains to be good at STEM.
    • There are many embarrassingly obvious glaring flaws in human reasoning.
    • One especially obvious example is "ability to think mathematically at all". This seems in many respects like a reasoning ability that's both relatively simple (it doesn't require grappling with the complexity of the physical world) and relatively core. Yet the average human can't even do trivial tasks like 'multiply two eight-digit numbers together in your head in under a second'. This gap on its own seems sufficient for AGI to blow humans out of the water.
    • (E.g., I expect there are innumerable scientific fields, subfields, technologies, etc. that are easy to find when you can hold a hundred complex mathematical structures in your 200 slots of working memory simultaneously and perceive connections between those structures. Many things are hard to do across a network of separated brains, calculators, etc. that are far easier to do within a single brain that can hold everything in view at once, understand the big picture, consciously think about many relationships at once, etc.)
    • Example: AlphaGo Zero. There was a one-year gap between 'the first time AI ever defeated a human professional' and 'the last time a human professional ever beat a SotA AI'. AlphaGo Zero in particular showed that 2500 years of human reasoning about Go was crap compared to what was pretty easy to do with 2017 hardware and techniques and ~72 hours of self-play. This isn't a proof that human intelligence is similarly crap in physical-data-dependent STEM work, or in other formal settings, but it seems like a strong hint.
  • I'd guess we already have a hardware overhang for running AGI. (Considering, e.g., that we don't need to recapitulate everything a human brain is doing in order to achieve AGI. Indeed, I'd expect that we only need to capture a small fraction of what the human brain is doing in order to produce superhuman STEM reasoning. I expect that AGI will be invented in the future (i.e., we don't already have it), and that we'll have more than enough compute.)

I'd be curious to know (1) whether you disagree with these points, and (2) whether you disagree that theses points are sufficient to predict that at least one early AGI system will be capable enough to defeat humans, if we don't succeed on the alignment problem.

(I usually think of "early AGI systems" as 'AGI systems built within five years of when humanity first starts building a system that could be deployed to do all the work human engineers can do in at least one hard science, if the developers were aiming at that goal'.)

...I think AGI ruin is likely. (>90% likely in our lifetimes.)

I disagree. Let's trade!

Initial idea: a long-dated (potentially perpetual) agreement in which you borrow at very low (potentially negative) interest rates initially, with rates increasing each year. 

If you are correct that AGI is a near and significant danger, the cheap money you borrow can be used to reduce this risk. 

If I am correct, the rising interest rates eventually create a profit over a long horizon.

Use HedgEverything to facilitate?  You can PM me if you want this conversation offline. 

Thanks for this! A couple of things:

  1. It's strange to me that this is aimed at people who aren't aware that MIRI staffers are quite pessimistic about AGI risk. After something like Eliezer's April Fools post, it seems pretty clear to those who've been paying attention - I would've been more interested in something that digs into the meat of the view rather than explaining the basic premises. Though it's possible I'm overestimating the amount of familiarity within longtermist circles of different views, including MIRI's.
  2. There are factors excluded from this model which are necessary for the core claim that alignment fails by default. Warning shots followed by a huge effort to avert disaster are one way things could go well, but we could just be further from AGI than people think (something like 5-15 years is my understanding of the MIRI view) or have very slow takeoff speeds.

 

I'm a bit frustrated because it seems like these 2 things are indicative of a failure to engage with counterarguments. They strike me as more of an attempt to instruct people who aren't familiar with the view, rather than persuasively argue for it compared to different (informed) views.

It's strange to me that this is aimed at people who aren't aware that MIRI staffers are quite pessimistic about AGI risk.

It's not. It's mainly aimed at people who found it bizarre and hard-to-understand that Nate views AGI risk as highly disjunctive. (Even after reading all the disjunctive arguments in AGI Ruin.) This post is primarily aimed at people who understand that MIRI folks are pessimistic, but don't understand where "it's disjunctive" is coming from.