Hide table of contents

Subtitle: A partial defense of high-confidence AGI doom predictions.

Introduction

Consider these two kinds of accident scenarios:

  1. In a default-success scenario, accidents are rare. For example, modern aviation is very safe thanks to decades of engineering efforts and a safety culture (e.g. the widespread use of checklists). When something goes wrong, it is often due to multiple independent failures that combine to cause a disaster (e.g. bad weather + communication failures + pilot not following checklist correctly).
  2. In a default-failure scenario, accidents are the norm. For example, when I write a program to do something I haven’t done many times already, it usually fails the first time I try it. It then goes on to fail the second time and the third time as well. Here, failure on the first try is overdetermined―even if I fix the first bug, the second bug is still, independently, enough to cause the program to crash. This is typical in software engineering, and it can take many iterations and tests to move into the default-success regime.

See also: conjuctive vs disjunctive risk scenarios.

Default-success scenarios include most engineering tasks that we have lots of experience with and know how to do well: building bridges, building skyscrapers, etc. Default-failure scenarios, as far as I can tell, come in two kinds: scenarios in which we’re trying to do something for the first time (rocket test launches, prototypes, new technologies) and scenarios in which there is a competent adversary that is trying to break the system, as in computer security.[1]

Predictions on AGI risk

In the following, I use P(doom) to refer to the probability of an AGI takeover and / or human extinction due to the development of AGI.

I often encounter the following argument against predictions of AGI catastrophes:

Alice: We seem to be on track to build an AGI smarter than humans. We don’t know how to solve the technical problem of building an AGI we can control, or the political problem of convincing people to not build AGI. Every plausible scenario I’ve ever thought or heard of leads to AGI takeover. In my estimate, P(doom) is [high number].

Bob: I disagree. It’s overconfident to estimate high P(doom). Humans are usually bad at predicting the future, especially when it comes to novel technologies like AGI. When you account for how uncertain your predictions are, your estimate should be at most [low number].”

I'm being vague about the numbers because I've seen Bob's argument made in many different situations. In one recent conversation I witnessed, the Bob-Alice split was P(doom) 0.5% vs. ~10%, and in another discussion it was 10% vs. 90%.

My main claim is that Alice and Bob don’t actually disagree about how uncertain or hard to predict the future is―instead, they disagree about to what degree AGI risk is default-success vs. default-failure. If AGI risk is (mostly) default-failure, then uncertainty is a reason for pessimism rather than optimism, and Alice is right to predict failure.

In this sense I think Bob is missing the point. Bob claims that Alice is not sufficiently uncertain about her AI predictions, or has not integrated her uncertainty into her estimate well enough. This is not necessarily true; it may just be that Alice’s uncertainty about her reasoning doesn't make her much more optimistic.

Instead of trying to refute Alice from general principles, I think Bob should instead point to concrete reasons for optimism (for example, Bob could say “for reasons A, B, and C it is likely that we can coordinate on not building AGI for the next 40 years and solve alignment in the meantime”).

Uncertainty does not (necessarily) mean you should be more optimistic

Many people are skeptical of the ‘default-failure’ frame, so I'll give a bit more color here by listing some reasons why I think Bob's argument is wrong / unproductive. I won’t go into detail about why AGI risk specifically might be a default-failure scenario; you can find a summary of those arguments in Nate Soares’ post on why AGI ruin is likely.

  1. It’s true that the future is often hard to predict; for example, experts often fail to predict technological developments. This is not a reason for optimism. It would be kind of weird if it was! Humans are generally bad at predicting the future, especially for technological progress, and this is bad news for AI safety.
    1. In particular: if all the AI researchers are uncertain about what will happen, that is a bad sign much in the same way that it would be a bad sign if none of your security engineers understood the system they are supposed to secure.
    2. Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
  2. To believe P(doom) is high, all you really need to be convinced of is that the default outcome for messing up superhuman AGI is human extinction, and that we’re not prepared. Our understanding here is incomplete but still relatively good compared to details that are harder to predict, e.g. when exactly AGI will arrive or what early forms of AGI will look like.
  3. It is not always wrong to make high-confidence disaster predictions. For example, people saying “covid will be a disaster with high (~90%) probability” in February 2020 were predictably correct, even though covid was a very novel situation. There was a lot of uncertainty, and the people who predicted disaster usually got the details wrong like everyone else, but the overall picture was still correct because the details didn’t matter much.
  4. A confidence of 90% is not actually much harder to achieve than 10%, relative to the baseline extinction risk for a new technology which is close to 0%. An estimate of P(doom) = 30% already leans very heavily on your inside view of the risks involved; you don’t need to trust your reasoning all that much more to estimate 90% instead.
  5. Put differently: there’s no reason in particular why Bob's uncertainty argument should cap your confidence at ~80%, rather than 1% or 0.1%.
    1. (It seems totally reasonable to me for a first reaction to AI X-risk to be “Eh I don’t know, it’s an interesting idea and I’ll think more on it, but it does seem pretty crazy; if I had to estimate P(doom) right now I would say ~0.1%, though I would prefer not to give a number at all.” Followed, to be clear, by rapid updates in favor of high p(doom), though not necessarily 90%; I think 90% makes sense for people who have slammed their head against the difficulties involved, and noticed a pattern where the wall they’re slamming their heads against is pretty hard and doesn’t have visible weak spots; but otherwise you wouldn’t necessarily be that pessimistic.)
  6. More generally: estimates around 90% aren’t all that “confident”. If you’re well-calibrated, changing your mind about something that you estimate to be 90% likely is something that happens all the time. So P(X) = 90% means “I expect X to happen, though I’m happy to change my mind and in fact regularly do change my mind about claims like this”.
  7. It makes sense to be uncertain about your beliefs, and about whether you thought of all the relevant things (usually you didn’t). Rather than be generically uncertain about everything, it’s usually better to be uncertain about specific parts of your model.
    1. For example: I’m uncertain about the behavior and capability profile of the first AI that surpasses humans in scientific research. This makes me more pessimistic about alignment relative to a baseline where I was certain, because any strategy that depends on specific assumptions about the capabilities of this AI is unlikely to work.
    2. For a second example: I think there probably won’t be any international ban or regulation on large training runs that lengthens timelines by >10 years, but I’m pretty uncertain. This makes me more optimistic relative to a baseline where I was certain governments would do nothing.
  8. Put differently: most of your uncertainty about beliefs should be part of your model, not some external thing that magically pushes all your beliefs towards 50% or 0% or 100%.

Some things I’m not saying

This part is me hedging my claims. Feel free to skip if that seems like a boring thing to read.

I don’t personally estimate P(doom) above 90%.

I’m also not saying there are no reasons to be optimistic. I’m claiming that reasons for optimism should usually be concrete arguments about possible ways to avoid doom. For example, Paul Christiano argues for a somewhat lower than 90% P(doom) here, and I think the general shape of his argument makes sense, in contrast to Bob’s above.

I do think there is a correct version of the argument that, if your model says P(outcome) = 0.99, model uncertainty will generally be a reason to update downwards. I think people already take that into account when stating high P(doom) estimates. Here’s a sketch of a plausible reasoning (summarized and not my numbers, but I do have similar reasoning, and I don’t think the numbers are crazy):

  1. Almost every time I imagine a concrete scenario for how AGI development might go, that leads to an outcome where humans go extinct.
  2. I can imagine some ways in which things go well, but they seem pretty fanciful; for example a sudden international treaty that forbids large training runs and successfully enforces this. (I do expect there’ll be other government efforts, but I don’t expect those to change things much for the better). So my “within-model” prediction is p(doom) = 0.99.
  3. My model is almost certainly wrong. Sadly, for most scenarios I can imagine, being wrong would only make things worse. I’m literally a safety researcher; me being totally wrong about e.g. what the first AGI looks like is not a good sign for safety (and I don’t expect other safety researchers to have better models). Almost all surprises are bad.
    1. Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
  4. That said: while technical surprises are probably bad, there’s other kinds of positive surprises we could get, for example: more progress on AI safety than expected, better interpretability methods, more uptake of AI risk concerns by the broader ML community, more government action on regulating AI.
    1. In fact, there are some kinds of cumulative surprises that could add up to save us; as an example, enough regulation of AI could lead to ~10y longer timelines; more progress than expected in interpretability could lead to more compelling demonstrations of misalignment; more uptake of AI risk by the broader scientific community might lead to more safety progress and an overall more careful approach to AGI.
    2. Note that this is not an update made from pure uncertainty―there is a concrete story here about how exactly surprises might actually be helpful, rather than bad. It’s not a particularly great story either; it needs many things to go better than expected.
  5. Now, that particular story is not likely at all. But it seems like there are many stories in that general category, such that the total likelihood of a good surprise adds up to 10%.
    1. Note the basic expectation of ‘surprises are often bad’ still applies. Not knowing how governments or society will react to AI is hardly helpful for the people who are currently trying to get governments or society to react in a useful way.
  6. So my overall, all-things-considered p(doom) is 90%, mostly due to a kind of sketchy downwards-update due to model uncertainty, without which the estimate would be around 99%.
  7. It’s debatable how large the downwards update here should be―it could reasonably be more or less than 10%, and it’s plausible that we’re in the kind of domain where small quantified probability updates aren’t very useful at all.

I don’t mean to say that the reasoning here is the only reasonable version out there. It depends a lot on how likely you think various definitely-useful surprises are, like long timelines to AGI and slow progress after proto-AGI. But I do think it is wrong to call high P(doom) estimates overconfident without any further more detailed criticism.

Finally, I haven’t given an explicit argument for AGI risk; there’s a lot of that elsewhere.

  1. ^

    Note how AGI somehow manages to satisfy both of these criteria at once.

8

0
0

Reactions

0
0

More posts like this

Comments11
Sorted by Click to highlight new comments since: Today at 12:30 PM

We have a large universe of "technologies we make for economic benefit": nearly all of them are "somewhat fine" to "very good". Famous exceptions of course exist like leaded petrol but are relatively rare. I don't count nuclear bombs in this comparison class given that they were explicitly invented to kill large numbers of people Given the massive commercial incentive to make AI useful, we should plausibly expect it to be safe. This is IMO the base-rate-thinking case. Purely from the outside view, we should expect AI to be fine. 

In general I quite like this post, I think it elucidates some disagreements quite well. However, as an out and proud default-success guy, I’m not sure it represents the default-success argument on uncertainty well. To see why, let’s translate default-failure and default-success into their full beliefs:

Default-failure is the stance that, by default, highly intelligent AI systems will be developed, at least one of which rebels and successfully defeats and then murders/enslaves all of humanity.

Default-success is referring to a stance that, by default, that chain of events above won’t happen. 

I think the argument of the default-success people is that uncertainty about the future means that you shouldn’t be default-failure. We’re saying that the uncertainty about the future should translate into uncertainty about:

1.whether AI systems are developed, 

2. whether they rebel

3.whether they beat us

4.whether, having defeated us, they decide to kill/enslave us all. 

In order to get to a 90% chance of doom, you need to estimate a 97% certainty of every single step in that process. Of course, it’s completely fine to have 97% confidence in something if you have a lot of evidence for it. I do not feel that doomers have anywhere close to this standard of evidence. I do agree that discussion is better pointed to discussing this evidence than gesturing to uncertainty, but I don’t think it’s a useless point to make. 

1.whether AI systems are developed, 

2. whether they rebel

3.whether they beat us

4.whether, having defeated us, they decide to kill/enslave us all. 

 

I'm curious what 3 (defeat) might look like without 4 happening?

In general I quite like this post, I think it elucidates some disagreements quite well.

Thanks!

I’m not sure it represents the default-success argument on uncertainty well.

I haven't tried to make an object-level argument for either "AI risk is default-failure" or "AI risk is default-success" (sorry if that was unclear). See Nate's post for the former.

Re your argument for default-success, you only need to have 97% certainty for 1-4 if every step was independent, which they aren't.

I do agree that discussion is better pointed to discussing this evidence than gesturing to uncertainty

Agreed.

Re your argument for default-success, you only need to have 97% certainty for 1-4 if every step was independent, which they aren't.

I'm pretty sure this isn't true. To be clear, I was talking about conditional probabilities, the probability of each occurring, given that the previous steps had already occurred.

Consider me making an estimate like "theres a 90% chance you complete this triathlon (without dropping out)". In order to complete the triathlon as a whole, I need to complete the swimming, cycling and running in turn. 

To get to 90% probability overall, I might estimate that you have a 95% chance of completing the swimming portion, a 96% chance of completing the cycling portion given that you finished the swimming portion, and a 99% chance of you completing the running portion, given that you finished the swimming and running portion. Total probability is 0.95*0.96*0.99=0.90. 

The different events are correlated (a fit person will find all three easier than an unfit person), but that's taken care of in the conditional nature of the calculation. It's also possible that uncertainty is correlated (If I find out you have a broken leg, all three of my estimates will probably go down, even though they are conditional). 

With regards to the doomsday scenario, the point is that there are several possible exit ramps (the AI doesn't get built, it isn't malevolent, it can't kill us all). If you want to be fairly certain that no exit ramps are taken, you have to be very certain that each individual exit ramp won't get taken. 

That's fair; upon re-reading your comment it's actually pretty obvious you meant the conditional probability, in which case I agree multiplying is fine.

I think the conditional statements are actually straightforward - e.g. once we've built something far more capable than humanity, and that system "rebels" against us, it's pretty certain that we lose, and point (2) is the classic question of how hard alignment is. Your point (1) about whether we build far-superhuman AGI in the next 30 years or so seems like the most uncertain one here.

Yeah, no worries, I was afraid I'd messed up the math for a second there!

It's funny, I think my estimates are the opposite of yours, I think 1 is probably the most likely, whereas I view 3 as vastly unlikely. None of the proposed takeover scenarios seem within the realm of plausibility, at least in the near future.  But I've already stated my case elsewhere.

Instead of trying to refute Alice from general principles, I think Bob should instead point to concrete reasons for optimism (for example, Bob could say “for reasons A, B, and C it is likely that we can coordinate on not building AGI for the next 40 years and solve alignment in the meantime”).

As an aside to the main point of your post, I think Bob arrived at his position by default. I suspect that part of it comes from the fact that the bulk of human experiences deal with natural systems. These natural systems are often robust and could be described as default-success. Take human interaction for instance: we assume that any stranger we meet is not a sociopath, because they rarely are. This system is robust and default-success because anti-social behavior is maladaptive. Because AI is so easy for our brains to place in the category of humans, we might by extension put it in the "natural system"-box. With that comes the assumption that it's behavior reverts to default-success. Have you ever been irritated at your computer because it freezes? This irrational response could be traced to us being angry that the computer doesn't follow the rules of behavior that have to be followed when in the (human) box that we erroneously placed it in.

blueberry - this is a very good point about humans applying their 'default-success' heuristic (regarding social interactions with mostly-non-psychopathic humans) inappropriately to their potential interactions with AIs. 

Lauro - nice post; I especially appreciated your points about default-success versus default-failure mindsets.

Importantly, I think these defaults apply not just to (1) likelihood of being able to develop AGI and (2) likelihood of AGI imposing doom, but also to (3) likelihood that international regulations/pauses/moratoriums could succeed, and (4) likelihood that an anti-AI moral backlash could succeed, and lots of other related issues.

For example, some folks seem to think there's a very strong default-failure outcome of trying to coordinate formal global regulation of AI, but the same folks (e.g. me!) may think there's a fairly strong default-success outcome of trying to promote informal global moral stigmatization of AI. 

Of course in each such case, what counts as 'success' or failure' may depend heavily on one's goals. For transhumanist Singularity enthusiasts who actually want humans to be replaced by AIs, a high 'default-fail' rate for AI alignment might be seen as actually a success; for libertarians who want every private citizen to have their own unregulated, unaligned AI, then a default-fail for global AI regulation would be seen as a success. So, we should be careful to be clear about what we're counting as 'success' or 'failure' when we talk about default-success or default-failure mind-sets.

Hi Geoffrey! Yeah, good point - I agree that the right way to look at this is finer-grained, separating out prospects for success via different routes (gov regulation, informal coordination, technical alignment, etc).