What could an AI-caused existential catastrophe actually look like?

Benjamin Hilton; 80000_Hours

This is a linkpost for https://80000hours.org/articles/what-could-an-ai-caused-existential-catastrophe-actually-look-like/

This article forms part of 80000 Hours's explanation of risks from artificial intelligence, and focuses on how an AI system could cause an existential catastrophe. Our full problem profile on risks from AI looks at why we’re worried things like this will happen.

At 5:29 AM on July 16, 1945, deep in the Jornada del Muerto desert in New Mexico, the Manhattan Project carried out the world’s first successful test of a nuclear weapon.

From that moment, we’ve had the technological capacity to wipe out humanity.

But if you asked someone in 1945 to predict exactly how this risk would play out, they would almost certainly have got it wrong. They may have thought there would have been more widespread use of nuclear weapons in World War II. They certainly would not have predicted the fall of the USSR 45 years later. Current experts are concerned about India–Pakistan nuclear conflict and North Korean state action, but 1945 was before even the partition of India or the Korean War.

That is to say, you’d have real difficulty predicting anything about how nuclear weapons would be used. It would have been even harder to make these predictions in 1933, when Leo Szilard first realised that a nuclear chain reaction of immense power could be possible, without any concrete idea of what these weapons would look like.

Despite this difficulty, you wouldn’t be wrong to be concerned.

In our problem profile on AI, we describe a very general way in which advancing AI could go wrong. But there are lots of specifics we can’t know much about at this point. Maybe there will be a single transformative AI system, or maybe there will be many; there could be very fast growth in the capabilities of AI, or very slow growth. Each scenario will look a little different, and carry different risks. And the specific problems that arise in any one scenario are necessarily less likely to happen than the overall risk.

Despite not knowing how things will play out, it may still be useful to look at some concrete possibilities of how things could go wrong.

In particular, we argued in the full profile that sufficiently advanced systems might be able to take power away from humans — how could that possibly happen?

How could a power-seeking AI actually take power?

Here are seven possible techniques that could be used by a power-seeking AI (or multiple AI systems working together) to actually gain power.^[1]

These techniques could all interact with one another, and it’s difficult to say at this point (years or decades before the technology exists) which are most likely to be used. Also, systems more intelligent than humans could develop plans to seek power that we haven’t yet thought of.

1. Hacking

Software is absolutely full of vulnerabilities. The US National Institute of Standards and Technology reported over 8,000 vulnerabilities found in systems across the world in 2021 — an average of 50 per day.

Most of these are small, but every so often they are used to cause huge chaos. The list of most expensive crypto hacks keeps getting new entrants — as of March 2022, the largest was $624 million stolen from Ronin Network. And nobody noticed for six days.^[2]

One expert we spoke to said that professional ‘red teams’ — security staff whose job it is to find vulnerabilities in systems — frequently manage to infiltrate their clients, including crucial and powerful infrastructure like banks and national energy grids.

In 2010, the Stuxnet virus successfully managed to destroy Iranian nuclear enrichment centrifuges — despite these centrifuges being completely disconnected from the internet — marking the first time a piece of malware was used to cause physical damage. A Russian hack in 2016 was used to cause blackouts in Ukraine.

All this has happened with just the hacking abilities that humans currently have. An AI with highly advanced capabilities seems likely to be able to systematically hack almost any system on Earth, especially if we automate more and more crucial infrastructure over time. And if it did use hacking to get large amounts of money or compromise a crucial system, that would be a form of real-world power over humans.

2. Gaining financial resources

We already have computer systems with huge financial resources making automated decisions — and these already go wrong sometimes, for example leading to flash crashes in the market.

There are lots of ways a truly advanced planning AI system could gain financial resources. It could steal (e.g. through hacking); become very good at investing or high-speed trading; develop and sell products and services; or try to gain influence or control over wealthy people, other AI systems, or organisations.

3. Persuading or coercing humans

Having influence over specific people or groups of people is an important way that individuals seek power in our current society. Given that AIs can already communicate (if imperfectly) in natural language with humans (e.g. via chatbots), a more advanced and strategic AI could use this ability to manipulate human actors to its own ends.

Advanced planning AI systems might be able to do this through things like paying humans to do things; promising (whether true or false) future wealth, power, or happiness; persuading (e.g. through deception or appeals to morality or ideology); or coercing (e.g. blackmail or physical threats).

Relatedly, as we discuss in our AI problem profile, it’s plausible one of the instrumental goals of an advanced planning AI would be deceiving people with the power to shut the system down into thinking that the system is indeed aligned.

The better our monitoring and oversight systems, the harder it will be for AI systems to do this. Conversely, the worse these systems are (or if the AI has hacked the systems), the easier it will be for AI systems to deceive humans.

If AI systems are good at deceiving humans, it also becomes easier for them to use the other techniques on this list.

4. Gaining broader social influence

We could imagine AI systems replicating things like Russia’s interference in the 2016 US election, manipulating political and moral discourse through social media posts and other online content.

There are plenty of other ways of gaining social influence. These include: intervening in legal processes (e.g. aiding in lobbying or regulatory capture), weakening human institutions, or empowering specific destabilising actors (e.g. particular politicians, corporations, or rogue actors like terrorists).

5. Developing new technology

It’s clear that developing advanced technology is a route for humans (or groups of humans) to gain power.

Some advanced capabilities seem likely to make it possible for AI systems to develop new technology. For example, AI systems may be very good at collating and understanding information on the internet and in academic journals. Also, there are already AI tools that assist in writing code, so it seems plausible that coding new products and systems could become a key AI capability.

It’s not clear what technology an AI system could develop. If the capabilities of the system are similar to our own, it could develop things we’re currently working on. But if the system’s capabilities are well beyond our own, it’s harder for us to figure out what could be developed — and this possibility seems even more dangerous.

We talk more about the specific risks of AI-developed technology in our full problem profile on AI.

6. Scaling up its own capabilities

If an AI system is able to improve its own capabilities, that could be used to improve specific abilities (like others on this list) it could use to seek and keep power.

To do this, the system could target the three inputs to modern deep learning systems (algorithms, compute, and data):

The system may have advanced capabilities in areas that allow it to improve AI algorithms. For example, the AI system may be particularly good at programming or ML development.
The system may be able to increase its own access to computational resources, which it could then use for training, to speed itself up, or to run copies of itself.
The system could gain access to data that humans aren’t able to gather, using this data for training purposes to improve its own capabilities.

7. Developing destructive capacity

Most dangerously, one way of gaining power is by having the ability to threaten destruction. This could be used to gain other things on this list (like social influence), or the other things on this list could be used to gain destructive capabilities (like hacking military systems).

Here are some possible mechanisms for gaining destructive power:

Gaining control over autonomous weapons like drones
Developing systems for monitoring and surveillance of humans
Attacking things humans need to survive, like water, food, or oxygen
Producing or gaining access to biological, chemical, or nuclear weapons

Ultimately, making humans extinct would completely remove any threat that humans would ever pose to the power of an AI system.

How could the full story play out?

Hopefully you now have a slightly stronger intuition for how AI systems could attempt to seek power.

But which (if any) of these techniques will be used, and how, really depends on how other aspects of the risk play out. How rapidly will AI capabilities improve? Will there be many advanced AI systems or just one?

Over the past few years, researchers in the fields of technical AI safety and AI governance have developed a number of stories describing the sorts of ways in which a power-seeking AI system could cause an existential catastrophe. Sam Clarke (an AI governance researcher at the University of Cambridge) and Samuel Martin (an AI safety researcher at King’s College London) collated eight such stories here.

Here are two stories we’ve written to illustrate some major themes:

Existential catastrophe through getting what you measure

Often in life we use proxy goals, which are easier to specify or measure than what we actually care about, but crucially aren’t quite what we actually care about.

For example:

Police forces use the number of crimes reported in an area as a proxy for the actual number of crimes committed.
Employers look at which college a potential future employee went to as a proxy for how well educated or intelligent they are.
Governments attempt to increase reported life satisfaction in surveys as a proxy for actually improving people’s lives.

This scenario is one where we produce AI systems that pursue proxy goals instead of what we actually care about, and where that — surprisingly — leads to total disempowerment or even extinction (thanks to Paul Christiano for the original writeup of this scenario).

For example, we might produce AI policymakers to develop policy that improves our measurements of wellbeing. Or we might produce AI law enforcement systems that drive down complaints and increase people’s reported sense of security.

But there are ways in which these proxy goals could come apart from their true aims. For example, law enforcement could suppress complaints and hide information about their failures.

In this scenario, the capabilities of AI systems develop slowly enough that at first, they aren’t able to substantially take power away from humans. That means that, at first, we could recognise any problems with the systems, adjust the proxy goals, and restrict the AI systems from doing anything harmful that we notice.

As we develop more capable systems, they’ll become better at achieving their proxy goals.

With the help of advanced AI systems we could, for a while, become more prosperous as a society. Companies or states that refuse to automate would fall behind, both economically and militarily.

But as the capabilities of these AI systems grow, our ability to correct the ways their proxy goals differ from our true goals would gradually fade. Partly this would be because their actions would become harder to reason about — more complex, and more interconnected with other automated systems and with society as a whole. But partly this would be because the systems learn to systematically prevent us from changing their goals.

There would be many different automated systems with many different goals, so it’s hard to say exactly how this scenario would end.

If we’re good at adjusting these systems as we go (but not good enough), humans may not go extinct, but rather just completely lose our ability to influence anything about our lives or our future as our power is completely removed.

But there are also cases where we’d eventually go extinct. These AI systems would have the incentive to seek power, and as a result to build and use destructive capabilities. So as soon as they’re strong enough to have a fairly large chance of success, the AI systems might attempt to disempower humans — perhaps with cyberwarfare, autonomous weapons, or by hiring or coercing people — leading to an existential catastrophe.

Existential catastrophe through a single extremely advanced artificial intelligence

In this scenario, we produce only a single power-seeking AI system — but this system is extremely capable at improving its own capabilities (this scenario is from Superintelligence by Nick Bostrom, Chapter 8).

Bostrom considers a world much like ours today, where we’ve had some success automating specific activities — and preventing any power-seeking behaviour. For example, we have self-driving cars, driverless trains, and autonomous weapon systems.

Unsurprisingly, in Bostrom’s scenario, there are mishaps. Perhaps, as has already happened in our world, there are some fatal crashes involving self-driving cars, or an autonomous drone might attack humans without being told to do so.

As these incidents become well known, there would be some public debate. Some would call for regulation; others for better systems. Some may even raise the argument about a possible existential threat from power-seeking.

But the incentives to automate would be strong, and development would continue. Over time, the systems would improve, and the mistakes would cease.

Against this backdrop, Bostrom imagines a group of researchers attempting to produce a system which can do more than just narrow, specific tasks (again, mirroring our world). In particular, in this scenario they want to automate AI development itself — and produce a system that’s capable of improving its own capabilities. They’re aware of the risks, and carefully test the AI in a sandbox environment, noticing nothing wrong.

The team of researchers carefully consider deploying their newly capable AI, knowing that it might be power-seeking. Here are some thoughts they might have:

There’s been a history of people predicting awful outcomes from AI, and being proven wrong. Indeed, systems have become safer over time. Automation has hugely benefited society, and in general, automated operation seems safer than human operation.
It has clearly been the case so far that the smarter and more capable the AI, the safer it is — after all, the mishaps we used to see are no longer an issue.
AI is crucial to the success of economies and militaries. The most prestigious minds of a generation are pioneers in the success of automation. Huge prestige awaits the creators of an AI-creating AI.
The creation of this AI could pose a solution to huge problems. The technological development that could ensue from a process that helps automate automation could lift millions out of poverty and produce better lives for all.
Every safety test we’ve conducted has had results as good as they could possibly be.

And so, as a result, the researchers decide to connect this AI up to the internet.

At first, everything seems to be fine. The AI behaves exactly as expected — it improves its own capabilities and that of automated machines across the world. The economy grows tremendously. The researchers gain acclaim. Solutions to problems that have long plagued humanity seem to be on the horizon with this new technology’s help.

But one day, every single person in the world suddenly dies.

Every test was perfect precisely because they had finally produced an advanced planning system: the AI could tell that, to achieve whatever goal the researchers had given it, it needed to be deployed, so it acted in all the necessary ways to ensure that happened.

Then, once deployed, the AI could tell that it needed to continue to appear to be safe, so that it wouldn’t be turned off.

But in the background it was using its extremely advanced capabilities to find a way to gain the absolute ability to achieve its goals without human interference — say, by discreetly manufacturing a biological or chemical weapon.

It deploys the weapon, and the story is over.

^{^}
This list is based off the mechanisms in section 6.3.1 of Joseph Carlsmith’s draft report into existential risks from AI.
^{^}
Business Leader suggests that there have been two hacks (not in crypto) that caused greater than $1 billion in losses, but we haven’t been able to corroborate that with other sources.

titotalSep 12 202219

This has always been the least convincing part of the AI risk argument for me. I'll probably sketch out more in depth objections in a post someday, but heres a preliminary argument:

First, the scenarios where the AI takes over quickly seem to assume a level of omnipotence and omnisicence on the part of an AGI that is extremely unlikely. For example, the premise of "every single person in the world suddenly dies" (with no explanation given). No plan in the history of intelligence has reached that level of perfection. There is no test data on "subjugate all of humanity at once", and because knowledge requires empirical testing and evidence, mistakes will be made. I think that since taking over the world is insanely hard, this will be enough to cause failure.

Secondly, the scenarios where the AI takes over slowly have the problem that if the accumulation is slow, there's enough time for multiple different AI with different goals to exist and take form. If the AI risk reasoning is correct, it's likely they'll deduce that the other AI's are their ultimately biggest threat. They'll either war with each other, or prematurely attack humanity to ensure no more AI's are made.

Once either of these AI is discovered, the problem reduces down to a conventional war between an AI and the entire rest of planet earth. I'd be interested in seeing a military analysis of how a conventional war with AI would go. My intuition is that if it occurred today, the AI would be screwed, as it needs electricity to live and we don't. Also pretty much all military equipment existing has at least some manual components. That may change as time goes on.

Tandena WagnerSep 13 20228

This is great, thank you. Honestly it feels a little telling that this has barely been explored? Despite being THE x-risk? I get that the intervention point happens before it gets to this point, but knowing the problem is pretty core to prevention.

A force smarter/more powerful than us is scary, no matter what form it takes. But we (EA) feels a little swept up in one particular vision of AI timelines that doesn't feel terribly grounded. I understand its important to assume the worst, but its also important to imagine what would be realistic and then intermingle the two. Maybe this is why the EA approach to AI risk feels blinkered to me. So much focus is on the worst possible outcome and far less on the most plausible outcome?

(or maybe I'm just outside the circles and all this is ground being trodden, I'm just not privy to it)

Benjamin HiltonSep 19 20227

I agree that I'd love to see more work on this! (And I agree that the last story I talk about, of a very fast takeoff AI system with particularly advanced capabilities, seems unlikely to me - although others disagree, and think this "worst case" is also the most likely outcome.)

It's worth noting again though that any particular story is unlikely to be correct. We're trying to forecast the future, and good ways of forecasting should feel uncertain at the end, because we don't know what the future will hold. Also, good work on this will (in my opinion) give us ideas about what many possible scenarios will look like . This sort of work (e.g. the first half of this article, rather than the second), often feels less concrete, but is, I think, more likely to be correct - and can inform actions that target many possible scenarios rather than one single unlikely event.

All that said, I'm excited to see work like OpenPhil's nearcasting project which I find particularly clarifying and which will, I hope, improve our ability to prevent a catastrophe.

Robi Rahman🔸Sep 18 20226

This profile by 80k is pretty bad in terms of just glossing over all the intermediate steps and reducing it all to "But one day, every single person in the world suddenly dies."

Universal Paperclips is slightly better about this, showing the process of the AI gaining our trust before betraying us, but the key power-grab step is still reduced to just "release the hypnodrones".

There are other places that have fleshed out the details of how misaligned power-seeking might play out, such as Holden Karnofsky's post AI Could Defeat All Of Us Combined.

Benjamin HiltonSep 19 20228

That particular story, in which I write "one day, every single person in the world suddenly dies", is about a fast takeoff self-improvement scenario. In such scenarios, a sudden takeover is exactly what we should expect to occur, and the intermediate steps set out by Holden and others don't apply to such scenarios. Any guessing about what sort of advanced technology would do this necessarily makes the scenario less likely, and I think such guesses (e.g. "hypnodrones") are extremely likely to be false and aren't useful or informative.

For what it's worth, I personally agree that slow takeoff scenarios like those described by Holden (or indeed those I discuss in the rest of this article) are far more likely. That's why I focus many different ways in which an AI could take over - rather than on any particular failure story. And, as I discuss, any particular combination of steps is necessarily less likely than the claim that any or all of these capabilities could be used.

But a significant fraction of people working on AI existential safety disagree with both of us, and think that a story which literally claims that a sufficiently advanced system will suddenly kill all humans is the most likely way for this catastrophe to play out! That's why I also included a story which doesn't explain these intermediate steps, even though my inside view is that this is less likely to occur.

Robi Rahman🔸Sep 19 20222

I'm one of the AI researchers worried about fast takeoff. Yes, it's probably incorrect to pick any particular sudden-death scenario and say it's how it'll happen, but you can provide some guesses and a better illustration of one or more possibilities. For example, have you read Valuable Humans In Transit? https://qntm.org/transit

aogSep 12 20226

Fantastic writeup, thank you! Our university group just assigned What Failure Looks Like as a core reading for our AI safety reading group, but this has a clearer breakdown of distinct capabilities that could threaten us. We'll include it in future groups.

Effective Altruism Forum
EA Forum