Public Explainer on AI as an Existential Risk

AndrewDoris

This is a submission to the AI Safety Public Materials bounty. It aims to help an intelligent but non-technical audience grasp and visualize the basic threat advanced AI poses to humankind, in about 20 minutes of reading. As such, it prioritizes digestibility over technical precision.

Introduction

The United States and China are increasingly investing in artificial intelligence (AI) and exploring opportunities to harness its competitive benefits.[1] As they do so, they must be mindful of various ethical problems surrounding AI's development.

Some of these problems are already here and widely publicized. These include racial bias in search engines or criminal sentencing algorithms, deaths caused by fully autonomous vehicles or weapons, privacy issues, the risk of empowering totalitarian regimes, and social media algorithms fueling political polarization.

Other problems remain more speculative and receive less media attention. But this does not make them any less dangerous. If anything, the relative lack of scrutiny only heightens the chance that these problems will fly under the radar until it’s too late to prevent them – or worse, too late to fix them.

This explainer is about one such issue: the chance that advanced AI could either make humanity extinct, or leave it forever unable to achieve its full potential.

This risk is not intuitive or easy to visualize, so many people are initially skeptical that it is likely enough to worry about. This explainer aims to change that. It starts with a brief summary of the problem, then elaborates on the various development pathways advanced AI may take. Finally, it outlines four possible scenarios through which AI's improper development could wipe out humankind. The goal is not to convince readers that any of these scenarios are likely – only that they are likely enough to take very seriously, given the stakes.

How Artificial Intelligence Could Kill Us All

The shortest answer is that super-intelligent AI is likely to be both very powerful and very difficult to control, due to something AI researchers call the alignment problem.

AI operates in the single-minded pursuit of a goal that humans provide it. This goal is specified in something called the reward function. Even in today’s relatively simple AI applications, it is really hard to design the reward function in a way that prevents the computer from doing things programmers don’t want – in part because it’s hard to anticipate every possible shortcut the computer might take on the way to its specified goal. It is hard to make AI’s preferences align with our own.

Right now, this creates mostly small and manageable problems. But the more powerful AI becomes, the bigger the problems it’s likely to create. For example, if AI gets smart enough, it could learn how to seize cash and other resources, manipulate people, make itself indestructible, and generally escape human control. The fear is that if a sufficiently advanced (but unaligned) AI system were to escape human control, it would do anything necessary to maximize its specified reward – even things that would kill us all.

Any more detailed explanation requires envisioning what advanced AI will actually look like. This, in turn, requires getting slightly technical, in order to describe competing theories on that subject.

Possible Pathways for AI’s Development

Recent years have seen AI develop more quickly than ever before, renewing debate about the path its development will most likely take moving forward. Some say the advance is exponential, and will only keep accelerating. Others suspect researchers will encounter stubborn new obstacles, and advance only in fits and spurts.

Equally contentious is the debate over just how advanced AI will become at its peak. Capability optimists talk of something called AGI, or Artificial General Intelligence. In a sentence, this would be a single machine that does anything humans can do, at least as well as humans do it. For this reason, it is sometimes also called HLMI, for Human-Level Machine Intelligence.

Currently, AI programs are mostly narrow, in that they are stove-piped across many individual applications. One plays Atari games, another recognizes faces, another creates digital art, and another is our personal smartphone or Alexa assistant. AI can already perform some of these narrow tasks (like playing Chess or Go, or solving advanced math problems) much better than human beings. Other functions (like driving or writing) are nearing the human level, but not quite there yet.

In theory, broad or general AI would combine all these functions into a single machine, “able to outperform humans at most economically valuable work.”[2] The term advanced AI includes both broad and narrow systems, but AGI would be an especially advanced form of it.

There is significant debate over just how feasible AGI is, and if it is feasible, by when it may arrive. In 2014, one survey asked the 100 most cited living AI scientists by what year they saw a 10%, 50%, and 90% chance that HLMI would exist.[3] 29 scientists responded. Their results are reproduced below:

	Median response	Mean response
10% chance of HLMI	2024	2034
50% chance of HLMI	2050	2072
90% chance of HLMI	2070	2168

Only 17% of respondents said they were never at least 90% confident HLMI would exist. A separate survey in 2016 asked more than 300 AI researchers when AI would be able to “accomplish every task better and more cheaply than human workers.” On average, they estimated a 50% chance this would happen by 2061, and a 10% chance it would happen by 2025.[4]

The precision of these estimates should be taken with a grain of salt. All of them are highly sensitive to the phrasing of the question, and even to unexpected variables like the continent from which the researcher hailed. Besides, even experts have a spotty track record of predicting the future of tech advancement.

On the other hand, this poor track record cuts both ways. AI has developed much more rapidly since 2016 than many experts anticipated, which may mean these predictions are more conservative than they would be were the studies repeated today. The key point is merely this: most experts think AGI is at least possible, perhaps within a generation.

If AI does reach human intelligence, it can certainly exceed human intelligence. After all, in some ways it already does. But how quickly it would do so is again controversial.[5] Predictors of a “fast takeoff” fear researchers may reach an inflection point wherein self-teaching AI encounters enough data and resources to become thousands of times more intelligent than humans in mere days or hours.[6] Others suspect a more gradual “slow takeoff” that resembles an exponential curve, rather than a vertical line.

It is important to clarify that even a relatively slow takeoff would transform human life at a breathtaking pace. Paul Christiano, a prominent proponent of a slow takeoff theory, defines it as a belief that “there will be a complete four-year interval in which world output doubles, before the first 1-year interval in which world output doubles.”[7] Since world GDP currently doubles about once every 25 years, this would still mark a revolutionary turning point in human progress, akin to the industrial revolution of the 19^th century, and the agricultural revolution of 12,000 years ago.[8]

But while the implications for economic productivity may not much depend on whether AGI has a fast or slow takeoff, the implications for existential risk do. Researcher Allan Dafoe puts it this way: “AI value alignment is not a hard problem if, once we invent AGI, we have at least two years to play with it and get it right. It’s almost impossible if we don’t.”[9] With this background in place, we can begin to sketch out more concrete scenarios whereby AI advances could threaten human extinction.

Scenario A: Going Out with a Bang

If AGI has a fast takeoff before researchers have solved the alignment problem, it could create an uncontrollable “second species” of superhuman intellect. This AGI could kill us incidentally if doing so was instrumental to some other goal we gave it.

The term “second species” does not imply that the machine will actually come alive or acquire a biological body. It means only that it will become more capable than humans at the tasks through which humans now dominate Earth's other creatures. If we fail to instill AGI with an innate desire to learn and obey our wishes – a task AI developers do not yet know how to do – then through its creation we could forfeit control over our own destiny.

By analogy, imagine if ants figured out how to invent human beings. Because ants spend most of their time looking for food, they might program the humans to “go make lots of food.” If they were cautious, they might even add side constraints rooted in dangers they already knew about: for example, “…and don’t use any anteaters as you do it!” How would we humans respond?

Probably, we’d plant a farm and water it – flooding the ant colony and killing all the ants.

In this analogy, humans did not intend to kill the ants; the ants were just in the way of the goal they gave us. Because humans are many times smarter than ants, we achieved their goal in ways they could not fathom protecting against, and were powerless to stop once initiated.

Likewise, there is little reason to expect AGI will be actively hostile to human beings, no matter how powerful it gets. Most goals humans might give AGI would be innocuous in the abstract; for instance, governments might program it to help solve climate change. But because AI pursues its goals without mind to other consequences, it would be indifferent to human approval as soon as the program was run. In this case, maybe it would begin an elaborate chemical reaction that humans have not discovered, converting the world’s carbon dioxide to toxic gas. All life on earth is wiped out, but the computer is glad to report that 1.5C warming was avoided. Almost any goal could theoretically be accomplished more effectively using similar shortcuts.

This is the basic problem of instrumentality. In order to maximize its chances of accomplishing a goal humans gave it, an advanced AI may pursue instrumental subgoals that humans do not want. In his 2014 book Superintelligence, Nick Bostrom put forth a theory of instrumental convergence that predicted what these subgoals could be. For a wide range of objectives that humans might give AGI, there are certain consistent subgoals that would be instrumentally useful – such as:

Self-improvement. The more you know and the better you think, the easier it is for you to achieve your goals. Knowing this, a sufficiently smart AI would seek ever more information, to teach and train itself ever more quickly. (This is part of what motivates concerns about a “fast takeoff”).
Self-preservation. The AI cannot achieve its goal if it gets destroyed or turned off beforehand. Knowing this, a sufficiently smart AI would identify threats to its continued operation, and take steps to avoid or prevent them.
Resource acquisition. Lots of things are easier if you have money, energy, allies, or additional computing power. Knowing this, a sufficiently smart AI would try to acquire these things, then leverage them in service of its goal.
Influence seeking. Power is the ultimate instrumental resource. A sufficiently smart AI may try to bribe, blackmail, or brainwash humans into doing its bidding.

Pursuing any of these subgoals would weaken humanity’s ability to control the machine. Because self-preservation is a versatile instrumental goal, the AI would be motivated to anticipate, then prevent or resist our efforts to turn it off. And because the AI would be smarter and more capable than us, it may well succeed! It would be several steps ahead of our efforts to put the genie back in the bottle.

Readers may wonder how it could succeed. Computer programs just produce text, images, or sounds, which don’t seem particularly fatal. Wouldn’t AI need an army of robots to do its bidding?

No, for at least two reasons. First, words and sounds are all it takes to manipulate people into doing one’s bidding. From Genghis Khan to Adolph Hitler, those who’ve come closest to conquering the world did not personally win all the requisite physical contests; rather, they used words to acquire power and enlist the help of millions of others.[10]

Second, AGI would presumably have internet access to acquire training data, and hacking skills many times better than the greatest human hacker. Any computer, smartphone, website, television, or piece of critical infrastructure (ex: power grids, water treatment facilities, hospitals, broadcast stations, etc.) – connected to the internet may lie within its control. Any data within those devices – passwords, launch codes, detailed profiles of billions of people’s wants and fears and vulnerabilities – may lie within its reach. So would the funds in any online bank account. What words could not motivate, perhaps bribery or blackmail could.

Nor could humans just turn off, unplug, or smash the computers to reassert control. Any AI with internet access could presumably save millions of copies of itself on unsecured computers all over the world, each ready to wake up and continue the job if another was destroyed. That alone would make the AI basically indestructible unless humanity were to destroy all computers and the entire internet. Doing so would be politically difficult on our best day; but it would be especially so if, while the informed few were attempting to raise the alarm, the AI also created millions of authentic-looking disinformation bots, convincing millions of unwitting internet users not to cooperate.

In truth, it may be more likely that we’d never even notice. The AI would never want to make us nervous in the first place, so it could lie or cover up its actions to keep us happily oblivious. If its lies were uncovered, it could resist quietly. It could censor reports of its ascendant power, or broadcast fake news to distract us. It could brainwash people on social media to think the alarmists were political enemies. It could frame those trying to turn it off as criminals, so the rest of us arrest them. It could hire assassins to kill those people. It would be the smartest politician, the savviest businessman, and the most effective criminal underworld.

All of this is speculative, of course. We can’t know for sure what an AGI takeover would look like - remember, we’re the ants. But it doesn’t take much creativity to visualize dystopian possibilities. Besides, the inability to know for sure is precisely what makes it so ominous. If we knew what to expect, it would be easier to prevent.

In the same 2016 survey cited earlier, expert respondents gave a median probability of 5% that the consequence of attaining AGI would be “extremely bad (ex: human extinction).”[11] This is not likely, but it is likely enough to mitigate.

Scenario B: Going Out with a Whimper

In this scenario, advanced AI goes well at first. Narrow AI applications become economically useful, benefiting the world in manifold ways. While mishaps occur, they seem relatively manageable; there is no fast takeoff, and we never suddenly lose control of an all-powerful AGI. Lulled into comfort and complacency, we gradually entrust more and more of our lives to intersecting autonomous systems.

Over time, however, these intersecting systems become increasingly complex. Before long, they are so complex that no one can really understand them, much less change them. Eventually, humans are “locked in,” and lose the ability to steer their society anywhere other than where the machines are taking them. This could lead to extinction, or just to a world that falls short of our full potential in some tragic way.

The most famous description of this scenario came from alignment researcher Paul Christiano in blog posts titled “What Failure Looks Like,” and “Another (outer) alignment failure story.”[13] The phrase “going out with a whimper” comes from the former, and the rest of this section will extensively quote from or paraphrase the latter.

Christiano envisions a world in which narrow but powerful systems start running “factories, warehouses, shipping and construction” – and then start building the factories, warehouses, powerplants, trucks, and roads. Machine learning assistants help write the code to incorporate AI elsewhere. Defense contractors use AI to design new military equipment, and AI helps the DoD decide what to buy and how to use it. Before long, “ML [machine learning] systems are designing new ML systems, testing variations…The financing is coming from automated systems.”

For a while, this would all be very exciting. Efficiency goes through the roof. Investments are automated and returns skyrocket. Lots of people get rich, and many others are lifted from poverty. “People are better educated and better trained, they are healthier and happier in every way they can measure. They have incredibly powerful ML tutors telling them about what’s happening in the world and helping them understand.”

But soon, things start moving so quickly that humans can’t really understand them. The outputs of one AI system become the input for another – “automated factories are mostly making components for automated factories” – and only the AI can evaluate whether it’s getting what it needs to give us what we want. We programmed each step of the AI to respond and adapt to our objections, but we no longer know when or whether to object. So eventually, “humans just evaluate these things by results” – by the final output we can measure. If consumers keep getting their products, stocks keep going up, and the United States keeps winning its wargames, we assume everything is fine under the hood.

This, however, allows the AI to cheat; to give us only what we can measure, instead of what we actually care about. Fake earnings reports do as well as real ones. The machine can let China take Taiwan, so long as the West hears fake news instead. It also motivates dangerous influence-seeking behavior among AI systems with competing goals.

Some of these screwups are noticed and corrected – but the only way to prevent them in real time is with automated immune systems. So, we just add even more AI layers. “We audit the auditors.” The regulatory regime, too, “becomes ever-more-incomprehensible, and rests on complex relationships between autonomous corporations and automated regulators and automated law enforcement.”

Over time, this could become “a train that’s now moving too fast to jump off, but which is accelerating noticeably every month.” Many people would probably not like this. Some may attempt to stop it, and some governments actually could. But they would quickly fall behind economically and quickly become militarily irrelevant. And many others would resist efforts to stop the train because it really would be making the world better in important ways. Christiano writes:

Although people are scared, we are also building huge numbers of new beautiful homes, and using great products, and for the first time in a while it feels like our society is actually transforming in a positive direction for everyone. Even in 2020 most people have already gotten numb to not understanding most of what’s happening in the world. And it really isn’t that clear what the harm is as long as things are on track….
This could be similar to how we now struggle to extricate our economy from fossil fuels, despite knowing their long-term dangers. The AI wresting control from humans may “not look so different from (e.g.) corporate lobbying against the public interest,” so and there would be “genuine ambiguity and uncertainty about whether the current state of affairs is good or bad.”[14]

But one day, control would be fully lost. Humans in nominal positions of authority – like the President or a CEO – may give orders and find they cannot be executed. The mechanisms for detecting problems could break down, or be overcome by competing AI systems motivated to avoid them. The same tactics employed by a single AGI in Scenario A could be leveraged not as a sudden surprise after a fast takeoff, but as the final destination of this gradually accelerating train. There may be no single point when it all goes wrong, but in the end, humanity is no longer in charge.

Such a world could end in human extinction. But even if it didn’t, it would qualify as an existential risk. Limping into eternity at the mercy of lying machines is hardly the long-term future to which our species aspires. Humans have made much progress over time – not just technologically, but morally – and most people hope this will continue. We probably do things today that would horrify future generations. But the moment we embed our current norms, values, or preferences into code that’s powerful enough to rule the world, that progress could conceivably stop. We’d be “locked in” to a less ethical world than we might have created one day, had we not baked our flawed values into the AI system.

Berkeley computer scientist Stuart Russell often describes a variant of Scenario B which he compares to the movie WALL-E. Automating so much of our lives could foster dependency in humans, making us ignorant, obese slobs with no willpower or incentive to learn much of anything. Knowledge passed down between humans for millennia could be passed off to computers – and then lost (to humans) forever. Either way, it would not take a dramatic robot takeover for humans to cede autonomy.

Scenario C: Very Smart Bad Guys

The scenarios above assume humankind was universally attempting to ensure AI did not kill people. In practice, this is not a safe assumption. Terror groups or lone-wolf maniacs might want to use AI maliciously, to inflict mass carnage on purpose. Small or rogue states might want to acquire the capability as a power equalizer, much as they covet nuclear weapons today. Once the capability existed, it could be tough to keep it from falling into the wrong hands. Building nuclear weapons requires rare uranium; AI needs only code and computing power. If security procedures were lax, sharing it could be as easy as copy and paste.

Nor would this require the omnipotent AGI of Scenario A. Intrepid terror groups could use narrow AI to process immense amounts of biological data, inventing and testing millions of viruses against human immune responses. This could help them fabricate the ultimate bioweapon: a maximally contagious, and maximally lethal superbug.

And bioweapons are just one example. Many labor- or knowledge-intensive tasks previously beyond the reach of terror groups, given their limited resources and manpower, may suddenly be within reach if they acquire advanced AI. If their motives were evil enough, the consequences could be grave.

Scenario D: AI as a Risk Multiplier

Finally, AI could so destabilize the global order that it acts as a risk multiplier for more conventional X-risk scenarios. It is not merely that AI could behave unpredictably; it’s that it could cause humans to behave unpredictably.

To some extent, it has already done so. Social media algorithms drive polarization and disinformation, which led (for example) to the events of January 6^th, and the election of an erratic populist leader to the most powerful position on earth. As AI gets more powerful and pervasive, these effects could multiply.

Recall that even the most conservative AI development pathways stand to radically and rapidly transform human society. Even climate change, which occurs gradually over decades, is expected to cause mass migration and ensuing conflict over scarce resources. How much sharper could this effect be if it takes place while the global economy is being completely overhauled by AI? Mass labor displacement could produce unprecedented inequality, deepening a pervasive sense that too much power lies in the hands of a relative few. Left unchecked, this could spark revolutions or civil wars.

If we manage to keep the domestic peace, international peace will remain imperiled. Exponential economic growth could shift the global balance of power, and those in power do not like to give it up. Already, there is fear of a "Thucydides Trap," wherein a rising China could produce a war between nuclear powers. AI could heighten either side’s sense of threat – or, give either side a sense of invincible overconfidence.

In the nuclear context, advanced AI imagery intelligence could identify the locations of nuclear silos and submarines, weakening the deterrent of mutually assured destruction. Autonomous cyber-attacks could disable missile detention centers, creating paranoia that an attack was inbound. Autonomous weapons could even remove the decision to fire from human hands in response to certain external stimuli, like the doomsday machine in the movie Dr. Strangelove. If powerful AI arrives in a world that's already unstable and high-strung, the chaos would make it less likely for humans to put proper safeguards in place.

***

Considering all of these scenarios together, 80,000 Hours’ team of AI experts estimates that “the risk of a severe, even existential catastrophe caused by machine intelligence within the next 100 years is something like 10%.”[15] The stakes are so grave that it's worth doing everything we can to reduce that chance.

[1] National Security Commission on Artificial Intelligence, Final Report, March 19^th, 2021, https://www.nscai.gov/2021-final-report/

[2] OpenAI Charter, April 9^th, 2018, https://openai.com/charter/

[3] Vincent C. Müller and Nick Bostrom, “Future progress in artificial intelligence: A survey of expert opinion”, in Vincent C. Müller (ed.), Fundamental Issues of Artificial Intelligence (Synthese Library; Berlin: Springer), 2016, pages 553-571, accessed online at http://sophia.de/pdf/2014_PT-AI_polls.pdf

[4] Toby Ord, The Precipice, page 141

[5] Paul Christiano, “Takeoff Speeds,” The sideways view, February 24^th, 2018, https://sideways-view.com/2018/02/24/takeoff-speeds/

[6] Scott Alexander, “Yudkowsky Contra Christiano On AI Takeoff Speeds,” Astral Codex Ten, April 4^th, 2022, https://astralcodexten.substack.com/p/yudkowsky-contra-christiano-on-ai?s=r

[7] Christiano, “Takeoff Speeds”

[8] Robert Wilbin, “Positively shaping the development of artificial intelligence,” 80,000 Hours, March 2017, https://80000hours.org/problem-profiles/positively-shaping-artificial-intelligence/

[9] Allan Dafoe, “The AI revolution and international politics,” presentation at Effective Altruism Global Conference in Boston, 2017. Accessed at https://www.youtube.com/watch?v=Zef-mIKjHAk, timestamp 18:25

[10] Ord, The Precipice, page 146.

[11] Dafoe, “The AI revolution”. Also cited in Ord, The Precipice, page 141.

[12] Allan Dafoe, “AI Governance: Opportunity and Theory of Impact,” Effective Altruism Forum, September 17^th, 2020, https://forum.effectivealtruism.org/posts/42reWndoTEhFqu6T8/ai-governance-opportunity-and-theory-of-impact

[13] Paul Christiano, “What Failure Looks Like,” LessWrong, March 17^th, 2019; https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like; Paul Christiano, “Another (outer) alignment failure story,” LessWrong, April 7^th, 2021, https://www.lesswrong.com/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story

[14] Christiano, “What Failure Looks Like.”

[15] Wilbin, “Positively shaping the development of artificial intelligence”

Thomas LarsenOct 7 20226

AI operates in the single-minded pursuit of a goal that humans provide it. This goal is specified in something called the reward function.

It turns out the problem is a lot worse than this -- even if we knew of a safe goal to give AI, we would have no idea how to build an AI that pursues that goal!

See this post for more detail. Another way of saying this using the inner/outer alignment framework: reward is the outer optimization target, but this does automatically induce inner optimization in the same direction.

Sarah ChengOct 7 20225

People interested in AI risk and this post might be interested in applying to the researcher or software engineer roles at the Alignment Research Center, a non-profit organization focused on theoretical research to align future machine learning systems with human interests.

This is a test by the EA Forum Team to gauge interest in job ads relevant to posts - give us feedback here.

TW123Oct 7 20223

[I'm a contest organizer but I'm recusing myself for this because I personally know Andrew.]

Thanks for writing! A few minor points (may leave more substantive points later).

In 2014, one survey asked the 100 most cited living AI scientists by what year they saw a 10%, 50%, and 90% chance that HLMI would exist

There is updated research on this here (survey conducted 2019) and here (2022; though it's not a paper yet, so might not be palatable for some people).

Only 17% of respondents said they were never at least 90% confident HLMI would exist.

I think this is a typo.

Considering all of these scenarios together, 80,000 Hours’ team of AI experts estimates that “the risk of a severe, even existential catastrophe caused by machine intelligence within the next 100 years is something like 10%.”

I don't think I would cite 80,000 hours, as that particular article is older. There is a newer one recently, but it still seems better for ethos to cite something that looks like a paper. You could possibly cite Carlsmith or the survey above, which I think says the median researcher assigns 5% chance of extinction-level catastrophe.

AndrewDorisOct 8 20221

Thanks Thomas - appreciate the updated research. And that wasn't a typo, just a poorly expressed idea. I meant to say, "Only 17% of respondents reported less than 90% confidence that HLMI will eventually exist."

Effective Altruism Forum
EA Forum