I've been citing AGI Ruin: A List of Lethalities to explain why the situation with AI looks lethally dangerous to me. But that post is relatively long, and emphasizes specific open technical problems over "the basics".
Here are 10 things I'd focus on if I were giving "the basics" on why I'm so worried:
1. General intelligence is very powerful, and once we can build it at all, STEM-capable artificial general intelligence (AGI) is likely to vastly outperform human intelligence immediately (or very quickly).
When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems".
It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too.
Human brains aren't perfectly general, and not all narrow AI systems or animals are equally narrow. (E.g., AlphaZero is more general than AlphaGo.) But it sure is interesting that humans evolved cognitive abilities that unlock all of these sciences at once, with zero evolutionary fine-tuning of the brain aimed at equipping us for any of those sciences. Evolution just stumbled into a solution to other problems, that happened to generalize to millions of wildly novel tasks.
More concretely:
- AlphaGo is a very impressive reasoner, but its hypothesis space is limited to sequences of Go board states rather than sequences of states of the physical universe. Efficiently reasoning about the physical universe requires solving at least some problems that are different in kind from what AlphaGo solves.
- These problems might be solved by the STEM AGI's programmer, and/or solved by the algorithm that finds the AGI in program-space; and some such problems may be solved by the AGI itself in the course of refining its thinking.
- Some examples of abilities I expect humans to only automate once we've built STEM-level AGI (if ever):
- The ability to perform open-heart surgery with a high success rate, in a messy non-standardized ordinary surgical environment.
- The ability to match smart human performance in a specific hard science field, across all the scientific work humans do in that field.
- In principle, I suspect you could build a narrow system that is good at those tasks while lacking the basic mental machinery required to do par-human reasoning about all the hard sciences. In practice, I very strongly expect humans to find ways to build general reasoners to perform those tasks, before we figure out how to build narrow reasoners that can do them. (For the same basic reason evolution stumbled on general intelligence so early in the history of human tech development.)
When I say "general intelligence is very powerful", a lot of what I mean is that science is very powerful, and that having all of the sciences at once is a lot more powerful than the sum of each science's impact.
Another large piece of what I mean is that (STEM-level) general intelligence is a very high-impact sort of thing to automate because STEM-level AGI is likely to blow human intelligence out of the water immediately, or very soon after its invention.
80,000 Hours gives the (non-representative) example of how AlphaGo and its successors compared to humanity:
In the span of a year, AI had advanced from being too weak to win a single [Go] match against the worst human professionals, to being impossible for even the best players in the world to defeat.
I expect general-purpose science AI to blow human science ability out of the water in a similar fashion.
Reasons for this include:
- Empirically, humans aren't near a cognitive ceiling, and even narrow AI often suddenly blows past the human reasoning ability range on the task it's designed for. It would be weird if scientific reasoning were an exception.
- Empirically, human brains are full of cognitive biases and inefficiencies. It's doubly weird if scientific reasoning is an exception even though it's visibly a mess with tons of blind spots, inefficiencies, and motivated cognitive processes, and even though there are innumerable historical examples of scientists and mathematicians taking decades to make technically simple advances.
- Empirically, human brains are extremely bad at some of the most basic cognitive processes underlying STEM.
- E.g., consider the stark limits on human working memory and ability to do basic mental math. We can barely multiply smallish multi-digit numbers together in our head, when in principle a reasoner could hold thousands of complex mathematical structures in its working memory simultaneously and perform complex operations on them. Consider the sorts of technologies and scientific insights that might only ever occur to a reasoner if it can directly see (within its own head, in real time) the connections between hundreds or thousands of different formal structures.
- Human brains underwent no direct optimization for STEM ability in our ancestral environment, beyond traits like "I can distinguish four objects in my visual field from five objects".
- In contrast, human engineers can deliberately optimize AGI systems' brains for math, engineering, etc. capabilities; and human engineers have an enormous variety of tools available to build general intelligence that evolution lacked.
- Software (unlike human intelligence) scales with more compute.
- Current ML uses far more compute to find reasoners than to run reasoners. This is very likely to hold true for AGI as well.
- We probably have more than enough compute already, if we knew how to train AGI systems in a remotely efficient way.
And on a meta level: the hypothesis that STEM AGI can quickly outperform humans has a disjunctive character. There are many different advantages that individually suffice for this, even if STEM AGI doesn't start off with any other advantages. (E.g., speed, math ability, scalability with hardware, skill at optimizing hardware...)
In contrast, the claim that STEM AGI will hit the narrow target of "par-human scientific ability", and stay at around that level for long enough to let humanity adapt and adjust, has a conjunctive character.
2. A common misconception is that STEM-level AGI is dangerous because of something murky about "agents" or about self-awareness. Instead, I'd say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.
Call such sequences "plans".
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability. This is because:
- "Invent fast WBE" is a hard enough task that succeeding in it usually requires gaining a lot of knowledge and cognitive and technological capabilities, enough to do lots of other dangerous things.
- "Invent fast WBE" is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are "convergent instrumental strategies"—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you're pushing.
- Human bodies and the food, water, air, sunlight, etc. we need to live are resources ("you are made of atoms the AI can use for something else"); and we're also potential threats (e.g., we could build a rival superintelligent AI that executes a totally different plan).
The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.
It isn't that the abstract space of plans was built by evil human-hating minds; it's that the instrumental convergence thesis holds for the plans themselves. In full generality, plans that succeed in goals like "build WBE" tend to be dangerous.
This isn't true of all plans that successfully push our world into a specific (sufficiently-hard-to-reach) physical state, but it's true of the vast majority of them.
This is counter-intuitive because most of the impressive "plans" we encounter today are generated by humans, and it’s tempting to view strong plans through a human lens. But humans have hugely overlapping values, thinking styles, and capabilities; AI is drawn from new distributions.
3. Current ML work is on track to produce things that are, in the ways that matter, more like "randomly sampled plans" than like "the sorts of plans a civilization of human von Neumanns would produce". (Before we're anywhere near being able to produce the latter sorts of things.)
We're building "AI" in the sense of building powerful general search processes (and search processes for search processes), not building "AI" in the sense of building friendly ~humans but in silicon.
(Note that "we're going to build systems that are more like A Randomly Sampled Plan than like A Civilization of Human Von Neumanns" doesn't imply that the plan we'll get is the one we wanted! There are two separate problems: that current ML finds things-that-act-like-they're-optimizing-the-task-you-wanted rather than things-that-actually-internally-optimize-the-task-you-wanted, and also that internally ~maximizing most superficially desirable ends will kill humanity.)
Note that the same problem holds for systems trained to imitate humans, if those systems scale to being able to do things like "build whole-brain emulation". "We're training on something related to humans" doesn't give us "we're training things that are best thought of as humans plus noise".
It's not obvious to me that GPT-like systems can scale to capabilities like "build WBE". But if they do, we face the problem that most ways of successfully imitating humans don't look like "build a human (that's somehow superhumanly good at imitating the Internet)". They look like "build a relatively complex and alien optimization process that is good at imitation tasks (and potentially at many other tasks)".
You don't need to be a human in order to model humans, any more than you need to be a cloud in order to model clouds well. The only reason this is more confusing in the case of "predict humans" than in the case of "predict weather patterns" is that humans and AI systems are both intelligences, so it's easier to slide between "the AI models humans" and "the AI is basically a human".
4. The key differences between humans and "things that are more easily approximated as random search processes than as humans-plus-a-bit-of-noise" lies in lots of complicated machinery in the human brain.
(Cf. Detached Lever Fallacy, Niceness Is Unnatural, and Superintelligent AI Is Necessary For An Amazing Future, But Far From Sufficient.)
Humans are not blank slates in the relevant ways, such that just raising an AI like a human solves the problem.
This doesn't mean the problem is unsolvable; but it means that you either need to reproduce that internal machinery, in a lot of detail, in AI, or you need to build some new kind of machinery that’s safe for reasons other than the specific reasons humans are safe.
(You need cognitive machinery that somehow samples from a much narrower space of plans that are still powerful enough to succeed in at least one task that saves the world, but are constrained in ways that make them far less dangerous than the larger space of plans. And you need a thing that actually implements internal machinery like that, as opposed to just being optimized to superficially behave as though it does in the narrow and unrepresentative environments it was in before starting to work on WBE. "Novel science work" means that pretty much everything you want from the AI is out-of-distribution.)
5. STEM-level AGI timelines don't look that long (e.g., probably not 50 or 150 years; could well be 5 years or 15).
I won't try to argue for this proposition, beyond pointing at the field's recent progress and echoing Nate Soares' comments from early 2021:
[...] I observe that, 15 years ago, everyone was saying AGI is far off because of what it couldn't do -- basic image recognition, go, starcraft, winograd schemas, simple programming tasks. But basically all that has fallen. The gap between us and AGI is made mostly of intangibles. (Computer programming that is Actually Good? Theorem proving? Sure, but on my model, "good" versions of those are a hair's breadth away from full AGI already. And the fact that I need to clarify that "bad" versions don't count, speaks to my point that the only barriers people can name right now are intangibles.) That's a very uncomfortable place to be!
[...] I suspect that I'm in more-or-less the "penultimate epistemic state" on AGI timelines: I don't know of a project that seems like they're right on the brink; that would put me in the "final epistemic state" of thinking AGI is imminent. But I'm in the second-to-last epistemic state, where I wouldn't feel all that shocked to learn that some group has reached the brink. Maybe I won't get that call for 10 years! Or 20! But it could also be 2, and I wouldn't get to be indignant with reality. I wouldn't get to say "but all the following things should have happened first, before I made that observation!". Those things have happened. I have made those observations. [...]
I think timing tech is very difficult (and plausibly ~impossible when the tech isn't pretty imminent), and I think reasonable people can disagree a lot about timelines.
I also think converging on timelines is not very crucial, since if AGI is 50 years away I would say it's still the largest single risk we face, and the bare minimum alignment work required for surviving that transition could easily take longer than that.
Also, "STEM AGI when?" is the kind of argument that requires hashing out people's predictions about how we get to STEM AGI, which is a bad thing to debate publicly insofar as improving people's models of pathways can further shorten timelines.
I mention timelines anyway because they are in fact a major reason I'm pessimistic about our prospects; if I learned tomorrow that AGI were 200 years away, I'd be outright optimistic about things going well.
6. We don't currently know how to do alignment, we don't seem to have a much better idea now than we did 10 years ago, and there are many large novel visible difficulties. (See AGI Ruin and the Capabilities Generalization, and the Sharp Left Turn.)
On a more basic level, quoting Nate Soares: "Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems."
7. We should be starting with a pessimistic prior about achieving reliably good behavior in any complex safety-critical software, particularly if the software is novel. Even more so if the thing we need to make robust is structured like undocumented spaghetti code, and more so still if the field is highly competitive and you need to achieve some robustness property while moving faster than a large pool of less-safety-conscious people who are racing toward the precipice.
The default assumption is that complex software goes wrong in dozens of different ways you didn't expect. Reality ends up being thorny and inconvenient in many of the places where your models were absent or fuzzy. Surprises are abundant, and some surprises can be good, but this is empirically a lot rarer than unpleasant surprises in software development hell.
The future is hard to predict, but plans systematically take longer and run into more snags than humans naively expect, as opposed to plans systematically going surprisingly smoothly and deadlines being systematically hit ahead of schedule.
The history of computer security and of safety-critical software systems is almost invariably one of robust software lagging far, far behind non-robust versions of the same software. Achieving any robustness property in complex software that will be deployed in the real world, with all its messiness and adversarial optimization, is very difficult and usually fails.
In many ways I think the foundational discussion of AGI risk is Security Mindset and Ordinary Paranoia and Security Mindset and the Logistic Success Curve, and the main body of the text doesn't even mention AGI. Adding in the specifics of AGI and smarter-than-human AI takes the risk from "dire" to "seemingly overwhelming", but adding in those specifics is not required to be massively concerned if you think getting this software right matters for our future.
8. Neither ML nor the larger world is currently taking this seriously, as of April 2023.
This is obviously something we can change. But until it's changed, things will continue to look very bad.
Additionally, most of the people who are taking AI risk somewhat seriously are, to an important extent, not willing to worry about things until after they've been experimentally proven to be dangerous. Which is a lethal sort of methodology to adopt when you're working with smarter-than-human AI.
My basic picture of why the world currently isn't responding appropriately is the one in Four mindset disagreements behind existential risk disagreements in ML, The inordinately slow spread of good AGI conversations in ML, and Inadequate Equilibria.
9. As noted above, current ML is very opaque, and it mostly lets you intervene on behavioral proxies for what we want, rather than letting us directly design desirable features.
ML as it exists today also requires that data is readily available and safe to provide. E.g., we can’t robustly train the AGI on "don’t kill people" because we can’t provide real examples of it killing people to train against the behavior we don't want; we can only give flawed proxies and work via indirection.
10. There are lots of specific abilities which seem like they ought to be possible for the kind of civilization that can safely deploy smarter-than-human optimization, that are far out of reach, with no obvious path forward for achieving them with opaque deep nets even if we had unlimited time to work on some relatively concrete set of research directions.
(Unlimited time suffices if we can set a more abstract/indirect research direction, like "just think about the problem for a long time until you find some solution". There are presumably paths forward; we just don’t know what they are today, which puts us in a worse situation.)
E.g., we don’t know how to go about inspecting a nanotech-developing AI system’s brain to verify that it’s only thinking about a specific room, that it’s internally representing the intended goal, that it’s directing its optimization at that representation, that it internally has a particular planning horizon and a variety of capability bounds, that it’s unable to think about optimizers (or specifically about humans), or that it otherwise has the right topics internally whitelisted or blacklisted.
Individually, it seems to me that each of these difficulties can be addressed. In combination, they seem to me to put us in a very dark situation.
One common response I hear to points like the above is:
The future is generically hard to predict, so it's just not possible to be rationally confident that things will go well or poorly. Even if you look at dozens of different arguments and framings and the ones that hold up to scrutiny nearly all seem to point in the same direction, it's always possible that you're making some invisible error of reasoning that causes correlated failures in many places at once.
I'm sympathetic to this because I agree that the future is hard to predict.
I'm not totally confident things will go poorly; if I were, I wouldn't be trying to solve the problem! I think things are looking extremely dire, but not hopeless.
That said, some people think that even "extremely dire" is an impossible belief state to be in, in advance of an AI apocalypse actually occurring. I disagree here, for two basic reasons:
a. There are many details we can get into, but on a core level I don't think the risk is particularly complicated or hard to reason about. The core concern fits into a tweet:
STEM AI is likely to vastly exceed human STEM abilities, conferring a decisive advantage. We aren't on track to knowing how to aim STEM AI at intended goals, and STEM AIs pursuing unintended goals tend to have instrumental subgoals like "control all resources".
Zvi Mowshowitz puts the core concern in even more basic terms:
I also notice a kind of presumption that things in most scenarios will work out and that doom is dependent on particular ‘distant possibilities,’ that often have many logical dependencies or require a lot of things to individually go as predicted. Whereas I would say that those possibilities are not so distant or unlikely, but more importantly that the result is robust, that once the intelligence and optimization pressure that matters is no longer human that most of the outcomes are existentially bad by my values and that one can reject or ignore many or most of the detail assumptions and still see this.
The details do matter for evaluating the exact risk level, but this isn't the sort of topic where it seems fundamentally impossible for any human to reach a good understanding of the core difficulties and whether we're handling them.
b. Relatedly, as Nate Soares has argued, AI disaster scenarios are disjunctive. There are many bad outcomes for every good outcome, and many paths leading to disaster for every path leading to utopia.
Quoting Eliezer Yudkowsky:
You don't get to adopt a prior where you have a 50-50 chance of winning the lottery "because either you win or you don't"; the question is not whether we're uncertain, but whether someone's allowed to milk their uncertainty to expect good outcomes.
Quoting Jack Rabuck:
I listened to the whole 4 hour Lunar Society interview with @ESYudkowsky
(hosted by @dwarkesh_sp) that was mostly about AI alignment and I think I identified a point of confusion/disagreement that is pretty common in the area and is rarely fleshed out:
Dwarkesh repeatedly referred to the conclusion that AI is likely to kill humanity as "wild."
Wild seems to me to pack two concepts together, 'bad' and 'complex.' And when I say complex, I mean in the sense of the Fermi equation where you have an end point (dead humanity) that relies on a series of links in a chain and if you break any of those links, the end state doesn't occur.
It seems to me that Eliezer believes this end state is not wild (at least not in the complex sense), but very simple. He thinks many (most) paths converge to this end state.
That leads to a misunderstanding of sorts. Dwarkesh pushes Eliezer to give some predictions based on the line of reasoning that he uses to predict that end point, but since the end point is very simple and is a convergence, Eliezer correctly says that being able to reason to that end point does not give any predictive power about the particular path that will be taken in this universe to reach that end point.
Dwarkesh is thinking about the end of humanity as a causal chain with many links and if any of them are broken it means humans will continue on, while Eliezer thinks of the continuity of humanity (in the face of AGI) as a causal chain with many links and if any of them are broken it means humanity ends. Or perhaps more discretely, Eliezer thinks there are a few very hard things which humanity could do to continue in the face of AI, and absent one of those occurring, the end is a matter of when, not if, and the when is much closer than most other people think.
Anyway, I think each of Dwarkesh and Eliezer believe the other one falls on the side of extraordinary claims require extraordinary evidence - Dwarkesh thinking the end of humanity is "wild" and Eliezer believing humanity's viability in the face of AGI is "wild" (though not in the negative sense).
I don't consider "AGI ruin is disjunctive" a knock-down argument for high p(doom) on its own. NASA has a high success rate for rocket launches even though success requires many things to go right simultaneously. Humanity is capable of achieving conjunctive outcomes, to some degree; but I think this framing makes it clearer why it's possible to rationally arrive at a high p(doom), at all, when enough evidence points in that direction.
Might be a naive question:
For a STEM-capable AGI (or any intelligence for that matter) to do new science, it would have to interact with the physical environment to conduct experiments. Otherwise, how can the intelligent agent discover and validate new theories? For example, an AGI that understands physics and material science may theorize and propose thousands of possible high-temperature superconductors, but actually discovering a working material can happen only after actually synthesizing those materials and performing the experiments, which is time-consuming and difficult to do.
If that's true, then the speed in which the STEM-capable AGI discovers new knowledge, and correspondingly its "knowledge advantage" (not intelligence advantage) over humanity is bottlenecked by the speed in which the AGI can interact and perform experiments in the physical world, which as of now depends almost entirely on human operated equipment and is constrained by various real world physical limitations (wear and tear, speed of chemical reactions, speed of biological systems, energy consumption etc.). Doesn't this significantly throttles the speed of AGI gaining advantage over humanity, giving us more time for alignment?
Or read arXiv papers and draw inferences that humans failed to draw, etc.
I expect there's a ton of useful stuff you can learn (that humanity is currently ignorant about) just from looking at existing data on the Internet. But I agree that AGI will destroy the world a little slower in expectation because it may get bottlenecked on running experiments, and it's at least conceivable that at least one project will decide not let it run tons of physical experiments.
(Though I think the most promising ways to save the world involve AGIs running large numbers of physical experiments, so in addition to merely delaying AGI doom by some number of months, 'major labs don't let AGIs run physical experiments' plausibly rules out the small number of scenarios where humanity has a chance of surviving.)
Thank you for the reply, I agree with this point. Now that I think about it, protein folding is a good example of how the data was already available but before AlphaFold, nobody could predict sequence to structure with high accuracy. Maybe a sufficiently smart AGI can get more knowledge out of existing data on the internet without performing too many new experiments.
How much more can it squeeze out of existing data (which were not generated specifically with the AGI's new hypothesis in mind), and if it that can put a decisive advantage over humanity in a short span of time could be important? I.e. whether existing data out there contains within them enough information to figure out new science that is completely beyond our current understanding and can totally screw us.
I would argue that an important component of your first argument still stands. Even though AlphaFold can predict structures to some level of accuracy based on some training data sets that may already exist, an AI would STILL need to check if what it learned is usable in practice for the purposes it is intended to. This logically requires experimentation. Also hold in mind that most data which already exists was not deliberately prepared to help a machine "do X". Any intelligence no matter how strong will still need to check its hypotheses and, thus, prepare data sets that can actually deliver the evidence necessary for drawing warranted conclusions.
I am not really sure what the consequences of this are, though.
I think a sufficiently intelligent intelligence can generate accurate beliefs from evidence, not just 'experiments', and not just its own experiments. I imagine AIs will be suggesting experiments too (if they're not already).
It is still plausible that not being able to run its own experiments will greatly hamper AI's scientific agendas, but it's harder to know how much it will exactly for intelligences likely to be much more intelligent than ourselves.
Afaik it is pretty well established that you cannot really learn anything new without actually testing your new belief in practice, i.e., experiments. I mean how else would this work? Evidence does not grow on trees, it has to be created (i.e., data has to be carefully generated, selected and interpreted to become useful evidence).
While it might be true that this experimenting can sometimes be done using existing data, the point is that if you want to learn something new about the universe like “what is dark matter and can it be used for something?” existing data is unlikely to be enough to test any idea you come up with.
Even if you take data from published academic papers and synthesize some new theories from that, it is still not always (or even likely) the case that the theory you come up with can be tested with already existing data because any theory has unique requirements towards what counts as evidence against it. I mean thats the whole point why we continue to do experiments rather than just metanalyze the sh*t out of all the papers out there.
Of course, advanced AI could trick us into doing certain experiments or looking at ChatGPT plugins, we may just give it access to anything on the internet wholesale in due time so all of this may just be a short bump in the road. If we are lucky, we might avoid a FOOM style takeover though as long as advanced AI remains dependent on us to carry out its experiments for it simply because of the time those experiments will take. So even if it could bootstrap to nanotech quickly due to good understanding of physics based on our formulas and existing data, the first manufacturing machine / factory would still need to be built somehow and that may take some time.