The basic reasons I expect AGI ruin

RobBensinger

The basic reasons I expect AGI ruin

RobBensinger

17 min read · Apr 18, 2023

Comments 13

Sorted by

New & upvoted

vincentzh

Might be a naive question:

For a STEM-capable AGI (or any intelligence for that matter) to do new science, it would have to interact with the physical environment to conduct experiments. Otherwise, how can the intelligent agent discover and validate new theories? For example, an AGI that understands physics and material science may theorize and propose thousands of possible high-temperature superconductors, but actually discovering a working material can happen only after actually synthesizing those materials and performing the experiments, which is time-consuming and difficult to do.

If that's true, then the speed in which the STEM-capable AGI discovers new knowledge, and correspondingly its "knowledge advantage" (not intelligence advantage) over humanity is bottlenecked by the speed in which the AGI can interact and perform experiments in the physical world, which as of now depends almost entirely on human operated equipment and is constrained by various real world physical limitations (wear and tear, speed of chemical reactions, speed of biological systems, energy consumption etc.). Doesn't this significantly throttles the speed of AGI gaining advantage over humanity, giving us more time for alignment?

RobBensinger

For a STEM-capable AGI (or any intelligence for that matter) to do new science, it would have to interact with the physical environment to conduct experiments.

Or read arXiv papers and draw inferences that humans failed to draw, etc.

Doesn't this significantly throttles the speed of AGI gaining advantage over humanity, giving us more time for alignment?

I expect there's a ton of useful stuff you can learn (that humanity is currently ignorant about) just from looking at existing data on the Internet. But I agree that AGI will destroy the world a little slower in expectation because it may get bottlenecked on running experiments, and it's at least conceivable that at least one project will decide not let it run tons of physical experiments.

(Though I think the most promising ways to save the world involve AGIs running large numbers of physical experiments, so in addition to merely delaying AGI doom by some number of months, 'major labs don't let AGIs run physical experiments' plausibly rules out the small number of scenarios where humanity has a chance of surviving.)

vincentzh

I expect there's a ton of useful stuff you can learn (that humanity is currently ignorant about) just from looking at existing data on the Internet.

Thank you for the reply, I agree with this point. Now that I think about it, protein folding is a good example of how the data was already available but before AlphaFold, nobody could predict sequence to structure with high accuracy. Maybe a sufficiently smart AGI can get more knowledge out of existing data on the internet without performing too many new experiments.

How much more can it squeeze out of existing data (which were not generated specifically with the AGI's new hypothesis in mind), and if it that can put a decisive advantage over humanity in a short span of time could be important? I.e. whether existing data out there contains within them enough information to figure out new science that is completely beyond our current understanding and can totally screw us.

Alexander Herwix 🔸

I would argue that an important component of your first argument still stands. Even though AlphaFold can predict structures to some level of accuracy based on some training data sets that may already exist, an AI would STILL need to check if what it learned is usable in practice for the purposes it is intended to. This logically requires experimentation. Also hold in mind that most data which already exists was not deliberately prepared to help a machine "do X". Any intelligence no matter how strong will still need to check its hypotheses and, thus, prepare data sets that can actually deliver the evidence necessary for drawing warranted conclusions.

I am not really sure what the consequences of this are, though.

Kenny

I think a sufficiently intelligent intelligence can generate accurate beliefs from evidence, not just 'experiments', and not just its own experiments. I imagine AIs will be suggesting experiments too (if they're not already).

It is still plausible that not being able to run its own experiments will greatly hamper AI's scientific agendas, but it's harder to know how much it will exactly for intelligences likely to be much more intelligent than ourselves.

Alexander Herwix 🔸

Afaik it is pretty well established that you cannot really learn anything new without actually testing your new belief in practice, i.e., experiments. I mean how else would this work? Evidence does not grow on trees, it has to be created (i.e., data has to be carefully generated, selected and interpreted to become useful evidence).

While it might be true that this experimenting can sometimes be done using existing data, the point is that if you want to learn something new about the universe like “what is dark matter and can it be used for something?” existing data is unlikely to be enough to test any idea you come up with.

Even if you take data from published academic papers and synthesize some new theories from that, it is still not always (or even likely) the case that the theory you come up with can be tested with already existing data because any theory has unique requirements towards what counts as evidence against it. I mean thats the whole point why we continue to do experiments rather than just metanalyze the sh*t out of all the papers out there.

Of course, advanced AI could trick us into doing certain experiments or looking at ChatGPT plugins, we may just give it access to anything on the internet wholesale in due time so all of this may just be a short bump in the road. If we are lucky, we might avoid a FOOM style takeover though as long as advanced AI remains dependent on us to carry out its experiments for it simply because of the time those experiments will take. So even if it could bootstrap to nanotech quickly due to good understanding of physics based on our formulas and existing data, the first manufacturing machine / factory would still need to be built somehow and that may take some time.

Henry Howard🔸

I feel the weakest part of this argument, and the weakest part of the AI Safety space generally, is the part where AI kills everyone (part 2, in this case).

You argue that most paths to some ambitious goal like whole-brain emulation end terribly for humans, because how else could the AI do whole-brain emulation without subjugating, eliminating or atomising everyone?

I don't think that follows. This seems like what the average hunter-gatherer would have thought when made to imagine our modern commercial airlines or microprocessor industries: how could you achieve something requiring so much research, so many resources and so much coordination without enslaving huge swathes of society and killing anyone that gets in the way? And wouldn't the knowledge to do these things cause terrible new dangers?

Luckily the peasant is wrong: the path here has led up a slope of gradually increasing quality of life (some disagree).

Alexander Herwix 🔸

I think the point is not that it is not conceivable that progress can continue with humans still being alive but with the game theoretic dilemma that whatever we humans want to do is unlikely to be exactly what some super powerful advanced AI would want to do. And because the advanced AI does not need us or depend on us, we simply lose and get to be ingredients for whatever that advanced AI is up to.

Your example with humanity fails because humans have always and continue to be a social species that is dependent on each other. An unaligned advanced AI would not be so. A more appropriate example would be to look at the relationship between humans and insects. I don't know if you noticed but a lot of those are dying out right now because we simply don't care about or depend on them. The point with advanced AI would be that because it is potentially even more removed from us than we are from insects and also much more capable in achieving its goals that this whole competitive process which we all engage in is going to be much more competitive and faster when advanced AIs start playing in the game.

I don't want to be the bearer of bad news but I think it is not that easy to reject this analysis... it seems pretty simple and solid. I would love to know if there is some flaw in the reasoning. Would help me sleep better at night!

RobBensinger

Your example with humanity fails because humans have always and continue to be a social species that is dependent on each other.

I would much more say that it fails because humans have human values.

Maybe a hunter-gatherer would have worried that building airplanes would somehow cause a catastrophe? I don't exactly see why; the obvious hunter-gatherer rejoinder could be 'we built fire and spears and our lives only improved; why would building wings to fly make anything bad happen?'.

Regardless, it doesn't seem like you can get much mileage via an analogy that sticks entirely to humans. Humans are indeed safe, because "safety" is indexed to human values; when we try to reason about non-human optimizers, we tend to anthropomorphize them and implicitly assume that they'll be safe for many of the same reasons. Cf. The Tragedy of Group Selectionism and Anthropomorphic Optimism.

You argue that most paths to some ambitious goal like whole-brain emulation end terribly for humans, because how else could the AI do whole-brain emulation without subjugating, eliminating or atomising everyone?

'Wow, I can't imagine a way to do something so ambitious without causing lots of carnage in the process' is definitely not the argument! On the contrary, I think it's pretty trivial to get good outcomes from humans via a wide variety of different ways we could build WBE ourselves.

The instrumental convergence argument isn't 'I can't imagine a way to do this without killing everyone'; it's that sufficiently powerful optimization behaves like maximizing optimization for practical purposes, and maximizing-ish optimization is dangerous if your terminal values aren't included in the objective being maximized.

If it helps, we could maybe break the disagreement about instrumental convergence into three parts, like:

Would a sufficiently powerful paperclip maximizer kill all humans, given the opportunity?
Would sufficiently powerful inhuman optimization of most goals kill all humans, or are paperclips an exception?
Is 'build fast-running human whole-brain emulation' an ambitious enough task to fall under the 'sufficiently powerful' criterion above? Or if so, is there some other reason random policies might be safe if directed at this task, even if they wouldn't be safe for other similarly-hard tasks?

Henry Howard🔸

The step that's missing for me is the one where the paperclip maximiser gets the opportunity to kill everyone.

Your talk of "plans" and the dangers of executing them seems to assume that the AI has all the power it needs to execute the plans. I don't think the AI crowd has done enough to demonstrate how this could happen.

If you drop a naked human in amongst some wolves I don't think the human will do very well despite its different goals and enormous intellectual advantage. Similarly, I don't see how a fledgling sentient AGI on OpenAI servers can take over enough infrastructure that it poses a serious threat. I've not seen a convincing theory for how this would happen. Mailorder nanobots seem unrealistic (too hard to simulate the quantum effects in protein chemistry), the AI talking itself out of its box is another suggestion that seems far-fetched (main evidence seems to be some chat games that Yudkowsky played a few times?), a gradual takeover by its voluntary uptake into more an more of our lives seems slow enough to stop.

Ian Turner

Is your question basically how an AGI would gain power in the beginning in order to get to a point where it could execute on a plan to annihilate humans?

I would argue that:

Capitalists would quite readily give the AGI all the power it wants, in order to stay competitive and drive profits.
Some number of people would deliberately help the AGI gain power just to "see what happens" or specifically to hurt humanity. Think ChaosGPT, or consider the story of David Charles Hahn.
Some number of lonely, depressed, or desperate people could be persuaded over social media to carry out actions in the real world.

Considering these channels, I'd say that a sufficiently intelligent AGI with as much access to the real world as ChatGPT has now would have all the power needed to increase its power to the point of being able to annihilate humans.

demirev

Thank you for taking the time to write this - I think it is a clear and concise entry point into the AGI ruin arguments.

I want to voice an objection / point out an omission to point 2: I agree that any plan towards a sufficiently complicated goal will include "acquire resources" as a sub-goal, and that "getting rid of all humans" might be a by-product of some ways to achieve this sub-goal. I'm also willing to grant that if all we now about the plan is that it achieves the end (sufficiently complicated) goal, it is likely that the plan might lead to the destruction of all humans.

However I don't see why we can't infer more about the plans. Specifically I think an ASI plan for a sufficiently complicated goal should be 1) feasible and 2) efficient (at least in some sense). If the ASI doesn't believe that it can overpower humanity, then it's plans will not include overpowering humanity. Even more, if the ASI ascribes a high enough cost to overpowering humanity, it would instead opt to acquire resources in another way.

It seems that for point 2 to hold you must think that an ASI can overpower humanity with 1) close to a 100% certainty and 2) at negligible cost to the ASI. However I don't think this is (explicitly) argued for in the article. Or maybe I'm missing something?

Vasco Grilo🔸

Thanks, I thought this to be informative!

Comments

RobBensinger

Your example with humanity fails because humans have always and continue to be a social species that is dependent on each other.

I would much more say that it fails because humans have human values.

You argue that most paths to some ambitious goal like whole-brain emulation end terribly for humans, because how else could the AI do whole-brain emulation without subjugating, eliminating or atomising everyone?

If it helps, we could maybe break the disagreement about instrumental convergence into three parts, like:

Would a sufficiently powerful paperclip maximizer kill all humans, given the opportunity?
Would sufficiently powerful inhuman optimization of most goals kill all humans, or are paperclips an exception?
Is 'build fast-running human whole-brain emulation' an ambitious enough task to fall under the 'sufficiently powerful' criterion above? Or if so, is there some other reason random policies might be safe if directed at this task, even if they wouldn't be safe for other similarly-hard tasks?

^{^}

Eliezer Yudkowsky's So Far: Unfriendly AI Edition and Nate Soares' Ensuring Smarter-Than-Human Intelligence Has a Positive Outcome are two other good (though old) introductions to what I'd consider "the basics".

To state the obvious: this post consists of various claims that increase my probability on AI causing an existential catastrophe, but not all the claims have to be true in order for AI to have a high probability of causing such a catastrophe.

Also, I wrote this post to summarize my own top reasons for being worried, not to try to make a maximally compelling or digestible case for others. I don't expect others to be similarly confident based on such a quick overview, unless perhaps you've read other sources on AI risk in the past. (Including more optimistic ones, since it's harder to be confident when you've only heard from one side of a disagreement. I've written in the past about some of the things that give me small glimmers of hope, but people who are overall far more hopeful will have very different reasons for hope, based on very different heuristics and background models.)

^{^}

E.g., the physical world is too complex to simulate in full detail, unlike a Go board state. An effective general intelligence needs to be able to model the world at many different levels of granularity, and strategically choose which levels are relevant to think about, as well as which specific pieces/aspects/properties of the world at those levels are relevant to think about.

More generally, being a general intelligence requires an enormous amount of laserlike focus and strategicness when it comes to which thoughts you do or don't think. A large portion of your compute needs to be relentlessly funneled into exactly the tiny subset of questions about the physical world that bear on the question you're trying to answer or the problem you're trying to solve. If you fail to be relentlessly targeted and efficient in "aiming" your cognition at the most useful-to-you things, you can easily spend a lifetime getting sidetracked by minutiae, directing your attention at the wrong considerations, etc.

And given the variety of kinds of problems you need to solve in order to navigate the physical world well, do science, etc., the heuristics you use to funnel your compute to the exact right things need to themselves be very general, rather than all being case-specific.

(Whereas we can more readily imagine that many of the heuristics AlphaGo uses to avoid thinking about the wrong aspects of the game state (or getting otherwise sidetracked) are Go-specific heuristics.)

^{^}

Of course, if your brain has all the basic mental machinery required to do other sciences, that doesn't mean that you have the knowledge required to actually do well in those sciences. An STEM-level artificial general intelligence could lack physics ability for the same reason many smart humans can't solve physics problems.

^{^}

E.g., because different sciences can synergize, and because you can invent new scientific fields and subfields, and more generally chain one novel insight into dozens of other new insights that critically depended on the first insight.

^{^}

More generally, the sciences (and many other aspects of human life, like written language) are a very recent development on evolutionary timescales. So evolution has had very little time to refine and improve on our reasoning ability in many of the ways that matter.

^{^}

"Human engineers have an enormous variety of tools available that evolution lacked" is often noted as a reason to think that we may be able to align AGI to our goals, even though evolution failed to align humans to its "goal". It's additionally a reason to expect AGI to have greater cognitive ability, if engineers try to achieve great cognitive ability.

^{^}

And my understanding is that, e.g., Paul Christiano's soft-takeoff scenarios don't involve there being much time between par-human scientific ability and superintelligence. Rather, he's betting that we have a bunch of decades between GPT-4 and par-human STEM AGI.

^{^}

I'll classify thoughts and text outputs as "actions" too, not just physical movements.

^{^}

Obviously, neither is a particularly good approximation for ML systems. The point is that our optimism about plans in real life generally comes from the fact that they're weak, and/or it comes from the fact that the plan generators are human brains with the full suite of human psychological universals. ML systems don't possess those human universals, and won't stay weak indefinitely.

^{^}

Quoting Four mindset disagreements behind existential risk disagreements in ML:

People are taking the risks unseriously because they feel weird and abstract.
When they do think about the risks, they anchor to what's familiar and known, dismissing other considerations because they feel "unconservative" from a forecasting perspective.
Meanwhile, social mimesis and the bystander effect make the field sluggish at pivoting in response to new arguments and smoke under the door.

Quoting The inordinately slow spread of good AGI conversations in ML:

Info about AGI propagates too slowly through the field, because when one ML person updates, they usually don't loudly share their update with all their peers. This is because:
1. AGI sounds weird, and they don't want to sound like a weird outsider.
2. Their peers and the community as a whole might perceive this information as an attack on the field, an attempt to lower its status, etc.
3. Tech forecasting, differential technological development, long-term steering, exploratory engineering, 'not doing certain research because of its long-term social impact', prosocial research closure, etc. are very novel and foreign to most scientists.
EAs exert effort to try to dig up precedents like Asilomar partly because Asilomar is so unusual compared to the norms and practices of the vast majority of science. Scientists generally don't think in these terms at all, especially in advance of any major disasters their field causes.
And the scientists who do find any of this intuitive often feel vaguely nervous, alone, and adrift when they talk about it. On a gut level, they see that they have no institutional home and no super-widely-shared 'this is a virtuous and respectable way to do science' narrative.
Normal science is not Bayesian, is not agentic, is not 'a place where you're supposed to do arbitrary things just because you heard an argument that makes sense'. Normal science is a specific collection of scripts, customs, and established protocols.
In trying to move the field toward 'doing the thing that just makes sense', even though it's about a weird topic (AGI), and even though the prescribed response is also weird (closure, differential tech development, etc.), and even though the arguments in support are weird (where's the experimental data??), we're inherently fighting our way upstream, against the current.
Success is possible, but way, way more dakka is needed, and IMO it's easy to understand why we haven't succeeded more.
This is also part of why I've increasingly updated toward a strategy of "let's all be way too blunt and candid about our AGI-related thoughts".
The core problem we face isn't 'people informedly disagree', 'there's a values conflict', 'we haven't written up the arguments', 'nobody has seen the arguments', or even 'self-deception' or 'self-serving bias'.
The core problem we face is 'not enough information is transmitting fast enough, because people feel nervous about whether their private thoughts are in the Overton window'.

On the more basic level, Inadequate Equilibria paints a picture of the world's baseline civilizational competence that I think makes it less mysterious why we could screw up this badly on a novel problem that our scientific and political institutions weren't designed to address. Inadequate Equilibria also talks about the nuts and bolts of Modest Epistemology, which I think is a key part of the failure story.

^{^}

Quoting a recent conversation between Aryeh Englander and Eliezer Yudkowsky:

Aryeh: [...] Yet I still have a very hard time understanding the arguments that would lead to such a high-confidence prediction. Like, I think I understand the main arguments for AI existential risk, but I just don't understand why some people seem so sure of the risks. [...]
Eliezer: I think the core thing is the sense that you cannot in this case milk uncertainty for a chance of good outcomes; to get to a good outcome you'd have to actually know where you're steering, like trying to buy a winning lottery ticket or launching a Moon rocket. Once you realize that uncertainty doesn't move estimates back toward "50-50, either we live happily ever after or not", you realize that "people in the EA forums cannot tell whether Eliezer or Paul is right" is not a factor that moves us toward 1:1 good:bad but rather another sign of doom; surviving worlds don't look confused like that and are able to make faster progress.
Not as a fully valid argument from which one cannot update further, but as an intuition pump: the more all arguments about the future seem fallible, the more you should expect the future Solar System to have a randomized configuration from your own perspective. Almost zero of those have humans in them. It takes confidence about some argument constraining the future to get to more than that.
Aryeh: when you talk about uncertainty here do you mean uncertain factors within your basic world model, or are you also counting model uncertainty? I can see how within your world model extra sources of uncertainty don't point to lower risk estimates. But my general question I think is more about model uncertainty: how sure can you really be that your world model and reference classes and framework for thinking about this is the right one vs e.g., Robin or Paul or Rohin or lots of others? And in terms of model uncertainty it looks like most of these other approaches imply much lower risk estimates, so adding in that kind of model uncertainty should presumably (I think) point to overall lower risk estimates.
Eliezer: Aryeh, if you've got a specific theory that says your rocket design is going to explode, and then you're also very unsure of how rockets work really, what probability should you assess of your rocket landing safely on target?
Aryeh: how about if you have a specific theory that says you should be comparing what you're doing to a rocket aiming for the moon but it'll explode, and then a bunch of other theories saying it won't explode, plus a bunch of theories saying you shouldn't be comparing what you're doing to a rocket in the first place? My understanding of many alignment proposals is that they think we do understand "rockets" sufficiently so that we can aim them, but they disagree on various specifics that lead you to have such high confidence in an explosion. And then there are others like Robin Hanson who use mostly outside-type arguments to argue that you're framing the issues incorrectly, and we shouldn't be comparing this to "rockets" at all because that's the wrong reference class to use. So yes, accounting for some types of model uncertainty won't reduce our risk assessments and may even raise them further, but other types of model uncertainty - including many of the actual alternative models / framings at least as I understand them - should presumably decrease our risk assessment.
Eliezer: What if people are trying to build a flying machine for the first time, and there's a whole host of them with wildly different theories about why it ought to fly easily, and you think there's basic obstacles to stable flight that they're not getting? Could you force the machine to fly despite all obstacles by recruiting more and more optimists to have different theories, each of whom would have some chance of being right?
Aryeh: right, my point is that in order to have near certainty of not flying you need to be very very sure that your model is right and theirs isn't. Or in other words, you need to have very low model uncertainty. But once you add in model uncertainty where you consider that maybe those other optimists' models could be right, then your risk estimates will go down. Of course you can't arbitrarily add in random optimistic models from random people - it needs to be weighted in some way. My confusion here is that you seem to be very, very certain that your model is the right one, complete with all its pieces and sub-arguments and the particular reference classes you use, and I just don't quite understand why.
Eliezer: There's a big difference between "sure your model is the right one" and the whole thing with people wandering over with their own models and somebody else going, "I can't tell the difference between you and them, how can you possibly be so sure they're not right?"
The intuition I'm trying to gesture at here is that you can't milk success out of uncertainty, even by having a bunch of other people wander over with optimistic models. It shouldn't be able to work in real life. If your epistemology says that you can generate free success probability that way, you must be doing something wrong.
Or maybe another way to put it: When you run into a very difficult problem that you can see is very difficult, but inevitably a bunch of people with less clear sight wander over and are optimistic about it because they don't see the problems, for you to update on the optimists would be to update on something that happens inevitably. So to adopt this policy is just to make it impossible for yourself to ever perceive when things have gotten really bad.
Aryeh: not sure I fully understand what you're saying. It looks to me like to some degree what you're saying boils down to your views on modest epistemology - i.e., basically just go with your own views and don't defer to anybody else. It sounds like you're saying not only don't defer, but don't even really incorporate any significant model uncertainty based on other people's views. Am I understanding this at all correctly or am I totally off here?
Eliezer: My epistemology is such that it's possible in principle for me to notice that I'm doomed, in worlds which look very doomed, despite the fact that all such possible worlds no matter how doomed they actually are, always contain a chorus of people claiming we're not doomed.

(See Inadequate Equilibria for a detailed discussion of Modest Epistemology, deference, and "outside views", and Strong Evidence Is Common for the basic first-order case that people can often reach confident conclusions about things.)