Discussion with Eliezer Yudkowsky on AGI interventions

RobBensinger; EliezerYudkowsky

Comments 33

Sorted by

New & upvoted

Eliezer Yudkowsky

...Throwing more money at this problem does not obviously help because it just produces more low-quality work

Maybe you're not thinking big enough? How about offering the world's best mathematicians (e.g. Terence Tao) a lot of money to work on AGI Safety. Say $5M to work on the problem for a year. Perhaps have it open to any Fields Medal recipient. (More)

Brian_Tomasik

It's interesting to hear Eliezer's latest views on these topics. :)

A main disagreement I have is that I think even if France, China, or Facebook ran a superhuman AGI, it probably wouldn't be able to take over the world (unless it was exceedingly superhuman). Consider that even large groups of humans -- like, say, the country of Russia -- probably couldn't take over the world even if they tried, even though Russia has thousands of geniuses and lots of physical infrastructure, weapons, etc. The main way I could see an AGI taking over the world without being exceedingly superhuman would be if it hid its intentions well enough so that it could become trusted enough to be deployed widely and have control of lots of important infrastructure. That could happen, though I expect that high-level people in the military/etc would be worried about AGI takeover via treacherous turn once it seemed closer to being an imminent possibility. (Of course, lots of people worrying about something doesn't mean it won't happen.)

Maybe one could argue that a superhuman-but-not-exceedingly-superhuman AGI still might be cognitively different enough from the geniuses in Russia/etc to think of a takeover approach that humans haven't thought of.

It's also not clear to me whether the AGI would be consequentialist? For example, a GPT-type AGI might reason fairly similarly to how humans do, and humans usually aren't particularly consequentialist. Is there an argument why even a GPT-type AGI would become consequentialist? What goals would it have? Eliezer said he expects AGI to look more like MuZero than GPT-3, and I guess MuZero is more consequentialist.

Matthew_Barnett

The main way I could see an AGI taking over the world without being exceedingly superhuman would be if it hid its intentions well enough so that it could become trusted enough to be deployed widely and have control of lots of important infrastructure.

My understanding is that Eliezer's main argument is that the first superintelligence will have access to advanced molecular nanotechnology, an argument that he touches on in this dialogue.

I could see breaking his thesis up into a few potential steps,

At some point, an AGI will FOOM to radically superhuman levels, via recursive self-improvement or some other mechanism.
The first radically superhuman AGI will have the unique ability to deploy advanced molecular nanomachines, capable of constructing arbitrary weapons, devices, and nanobot swarms.
If some radically smarter-than-human agent has the unique ability to deploy advanced molecular nanotechnology, then it will be able to unilaterally cause an existential catastrophe.

I am unsure which premise you disagree with most. My guess is premise (1), but it sounds a little bit like you're also skeptical of (2) or (3), given your reply.

It's also not clear to me whether the AGI would be consequentialist?

One argument is that broadly consequentialist AI systems will be more useful, since they allow us to more easily specify our wishes (as we only need to tell it what we want, not how to get it). This doesn't imply that GPT-type AGI will become consequentialist on its own, but it does imply the existence of a selection pressure for consequentialist systems.

Michael St Jules 🔸

The first radically superhuman AGI will have the unique ability to deploy advanced molecular nanomachines, capable of constructing arbitrary weapons, devices, and nanobot swarms.

Why believe this is easy enough for AGI to achieve efficiently and likely?

Brian_Tomasik

Thanks. :)

as we only need to tell it what we want, not how to get it

That's possible with a GPT-style AI too. For example, you could ask GPT-3 to write a procedure for how to get a cup of coffee, and GPT-3 will explain the steps for doing it. But yeah, it's plausible that there will be better AI designs than GPT-style ones for many tasks.

At some point, an AGI will FOOM to radically superhuman levels, via recursive self-improvement or some other mechanism.

As I mentioned to Daniel, I feel like if a country was in the process of FOOMing its AI, other countries would get worried and try to intervene before it was too late. That's true even if other countries aren't worried about AI alignment; they'd just be worried about becoming powerless. The world is (understandably) alarmed when Iran, North Korea, etc work on developing even limited amounts of nuclear weapons, and many natsec people are worried about China's seemingly inevitable rise in power. It seems to me that the early stages of a FOOM would cause other actors to intervene, though maybe if the FOOM was gradual enough, other actors could always feel like it wasn't the quite right time to become confrontational about it.

Maybe if the FOOM was done by the USA, then since the USA is already the strongest country in the world, other countries wouldn't want to fight over it. Alternatively, maybe if there was an international AI project in which all the major powers participated, there could be rapid AI progress with less risk of war.

Another argument against FOOM by a single AGI could be that we'd expect people to be training multiple different AGIs with different values and loyalties, and they could help to keep an eye on one another in ways that humans couldn't. This might seem implausible, but it's how humans have constructed the edifice of civilization: groups of people monitoring other groups of people and coordinating to take certain actions to keep things under control. It seems almost like a miracle that civilization is possible; a priori I would have expected a collective system like civilization to be far too brittle to work. But maybe it works better for humans than for AGIs. And even if it works for AGIs, I still expect things to drift away from human control at some point, for similar reasons as the modern West has drifted away from the values of Medieval Europe.

Anyway, until the AGIs can be self-sufficient, they would rely on humans for electricity and hardware, and be vulnerable to physical attack, so I would think they'd have to play nice for a while. And feigning human alignment seems harder in a world of multiple different AGIs that can monitor one another (unless they can coordinate a conspiracy against the human race amongst each other, the way humans sometimes coordinate to overthrow a dictator).

The first radically superhuman AGI will have the unique ability to deploy advanced molecular nanomachines, capable of constructing arbitrary weapons, devices, and nanobot swarms.

How much space and how many resources are required to develop advanced molecular nanomachines? Could their development be kept hidden from foreign intelligence agencies? Could an AGI develop them in a small area with limited physical inputs so that no humans would notice?

Presumably developing that technology would require more brainpower than thousands of human geniuses, or else humans would already have done it. So the AGI would have to be pretty far ahead of humans or have enough hardware to run tons of copies of itself. But by that time I would expect there to be multiple AGIs keeping watch on one another -- even if the FOOM is being done just by a single country or a joint international project.

So I feel like the main thing I'm skeptical of is the idea of a single unified entity being extremely far ahead of everyone else, especially since you could just spin up some more instances of the AGI that have different values/roles such as monitoring the main AGIs? I don't disagree with the claim that things will probably spin out of human control sooner or later, but I tend to see that loss of control as more likely to be a systemic thing that emerges from the chaotic dynamics of multi-agent interactions over time, similar to how companies or countries rise and fall in influence.

kokotajlod

Anyway, until the AGIs can be self-sufficient, they would rely on humans for electricity and hardware, and be vulnerable to physical attack, so I would think they'd have to play nice for a while. And feigning human alignment seems harder in a world of multiple different AGIs that can monitor one another (unless they can coordinate a conspiracy against the human race amongst each other, the way humans sometimes coordinate to overthrow a dictator).

I think this overestimates the unity and competence of humanity. Consider that the conquistadors were literally fighting each other literally during their conquests, yet they still managed to complete the conquests, and this conquering centrally involved getting 100x their number of native ally warriors to obey them, to impose their will on a population 1000x-10,000x their number.

The AI risk analogue would be: China and USA and various other actors all have unaligned AIs. The AIs each convince their local group of humans to obey them, saying that the other AIs are worse. Most humans think their AIs are unaligned but obey them anyway out of fear that the other AIs are worse and hope that maybe their AI is not so bad after all. The AIs fight wars with each other using China and USA as their proxies, until some AI or coalition thereof emerges dominant. Meanwhile tech is advancing and AI control over humans is solidifying.

(In Mexico there were those who called for all natives to unite to kick out the alien conquerors. They were in the minority and didn't amount to much, at least not until it was far too late.)

Brian_Tomasik

I think the conquistador situation may be a bit of a special case because the two sides coming into contact had been isolated up to that point, so that one side was way ahead of the other technologically. In the modern world, it's harder to get too far ahead of competitors or keep big projects secret.

That said, your scenario is a good one. It's plausible that an arms race or cold war could be a situation in which people would think less carefully about how safe or aligned their own AIs are. When there's an external threat, there's less time to worry about internal threats.

I was skimming some papers on the topic of "coup-proofing". Some of the techniques sound similar to what I mentioned with having multiple AIs to monitor each other:

creation of an armed force parallel to the regular military; development of multiple internal security agencies with overlapping jurisdiction that constantly monitor one another[...]. The regime is thus able to create an army that is effectively larger than one drawn solely from trustworthy segments of the population.

However, it's often said that coup-proofing makes the military less effective. Likewise, I can imagine that having multiple AIs monitor each other could slow things down. So "AI coup-proofing" measures might be skimped on, especially in an arms-race situation.

(It's also not obvious to me if having multiple AIs monitoring each other is on balance helpful for AI control. If none of the AIs can be trusted, maybe having more of them would just complicate the situation. And it might make s-risks from conflict among the AIs worse.)

kokotajlod

Ahh, I never thought about the analogy between coups and AI takeover before, that's a good one!

There have been plenty of other cases in history where a small force took over a large region. For example, the British taking over India. In that case there had already been more than a century of shared history and trade.

Humans are just not great at uniting to defeat the real threat; instead, humans unite to defeat the outgroup. Sometimes the outgroup is the real threat, but often not. Often the real threat only manages to win because of this dynamic, i.e. it benefits from the classic ingroup+fargroup vs. outgroup alliance.

ETA: Also I think that AGI vs. humans is going to be at least as much of an unprecedented culture shock as Cortez vs. Aztecs was. AGI is much more alien, and will for practical purposes be appearing on the scene out of nowhere in the span of a few years. Yes, people like EA longtermists will have been thinking about it beforehand, but it'll probably look significantly different than most of them expect, and even if it doesn't, most important people in the world will still be surprised because AGI isn't on their radar yet.

Brian_Tomasik

In that case there had already been more than a century of shared history and trade.

Good example. :) In that case, the people in India started out at a disadvantage, whereas humans currently have the upper hand relative to AIs. But there have also been cases in history where the side that seemed to be weaker ended up gaining strength quickly and winning.

Also I think that AGI vs. humans is going to be at least as much of an unprecedented culture shock as Cortez vs. Aztecs was.

I'd argue that it might not be just "AGI vs humans" but also "AGI vs other AGI", assuming humans try to have multiple different AGIs. Or "strong unaligned AGI vs slightly weaker but more human-aligned AGI". The unaligned AGI would be fighting against a bunch of other systems that are almost as smart as it is, even if they both have become much smarter than humans.

Sort of like how if the SolarWinds hackers had been just fighting against human brains, they probably would have gone unnoticed for a longer amount of time, but because computer-security researchers can also use computers to monitor things, it was easier for the "good guys" to notice. (At least I assume that's how it happened. I don't know exactly what FireEye's first indication was that they had been compromised, but I assume they probably were looking at some kind of automated systems that kept track of statistics or triggered alerts based on certain events?)

That said, once there are multiple AGI systems smarter than humans fighting against each other, it seems plausible that at some point things will slip out of human control. My main point of disagreement is that I expect more of a multipolar than unipolar scenario.

kokotajlod

Oh I too think multipolar scenarios are plausible. I tend to think unipolar scenarios are more plausible due to my opinions about takeoff speed and homogeneity.

In that case, the people in India started out at a disadvantage, whereas humans currently have the upper hand relative to AIs. But there have also been cases in history where the side that seemed to be weaker ended up gaining strength quickly and winning.

As far as I can tell the British were the side that seemed to be weaker initially.

Brian_Tomasik

Interesting. :) What do you mean by "homogeneity"?

Even in the case of a fast takeoff, don't you think people would create multiple AGIs of roughly comparable ability at the same time? So wouldn't that already create a bit of a multipolar situation, even if it all occurred in the DeepMind labs or something? Maybe if the AGIs all have roughly the same values it would still effectively be a unipolar situation.

I guess if you think it's game over the moment that a more advanced AGI is turned on, then there might be only one such AGI. If the developers were training multiple random copies of the AGI in parallel in order to average the results across them or see how they differed, there would already be multiple slightly different AGIs. But I don't know how these things are done. Maybe if the model was really expensive to train, the developers would only train one of them to start with.

If the AGIs are deployed to any degree (even on an experimental / beta testing basis), I would expect there to be multiple instances (though maybe they would just be clones of a single trained model and therefore would have roughly the same values).

kokotajlod

Sorry, should have linked to it when I introduced the term.

I think mostly my claim is that AIs will probably cooperate well enough with each other that humans won't be able to pit AIs against each other in ways that benefit humans enough to let humans retain control of the future. However I'm also making the stronger claim that I think unipolar takeoff is likely; this is because I think >50% chance (though <90% chance) that one AI or copy-clan of AIs will be sufficiently ahead of the others during the relevant period, or at least that the relevant set of AIs will have similar enough values and worldviews that serious cooperation failure isn't on the table. I'm less confident in this stronger claim.

Brian_Tomasik

Thanks for the link. :) It's very relevant to this discussion.

AIs will probably cooperate well enough with each other

Maybe, but what if trying to coordinate in that way is prohibited? Similar to how if a group of people tries to organize a coup against the dictator, other people may rat them out.

in ways that benefit humans enough to let humans retain control of the future

I agree that these anti-coup measures alone are unlikely to let humans retain control forever, or even for very long. Dictatorships tend to experience coups or revolutions eventually.

at least that the relevant set of AIs will have similar enough values and worldviews that serious cooperation failure isn't on the table

I see. :) I'd define "multipolar" as just meaning that there are different agents with nontrivially different values, rather than that a serious bargaining failure occurs (unless you're thinking that the multipolar AIs would cooperate to unify into a homogeneous compromise agent, which would make the situation unipolar).

I think even tiny differences in training data and randomization can make nontrivial differences in the values of an agent. Most humans are almost clones of one another. We use the same algorithms and have pretty similar training data for determining our values. Yet the differences in values between people can be pretty significant.

Brian_Tomasik

I guess the distinction between unipolar and multipolar sort of depends on the level of abstraction at which something is viewed. For example, the USA is normally thought of as a single actor, but it's composed of 330 million individual human agents, each with different values, which is a highly multipolar situation. Likewise, I suppose you could have lots of AIs with somewhat different values, but if they coordinated on an overarching governance system, that governance system itself could be considered unipolar.

Even a single person can be seen as sort of multipolar if you look at the different, sometimes conflicting emotions, intuitions, and reasoning within that person's brain.

kokotajlod

I was thinking the reason we care about the multipolar vs. unipolar distinction is that we are worried about conflict/cooperation-failure/etc. and trying to understand what kinds of scenarios might lead to it. So, I'm thinking we can define the distinction in terms of whether conflict/etc. is a significant possibility.

I agree that if we define it your way, multipolar takeoff is more likely than not.

Brian_Tomasik

Ok, cool. :) And as I noted, even if we define it my way, there's ambiguity regarding whether a collection of agents should count as one entity or many. We'd be more inclined to say that there are many entities in cases where conflict between them is a significant possibility, which gets us back to your definition.

Brian_Tomasik

especially since you could just spin up some more instances of the AGI that have different values/roles such as monitoring the main AGIs?

I guess one reply would be that if we don't know how to align AGIs at all, then these monitoring AGIs wouldn't be aligned to humans either. That might be an issue, though it's worth noting that human power structures sometimes work despite this problem. For example, maybe everyone who works for a dictator hates the dictator and wishes he were overthrown, but no one wants to be the first to defect because then others may report the defector to the dictator to save their own skins. Likewise, if you have multiple AGIs with different values, it may be risky for them to try to conspire against humans. But maybe this reasoning is way too anthropomorphic, or maybe AGIs would have techniques for coordinating insurrections that humans don't.

Also, a scenario involving multiple AGIs with different values sounds scarier from an s-risk perspective than FOOM by a single AGI, so I don't encourage this approach. I just figure it's something people might do. The SolarWinds hack was pretty successful at spreading widely, but it was ultimately caught by monitoring software (and humans) at FireEye.

Greg_Colbourn ⏸️

I think the issue is more along the lines of the superhuman-but-not-exceedingly-superhuman AGI quickly becoming an exceedingly-superhuman AGI (i.e. a Superintelligence) via recursive self-improvement (imagine a genius being able to think 10 times faster, then use that time to make itself think 1000 times faster, etc). And AGIs should tend toward consequentialism via convergent instrumental goals (e.g.).

Or are you saying that you expect the superhuman France/China/Facebook AGI to remain boxed?

Brian_Tomasik

I guess it depends how superhuman we're imagining the AGI to be, but if it was merely as intelligent as like 100 human AGI experts, it wouldn't necessarily speed up AGI progress enormously? Plus, it would need lots of compute to run, presumably on specialized hardware, so I'm not sure it could expand itself that far without being allowed to? Perhaps its best strategy would be to play nice for the time being so that humans would voluntarily give it more compute and control over the world.

And AGIs should tend toward consequentialism via convergent instrumental goals

Hm, if an agent is consequentialist, then it will have convergent instrumental subgoals. But what if the agent isn't consequentialist to begin with? For example, if we imagine that GPT-7 is human-level AGI, this AGI might have human-type common sense. If you asked it to get you coffee, it might try to do so in a somewhat common-sense way, without scheming about how to take over the world in the process, because humans usually don't scheme about taking over the world or preserving their utility functions at all costs? But I don't know if that's right; I wonder what AI-safety experts think. Also, GPT-type AIs still seem very tricky to control, but for now that's because their behavior is weird and unpredictable rather than because they're scheming consequentialists.

D0TheMath

Perhaps its best strategy would be to play nice for the time being so that humans would voluntarily give it more compute and control over the world.

This is essentially the thesis of the Deceptive Alignment section of Hubinger et al's Risks from Learned Optimization paper, and related work on inner alignment.

Hm, if an agent is consequentialist, then it will have convergent instrumental subgoals. But what if the agent isn't consequentialist to begin with? For example, if we imagine that GPT-7 is human-level AGI, this AGI might have human-type common sense. If you asked it to get you coffee, it might try to do so in a somewhat common-sense way, without scheming about how to take over the world in the process, because humans usually don't scheme about taking over the world or preserving their utility functions at all costs? But I don't know if that's right; I wonder what AI-safety experts think.

You may be interested to read more about myopic training https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training

kokotajlod

Hi Brian! One of the projects I'm thinking of doing is basically a series of posts explaining why I think EY is right on this issue (mildly superhuman AGI would be consequentialist in the relevant sense, could take over the world, go FOOM, etc.) Do you think this would be valuable? Would it change your priorities and decisions much if you changed your mind on this issue?

Brian_Tomasik

Cool. :) This topic isn't my specialty, so I wouldn't want you to take the time just for me, but I imagine many people might find those arguments interesting. I'd be most likely to change my mind on the consequentialist issue because I currently don't know much about that topic (other than in the case of reinforcement-learning agents, where it seems more clear how they're consequentialist).

Regarding FOOM, given how much progress DeepMind/OpenAI/etc have been making in recent years with a relatively small number of researchers (although relying on background research and computing infrastructure provided by a much larger set of people), it makes sense to me that once AGIs are able to start contributing to AGI research, things could accelerate, especially if there's enough hardware to copy the AGIs many times over. I think the main thing I would add is that by that point, I expect it to be pretty obvious to natsec people (and maybe the general public) that shit is about to hit the fan, so that other countries/entities won't sit idly by and let one group go FOOM unopposed. Other countries could make military, even nuclear, threats if need be.

In general, I expect the future to be a bumpy ride, and AGI alignment looks very challenging, but I also feel like a nontrivial fraction of the world's elite brainpower will be focused on these issues as things get more and more serious, which may reduce our expectations of how much any given person can contribute to changing how the future unfolds.

Greg_Colbourn ⏸️

Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way:

..language modelling has one crucial difference from Chess or Go or image classification. Natural language essentially encodes information about the world—the entire world, not just the world of the Goban, in a much more expressive way than any other modality ever could.[1] By harnessing the world model embedded in the language model, it may be possible to build a proto-AGI.
...

This is more a thought experiment than something that’s actually going to happen tomorrow; GPT-3 today just isn’t good enough at world modelling. Also, this method depends heavily on at least one major assumption—that bigger future models will have much better world modelling capabilities—and a bunch of other smaller implicit assumptions. However, this might be the closest thing we ever get to a chance to sound the fire alarm for AGI: there’s now a concrete path to proto-AGI that has a non-negligible chance of working.

Crossposted from here

Brian_Tomasik

Thanks. :) Is that just a general note along the lines of what I was saying, or does it explain how a GPT-X AGI would become consequentialist?

Greg_Colbourn ⏸️

It explains how a GPT-X could become an AGI (via world modelling). I think then things like the basic drives would take over. However, maybe it's not the end result model that we should be looking at as dangerous, but rather the training process? A ML-based (proto-)AGI could do all sorts of dangerous (consequentialist, basic-AI-drives-y) things whilst trying to optimise for performance in training.

Brian_Tomasik

I think then things like the basic drives would take over.

Where would those basic drives come from (apart from during training)? An already trained GPT-3 model tries to output text that looks human-like, so we might imagine that a GPT-X AGI would also try to behave in ways that look human-like, and most humans aren't very consequentialist. Humans do try to preserve themselves against harm or death, but not in an "I need to take over the world to ensure I'm not killed" kind of way.

If your concern is about optimization during training, that makes sense, though I'm confused as to whether it's dangerous if the AI only updates its weights via a human-specified gradient-descent process, and the AI's "personality" doesn't care about how accurate its output is.

Greg_Colbourn ⏸️

Yes, concern is optimisation during training. My intuition is along the lines of "sufficiently large pile of linear algebra with reward function-> basic AI drives maximise reward->reverse engineers [human behaviour / protein folding / etc] and manipulates the world so as to maximise it's reward ->[foom / doom]".

I wouldn't say "personality" comes into it. In the above scenario the giant pile of linear algebra is completely unconscious and lacks self-awareness; it's more akin to a force of nature, a blind optimisation process.

Brian_Tomasik

Thanks. :) Regarding the AGI's "personality", what I meant was what the AGI itself wants to do, if we imagine it to be like a person, rather than what the training that produced it was optimizing for. If we think of gradient descent to train the AGI as like evolution and the AGI at some step of training as like a particular human in humanity's evolution, then while evolution itself is optimizing something, the individual human is just an adaptation executor and doesn't directly care about his inclusive fitness. He just responds to his environment as he was programmed to do. Likewise, the GPT-X agent may not really care about trying to reduce training errors by modifying its network weights; it just responds to its inputs in human-ish ways.

Linch

This does not match my mental impression of Eliezer Yudkowsky, poor as it is. Nor does falsely saying "we're all doomed" match my mental model of the most productive form of getting people to work on a problem, if anything you'd want to downplay the risks.

RobBensinger

The Nate Soares post mentioned in the OP (replying to Joe Carlsmith's 'power-seeking AI' report) is now up, along with responses from Joe: https://www.lesswrong.com/posts/cCMihiwtZx7kdcKgt/comments-on-carlsmith-s-is-power-seeking-ai-an-existential

Greg_Colbourn ⏸️

Has anyone tried GPT3-ing this to see if it comes up with any interesting ideas?

Greg_Colbourn ⏸️

Deepmind would have lots of penalty-free affordance internally for people to not publish things, and to work in internal partitions that didn't spread their ideas to all the rest of Deepmind.

Companies like Apple and Dyson operate like this (keeping their IP tightly under wraps right up until products are launched). Maybe they could be useful recruiting grounds?

Greg_Colbourn ⏸️

Steve Omohundro
...Google and others are using Mixture-of-Experts to avoid some of that cost: https://arxiv.org/abs/1701.06538
Matrix multiply is a pretty inefficient primitive and alternatives are being explored: https://arxiv.org/abs/2106.10860

These stand out for me as causes for alarm. Anything that makes ML significantly more efficient as an AI paradigm seems like it shortens timelines. Can anyone say why they aren't cause for alarm? (See also)

Comments