tl;dr: I know a bunch of EA/rationality-adjacent people who argue — sometimes jokingly and sometimes seriously — that the only way or best way to reduce existential risk is to enable an “aligned” AGI development team to forcibly (even if nonviolently) shut down all other AGI projects, using safe AGI.  I find that the arguments for this conclusion are flawed, and that the conclusion itself causes harm to institutions who espouse it.   Fortunately (according to me), successful AI labs do not seem to espouse this "pivotal act" philosophy.

[This post is also available on LessWrong.]

How to read this post

Please read Part 1 first if you’re very impact-oriented and want to think about the consequences of various institutional policies more than the arguments that lead to the policies; then Parts 2 and 3.

Please read Part 2 first if you mostly want to evaluate policies based on the arguments behind them; then Parts 1 and 3.

I think all parts of this post are worth reading, but depending on who you are, I think you could be quite put off if you read the wrong part first and start feeling like I’m basing my argument too much on kinds-of-thinking that policy arguments should not be based on.

Part 1: Negative Consequences of Pivotal Act Intentions

Imagine it’s 2022 (it is!), and your plan for reducing existential risk is to build or maintain an institution that aims to find a way for you — or someone else you’ll later identify and ally with — to use AGI to forcibly shut down all other AGI projects in the world.  By “forcibly” I mean methods that violate or threaten to violate private property or public communication norms, such as by using an AGI to engage in…

  • cyber sabotage: hacking into competitors’ computer systems and destroy their data;
  • physical sabotage: deploying tiny robotic systems that locate and destroy AI-critical hardware without (directly) harming any humans;
  • social sabotage: auto-generating mass media campaigns to shut down competitor companies by legal means, or
  • threats: demonstrating powerful cyber or physical or social threats, and bargaining with competitors to shut down “or else”.

Hiring people for your pivotal act project is going to be tricky.  You’re going to need people who are willing to take on, or at least tolerate, a highly adversarial stance toward the rest of the world.  I think this is very likely to have a number of bad consequences for your plan to do good, including the following:

  1. (bad external relations)  People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration.  This will alienate other institutions and make them not want to work with you or be supportive of you.
  2. (bad internal relations)  As your team grows, not everyone will know each other very well.  The “us against the world” attitude will be hard to maintain, because there will be an ever weakening sense of “us”, especially as people quit and move to other institutions and conversely.  Sometimes, new hires will express opinions that differ from the dominant institutional narrative, which might pattern-match as “outsidery” or “norm-y” or “too caught up in external politics”, triggering feelings of internal distrust within the team that some people might defect on the plan to forcibly shut down other projects.  This will cause your team to get along poorly internally, and make it hard to manage people.
  3. (risky behavior) In the fortunate-according-to-you event that your team manages to someday wield a powerful technology, there will be a sense of pressure to use it to “finally make a difference” or other argument that boils down to acting quickly before competitors would have a chance to shut you down or at least defend themselves.  This will make it hard to stop your team from doing rash things that would actually increase existential risk.

Overall, building an AGI development team with the intention to carry out a “pivotal act” of the form “forcibly shut down all other A(G)I projects” is probably going to be a rough time, I predict.

Does this mean no institution in the world can have the job of preparing to shut down runaway technologies?  No; see “Part 3: it matters who does things”.

Part 2: Fallacies in Justifying Pivotal Acts

For pivotal acts of the form “shut down all (other) AGI projects”, there’s an argument  that I’ve heard repeatedly from dozens of people, which I claim has easy-to-see flaws if you slow down and visualize the world that the argument is describing.

This is not an argument that successful AI research groups (e.g., OpenAI, DeepMind, Anthropic) seem to espouse.  Nonetheless, I hear the argument frequently enough to want to break it down and refute it.

Here is the argument:

  1. AGI is a dangerous technology that could cause human extinction if not super-carefully aligned with human values.

    (My take: I agree with this point.)
     
  2. If the first group to develop AGI manages to develop safe AGI, but the group allows other AGI projects elsewhere in the world to keep running, then one of those other projects will likely eventually develop unsafe AGI that causes human extinction.

    (My take: I also agree with this point, except that I would bid to replace “the group allows” with “the world allows”, for reasons that will hopefully become clear in Part 3: It Matters Who Does Things.)
     
  3. Therefore, the first group to develop AGI, assuming they manage to align it well enough with their own values that they believe they can safely issue instructions to it, should use their AGI to build offensive capabilities for targeting and destroying the hardware resources of other AGI development groups, e.g., nanotechnology targeting GPUs, drones carrying tiny EMP charges, or similar.

    (My take: I do not agree with this conclusion, I do not agree that (1) and (2) imply it, and I feel relieved that every successful AI research group I talk to is also not convinced by this argument.)

The short reason why (1) and (2) do not imply (3) is that when you have AGI, you don’t have to use the AGI directly to shut down other projects.  

In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning.  In other words, outsiders can start to help you implement helpful regulatory ideas, rather than you planning to do it all on your own by force at the last minute using a super-powerful AI system.  

To be clear, I’m not arguing for leaving regulatory efforts entirely in the hands of governments with no help or advice or infrastructural contributions from the tech sector.  I’m just saying that there are many viable options for regulating AI technology without requiring one company or lab to do all the work or even make all the judgment calls.

Q: Surely they must be joking or this must be straw-manning... right?

A: I realize that lots of EA/R folks are thinking about AI regulation in a very nuanced and politically measured way, which is great.  And, I don't think the argument (1-3) above represents a majority opinion among the EA/R communities.  Still, some people mean it, and more people joke about it in an ambiguous way that doesn't obviously distinguish them from meaning it:

  • (ambiguous joking) I've numerous times met people at EA/R events who were saying extreme-sounding things like "[AI lab] should just melt all the chip fabs as soon as they get AGI", who when pressed about the extremeness of this idea will respond with something like "Of course I don't actually mean I want [some AI lab] to melt all the chip fabs".  Presumably, some of those people were actually just using hyperbole to make conversations more interesting or exciting or funny.  

    Part of my motivation in writing this post is to help cut down on the amount of ambiguous joking about such proposals.  As the development of more and more advanced AI technologies is becoming a reality, ambiguous joking about such plans has the potential to really freak people out if they don't realize you're exaggerating.
     
  • (meaning it) I have met at least a dozen people who were not joking when advocating for invasive pivotal acts along the lines of the argument (1-3) above.  That is to say, when pressed after saying something like (1-3), their response wasn't "Geez, I was joking", but rather, "Of course AGI labs should shut down other AGI labs; it's the only morally right thing for them to do, given that AGI labs are bad.  And of course they should do it by force, because otherwise it won't get done."

    In most cases, folks with these viewpoints seemed not to have thought about the cultural consequences of AGI research labs harboring such intentions over a period of years (Part 2), or the fallacy of assuming technologists will have to do everything themselves (Part 1), or the future possibility of making clearer evidence available to support regulatory efforts from a broader base of consensual actors (see Part 3).

    So, part of my motivation in writing this post is as a genuine critique of a genuinely expressed position.

Part 3: It Matters Who Does Things

I think it’s important to separate the following two ideas:

  • Idea A (for “Alright”): Humanity should develop hardware-destroying capabilities — e.g., broadly and rapidly deployable non-nuclear EMPs — to be used in emergencies to shut down potentially-out-of-control AGI situations, such as an AGI that has leaked onto the internet, or an irresponsible nation developing AGI unsafely.
  • Idea B (for “Bad”): AGI development teams should be the ones planning to build the hardware-destroying capabilities in Idea A.

For what it’s worth, I agree with Idea A, but disagree with Idea B:

Why I agree with Idea A

It’s indeed much nicer to shut down runaway AI technologies (if they happen) using hardware-specific interventions than attacks with big splash effects like explosives or brainwashing campaigns.  I think this is the main reason well-intentioned people end up arriving at this idea, and Idea B, but I think Idea B has some serious problems.

Why I disagree with Idea B

A few reasons!  First, there’s:

  • Action Consequence 1: the action of having an AGI carry out or even prescribe such a large intervention on the world — invading others’ private property to destroy their hardware — is risky and legitimately scary.  Invasive behavior is risky and threatening enough as it is; using AGI to do it introduces a whole range of other uncertainties, not least because the AGI could be deceptive or otherwise misaligned with humanity in ways that we don’t understand.

Second, before even reaching the point of taking the action prescribed in Idea B, merely harboring the intention of Idea B has bad consequences; echoing similar concerns as Part 1:

  • Intention Consequence 1: Racing.  Harboring Idea B creates an adversarial winner-takes-all relationship with other AGI companies racing to maintain
    • a degree of control over the future, and
    • the ability to implement their own pet theories on how safety/alignment should work, leading to more desperation, more risk-taking, and less safety overall.
  • Intention Consequence 2: Fear.  Via staff turnover and other channels, harboring Idea B signals to other AGI companies that you are willing to violate their property boundaries to achieve your goals, which will cause them to fear for their physical safety (e.g., because your incursion to invade their hardware might go awry and end up harming them personally as well).  This kind of fear leads to more desperation, more winner-takes-all mentality, more risk-taking, and less safety.

Summary

In Part 1, I argued that there are negative consequences to AGI companies harboring the intention to forcibly shut down other AGI companies.  In Part 2, I analyzed a common argument in favor of that kind of “pivotal act”, and found a pretty simple flaw stemming from fallaciously assuming that the AGI company has to do everything itself (rather than enlisting help from neutral outsiders, using evidence).  In Part 3, I elaborated more on the nuance regarding who (if anyone) should be responsible for developing hardware-shutdown technologies to protect humanity from runaway AI disasters, and why in particular AGI companies should not be the ones planning to do this, mostly echoing points from Part 1.

Fortunately, successful AI labs like DeepMind, OpenAI, and Anthropic do not seem to espouse this “pivotal act” philosophy for doing good in the world.  One of my hopes in writing this post is to help more EA/R folks understand why I agree with their position.


 

61

10 comments, sorted by Click to highlight new comments since: Today at 9:27 AM
New Comment

FWIW, the way I now think about these scenarios is that there's a tradeoff between technical ability and political ability:

 - If you have infinite technical ability (one person can create an aligned Jupiter Brain in their basement), then you don't need any political ability and can do whatever you want.

 - If you have infinite political ability (Xi Jinping cures aging, leads the CCP to take over the world, and becomes God-Emperor of Man), you don't need any technical ability and can just do whatever you want.

I don't think either of those are plausible and a realistic strategy will need both, although in varying proportion, but having less of one will demand more of the other. Some closely related ideas are:

 - The weaker and less general an AI is, the safer it is to align and test. Potentially dangerous AIs should be as weak as possible while still doing the job, in the same way that Android apps should have as few permissions as are reasonably practical. A technique that reduces an AI's abilities in some important way, while still fulfilling the main goal, is a net win. (Eg. scrubbing computer code from the train set of something like GPT-3.)

 -  Likewise, everything you might try for alignment will almost certainly fail if you turn up the AI power level *enough*, just as any system can be hacked into if you try infinitely hard. No alignment advance will "solve the problem", but it may make a somewhat-more-powerful AI safer to run. (Eg. I doubt better interpretability would do much to help with a Jupiter Brain, but would help you understand smaller AIs.)

 - An unexpected shock (eg. COVID-19, or the release of GPT-3) won't make existing political actors smarter, but may make them change their priorities. (Eg. if, when COVID-19 happened, you had already met everyone at the FDA, had vaccine factories and supply chains built and emergency trial designs ready in advance, it would have been a lot easier to get rapid approval. Likewise, many random SWEs that I tell about PaLM or DALL-E now instinctively see it as dangerous and start thinking about safety; they don't have a plan but now see one as important.)

(Plan to write more about this in the future, this is just a quick conceptual sketch.)

the way I now think about these scenarios is that there's a tradeoff between technical ability and political ability

I also like this, and appreciate you pointing out a tradeoff where the discouse was presenting an either-or decision. I'd actually considered a follow-up post on the pareto boundary between unilaterally maximizing (altruistic) utility and multilaterally preserving coordination boundaries and consent norms. Relating your ontology to mine, I'd say that in the AGI arena, technical ability contributes more to the former (unilaterally maximizing...) than the latter (multilaterally preserving...), and political ability contributes more to the latter than the former.

I really like your framing about the trade-off between some plans requiring more technical ability and some requiring more political ability.

Likewise, many random SWEs that I tell about PaLM or DALL-E now instinctively see it as dangerous and start thinking about safety

What is it about these that you think convinces them? Are these people that you tried to convince before?

On point 3: I think essentially everyone talking about pivotal acts is envisioning a world like the present one, where they are concerned about x-risk from AGI but most other organizations aren't. What good does it do someone in that position to say "well, this is the UN's job" if they have no way to convince the UN to care?

I think people who expect e.g. the UN or "humanity" to be persuadable to do something are not generally in favor of unilateral pivotal acts as a concept for exactly the reasons you highlight, but IMO this is the biggest crux.

"Humanity should do X" is not a plan.

EDIT: You address this in point 2, but I think people aren't engaging in a fallacy, they're just much more pessimistic about the prospects of gathering evidence and convincing global institutions to take action.

I hasten to add: I agree with you about the downsides of advocating for aggressive unilateral actions.

Also, if people are actively advocating for arbitrary AI labs (that they have little influence over) taking unilateral actions like this, that seems like potentially the worst of both worlds: harmful to even talk about, without being especially actionable.

I might be reading too much into this, but the word "legitimate" in "legitimate global regulatory efforts" feels weird here. Like... the idea that "if you, a private AI lab, try to unilaterally stop everyone else from building AI, they will notice and get mad at you" is really important. But the word "legitimate" brings to mind a sort of global institutional-management-nomenklatura class using the word as a status club to go after anything it doesn't like. If eg. you developed a COVID test during 2020, one might say "this test doesn't work" or "this test has bad side effects" or "the FDA hasn't approved it, they won't let you sell this test" or "my boss won't let me build this test at our company"; but saying "this test isn't legitimate" feels like a conceptual smudge that tries to blend all those claims together, as if each implied all of the others. 

saying "this test isn't legitimate" feels like a conceptual smudge that tries to blend all those claims together, as if each implied all of the others.

This is my favorite piece of feedback on this post so far, and I agree with it; thanks!

To clarify what I meant, I've changed the text to read "making evidence available to support global regulatory efforts from a broader base of consensual actors (see Part 3)."

Thanks for writing this! I'm excited to see more AI strategy discussions being published.

If the first group to develop AGI manages to develop safe AGI, but the group allows other AGI projects elsewhere in the world to keep running, then one of those other projects will likely eventually develop unsafe AGI that causes human extinction.

(My take: I also agree with this point, except that I would bid to replace “the group allows” with “the world allows”, for reasons that will hopefully become clear in Part 3: It Matters Who Does Things.)

I don't yet see why you agree with this. If the first general AI were safe and other projects continued, couldn't people still be safe, through the leading project improving human resilience?

  • It seems like there are several ways by which safe AI could contribute to human resilience:
    • Direct defensive interventions, e.g., deploying vaccines in response to viruses
    • Deterrence, e.g., "if some AI actively tries to harm many people, I shut it down" (arguably much less norm-breaking than shutting it down proactively)
    • Coordination, i.e., helping create mechanisms for facilitating trade/compromise and reducing conflict
  • This leading safe AI project could also be especially well-placed to do the above, because market and R&D advantages may help it peacefully grow its influence (at a faster rate than others)

In other words, why assume that (a) AI offense-defense balance is so stacked in favor of offense, and (b) deterrence and coordination wouldn't pacify a situation with high offensive capabilities on multiple sides?

(Edited for formatting/clarity)

Non-original idea: What about a misaligned AI threatening to torture people? An aligned AGI could exist, and then a misaligned AGI could be created. The second AGI threatens to torture or kill lots of people if not given more power. Presumably, it could get in a position where it is able to do this without triggering the Deterrence mode of the aligned AGI, unless there is really good interpretability and surveillance. The first AGI, being a utility maximizer and suffering minimizer, cedes control of the future to the second AGI because it's better than the tons of suffering (e.g., even human extinction or paperclipping may be better than billions suffering from non-fatal rabies, or other unimaginable suffering). This failure mode, maybe call it hostage-based-takeover (HBT) if it doesn't have a better name, is still possible even given the scenario you lay out. That is, HBT is strongly in favor of offense given many values an aligned AGI could have and imperfect surveillance/deterrence. Variants of this idea have been discussed here in terms of an AI threatening to torture simulations of you if you don't let it out. The simulation part and the "you" part don't seem important for this argument to go through, because many people would back down in the face of a realistic threat to torture everybody. 

More original idea: It seems to me that novel technologies often favor offense because, at the core, successful offense requires exploiting one unpatched vulnerability, whereas successful defense requires finding and patching all the vulnerabilities that could plausibly be found by others. The FBI has to stop all the terrorists to be successful, but from the perspective of the terrorists, even one successful attack is a win. We could have 100 years of avoiding nuclear war, but it only takes one mess up to be really bad.

I think that offense is favored by default in a lot of cases. And when the stakes are incredibly high, like extinction, the bar for safe deterrence is incredibly high. It feels unlikely to me that we could reach a sufficiently high bar of safety without some pivotal act like:

  1. Global monitoring of compute usage (I need to reply to your slack message)
  2. Invasive monitoring of AI labs
  3. Destruction of other GPUs etc.
  4. Something else.

Thinking through these needs, it seems like having the UN implement measures like this would be best, but as others have mentioned, this seems unlikely in the current environment.

Thanks for this!

I'm not sure I get why extortion could give misaligned agents a (big) asymmetric advantage over aligned agents. Here are some things that might each prevent extortion-based takeovers:

  • Reasons why successful extortion might not happen:
    • Deterrence might prevent extortion attempts--blackmailing someone is less appealing if they've committed to severe retaliation (cf. Liam Neeson).
    • Plausibly there'll be good enough interpretability or surveillance (especially since we're conditioning on there being some safe AI--those are disproportionately worlds in which there's good interpretability).
    • Arguably, sufficiently smart and capable agents don't give in to blackmail, especially if they've had time to make commitments. If this applies, the safe AI would be less likely to be blackmailed in the first place, and it would not cede anything if it is blackmailed.
    • Plausibly, the aligned AI would be aligned to values that would not accept such a scope-insensitive trade, even if they were willing to give in to threats.
  • Other reasons why extortion might not create asymmetric advantages:
    • Plausibly, the aligned AI will be aligned to values that would also be fine with doing extortion.

many people would back down in the face of a realistic threat to torture everybody.

Maybe a limitation of this analogy is that it assumes away most of the above anti-extortion mechanisms. (Also, if the human blackmail scenario assumes that many humans can each unilaterally cede control, that also makes it easier for extortion to succeed than if power is more centralized.)

On the other point - seems right, I agree offense is often favored by default. Still:

  • Deterrence and coordination can happen even (especially?) when offense is favored.
  • Since the aligned AI may start off with and then grow a lead, some degree of offense being favored may not be enough for things to go wrong; the defense is (in this hypothetical) (much) stronger than the offense, so things may have to be tilted especially heavily toward offense for offense to win.
    • (Actually, I'm realizing a major limitation of my argument is that it doesn't consider how the time/investment costs of safety may mean that--even if the first general AI project is safe--it's then outpaced by other, less cautious projects. More generally, it seems like the leading project's (relative) growth rate will depend on how much it's accelerated by its lead, how much it's directly slowed down by its caution, how much it's accelerated by actors who want cautious projects to lead, and other factors, and it's unclear whether this would result in overall faster or slower growth than other projects.)

(I also agree that a high bar for safety in high-stakes scenarios is generally worthwhile; I mainly just mean to disagree with the position that extinction is very likely in these scenarios.)

"I’m just saying that there are many viable options for regulating AI technology". I hope so too and I would be interested in seeing a post where you flesh out your thinking on that. 

I think even some (most?) of the proponents of the 'pivotal act' idea would agree that it is a strategy with a low probability of success - they just think that the regulatory strategy is even less likely to work. I suspect the opposition to a regulatory strategy is in part due to the libertarian background of some of the founders of the AI safety field. But maybe governmental forces are not quite as incompetent or slow as they are assumed to be.