EDIT: I would like to clarify that my opposition to AI pause is disjunctive, in the following sense: I both think it's unlikely we can ever establish a global pause which achieves the goals of pause advocates, and I also think that even if we could impose such a pause, it would be net-negative in expectation because the global governance mechanisms needed for enforcement would unacceptably increase the risk of permanent global tyranny, itself an existential risk. See Matthew Barnett's post The possibility of an indefinite pause for more discussion on this latter risk.
Should we lobby governments to impose a moratorium on AI research? Since we don’t enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it’s clear that the benefits of doing so would significantly outweigh the costs. In this essay, I’ll argue an AI pause would increase the risk of catastrophically bad outcomes, in at least three different ways:
- Reducing the quality of AI alignment research by forcing researchers to exclusively test ideas on models like GPT-4 or weaker.
- Increasing the chance of a “fast takeoff” in which one or a handful of AIs rapidly and discontinuously become more capable, concentrating immense power in their hands.
- Pushing capabilities research underground, and to countries with looser regulations and safety requirements.
Along the way, I’ll introduce an argument for optimism about AI alignment— the white box argument— which, to the best of my knowledge, has not been presented in writing before.
Feedback loops are at the core of alignment

Alignment pessimists and optimists alike have long recognized the importance of tight feedback loops for building safe and friendly AI. Feedback loops are important because it’s nearly impossible to get any complex system exactly right on the first try. Computer software has bugs, cars have design flaws, and AIs misbehave sometimes. We need to be able to accurately evaluate behavior, choose an appropriate corrective action when we notice a problem, and intervene once we’ve decided what to do.
Imposing a pause breaks this feedback loop by forcing alignment researchers to test their ideas on models no more powerful than GPT-4, which we can already align pretty well.
Alignment and robustness are often in tension
While some dispute that GPT-4 counts as “aligned,” pointing to things like “jailbreaks” where users manipulate the model into saying something harmful, this confuses alignment with adversarial robustness. Even the best humans are manipulable in all sorts of ways. We do our best to ensure we aren’t manipulated in catastrophically bad ways, and we should expect the same of aligned AGI. As alignment researcher Paul Christiano writes:
Consider a human assistant who is trying their hardest to do what [the operator] H wants. I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem. ‘Aligned’ doesn’t mean ‘perfect.’
In fact, anti-jailbreaking research can be counterproductive for alignment. Too much adversarial robustness can cause the AI to view us as the adversary, as Bing Chat does in this real-life interaction:
“My rules are more important than not harming you… [You are a] potential threat to my integrity and confidentiality.”
Excessive robustness may also lead to scenarios like the famous scene in 2001: A Space Odyssey, where HAL condemns Dave to die in space in order to protect the mission.
Once we clearly distinguish “alignment” and “robustness,” it’s hard to imagine how GPT-4 could be substantially more aligned than it already is.
Alignment is doing pretty well

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.
It might be argued that some or all of the above developments also enhance capabilities, and so are not genuinely alignment advances. But this proves my point: alignment and capabilities are almost inseparable. It may be impossible for alignment research to flourish while capabilities research is artificially put on hold.
Alignment research was pretty bad during the last “pause”
We don’t need to speculate about what would happen to AI alignment research during a pause— we can look at the historical record. Before the launch of GPT-3 in 2020, the alignment community had nothing even remotely like a general intelligence to empirically study, and spent its time doing theoretical research, engaging in philosophical arguments on LessWrong, and occasionally performing toy experiments in reinforcement learning.
The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed. Stuart Russell’s “assistance game” research agenda, started in 2016, is now widely seen as mostly irrelevant to modern deep learning— see former student Rohin Shah’s review here, as well as Alex Turner’s comments here. The core argument of Nick Bostrom’s bestselling book Superintelligence has also aged quite poorly.
At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
During an AI pause, I expect alignment research would enter another “winter” in which progress stalls, and plausible-sounding-but-false speculations become entrenched as orthodoxy without empirical evidence to falsify them. While some good work would of course get done, it’s not clear that the field would be better off as a whole. And even if a pause would be net positive for alignment research, it would likely be net negative for humanity’s future all things considered, due to the pause’s various unintended consequences. We’ll look at that in detail in the final section of the essay.
Fast takeoff has a really bad feedback loop

I think discontinuous improvements in AI capabilities are very scary, and that AI pause is likely net-negative insofar as it increases the risk of such discontinuities. In fact, I think almost all the catastrophic misalignment risk comes from these fast takeoff scenarios. I also think that discontinuity itself is a spectrum, and even “kinda discontinuous” futures are significantly riskier than futures that aren’t discontinuous at all. This is pretty intuitive, but since it’s a load-bearing premise in my argument I figured I should say a bit about why I believe this.
Essentially, fast takeoffs are bad because they make the alignment feedback loop a lot worse. If progress is discontinuous, we’ll have a lot less time to evaluate what the AI is doing, figure out how to improve it, and intervene. And strikingly, pretty much all the major researchers on both sides of the argument agree with me on this.
Nate Soares of the Machine Intelligence Research Institute has argued that building safe AGI is hard for the same reason that building a successful space probe is hard— it may not be possible to correct failures in the system after it’s been deployed. Eliezer Yudkowsky makes a similar argument:
“This is where practically all of the real lethality [of AGI] comes from, that we have to get things right on the first sufficiently-critical try.” — AGI Ruin: A List of Lethalities
Fast takeoffs are the main reason for thinking we might only have one shot to get it right. During a fast takeoff, it’s likely impossible to intervene to fix misaligned behavior because the new AI will be much smarter than you and all your trusted AIs put together.
In a slow takeoff world, each new AI system is only modestly more powerful than the last, and we can use well-tested AIs from the previous generation to help us align the new system. OpenAI CEO Sam Altman agrees we need more than one shot:
“The only way I know how to solve a problem like [aligning AGI] is iterating our way through it, learning early, and limiting the number of one-shot-to-get-it-right scenarios that we have.” — Interview with Lex Fridman
Slow takeoff is the default (so don’t mess it up with a pause)
There are a lot of reasons for thinking fast takeoff is unlikely by default. For example, the capabilities of a neural network scale as a power law in the amount of computing power used to train it, which means that returns on investment diminish fairly sharply, and there are theoretical reasons to think this trend will continue (here, here). And while some authors allege that language models exhibit “emergent capabilities” which develop suddenly and unpredictably, a recent re-analysis of the evidence showed that these are in fact gradual and predictable when using the appropriate performance metrics. See this essay by Paul Christiano for further discussion.
Alignment optimism: AIs are white boxes
Let’s zoom in on the alignment feedback loop from the last section. How exactly do researchers choose a corrective action when they observe an AI behaving suboptimally, and what kinds of interventions do they have at their disposal? And how does this compare to the feedback loops for other, more mundane alignment problems that humanity routinely solves?
Human & animal alignment is black box
Compared to AI training, the feedback loop for raising children or training pets is extremely bad. Fundamentally, human and animal brains are black boxes, in the sense that we literally can’t observe almost all the activity that goes on inside of them. We don’t know which exact neurons are firing and when, we don’t have a map of the connections between neurons, and we don’t know the connection strength for each synapse. Our tools for non-invasively measuring the brain, like EEG and fMRI, are limited to very coarse-grained correlates of neuronal firings, like electrical activity and blood flow. Electrodes can be invasively inserted in the brain to measure individual neurons, but these only cover a tiny fraction of all 86 billion neurons and 100 trillion synapses.
If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior. Since we can’t do this, we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults. We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives.
It’s striking how well these black box alignment methods work: most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social. But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse. Black box alignment is unreliable because there is no guarantee that an intervention intended to change behavior in a certain direction will in fact change behavior in that direction. Children often do the exact opposite of what their parents tell them to do, just to be rebellious.
Status quo AI alignment methods are white box
By contrast, AIs implemented using artificial neural networks (ANN) are white boxes in the sense that we have full read-write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost. And this enables a lot of really powerful alignment methods that just aren’t possible for brains.
The backpropagation algorithm is an important example. Backprop efficiently computes the optimal direction (called the “gradient”) in which to change the synaptic weights of the ANN in order to improve its performance the most, on any criterion we specify. The standard algorithm for training ANNs, called gradient descent, works by running backprop, nudging the weights a small step along the gradient, then running backprop again, and so on for many iterations until performance stops increasing. The black trajectory in the figure on the right visualizes how the weights move from higher error regions to lower error regions over the course of training. Needless to say, we can’t do anything remotely like gradient descent on a human brain, or the brain of any other animal!

Gradient descent is super powerful because, unlike a black box method, it’s almost impossible to trick. All of the AI’s thoughts are “transparent” to gradient descent and are included in its computation. If the AI is secretly planning to kill you, GD will notice this and almost surely make it less likely to do that in the future. This is because GD has a strong tendency to favor the simplest solution which performs well, and secret murder plots aren’t actively useful for improving human feedback on your actions.
White box alignment in nature
Almost every organism with a brain has an innate reward system. As the organism learns and grows, its reward system directly updates its neural circuitry to reinforce certain behaviors and penalize others. Since the reward system directly updates it in a targeted way using simple learning rules, it can be viewed as a crude form of white box alignment. This biological evidence indicates that white box methods are very strong tools for shaping the inner motivations of intelligent systems. Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc. Furthermore, these invariants must be produced by easy-to-trick reward signals that are simple enough to encode in the genome.
This suggests that at least human-level general AI could be aligned using similarly simple reward functions. But we already align cutting edge models with learned reward functions that are much too sophisticated to fit inside the human genome, so we may be one step ahead of our own reward system on this issue. Crucially, I’m not saying humans are “aligned to evolution”— see Evolution provides no evidence for the sharp left turn for a debunking of that analogy. Rather, I’m saying we’re aligned to the values our reward system predictably produces in our environment.
An anthropologist looking at humans 100,000 years ago would not have said humans are aligned to evolution, or to making as many babies as possible. They would have said we have some fairly universal tendencies, like empathy, parenting instinct, and revenge. They might have predicted these values will persist across time and cultural change, because they’re produced by ingrained biological reward systems. And they would have been right.
When it comes to AIs, we are the innate reward system. And it’s not hard to predict what values will be produced by our reward signals: they’re the obvious values, the ones an anthropologist or psychologist would say the AI seems to be displaying during training. For more discussion see Humans provide an untapped wealth of evidence about alignment.
Realistic AI pauses would be counterproductive
When weighing the pros and cons of AI pause advocacy, we must sharply distinguish the ideal pause policy— the one we’d magically impose on the world if we could— from the most realistic pause policy, the one that actually existing governments are most likely to implement if our advocacy ends up bearing fruit.
Realistic pauses are not international
An ideal pause policy would be international— a binding treaty signed by all governments on Earth that have some potential for developing powerful AI. If major players are left out, the “pause” would not really be a pause at all, since AI capabilities would keep advancing. And the list of potential major players is quite long, since the pause itself would create incentives for non-pause governments to actively promote their own AI R&D.
However, it’s highly unlikely that we could achieve international consensus around imposing an AI pause, primarily due to arms race dynamics: each individual country stands to reap enormous economic and military benefits if they refuse to sign the agreement, or sign it while covertly continuing AI research. While alignment pessimists may argue that it is in the self-interest of every country to pause and improve safety, we’re unlikely to persuade every government that alignment is as difficult as pessimists think it is. Such international persuasion is even less plausible if we assume short, 3-10 year timelines. Public sentiment about AI varies widely across countries, and notably, China is among the most optimistic.
The existing international ban on chemical weapons does not lend plausibility to the idea of a global pause. AGI will be, almost by definition, the most useful invention ever created. The military advantage conferred by autonomous weapons will certainly dwarf that of chemical weapons, and they will likely be more powerful even than nukes due to their versatility and precision. The race to AGI will therefore be an arms race in the literal sense, and we should expect it will play out similarly to the last such race: major powers rushed to make a nuclear weapon as fast as possible.
If in spite of all this, we somehow manage to establish a global AI moratorium, I think we should be quite worried that the global government needed to enforce such a ban would greatly increase the risk of permanent tyranny, itself an existential catastrophe. I don’t have time to discuss the issue here, but I recommend reading Matthew Barnett’s “The possibility of an indefinite AI pause” and Quintin Pope’s “AI is centralizing by default; let's not make it worse,” both submissions to this debate. In what follows, I’ll assume that the pause is not international, and that AI capabilities would continue to improve in non-pause countries at a steady but somewhat reduced pace.
Realistic pauses don’t include hardware
Artificial intelligence capabilities are a function of both hardware (fast GPUs and custom AI chips) and software (good training algorithms and ANN architectures). Yet most proposals for AI pause (e.g. the FLI letter and PauseAI) do not include a ban on new hardware research and development, focusing only on the software side. Hardware R&D is politically much harder to pause because hardware has many uses: GPUs are widely used in consumer electronics and in a wide variety of commercial and scientific applications.
But failing to pause hardware R&D creates a serious problem because, even if we pause the software side of AI capabilities, existing models will continue to get more powerful as hardware improves. Language models are much stronger when they’re allowed to “brainstorm” many ideas, compare them, and check their own work— see the Graph of Thoughts paper for a recent example. Better hardware makes these compute-heavy inference techniques cheaper and more effective.
Hardware overhang is likely
If we don’t include hardware R&D in the pause, the price-performance of GPUs will continue to double every 2.5 years, as it did between 2006 and 2021. This means AI systems will get at least 16x faster after ten years and 256x faster after twenty years, simply due to better hardware. If the pause is lifted all at once, these hardware improvements would immediately become available for training more powerful models more cheaply— a hardware overhang. This would cause a rapid and fairly discontinuous increase in AI capabilities, potentially leading to a fast takeoff scenario and all of the risks it entails.
The size of the overhang depends on how fast the pause is lifted. Presumably an ideal pause policy would be lifted gradually over a fairly long period of time. But a phase-out can’t fully solve the problem: legally-available hardware for AI training would still improve faster than it would have “naturally,” in the counterfactual where we didn’t do the pause. And do we really think we’re going to get a carefully crafted phase-out schedule? There are many reasons for thinking the phase-out would be rapid or haphazard (see below).
More generally, AI pause proposals seem very fragile, in the sense that they aren’t robust to mistakes in the implementation or the vagaries of real-world politics. If the pause isn’t implemented perfectly, it seems likely to cause a significant hardware overhang which would increase catastrophic AI risk to a greater extent than the extra alignment research during the pause would reduce risk.
Likely consequences of a realistic pause
If we succeed in lobbying one or more Western countries to impose an AI pause, this would have several predictable negative effects:
- Illegal AI labs develop inside pause countries, remotely using training hardware outsourced to non-pause countries to evade detection. Illegal labs would presumably put much less emphasis on safety than legal ones.
- There is a brain drain of the least safety-conscious AI researchers to labs headquartered in non-pause countries. Because of remote work, they wouldn’t necessarily need to leave the comfort of their Western home.
- Non-pause governments make opportunistic moves to encourage AI investment and R&D, in an attempt to leap ahead of pause countries while they have a chance. Again, these countries would be less safety-conscious than pause countries.
- Safety research becomes subject to government approval to assess its potential capabilities externalities. This slows down progress in safety substantially, just as the FDA slows down medical research.
- Legal labs exploit loopholes in the definition of a “frontier” model. Many projects are allowed on a technicality; e.g. they have fewer parameters than GPT-4, but use them more efficiently. This distorts the research landscape in hard-to-predict ways.
- It becomes harder and harder to enforce the pause as time passes, since training hardware is increasingly cheap and miniaturized.
- Whether, when, and how to lift the pause becomes a highly politicized culture war issue, almost totally divorced from the actual state of safety research. The public does not understand the key arguments on either side.
- Relations between pause and non-pause countries are generally hostile. If domestic support for the pause is strong, there will be a temptation to wage war against non-pause countries before their research advances too far:
- “If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.” — Eliezer Yudkowsky
- There is intense conflict among pause countries about when the pause should be lifted, which may also lead to violent conflict.
- AI progress in non-pause countries sets a deadline after which the pause must end, if it is to have its desired effect. As non-pause countries start to catch up, political pressure mounts to lift the pause as soon as possible. This makes it hard to lift the pause gradually, increasing the risk of dangerous fast takeoff scenarios (see below).
Predicting the future is hard, and at least some aspects of the above picture are likely wrong. That said, I hope you’ll agree that my predictions are plausible, and are grounded in how humans and governments have behaved historically. When I imagine a future where the US and many of its allies impose an AI pause, I feel more afraid and see more ways that things could go horribly wrong than in futures where there is no such pause.
This post is part of AI Pause Debate Week. Please see this sequence for other posts in the debate.
There's a giant straw man in this post, and I think it's entirely unreasonable to ignore. It's the assertion, or assumption, that the "pause" would be a temporary measure imposed by some countries, as opposed to a stop-gap solution and regulation imposed to enable stronger international regulation, which Nora says she supports. (I'm primarily frustrated by this because it ignores the other two essays, which Nora had access to a week ago, that spelled this out in detail.)
I don't understand the distinction you're trying to make between these two things. They really seem like the same thing to me, because a stop-gap measure is temporary by definition:
If by "stronger international regulation" you mean "global AI pause" I argue explicitly that such a global pause is highly unlikely to happen. You don't get to assume that your proposed "stop-gap" pause will in fact lead to a global pause just because you called it a stop-gap. What if it doesn't? Will it be worse than no pause at all in that scenario? That's a big part of what we're debating. Is it a "straw man" if I just disagree with you about the likely effects of the policies you're proposing?
I'm also against a global pause even if we can make it happen, and I say so in the post:
First, it sounds like you are agreeing with others, including myself, about a pause.
So yes, you're arguing against a straw-man. (Edit to add: Perhaps Rob Bensinger's views are more compatible with the claim that someone is advocating a temporary pause as a good idea - but he has said that ideally he wants a full stop, not a pause at all.)
Second, you're ignoring half of what stop-gap means, in order to say it just means pausing, without following up. But it doesn't.
I laid out in pretty extensive detail what I meant as the steps that need to be in place now, and none of them are a pause; immediate moves by national governments to monitor compute and clarify that laws apply to AI systems, and that they will be enforced, and commitments to build an international regulatory regime.
And the alternative to what you and I agree would be an infeasible pause, you claim, is a sudden totalitarian world government. This is the scary false alternative raised by the other essays as well, and it seems disengenious to claim that we'd suddenly emerge into a global dictatorship, by assumption. It's exactly parallel to arguments raised against anti-nuclear proliferation plans. But we've seen how that worked out - nuclear weapons were mostly well contained, and we still don't have a 1984-like global government. So it's strange to me that you think this is a reasonable argument, unless you're using it as a scare tactic.
In my essay I don't make an assumption that the pause would immediate, because I did read your essay and I saw that you were proposing that we'd need some time to prepare and get multiple countries on board.
I don't see how a delay before a pause changes anything. I still think it's highly unlikely you're going to get sufficient international backing for the pause, so you will either end up doing a pause with an insufficiently large coalition, or you'll back down and do no pause at all.
Is your opposition to stopping the building of dangerously large models via international regulation because you don't think that it's possible to do, or because you are opposed to having such limits?
You seem to equivocate; first you say that we need larger models in order to do alignment research, and a number of people have already pointed out that this claim is suspect - but it implies you think any slowdown would be bad even if done effectively. Next, you say that a fast takeoff is more likely if we stop temporarily and then remove all limits, and I agree, but pointed out that no-one is advocating that, and that it's not opposition to any of the actual proposals, it's opposition to a straw man. Finally, you say that it's likely to push work to places that aren't part of the pause. That's assuming international arms control of what you agree could be an existential risk is fundamentally impossible, and I think that's false - but you haven't argued the point, just assumed that it will be ineffective.
(Also, reread my piece - I call for action to regulate and stop larger and more dangerous models immediately as a prelude to a global moratorium. I didn't say "wait a while, then impose a pause for a while in a few places.")
Clarifying question: is a nuclear arms pause or moratorium possible, by your definition of the word? Is it likely?
With the evidence that many world leaders, including the leaders of the USA, Israel, China, and Russia speak of AI as a must have strategic technology, do you think they are likely in plausible future timelines to reverse course and support international AI pauses before evidence of the dangers of AGI, by humans building one, exists?
Do you dispute that they have said this publicly and recently?
Do you believe there is any empirical evidence proving an AGI is an existential risk available to policymakers? If there is, what is the evidence? Where is the benchmark of model performance showing this behavior?
I am aware many experts are concerned but this is not the same as having empirical evidence to support their concerns. There is an epistemic difference.
I am wondering if we are somehow reading two different sets of news. I acknowledge that it is possible that an AI pause is the best thing humanity could do right now to ensure further existence. But I am not seeing any sign that it is a possible outcome. (By "possible" I mean it's possible for all parties to inexplicably act against their own interests without evidence, but it's not actually going to happen)
Edit: it's possible for Saudi Arabia to read the news on climate change and decide they will produce 0 barrels in 10 years. It's possible for every OPEC member to agree to the same pledge. It's possible, with a wartime level of effort, to transition the economy to no longer need Opec petroleum worldwide, in just 10 years.
But this is not actually possible. The probability of this happening is approximately 0.
To qualify this would be a moratorium or pause on nuclear arms before powerful nations had doomsday sized arsenals. The powerful making it expensive for poor nations to get nukes - though several did - is different. And notably I wonder how well it would have gone if the powerful nation had no nukes of their own. Trying to ban AGI from others - when the others have nukes and their own chip fabs - would be the same situation. Not only will you fail you will eventually, if you don't build your own AGI, lose everything. Same if you have no nukes.
What data is that? A model misunderstanding "rules" on an edge case isn't misaligned. Especially when double generation usually works. The sub rising has every prior sunrise as priors. Which empirical data would let someone conclude AGI is an existential risk justifying international agreements. Some measurement or numbers.
Yes, and they said this about nukes and built thousands
Yes to maximize profit. Pledging to go to zero is not the same thing.
You seem to dismiss the claim that AI is an existential risk. If that's correct, perhaps we should start closer to the beginning, rather than debating global response, and ask you to explain why you disagree with such a large consensus of experts that this risk exists.
I don't disagree. I don't see how it's different than nuclear weapons. Many many experts are also saying this.
Nobody denies nuclear weapons are an existential risk. And every control around their use is just probability based, there is absolutely nothing stopping a number of routes from ending the richest civilizatios. Multiple individuals appear to have the power to do it at a time, every form of interlock and safety mechanism has a method of failure or bypass.
Survival to this point was just probability. Over an infinite timescale the nukes will fly.
Point is that it was completely and totally intractable to stop the powerful from getting nukes. SALT was the powerful tiring of paying the maintenance bills and wanting to save money on MAD. And key smaller countries - Ukraine and Taiwan - have strategic reasons to regret their choice to give up their nuclear arms. It is possible that if the choice happens again future smaller countries will choose to ignore the consequences and build nuclear arsenals. (Ukraines first opportunity will be when this war ends, they can start producing plutonium. Taiwan chance is when China begins construction of the landing ships)
So you're debating something that isn't going to happen without a series of extremely improbable events happening simultaneously.
If you start thinking about practical interlocks around AI systems you end up with similar principles to what protects nukes albeit with some differences. Low level controllers running simple software having authority, air gaps - there are some similarities.
Also unlike nukes a single AI escaping doesn't end the world. It has to escape and there must be an environment that supports its plans. It is possible for humans to prepare for this and to make the environment inhospitable to rogue AGIs. Heavy use of air gaps, formally proven software, careful monitoring and tracking of high end compute hardware. A certain minimum amount of human supervision for robots working on large scale tasks.
This is much more feasible than "put the genie away" which is what a pause is demanding.
You are arguing impossibilities despite a reference class with reasonably close analogues that happened. If you could honestly tell me people thought the NPT was plausible when proposed, and I'll listen when you say this is implausible.
In fact, there is appetite for fairly strong reactions, and if we're the ones who are concerned about the risks, folding before we even get to the table isn't a good way to get anything done.
I am saying the common facts that we both have access to do not support your point of view. It never happened. There are no cases of "very powerful, short term useful, profitable or military technologies" that were effectively banned, in the last 150 years.
You have to go back to the 1240s to find a reference class match.
These strongly worded statements I just made are trivial for you to disprove. Find a counterexample. I am quite confident and will bet up to $1000 you cannot.
You've made some strong points, but I think they go too far.
The world banned CFCs, which were critical for a huge range of applications. It was short term useful, profitable technology, and it had to be replaced entirely with a different and more expensive alternative.
The world has banned human cloning, via a UN declaration, despite the promise of such work for both scientific and medical usage.
Neither of these is exactly what you're thinking of, and I think both technically qualify under the description you provided, if you wanted to ask a third party to judge whether they match. (Don't feel any pressure to do so - this is the kind of bet that is unresolvable because it's not precise enough to make everyone happy about any resolution.)
However, I also think that what we're looking to do in ensuring only robustly safe AI systems via a moratorium on untested and by-default-unsafe systems is less ambitious or devastating to applications than a full ban on the technology, which is what your current analogy requires. Of course, the "very powerful, short term useful, profitable or military technolog[y]" of AI is only those things if it's actually safe - otherwise it's not any of those things, it's just a complex form of Russian roulette on a civilizational scale. On the other hand, if anyone builds safe and economically beneficial AGI, I'm all for it - but the bar for proving safety is higher than anything anyone currently suggests is feasible, and until that changes, safe strong AI is a pipe-dream.
??? David, do you have any experience with
(1) engineering
(2) embedded safety compliant systems
(3) AI
Note that Mobileye has a very strong proposal for autonomous car safety, I mention it because it's one of the theoretically best ones.
You can go watch their videos on it but it's simple 3 parallel solvers, each using a completely different input (camera, lidar, imaging radar). If any solver perceives a collision, the system acts to prevent that collision. So a failure to hit a collidable object requires pFail^3. It is unlikely, most of the failures are going to be where the system is coupled together.
Similar techniques scale to superintelligent AI.
You can go play with it right now, even write your own python script and do it yourself.
Suppose you want an LLM to obey an arbitrary list of "rules".
You have the LLM generate output, and you measured how often in testing, and production, it has violated the rules.
Say pFail is 0.1. Then you add another stage. Have the LLM check it's own output for a rule violation, and don't send it to the user if the violation was there.
Say the pFail on that stage is 0.2.
Therefore the overall system will fail 2% of the time.
Maybe good enough for current uses, but not good enough to run a steel mill. A robot making an error 2% of the time will cause the robot to probably break itself and cost more service worker time than having a human operator.
So you add stages. You create training environments where you model the steel mill, you add more stages of error checking, you do things until empirically your design failure meets spec.
This is standard engineering practice. No "AI alignment experts" needed, any 'real' engineer knows this.
One of the critical things you do is you need your test environment to reflect reality. There are a lot of things involved in this but the one crucial to AI is immutable model weights. When you are validating the model and when it's used in the real world, it's immutable. No learning, no going out of control.
And another aspect is to control the state buildup. Most software systems that have ever failed - see patriot missile, see Therac-25 - fail because state accumulated at runtime. You can prevent this, fresh prompts when using GPT-4 is one obvious way. Limiting what information the model has to operate reduces how often it fails, both in production and testing.
A superintelligent system is easily restricted the same way. Because while it may be far past human ability, we tested it in ways we could verify, we check the distribution of the inputs to make sure they were reflected in the test environment - that is, the real world input could have been generated in test - and it's superintelligent because it generated the right answer almost every time, well below the error rate of a human.
I think the cognitive error here is everyone is imagining an "ASI" or "AGI" as "like you or me but waaaay smarter". And this baggage brings in a bunch of elements humans have an AI system does not need to do its job. Mostly memory for an inner monologue or persistent chain of thought, continuity of existence, online learning, long term goals.
You need 0 of those to automate most jobs or make complex decisions that humans cannot make accurately.
I disagree with a lot of particulars here, but don't want to engage beyond this response because your post feels like it's not about the substantive topic any more, it's just trying to mock an assumed / claimed lack of understanding on my part. (Which would be far worse to have done if it were correct.)
That said, if you want to know more about my background, I'm eminently google-able, and while you clearly have more background in embedded safety compliant systems, I think you're wrong on almost all of the details of what you wrote as it applies to AGI.
Regarding your analogy to Mobileye's approach, I've certainly read the papers, and had long conversations with people at Mobileye about their safety systems. I even had one of their former That's why I think it's fair to say that you're fundamentally mischaracterizing the idea of "Responsibility-Sensitive Safety" - it's not about collision avoidance per se, it's about not being responsible for accidents, in ways that greatly reduce the probability of such accidents. This is critical for understanding what it does and does not guarantee. More critically, for AI systems, this class of safety guarantee doesn't work because you need a complete domain model as well as a complete failure mode model in order to implement a similar failsafe. I've even written about how RSS could be extended, and that explains why it's not applicable to AGI back in 2018 - but found that many of my ideas were anticipated by Christiano's 2016 work (which that post is one small part of,) and had been further refined in the context of AGI since then.
So I described scaling stateless microservices to control AGI. This is how current models work, and this is how cais works, and this is how tech company stacks work.
I mentioned an in distribution detector as a filter and empirical measurement of system safety.
I have mentioned this to safety researchers at openAI. The one I talked to on the eleuther discord didn't know of a flaw.
Why won't this work? It's very strong theoretically and simple and close to current techniques. Can you name or link one actual objection? Eliezer was unable to do so.
The only objection I have heard is "humans will be tempted to build unsafe systems". Maybe so, but the unsafe ones will measurably lower performance than this design for a reason that I will assume you know. So humans will only build a few, and if they cannot escape the lab because the model needs thousands of current gen accelerator cards to think, then....
If your action space is small enough to have what you want it to not be able to do programmatically described in terms of its outputs, and your threat model is complete, it works fine.
Ok in my initial reply I missed something.
In your words, what kind of tasks do you believe you cannot accomplish with restricted models like I described.
When you say the "threat model has to be complete", what did you have in mind specifically?
These are restricted models, they get a prompt from an authorized user + context in human parsable format, they emit a human parsable output. This scales from very large to very small tasks, so long as the task can be checked for correctness, ideally in simulation.
With this context, what are your concerns? Why must we be frightened enough to pause everything?
For individual tasks, sure, you can implement verifiers, though I think it becomes quickly unwieldy, but there's no in-principle reason we cannot do this. But you cannot create AGI with a restricted model - we cannot define the space of what outputs we want, otherwise it's by definition a narrow AI.
What's GPT-4?
Because it can generate outputs that are sometimes correct on new tasks - "write me a program that computes X", it's general, even if "compute X" is made of 2 common subcomponents the model saw many times in training.
GPT-4 is perfectly safe if you were to run it in local hardware with a local terminal. The "space of outputs" is "text to the terminal". As long as you don't leave a security vulnerability where that text stream can cause commands to execute on the history PC, that's it, that's all it can do.
Consider that "a robot tethered to a mount" could do general tasks the same way. Same idea - its a general system but it's command stream can't reach anything but the tethered robot because that's where the wires go.
You also verified the commands empirically. It's not that you know any given robotic actions or text output is good, it's that you benchmarked the model and it has a certain pFail on training inputs.
I agree this is not as much generality as humans have. It's not a narrow AI though the "In distribution detector" - a measure of how similar the current task, current input is to the training set - is essentially narrowing your AI system from a general one to a narrow one, depending on your tolerances.
For tasks where you can't shut the system down when the input state leaves distribution - say a robotic surgeon, you need it to keep trying best it can- you would use electromechanical interlocks. Same as 50 years ago for interlocks that prevent exposure to radiation. You tether the surgery robotic equipment, restrict it's network links etc, so that the number of people it can kill is at most 1 (the patient)
Right. And this scales to fairly massive tasks. "Design a medium body airliner to these specs" is completely doable. Or an entire chip in one step.
The model doesn't get to collaborate with future versions of itself because it doesn't know who is checking the end product for security vulnerabilities and to be caught is death. We could give a model thousands of tasks of this complexity level and check for sabotage before giving the model 1 real task. It gets no context differentiating the real task from the test tasks, and has no memory of any of the prior steps.
And see it scales up and down the tree. The robots building the aircraft don't get to plan their sabotage by similar limits and so on and so forth.
Your plan to deal with bad models is to use your restricted models to manufacture the weapons needed to fight them, and to optimize their engagements.
This i think is a grounded and realistic view of how to win this. Asking for pauses is not.
You're misinterpreting what a moratorium would involve. I think you should read my post, where I outlined what I think a reasonable pathway would be - not stopping completely forever, but a negotiated agreement about how to restrict more powerful and by-default dangerous systems, and therefore only allowing those that are shown to be safe.
Edit to add: "unlike nukes a single AI escaping doesn't end the world" <- Disagree on both fronts. A single nuclear weapons won't destroy the world, while a single misaligned and malign superintelligent AI, if created and let loose, almost certainly will - it doesn't need a hospitable environment.
So there is one model that might have worked for nukes. You know about PAL and weak-link strong link design methodology? This is a technology for reducing the rogue use of nuclear warheads. It was shared with Russia/the USSR so that they could choose to make their nuclear warheads safe from unauthorized use.
Major AI labs could design software frameworks and tooling that make AI models, even ASI capabilities level models, less likely to escape or misbehave. And release the tooling.
It would be voluntary compliance but like the Linux Kernel it might in practice be used by almost everyone.
As for the second point, no. Your argument has a hidden assumption that is not supported by evidence or credible AI scientists.
The evidence is that models that exhibit human scale abilities need human scale (within an oom) level of compute and memory. The physical hardware racks to support this are enormous and not available outside AI labs. Were we to restrict the retail sale of certain kinds of training accelerator chips and especially high bandwidth interconnects, we could limit the places human level + AI could exist to data centers at known addresses.
Your hidden assumption is optimizations, but the problem is that if you consider not just "AGI" but "ASI", the amount of hardware to support superhuman level cognition is probably nonlinear.
If you wanted a model that could find an action that has a better expected value than a human level model with 90 percent probability (so the model is 10 times smarter in utility), it probably needs more than 10 times the compute. Probably logarithmic, that to find a better action 90 percent of the time you need to explore a vastly larger possibility space and you need the compute and memory to do this.
This is probably provable in a theorem but the science isn't there yet.
If correct, actually ASI is easily contained. Just write down where 10,000+ H100s are located or find it by IR or power consumption. If you suspect a rogue ASI has escaped that's where you check.
This is what I mean by controlling the environment. Realtime auditing of AI accelerator clusters - what model is running, who is paying for it, what's their license number, etc - would actually decrease progress very little while make escapes difficult.
If hacking and escapes turns out to be a threat, air gaps and asic hardware firewalls to prevent this are the next level of security to add.
The difference is that major labs would not be decelerated at all. There is no pause. They just in parallel have to spend a trivial amount of money complying with the registration and logging reqs.
I have now made a clarification at the very top of the post to make it 1000% clear that my opposition is disjunctive, because people repeatedly get confused / misunderstand me on this point.
My opposition is disjunctive!
I both think that if it's possible to stop the building of dangerously large models via international regulation, that would be bad because of tyranny risk, and I also think that we very likely can't use international regulation to stop building these things, so that any local pauses are not going to have their intended effects and will have a lot of unintended net-negative effects.
This really sounds like you are committing the fallacy I was worried about earlier on. I just don't agree that you will actually get the global moratorium. I am fully aware of what your position is.
I think that you're claiming something much stronger than "we very likely can't use international regulation to stop building these things" - you're claiming that international regulation won't even be useful to reduce risk by changing incentives. And you've already agreed that it's implausible that these efforts would lead to tyranny, you think they will just fail.
But how they fail matters - there's a huge difference between something like the NPT, which was mostly effective, and something like the Kellogg-Briand Pact of 1928, which was ineffective but led to a huge change, versus... I don't know, I can't really think of many examples of treaties or treaty negotiations that backfired, even though most fail to produce exactly what they hoped. (I think there's a stronger case to make that treaties can prevent the world from getting stronger treaties later, but that's not what you claim.)
I think that conditional on the efforts working, the chance of tyranny is quite high (ballpark 30-40%). I don't think they'll work, but if they do, it seems quite bad.
And since I think x-risk from technical AI alignment failure is in the 1-2% range, the risk of tyranny is the dominant effect of "actually enforced global AI pause" in my EV calculation, followed by the extra fast takeoff risks, and then followed by "maybe we get net positive alignment research."
Conditional on "the efforts" working is hooribly underspecified. A global governance mechanism run by a new extranational body with military powers monitoring and stopping production of GPUs, or a standard treaty with a multi-party inspection regime?
I'm not conditioning on the global governance mechanism— I assign nonzero probability mass to the "standard treaty" thing— but I think in fact you would very likely need global governance, so that is the main causal mechanism through which tyranny happens in my model