Aim for conditional pauses

AnonResearcherMajorAILab

TL;DR: I argue for two main theses:

[Moderate-high confidence] It would be better to aim for a conditional pause, where a pause is triggered based on evaluations of model ability, rather than an unconditional pause (e.g. a blanket ban on systems more powerful than GPT-4).
[Moderate confidence] It would be bad to create significant public pressure for a pause through advocacy, because this would cause relevant actors (particularly AGI labs) to spend their effort on looking good to the public, rather than doing what is actually good.

Since mine is one of the last posts of the AI Pause Debate Week, I've also added a section at the end with quick responses to the previous posts.

Which goals are good?

That is, ignoring tractability and just assuming that we succeed at the goal -- how good would that be? There are a few options:

Full steam ahead. We try to get to AGI as fast as possible: we scale up as quickly as we can; we only spend time on safety evaluations to the extent that it doesn’t interfere with AGI-building efforts; we open source models to leverage the pool of talent not at AGI labs.

Quick take. I think this would be bad, as it would drastically increase x-risk.

Iterative deployment. We treat AGI like we would treat many other new technologies: something that could pose risks, which we should think about and mitigate, but ultimately something we should learn about through iterative deployment. The default is to deploy new AI systems, see what happens with a particular eye towards noticing harms, and then design appropriate mitigations. In addition, rollback mechanisms ensure that we can AI systems are deployed with a rollback mechanism, so that if a deployment causes significant harms

Quick take. This is better than full steam ahead, because you could notice and mitigate risks before they become existential in scale, and those mitigations could continue to successfully prevent risks as capabilities improve^[1].

Conditional pause. We institute regulations that say that capability improvement must pause once the AI system hits a particular threshold of riskiness, as determined by some relatively standardized evaluations, with some room for error built in. AI development can only continue once the developer has exhibited sufficient evidence that the risk will not arise.

For example, following ARC Evals, we could evaluate the ability of an org’s AI systems to autonomously replicate, and the org would be expected to pause when they reach a certain level of ability (e.g. the model can do 80% of the requisite subtasks with 80% reliability), until they can show that the associated risks won’t arise.

Quick take. Of course my take would depend on the specific details of the regulations, but overall this seems much better than iterative deployment. Depending on the details, I could imagine it taking a significant bite out of overall x-risk. The main objections which I give weight to are the overhang objection (faster progress once the pause stops) and the racing objection (a pause gives other, typically less cautious actors more time to catch up and intensify or win a capabilities race), but overall these seem less bad than not stopping when a model looks like it could plausibly be very dangerous.

Unconditional temporary pause. We institute regulations that ban the development of AI models over some compute threshold (e.g. “more powerful than GPT-4”). Every year, the minimum resources necessary to destroy the world drops by 0.5 OOMs^[2], and so we lower the threshold over time. Eventually AGI is built, either because we end the pause in favor of some new governance regime (that isn’t a pause), or because the compute threshold got low enough that some actor flouted the law and built AGI.

Quick take. I’ll discuss this further below, but I think this is clearly inferior to a conditional pause, and it’s debatable to me whether it is better or worse than iterative deployment.

Unconditional permanent pause. At the extreme end, we could imagine a complete ban on making progress towards AGI. We aren’t just avoiding training new frontier models – we’re preventing design of new architectures or training algorithms, forecasting via scaling laws, study of existing model behaviors, improvements on chip design, etc. We’re monitoring research fields to nip any new paradigms for AGI in the bud^[3].

Variant 1: This scenario starts like the unconditional temporary pause story above, but we use extreme surveillance once the compute threshold becomes very low to continue enforcing the pause.

Variant 2: In addition to this widespread pause, there is a tightly controlled and monitored government project aiming to build safe AGI.

Quick take. I don’t usually think about these scenarios, but my tentative take is that the scenario I outlined, or Variant 2 (but not Variant 1), would be the best outcome on this list. It implies a level of global coordination that would probably be sufficient to tackle other societal challenges, decreasing the importance of getting AGI soon, and we can likely obtain many of the benefits of AGI eventually by developing the relevant technologies ourselves with our human intelligence. (We could probably even get an intelligence explosion from digital people, e.g. through whole-brain emulation.) Nonetheless, I will set these scenarios aside for the rest of this post as I believe there is widespread consensus that this is not a realistic option available to us.

Why I prefer conditional over unconditional pauses

There are many significant advantages of a conditional pause, relative to an unconditional temporary pause:

There is more time for safety research with access to powerful models (since you pause later, when you have more powerful models). This is a huge boost for many areas of alignment research that I’m excited about^[4]. (This point has been discussed elsewhere via the “overhang” objection to an unconditional pause.)
While I’m very uncertain, on balance I think it provides more serial time to do alignment research. As model capabilities improve and we get more legible evidence of AI risk, the will to pause should increase, and so the expected length of a pause should also increase^[5].
We get experience and practice with deploying and using more powerful models in the real world, which can be particularly helpful for handling misuse and structural risks, simply because we can observe what happens in practice rather than speculating on it from our armchairs^[6].
We can get much wider support for a conditional pause. Most people can get on board with the principle “if an AI system would be very dangerous, don’t build it”, and then the relevant details are about when a potential AI system should be considered plausibly very dangerous. At one end, a typical unconditional pause proposal would say “anything more powerful than GPT-4 should be considered plausibly very dangerous”. As you make the condition less restrictive and more obviously tied to harm, it provokes less and less opposition.
It is easier to put in place, because it doesn’t require anyone to give up something they are already doing.

In contrast, the advantages of an unconditional pause seem relatively minor:

Since it is simpler than the conditional pause, it is easier to explain.
- This is a big advantage. While the conditional pause isn’t that much harder to explain, for public advocacy, I’d guess you get about five words, so even small differences lead to big advantages.
Since it is simpler than the conditional pause, it is easier to implement correctly.
- This doesn’t seem like a big deal, governments routinely create and enforce regulations that are way more complicated than a typical conditional pause proposal.
Since there are fewer moving parts to manipulate, It is harder to game.
- This might be a big deal in the longer run? But in the short term, I don’t expect the current leading companies to game such a regulation, because I expect they care at least a bit about safety (e.g. the Superalignment commitment is really unlikely in worlds where OpenAI doesn’t care at all, similarly for the signatures on the CAIS statement).
There’s less risk that we fail to pause at the point when AI systems become x-risky.
- We can choose a conditional pause proposal that makes this risk tiny, e.g. “pause once an agent passes 10 or more of the ARC Evals tasks”^[7].

Upon reading other content from this week, it seems like maybe a crux for people is whether the pause is a stopgap or an end state. My position stays the same regardless; in either case I’d aim for a conditional pause, for the same reasons.

What advocacy is good?

I think about three main types of advocacy to achieve these goals:

Targeted advocacy within AGI labs and to the AI research community
Targeted advocacy to policymakers
Broad advocacy to the public
- Note that I’m only considering advocacy with the aim of producing public pressure for a pause – I’m not considering e.g. advocacy that’s meant to produce new alignment researchers, or other such theories of change.

I think there is a fairly widespread consensus that (1) and (2) are good. I’d also highlight that (1) and (2) are working. For (1) look at the list of people who signed the CAIS statement. For (2), note the existence of the UK Frontier AI Taskforce and the people on it, as well as the intent bill SB 294 in California about “responsible scaling” (a term from ARC that I think refers to a set of policies that would include a conditional pause).

I think (3) is much less clear. I’d first note that the default result of advocacy to the public is failure; it’s really hard to reach a sizable fraction of the public (which is what’s needed to produce meaningful public pressure). But let’s consider the cases in which it succeeds. In these cases, I expect a mix of advantages and disadvantages. The advantages are:

Will: There would be more pressure to get things done, in a variety of places (government, AGI labs, AI research community). So, more things get done.

(I’ve only put down one advantage, but this is a huge advantage! Don’t judge the proposal solely on the number of points I wrote down.)

The disadvantages are:

Lack of nuance: Popular AI x-risk would not be nuanced. For example, I expect that the ask from the public would be an unconditional pause, because a conditional one would be too nuanced.
- Note that it is possible to have a good cop / bad cop routine: build public pressure with a “bad cop” that pushes goals that lack nuance, and have a “good cop” use that pressure to achieve a better, more nuanced goal^[8].
- This is also discussed as “inside game” / “outside game” in Holly’s post.
Goodhart’s Law (labs): Companies face large pressure to do safety work that would look good to the public, which is different from doing work that is actually good^[9]. At the extreme, existing safety teams could be replaced by compliance and PR departments, which do the minimum required to comply with regulations and keep the public sufficiently happy (see also Why Not Slow AI Progress?).
- This is already happening a little: public communication from labs is decided in part based on the expected response from AI x-risk worriers. It will get much worse when it expands to the full public, since the public is a much more important stakeholder, and has much less nuance.
- This seems especially bad for AI alignment research, where the best work is technical, detail-oriented, and relatively hard to explain.
Goodhart’s Law (govts): Similarly, policymakers would be under a lot of pressure to create policy that looks good to the public, rather than policies that are actually good.
- I’m very uncertain about how bad this effect is, but currently I think it could be pretty bad. It’s not hard to think of potential examples from environmentalism^[10], e.g. moving from plastic bags to paper bags, banning plastic straws, emphasizing recycling, opposing nuclear power.
- As an illustrative hypothetical example, recently there have been a lot of critiques of RLHF from the x-risk community. In a world where the public is anti-RLHF, perhaps policymakers impose very burdensome regulations on RLHF, and AGI labs turn to automated recursive self-improvement instead (which I expect would be much worse for safety).
Controversy: AI x-risk would be controversial, since it’s very hard to reach a significant fraction of the public without being controversial^[11] (see toxoplasma of rage, or just name major current issues with pressure from a large fraction of the public, and notice most are controversial). This seems bad^[12].
- People sometimes say controversy is inevitable, but I don’t see why this is true^[13]. I haven’t seen compelling arguments for it, just raw assertions^[14].
- While I’m uncertain how negative this effect is overall, this brief analysis of cryonics and molecular nanotechnology is quite worrying; it almost reads like a forecast about the field of AI alignment.
Deadlock: By making the issue highly salient and controversial, we may make governments unable to act when it otherwise would have acted (see Secret Congress).

I think that the best work on AI alignment happens at the AGI labs^[15], for a variety of structural reasons^[16]. As a result, I think disadvantage 2 (AGI labs Goodharting on public perception) is a really big deal and is sufficient by itself to put me overall against public advocacy with the aim of producing pressure for a pause. I am uncertain about the magnitudes of the other disadvantages, but they all appear to me to be potentially very large negative effects.

Current work on unconditional pauses

I previously intended to have a section critiquing the existing pause efforts, but recent events have made me more optimistic on this front. In particular, (1) I’m glad to see a protest against Meta (previous protests were at the most safety-conscious labs, differentially disadvantaging them), (2) I’m glad to see Holly acknowledge objections like overhangs (many of my previous experiences with unconditional pause advocates involved much more soldier mindset). I still don’t trust existing unconditional pause advocates to choose good strategies, or to try to come to accurate beliefs, and this still makes me meaningfully more uncomfortable about public advocacy than the prior section alone would, but I feel more optimistic about it than before.

Miscellaneous points

A few points that I think are less important than the others in this post:

I have heard anecdotal stories (e.g. FINRA) that good regulation often comes from codifying existing practices at private companies. I’m currently more excited about this path though have not done even a shallow dive into the topic and so am very uncertain.
I’m particularly worried about differentially disadvantaging safety-conscious AGI labs (including e.g. through national regulations that fail to constrain international labs).
I think it’s not very tractable to get an unconditional pause, but I’m not particularly confident in that and it is not a crux for me^[17].
I suspect relative to most of my audience I’m pretty optimistic that enforcement of significant regulations is feasible, and it is mostly about whether we have the political will to do so.

Responses to other posts

Having now read the other posts this week, I wish I had written a slightly different post. In particular, I was not expecting nearly this much anti-moratorium sentiment; I would have also focused on arguing for a conditional pause rather than no pause. I also didn’t expect alignment optimism to be such a common theme. Rather than rewrite the post (which takes a lot of time), I decided to instead add some commentary on each of the other posts.

What’s in a Pause? and How could a moratorium fail?: I think the idea here is that we should eventually aim for an international regulatory regime that is not a pause, but a significant chunk of x-risk from misaligned AI is in the near future, so we should enact an unconditional pause right now. If so, my main disagreement is that I think the pause we enact right now should be conditional: specifically, I think it’s important that you evaluate the safety of a model after you train it, not before^[18]. I may also disagree (perhaps controversially) that a significant chunk of x-risk from misaligned AI is in the near future, depending on what “near future” means.

AI Pause Will Likely Backfire: I weakly agree with the part of this post that argues that an unconditional pause would be bad, due to (1) overhangs / fast takeoff and (2) advantaging less cautious actors (given that a realistic pause would not be perfect). I think these don’t bite nearly as hard for conditional pauses, since they occur in the future when progress will be slower^[19], they are meant to be temporary to give us time to improve our mitigations, and they are easier to build a broad base of support for (and so are easier to enforce across all relevant actors).

I have several disagreements with the rest, which argues for optimism about alignment, but I also don’t understand why this matters. My opinions on pausing wouldn’t change if I became significantly more optimistic or pessimistic about alignment, because the decision-relevant quantity is whether pausing lets you do better than you otherwise would have done^[20].

Policy ideas for mitigating AI risk: I mostly agree with the outline of the strategic landscape, establishing the need to prevent catastrophe-capable AI systems from being built. Thomas then provides some concrete policy proposals, involving visibility into AI development and brakes to slow it down. These seem fine, but personally I think we should be more ambitious and aim to have a conditional pause^[21].

How to think about slowing AI: I agree with this post; I think it says very similar things as I do in the “Which goals are good?” section.

Comments on Manheim's "What's in a Pause?": There isn’t a central point in this post, so I will mostly not respond to it. The one thing I’ll note is that a lot of the post is dependent on the premise that if smarter-than-human AI is developed in the near future, then we almost surely die, regardless of who builds it. I strongly disagree with this premise, and so disagree with many of the implications drawn (e.g. that it approximately doesn’t matter if you differentially advantage the least responsible actors).

The possibility of an indefinite AI pause: I had two main comments on this post:

I agree that you eventually need a “global police state” to keep an AI pause going indefinitely if you allow AI research and hardware improvements to continue. But you could instead ban those things as well, which seems a lot easier to do than to build a global police state.
I don’t think it’s reasonable to worry that an indefinite AI pause would be an x-risk by stalling technological progress, given that there are many other non-AI technologies that can enable continued progress (human augmentation and whole brain emulation come to mind).

I did agree with Matthew’s point that the existing evidence for an AI catastrophe isn’t sufficient to justify creating a global police state.

The Case for AI Safety Advocacy to the Public: I agree with this post that advocacy can work for AI x-risk, and that the resulting public pressure could lead to more things being done^[22]. I agree that conditional on advocacy to the public, you likely want the message to be some flavor of pause, since other messages would require too much nuance. I’m on board with shifting the burden of proof onto companies to show that their product is safe (but I’m not convinced that this implies a pause).

I agree that the inside-outside game dynamic (or good cop / bad cop as I call it) has worked in previous cases of advocacy to the public. However, I expect it only works in cases where the desirable policy is relatively clear (and not nuanced), unlike AI alignment, so I’m overall against.

A couple of points where I had strong disagreements:

““Pause AI” is a simple and clear ask that is hard to misinterpret in a harmful way” – this is drastically underestimating either people’s ability to misinterpret messages, or how much harm misinterpretations can cause.
“I predict that advocacy activities could be a big morale boost” – I experience existing advocacy efforts as very demoralizing: people make wildly overconfident contrarian claims that I think are incorrect, they get lots of attention because of how contrarian the takes are, and they use this attention to talk about how the work I do isn’t helping.

AI is centralizing by default; let's not make it worse: Half of this post is about how AI is easier to control than humans. I agree with the positive arguments outlining advantages we have in controlling AIs. But when addressing counterarguments, Quintin fails to bring up the one that seems most obvious and important to me, that AIs will become much more capable and intelligent than humans, which could make controlling them difficult. Personally, I think it’s still unclear whether AI will be easier or harder to control than humans.

The other half of the post is about centralization effects of AI, warning against extreme levels of centralization. I didn’t get a good picture about what extreme levels of centralization look like and why they are x-risky (are we talking totalitarianism?). I’m not sure whether the post is recommending against pauses because of centralizing effects, but if it is I probably disagree because I expect the other effects of a pause would be significantly more important.

We are not alone: many communities want to stop Big Tech from scaling unsafe AI: I agree with the title, but I think the mitigations that other communities would want would be quite different from the ones that would make sense for alignment.

^{^}
One might say that iterative deployment is no better than full steam ahead because mitigations will fail to generalize to higher capability levels, either because x-risks are qualitatively different or because mitigations tend to break once you exceed human-level. I disagree, but I won’t defend that here.
^{^}
Epoch’s trends dashboard currently estimates that the required compute drops by 2.5x = 0.4 OOM (order of magnitude) per year due to algorithmic progress, and the required money per unit compute drops by 1.32x = 0.1 OOM per year due to hardware progress, for a combined estimate of 3.3x = 0.5 OOM per year.
^{^}
Some people think of this as requiring extreme surveillance, but I don’t think that’s necessary: ultimately you need to prevent algorithmic progress, compute progress, and usage of large clusters of compute for big training runs; I expect you can get really far on (1) and (2) by banning research on AI and on compute efficiency, enforced by monitoring large groups of researchers (academia, industry, nonprofits, etc), as well as monitoring published research (in particular, we do not enforce it by perpetually spying on all individuals), and on (3) by monitoring large compute clusters. This doesn’t defend against situations where a small group of individuals can build AGI in their basement – but I assign very low probability to that scenario. It’s only once you can build AGI with small amounts of compute and publicly available techniques that you need extreme surveillance to significantly reduce the chance that AGI is built.

^{^}

I initially had an exception here for mechanistic interpretability: most current work isn’t done on the largest models, since the core problems to be solved (e.g. feature discovery, superposition, circuit identification) can be easily exhibited with small models. However, on reflection I think even mechanistic interpretability gets significant benefits. In particular, there is a decent chance that there are significant architecture changes on the way to powerful models; it is much better for mechanistic interpretability to know that earlier rather than later. See for example the fairly big differences in mechanistic interpretability for Transformers vs. CNNs.

^{^}

This is a bit subtle. I expect pauses to be longer in worlds where we have unconditional pauses at GPT-4 than in worlds where we have conditional pauses, because having an unconditional pause at GPT-4 is evidence of the world at large caring a lot about AI risk, and/or being more conservative by default. However, these are evidential effects, and for decision-making we should ignore these. The more relevant point is: if you take a world where we “could have” implemented an unconditional pause at GPT-4, hold fixed the level of motivation to reduce AI risk, and instead implement a conditional pause that kicks in later, probably the conditional pause will last longer because the potential harms for GPT-5 will galvanize even more support than existed at the time of GPT-4.

^{^}

There are also safety techniques like anomaly detection on model usage that benefit from real-world usage data.

^{^}

This isn’t a great conditional pause proposal; we can and should choose better evaluations. My point is just that this is a concrete proposal that is more closely tied to danger than the unconditional pause proposal, while still having a really low chance of failing to pause before AI systems become x-risky, unless you believe in really fast takeoff.

^{^}

However, typically the “good cops” are still, well, cops: they are separate organizations with a mission to hold a responsible party to account. The Humane League is a separate organization that negotiates with Big Ag, not an animal welfare department within a Big Ag company. In contrast, with AI x-risk, many of the AGI labs have x-risk focused departments. It’s unclear whether this would continue in a good cop / bad cop routine.

^{^}

Both because the public understanding of AI x-risk will lack nuance and so will call for less good work, and because companies will want to protect their intellectual property (IP) and so would push for safety work on topics that can be more easily talked about.

^{^}

Note I don’t know much about environmentalism and haven’t vetted these examples; I wouldn’t be surprised if a couple turned out to be misleading or wrong.

^{^}

You can also get public pressure from small but vocal and active special interest groups. I expect this looks more like using the members of such groups to carry out targeted advocacy to policymakers, so I’d categorize this as (2) in my list of types of advocacy, and I’m broadly in support of it.

^{^}

See for example How valuable is movement growth?, though it is unfortunately not AI-specific.

^{^}

A few caveats: (a) Certainly some aspects of AI will be controversial, simply because AI has applications to already controversial areas, e.g. military uses. I’m saying that AI x-risk in particular doesn’t seem like it has to be controversial. (b) My prediction is that AI x-risk will be controversial, but the main reason for that is because parts of the AI x-risk community are intent on making strong, poorly-backed, controversial claims and then call for huge changes, instead of limiting themselves to the things that most people would find reasonable. In this piece, since I’m writing to that audience to influence their actions, I’m imagining that they stop doing those things as I wish they would do – if that were to happen, it seems quite plausible to me that AI x-risk wouldn’t become controversial.

^{^}

For example, Holly’s post: “AI is going to become politicized whether we get involved in it or not”.

^{^}

This is a controversial view, but I’d guess it’s a majority opinion amongst AI alignment researchers.

^{^}

Reasons include: access to the best alignment talent, availability of state of the art models, ability to work with AI systems deployed at scale, access to compute, ability to leverage existing engineering efforts, access to company plans and other confidential IP, access to advice from AI researchers at the frontier, job security, mature organizational processes, etc. Most of these reasons are fundamentally tied to the AGI labs and can’t easily be ported to EA nonprofits if the AGI labs start to Goodhart on public perception. Note there are countervailing considerations as well, e.g. the AGI labs have more organizational politics.

^{^}

As one piece of evidence, even a successful case like NEPA for environmentalism may have been the result of luck.

^{^}

If you require safety to be shown before you train the model, you are either imposing an unconditional pause, or you are getting the AGI companies to lie to you, both of which seem bad.

^{^}

We’re currently able to scale up investment in AI massively, because it was so low to begin with, but eventually we’ll run out of room to scale up investment.

^{^}

This does break down at extremes – if you are very confident that alignment will work out fine, then you might care more about getting the benefits of AGI sooner; if you are extremely confident that alignment won’t work, then you may want to aim for an unconditional permanent pause even though it isn’t tractable, and view an unconditional temporary pause as good progress.

^{^}

The key difference is that a conditional pause proposal would not involve “emergency” powers: the pause would be a default, expected result of capabilities scaling as models gain dangerous capabilities – until the lab can also demonstrate mitigations that ensure the models will not use those dangerous capabilities.

^{^}

Though I don’t trust polls as much as Holly seems to. For example, 32% of Americans say that animals should be given the same rights as people – but even self-reported veg*nism rates are much lower than that.

Show all footnotes

100 Reactions

How could a moratorium fail?

4 comments49 karma

Pause For Thought: The AI Pause Debate

20 comments113 karma

Mentioned in

113Pause For Thought: The AI Pause Debate

78Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope

60Rolling Thresholds for AGI Scaling Regulation

Comments42

Sorted by

New & upvoted

Click to highlight new comments since: Today at 7:41 PM

Davidmanheim2y10

Thanks - I definitely don't completely agree, but it's good to hear that people at the labs take this seriously.

Given that, I'll respond to your response.

I think the idea here is that we should eventually aim for an international regulatory regime that is not a pause, but a significant chunk of x-risk from misaligned AI is in the near future, so we should enact an unconditional pause right now.
If so, my main disagreement is that I think the pause we enact right now should be conditional: specifically, I think it’s important that you evaluate the safety of a model after you train it, not before^[18]. I may also disagree (perhaps controversially) that a significant chunk of x-risk from misaligned AI is in the near future, depending on what “near future” means.

This seems basically on point, and I think we disagree less than it seems. To explain why, two things are critical.

The first is that x-risks from models come in many flavors, and uncontrolled AI takeoff is only one longer term concern. If there are nearer term risks from misuse, which I think are even more critical for immediate action, we should be worried about short timelines, in the lower single digit year range. (Perhaps you disagree?)

The second is that "near future" means really different things to engineers, domestic political figures, and treaty negotiators. Political deals take years to make - we're optimistically a year or more from getting anything in place. Even if a bill is proposed today and miraculously passes tomorrow, it likely would not go into effect for months or years, and that's excluding court challenges. I don't think we want to wait to get this started - we're already behind schedule!

(OpenAI says we have 3-4 years until ASI. I'm more skeptical, but also confused by their opposition to immediate attempts to regulate - by the time we get to AGI, it won't matter if regulations are coming online, abuse risks are stratospheric.

Either way, the suggestion that we shouldn't pause now is already a done deal , and a delayed pause starting in 2 years, which is what I proposed, until there is a treaty, seems behind schedule already. If you'd prefer that be conditional, that seems fine - though I think we're less than 2 years from hitting the remaining fire alarms, so your pause timelines might create a stop faster than mine. But I think we need something to light a fire to act faster.

Why? Because years are likely needed just to get people on board for negotiating a real international regulatory regime! Climate treaties take a decade or more to convene, then set their milestones to hit a decade or more in the future, narrowly bilateral nuclear arms control treaties talks take years to start, span 3-4 years before there is a treaty, and the resulting treaty has a 30-year time span.

I'm optimistic that in worlds where risk shows up rapidly, we'll see treaties written and signed faster than this - but I think that either way, pushing hard now makes sense. To explain why, I'll note that I predicted that COVID-19 would create incentives to get vaccines faster than previously seen, and along the same logic, I think the growing and increasingly obvious risks from AGI will cause us to do something similar - but this is conditioned on there being significant and rapid advances that scare people much more, which I think you say you doubt.

If you think that significant realized risks are further away, so that we have time to wait for a conditional pause, I think it's also unreasonable to predict these shorter time frames. So conditional on what seem to be your beliefs about AI risk timelines, I claim I'm right to push now, and conditional on my tentative concerns about shorter timelines, I claim we're badly behind schedule. Either way, we should be pushing for action now.

AnonResearcherMajorAILab2y5

If there are nearer term risks from misuse, which I think are even more critical for immediate action, we should be worried about short timelines, in the lower single digit year range. (Perhaps you disagree?)

I agree, but in that case I'm even more baffled by the call for an unconditional pause. It's so much easier to tell whether an AI system can be misused, relative to whether it is misaligned! Why would you not just ask for a ban on AI systems that pose great misuse risks?

In the rest of your comment you seem to assume that I think we shouldn't push for action now. I'm not sure why you think that -- I'm very interested in pushing for a conditional pause now, because as you say it takes time to actually implement.

(I agree that we mostly don't disagree.)

Davidmanheim2y6

I don't think you can make general AI systems that are powerful enough to be used for misuse that aren't powerful enough to pose risks of essentially full takeoff. I talked about this in terms of autonomy, but think the examples used illustrate why these are nearly impossible to separate. Misuse risk isn't an unfortunate correlate of greater general capabilities, it's exactly the same thing.

That said, if you're not opposed to immediate steps, I think that where we differ most is a strategic detail about how to get to the point where we have a robust international system that conditionally bans only potentially dangerous systems, namely, whether we need a blanket ban on very large models until that system is in place.

AnonResearcherMajorAILab2y7

I think that where we differ most is a strategic detail about how to get to the point where we have a robust international system that conditionally bans only potentially dangerous systems, namely, whether we need a blanket ban on very large models until that system is in place.

Yes, but it seems like an important detail, and I still haven't seen you give a single reason why we should do the blanket ban instead of the conditional one. (I named four in my post, but I'm not sure if you care about any of them.)

The conditional one should get way more support! That seems really important! Why is everyone dying on the hill of a blanket ban?

I don't think you can make general AI systems that are powerful enough to be used for misuse that aren't powerful enough to pose risks of essentially full takeoff.

There are huge differences between misuse threat models and misalignment threat models, even if you set aside whether models will be misaligned in the first place:

In misuse, an AI doesn't have to hide its bad actions (e.g. it can do chains of thought that explicitly reason about how to cause harm).
In misuse, an AI gets active help from humans (e.g. maybe the AI is great at creating bioweapon designs, but only the human can get them synthesized and released into the world).

It's possible that the capabilities to overcome these issues comes at the same time as the capability for misuse, but I doubt that will happen.

Also, this seems to contradict your previous position:

The first is that x-risks from models come in many flavors, and uncontrolled AI takeoff is only one longer term concern. If there are nearer term risks from misuse, which I think are even more critical for immediate action, we should be worried about short timelines, in the lower single digit year range.

Davidmanheim2y2

I didn't say "blanket ban on ML," I said "a blanket ban on very large models."

Why? Because I have not seen, and don't think anyone can make, a clear criterion for "too high risk of doom," both because there are value judgements, and because we don't know how capabilities arise - so there needs to be some cutoff past which we ban everything, until we have an actual regulatory structure in place to evaluate. Is that different than the "conditional ban" you're advocating? If so, how?

AnonResearcherMajorAILab2y7

I didn't say "blanket ban on ML," I said "a blanket ban on very large models."

I know? I never said you asked for a blanket ban on ML?

Because I have not seen, and don't think anyone can make, a clear criterion for "too high risk of doom,"

My post discusses “pause once an agent passes 10 or more of the ARC Evals tasks”. I think this is too weak a criterion and I'd argue for a harder test, but I think this is already better than a blanket ban on very large models.
Anthropic just committed to a conditional pause.
ARC Evals is pushing responsible scaling, which is a conditional pause proposal.

But also, the blanket ban on very large models is implicitly saying that "more powerful than GPT-4" is the criterion of "too high risk of doom", so I really don't understand at all where you're coming from.

Lukas_Gloor2y8

Iterative deployment. We treat AGI like we would treat many other new technologies: something that could pose risks, which we should think about and mitigate, but ultimately something we should learn about through iterative deployment. The default is to deploy new AI systems, see what happens with a particular eye towards noticing harms, and then design appropriate mitigations. In addition, rollback mechanisms ensure that we can AI systems are deployed with a rollback mechanism, so that if a deployment causes significant harms
[...]

Conditional pause. We institute regulations that say that capability improvement must pause once the AI system hits a particular threshold of riskiness, as determined by some relatively standardized evaluations, with some room for error built in. AI development can only continue once the developer has exhibited sufficient evidence that the risk will not arise.

Compared to you, I'm more pessimistic about these two measures. On iterative deployment, I'm skeptical about the safety of rollback mechanisms. On conditional pause, I agree it makes total sense to pause at the latest point possible as long as things are still pretty likely to be safe. However, I don't see why we aren't already at that point.

I suspect that our main crux might be a disagreement over takeoff speeds, and perhaps AI timelines being another (more minor) crux?

On takeoff speeds, I place 70% on 0.5-2.5 orders of magnitude for what Tom Davidson calls the FLOP gap. (Relatively high robustness because I thought about this a lot.) I also worry less about metrics of economic growth/economic change because I believe it won't take long from "AI makes a noticeable dent on human macroeconomic productivity" to "it's now possible to run millions of AIs that are ~better/faster at all tasks human experts can do on a computer." The latter scenario is one from which it is easy to imagine how AIs might disempower humans. I basically don't see how one can safely roll things back at the point where generally smarter-than-human AIs exist that can copy themselves millionfold on the internet.

On timelines, I have maybe 34% on 3 years and less, and 11% on 1 year.

Why do I have these views?

A large part of the story is that if you had described to me back in 2018 all the AI capabilities we have today, without mentioning the specific year by which we'd have those capabilities, I'd have said "once we're there, we're probably very close to transformative AI." And now that we are at this stage, even though it's much sooner than I'd have expected, I feel like the right direction of update is "AI timelines are sooner than expected" rather than "(cognitive) takeoff speeds must be slower than I'd have expected."

Maybe this again comes down to my specific view on takeoff speeds. I felt more confident that takeoff won't be super slow than I felt confident about anything timelines-related.

So, why the confident view on takeoff speeds? Just looking at humans vs chimpanzees, I'm amazed by the comparatively small difference in brain size. We can also speculate, based on the way evolution operates, that there's probably not much room for secret-sauce machinery in the human brain (that chimpanzees don't already have).

The main counterargument from the slow(er) takeoff crowd on the chimpanzees vs. humans comparison is that humans faced much stronger selection pressure for intelligence, which must have tweaked a lot of other things besides brain size, and since chimpanzees didn't face that same selection pressure, the evolutionary comparison underestimates how smart an animal with a chimpanzee-sized brain would be if it had also undergone strong selection pressure for the sort of niche that humans inhabit ("intelligence niche"). I find that counterargument slightly convincing, but not convincing enough to narrow my FLOP gap estimates too much. Compared to ML progress where we often 10x compute between models, evolution was operating way more slowly.

As I've written in an (unpublished) review of (a draft of) the FLOP gap report:

I’ll now reply to sections in the report where Tom discusses evidence for the FLOP gap and point out where I’d draw different conclusions. I’ll start with the argument that I consider the biggest crux between Tom and me.
Tom seems to think chimpanzees – or even rats, with much smaller probability – could plausibly automate a significant percent of cognitive tasks “if they had been thoroughly evolutionarily prepared for this” (I’m paraphrasing). A related argument in this spirit is Paul Christiano’s argument that humans far outclass chimpanzees not because of a discontinuity in intelligence, but because chimpanzees hadn’t been naturally selected to be good at a sufficiently general (or generalizable) range of skills.

I think this argument is in tension with the observation that people at the lower range of human intelligence (as measured by IQ, for instance) tend to struggle to find and hold jobs. If natural selection had straightforwardly turned all humans into specialists for “something closely correlated with economically useful tasks,” then how come the range for human intelligence is still so wide? I suspect that the reason humans were selected for (general) intelligence more than chimpanzees is because of some “discontinuity” in the first place. This comment by Carl Shulman explains it as follows (my emphasis in bold):
“Hominid culture took off enabled by human capabilities [so we are not incredibly far from the minimum need for strongly accumulating culture, the selection effect you reference in the post], and kept rising over hundreds of thousands and millions of years, at accelerating pace as the population grew with new tech, expediting further technical advance.”
Admittedly, Carl also writes the following:
“Different regions advanced at different rates (generally larger connected regions grew faster, with more innovators to accumulate innovations), but all but the smallest advanced. So if humans overall had lower cognitive abilities there would be slack for technological advance to have happened anyway, just at slower rates (perhaps manyfold), accumulating more by trial and error.”
So, perhaps chimpanzees (or bonobos), if they had been evolutionarily prepared for social learning or for performing well in the economy, could indeed perform about 20% of today’s tasks. But that’s also my cutoff point: I think we can somewhat confidently rule out that smaller-brained primates, let alone rodents such as rats, could do the same thing in this hypothetical. (Or, in case I’m wrong, it would have to be because it’s possible to turn all of the rodent’s cognition into a highly specialized “tool” that exploits various advantages it can reach over humans – enabling partial automation of specific workflows, but not full automation of sectors.)

AnonResearcherMajorAILab2y10

A conditional pause fails to prevent x-risk if either:

The AI successfully exfiltrates itself (which is what's needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)

(Obviously there's also the normal institutional failures, e.g. if a company simply ignores the evaluation requirements and forges ahead. I'm setting those aside here.)

Both of these seem extremely difficult to me (likely beyond human-level, in the sense that if you somehow put a human in the situation the AI would be in, I would expect the human to fail).

How likely do you think it is that we get an AI capable of one of these failure modes, before we see an AI capable of e.g. passing 10 out of the 12 ARC Evals tasks? My answer would be "negligible", and so I'm at least in favor of "pause once you pass 10 out of 12 ARC Evals tasks" over "pause now". I think we can raise the difficulty of the bar a decent bit more before my answer stops being "negligible".

I don't think this depends on takeoff speeds at all, since I'd expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).

I don't think this depends on timelines, except inasmuch as short timelines correlates with discontinuous jumps in capability. If anything it seems like shorter timelines argue more strongly for a conditional pause proposal, since it seems far easier to build support for and enact a conditional pause.

Lukas_Gloor2y3

I don't think this depends on takeoff speeds at all, since I'd expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).

I should have emphasized that I'm talking about cognitive AI takeoff, not economic takeoff.

I don't have a strong view whether there are ~20% of human tasks that are easy and regular/streamlined enough to automate with stochastic-parrot AI tools. Be that as it may, what's more important is what happens once AIs pass the reliability threshold that makes someone a great "general assistant" in all sorts of domains. From there, I think it's just a tiny step further to also being a great CEO. Because these capability levels are so close to each other on my model, the world may still look similar to ours at that point.

All of that said, it's not like I consider it particularly likely that a system would blow past all the evals you're talking about in a single swoop, especially since some of them will be (slightly) before the point of being a great "general assistant." I also have significant trust that the people designing these evals will be thinking about these concerns. I think it's going to be very challenging to make sure evals organizations (or evals teams inside labs in case it's done lab-internally) have enough political power and stay uncorrupted by pressures to be friendly towards influential lab leadership. These problems are surmountable in theory, but I think it'll be hard, so I'm hoping the people working on this are aware of all that could go wrong. I recently wrote up some quick thoughts on safety evals here. Overall, I'm probably happy enough with a really well-thought out "conditional pause" proposal, but I'd need to be reassured that the people who decide in favor of that can pass the Ideological Turing test for positions like fast takeoff or the point that economic milestones like "20% of tasks are automated" are probably irrelevant.

AnonResearcherMajorAILab2y2

Sounds like we roughly agree on actions, even if not beliefs (I'm less sold on fast / discontinuous takeoff than you are).

As a minor note, to keep incentives good, you could pay evaluators / auditors based on how much performance they are able to elicit. You could even require that models be evaluated by at least three auditors, and split up payment between them based on their relative performances. In general it feels like there a huge space of possibilities that has barely been explored.

Lukas_Gloor2y2

A conditional pause fails to prevent x-risk if either:

The AI successfully exfiltrates itself (which is what's needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)

Another way evals could fail is if they work locally but it's still too late in the relevant sense because even with the pause mechanism kicking in (e.g., "from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited"), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. There's probably a degree to which algorithmic insights can substitute for having more compute available. So, even if a failed eval triggers a global pause, if cognitive takeoff dynamics are sufficiently fast/"discontinuous," a failed eval may just mean that we're now so close to takeoff that it's too hard to prevent. Monitoring algorithmic progress is particularly difficult (I think compute is the much better lever).

Of course, you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button. Maybe that's okay!

But then we're back to my point of "Are we so sure that the evals (with the proper safety margin) shouldn't have triggered today already?"

(It's not saying that I'm totally confident about any of this. Just flagging that it seems like a tough forecasting question to determine what an appropriate safety margin is on the assumption that algorithmic progress is hard to monitor/regulate and the assumption that fast takeoff into and through the human range is very much a possibility.)

AnonResearcherMajorAILab2y7

even with the pause mechanism kicking in (e.g., "from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited"), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. [...] you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button.

Safety margin is one way, but I'd be much more keen on continuing to monitor the strongest models even after the pause has kicked in, so that you notice the effects of algorithmic progress and can tighten controls if needed. This includes rolling back model releases if people on Twitter tinker with them to exceed capability thresholds. (This also implies no open sourcing, since you can't roll back if you've open sourced the model.)

But I also wish you'd say what exactly your alternative course of action is, and why it's better. E.g. the worry of "algorithmic progress gets you to the threshold" also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.

Lukas_Gloor2y2

But I also wish you'd say what exactly your alternative course of action is, and why it's better. E.g. the worry of "algorithmic progress gets you to the threshold" also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.

The way I see it, the main difference between conditional vs unconditional pause is that the unconditional pause comes with a bigger safety margin (as big as we can muster). So, given that I'm more worried about surprising takeoffs, that position seems prima facie more appealing to me.

In addition, as I say in my other comment, I'm open to (edit: or, more strongly, I'd ideally prefer this!) some especially safety-conscious research continuing onwards through the pause. I gather that this is one of your primary concerns? I agree that an outcome where that's possible requires nuanced discourse, which we may not get if public reaction to AI goes too far in one direction. So, I agree that there's a tradeoff around public advocacy.

mic2y5

It would be bad to create significant public pressure for a pause through advocacy, because this would cause relevant actors (particularly AGI labs) to spend their effort on looking good to the public, rather than doing what is actually good.

I think I can reasonably model the safety teams at AGI labs as genuinely trying to do good. But I don't know that the AGI labs as organizations are best modeled as trying to do good, rather than optimizing for objectives like outperforming competitors, attracting investment, and advancing exciting capabilities – subject to some safety-related concerns from leadership. That said, public pressure could manifest itself in a variety of ways, some of which might work toward more or less productive goals.

I agree that conditional pauses better than unconditional pauses, due to pragmatic factors. But I worry about AGI labs specification gaming their way through dangerous-capability evaluations, using brittle band-aid fixes that don't meaningfully contribute to safety.

AnonResearcherMajorAILab2y17

I don't know that the AGI labs as organizations are best modeled as trying to do good, rather than optimizing for objectives like outperforming competitors, attracting investment, and advancing exciting capabilities – subject to some safety-related concerns from leadership.

I will go further -- it's definitely the latter one for at least Google DeepMind and OpenAI; Anthropic is arguable. I still think that's a much better situation than having public pressure when the ask is very nuanced (as it would be for alignment research).

For example, I'm currently glad that the EA community does not have the power to exert much pressure on the work done by the safety teams at AGI labs, because the EA community's opinions on what alignment research should be done are bad, and the community doesn't have the self-awareness to notice that themselves, and so instead safety teams at labs would have to spend hundreds of hours writing detailed posts explaining their work to defuse the pressure from the EA community.

At least there I expect the safety teams could defuse the pressure by spending hundreds of hours writing detailed posts, because EAs will read that. With the public there's no hope of that, and safety teams instead just won't do the work that they think is good, and instead do work that they can sell to the public.

Aaron_Scher2y4

While I’m very uncertain, on balance I think it provides more serial time to do alignment research. As model capabilities improve and we get more legible evidence of AI risk, the will to pause should increase, and so the expected length of a pause should also increase [footnote explaining that the mechanism here is that the dangers of GPT-5 galvanize more support than GPT-4]

I appreciate flagging the uncertainty; this argument doesn't seem right to me.

One factor affecting the length of a pause would be the (opportunity cost from pause) / (risk of catastrophe from unpause) ratio of marginal pause days, or what is the ratio of the costs to the benefits. I expect both the costs and the benefits of AI pause days to go up in the future — because risks of misalignment/misuse are greater, and because AIs will be deployed in a way that adds a bunch of value to society (whether the marginal improvements are huge remains unclear, e.g., GPT-6 might add tons of value, but it's unclear how much more GPT-6.5 adds on top of that, seems hard to tell). I don't know how the ratio will change, which is probably what actually matters. But I wouldn't be surprised if that numerator (opportunity cost) shot up a ton.

I think it's reasonable to expect that marginal improvements to AI systems in the future (e.g., scaling up 5x) could map on to automating an additional 1-7% of a nation's economy. Delaying this by a month would be a huge loss (or a benefit, depending on how the transition is going).

What relevant decision makers think the costs and benefits are is what actually matters, not the true values. So even if right now I can look ahead and see that an immediate pause pushes back future tremendous economic growth, this feature may not become apparent to others until later.

To try and say what I'm getting at a different way: you're suggesting that we get a longer pause if we pause later than if we pause now. I think that "races" around AI are going to ~monotonically get worse and that the perceived cost of pausing will shoot up a bunch. If we're early on an exponential of AI creating value in the world, it just seems way easier to pause for longer than it will be later on. If this doesn't make sense I can try to explain more.

AnonResearcherMajorAILab2y4

I agree it's important to think about the perceived opportunity cost as well, and that's a large part of why I'm uncertain. I probably should have said that in the post.

I'd still guess that overall the increased clarity on risks will be the bigger factor -- it seems to me that risk aversion is a much larger driver of policy than worries about economic opportunity cost (see e.g. COVID lockdowns). I would be more worried about powerful AI systems being seen as integral to national security; my understanding is that national security concerns drive a lot of policy. (But this could potentially be overcome with international agreements.)

Lukas_Gloor2y3

Interesting post!

I like your points against the value of public advocacy. I'm not convinced overall, but that's probably mostly because I'm in an overall more pessimistic state where I don't mind trying more desperate measures.

One small comment on incentives for alignment researchers:

Goodhart’s Law (labs): Safety teams face large pressure to do things that would look good to the public, which is different from doing work that is actually good[9].

I feel like there are already a bunch of misaligned incentives for alignment researchers (and all groups of people in general), so if the people on the safety team aren't selected to be great at caring about the best object-level work, then we're in trouble either way. My model might be too "black and white," but I basically think that people fall into categories where they are either lost causes or they have trained themselves cognitive habits about how to steer clear of distorting influences. Sure, the incentive environment still makes a difference, but I find it unlikely that it's among the most important considerations for evaluating the value of masses-facing AI pause advocacy.

AnonResearcherMajorAILab2y4

I feel like there are already a bunch of misaligned incentives for alignment researchers (and all groups of people in general), so if the people on the safety team aren't selected to be great at caring about the best object-level work, then we're in trouble either way.

I'm not thinking that much about the motivations of the people on the safety team. I'm thinking of things like:

Resourcing for the safety team (hiring, compute, etc) is conditioned on whether the work produces big splashy announcements that the public will like
Other teams in the company that are doing things that the public likes will be rebranded as safety teams
When safety teams make recommendations for safety interventions on the strongest AI systems, their recommendations are rejected if the public wouldn't like them
When safety teams do outreach to other researchers in the company, they have to self-censor for fear that a well-meaning whistleblower will cause a PR disaster by leaking an opinion of the safety team that the public has deemed to be wrong

See also Unconscious Economics.

(I should have said that companies will face pressure, rather than safety teams. I'll edit that now.)

AnonResearcherMajorAILab2y3

I like your points against the value of public advocacy. I'm not convinced overall, but that's probably mostly because I'm in an overall more pessimistic state where I don't mind trying more desperate measures.

People say this a lot, but I don't get it.

Your baseline has to be really pessimistic before it looks good to throw in a negative-expectation high-variance intervention. (Perhaps worth making some math models and seeing when it looks good vs not.) Afaict only MIRI-sphere is pessimistic enough for this to make sense.
It's very uncooperative and unilateralist. I don't know why exactly it has became okay to say "well I think alignment is doomed, so it's fine if I ruin everyone else's work on alignment with a negative-expectation intervention", but I dislike it and want it to stop.

Or to put it a bit more viscerally: It feels crazy to me that when I say "here are reasons your intervention is increasing x-risk", the response is "I'm pessimistic, so actually while I agree that the effect in a typical world is to increase x-risk, it turns out that there's this tiny slice of worlds where it made the difference and that makes the intervention good actually". It could be true, but it sure throws up a lot of red flags.

Or in meme form:

Lukas_Gloor2y2

I agree that unilateralism is bad. I'm still in discussion mode rather than confidently advocating for some specific hard-to-reverse intervention. (I should have flagged that explicitly.)

I think it's not just the MIRI-sphere that's very pessimistic, so there might be a situation where two camps disagree but neither camp is obviously small enough to be labelled unilateralist defectors. Seems important to figure out what to do from a group-rationality perspective in that situation. Maybe the best thing would be to agree on predictions that tell us what world we're more likely in, and then commit to a specific action once one group turned out to be wrong about their worldview's major crux/cruxes. (Assuming we have time for that.)

It feels crazy to me that when I say "here are reasons your intervention is increasing x-risk", the response is "I'm pessimistic, so actually while I agree that the effect in a typical world is to increase x-risk, it turns out that there's this tiny slice of worlds where it made the difference and that makes the intervention good actually".

That's not how I'd put it. I think we are still in a "typical" world, but the world that optimistic EAs assume we are in is the unlikely one where institutions around AI development an deployment suddenly turn out to be saner than our baseline would suggest. (If someone had strong reasons to think something like "[leader of major AI lab] is a leader with exceptionally high integrity who cares the most about doing the right thing; his understanding of the research and risks is pretty great, and he really knows how to manage teams and so on, so that's why I'm confident," then I'd be like "okay, that makes sense.")

AnonResearcherMajorAILab2y1

That's not how I'd put it. I think we are still in a "typical" world, but the world that optimistic EAs assume we are in is the unlikely one where institutions around AI development an deployment suddenly turn out to be saner than our baseline would suggest.

I don't see at all how this justifies that public advocacy is good? From my perspective you're assuming we're in an unlikely world where the public turns out to be saner than our baseline would suggest. I don't think I have a lot of trust in institutions (though maybe I do have more trust than you do); I think I have a deep distrust of politics and the public.

I'm also not sure I understood your original argument any more. The argument I thought you were making was something like:

Consider an instrumental variable like "quality-weighted interventions that humanity puts in place to reduce AI x-risk". Then public advocacy is:
Negative expectation: Public advocacy reduces the expected value of quality-weighted interventions, for the reasons given in the post.
High variance: Public advocacy also increases the variance of quality-weighted interventions (e.g. maybe we get a complete ban on all AI, which seems impossible without public advocacy).
However, I am pessimistic:
Pessimism: The required quality-weighted interventions to avoid doom is much higher than the default quality-weighted interventions we're going to get.
Therefore, even though public advocacy is negative-expectation on quality-weighted interventions, it still reduces p(doom) due to its high variance.

(This is the only way I see to justify rebuttals like "I'm in an overall more pessimistic state where I don't mind trying more desperate measures", though perhaps I'm missing something.)

Is this what you meant with your original argument? If not, can you expand?

I think it's not just the MIRI-sphere that's very pessimistic

What is your p(doom)?

For reference, I think it seems crazy to advocate for negative-expectation high-variance interventions if you have p(doom) < 50%. As a first pass heuristic, I think it still seems pretty unreasonable all the way up to p(doom) of < 90%, though this could be overruled by details of the intervention (how negative is the expectation, how high is the variance).

Lukas_Gloor2y4

From my perspective you're assuming we're in an unlikely world where the public turns out to be saner than our baseline would suggest.

Hm, or that we get lucky in terms of the public's response being a good one given the circumstances, even if I don't expect the discourse to be nuanced. It seems like a reasonable stance to think that a crude reaction of "let's stop this research before it's too late" is appropriate as a first step, and that it's okay to worry about other things later on. The negatives you point out are certainly significant, so if we could get a conditional pause setup through other channels, that seems clearly better! But my sense is that it's unlikely we'd succeed at getting ambitious measures in place without some amount of public pressure. (For what it's worth, I think the public pressure is already mounting, so I'm not necessarily saying we have to ramp up the advocacy side a lot – I'm definitely against forming PETA-style anti-AI movements.)

As a first pass heuristic, I think it still seems pretty unreasonable all the way up to p(doom) of < 90%,

It also matters how much weight you give to person-affecting views (I've argued here for why I think they're not unreasonable). If we can delay AI takeoff for five years, that's worth a lot from the perspective of currently-existing people! (It's probably also weakly positive or at least neutral from a suffering-focused longtermist perspective because everything seems uncertain from that perspective and a first-pass effect is delaying things from getting bigger; though I guess you could argue that particular s-risks are lower if more alignment-research-type reflection goes into AI development.) Of course, buying a delay that somewhat (but not tremendously) worsens your chances later on is a huge cost to upside-focused longtermism. But if we assume that we're already empirically pessimistic on that view to begin with, then it's an open question how a moral parliament between worldviews would bargain things out. Certainly the upside-focused longtermist faction should get important concessions like "try to ensure that actually-good alignment research doesn't fall under the type of AI research that will be prohibited."

What is your p(doom)?

My all-things-considered view (the one I would bet on) is maybe 77%. My private view (what to report to avoid double-counting the opinions of people the community updates towards) is more like 89%. (This doesn't consider scenarios where AI is misaligned but still nice towards humans for weird decision-theoretic reasons where the AI is cooperating with other AIs elsewhere in the multiverse – not that I consider that particularly likely, but I think it's too confusing to keep track of that in the same assessment.)

Some context on that estimate: When I look at history, I don't think of humans being "in control" of things. I'm skeptical of Stephen Pinker's "Better Angels" framing. Sure, a bunch of easily measurable metrics got better (though some may even be reversing for the last couple of years, e.g., life expectancy is lower now than it used to be before Covid). However, at the same time, technological progress introduced new problems of its own that don't seem anywhere close to being addressed (e.g., social media addiction, increases in loneliness, maybe polarization via attention-grabbing news). Even if there's an underlying trend of "Better Angels," there's also a trend of "new technology increases the strength of Molochian forces." We seem to be losing that battle! AI is an opportunity to gain control for the first time via superhuman intelligence/rationality/foresight to bail us out and reign in Molochian forces once and for all, but to get there, we have to accomplish an immense feat of coordination. I don't see why people are by-default optimistic about something like that. If anything, my 11% that humans will gain control over history for the first time ever seems like the outlandish prediction here! The 89% p(doom) is more like what we should expect by default: things get faster and out of control and then that's it for humans.

AnonResearcherMajorAILab2y7

From my perspective you're assuming we're in an unlikely world where the public turns out to be saner than our baseline would suggest.
Hm, or that we get lucky in terms of the public's response being a good one given the circumstances, even if I don't expect the discourse to be nuanced.

That sounds like a rephrasing of what I said that puts a positive spin on it. (I don't see any difference in content.)

To put it another way -- you're critiquing "optimistic EAs" about their attitudes towards institutions, but presumably they could say "we get lucky in terms of the institutional response being a good one given the circumstances". What's the difference between your position and theirs?

But my sense is that it's unlikely we'd succeed at getting ambitious measures in place without some amount of public pressure.

Why do you believe that?

It also matters how much weight you give to person-affecting views (I've argued here for why I think they're not unreasonable).

I don't think people would be on board with the principle "we'll reduce the risk of doom in 2028, at the cost of increasing risk of doom in 2033 by a larger amount".

For me, the main argument in favor of person-affecting views is that they agree with people's intuitions. Once a person-affecting view recommends something that disagrees with other ethical theories and with people's intuitions, I feel pretty fine ignoring it.

The 89% p(doom) is more like what we should expect by default: things get faster and out of control and then that's it for humans.

Your threat model seems to be "Moloch will cause doom by default, but with AI we have one chance to prevent that, but we need to do it very carefully". But Molochian forces grow much stronger as you increase the number of actors! The first intervention would be to keep the number of actors involved as small as possible, which you do by having the few leaders race forward as fast as possible, with as much secrecy as possible. If this were my main threat model I would be much more strongly against public advocacy and probably also against both conditional and unconditional pausing.

(I do think 89% is high enough that I'd start to consider negative-expectation high-variance interventions. I would still be thinking about it super carefully though.)

Lukas_Gloor2y2

That sounds like a rephrasing of what I said that puts a positive spin on it. (I don't see any difference in content.)

Yeah, I just wanted to say that my position doesn't require public discourse to turn out to be surprisingly nuanced.

To put it another way -- you're critiquing "optimistic EAs" about their attitudes towards institutions, but presumably they could say "we get lucky in terms of the institutional response being a good one given the circumstances". What's the difference between your position and theirs?

I'm hoping to get lucky but not expecting it. The more optimistic EAs seem to be expecting it (otherwise they would share more pessimism).

I don't think people would be on board with the principle "we'll reduce the risk of doom in 2028, at the cost of increasing risk of doom in 2033 by a larger amount".

If no one builds transformative AI in a five-year period, the risk of doom can be brought down to close to 0%. By contrast, once we do build it (with that five years delay), if we're still pessimistic about the course of civilization, as I am and I was taking as an assumption for my comments about how worldviews would react to these tradeoffs, then the success chances five years later will still be bad. (EDIT:) So, the tradeoff is something like 55% death spread over a five-year period vs no death for five years, for an eventual reward of reducing total chance of death (over a 10y period or whatever) from 89% to 82%. (Or something a bit lower; I probably place >20% on us somehow not getting to TAI even in 10 years.)

In that scenario, I could imagine that many people would go for the first option. Many care more about life as they know it than for chances of extreme longevity and awesome virtual worlds. (Some people would certainly change their minds if they thought about the matter a lot more under ideal reasoning conditions, but I expect many [including me when I think self-orientedly] wouldn't.)

(I acknowledge that, given your more optimistic estimates about AI going well and about low imminent takeoff risk, person-affecting concerns seem highly aligned with long-termist concerns.)

Your threat model seems to be "Moloch will cause doom by default, but with AI we have one chance to prevent that, but we need to do it very carefully". But Molochian forces grow much stronger as you increase the number of actors! The first intervention would be to keep the number of actors involved as small as possible, which you do by having the few leaders race forward as fast as possible, with as much secrecy as possible. If this were my main threat model I would be much more strongly against public advocacy and probably also against both conditional and unconditional pausing.

That's interesting. I indeed find myself thinking that our best chances of success come from a scenario where most large-model AI research shuts down, but a new org of extremely safety-conscious people is formed where they make progress with a large lead. I was thinking of something like the scenario you describe as "Variant 2: In addition to this widespread pause, there is a tightly controlled and monitored government project aiming to build safe AGI." It doesn't necessarily have to be government-led, but maybe the government has talked to evals experts and demands a tight structure where large expenditures of compute always have to be approved by a specific body of safety evals experts.

>>But my sense is that it's unlikely we'd succeed at getting ambitious measures in place without some amount of public pressure.

Why do you believe that?

I don't know. Maybe I'm wrong: If the people who are closest to DC are optimistic that lawmakers would be willing to take ambitious measures soon enough, then I'd update that public advocacy has fewer upsides (while the downsides remain). I was just assuming that the current situation is more like "some people are receptive, but there's also a lot of resistance from lawmakers."

AnonResearcherMajorAILab2y1

So, the tradeoff is something like 55% death spread over a five-year period vs no death for five years, for an eventual reward of reducing total chance of death (over a 10y period or whatever) from 89% to 82%.

Oh we disagree much more straightforwardly. I think the 89% should be going up, not down. That seems by far the most important disagreement.

(I thought you were saying that person-affecting views means that even if the 89% goes up that could still be a good trade.)

I still don't know why you expect the 89% to go down instead of up given public advocacy. (And in particular I don't see why optimism vs pessimism has anything to do with it.) My claim is that it should go up.

I was thinking of something like the scenario you describe as "Variant 2: In addition to this widespread pause, there is a tightly controlled and monitored government project aiming to build safe AGI." It doesn't necessarily have to be government-led, but maybe the government has talked to evals experts and demands a tight structure where large expenditures of compute always have to be approved by a specific body of safety evals experts.

But why do evals matter? What's an example story where the evals prevent Molochian forces from leading to us not being in control? I'm just not seeing how this scenario intervenes on your threat model to make it not happen.

(It does introduce government bureaucracy, which all else equal reduces the number of actors, but there's no reason to focus on safety evals if the theory of change is "introduce lots of bureaucracy to reduce number of actors".)

Maybe I'm wrong: If the people who are closest to DC are optimistic that lawmakers would be willing to take ambitious measures soon enough

This seems like the wrong criterion. The question is whether this strategy is more likely to succeed than others. Your timelines are short enough that no ambitious measure is going to come into place fast enough if you aim to save ~all worlds.

But e.g. ambitious measures in ~5 years seems very doable (which seems like it is around your median, so still in time for half of worlds). We're already seeing signs of life:

note the existence of the UK Frontier AI Taskforce and the people on it, as well as the intent bill SB 294 in California about “responsible scaling”

You could also ask people in DC; my prediction is they'd say something reasonably similar.

Lukas_Gloor2y2

Oh we disagree much more straightforwardly. I think the 89% should be going up, not down. That seems by far the most important disagreement.

I think we agree (at least as far as my example model with the numbers was concerned). The way I meant it, 82% goes up to 89%.

(My numbers were confusing because I initially said 89% was my all-things-considered probability, but in this example model, I was giving 89% as the probability for a scenario where we take a (according to your view) suboptimal action. In the example model, 82% is the best chance we can get with optimal course of action, but it comes at the price of way higher risk of death in the first five years.)

In any case, my assumptions for this example model were:

(1) Public advocacy is the only way to install an ambitious pause soon enough to reduce risks that happen before 5 years.

(2) If it succeeds at the above, public advocacy will likely also come with negative side effects that increase the risks later on.

And I mainly wanted to point out how, from a person-affecting perspective, the difference between 82% and 89% isn't necessarily huge, whereas getting 5 years of zero risk vs 5 years of 55% cumulative risks feels like something that could matter a lot.

But one can also discuss the validity of (1) and (2). It sounds like you don't buy (1) at all. By contrast, I think (1) is plausible but I'm anot confident in my stance here, and you raise good points.

Regarding (2), I probably agree that if you achieve an ambitious pause via public advocacy and public pressure playing a large role, this makes some things harder later.

But why do evals matter? What's an example story where the evals prevent Molochian forces from leading to us not being in control? I'm just not seeing how this scenario intervenes on your threat model to make it not happen.

Evals prevent the accidental creation of misaligned transformative AI by the project that's authorized to go beyond the compute cap for safety research (if necessary; obviously they don't have to go above the cap if the returns from alignment research are high enough at lower levels of compute).

Molochian forces are one part of my threat model, but I also think alignment difficulty is high and hard takeoff is more likely than soft takeoff. (Not all these components to my worldview are entirely independent. You could argue that being unusually concerned about Molochian forces and expecting high alignment difficulty is both produced by the same underlying sentiment. Arguably, most humans aren't really aligned with human values, for Hansonian reasons, which we can think of as a subtype of Moloch problems. Likewise, if it's already hard to align humans to human values, it'll probably also be hard to align AIs to those values [or at least create an AI that is high-integrity friendly towards humans, while perhaps pursuing some of its own aims as well – I think that would be enough to generate a good outcome, so we don't necessarily have to create AIs that care about nothing else besides human values].)

AnonResearcherMajorAILab2y3

The way I meant it, 82% goes up to 89%.

Oops, sorry for the misunderstanding.

Taking your numbers at face value, and assuming that people have on average 40 years of life ahead of them (Google suggests median age is 30 and typical lifespan is 70-80), the pause gives an expected extra 2.75 years of life during the pause (delaying 55% chance of doom by 5 years) while removing an expected extra 2.1 years of life (7% of 30) later on. This looks like a win on current-people-only views, but it does seem sensitive to the numbers.

I'm not super sold on the numbers. Removing the full 55% is effectively assuming that the pause definitely happens and is effective -- it neglects the possibility that advocacy succeeds enough to have the negative effects, but still fails to lead to a meaningful pause. I'm not sure how much probability I assign to that scenario but it's not negligible, and it might be more than I assign to "advocacy succeeds and effective pause happens".

It sounds like you don't buy (1) at all.

I'd say it's more like "I don't see why we should believe (1) currently". It could still be true. Maybe all the other methods really can't work for some reason I'm not seeing, and that reason is overcome by public advocacy.

gergo2y3

Thank you for writing this post, I think I learnt a lot from it (including about things I didn't expect I would, such as waste sites and failure modes in cryonics advocacy - excellent stuff!).

Question for anyone to chip in on:

I'm wondering whether if we're to make the "conditional pause system" the post is advocating for universal, would it imply that the alignment community needs to drastically scale up (in terms of quantity of researchers) to be able to do similar work to what ARC Evals is doing?

After all, someone would actually need to check if systems at a given capability are safe, and as the post argues, you would not want AGI labs to do it for themselves. However, if all current top labs were to start throwing their cutting-edge models at ARC Evals, I imagine they would be quite overwhelmed. (And the demand for evals would just increase over time)

I could see this being less of an issue if the evaluations only need to happen for the models that are really the most capable at a given point in time, but my worry would be that as capabilities increase, even if we test the top models rigorously, the second-tier models could still end up doing harm.

(I guess it would also depend on whether you can "reuse" some of the insights you gain from evaluating the top models at a given time on the second-tier models at a given time, but I certainly don't know enough about this topic to know if that would be feasible)

AnonResearcherMajorAILab2y3

Yes, in the longer term you would need to scale up the evaluation work that is currently going on. It doesn't have to be done by the alignment community; there are lots of capable ML researchers and engineers who can do this (and I expect at least some would be interested in it).

I could see this being less of an issue if the evaluations only need to happen for the models that are really the most capable at a given point in time

Yes, I think that's what you would do.

my worry would be that as capabilities increase, even if we test the top models rigorously, the second-tier models could still end up doing harm.

The proposal would be that if the top models are risky, then you pause. So, if you haven't paused, then that means your tests concluded that the top models aren't risky. In that case I don't know why you expect the second-tier models to be risky?

Oliver Sourbut2y3

I think that the best work on AI alignment happens at the AGI labs

Based on your other discussion e.g. about public pressure on labs, it seems like this might be a (minor?) loadbearing belief?

I appreciate that you qualify this further in a footnote

This is a controversial view, but I’d guess it’s a majority opinion amongst AI alignment researchers.

I just wanted to call out that I weakly hold the opposite position, and also opposite best guess on majority opinion (based on safety researchers I know). Naturally there are sampling effects!

This is a marginal sentiment, and I certainly wouldn't trade all lab researchers for non-lab researchers or vice versa. Diversification of research settings seems quite precious, and the dialogue is important to preserve.

I also question

Reasons include: access to the best alignment talent,

because a lot of us are very reluctant to join AGI labs, for obvious reasons! I know folks inside and outside of AGI labs, and it seems to me that the most talented are among the outsiders (but this also definitely could be an artefact of sample sizes).

AnonResearcherMajorAILab2y1

it seems like this might be a (minor?) loadbearing belief?

Yes, if I changed my mind about this I'd have to rethink my position on public advocacy. I'm still pretty worried about the other disadvantages so I suspect it wouldn't change my mind overall, but I would be more uncertain.

Greg_Colbourn ⏸️ 2y2

While I’m uncertain how negative this effect is overall, this brief analysis of cryonics and molecular nanotechnology is quite worrying; it almost reads like a forecast about the field of AI alignment.

I think we are way past this point. Cryonics and molecular nanotechnology have never got to the point where there are Nobel Prize winners in Medicine and Chemistry advocating for them. (Whereas AI x-safety now has Hinton and Bengio, Turing Award winners^[1], advocating for it.)

^{^}
the closest thing to a Nobel Prize in Computer Science.

Aaron_Scher2y1

I think these don’t bite nearly as hard for conditional pauses, since they occur in the future when progress will be slower

Your footnote is about compute scaling, so presumably you think that's a major factor for AI progress, and why future progress will be slower. The main consideration pointing the other direction (imo) is automated researchers speeding things up a lot. I guess you think we don't get huge speedups here until after the conditional pause triggers are hit (in terms of when various capabilities emerge)? If we do have the capabilities for automated researchers, and a pause locks these up, that's still pretty massive (capability) overhang territory.

AnonResearcherMajorAILab2y1

Yeah, unless we get a lot better at alignment, the conditional pause should hit well before we create automated researchers.

Greg_Colbourn ⏸️ 2y1

“if an AI system would be very dangerous, don’t build it”

This sounds like a recipe for disaster. How do you reliably calibrate what "very dangerous" is in advance? It doesn't take much error for something considered merely "potentially dangerous" (and so allowed to be built) to actually turn out to be very dangerous. Especially if you consider potential lab leaks during training (e.g. situationally aware mesaoptimisers emerging).

AnonResearcherMajorAILab2y0

You're right, you can't do it. Therefore we should not call for a pause on systems more powerful than GPT-4, because we can't reliably calibrate that systems more powerful than GPT-4 would be plausibly very dangerous. </sarcasm>

If you actually engage with what I write, I'll engage with what you write. For reference, here's the paragraph around the line you quoted:

We can get much wider support for a conditional pause. Most people can get on board with the principle “if an AI system would be very dangerous, don’t build it”, and then the relevant details are about when a potential AI system should be considered plausibly very dangerous. At one extreme, a typical unconditional pause proposal would say “anything more powerful than GPT-4 should be considered plausibly very dangerous”. As you make the condition less restrictive and more obviously tied to harm, it provokes less and less opposition.

There are also a couple of places where I discuss specific conditional pause proposals, you might want to read those too.

Greg_Colbourn ⏸️ 2y2

You're right, you can't do it. Therefore we should not call for a pause on systems more powerful than GPT-4, because we can't reliably calibrate that systems more powerful than GPT-4 would be plausibly very dangerous.

The second sentence really doesn't follow from the first! I would replace should not with should, and would with would not^[1] [Added 10 Oct: for those reading now, the "</sarcasm>" was not there when I responded]:

We should call for a pause on systems more powerful than GPT-4, because we can't reliably calibrate that systems more powerful than GPT-4 would not be very dangerous.

We need to employ abundant precaution when dealing with extinction level threats!

At one extreme, a typical unconditional pause proposal would say “anything more powerful than GPT-4 should be considered plausibly very dangerous”.

I really don't think this is extreme at all. Outside of the major AI labs and their lobbys (and this community), I think this is mostly regarded as normal and sensible.

^{^}
and remove the "plausibly"

AnonResearcherMajorAILab2y1

we can't reliably calibrate that systems more powerful than GPT-4 would not be very dangerous.

Seems false (unless you get Pascal's Mugged by the word "reliably").

We're pretty good at evaluating models at GPT-4 levels of capability, and they don't seem capable of autonomous replication. (I believe that evaluation is flawed because ARC didn't have finetuning access, but we could do better evaluations.) I don't see any good reason to expect this to change for GPT-4.5.

Or as a more social argument: I can't think of a single ML expert who'd agree with your claim (including ones who work in alignment, but excluding alignment researchers who aren't ML experts, so e.g. not Connor Leahy or Eliezer Yudkowsky). Probably some do exist, but it seems like it would be a tiny minority.

I really don't think this is extreme at all.

I was using the word extreme to mean "one end of the spectrum" rather than "crazy". I've changed it to "end" instead.

Though I do think it would both be an overreaction and would increase x-risk, so I am pretty strongly against it.

Greg_Colbourn ⏸️ 2y4

unless you get Pascal's Mugged by the word "reliably"

I don't think it's a case of Pascal's Mugging. Given the stakes (extinction), even a 1% risk of a lab leak for a next gen model is more than enough to not build it (I think we're there already).

who aren't ML experts, so e.g. not Connor Leahy

Connor Leahy is an ML expert (he co-founded EleutherAI before realising x-risk was an massive issue).

I don't see any good reason to expect this to change for GPT-4.5.

To me this sounds like you are expecting scaling laws to break? Or not factoring it being given access to other tools such as planners (AutoGPT etc) or plugins.

Though I do think it would both be an overreaction and would increase x-risk, so I am pretty strongly against it.

How would it increase x-risk!? We're not talking about a temporary pause with potential for an overhang. Or a local pause with potential for less safe actors to race ahead. The only sensible (and, I'd say, realistic) pause is global and indefinite, until global consensus on x-safety or global democratic mandate to proceed; and lifted gradually to avoid sudden jumps in capability.

I think you are also likely to be quite biased if you are, in fact, working for (and being paid good money by) a Major AI Lab (why the anonymity?)

AnonResearcherMajorAILab2y-2

Every comment of yours so far has misunderstood or misconstrued at least one thing I said, so I'm going to bow out now.

Greg_Colbourn ⏸️ 2y1

I think the crux boils down to you basically saying "we can't be certain that it would be very dangerous, therefore we should build it to find out". This, to me, it totally reckless when it comes to the stakes being extinction (we really do not want to be FAFO-ing this! Where is your security mindset?). You don't seem to put much (any?) weight on lab leaks during training (as a result of emergent situational awareness). "Responsible Scaling" is anything but, in the situation we are now in.

Also, the disadvantages you mention around Goodharting make me think that the only sensible way to proceed is to just shut it all down.

You say that you disagree with Nora over alignment optimism, but then also that you "strongly disagree" with "the premise that if smarter-than-human AI is developed in the near future, then we almost surely die, regardless of who builds it" (Rob's post). I think you are also way too optimistic about alignment work on it's current trajectory actually leading to x-safety by saying this.