Prospects for AI safety agreements between countries

oeg

This post summarizes a project that I recently completed about international agreements to coordinate on safe AI development. I focused particularly on an agreement that I call “Collaborative Handling of Artificial intelligence Risks with Training Standards” (“CHARTS”). CHARTS would regulate training runs and would include both the US and China.

Among other things, I hope that this post will be a useful contribution to recent discussions about international agreements and regulatory regimes for AI.

I am not (at least for the moment) sharing the entire project publicly, but I hope that this summary will still be useful or interesting for people who are thinking about AI governance. The conclusions here are essentially the same as in the full version, and I don’t think readers’ views would differ drastically based on the additional information present there. If you would benefit from reading the full version (or specific parts) please reach out to me and I may be able to share.^[1]

This post consists of an executive summary (∼1200 words) followed by a condensed version of the report (∼5000 words + footnotes).

Executive summary

In this report, I investigate the idea of bringing about international agreements to coordinate on safe AI development (“international safety agreements”), evaluate the tractability of these interventions, and suggest the best means of carrying them out.

Introduction

My primary focus is a specific type of international safety agreement aimed at regulating AI training runs to prevent misalignment catastrophes.^[2] I call this agreement “Collaborative Handling of Artificial intelligence Risks with Training Standards,” or “CHARTS.”

Key features of CHARTS include:

Prohibiting governments and companies from performing training runs with a high likelihood of producing powerful misaligned AI systems.
Risky training runs would be determined using proxies like training run size, or potential risk factors such as reinforcement learning.
Requiring extensive verification through on-chip mechanisms, on-site inspections, and dedicated institutions.
Cooperating to prevent exports of AI-relevant compute to non-member countries, avoiding dangerous training runs in non-participating jurisdictions.

I chose to focus on this kind of agreement after seeing similar proposals and thinking that they seemed promising.^[3] My intended contribution here is to think about how tractable it would be to get something like CHARTS, and how to increase this tractability. I mostly do not attempt to assess how beneficial (or harmful) CHARTS would be.

I focus particularly on an agreement between the US and China because existentially dangerous training runs seem unusually likely to happen in these countries, and because these countries have an adversarial relationship, heightening concerns about racing dynamics.

Political acceptance of costly measures to regulate AI

I introduce the concept of a Risk Awareness Moment (Ram) as a way to structure my thinking elsewhere in the report. A Ram is “a point in time, after which concern about extreme risks from AI is so high among the relevant audiences that extreme measures to reduce these risks become possible, though not inevitable.” Examples of audiences include the general public and policy elites in a particular country.

I think this concept is helpful for thinking about a range of AI governance interventions. It has advantages over related concepts such as “warning shots” in that it makes it easier to remain agnostic about what causes increased concern about AI, and what the level of concern looks like over time.^[4]

I think that CHARTS would require a Ram among policy elites, and probably also among the general public – at least in one of the US and China.^[5] So this agreement would be intractable before Rams among these audiences, even if longtermists worked hard to make it occur. Developments such as GPT-4, however, seem to have moved various audiences closer to having Rams in the past six months or so.

Likelihood of getting an agreement

Conditional on (a) Rams among policy elites and the general public in the US and China and (b) EAs/longtermists pushing hard for getting CHARTS, I think it is around 40% likely that the US and China would seriously negotiate for this agreement.

By “serious negotiations,” I mean negotiations where all parties are negotiating in good faith and are extremely motivated to get an agreement, such that the negotiations would eventually lead to an agreement, assuming that an existential catastrophe or Point of no return (PONR) does not happen first.
By “pushing hard,” I mean that at least 10 median-seniority FTEs from within longtermist AI governance are spent on work whose main goal is to make this agreement more likely, and we are willing to accept credibility-and-political-capital costs at a similar level to the costs incurred by the “Slaughterbots” campaign.^[6]

As mentioned above, I believe that in 40% of worlds, Rams cause policymakers in the US and China to negotiate seriously for an agreement, such that they would eventually succeed at this, assuming an existential catastrophe or PONR doesn’t happen first. In those worlds, I think it would take around four years to go from these Rams to there being a signed agreement (90% CI:^[7] two to eight years). I got to this number by looking at five reference classes and then adjusting from there.^[8] The main adjustment I made was that I expect decision-makers after a Ram to be unusually motivated – by fear of catastrophe – to negotiate quickly.

By combining the guesses above with additional estimates, we could find the likelihood of getting an agreement before possible catastrophe. I think the additional estimates that would be needed are:

How likely is it that there will be Rams among policy elites (and maybe also the general public) before a PONR? If there will never be a Ram among policy elites, but an agreement would require such a Ram, then there will not be an agreement. Similarly, if we think this Ram is only somewhat likely, then we should discount our probability estimate of getting an agreement accordingly.
How much time is there between Rams among policy elites (and maybe also the general public) and a Point of No Return (PONR)?
How high is existential risk during negotiations? If negotiations increase or reduce risk compared to a counterfactual world without negotiations, then that changes how much time there is for negotiations before a possible catastrophe.

How to increase the tractability of getting international safety agreements?

I discuss interventions that we could do now to make it easier to get an AI safety agreement, such as CHARTS, in the future. There are at least two reasons why we might want to make it easier to get agreements:

If it is easier for countries to make AI safety agreements, then these agreements are more likely to occur. This seems good if we think that this cooperation could reduce the likelihood of existential catastrophe.
If it is easier for countries to make AI safety agreements, then I expect an agreement to occur earlier than it would otherwise. This seems good if we think there may not be much time to get an agreement before a possible catastrophe.

The main intervention that I am excited about doing now is developing, testing, and implementing measures for monitoring compute use. I am also somewhat excited about promoting generally better relations between China and the US and contributing to an informed consensus among technical experts (particularly in the US and China) about the nature of misalignment risk.

Bottom lines about the role of international agreements in the AI governance portfolio

Please note that this section is more tentative than the rest of the project.

My current guess is that the longtermist AI governance community should aim to spend an average of around eight FTE per year (90% CI: four to 12) on international safety agreements over the next two years.^[9] That figure does not include work that is relevant to international agreements but also to lots of other parts of the field, e.g., work on developing technical mechanisms for compute governance.

Unless exceptional windows of opportunity arise,^[10] I think this effort should be spent on strategy research, developing concrete proposals, and watching for windows of opportunity – rather than concretely pushing for agreements.^[11]

I expect that successfully pushing for agreements like the one that I describe in this report would take a big chunk of the longtermist AI governance field. For example, if the field does decide to push for CHARTS, my quick guess is that 20-30 people should focus on this.^[12] So, I think it only makes sense to push for such an agreement if a large fraction of the field would indeed work on it.

Introduction

In this report, I investigate the idea of bringing about international agreements to coordinate on safe AI development (“international safety agreements”), evaluating the tractability of these interventions and the best means of carrying them out.

My primary focus is a specific type of international safety agreement aimed at regulating AI training runs to prevent misalignment catastrophes.^[13] I call this agreement “Collaborative Handling of Artificial intelligence Risks with Training Standards,” or “CHARTS.”

Key features of CHARTS include:

Prohibiting governments and companies from performing training runs with a high likelihood of producing powerful misaligned AI systems.
Risky training runs would be determined using proxies like training run size, or potential risk factors such as reinforcement learning.
Requiring extensive verification through on-chip mechanisms, on-site inspections, and dedicated institutions, such as an IAIA.
Members of the agreement might need to cooperate to prevent exports of AI-relevant compute (or semiconductor-manufacturing equipment) to non-member countries; this would prevent dangerous training runs from happening in jurisdictions that have not joined the agreement.

I chose to focus on this kind of agreement after seeing similar proposals and thinking that they seemed promising. See, for example, the regulatory regime proposed by Shavit (2023) — though Shavit also discusses how these measures could be implemented at the national level by individual countries.^[14] My intended contribution here is to think about how tractable it would be to get an agreement like CHARTS, and how to increase this tractability. I mostly do not attempt to assess how beneficial (or harmful) CHARTS would be.^[15]

I don't assume that an international safety agreement would need to be a formal treaty; it could be, e.g., an executive agreement or other comparatively informal agreement. Given that CHARTS would require far-reaching restrictions and verification, the agreement seems most suited to a treaty. But it seems possible to me that countries could also make a CHARTS agreement less formally – at least as a stop-gap measure until a treaty is in place, or perhaps even indefinitely.^[16]

I focus on an agreement that would include both the US and China – though it could include additional countries. I focus on the US and China because:

Existentially dangerous training runs are disproportionately likely to happen in these countries.^[17]
Among the AI frontrunners, these countries have a particularly adversarial relationship with each other. This means that we should be particularly concerned about racing dynamics between them.^[18]

I expect that my analysis here would be relevant to other types of international agreements, though less so for agreements that differ substantially from CHARTS. For example, other potential agreements might be disanalogous because they impose less onerous measures or because they would not require “risk awareness moments” – a concept that I discuss below.

Political acceptance of costly measures to regulate AI

Risk awareness moments (Rams) as a concept for thinking about AI governance interventions

I published a standalone post for this part of the project because it seems more generally useful than the rest of my project. But you don’t need to read that post to understand everything in this one.

I introduce a new concept as a way of structuring my thinking in this report: A Risk Awareness Moment (Ram) is “a point in time, after which concern about extreme risks from AI is so high among the relevant audiences that extreme measures to reduce these risks become possible, though not inevitable.”

From DALL-E 2

Relevant audiences could include, e.g., policy elites in particular countries, people within leading labs, and the general population. By “extreme measures,” I mean both:

Measures that are far outside the Overton window or “unthinkable,” and
Measures that are very burdensome, e.g., because implementing the measure would impose high financial costs.^[19]

I think the Ram concept has two main benefits compared to similar concepts such as warning shots:

Rams let us remain agnostic about what type of evidence makes people concerned. For example, evidence could come from something that an AI does, but also from social phenomena. An example of the former might be a disastrous (but not existentially catastrophic) alignment failure. An example of the latter might be a speech from an influential figure about risks from AI.
Rams lets us remain agnostic about the “trajectory” to people being concerned about the risk. For example, whether there is a more discrete/continuous/lumpy change in opinion.

Here are several points that seem important for how we think about Rams:

A Ram – even among influential audiences – is not sufficient for adequate risk-reduction measures to be put in place. For example, there could be bargaining failures between countries that make it impossible to get mutually beneficial AI safety agreements. Or people who become more aware of the risks from advanced AI might as a result also become more aware of the benefits, and make an informed decision that – by their lights – the benefits justify a high level of risk.^[20]

For many audiences, there will not necessarily be a Ram. For example, there might be a fast takeoff before the general public has a chance to significantly alter their beliefs about AI. That said, various audiences seem to have moved closer to having Rams in the past six months or so.^[21] In particular, the impressiveness — and obvious shortcomings — of ChatGPT, GPT-4, and Bing, as well as the FLI open letter, seem to have significantly shifted public discourse about AI safety in the US (and maybe elsewhere). AW has collected various examples of this.

Risk awareness moments are not just an exogenous variable. If a Ram among a particular audience would be necessary to get a desirable policy, we could try to cause this Ram (or make it happen earlier than it would otherwise), for instance by “raising the alarm.” Whether it would in fact be good to trigger (earlier) Risk Awareness Moments depends on various factors. These include:

To what extent do desirable interventions require there to have been Rams among particular groups?
What are the risks associated with causing a Ram? For example, might this accelerate timelines towards very capable AI systems, leaving less time for safety work? Or might it cause additional hype about AI that attracts reckless actors?
What are the opportunity costs of causing a Ram? Even if we think it would be desirable to cause a Ram, we might think that marginal resources would be better spent on something else.

Risk awareness moments and international safety agreements

I claim that CHARTS would require Rams among audiences that have not (yet) had them:^[22]

It seems maybe 90% likely to me that CHARTS would require a Ram among policy elites in at least one of the US and China.
It seems about 60% likely to me that a Ram among the general public of at least one of these countries would also be needed.
I think that CHARTS would be most tractable if there had been Rams among policy elites and the general public in both of these countries – though I have not thought about by how much.

I mainly expect that Rams in either the US or China would be sufficient for an agreement between those countries because of side-payments: if one country is strongly motivated to get an agreement, it can offer incentives to the other country so that the other country would be willing to join, even without being concerned about misalignment.^[23]

Even if Rams in one country may be sufficient for getting an agreement, I expect Rams among similar audiences to happen at similar times in different countries anyway.^[24] One reason for this is that Rams might be triggered by highly visible events, such as misalignment accidents or influential publications, that are seen by people in many different countries.^[25] Additionally, groups might update their beliefs based on what groups in different countries say or do. If, for example, people in one country start acting extremely cautiously about misalignment, I suspect this would generally increase the likelihood that people in other countries would take misalignment more seriously.^[26]

Rams are not needed for helpful AI governance interventions if these interventions can be justified in terms of issues that people already care about. For example, framings such as geopolitical competition could be a good justification for weaker versions of CHARTS, such as regulating some training runs, and the access to compute of some actors. But I do think that a Ram among policy elites in at least one of the US and China would be a necessary condition for the main agreement that I focus on in this report.^[27]

There are a few important implications of needing to wait for additional Rams before being able to get CHARTS.

If we think that both (a) Rams among particular audiences would be necessary for international safety agreements, and (b) these Rams are unlikely to come, then CHARTS looks unpromising as a way of reducing risk from AI.
If we are excited about CHARTS, but do not expect the necessary Rams to come about by default (or to come about in time to get an agreement), then we should be more excited about interventions to cause (earlier) Rams. This might mean, for example, being more willing to accept downside risks associated with awareness-raising.
Having to wait for Rams would leave us with less time to get CHARTS (or other agreements) in place before AI could be extremely dangerous. This increases the value of measures that would help humanity to quickly get agreements. I discuss such measures later in the report.

Likelihood of getting the agreement

Tractability of different agreement formats

Which agreement format would be the most tractable for getting a CHARTS agreement that includes both the US and China? By “format,” I mean, for example, “Is the agreement bilateral or multilateral?” and “Are all key countries included from the beginning rather than joining in stages?”

I thought about four different formats:^[28]

Two-Step: Step 1: Informal agreement between the US and key allies about what they would want an agreement to look like. Step 2: Negotiations between the US-bloc and China. I think in practice this will basically look like bilateral negotiations between the US and China.
Bilateral: An agreement between just the US and China.
Double Treaty: There is a treaty among the US and its allies which is converted into a treaty between the US bloc on one side and China on the other.
3^rd Country Multilateral Agreement: An international agreement that is led by countries other than the US or China but that successfully influences the US and China.

My bottom line: I think it’s around 50% likely that Two-Step is the most tractable, and around 30% likely that Bilateral is the most tractable for getting a CHARTS agreement.^[29] My main crux between those options is the extent to which we could model the US Bloc as a single actor. The more this group can be characterized as many different actors, the more we should favor Bilateral instead. This is because coordinating many actors is generally harder than coordinating two actors.

I lay out my reasoning below for thinking that Two-Step, or maybe Bilateral, are the most tractable.

Two-Step is probably more tractable than Bilateral

My main reason for thinking this is the claim that China would not want a CHARTS agreement that does not include key US allies. I mainly have in mind here countries, such as the UK, that are closely allied with the US and that could plausibly build advanced AI. Bilateral may in fact be more tractable than Two-Step, however, if the US bloc is difficult to coordinate^[30] or resistant to US-led compromise with China.

Two-Step is probably more tractable than Double Treaty

My main reason for thinking this is that it’s generally easier to get informal agreements than treaties; Two-Step replaces one of the treaties in Double Treaty with an informal agreement. Also, Two-Step might raise fewer prestige concerns among Chinese elites.^[31] If I am wrong, it is probably because the treaty among the US bloc is actually good for tractability, e.g., because it is a credible signal to China that these countries care about AI safety.

3^rd country multilateral agreements have low tractability

This is because there isn’t a good mechanism for compelling the US and China to join CHARTS if these countries do not perceive it to be in their interests. And if the US and China did perceive CHARTS to be in their interests, then presumably they would just make this agreement themselves. That said, I do not mean to claim here that this format would not be tractable or desirable for other desirable agreements.^[32]

Forecasting the likelihood of a two-step CHARTS agreement

Having identified Two-Step as the most tractable agreement format, I attempt to assess how likely it is that countries would try to make a two-step agreement to implement CHARTS.

Conditional on (a) Rams among policy elites and the general public of the US and China and (b) EAs/longtermists pushing hard for getting an agreement, I think it is around 40% likely that the US and China would seriously negotiate for CHARTS.^[33]

By “EAs/longtermists pushing hard,” I mean: At least 10 median-seniority FTEs from within longtermist AI governance are spent mainly on work whose main goal is to make this agreement more likely.^[34] Additionally, the field is willing to accept credibility-and-political-capital costs at a similar level to the costs incurred by the “Slaughterbots” campaign.^[35]
By serious negotiations, I mean negotiations where all parties are negotiating in good faith and are extremely motivated to get an agreement, such that serious negotiations would eventually lead to an agreement being in place, assuming that an existential catastrophe or Point of no return does not happen first.

I got to my 40% forecast by multiplying together two sub-forecasts.

My 1^st sub-forecast: It seems around 60% likely to me that, given the above conditions, the US and some allies would make an informal agreement about what kind of agreement with China that they would like to make.^[36]
My 2^nd sub-forecast: Conditional on the earlier two conditions and the above outcome, I am 70% confident that serious negotiations would start between the US bloc and China.^[37]

So, I think the overall likelihood is about 60% of 70%, i.e., about 40%.^[38]

My sub-forecasts here come from introspecting about various inside-view considerations. I am not anchored by particular reference classes because I was unable to find any good reference classes about how often proposed agreements actually happen.^[39]

1^st sub-forecast

Here are the considerations that led me to estimate 60% for the first sub-forecast:^[40]

The key factor pushing this number up is that CHARTS would be a good way to reduce extreme risks from AI. And we’re conditioning on there having been a Ram among policy elites and the general public, meaning that it would be widely accepted that these risks are high.

Several factors push the number down: There may be concerns among the US and its allies about China's trustworthiness in adhering to CHARTS.^[41] CHARTS would also impose high burdens on these countries, such as limitations on doing training runs that might be beneficial, or complying with intrusive verification. Additionally, the US or its allies may want to maintain the freedom to act outside of agreements when it comes to AI.

2nd sub-forecast

Here are the considerations that led me to estimate 70% for the second sub-forecast:^[42]

The main consideration pushing the number up is that a lot of hurdles to getting CHARTS have already been overcome by this point; we are conditioning on forecast 1 having resolved positively, and so are assuming, e.g., that the US is in principle willing to make an agreement with China.

The main consideration pushing the number down is the possibility of bargaining failures between the countries, such as an unwillingness to accept adequate verification.

How long after a Risk Awareness Moment among policy elites and the general public should we expect it to take to get an AI safety agreement?

As mentioned above, I believe that in 40% of worlds, Rams cause policymakers in the U.S. and China to negotiate seriously for an agreement, such that they would eventually succeed at this, assuming an existential catastrophe or PONR doesn’t happen first. In those worlds, I think it would take around four years to go from these Rams to there being a signed agreement (90% CI: two to eight years).

My approach to making this forecast is to first look at reference classes to get an outside view: How long did it generally take to negotiate historical agreements? I then adjust from this outside view using various inside-view considerations that are specific to AI, CHARTS, and what I expect the world to look like.^[43]

Outside view

My outside view comes from a weighted average of the negotiation time for five reference classes of historical agreements. This table summarizes the reference classes and how much weight I give them. This weighted average outside view gives a best guess of 4.5 years between formal negotiations starting and there being a signed agreement.^[44]

Reference class	My weighting	Mean negotiating time (years)
Bilateral nuclear arms control.^[45] This covers all the strategic nuclear arms control agreements between the USA and the USSR/Russia. Examples include SALT 1 and New START.	10%	3.4
First agreements to cover key categories of nuclear weapons.^[46] This class consists of: SALT 1 (first strategic nuclear arms control for Cold War superpowers) INF (shorter range missiles for Cold War superpowers) NPT (nuclear weapons for most countries)	20%	3.9
Comprehensive Safeguards Agreements for richer countries.^[47] As part of the NPT, non-nuclear states had to make bilateral agreements – Comprehensive Safeguards Agreements (CSAs) – with the IAEA to facilitate verification of the NPT. I exclude low-GDP countries, as they are less able to build nuclear weapons.	20%	9.4
Multilateral agreements since 1945.^[48] The broadest reference class I could find that covers the post-1945 period, although it does not include bilateral agreements.	20%	2.9
Multilateral agreements since 1945 that were both proposed by a great power and related to security.^[49] This is a subset of the “multilateral agreements since 1945” reference class.	30%	3.2

Inside view

My best guess is that it would take around four years (90% CI: two to eight years) to get CHARTS – conditioning on a scenario where there is Ram among policy elites and the general public that countries respond to by attempting to make an agreement.^[50] Unlike in the outside-view data, which counts the period from the start of negotiations to there being a signed agreement, I am counting here the period from there being Rams among policy elites and the general public to there being a signed agreement. This somewhat pushes up the estimate, but in fact I think this time lag would be small (on the order of months) because of the perceived urgency. I think the time to implement the agreement after it is signed would be small, as discussed below.

My main inside-view adjustment is that I expect decision-makers to be very motivated (by fear of risks from AI) to act quickly. In particular, I expect decision-makers to be more motivated to act quickly than decision-makers were for the agreements in the outside-view reference classes. There are counterarguments that reduce the size of this update. Most importantly, even if decision-makers are very motivated to reduce risk quickly, they might negotiate slower than they could in order to preserve option value while they gain information about which misalignment risk-reduction measures are best.

Here are other inside-view adjustments towards negotiations that are quicker than the reference classes:

Medium adjustment: Improvements in non-AGI technology may make it easier to negotiate quicker. For example, large language models could be used to identify attractive compromises.
Small adjustment: EAs/longtermists could do preparatory work now to contribute to an agreement happening quickly, once Rams have happened among policy elites and the general public – as discussed below.

Here are my inside-view updates towards negotiations that are slower than the reference classes:

Small adjustment: AI is somewhat harder to verify than many of the capabilities (e.g., nuclear weapons) in the reference classes.^[51]
Small adjustment: Empirically, international agreements seem to have become rarer since the end of the Cold War.

Implementation time

I am assuming that there would be little implementation time. Specifically, conditional on the CHARTS agreement being signed, I think it is around 70% likely that – within six months of signature – CHARTS would be either completely in force or informally mostly followed.^[52]

I mainly think there would be little implementation time because I assume that countries would be highly motivated to get risk-reducing measures in place quickly, once the magnitude of AI-related risks becomes more salient. This means that I expect countries to informally commit to the measures in the agreement (almost) immediately after signature. They might even implement the measures in the agreement while negotiations are ongoing, e.g., to indicate good faith.

However, if I'm wrong and there is a substantial delay between the signature and implementation, I expect this to be because we’re in a world where (a) countries would not trust each other enough to implement an agreement without rigorous verification, and (b) this verification would be complex (and thus slow) to implement.^[53]

How likely is it that we would get an agreement in time?

In this section, I come to a best guess that it would take around four years (90% CI: two to eight years) to get the agreement described in this report – conditioning on a scenario where there is a Ram among policy elites and the general public that countries respond to by attempting to make an agreement.

By combining this guess with additional estimates, we could find the likelihood of the time to get an agreement being less than the time until possible catastrophe:

How much time between Rams and PONR?
How high is existential risk during negotiations? If negotiations increase or reduce risk compared to a counterfactual world without negotiations, then that changes how much time there is for negotiations before a possible catastrophe.

Regrettably I have not (yet) estimated these other components, so do not (yet) have a combined estimate – though I hope to think more about this in 2023.

Possible implications of the combined estimate

If we think that we would not be able to get an agreement in time, then we should be less excited about this agreement (and, to some extent, international agreements in general) as an AI governance intervention.

If we want international agreements but are not confident that there would be sufficient time to get an agreement, then we should be more willing to carry out interventions that would increase the time between Rams and a possible PONR or existential catastrophe.^[54] Here are examples of what this might look like:

Causing Rams among policy elites and the general public to happen earlier than they would by default, for instance by awareness raising.^[55] This would only be helpful if the effect is not canceled out by these efforts also causing an earlier PONR/catastrophe.^[56]
Causing TAI to come later than it would by default – but only if this effect is not canceled out by these efforts also delaying Rams.^[57]

Additionally, if we do not think that there would be enough time to get CHARTS, then we should be more excited about interventions that would make it easier to make agreements quickly. I discuss examples of these interventions below.

Measures that we could take now to promote future cooperation on AI safety

I discuss interventions that we could do now to make it easier to get AI safety agreements, including CHARTS, in the future.^[58] I am excited about making it easier for countries to make AI safety agreements for two main reasons: this would increase the likelihood of risk-reducing agreements, and this would increase the likelihood that these agreements happen earlier.

The table below summarizes the measures that I considered to promote international cooperation, and my bottom lines on how promising they are. My takes in this section are rougher than elsewhere in my report. Also, my process for deciding which measures to assess was very informal; there are several measures that seem somewhat promising to me but that I do not assess here.^[59]

Potential measure	My rough bottom line and resilience
Promoting generally better relations between the US and China with track 1.5 or track II dialogues.^[60] I mean here dialogues that are not focused specifically on AI – dialogues that focus on AI specifically are discussed separately.	I would be excited for someone to do this, but do not think that it is a good use of time for people who could work on longtermist AI governance, given what they could be working on instead. I am 80% confident that I would have basically the same view if I thought about this for a month.
Developing, testing, and implementing measures for monitoring compute use. For example, particular hardware-enabled mechanisms to enable verifiable claims about training runs, as well as KYC rules for compute.	This seems extremely valuable to me because I think this capability is necessary to enforce CHARTS – as well as some other plausible agreements.^[61] I am 80% confident that I would still think this should be a high priority for AI governance researchers if I thought about this for a month.
Contributing to informed consensus among technical experts in different countries about the nature of misalignment risk (e.g., “What is the control problem?” & “How should we feel about RLHF?”)	I think this would be net positive as long as risks are managed – which seems doable to me. The main risks that I have in mind are attention hazards about the importance of TAI, and information dissemination about how to build powerful AI systems. It seems around 80% likely to me that multiple people within longtermist AI governance should spend some of their time on this (as long as they are in a position to be able to manage the risks).^[62] Resilience: I could see myself going to as low as 40% likelihood if I thought about this for a month.
Building trust by having countries make and stick to commitments on AI in somewhat lower stakes contexts, such as lethal autonomous weapons. Countries could either make these commitments unilaterally or as part of an agreement	I think there will be various types of commitment here that are net positive. But I haven’t thought much about what the best commitments to push for would be, or how work on these commitments would compare to other things that people in AI governance might spend their time on. Note that, in this document, I don’t take into account career capital effects like “AI governance people get a better understanding of how to cause policy change” or field-building effects, e.g., “this work causes more people to join longtermist AI governance.” It seems initially plausible to me, however, that the career-capital and field-building effects are cumulatively bigger than the trust-increasing effects.

Tentative views about the role of international agreements in the AI governance portfolio

Based on the rest of this project, I tried to think about my worldview of how international safety agreements should fit into AI governance. My views here are particularly tentative, and I plan to think more about these questions in Q2 2023.

My current guess is that the longtermist AI governance community should aim to spend an average of eight FTE per year (90% CI: four to 12) on international safety agreements over the next two years (approximately). For reference, I think there are currently about 100 people working in the longtermist AI governance field, though I expect this to grow.^[63]

Unless exceptional windows of opportunity arise,^[64] I think this effort should be spent on strategy research and developing more concrete proposals. And I think the focus should mostly be agreements similar to CHARTS.^[65]

I expect that successfully pushing for CHARTS would take a big chunk of the longtermist AI governance field. For example, if the field does decide to push for CHARTS, my weakly-held best guess is that this agreement should be the focus of 20-30 people. As a result, I think it only makes sense to push for CHARTS if a large fraction of the field would indeed work on this.^[66]

Bibliography

Available here.

Acknowledgements

This research is a project of Rethink Priorities. It was written by Oliver Guest.

I am grateful to many people, inside and outside of Rethink Priorities, who have contributed to this project. Thanks in particular to Ashwin Acharya (my manager), as well as to Onni Aarne, Michael Aird, Mauricio Baker, Haydn Belfield, Matthijs Maas, Zach Stein-Perlman, attendees of GovAI and Rethink Priorities seminars where this work was discussed, and two others who gave highly valuable guidance. Thank you to Erich Grunewald and Patrick Levermore for feedback on an earlier draft, as well as contributions to the project in general. Thanks to Adam Papineau for copyediting. Mistakes are my own and other people do not necessarily endorse the claims that I make.

If you are interested in RP’s work, please visit our research database and subscribe to our newsletter.

^{^}
I am avoiding widely sharing the full version partly in order to reduce downside risks from sharing a few lines of thinking that seem potentially sensitive, and partly because the full version is written more informally and is less polished for public consumption.
Anyone is welcome, however, to request more complete access here. If you do so, please briefly explain (a) why you think it’d be useful for you to have access, and (b) whether there are any parts that would be particularly helpful to see in full. That said, I might not approve or reply to all requests.
^{^}
The described agreement would be similar to the regime proposed by Shavit (2023), though Shavit also considers implementing the regime at a national level.
^{^}
See, for example, the regulatory regime proposed by Shavit (2023) — though Shavit also discusses how these measures could be implemented at the national level by individual countries. There are linkposts to Shavit’s paper on the EA Forum and LessWrong.

^{^}

Whereas the term “warning shot” implies a fairly discrete change from not being worried, to seeing the shot, to being worried.

^{^}

One great power might be sufficient because that country could provide incentives for other countries to join, even if audiences in the other countries are not convinced about misalignment. That said, an agreement seems much more tractable to me if Rams have happened in both countries.

^{^}

I sometimes report “90% CIs”; these are subjective confidence intervals. Unless otherwise specified, these should be interpreted as meaning: “I’m 90% confident that, if [thing happened – e.g., spent x hours on further research, thinking, and discussion about this matter], the belief I’d end up having would be somewhere between the following two points (i.e., I think there’s only a 10% chance that, after that process, my belief would lie outside that range).”

^{^}

Three of the reference classes were particular kinds of agreements between countries about nuclear weapons. Two of the reference classes were different categories of multilateral agreements that countries have made since 1945.

^{^}

For reference, there are currently about 100 people working in the longtermist governance field (Hilton, 2023), though I expect this to grow.

^{^}

By “exceptional window of opportunity,” I mean an opportunity that I expect to happen in fewer than 10% of possible worlds where it would be particularly tractable or desirable to push for an agreement within approximately the next two years. This might look like, e.g., world leaders seriously expressing interest in making AI safety agreements. Though if we do get this kind of window of opportunity, I would want more than eight people to try to seize it! (If there is such a window of opportunity, my ballpark figure would be more like 20-30 people, but I haven’t thought carefully about this, and presumably it would depend on the specific situation.)

^{^}

This framing is maybe simplistic in assuming that windows of opportunity are purely exogenous; presumably we can also take actions that increase the likelihood of there being windows of opportunity.

^{^}

The 20-30 figure is the result of me informally trying to balance two considerations: On the one hand, the more people working on getting an agreement, the more likely they are to succeed. On the other hand, the AI governance field is very small (currently around 100 people), so each additional person working on this agreement would involve a trade-off against other important work.

^{^}

The described agreement would be similar to the regime proposed by Shavit (2023), though Shavit also considers implementing the regime at a national level.

^{^}

There are linkposts to Shavit’s paper on the EA Forum and LessWrong.

^{^}

That said, I do continue to think that this agreement would be very beneficial!

^{^}

To be more specific: Conditional on there being the kind of agreement that I describe here, it seems at least 40% likely to me that the agreement would (at least initially) be less formal than a treaty. I am basing this claim mainly on general International Relations knowledge, but have not spoken to anyone with legal expertise about it.

^{^}

I do not mean to claim here that these countries are the two most likely countries in which dangerous training runs might happen; I have not tried carefully to weigh them against, e.g., the UK.

^{^}

Two ways in which the adversarial nature of the US-China relationship would by default increase racing dynamics: (1) We should expect each country to have a stronger dispreference against the other country being the first to develop TAI, and so be more incentivized to race to beat the other; (2) Cooperation between them to curb racing is difficult.

^{^}

I particularly have in mind burdens that would be widely perceived even before a Ram. This excludes, e.g., the opportunity cost of delaying amazing AGI capabilities. I don’t give a precise operationalization of financial cost here. This is because I am primarily interested in whether the cost would be perceived as very burdensome. But I mainly have in mind (opportunity) costs in the hundreds of millions, or billions, of dollars.

^{^}

If people have this belief, they might not want to put in place risk-reducing measures if they think that these measures would delay AI advances.

^{^}

Note that I have not tried to operationalize this claim precisely and that my evidence here feels fairly anecdotal. Also, the politics around AI seem to be changing particularly quickly at the time of writing, so my claims here may quickly look out of date!

^{^}

As mentioned above, I don’t think these audiences have yet had Rams — but the political situation seems to be moving fast, so maybe they will have had them soon after this post is out!

^{^}

“Side-payments as an international bargaining tactic refer to compensation granted by policymakers to other countries (foreign policymakers or their state and societal constituencies) in exchange for concessions on other issues.” See Friman 2009, footnote 7.

There are additional (smaller) reasons why countries might join an agreement even if people in that country are not worried about misalignment. For example, people in a given country might expect to benefit from the relative gains that the agreement would produce. Additionally, it’s generally useful for countries to join international consensus and be seen to be part of the rules-based international system.

Side payments, relative gains, and incentives to be in the rules-based system are also reasons why additional countries (other than the US and China) might be willing to join the agreement, even if Rams have not happened in those countries.

^{^}

As a more specific example of this claim: I am around 70% confident that, conditional on there being a risk awareness moment among US policy elites, there would be a risk awareness moment among the equivalent audience in China within a year.

^{^}

That said, cultural or bureaucratic factors might cause people in different countries to respond in different ways to seeing the same event.

^{^}

There are exceptions here. For example, maybe groups would downplay the extent to which they have become concerned about AI, e.g., in order to prevent dangerous racing dynamics.

^{^}

As discussed above, maybe Rams among the general public would also be necessary.

^{^}

There is an additional plausible format that I did not consider but would have if I had had more time: Multilateral agreements that are led by the US and/or China. Technically, an agreement involving the US bloc would be multilateral, since there are several countries in the US bloc. Here, however, I have in mind something with a wider range of member countries.

^{^}

Both of those estimates might go up or down by 15 percentage points if I thought about this for an additional three months.

^{^}

Negotiations between many actors are generally harder than negotiations between two actors.

^{^}

Given that treaties tend to be slower than informal agreements, Two-Step also has an advantage over Double Treaty of likely being faster.

^{^}

I also think that some people may disagree with me here because they think that stigma and social pressure has a bigger role than I think they do in influencing the policy of powerful countries. For example, advocates for the Treaty on the Prohibition of Nuclear Weapons often make these kinds of arguments. See, for example, Fihn (2017).

^{^}

Note that I am not forecasting here the likelihood that the serious negotiations would successfully culminate in an agreement being made. That would also require forecasting (a) the length of negotiations and (b) the expected length of the window between the Rams and a PONR of existential catastrophe. I do (a) below but have not done (b).

Also, I make my forecast conditional on there having been Rams in both the US and China, despite having claimed above that I think a Ram in just one would be sufficient. This is because (as explained above) I think that Rams are likely to occur at similar times in both. This scenario is thus the more relevant one to make forecasts about.

^{^}

I use “median-seniority FTEs” as an easy unit – but plausibly it would in fact be better to have a smaller number of senior people or a larger number of junior people, or some mix. My 10 FTE figure does not include people in complementary roles such as operations.

^{^}

My understanding is that some people in the AI governance world think the Slaughterbots campaign cost some credibility, particularly among people in the US government, because some government people perceived it as sensationalist. Once the political impact of more recent developments, such as the TIME article by Eliezer Yudkowsky, becomes clearer, maybe this could also be used as a benchmark.

^{^}

This is forecasting the likelihood that these countries would do step 1 of my Two-Step model.

^{^}

This is forecasting the likelihood that these countries would start serious negotiations for step 2 of my Two-Step model.

^{^}

Ideally I would do this step by multiplying together probability distributions rather than point estimates.

^{^}

I am >90% confident that no good dataset has been published on a question similar to “what % of proposed agreements between countries actually get made?” This claim is based on having looked hard for this data, and on expecting this data to be hard enough to compile that no one would have done it. But plausibly there are narrower reference classes that would be helpful and that I should use here.

^{^}

As a reminder: It seems around 60% likely to me that, given the above conditions, the US and some allies would make an informal agreement about what kind of agreement with China that they would like to make.

^{^}

Relatedly, there might be principled stances against making major agreements with China from some influential constituencies within the US and its allies.

^{^}

As a reminder: Conditional on the earlier two conditions and the above outcome, I am 70% confident that serious negotiations would start between the US bloc and China.

^{^}

Philip Tetlock promotes this approach, e.g., in his book Superforecasting. For a good overview of this approach, see this blog post. Kokotajlo (2021) notes that “outside view” has acquired several meanings. I am using it in the sense of “reference class forecasting.”

^{^}

I consider it a limitation that the reference class data excludes time before the start of formal negotiations but where progress was still being made towards an agreement.

90% of the points in the weighted average distribution are between 2.1 and 9.4 years. I take these numbers into account when thinking about my inside-view probability distribution – described below. Based on my limited statistical understanding, however, these numbers are not themselves meaningful. The weighted average approach assumes that the reference classes are all independent, whereas, in fact, some agreements are in multiple reference classes, making the reference classes somewhat correlated. This flaw means that there will appear to be less variance in the data than there is. I try to adjust for this when subjectively forming my inside-view probability distribution. This feels dubious – but also like a better use of time than attempting to fix the statistical issue.

^{^}

The time period is from the beginning of formal negotiations to there being a signed agreement.

^{^}

The time period is from the beginning of formal negotiations to there being a signed agreement.

^{^}

The time period is from when the country entered the NPT to when the CSA came into force for that country.

^{^}

I use a dataset compiled by Simonelli. The time period is from the first written proposal for an agreement, to signature (Simonelli 2011, p158).

^{^}

As with the previous reference class, I use a dataset compiled by Simonelli and the time period is from the first written proposal for an agreement, to signature (Simonelli 2011, p158).

^{^}

I am imagining a roughly lognormal probability distribution.

^{^}

On the differences between agreements about nuclear weapons and agreements about AI, see Maas (2019). On the other hand, the agreement in this report is based on regulating compute. This reduces some of the difficulties that are sometimes discussed around verifying AI agreements. For example, it appears to be simpler to outline what is allowed in an agreement by focusing on compute rather than software.

^{^}

But I have thought about this topic less than other parts of my project. My 90% CI is that my 70% figure could turn into something between 40% and 80% if I thought about it for two months.

^{^}

The main reason why I do not give this consideration more weight is an assumption that the time during the negotiations would be sufficient for countries to figure out the details of verification — particularly given that I am expecting countries to be very motivated to get CHARTS in place as quickly as possible. I have not, however, looked at reference classes of how long it generally takes to establish verification measures.

^{^}

By “more willing,” I mean more willing to accept the opportunity costs and downside risks of a given intervention.

^{^}

The desirability of raising awareness about AI is very contested! See for example, recent debates on the EA Forum and LessWrong about the recent open letter organized by FLI, and Eliezer Yudkowsky’s article in Time.

^{^}

For example, awareness raising to cause earlier Rams would be pointless for our purposes here, if it also has the effect of shortening timelines, e.g., by motivating a crash project.

^{^}

The desirability of slowing AI is also contested! See, e.g., Katja Grace’s post “Let’s think about slowing down AI” and the response to it. See also Muehlhauser (2021): “I can say with some confidence that there is very little consensus on which intermediate goals are net-positive to pursue, or even on more fundamental questions such as ‘Is AI x-risk lower if AI progress is faster or slower?’”

^{^}

By “now,” I particularly mean “even before Rams have happened among key audiences.”

^{^}

See, for example, some of the measures suggested in Imbrie and Kania (2019).

^{^}

See, e.g., Clare (2021) for more detail. Clare is talking about reducing risks from great power conflict, but this seems like a similar problem to trying to promote cooperation between great powers.

^{^}

This measure also seems helpful for other possible AI governance interventions, e.g., agreements between labs to promote safety.

^{^}

As a quick best guess, if we assume that the field is 100 people, maybe I would want this to have a total of between 0.5 and 1 FTEs (if we assume that the FTE is of median-seniority, relative to the longtermist AI governance field).

^{^}

I take the 100 people claim from Hilton (2023). Both my eight FTE figure and Hilton’s 100 FTE figure exclude, e.g., operations support for these people. My eight FTE figure also does not include work that is relevant to international agreements, but that mostly has other theories of change. This includes, for example, developing safety benchmarks; these could be used to determine which training runs are permitted under an international agreement, but are also useful for a range of other AI governance interventions.

^{^}

By “exceptional window of opportunity” I mean an opportunity that I expect to happen in fewer than 10% of possible worlds where it would be particularly tractable or desirable to push for an agreement within approximately the next two years. This might look like, e.g., world leaders seriously expressing interest in making AI safety agreements. Though if we do get this kind of window of opportunity, I would want more than eight people to try to seize it! (If there is such a window of opportunity, my ballpark figure would be more like 20-30 people, but I haven’t thought carefully about this, and presumably it would depend on the specific situation.) A limitation of the “window of opportunity” framing is that it implies that windows of opportunity are exogenous events. In fact, we could maybe do things to bring these windows about.

^{^}

Though I am aware that I do not justify the claim anywhere here that this type of international safety agreement is more promising than possible alternatives.

^{^}

Given that 30 is such a small number of people, these people would need to focus hard on leveraging their efforts – such as by focusing on convincing powerful institutions that the agreement is worth pursuing. The 20-30 figure is the result of me informally trying to balance two considerations: On the one hand, the more people working on getting an agreement, the more likely they are to succeed. On the other hand, the AI governance field is very small. This means that each additional person working on this would involve tough trade-offs about what other work to neglect.

Show all footnotes

Effective Altruism Forum
EA Forum

Prospects for AI safety agreements between countries

104

Executive summary

Introduction

Political acceptance of costly measures to regulate AI

Likelihood of getting an agreement

How to increase the tractability of getting international safety agreements?

Bottom lines about the role of international agreements in the AI governance portfolio

Introduction

Political acceptance of costly measures to regulate AI

Risk awareness moments (Rams) as a concept for thinking about AI governance interventions

Risk awareness moments and international safety agreements

Likelihood of getting the agreement

Tractability of different agreement formats

Two-Step is probably more tractable than Bilateral

Two-Step is probably more tractable than Double Treaty

3^rd country multilateral agreements have low tractability

Forecasting the likelihood of a two-step CHARTS agreement

1^st sub-forecast

2nd sub-forecast

How long after a Risk Awareness Moment among policy elites and the general public should we expect it to take to get an AI safety agreement?

Outside view

Inside view

Implementation time

How likely is it that we would get an agreement in time?

Possible implications of the combined estimate

Measures that we could take now to promote future cooperation on AI safety

Tentative views about the role of international agreements in the AI governance portfolio

Bibliography

Acknowledgements

104

Reactions

More posts like this

Prospects for AI safety agreements between countries

104

Executive summary

Introduction

Political acceptance of costly measures to regulate AI

Likelihood of getting an agreement

How to increase the tractability of getting international safety agreements?

Bottom lines about the role of international agreements in the AI governance portfolio

Introduction

Political acceptance of costly measures to regulate AI

Risk awareness moments (Rams) as a concept for thinking about AI governance interventions

Risk awareness moments and international safety agreements

Likelihood of getting the agreement

Tractability of different agreement formats

Two-Step is probably more tractable than Bilateral

Two-Step is probably more tractable than Double Treaty

3rd country multilateral agreements have low tractability

Forecasting the likelihood of a two-step CHARTS agreement

1st sub-forecast

2nd sub-forecast

How long after a Risk Awareness Moment among policy elites and the general public should we expect it to take to get an AI safety agreement?

Outside view

Inside view

Implementation time

How likely is it that we would get an agreement in time?

Possible implications of the combined estimate

Measures that we could take now to promote future cooperation on AI safety

Tentative views about the role of international agreements in the AI governance portfolio

Bibliography

Acknowledgements

104

Reactions

More posts like this

3^rd country multilateral agreements have low tractability

1^st sub-forecast