At the Center on Long-Term Risk (CLR), weâre interested in preventing catastrophic cooperation failures between powerful AIs. These AIs might be able to make credible commitments,[1] e.g., deploying subagents that are bound to auditable instructions. Such commitment abilities could open up new opportunities for cooperation in high-stakes negotiations. In particular, with the ability to commit to certain policies conditional on each otherâs commitments, AIs could use strategies like âIâll cooperate in this Prisonerâs Dilemma if and only if youâre committed to this same strategyâ (as in open-source game theory).
But credible commitments might also exacerbate conflict, by enabling multiple parties to lock in incompatible demands. For example, suppose two AIs can each lock a successor agent into demanding 60% of some contested resource. And suppose thereâs a delay between when each AI locks in this policy and when the other AI verifies it. Then, the AIs could end up both locking in the demand of 60%, before seeing that each other has done the same.[2] So weâd like to promote differential progress on cooperative commitments.
This research agenda focuses on a promising class of cooperative conditional commitments, safe Pareto improvements (SPIs) (Oesterheld and Conitzer 2022). Informally, an SPI is a change to the way agents negotiate/bargain that makes them all better off, regardless of their original strategies â hence âsafeâ. (See Appendix B.1 for more on this definition and how it relates to Oesterheld and Conitzerâs framework.)
What do SPIs look like? The rough idea is to mitigate the costs of conflict, but commit to bargain as if the costs were the same. Two key examples:
Later, weâll come back to the question of when agents would be individually incentivized to agree to SPIs. We think SPIs themselves are unusually robust for a few reasons.
First, SPIs donât require agents to coordinate on some notion of a âfairâ deal, unlike classic cooperative bargaining solutions (Nash, Kalai-Smorodinsky, etc.). That is, to mutually benefit from an SPI, the agents donât need to agree on a particular way to split whatever theyâre negotiating over[3] â which even advanced AIs might fail to do, as argued here. Thatâs what the âsafeâ property above buys us.
Second, the examples of SPIs listed above (at least) preserve the agentsâ bargaining power. That is, when agents apply these kinds of SPIs to their original strategies, each party makes the same demands as in their original strategy. This means that, all else equal, these SPIs avoid two potential backfire risks of conflict-reduction interventions: they donât make conflict more likely (via incompatible higher demands) or make either party more exploitable (via lower demands). (âAll else equalâ means we set aside whether the anticipated availability of SPIs shifts bargaining power; we address this in Part II.1.a.)
But if SPIs are so great, wonât any AIs advanced enough to cause catastrophe use them without our interventions? We agree SPIs will likely be used by default. However, this is arguably not overwhelmingly likely, because AIs or humans in the loop might mistakenly lock out the opportunity to use SPIs later. Itâs unclear if default capabilities progress will generalize to careful reasoning about novel bargaining approaches. So, given the large stakes of conflicts that SPIs could prevent, making SPI implementation even more likely seems promising overall. In particular, we see two major reasons to prioritize SPI interventions and research:[4]
Accordingly, this agenda describes three workstreams:
Part I â Evaluations and datasets: studying unambiguous SPI capability failures in current models, i.e., cases where they endorse commitments or patterns of reasoning that might foreclose SPIs.
Part II â Conceptual research and SPI pitch: clarifying which near-term actions might either undermine AIsâ incentives to use SPIs or directly lock them out; and writing an accessible âpitchâ for AI companies to mitigate risks of SPI lock-out.
Part III â Preparing for research automation: developing benchmarks and workflows to help us efficiently do AI-assisted SPI research.
See Appendix A for a brief overview of relevant prior work on SPIs.
If youâre interested in researching any of these topics at CLR, or collaborating with us on them, please reach out via our expression of interest form.
Weâd like to identify the contexts where current AI systems exhibit SPI-incompatible behavior and reasoning. Namely, when do models endorse actions that unwisely foreclose SPIs, or fail to consider or reason clearly about SPI concepts when relevant?
We plan to design evals for the following failure modes:
Using these evals, we aim to:
How exactly should this data be used? A natural approach is to share it with safety teams at AI companies, and collaborate with them on designing interventions. That said, even if itâs robustly good for AIs to avoid locking out SPIs all else equal, interventions intended to prevent SPI lock-out could have large and negative off-target effects. For example, they might excessively delay commitments that would actually support SPIs. This is one reason we focus on narrow capability failures, rather than broad patterns of bargaining behavior. But we intend to deliberate more on how to mitigate such backfire effects.
On the value of information from this research: Plausibly, unambiguous SPI compatibility failures will only appear in a small fraction of high-stakes bargaining prompts, and itâs unclear how well the evidence from current AIs will transfer to future AIs. Despite this, we expect to benefit in the long run from iterating on these evals. And concrete examples will likely be helpful for the safety teams we aim to collaborate with. But if the results turn out to be less enlightening than expected, weâd focus harder on Parts II and III of the agenda.
The goal of Part II is to understand what might lead to SPI lock-out, and what can be done about it. We break this problem down into:
Weâll also distill findings from (1) and (2) into a pitch for preserving SPI option value (more).
If all parties implement some SPI, theyâll all be better off than under their original strategies, by definition. But this doesnât guarantee they each individually prefer to try implementing the same SPI (Figure 1, top row):[5]
Figure 1. A solid arrow from a gray box to another box means âthe assumption is clearly load-bearing for whether the given risk (red box) is avoidedâ; a dashed arrow means âpossibly load-bearing for whether the given solution (green box) works, but itâs unclearâ.
DiGiovanni et al. (2024) give conditions under which agents avoid all three of these risks â hence, they individually prefer to use the same SPI (Figure 1, middle row). The particular SPI in this paper significantly mitigates the costs of conflict, by leaving no agent worse off than if theyâd fully conceded to the othersâ demands.[6] But these results rest on assumptions weâd like to relax or better understand (Figure 1, bottom row):
Implications for lock-out: Understanding these assumptions better would help us strategize about the timing of commitments to SPIs. For example, if itâs harder to incentivize SPIs in the case where one agent moves first, we might lock out SPIs by failing to commit early enough (i.e., by moving second). Or, suppose the assumptions about beliefs and verifiable counterfactuals turn out to be dubious, but surrogate goals donât rely on them. Then, since surrogate goals arguably[8] only work if implemented before any other bargaining commitments, getting the timing of surrogate goals right would become a priority.
The question above was, âFor any given original strategies, when would agents prefer to change those strategies with an SPI?â But we should also ask, âWhat conditions does an agentâs original strategy need to satisfy, for their counterpart to prefer to participate in an SPI?â
Why would counterparts impose such conditions? Because even if an SPI itself doesnât inflate anyoneâs demands, agents might still choose higher âoriginalâ demands as inputs to the SPI â since they expect the SPI to mitigate conflict (cf. moral hazard). Anticipating this, their counterparts will only participate in SPIs if participation doesnât incentivize higher demands.
Itâs an open question how exactly counterparts would operationalize âparticipation doesnât incentivize higher demandsâ. Weâve identified two candidates (see Figure 2; more in Appendix B.2):
Figure 2. Each âDemandsâ box indicates the demands the agent makes given their policy (solid arrow) and, respectively, their counterpartâs participation policy (PI) or their beliefs about the counterpartâs participation (FI) (dashed arrow).
Research goals: One priority is to better understand what needs to happen for AI development to satisfy PI vs. FI. For example, which bargaining decisions do we need to defer to successors with surrogate goals? And, if satisfying FI requires more deliberate structuring of AI development than PI, itâs also a priority to clarify whether FI is necessary. We aim to make progress by:
Implications for lock-out: Above, we saw that thereâs an incentive lock-out risk if surrogate goals âonly work if implemented before any other bargaining commitmentsâ. If FI is required, this hypothesis looks more likely: On one hand, if the surrogate goal is adopted first, the demands are set by an agent who actually has âstakeâ in the incoming threats (and therefore wouldnât want to inflate such demands). On the other hand, if the demands come first, theyâll be set by an agent with no stake in the threats.
Even if we avoid undermining AIsâ incentives to use SPIs, AIs might still lock out the option to implement SPIs at all. Weâd like to more concretely understand how this could happen.
As an illustrative example, consider some AI developers who havenât thought much about surrogate goals. Suppose they think, âTo prevent misalignment, we should strictly prohibit our AI from changing its values without human approval.â Even with the âwithout human approvalâ clause, this policy could still backfire. E.g., if a war between AIs wiped out humanity, the AI would be left unable to implement a surrogate goal. (More related discussion in âWhen would consultation with overseers fail to prevent catastrophic decisions?â here.) The developers could have preserved SPI option value, with minimal misalignment risk, by adding a clause like âunless the values change is a surrogate goal, and itâs impossible to check in with humansâ.
Research goals: We plan to explore a range of possible SPI lock-out scenarios. Ideally, weâd use this library of scenarios to produce a âchecklistâ of simple risk factors for lock-out. AIs and humans in the loop could consult this checklist to cheaply preserve SPI option value. Separately, the library could inform the evals/datasets in Part I, and help motivate very simple interventions by AI companies like âput high-quality resources about SPIs in training dataâ. So the initial exploration step could still be useful, even if we update against the checklist plan. That could happen if we conclude the bulk of lock-out risk comes from factors that a checklist is ill-suited for â factors like broad commitment race dynamics that are hard to robustly intervene on, or mistakes that could be prevented simply by making AIs/humans in the loop more aware of SPIs.
In parallel with the research threads above, we aim to write a clear âpitchâ for why AI developers should care about SPI lock-out. The target audience is technical staff at AI companies who make decisions about model training, deployment, and commitments, but who may not be familiar with open-source game theory. The goal at this stage is to help build coordination on preserving SPI option value where feasible, not to push for expensive or far-reaching changes to AI training.
The pitch would cover:
Various open conceptual questions about SPIs seem important, yet less tractable or urgent than those in Part II. For example: Which attitudes that AIs might have about decision theory could shape their incentives to use SPIs? And given that these decision-theoretic attitudes arenât self-correcting (Cooper et al.), how might future AIsâ incentives to use SPIs be path-dependent on earlier AIsâ/humansâ attitudes (even if these arenât âlocked inâ)? We want to get into a strong position to delegate these questions to future AI research assistants.
Anecdotally, weâve found current models to be mostly poor at conceptual reasoning about SPIs, even when given substantial context. But models do help with some conceptual tasks. While the set of such tasks might grow quite quickly soon, delegating SPI research to AI assistants could still face two main bottlenecks:
(See Carlsmithâs âCan we safely automate alignment research?â. (1) is about what Carlsmith calls âevaluation failuresâ (Sec. 5-6), and (2) is about âdata-scarcityâ and âshlep-scarcityâ (Sec. 10).[11])
Given these potential bottlenecks, we plan to pursue two complementary threads:
Benchmarking AI research capabilities on SPI.[12] Weâre developing a benchmark to diagnose (and track over time) which SPI research tasks AI systems can handle. The aim is to help calibrate our decisions about what/how to delegate to AIs, at two levels: i) Which tasks can we trust AIs to do end-to-end? ii) Among the tasks the AIs canât do end-to-end but can still help with, at which steps should they check in with overseers, and how can we decompose these tasks more productively? (We take dual-use concerns about advancing general conceptual reasoning seriously. For now, the default plan is to use the benchmark internally rather than sharing it with AI companies as a training target.)
Some examples of task classes the benchmark would cover:
Strategies for efficient human-AI collaboration on SPI research. Drawing on our experience using AI assistants for SPI research, weâll strategize about how to make this process more efficient â in ways that wonât quickly be made obsolete by the âBitter Lessonâ. Some strategies we plan to test out and refine:
Many thanks to Tristan Cook, Clare Harris, Matt Hampton, Maxime Riché, Caspar Oesterheld, Nathaniel Sauerberg, Jesse Clifton, Paul Knott, and Claude for comments and suggestions. I developed this agenda with significant input from Caspar Oesterheld, Lukas Finnveden, Johannes Treutlein, Chi Nguyen, Miranda Zhang, Nathaniel Sauerberg, and Paul Christiano. This does not imply their full endorsement of the strategy in this agenda.
This list of resources gives a (non-comprehensive) overview of public SPI research. Brief summaries of some particularly relevant work:
Setup:
Then:
Definition. An SPI is a transformation such that, for all in some space , the agentsâ payoffs in when they all follow programs (weakly) Pareto-dominate their payoffs when they all follow .
This definition alone doesnât impose any restrictions on , e.g., that matches the agentsâ âdefaultâ way of bargaining in some sense. In particular:
Oesterheld and Conitzer (2022) use a definition thatâs almost equivalent to this one, with the special choice of in Table 1. In their framework, thereâs (implicitly) a space of original programs characterized by (i) the true game , and (ii) some way the delegates would play any given game . And they define the SPI not as the transformation , but instead as the new game for the delegates such that maps to . But (from personal communication with Oesterheld) the definition of SPI is meant to allow for more general .[14] See also Figure 3 for a comparison to DiGiovanni et al.âs (2024) formalization.
Table 1. How the definition above captures different formalizations of SPIs in the literature.
| Original program space P | Before the programs are determined⊠| SPI transformation f | |
|---|---|---|---|
| Oesterheld & Conitzer (2022), Definition 1 | Space of tuples , for a fixed true game , where is a mapping from any game to actions. can be any such mapping satisfying certain assumptions (e.g., the paperâs Assumptions 1 and 2). (Agents have non-probabilistic uncertainty over . So the âfor all â quantifier in the definition of SPIs amounts to âfor all â.) | Agents choose some new game . (Here, programs are determined by the delegatesâ decisions.) | Transforms to . |
| DiGiovanni et al. (2024), Definition 2 | Arbitrary space of conditional commitments. | Agents choose how to map the program space to some new space, which they will then choose from.[15] | Transforms to . |
| Sauerberg and Oesterheld (2026) (Sec. 4) | Same as Oesterheld & Conitzer. | Agents choose a âtoken gameâ and function mapping âs outcomes into . The original game is then resolved via applied to âs outcomes. | Transforms to . |
Figure 3.
Hereâs how we might state the problem raised by Oesterheldâs âA gap in the theoretical justification for surrogate goals and safe Pareto improvementsâ, in the formalism above.
Consider the original space of programs in Oesterheld and Conitzerâs framework. The delegates can play the game in an arbitrary way, subject to the mild Assumptions 1 and 2. But itâs assumed that in , the game they play is the true game . So, take some SPI with respect to this space , a transformation from to . By definition, this transformation makes all agents better off for all . But itâs not guaranteed that for all and all , all agents are better off under than under .
This suggests one way to bridge the justification gap: find an thatâs an SPI with respect to any arbitrary program space , as DiGiovanni et al. (2024) aims to do. Cf. Oesterheldâs discussion of âdecision factorizationâ in the justification gap post.
Figure 4.
(These are working formalizations of participation independence and foreknowledge independence. âForeknowledge independenceâ and âdemand preservationâ are working terminology. Weâre not highly confident that weâll endorse these formalizations/terminology after more thought.)
If is an SPI and are the programs the agents in fact apply to, call the agentsâ full strategy. Itâs helpful to distinguish an SPI from the full strategy, because in general agents will only individually prefer to agree to some SPI conditional on the input programs satisfying certain restrictions.
Participation independence and foreknowledge independence, as well as the âpreserving bargaining powerâ property discussed in the Introduction, are properties of full strategies. These can be defined as follows.
Setup:
Then:
Definition. A full strategy is:
Commentary on these definitions:
Example: In DiGiovanni et al.âs (2024) setting, suppose agents use the SPI given by Proposition 1 (or Proposition 4). Then participation independence is satisfied for any input program profile , because:
(This section is based on previous joint work with Mia Taylor, Nathaniel Sauerberg, Julian Stastny, and Jesse Clifton.)
One key example of an SPI is a surrogate goal. More precisely, the (approximate) SPI here is, âA adopts a surrogate goal, and B threatens the surrogate goal whenever an executed surrogate threat would be less costly for B than the default threatâ. (More below on why this is an SPI.)
An agent doesnât need to broadly modify its preferences in order to implement an SPI of this form, though. We can generalize the idea of surrogate goals as follows:
Why is adoption of a concession-equivalent policy an SPI? Suppose â holding all else fixed â A becomes just as likely to concede to a surrogate threat that would give B utility if executed, as to an OG threat that would give B utility if executed. Then B would rather make a surrogate threat than an OG threat. So any executed threats would be less bad for both parties, but neither party would have an incentive to change how much they demand. (Except, perhaps, a very small increase in Bâs demands in proportion to the difference in disutility of executing an OG threat vs. surrogate threat.) Both parties are then better off overall, no matter how much they demand.
See also Oesterheld and Conitzerâs (2022) âDemand Gameâ (Table 1), as an example of something like a bilateral surrogate goal.
A renegotiation program is a program structured like: âIf they donât use a renegotiation program, act according to program . Otherwise, still act according to , except: if we get into conflict, propose some Pareto improvement(s) and take it if our proposals match.â In pseudocode (see also Algorithms 1 and 2 of DiGiovanni et al. (2024)):
For example, suppose agents A and B are negotiating over what values to instill in a successor agent. If they fail to reach an agreement, theyâll each attempt to take over. They simultaneously submit programs for the negotiation to some centralized server. Before they consider the possibility of SPIs, theyâre inclined to choose these programs, respectively:
Since the demands selected by these programs would be incompatible, the outcome would be âB triggers a doomsday deviceâ. In this scenario, the agentsâ corresponding renegotiation programs might be:
(Here, the Pareto improvement is to the outcome âboth agents attempt takeover, without any doomsday devicesâ, rather than âB triggers a doomsday deviceâ. Both here and in the surrogate goals example, weâre setting aside the additional conditions necessary for these SPIs to be individually preferable. See Part II.1 and Appendix B.2 for more. But note one such condition in this example: and demand 50% and 80%, respectively, regardless of whether the other program is a renegotiation program. See âdemand preservationâ in Appendix B.2.)
See MacĂ© et al., âIndividually incentivized safe Pareto improvements in open-source bargainingâ, for more discussion of how a special class of renegotiation programs can partially resolve the SPI selection problem.
âCommitmentsâ are meant to include modifications to oneâs decision theory or values/preferences. It has been argued (example) that decision theories like updateless decision theory (UDT) can sidestep the need for âcommitmentsâ in the usual sense. Weâll set this question aside here, and treat the resolution to make oneâs future decisions according to UDT as a commitment in itself. â©ïž
We might wonder: Weâve assumed the AIs are capable of conditional commitments. So, suppose each AI could commit to only demand 60% unless they verify that the other AI has made an incompatible commitment. Would this solve the problem? Not necessarily, because the AIs might reason, âIf they see that Iâll revoke my commitment conditional on incompatible demands, theyâll exploit this by making high demands. So I should stick with my unconditional commitment.â â©ïž
However, see Part II.1 for discussion of the âSPI selection problemâ. â©ïž
(H/t Caspar Oesterheld and Nathaniel Sauerberg:) Another important reason is that even if SPIs donât get locked out, they might not be implemented early enough, before conflicts break out. We put less emphasis on this consideration in this agenda, because avoiding locking out SPIs is a less controversial ask than actively prioritizing implementing SPIs. â©ïž
These gaps are related to, but importantly distinct from, the âSPI justification gapâ discussed by Oesterheld. Oesterheldâs question is: Suppose we have some SPI that makes everyone better off relative to particular âdefaultâ strategies â not necessarily relative to any possible original strategies. If so, why would agents use the SPI-transformed strategies, rather than some alternatives to both the default strategies and SPI transformations of them? More in Appendix B.1.1. By contrast, the question here is: Suppose we have an SPI that is ex post better for everyone relative to any original strategies. (So there is no privileged âdefaultâ.) Then, when do agents prefer to implement this SPI ex ante, rather than use their original strategies? â©ïž
See also this distillation. The rough intuition for the result is: If youâre (only) willing to fall back to Pareto improvements that arenât better for the other agent than conceding 100%, you donât give them perverse incentives (cf. Yudkowsky). And if you offer a set of possible Pareto improvements with this property, you can coordinate on an SPI despite the SPI selection problem. â©ïž
In more detail, respectively: (1) (H/t James Faville and Lukas Finnveden:) Agents might be incentivized to condition their demands on coarse-grained proxies about their counterparts, because they worry about being exploited if they use fine-grained information (cf. Soto). And an agent who opts out of SPIs might bargain more aggressively against SPI-participating agents, based on such proxies. (2) (H/t Lukas Finnveden:) Roughly, the âPMP-extensionâ of Algorithm 2 from DiGiovanni et al. (2024) offers a fallback outcome to an agent willing to use any âconditional set-valued renegotiationâ algorithm. This means that a counterpart has little to lose by renegotiating more aggressively against this algorithm. (It appears straightforward to avoid this problem by making the offer conditional, but we need to confirm this makes sense formally â see this comment.) More precisely, the âfallback outcomeâ is the âPareto meet minimumâ. â©ïž
See, e.g., Oesterheld (section âSolution idea 2: Decision factorizationâ): â[I]n the surrogate goal story, itâs important to first adopt surrogate goals and only then decide whether to make other commitments.â â©ïž
Working terminology. Cf. Kovarik (section âIllustrating our Main Objection: Unrealistic Framingâ); and Oesterheld: âIf in 20 years I instruct an AI to manage my resources, it would be problematic if in the meantime I make tons of decisions (e.g., about how to train my AI systems) differently based on my knowledge that I will use surrogate goals anyway.â The concept of foreknowledge independence was also inspired by Baumannâs notion of âthreatener-neutralityâ. â©ïž
Thanks to Jesse Clifton and Carl Shulman for these examples. â©ïž
In the context of SPI research, weâre not too concerned about a third problem Carlsmith discusses: deliberate sabotage by âschemingâ AIs. This is because SPIs are designed to make all parties better off, so a misaligned AI doesnât clearly have an incentive to sabotage SPI research. But weâll aim to be mindful of sabotage risks as well. â©ïž
See also Oesterheld et al. (2026) and Oesterheld et al. (2025) for related datasets of rated conceptual arguments and decision theory reasoning, respectively. â©ïž
DiGiovanni et al. (2024), Sec. 3.1, gives a more precise definition of programs. â©ïž
See also Oesterheld and Conitzer (2022), p. 30: âIn principle, Theorem 3 does not hinge on Î (Î) and Î (Îs) resulting from playing games. An analogous result holds for any random variables over A and As. In particular, this means that Theorem 3 applies also if the representatives [i.e., delegates] receive other kinds of instructions.â â©ïž
In the formalism of DiGiovanni et al. (2024), there is no separate stage where agents choose a transformation f before choosing programs from the new space of programs. Agents simply choose programs directly. But, for the purposes of modeling SPIs and comparing the framework of DiGiovanni et al. with that of Oesterheld and Conitzer (2022), itâs helpful to use the framing in Table 1. â©ïž