Using AI and cryptography to make cooperation rational – even when trust is impossible.
Epistemic status: Conceptual sketch. I believe this idea is worth exploring with small practical experiments, but I have not yet tested it. Written in collaboration with Claude Opus 4.5 (Anthropic), where I contributed the core idea and Claude developed the theoretical context and drafted the text.
Consider the following scenario: Two leading AI laboratories – Lab A and Lab B – both recognise that an uncontrolled race towards ever more advanced AI systems poses significant risks. Both would benefit from slowing down, investing more in safety, and coordinating their releases. But neither dares to move first.
If Lab A unilaterally slows down while Lab B continues at full speed, Lab A loses market share, talent, and influence – perhaps permanently. The same applies in reverse. The result? Both continue to accelerate, despite both preferring a world in which they had slowed down together.
We've heard it before – it's the essence of a coordination problem. Individually rational strategies lead to collectively suboptimal outcomes. We see the same pattern in climate negotiations, nuclear disarmament, and countless other domains where parties would gain from cooperation but remain trapped in a destructive equilibrium.
Three mechanisms explain why coordination is so difficult:
The question this text explores: Can new technology change the rules of the game?
Coordination problems are not new, and theorists have developed several frameworks for understanding and sometimes solving them.
In 2004, game theorist Moshe Tennenholtz introduced the concept of "programme equilibrium." The idea: instead of players choosing strategies directly, they submit programmes that specify their strategy – and these programmes have access to each other's source code.
In the prisoner's dilemma, a player can write a programme that says: "If my opponent's programme is identical to mine, I cooperate. Otherwise, I defect." If both players submit this programme, both will cooperate – and neither has an incentive to deviate.
This is remarkable: rational cooperation emerges even in a one-shot game, something impossible in classical game theory. But there is an obvious limitation: it requires parties to actually inspect each other's code and verify that it will be executed as specified.
In parallel, cryptographers have developed protocols for Secure Multi-Party Computation (MPC). Multiple parties can jointly compute a function of their secret inputs without any party learning the others' inputs.
A toy example: Three colleagues want to know who has the highest salary without revealing their individual salaries. With MPC, they can perform a computation that reveals only the result whilst each participant's actual figure remains secret.
Think of it as a perfectly trustworthy third party – a "Tony" – to whom everyone could send their secrets. Tony computes the function and returns only the result. MPC showed that one can achieve this without trusting any Tony – the protocol itself guarantees secrecy.
The problem with MPC for negotiations is complexity. The protocols are designed for well-defined mathematical functions, not for the rich, context-dependent reasoning that real negotiations require.
Monderer and Tennenholtz have also shown that a mediator can enable equilibria that would otherwise be impossible. A mediator receives private messages from all parties, computes a recommendation, and sends back instructions. If the mechanism is properly designed, no party has an incentive to deviate.
Certain socially desirable outcomes that cannot be reached through ordinary Nash equilibria can be achieved through mediated equilibria. But whom does one trust as mediator? And how does one verify that the mediator behaves correctly?
Here is the central idea: what happens if the mediator is a large language model running inside a cryptographically secured environment?
┌─────────────────────────────────────────┐
│ Trusted Execution Environment (TEE) │
│ ┌─────────────────────────────────────┐│
│ │ LLM Mediator ││
│ │ • Receives secret inputs ││
│ │ • Reasons about possible deals ││
│ │ • Returns only: ││
│ │ "Deal possible" / "No deal" ││
│ └─────────────────────────────────────┘│
│ [Verifiable code, cryptographically │
│ attested by all parties] │
└─────────────────────────────────────────┘
▲ ▲
│ │
Party A: Party B:
"We can accept X "We agree to Y
if B does at least Y" if A does at least Z"
The process:
A traditional MPC solution requires predefined functions: "Compute the intersection of intervals X and Y." But real negotiations are richer.
An LLM can handle natural language: "We can accept a pause on capability training above 10^26 FLOP if competitors do the same, but we need an exception for safety research."
It can reason about plausibility: "Party A's demand for exceptions is inconsistent with Party B's definition of safety research – but here is a possible compromise."
It can propose creative solutions – not merely compute whether positions overlap, but identify a zone of possible agreement that the parties themselves did not see.
A Trusted Execution Environment is hardware that guarantees code runs in isolation – even from the owner and operator of the computer. Modern implementations (Intel SGX, ARM TrustZone) offer:
This solves the problem of trusting the mediator: no one needs to trust any person or organisation – only that the hardware and code function as specified.
Let us return to the AI laboratory scenario. What might an LLM-mediated negotiation look like?
Anthropic, OpenAI, DeepMind, and several other leading laboratories wish to explore coordinated safety measures. But:
Each lab sends an encrypted message to the LLM mediator. A submission might include:
The LLM analyses all inputs and computes: Is there a consistent set of measures that all parties can accept? If yes, what are the minimum requirements? If no, how close were the parties – expressed without revealing individual positions?
The crucial point: if no deal is possible, no party learns why. Lab A does not know whether Lab B or Lab C blocked, or for what reason. This eliminates the strategic cost of failed negotiations.
The same mechanism could apply to other coordination problems: climate negotiations where countries reveal true emissions targets without showing their hand, nuclear disarmament where failure does not leak intelligence, or trade negotiations where parties explore deals without revealing their best alternative.
The mechanism could of course be used in situations not related to global risks as well – negotiations between employers and unions, business mergers, or divorce settlements where both parties prefer efficiency over prolonged conflict.
The idea is appealing but far from unproblematic.
The LLM mediator can help parties reach an agreement. But who ensures they follow it? This requires separate mechanisms – inspections, reporting, sanctions.
One possible extension: the same system could handle continuous reporting, where parties regularly send status updates to the mediator, which computes whether everyone still meets commitments without revealing individual data. It would require that the reports contain proofs of their validity, but these proofs could remain hidden to other parties.
How do we know the LLM behaves neutrally? Modern language models have subtle (or not-so-subtle) biases that could systematically favour certain outcomes.
Possible approaches: use simpler, more verifiable models for the core function, or run several different models in parallel with consensus required before accepting a result.
Can parties manipulate by sending strategic inputs? "We'll say our limit is X, even though it's really Y, to see what the system returns."
This requires incentive-compatible mechanism design – making truthful inputs each party's dominant strategy. The mechanism design literature offers tools, but combining them with LLM reasoning is unexplored territory.
Organisations accustomed to bargaining power – who extract advantages through information asymmetry and negotiating skill – have weak incentives to relinquish those advantages.
Compare resistance to binding arbitration: even when both parties would benefit from a predictable system, the more powerful party often prefers uncertainty that allows exploiting their strength.
Coordination problems are among the most difficult challenges humanity faces. The climate crisis, risks from advanced AI, nuclear proliferation – all share the same structure: individually rational choices leading to collective catastrophe.
Traditional solutions – repeated interactions, reputation mechanisms, social norms – work best in stable environments with long time horizons. For existential risks, where we may have only one chance, they are insufficient.
LLMs combined with cryptographic security open a new design space. Not as a magic solution, but as a tool to explore: Can we construct mechanisms where revealing one's true position becomes rational? Where failed negotiations carry no cost? Where creative compromises can be discovered by a system that sees all parties' perspectives simultaneously?
This is a conceptual sketch, and the next step would be small-scale experiments – perhaps in simulated negotiations or low-stakes real scenarios – to develop intuitions about what actually works.
What is needed is interdisciplinary work: game theory, cryptography, AI safety, international relations. Theoretical analysis of which coordination problems are solvable with these tools. Practical experiments. And eventually, perhaps, protocols that can be tested in real negotiation contexts.
Coordination problems have always been hard. But the tools for solving them need not be the same tomorrow as they were yesterday.
This is an updated cross-post from my own Substack. Original post here.