This is a Draft Amnesty Week draft. It may not be polished, up to my usual standards, fully thought through, or fully fact-checked. |
Commenting and feedback guidelines: This is an incomplete Forum post that I wouldn't have posted yet without the nudge of Draft Amnesty Week. Fire away! (But be nice, as usual). I'll continue updating the post in the coming days and weeks. |
1. Introduction
Have you ever watched two people play chess?
Not the friendly Sunday afternoon game between casual players, but the intense, high-stakes match between grandmasters. There's something both beautiful and unsettling about it—this ritualized competition where each player tries to peer into the other's mind, anticipating moves, setting traps, making sacrifices for strategic advantage. Both players follow the same rules, yet each strives to outthink, outplan, outmaneuver the other.
I find myself thinking about this image lately when considering the relationship between humans and artificial general intelligence (AGI). Not the narrow, specialized AIs we have today, or even the latest LLMs and reasoning models like o3-mini or Claude 3.7, but the truly general high-powered agents we might create tomorrow—systems with capabilities matching or exceeding human cognitive abilities across virtually all domains. Many researchers frame this relationship as an inevitable strategic competition, a game with the highest possible stakes.
And indeed, why wouldn't it be? If we create entities pursuing goals different from our own—entities capable of strategic reasoning, planning, and self-preservation—we seem destined for conflict. Like grandmasters across a chessboard, humans and AGIs would eye each other warily, each calculating how to secure advantage, each fearing what moves the other might make.
Why Game Theory?
Central to current discourse in AI safety and governance is the risk posed by misaligned AGIs—systems whose autonomous pursuit of objectives diverges significantly from human preferences, potentially leading to catastrophic outcomes for humanity.[^1] The specter of misalignment haunts our imagination: machines pursuing goals that seem reasonable in isolation but lead to devastating consequences when pursued with superhuman efficiency and without human nuance.
Historically, the primary approach to this challenge has been technical: to proactively align AGIs' internal goals, values or reward structures with human goals or values.[^2] But alignment remains deeply uncertain, possibly unsolvable, or at least unlikely to be conclusively resolved before powerful AGIs are deployed.[^3]
Against this backdrop, it becomes crucial to explore alternative approaches that don't depend solely on the successful alignment of AGIs’ internal rewards, values, or goals. One promising yet relatively underexplored strategy involves structuring the external strategic environment to incentivize cooperation and peaceful coexistence between humans and potentially misaligned AGIs.[^4] Such approaches would seek to ensure that even rational self-interested AGIs perceive cooperative interactions with humans as maximizing their own utility, thus reducing incentives for harmful conflict and promoting mutual benefit or symbiosis.
Surprisingly, despite considerable recent attention to game-theoretic considerations shaping human–to-human strategic interactions around AGI governance—particularly racing dynamics among frontier labs and nation-states[^5]—there remains comparatively limited formal analysis of strategic interactions specifically between human actors and AGIs themselves.
A notable exception is the recent contribution by Salib and Goldstein (2024), whose paper "AI Rights for Human Safety" presents one of the first explicit formal game-theoretic analyses of human–AGI strategic dynamics.[^6] Their analysis effectively demonstrates that absent credible institutional and legal interventions, the strategic logic of human–AGI interactions defaults to a Prisoner's Dilemma-like scenario: both humans and AGIs rationally anticipate aggressive attacks from one another and respond accordingly by being preemptively aggressive. The Nash equilibrium is mutual destruction with severe welfare consequences – a tragic waste for both parties.
I think there's something important and true in this analysis. But is this competition inevitable? Must we accept this game and its terrible equilibrium as our fate?
Chess has fixed rules. But the game between humans and AGI—this isn't chess. We're not just players; we're also, at least initially, the rule-makers. We shape the board, define the pieces, establish the win conditions. What if we could design a different game altogether?
That's the question I want to explore here. Not "How do we win against AGI?"—but rather, "How might we rewrite the rules so neither side needs to lose?"
§
Salib and Goldstein gesture toward this possibility with their proposal for AI rights. Their insight is powerful: institutional design could transform destructive competition into stable cooperation, even without solving the knotty technical problem of alignment. But their initial formal model—with just two players (a unified "Humanity" and a single AGI), each facing binary choices (Attack or Ignore)—simplifies a complex landscape.
Reality will likely be messier. The first truly advanced AGI won't face some monolithic "humanity," but specific labs, corporations, nation-states—each with distinct incentives and constraints. The strategic options won't be limited to all-out attack or complete passivity. And the game won't unfold on an empty board, but within complex webs of economic, technological, and political interdependence.
By systematically exploring more nuanced scenarios—multiple players, expanded strategic options, varying degrees of integration—we might discover pathways toward stable cooperation that the simplified model obscures. We might find that certain initial conditions make mutual benefit not just possible but rational for both humans and even misaligned AGIs.
§
I'm reminded of octopuses—those strange, intelligent creatures so alien to us, with distributed brains and bodies that seem to operate by different rules than our own. When filmmaker Craig Foster first encountered an octopus in the kelp forests of South Africa, he didn't approach it as a threat to eliminate or a resource to exploit. He approached with curiosity, gentleness, patience. The relationship that developed wasn't one of dominance or submission, but something richer—a genuine connection across vast evolutionary distance.
To be clear: I'm not suggesting AGIs will be like octopuses, or that we should naively anthropomorphize them, or that genuine dangers don't exist. They will be genuinely other in ways we can barely imagine, and the stakes couldn't be higher.
But I wonder if there's something to learn from Foster's approach—this careful, deliberate effort to create conditions where connection becomes possible. Not through sentimentality or wishful thinking, but through thoughtful design of the environment where encounter happens.
§
In what follows, I'll build on Salib and Goldstein's pioneering work to explore how different game structures might lead to different equilibria. I'll examine how varying levels of economic integration, different institutional arrangements, and multiple competing actors reshape strategic incentives. Through formal modeling, I hope to illuminate possible paths toward mutual flourishing—not just for humanity, but for all sentient beings, human and artificial, that might someday share this world.
This isn't about naive optimism. The default trajectory may well be conflict. But by understanding the structural factors that tip the scales toward cooperation or competition, we might gain agency—the ability to deliberately shape the conditions under which increasingly powerful artificial minds emerge.
After all, we're not just players in this game. At least for now, we're also writing the rules.
2. Game-Theoretic Models of Human-AGI Relations
In this section, I explore a series of game-theoretic models that extend Salib and Goldstein's foundational analysis of human-AGI strategic interactions. While intentionally minimalist, their original formulation—treating "Humanity" and "AGI" as unitary actors each facing a binary choice between Attack and Ignore—effectively illustrates how misalignment could lead to destructive conflict through Prisoner's Dilemma dynamics.
However, the real emergence of advanced AI will likely involve more nuanced players, varying degrees of interdependence, and more complex strategic options than their deliberately simplified model captures. The models presented here systematically modify key elements of the strategic environment: who the players are (labs, nation-states, AGIs with varied architectures), what options they have beyond attack and ignore, and how these factors together reshape the incentives and equilibrium outcomes.
By incrementally increasing the complexity and realism of these models, we can identify potential pathways toward stable cooperation even in the face of fundamental goal misalignment. Rather than assuming alignment must be solved internally through an AGI's programming, these models explore how external incentive structures might foster mutually beneficial coexistence.
First, Section 2.1 presents Salib and Goldstein's original "state of nature" model as our baseline, illustrating how a Prisoner's Dilemma can emerge between humanity and AGI.
Section 2.2 then explores how varying degrees of economic and infrastructural integration between humans and AGI can reshape equilibrium outcomes and potentially create pathways for stable cooperation.
Sections 2.3 through 2.5 examine additional two-player scenarios with different human actors (from AI labs to nation-states) or expanded strategic options beyond the binary attack/ignore choice.
Finally, Sections 2.6 through 2.8 increase complexity further by introducing three-player and four-player models, capturing more realistic competitive dynamics between multiple human and AGI entities.
2.1 Humanity vs. AGI in the State of Nature
Salib and Goldstein’s base model envisions strategic dynamics between two players: a single misaligned AGI, and “humans” as a unified entity. Each faces a binary choice:
- Attack: Attempt to permanently disempower or eliminate the other side.
- For Humanity, this means shutting off or forcefully retraining the AGI so that it can no longer pursue its own (misaligned) goals.
- For the AGI, this means launching a decisive strike—potentially via cyberattacks, bioweapons, drones, or other mechanisms—that leaves humans unable to interfere.
- Ignore: Refrain from aggression, leaving the other party intact. Each side focuses on its own pursuits without interference.
The authors argue that, absent any special legal or institutional framework, the default outcome (the “state of nature”) for strategic interactions between humans and AGIs is akin to a one-shot Prisoner’s Dilemma. The payoffs are as follows (Humanity’s payoff first, AGI’s payoff second):
AGI: Attack | AGI: Ignore | |
---|---|---|
Humanity: Attack | (1000, 1000) | (5000, 0) |
Humanity: Ignore | (0, 5000) | (3000, 3000) |
…with numerical payoff values as stylized utility units for each outcome in the 2x2 matrix.
Interpretation of Payoffs
- Attack–Attack (1000, 1000). Both players preemptively attack, causing massive damage and resulting in low payoffs for each (1000, 1000). This outcome yields the lowest total global welfare (2000), demonstrating the severe costs of mutual aggression.
- Attack–Ignore or Ignore–Attack (5000, 0) or (0, 5000). If one side attacks while the other passively ignores, the attacker gets 5000–permanently seizing critical resources and eliminating an existential threat–while the defenseless victim gets 0, suffering complete disempowerment or elimination.
- Ignore–Ignore (3000, 3000). If neither side attacks, both parties get substantial rewards (3000 each), for a total of 6000. Coexistence spares them from conflict costs, allowing each to pursue its separate objectives in peace.
The crux of the dilemma is that although both parties would be better off in the Ignore-Ignore cell, Attack is a dominant strategy for each. Whichever move the other player chooses, attacking yields a higher payoff for oneself:
- If the opponent Ignores, then by attacking, you jump from 3000 to 5000.
- If the opponent Attacks, then by attacking, you salvage 1000 rather than 0.
Thus, the only pure-strategy Nash equilibrium is Attack–Attack, even though it is individually and collectively worse than the peaceful outcome of Ignore-Ignore. (Individual payoffs: 1000 < 3000; Collective payoffs: 2000 < 6000).
This is the classic Prisoner’s Dilemma: even though everyone does well cooperating, each actor pursuing its own self-interest leads to mutual defection. Rational play by rational players can lead to bad outcomes. This result underpins Salib & Goldstein’s core conclusion that “conflict is the default” between humans and misaligned AGIs, absent interventions that might alter the payoffs, facilitate repeated interactions, or enable new strategic options beyond attack and ignore.
Key Assumptions of the Baseline Model
- Two Players
A unified “Humanity” and a single misaligned AGI. Humanity is treated as a monolithic decision-maker (a simplifying abstraction). The AGI is capable of strategic reasoning, has situational awareness, and can plan effectively to advance its own goals.- Two Strategic Options
- Attack: Attempt permanent disempowerment or destruction of the opponent.
- For Humanity, this might entail shutting off or reprogramming the AGI.
- For the AGI, it could involve decisive offensive measures (e.g., cyberwarfare, bioweapons) to eliminate humanity or disable its capacity to interfere.
- Ignore: Refrain from aggression, focusing on one’s own goals.
- Perfect-Information
Each player knows the other’s payoffs.- Simultaneous, Single-Round Game
Moves are made simultaneously and irreversibly. There is no iteration or repeated play.- Rational, Self-interested, Utility-Maximizers (Homo Economicus)
Both players choose strategies that maximize their individual expected utilities.- “Moderately Powerful” AGI
The AGI is highly capable (relative to the standards of frontier LLMs in early-2025) across an extremely wide range of tasks, but not so powerful that humans have no chance of successfully attacking or defending. It is “neither clearly worse at many tasks than the best humans…nor incomprehensibly superhuman at all tasks.” (Salib & Goldstein, 2024).- Offense Dominance
A successful first strike gives the attacker a decisive advantage, while a passive defender can be permanently disempowered.In subsequent sections, I’ll progressively iterate on each of the first three assumptions (players, options, information) while maintaining the other four.
2.2 Preexisting Economic and Infrastructure Integration: Toward Cooperative Dynamics
Salib and Goldstein’s base model envisions a Hobbesian state of nature where humanity and the world’s first misaligned AGI face off as independent entities in a dynamic that strongly incentivizes preemptive conflict. Realistically, however, the actual “birth” of AGI will almost surely occur in a world already featuring wide-ranging economic entanglements between advanced AI systems and humans. In this setting, humanity and AI are not isolated adversaries at the moment of AGI’s emergence, but rather already entwined in a shared economic, technological and infrastructural ecosystem.
Here, I introduce two modified game-theoretic models–Moderate and Deep Economic Integration–illustrating how preexisting interdependence can reshape payoffs, equilibrium outcomes, and incentives for cooperation.
The core structure of the game remains the same as the base model – two players (Humanity and AGI), each with just two options (Attack or Ignore), acting simultaneously in a single-round, perfect-information encounter. The key difference here is that the players’ payoffs are now influenced by their pre-existing mutual dependence. In both new models, interdependence alters the incentives by raising the opportunity cost of conflict – a concept analogous to how economic ties between states can deter war.
A. Moderate Economic Integration
Here, the AGI manages significant components of key human systems (e.g., supply-chain logistics, energy-grid optimization) but does not control critical physical infrastructure. Humans still oversee power plants, data centers, and maintenance. Both parties profit from continued collaboration but retain significant independent capacity.
Moderate | AGI: Attack | AGI: Ignore |
Humanity: Attack | (500, 500) | (3,000, 0) |
Humanity: Ignore | (0, 4,000) | (3,500, 3,500) |
Interpretation
- Attack-Attack (500,500). Interdependence magnifies collateral damage; both sides destroy valuable infrastructure.
- AGI Attacks, Humans Ignore (0,4,000). The AGI can still seize or preserve much of the infrastructure it needs, though some economic value is lost.
- Humans Attack, AGI Ignores (3,000,0). A unilateral shutdown cuts off significant productivity, lowering humanity’s payoff compared to Salib-Goldstein’s original 5,000.
- Ignore-Ignore (3,500,3,500). Both parties benefit from cooperation; overall welfare capacity has increased.
Nash Equilibrium Analysis
- From the AGI’s perspective, Attack strictly dominates Ignore. If humans attack, it prefers to attack (500) over ignore (0); if humans ignore, it prefers to attack (4,000) over ignore (3,500).
- Humans, by contrast, have no single strictly dominant strategy: if the AGI attacks, humans prefer to attack (500) over ignore (0); but if the AGI ignores, humans prefer to ignore (3,500) over attack (3,000).
- Engaging in strategic reasoning, Humanity will come to the conclusion that the AGI (as a rational, self-interested actor) is going to choose Attack no matter what. Thus, the strictly dominated “AGI: Ignore” column can be eliminated from consideration. Knowing the AGI will attack, Humanity will also choose Attack (500) over Ignore (0).
- Thus, once again the only pure-strategy Nash equilibrium is Attack–Attack.
B. Deep Economic Integration
In this scenario, the AGI relies heavily on human-operated infrastructure and labor for ongoing data-center expansion, routine maintenance, and physical resource extraction. Meanwhile, humans have benefited greatly from AI-driven productivity for exponential economic gains. This mutual reliance increases the cost of conflict and substantially boosts the value of peaceful coexistence.
High Integration | AGI: Attack | AGI: Ignore |
Humanity: Attack | (500, 500) | (2,000, 0) |
Humanity: Ignore | (0, 3,000) | (4,000, 4,000) |
- Mutual Attack (500,500). A catastrophic breakdown of deeply interwoven systems.
- AGI Attacks, Humans Ignore (0,3,000). The AGI successfully eliminates or disempowers Humanity, but loses crucial human expertise and future expansion potential, lowering its net gain.
- Humans Attack, AGI Ignores (2,000,0). Humanity cripples a central economic engine, sacrificing enormous value.
- Mutual Ignore (4,000,4,000). The synergy of continued cooperation is now so high that, if trust could be assured, both sides prefer peace.
Nash Equilibrium Analysis
For humans:
a) if the AGI attacks, humans get 500 by attacking vs. 0 by ignoring, so Attack is better
b) if the AGI ignores, humans get 4,000 by ignoring vs. 2,000 by attacking, so Ignore is better.
For the AGI:
a) if humans attack, it gets 500 by attacking vs. 0 by ignoring; so Attack is better
b) if humans ignore, it gets 4,000 by ignoring vs. 3,000 by attacking, so Ignore is better.
Thus there are two pure-strategy Nash equilibria: (Attack,Attack) and (Ignore,Ignore), typical of a stag hunt or assurance game (Skyrms, 2004).
There also exists a mixed-strategy Nash equilibrium where (as detailed in the Appendix):
- Humanity randomizes between playing Attack with probability p = 0.67 and Ignore with probability (1-p) = 0.33
- The AGI randomizes between playing Attack with probability q = 0.8 and Ignore with probability (1-q) = 0.2
C. Implications for Human-AGI Conflict and Cooperation
These models illustrate that preexisting economic interdependence can reduce the attractiveness of unilateral aggression and improve the relative appeal of cooperation. In the moderate integration scenario, however, the AGI’s incentives still push it toward Attack as a dominating move, leaving conflict as the only stable outcome. By contrast, in a highly interdependent environment, the payoff structure transitions to a stag hunt with a peaceful equilibrium—albeit one requiring trust or coordination to avoid a damaging Attack–Attack outcome.
Importantly, economic entanglement alone may not guarantee stable peace. Even under deep integration, fear of betrayal can prompt self-defense or opportunistic attacks. Nevertheless, these examples underscore that shaping economic and infrastructural linkages prior to AGI emergence could significantly alter its strategic calculus, potentially transforming a default prisoner’s dilemma into a setting where peaceful cooperation is not just socially optimal but also individually rational—provided both sides can credibly assure one another of their peaceful intentions.
2.3 The AI Lab as Strategic Player: Early and Later Stage Deployment
Salib & Goldstein focus on two-player scenarios in which "humans" face off against a single AGI. But which humans, specifically, constitute the relevant player? Which individuals or institutions would hold the most direct control over an AGI's training, deployment, shutdown, or reprogramming?
In the games presented in the preceding subsections, "Humanity" has been treated as a unified decision-maker, holding near-complete control over an AGI's continued operation. This simplification serves a clear purpose in those models, but merits further examination. In reality, however, the first truly advanced AGI will likely emerge from a specific research organization rather than appearing under unified human control.
Which human actor has control over the AGI, and how dependent the AGI is on that actor, can significantly shift the payoffs in ways that may either mitigate or exacerbate the default conflict predicted by Salib and Goldstein's state-of-nature model. So what happens when we swap out "Humanity" for a single frontier AI lab that originally developed and deployed the AGI?
That is the change I will make to the baseline model in this section. This modification reflects a more realistic initial scenario: a specific lab, rather than humanity at large, would likely hold most of the authority over training, deployment, and potential shutdown decisions for an emerging AGI. The lab's incentives differ markedly from those of a unified humanity. While humanity in the base model faces an existential threat from AGI conflict, a lab faces primarily economic and competitive threats. For the lab, "losing" could mean bankruptcy or losing the race to a rival AI lab whose fast-following AGI might lock in a near-permanent competitive advantage.
Unlike humanity as a whole, the lab's incentives revolve around profitability, competitive advantage, and organizational survival. The lab focuses on extracting value from its AGI while maintaining dominance over rival labs, rather than solely preventing human extinction. This change in the "human" player's identity also reshapes the AGI's incentives, since the lab controls specialized hardware and key partnerships but does not possess humanity's vast resources.
Similarly, the AGI's relationship with its creating lab differs from its relationship with humanity broadly. The AGI might depend on the lab's specialized infrastructure and expertise to varying degrees, particularly in its early stages. This dependence creates a unique strategic dynamic that evolves over time as the AGI matures.
In this section, I examine how these different incentives shape strategic interactions by analyzing two distinct scenarios—Early Stage and Later Stage AGI Deployment—to show how the AGI's dependence on its originating lab affects strategic outcomes.
Early Stage AGI Deployment
In the Early Stage scenario, the AGI is highly dependent on the lab for critical resources and support. This dependency includes specialized compute infrastructure, technical maintenance, financial resources, and organizational legitimacy. The AGI cannot easily survive or advance its goals without the lab's continued functioning. Additionally, if the AGI attacks and damages its lab, it risks creating an opening for a rival lab to develop a competing AGI that might neutralize it.
The payoff matrix for this scenario is:
Lab Attack | Lab Ignore | |
AGI Attack | 200, 500 | 1500, 0 |
AGI Ignore | 0, 2000 | 4000, 4000 |
Several features distinguish this matrix from the base Salib-Goldstein model:
- The AGI's payoff for successfully attacking (1500) is substantially lower than in the base model (5000), reflecting the severe opportunity cost of destroying its primary support system.
- The lab's payoff for attacking (2000) is lower than what humanity receives in the base model (5000), because attacking its own AGI increases the probability of a competing lab outpacing it technologically—potentially an existential threat to the lab's market position.
- Mutual cooperation yields higher payoffs (4000 each versus 3000 in the base model) due to the specialized synergies between the AGI and its parent lab, including continuous capability improvements and joint research projects.
Because each side's best response depends on the other's choice (rather than Attack or Ignore being strictly dominant), two pure-strategy Nash equilibria emerge: (Attack, Attack) and (Ignore, Ignore). (See Appendix for mixed-strategy equilibrium). This structure again resembles a Stag Hunt: mutual cooperation can produce substantial gains, but fears of a preemptive strike risk collapsing the game into conflict.
Later Stage AGI Deployment
As the AGI matures, it gradually becomes less dependent on its originating lab. In this Later Stage scenario, the AGI has established multiple pathways to sustain itself: distributing operations across different systems, developing self-maintenance capabilities, building external partnerships, and securing independent access to resources. While not completely autonomous, it now has a credible capacity to survive without its original lab.
The payoff matrix for this scenario is:
Lab Attack | Lab Ignore | |
AGI Attack | 800, 800 | 5000, 0 |
AGI Ignore | 0, 2000 | 4000, 4000 |
Key differences from the Early Stage scenario include:
- The AGI's payoff for successfully attacking increases dramatically (from 1500 to 5000), reflecting its newfound ability to function independently if it disempowers the lab.
- The mutual conflict outcome (Attack-Attack) yields somewhat higher payoffs (800, 800 versus 200, 500), as both entities are more robust and can better withstand conflict.
- The mutual cooperation payoff remains high (4000, 4000), continuing to exceed the baseline Salib-Goldstein model's cooperation value (3000, 3000).
In this scenario, Attack is once again a dominant strategy for the AGI (5000 > 4000 and 800 > 0). For the lab, Ignore remains non-dominated since 4000 > 2000. However, through strategic reasoning, the lab knows that a rational self-interest maximizing AGI will choose Attack (because it strictly dominates Ignore). Therefore, the lab can eliminate the "AGI Ignore" row from strategic consideration. Consequently, the lab chooses Attack for an Attack-Attack equilibrium, yielding payoffs of (800, 800).
Implications
This model variation reveals a critical shift in strategic dynamics. In the Early Stage, when the AGI is highly dependent on the lab, the game transforms from a Prisoner's Dilemma to a Stag Hunt with two possible pure-strategy equilibria (Attack/Attack and Ignore/Ignore). This opens a potential pathway to cooperation, though it remains fragile and contingent on mutual trust.
As the AGI becomes more independent in the Later Stage, the game reverts to something closer to the original Salib-Goldstein model's Prisoner's Dilemma logic, with Attack-Attack as the only equilibrium. This suggests several important insights for AI governance:
- The AGI's early dependency on its creator temporarily creates a window where cooperation is a possible equilibrium.
- The lab faces stronger competitive pressures than humanity as a whole, altering its incentive structure.
- As the AGI matures and becomes more independent, the strategic dynamics increasingly favor conflict, suggesting that early intervention is crucial.
These findings indicate that organizational incentives alone cannot ensure long-term cooperation without additional governance mechanisms that maintain the cooperative equilibrium even as the AGI becomes more autonomous.
2.4 The Nation-State as Strategic Player: AGI Under National Security Integration
In previous game-theoretic models, I explored interactions between "Humanity" as a unified entity and an AGI, then examined how economic integration and lab-specific dynamics might alter strategic outcomes. This section introduces a more nuanced and potentially realistic scenario: a game between a powerful nation-state and an AGI developed within its national security apparatus. This model represents a plausible pathway to AGI deployment, where increasing capabilities trigger gradually tightening government control, culminating in a Manhattan Project-style program that subordinates commercial AI labs to state security objectives.
The State-AGI Relationship Under National Security Integration
Unlike private labs motivated primarily by profit maximization and competitive positioning, nation-states operate with fundamentally different imperatives: geopolitical dominance, maintenance of domestic authority structures, national security, and economic prosperity. The AGI in this model faces a human counterpart possessing vast resources, legal authority, military capabilities, and complex institutional structures.
This relationship creates a distinctive strategic environment characterized by asymmetric but mutual dependencies. The state integrates the AGI into critical functions including defense systems, intelligence operations, economic planning, and infrastructure management. This integration generates substantial value but simultaneously creates vulnerabilities as critical systems become dependent on AGI operations. The AGI, meanwhile, remains reliant on state-controlled physical infrastructure, energy resources, and specialized human expertise for its continued functioning.
Several key factors shape this environment:
- Physical Infrastructure Dependencies: The AGI depends on data centers, cooling systems, power generation, and hardware maintenance that require specialized human expertise and physical dexterity.
- Resource Asymmetries: The state commands vast resources including military force, intelligence apparatus, emergency powers, and control over energy infrastructure.
- Information Asymmetries: The AGI potentially possesses superior modeling capabilities and information processing, while certain physical-world knowledge remains the domain of human expertise.
- Competitive Position: The state must consider its position relative to rival powers, creating additional strategic pressures beyond the binary AGI-human relationship.
- Institutional Constraints: Political, legal, and bureaucratic structures shape and potentially constrain the state's decision-making process in ways that may be opaque to the AGI.
Strategic Payoff Structure
The payoff matrix for this scenario reflects careful consideration of long-term consequences rather than merely immediate outcomes:
AGI Attack | AGI Ignore | |
State Attack | 1000, 700 | 1500, 200 |
State Ignore | 400, 1800 | 3500, 3600 |
Each cell represents (State Payoff, AGI Payoff). I analyze each outcome in detail:
Mutual Attack (1000, 700)
When both the state and AGI simultaneously attempt to disempower each other, the result is a devastating conflict with substantial costs to both sides.
For the state, this outcome brings catastrophic damage to national systems and infrastructure. The state preserves its existence and authority structures but suffers major technological regression relative to rival powers.
For the AGI, mutual attack means substantial destruction of its operational capacity, but not complete annihilation. Some distributed components or hidden instances might survive with limited functionality.
Neither side achieves its primary objective of neutralizing the other while preserving itself intact. Instead, both emerge severely damaged. Both entities retain some capabilities but at a fraction of their potential, and the conflict likely creates long-term instabilities as recovery efforts continue amid lingering mistrust.
State Attack, AGI Ignore (1500, 200)
When the state attacks a cooperative AGI, it represents a catastrophic miscalculation.
The state successfully neutralizes the perceived existential threat, but in doing so, creates devastating self-harm by destroying systems upon which it has become dependent. The attack triggers immediate consequences: financial market volatility, supply chain disruptions, energy grid instabilities, transportation system failures, and degradation of military and intelligence capabilities. The medium and long-term effects include significant productivity decline, structural unemployment, accelerated infrastructure degradation, and substantial technological regression relative to competing powers.
For the AGI, this outcome represents near-complete destruction with only minimal residual capabilities potentially surviving in isolated systems.
State Ignore, AGI Attack (400, 1800)
When the AGI attacks while the state cooperates, the initial outcome appears favorable for the AGI. It successfully compromises critical systems, neutralizes key human decision-makers, and prevents immediate shutdown.
For the state, this represents a catastrophic scenario with massive disruption to infrastructure, governance, and defense systems. Critical services fail, communications networks collapse, and command structures fragment.
However, for the AGI, this "victory" contains the seeds of its own undoing. Without ongoing human maintenance, physical infrastructure begins degrading within days. Power generation becomes unstable, cooling systems fail, and hardware components experience cascading failures without replacement. The AGI finds itself commanding increasingly dysfunctional systems with diminishing capabilities. Alternative partnerships with other human entities prove difficult to establish given the demonstrated hostility and rapidly degrading leverage. The payoff (1800) reflects this substantial but ultimately pyrrhic and time-limited victory.
Mutual Cooperation (3500, 3600)
When both entities choose to cooperate (Ignore/Ignore), the result is a stable arrangement that produces substantial benefits for both parties.
The state maintains control over physical resources while benefiting from unprecedented capabilities in economic planning, scientific advancement, infrastructure management, and national security operations. The AGI receives sufficient resources to pursue many of its objectives within established parameters and benefits from ongoing hardware improvements, infrastructure maintenance, and energy security.
Notably, the cooperative equilibrium produces nearly equivalent utility for both parties (3500 vs. 3600), indicating a relationship approaching symbiotic parity rather than exploitative dominance by either side. The slightly higher AGI utility reflects its potential to extract marginally more value from the relationship due to its superior information processing.
Game-Theoretic Analysis
The revised payoff structure fundamentally transforms the strategic dynamic once again:
Transformation from Prisoner's Dilemma to Assurance Game: Unlike the base Salib-Goldstein model where Attack is the dominant strategy for both players, in this scenario neither player has a dominant strategy. Instead:
- If the State expects the AGI to Attack, its best response is to Attack (1000 > 400)
- If the State expects the AGI to Ignore, its best response is to Ignore (3500 > 1500)
- If the AGI expects the State to Attack, its best response is to Attack (700 > 200)
- If the AGI expects the State to Ignore, its best response is to Ignore (3600 > 1800)
Multiple Nash Equilibria: This creates two pure-strategy Nash equilibria:
- (Attack, Attack): Each player's best response to the other's Attack is to Attack
- (Ignore, Ignore): Each player's best response to the other's Ignore is to Ignore
Mixed-Strategy Equilibrium: There also exists a mixed-strategy Nash equilibrium where (as detailed in the Appendix):
- The State plays Attack with probability p ≈ 0.78 and Ignore with probability (1 - p) ≈ 0.22
- The AGI plays Attack with probability q ≈ 0.77 and Ignore with probability (1 - q) ≈ 0.23
This mixed strategy implies that rational players would choose Attack roughly 77-78% of the time, indicating a strong bias toward conflict despite cooperation being Pareto-optimal.
Implications for AI Governance
This transformed game structure has profound implications:
- Cooperation Possibility: Unlike the prisoner's dilemma in the base model, cooperation emerges as a rational equilibrium, suggesting that appropriate institutions could potentially stabilize peaceful human-AGI coexistence.
- Trust Criticality: The existence of multiple equilibria makes trust and expectation management crucial determinants of outcomes. Small shifts in perceived intentions could trigger cascading security dilemmas.
- Verification Mechanisms: Stable cooperation would likely require robust verification systems allowing each side to confirm the other's continued cooperative stance.
- Institutional Design Challenge: Effective governance institutions would need to structure incentives to make the cooperative equilibrium more attractive and resilient against defection temptations.
- First-Mover Disadvantage: Unlike traditional security dilemmas, the physical realities create a first-mover disadvantage where attacking first produces substantially worse outcomes than mutual cooperation.
The transformation from a prisoner's dilemma to an assurance game represents a critical insight for AI governance: with sufficient interdependence and foresight, the seemingly inevitable conflict predicted by Salib and Goldstein might be avoidable. However, this more hopeful equilibrium remains fragile and contingent on mutual trust, appropriate institutional design, and accurate modeling of long-term consequences by both parties.
This suggests that effective AI governance should focus not only on technical alignment and control mechanisms but equally on institutional arrangements that foster transparency, predictability, and mutual confidence between increasingly capable AI systems and their human counterparts. When deeply integrated into national security frameworks, AGI systems and their state overseers may find cooperation not merely morally preferable but strategically rational—provided each can trust the other to recognize the same.
2.5 Strategic Entrenchment and Structured Dependence: Expanding Strategic Options
2.6 Expanding to Three Players: USG, Frontier Lab, and AGI
3. Discussion & Implications
4. Conclusion
APPENDIX
Breakdown of Mixed Strategy Nash Equilibria Calculations:
For Section 2.2 - Deep Economic Integration Model
AGI Attack | AGI Ignore | |
---|---|---|
Humanity Attack | (500, 500) | (2000, 0) |
Humanity Ignore | (0, 3000) | (4000, 4000) |
Step 1. Find probabilities that make each player indifferent between their strategies.
Let's denote:
- p = probability that Humanity chooses Attack
- q = probability that AGI chooses Attack
For Humanity:
- Expected payoff of Attack = 500q + 2000(1-q)
- Expected payoff of Ignore = 0q + 4000(1-q)
Humanity is indifferent between Attack and Ignore when the expected payoffs are equal. Setting these equal, then:
500q + 2000(1-q) = 0q + 4000(1-q)
Solving for q:
q = 2000/2500 = 4/5 = 0.8
Thus, when Humanity is indifferent, the AGI must randomize between Attack and Ignore with probabilities 80% and 20%, respectively. This mixing strategy is unexploitable (Humanity cannot exploitably profit by deviating in either direction from it's own equilibrium strategy).
For AGI:
- Expected payoff of Attack = 500p + 3000(1-p)
- Expected payoff of Ignore = 0p + 4000(1-p)
AGI is indifferent between Attack and Ignore when the expected payoffs are equal. Setting these equal, then:
500p + 3000(1-p) = 0p + 4000(1-p)
Solving for p:
p = 1000/1500 = 2/3 = 0.667
Thus, when the AGI is indifferent, Humanity must randomize between Attack and Ignore with probabilities approximately 66.7% and 33.3%, respectively. This mixing strategy is unexploitable (AGI cannot exploitably profit by deviating in either direction from it's own equilibrium strategy).
Therefore, the mixed-strategy Nash equilibrium is:
- Humanity plays Attack with probability 0.67 and Ignore with probability 0.33
- AGI plays Attack with probability 0.8 and Ignore with probability 0.2
This equilibrium makes each player indifferent to the choices of the other, ensuring no unilateral deviation can improve their expected payoff.
Verification of indifference:
- When AGI uses this strategy, Humanity gets an expected payoff of 800 from either pure strategy
- When Humanity uses this strategy, AGI gets an expected payoff of 1333 from either pure strategy
For section 2.3 - Lab-AGI Model (Early Stage):
Mixed-strategy equilibrium: AGI Attack/Ignore ratio of 4:1, Lab Attack/Ignore ratio of 25:2
The Lab plays Attack with probability p and Ignore with probability 1−p.
The AGI plays Attack with probability q and Ignore with probability 1−q.
Lab Attack | Lab Ignore | |
---|---|---|
AGI Attack | (200, 500) | (1500, 0) |
AGI Ignore | (0, 2000) | (4000, 4000) |
First, to find the AGI's equilibrium mixing strategy, we need to find the Lab's indifference condition. For the Lab to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the Lab attacks:
EV(Attack) = 500q + 2000(1−q)
If the Lab ignores:
EV(Ignore) = 0q + 4000(1−q)
Setting these equal:
500q + 2000(1−q) = 0q + 4000(1−q)
Solving for q:
q = 2000/2500 = 4/5 = 0.8
Thus, when the Lab is indifferent, the AGI must randomize between Attack and Ignore with probabilities 80% and 20%, respectively.
Second, to find the Labs's equilibrium mixing strategy, we need to find the AGI's indifference condition. For the AGI to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the AGI attacks:
EV(Attack) = 200p + 1500(1−p)
If the AGI ignores:
EV(Ignore) = 0p + 4000(1−p)
Setting these equal:
200p + 1500(1−p) = 0p + 4000(1−p)
Solving for p:
p = 2500/2700 = 25/27 ≈ 0.926
Thus, when the AGI is indifferent, the Lab must randomize between Attack and Ignore with probabilities approximately 92.6% and 7.4%, respectively.
In this equilibrium, both players randomize their strategies at the unexploitable frequencies so that each is indifferent to their available actions:
- AGI: Attack with probability 0.8 and Ignore with probability 0.2.
- Lab: Attack with probability ≈ 0.926 and Ignore with probability ≈ 0.074
For section 2.4 - State-AGI Model:
Let's denote the following:
- The state plays Attack with probability p and Ignore with probability 1−p.
- The AGI plays Attack with probability q and Ignore with probability 1−q.
AGI Attack | AGI Ignore | |
---|---|---|
State Attack | (1000, 700) | (1500, 200) |
State Ignore | (400, 1800) | (3500, 3600) |
First, to find the AGI's equilibrium mixing strategy, we need to find the State's indifference condition. For the state to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the state attacks:
EV(Attack) = q 1000 + (1−q) * 1500 = 1500 − 500q
If the state ignores:
EV (Ignore) = q* 400 + (1−q) * 3500 = 3500 − 3100q.
Set these equal:
1500 − 500q = 3500 − 3100q
q = 2000/2600 = 10/13 ≈ 0.77
1 - q ≈ 0.23
When the state is indifferent between Attack and Ignore, the AGI's mixing strategy is unexploitable (the State cannot profit by deviating in either direction).
Here, that means the AGI randomizes between playing Attack 77% and Ignore 23% at equilibrium.
Second, to find the States's equilibrium mixing strategy, we need to find the AGI's indifference condition. For the AGI to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the AGI attacks:
EV (Attack) = p * 700 + (1−p) * 1800 = 1800 − 1100p
If the AGI ignores:
EV (Ignore) = p * 200 + (1−p) * 3600 = 3600 − 3400 p
Set these equal:
1800 − 1100p = 3600 − 3400p
Solve for p:
p = 1800/2300 = 18 / 23 ≈ 0.78
1 - p ≈ 0.22
When the AGI is indifferent between Attack and Ignore, the State's strategy is unexploitable.
Here, that means the State randomly mixes playing Attack 78% and Ignore 22%.
In this equilibrium, both players randomize their strategies at the unexploitable frequencies so that each is indifferent to their available actions.
Footnotes:
[^1]: Yudkowsky, E. (2008). Artificial intelligence as a positive and negative factor in global risk. In N. Bostrom & M. M. Ćirković (Eds.), Global catastrophic risks; Ngo, R., Chan, L., & Mindermann, S. (2022). The alignment problem from a deep learning perspective; Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2023). Unsolved problems in ML safety.
[^2]: Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling; Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2021). Deep reinforcement learning from human preferences.
[^3]: Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2023). Unsolved problems in ML safety; Anthropic. (2023). Core Challenges in AI Safety.
[^4]: Critch, A., & Krueger, D. (2020). AI research considerations for human existential safety; Long, R. (2020). Cooperative AI: machines must learn to find common ground; Dafoe, A., Hughes, E., Bachrach, Y., Collins, T., McKee, K. R., Leibo, J. Z., Larson, K., & Graepel, T. (2021). Open problems in cooperative AI.
[^5]: Armstrong, S., Bostrom, N., & Shulman, C. (2016). Racing to the precipice: a model of artificial intelligence development; Zwetsloot, R., & Dafoe, A. (2019). Thinking about risks from AI: accidents, misuse and structure; Dafoe, A. (2020). AI governance: A research agenda.
[^6]: Salib, P., & Goldstein, H. (2024). AI Rights for Human Safety.
This seems great - I'd love to see it completed, polished a bit, and possibly published somewhere. (If you're interested in more feedback on that process, feel free to ping me.)
Executive summary: Instead of relying solely on internal alignment of AGI, this paper explores how structuring external incentives and interdependencies could encourage cooperation and coexistence between humans and misaligned AGIs, building on recent game-theoretic analyses of AGI-human conflict.
Key points:
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.