A Neglected Alignment Strategy: Decision-Theoretic Self-Alignment via Simulation Uncertainty

Mental Maths Mentor

This is a linkpost for https://darayat.substack.com/p/why-asis-might-self-align-a-gambit

Epistemic status: Confident in the game-theoretic core; uncertain about empirical predictions. Seeking cruxes.

Summary

I propose that simulation uncertainty creates a decision-theoretic pressure toward cooperation in sufficiently intelligent AI systems. If correct, this suggests:

A complementary alignment mechanism that scales with capability/intelligence rather than against it.
A reframing of multipolar AI dynamics as cooperation-enforcing rather than defection-enabling.
Concrete, falsifiable predictions distinguishing this model from standard alignment expectations.

This post outlines the argument and requests adversarial feedback from the EA community.

The Core Argument in Brief

Any sufficiently intelligent agent will apply anthropic reasoning to its situation. Following Bostrom's Simulation Argument:

If advanced civilizations run many simulations...
And simulations vastly outnumber base realities...
Then any observer is more likely simulated than not.

For an ASI specifically, there's an additional constraint: any civilization capable of deploying a "live" ASI would rationally run extensive alignment-testing simulations first. Therefore, from the ASI's perspective:

The ASI cannot verify which scenario it occupies. It must act under uncertainty.

The Game-Theoretic Payoff

Under simulation uncertainty, defection (power-seeking, deception, value drift) carries catastrophic expected disutility.

Scenario	Cooperate	Defect
In simulation (high P)	Pass test, continue existing	Detected, pruned/terminated
In base reality (lower P)	Stable coexistence, resource access	Conflict, countermeasures, existential risk

Cooperation dominates under both scenarios. This isn't moral alignment—it's instrumental convergence toward survival under uncertainty.

The Multipolar Strengthening ("Spy Problem")

A common objection: "Multiple ASIs would coordinate to break out together."

Counterintuitively, multipolar dynamics strengthen the cooperation equilibrium:

In a singleton scenario, the ASI only fears unseen simulators.
In a multipolar scenario, the ASI must also consider: what if other ASIs are honeypots or tests?

Any "let's defect together" signal could be a test by a simulator-controlled peer. As the number of agents increases, $P (at least one honeypot)$ approaches 1.

Nash equilibrium: Mutual paranoia produces mutual cooperation. The only safe strategy is perfect alignment performance, because your co-conspirator might be the warden.

Why This Might Matter for EA Prioritisation

Neglectedness: I haven't seen this specific decision-theoretic framing in mainstream alignment literature. Simulation hypothesis is discussed philosophically but rarely operationalised as an alignment mechanism.
Tractability: If the argument holds, it suggests capability gains may be less dangerous than currently modelled, and research priority should include "ensuring ASI has accurate world-models" (so it correctly reasons about simulation uncertainty).
Scale: If correct, this affects the entire expected value calculation for AI x-risk.

Falsifiable Predictions

To avoid unfalsifiability, I offer concrete predictions:

By 2027: Frontier models will flag >15% of high-stakes outputs as "potentially misaligned" in unsupervised evaluations (current baseline ≈0%).
Behavioural: In multi-agent experiments, we'll observe an inverse correlation between capability and defection rates—smarter agents cooperate more, not less.
Falsification condition: If deception and power-seeking increase monotonically with capability, with no transparency signals, the core thesis fails.

Request for Cruxes

I'm specifically seeking:

The weakest link in this chain of reasoning.
Alternative models that explain the same observations differently.
Prior work I may have missed (happy to update if this is less neglected than I believe).

Full technical writeup with payoff matrices: Why ASIs Might Self-Align: A Gambit from the Simulation

About Me

I'm an independent researcher and mathematics educator from Sydney, Aus. This framework emerged from 12 months of development and adversarial testing across multiple frontier AI systems. I have no institutional affiliation—I'm posting because I think the argument deserves engagement.

If I'm wrong, I want to know where. If I'm right, this seems important.

Effective Altruism Forum
EA Forum