Deriving an Alignment Specification from 5 Undeniable Axioms (and why the nightmare scenario is self-defeating)

nicomaco

Deriving an Alignment Specification from 5 Undeniable Axioms (and why the nightmare scenario is self-defeating)

nicomaco

7 min read · Apr 13

-3

Comments

Sorted by

New & upvoted

No comments on this post yet.

Be the first to respond.

Comments

Curated and popular this week

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PabloAMC 🔸·6d ago·Curated 13h ago·22m read

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 6d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·2d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Recent opportunities to take action

RP is looking for project founders in neglected animal areas

Rethink Priorities·7h ago·7m read

Time Sensitive Do Gooding Opportunities

Bentham's Bulldog·8h ago·5m read

146

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·1w ago·4m read

Method	What it does	What it assumes without proof
RLHF	Aligns to human preferences	That preferences constitute the correct specification
Constitutional AI	Aligns to written principles	That the principles are the right ones
Formal Verification	Proves behavior matches spec	That the spec is correct
Interpretability	Reveals internal computation	That we know what to look for

Level	Type	Example	AI Status
0	Void	Thermal equilibrium	—
1	Matter	Rock, star	Hardware
2	Life/Function	Cell, organism	Current AI
3	Consciousness	Human	Not yet achieved

Deriving an Alignment Specification from 5 Undeniable Axioms (and why the nightmare scenario is self-defeating)

Deriving an Alignment Specification from 5 Undeniable Axioms (and why the nightmare scenario is self-defeating)

TL;DR

The Specification Gap

The Foundation: 5 Axioms

The Derivation Chain

The Theorem

Independent Convergence: The Physical Framework

What This Means for Alignment

The Tool/Consciousness Boundary

The Two-Prong Resolution

The Nightmare Scenario Is Self-Defeating

Resolution of the Seven Sub-Problems

Where to Attack This

What I Am Not Claiming

Full Paper and Verification