AI Safety Endgame Stories

IvanVendrov

AI Safety Endgame Stories

IvanVendrov

12 min readSep 28, 2022

Comments 1

Sorted by

New & upvoted

Sharmake

Counterfactual Impact and Power-Seeking

It worries me that many of the most promising theories of impact for alignment end up with the structure “acquire power, then use it for good”.

This seems to be a result of the counterfactual impact framing and a bias towards simple plans. You are a tiny agent in an unfathomably large world, trying to intervene on what may be the biggest event in human history. If you try to generate stories where you have a clear, simple counterfactual impact, most of them will involve power-seeking for the usual instrumental convergence reasons. Power-seeking might be necessary sometimes, but it seems extremely dangerous as a general attitude; ironically human power-seeking is one of the key drivers of AI x-risk to begin with. Benjamin Ross Hoffman writes beautifully about this problem in Against responsibility.

I don’t have any good solutions, other than a general bias away from power-seeking strategies and towards strategies involving cooperation, dealism, and reducing transaction costs. I think the pivotal act framing is particularly dangerous, and aiming to delay existential catastrophe rather than preventing it completely is a better policy for most actors.

This is why AI risk is so high, in a nutshell.

Yet unlike this post (or Benjamin Ross Hoffman's post), I think this was a sad, but crucially necessary decision. I think the option you propose is at least partially a fabricated option. I think a lot of the reason is people dearly want to there be a better option, even if it's not there.

Link to fabricated options:

https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options

Comments

More from the author

150

Aligning Recommender Systems as Cause Area

IvanVendrov·7y ago·15m read

A Viral License for AI Safety

IvanVendrov·5y ago·6m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·6d ago·Curated 3d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

154

Let's taboo the V-word

lincolnq·1w ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

105

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·4d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

Recent opportunities to take action

EA Organisation Updates thread: July 2026

Dane Valerie·6d ago·1m read

announcing High Impact Aliens

tzukitchan·2d ago·1m read

Help us launch AI safety university groups by referring potential founders

Jason Chin🔸, Thomas Rodskog·3d ago·4m read

Sharmake

Counterfactual Impact and Power-Seeking

It worries me that many of the most promising theories of impact for alignment end up with the structure “acquire power, then use it for good”.

This seems to be a result of the counterfactual impact framing and a bias towards simple plans. You are a tiny agent in an unfathomably large world, trying to intervene on what may be the biggest event in human history. If you try to generate stories where you have a clear, simple counterfactual impact, most of them will involve power-seeking for the usual instrumental convergence reasons. Power-seeking might be necessary sometimes, but it seems extremely dangerous as a general attitude; ironically human power-seeking is one of the key drivers of AI x-risk to begin with. Benjamin Ross Hoffman writes beautifully about this problem in Against responsibility.

I don’t have any good solutions, other than a general bias away from power-seeking strategies and towards strategies involving cooperation, dealism, and reducing transaction costs. I think the pivotal act framing is particularly dangerous, and aiming to delay existential catastrophe rather than preventing it completely is a better policy for most actors.

This is why AI risk is so high, in a nutshell.

Link to fabricated options:

https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options

^{^}

The story is most directly inspired by Ajeya’s takeover post, but meant to cover most AI x-risk stories including What failure looks like, AGI Ruin: A List of Lethalities, and most multipolar failures. It’s also mostly agnostic to timelines and takeoff speeds.

^{^}

I revisit this assumption later in the essay, but I think it is analytically useful for two reasons. First, any plan that leads to true existential security will need to have an answer for how to avert this specific Magma catastrophe, so much of the analysis will transfer over. Second, achieving existential security or building friendly AGI may simply not be possible, and all we can do is tread water and delay catastrophe a few years at a time. Cryptography is like this - we haven’t found any perfect ways to do encryption and may never, but we can chain together enough kludges that extremely secure communication is possible most of the time.

^{^}

From the OpenAI charter: "if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”"

AI Safety Endgame Stories

AI Safety Endgame Stories

Changing the Technology

Differential Development of Safety

Technological Attractor States

Changing the Decision Maker

Defusing Races

Changing Magma’s Culture

Replacing Magma

Changing the Broader World

Counterfactual Impact and Power-Seeking