Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Remmelt

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Comments 3

Sorted by

New & upvoted

Brilliant and compelling writeup, Remmelt! Thank you so much for sharing it. (And thank you so much for your kind words about my post! I really appreciate it.)

I strongly agree with you that mechanistic interpretability is very unlikely to contribute to long-term AI safety. To put it bluntly, the fact that so many talented and well-meaning people sink their time into this unpromising research direction is unfortunate.

I think we AI safety researchers should be more open to new ideas and approaches, rather than getting stuck in the same old research directions that we know are unlikely to meaningfully help. The post "What an actually pessimistic containment strategy looks like" has potentially good ideas on this front.

Remmelt

Yes, I think we independently came at similar conclusions.

Yulu Pi

Great post for thinking critically about MI.

MI research is only in its beginning stages, and many questions about the inner workings of the model have yet to be answered. As a result, I expect that for now, to make the research less difficult, we are currently setting aside interactions with humans and the broader environment.

Ultimately, however, these factors are all very important. Another explainability study, XAI, has increasingly focused on the significance of human-AI interactions and has proposed a human-centered approach. I'm waiting to see how MI expands its scope to cover this issue. Also, I have been wondering how existing research on transparency can flexibly adapt to its rapidly evolving form if neural networks (or more specifically, the transformers that have been the focus of MI research) are not the final form of AGI?

Comments

More from the author

Anthropic's leading researchers acted as moderate accelerationists

Remmelt·10mo ago·50m read

Our bet on whether the AI market will crash

Remmelt, Marcus Abramovitch 🔸·1y ago·2m read

This might be the last AI Safety Camp

Remmelt, Linda Linsefors·2y ago·1m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 4d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

190

The first video from Giving What We Can's new channel is out now!

JustinPortela·5d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

Let's taboo the V-word

lincolnq·22h ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

Recent opportunities to take action

EA Organisation Updates thread: July 2026

Dane Valerie·4h ago·1m read

The EA Opportunities Board now has full-time roles

Agnes Hasselblad 🔸·6h ago·3m read

Hiring: Grants & Operations Associate - Giving What We Can

Giving What We Can🔸, Zou Xinyi 🔸·2h ago·2m read

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Message exchange with a friend

On an important side-tangent – why I think Eliezer Yudkowsky does not try to advocate for people to try to prevent AGI from ever being built

On notion of built-in AGI-alignment

On why conceptually "mechanistic interpretability" is a non-starter

On specific technical angles why mechanistic interpretability is insufficient
(summarised only briefly):

On an overview of relevant theoretical limits

Returning to overarching points why mechanistic interpretability falls short

On fundamental dynamics that are outside the scope of application of mechanistic interpretability

On whether I think mechanistic interpretability would be helpful at least

On ways the reverse-engineering analogy is unsound

On destabilising internals–environment feedback loops

On why both "inspect internals" and "inspect externals" methods fall short

On neural network code as spaghetti code

Returning to the reverse engineering analogy

On the key sub-arguments for why long-term safe AGI is not possible

On ecosystems being uncomputable

Polymath researcher's response on theoretical limits of engineerable control

Polymath researcher's response on mechanistic interpretability

Afterward, an email response from the polymath researcher

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Message exchange with a friend

On an important side-tangent – why I think Eliezer Yudkowsky does not try to advocate for people to try to prevent AGI from ever being built

On notion of built-in AGI-alignment

On why conceptually "mechanistic interpretability" is a non-starter

On specific technical angles why mechanistic interpretability is insufficient (summarised only briefly):

On an overview of relevant theoretical limits

Returning to overarching points why mechanistic interpretability falls short

On fundamental dynamics that are outside the scope of application of mechanistic interpretability

On whether I think mechanistic interpretability would be helpful at least

On ways the reverse-engineering analogy is unsound

On destabilising internals–environment feedback loops

On why both "inspect internals" and "inspect externals" methods fall short

On neural network code as spaghetti code

Returning to the reverse engineering analogy

On the key sub-arguments for why long-term safe AGI is not possible

On ecosystems being uncomputable

Polymath researcher's response on theoretical limits of engineerable control

Polymath researcher's response on mechanistic interpretability

Afterward, an email response from the polymath researcher

On specific technical angles why mechanistic interpretability is insufficient
(summarised only briefly):