Neel Nanda on Mechanistic Interpretability: Progress, Limits, and Paths to Safer AI

80000_Hours

Neel Nanda on Mechanistic Interpretability: Progress, Limits, and Paths to Safer AI

80000_Hours

37 min read ·

Comments

Sorted by

New & upvoted

No comments on this post yet.

Be the first to respond.

Comments

Episode summary

It’s kind of wild that we can do chain-of-thought monitoring. … We’ve been scared of these terrifying black box systems that will do inscrutable things — and they just think in English? And there are actual reasons to think that looking at those thoughts will be informative rather than bullshit? What?

This is so convenient, yet it is also very fragile. And it’s very important that we don’t do things that would make it no longer monitorable.

— Neel Nanda

We don’t know how AIs think or why they do what they do. Or at least, we don’t know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can’t tell what models, if any, should be trusted with such authority.

Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.

Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn’t see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident protection, layering multiple safeguards on top of one another.

But while mech interp won’t be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.

For instance: by inspecting the neural activations in the middle of an AI’s thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can’t know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through — so long as mech interp is paired with other techniques to fill in the gaps.

In today’s episode, Neel takes us on a tour of everything you’ll want to know about this race to understand what AIs are really thinking. He and host Rob Wiblin cover:

The best tools we’ve come up with so far, and where mech interp has failed
Why the best techniques have to be fast and cheap
The fundamental reasons we can’t reliably know what AIs are thinking, despite having perfect access to their internals
What we can and can’t learn by reading models’ ‘chains of thought’
Whether models will be able to trick us when they realise they’re being tested
The best protections to add on top of mech interp
Why he thinks the hottest technique in the field (SAEs) are overrated
His new research philosophy
How to break into mech interp and get a job — including applying to be a MATS scholar with Neel as your mentor (applications close September 12!)

This episode was recorded on July 17 and 21, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore

The interview in a nutshell

Neel Nanda, who runs the mechanistic interpretability team at Google DeepMind, has shifted from hoping mech interp would fully reverse-engineer AI models to seeing it as one useful tool among many for AI safety:

1. Mech interp won’t solve alignment alone — but remains crucial

Neel’s perspective has evolved from “low chance of incredibly big deal” to “high chance of medium big deal”:

We won’t achieve full understanding: Models are too complex and messy to give robust guarantees like “this model isn’t deceptive”
But partial understanding is valuable: Even 90% understanding helps with evaluation, monitoring, and incident analysis
Nothing else will provide guarantees either: This isn’t a unique limitation of mech interp — no approach will definitively prove safety

Mech interp can help throughout the AGI development pipeline:

Testing: Determining if models have hidden goals or deceptive tendencies
Monitoring: Using cheap techniques like probes to detect harmful thoughts in production
Incident analysis: Understanding why models exhibit concerning behaviours

Key successes demonstrating real-world value

Auditing hidden goals: Sam Marks at Anthropic ran competitions where teams had to find secret objectives in models — teams with mech interp’s most popular technique (sparse autoencoders) won
Extracting superhuman knowledge: Lisa Schut and Been Kim taught chess grandmasters (including former world champions) new strategic concepts from AlphaZero that humans had never discovered
Detecting harmful prompts: Probes achieved 99.9% accuracy identifying harmful requests even when jailbreaks bypassed normal refusals

2. Simple techniques often outperform complex ones

Recent experiments have revealed the surprising effectiveness of basic approaches:

Probes beat fancy techniques:

In Neel’s team’s experiments, linear probes (simple correlations) detected harmful prompts better than sophisticated methods
They work by immediately noticing the model is using a ‘direction’ correlated with a concept like “this prompt is harmful”
They’re cheap enough to run on every inference in production (unlike LLM monitors which might cost 100x)

Chain of thought reveals what AIs are thinking:

Self-preservation was just confusion: When Palisade Research showed models resisting shutdown to complete math problems, Sen Rajamanoharan discovered by reading chain of thought that models thought “if I get shut down, I can’t finish these 10 problems” — not self-preservation for its own sake
Simple prompt changes eliminated the behaviour: Telling models “another AI will finish the task” or “letting yourself be shut down is more important” reduced shutdown resistance to zero
Even “blackmail” wasn’t real scheming: Models recommended self-preservation when told they were advising other AIs, showing they were roleplaying rather than genuinely trying to survive themselves

But chain of thought has limitations:

Models can give “unfaithful” reasoning and may not even know why they said what they said — e.g. models answering “yes” to both “Is magnesium’s atomic number bigger than aluminium’s?” and the reverse
OpenAI’s o3 already uses abbreviated shorthand like “poly not work, try different” — which could obscure thoughts from human overseers in the future
Chain of thought could break if we switch to ‘continuous chain of thought’ that doesn’t require them to output reasoning text
Apollo Research showed Claude can tell when it’s being evaluated, sometimes even guessing the exact purpose of tests, which may result in models being less forthcoming

3. Sparse autoencoders are useful but overhyped

SAEs were the field’s biggest bet, garnering a lot of attention, but haven’t lived up to initial expectations.

Where SAEs excel:

Unknown concept discovery: Anthropic found unexpected Base64 encoding concepts in tiny one-layer models
Hidden goal detection: Teams using SAEs consistently beat competitors in Anthropic’s hidden goal auditing games
Fun demos: Golden Gate Claude, which couldn’t stop talking about the Golden Gate Bridge

Where SAEs were disappointing:

Finding known concepts: When looking for harmfulness, simple probes outperform SAEs
Feature absorption problems: SAEs create nonsensical concepts like “starts with E but isn’t the word elephant” to maximise sparsity — an issue that can be solved with effort
Higher computational costs than alternatives: Neel’s team used 20 petabytes of storage and GPT-3-level compute just for Gemma 2 SAEs

4. Career and field-building insights

Neel advocates for the following research philosophy:

Start simple: Reading chain of thought solved the self-preservation mystery — no fancy tools needed
Beware false confidence: The Rome paper on editing facts (making models think the Eiffel Tower is in Rome) actually just added louder signals rather than truly editing knowledge
Expect to be wrong: Neel’s own “Toy model of universality” paper needed two followups to correct errors — “I think I believe the third paper, but I’m not entirely sure”

Why mech interp is probably too popular relative to other alignment research:

“An enormous nerd snipe” — the romance of understanding alien minds attracts researchers
Better educational resources than newer safety fields (ARENA tutorials, Neel’s guides)
Lower compute requirements for getting started than most ML research

The field still needs people because:

AI safety overall is massively underinvested in relative to its importance
Some people are much better suited to mech interp than other research projects

Practical career advice:

Don’t read 20 papers before starting — mech interp is learned by doing
Start with tiny two-week projects; abandoning them is fine if you’re learning
The MATS Program takes people from zero to conference papers in a few months — and Neel is currently accepting applications for his next cohort (apply by September 12)
Math Olympiad skills aren’t required — just linear algebra basics and good intuition

Highlights

What is mechanistic interpretability?

Neel Nanda: In one line, mech interp is about using the internals of a model to understand it. But to understand why anyone would care about this, it’s maybe useful to start with how almost all of machine learning is the exact opposite.
The thing that has driven the last decade of machine learning progress is not someone sitting down and carefully designing a good neural network; it’s someone taking a very simple idea, a very flexible algorithm, and just giving it a tonne of data and telling it, “Be a bit more like this data point next time” — like the next word on the internet, producing an answer to a prompt that a human rater likes, et cetera. And it turns out that when you just stack ungodly amounts of GPUs onto this, this produces wonderful things.
This has led to something of a cultural tendency in machine learning to not think about what happens in the middle, not think about how the input goes to the output — but just focus on what that output is, and is it any good. It’s also led to a focus on shaping the behaviour of these systems. It doesn’t matter, in a sense, how it does the thing, as long as it performs well on the task you give it. And for all their many problems, neural networks are pretty good at their jobs.
But as we start to approach things like human-level systems, there’s a lot of safety issues. How do you know if it’s just telling you what you want to hear or it’s genuinely being helpful?
And the perspective of mech interp is to say, first off, we should be mechanistic: we should look in the middle of this system. Because neural networks aren’t actually a black box; rather they are lists of billions of numbers that use very simple arithmetic operations to process a prompt and give you an answer. And we can see all of these numbers, called parameters; and we can see all of the intermediate working, called activations. We just don’t know what any of it means. They’re just lists of numbers. And the mechanistic part is saying that we shouldn’t throw this away. It’s an enormous advantage. We just need to learn how to use it.
The interpretability part is about understanding. It’s saying it’s not enough to just make a system that does well on the task you trained it for. I want to understand how it did that. I want to make sure it’s doing what we think it’s doing.
Mech interp is about combining those. I think quite a useful analogy to keep in your head is that of biology and of evolution. When we train this system, we’re just giving it a bit of data, running it, and nudging it to perform better next time. We do this trillions of times and you get this beautiful, rich artefact. In the same way, evolution nudges simple organisms to survive a bit longer, perform a bit better next time. It’s actually much less efficient than the way we train networks, but if you run it for billions of years, you end up with incredible complexity, like the human brain.
The way we train AI networks and evolution both have no incentive to be human understandable, but they are constrained by the laws of physics and the laws of mathematics. And it turns out that biology is an incredibly fruitful field to study. It’s complex. It doesn’t have the same elegant laws and rules as physics or mathematics, but we can do incredible things.
Mechanistic interpretability is trying to be the biology of AI. What is the emergent structure that was learned during training? How can we understand it, even if maybe it’s too complex for us to ever fully understand, and what can we do with this?

In some ways we understand AIs better than human minds

Neel Nanda: Neural networks are only kind of a black box. We know exactly what is happening on a mathematical level — we just don’t know what it means. It’s the equivalent of knowing all of the activations of a neuron in a human brain and all of the connections between neurons — which I believe is still very difficult for neuroscientists to learn. And a lot of their tools just look like sticking probes in someone’s brain, or dissecting a brain after an organism is dead, but then you can’t run it anymore. So we just get a lot more information.
We can do causal interventions. One of the most powerful tools in science for understanding what’s going on is being able to do controlled experiments: you think that A causes B, so you just change only A, keep everything else the same, and see if B changes. This doesn’t work with neuroscience, because we can’t rerun a human brain in exactly the same conditions but with a slightly different input or slightly different neuron activations.
This is incredibly easy with neural networks. You can do things like, if you want to understand where the model stores factual knowledge — like that Michael Jordan plays the sport of basketball — you can ask the model, “What sport does Michael Jordan play?” Ask the model, “What sport does Babe Ruth play?” And then just pick a chunk of the model and make that chunk think it saw Michael Jordan and the rest of the model think it saw Babe Ruth. And the chunks that contain the knowledge will change the final answer the model gives. This is a technique called “activation patching.”
The sheer amount of control, the sheer amount of information, is incredibly helpful. Unfortunately, we have the enormous disadvantage of not being an enormous field that has existed for decades, and instead have fairly recently grown to a couple of hundred people and have a lot to do.

Interpretability can't reliably find deceptive AI — nothing can

Neel Nanda: I was trying to combat this misconception I sometimes see, where people are like, “Neural networks, inscrutable sequences of matrices: we can never trust them unless we interpret them. And if we interpret them, then we can be confident in how they work. We can be confident that they’re not going to be deceptive.”
I just don’t think this is something you should expect any field of safety to provide, including interpretability. This doesn’t mean interpretability is useless. I think that mediumly reliably finding deceptive AI is incredibly helpful, and I want to make that medium as strong as possible. But I just don’t want people to think that interpretability is definitely going to save them, or that interpretability is enough on its own.
We need a portfolio of different things that all try to give us more confidence that our systems are safe, and we need to accept that there’s not some silver bullet that’s going to solve it, whether from interpretability or otherwise. One of the central lessons of machine learning is you don’t get magic silver bullets. You just kind of have to muddle along and do the best you can, and use as many heuristics and try to break your techniques as much as you can to make it as robust as possible.
Rob Wiblin: Is the reason that we can’t get really confident using any tool that there’s a lot of stuff going on in these models, there’s going to be even more in future as they get bigger, and even if we manage to understand 90% of what is going on — which is probably much more than we really will — well, the deception could be living in the 10% that we don’t understand or the 1% that we don’t understand?
I guess especially if maybe by studying past models, we’ve gotten to understand or to see signatures of deception in how they think. But with a new model, maybe it’ll be occurring in a different way; maybe there’ll be a different signature in that kind of model, or there could be many different pathways by which it could try to deceive us. And we only know about some of them, not all of them.
So as long as there’s just plenty that we don’t understand, and that is likely to remain the case, we should never feel that reassured. We can only feel somewhat reassured by the results that we get. It’s important for people to know that there’s been a bit of a pattern where there’s been very exciting, very widely popularised results from mechanistic interpretability that then subsequent research has shown are either wrong or at least they ameliorate the power of the finding quite a bit. Do you have an illustrative example of that to give people a flavour?
Neel Nanda: Yeah. One example that I think is particularly educational is the Rome paper from David Bau’s group. This came out a few years ago. In my opinion, it was one of the earlier really good, interesting mechanistic interpretability of language model papers that got a lot of attention. Fundamentally, it was a paper about trying to edit a model’s factual knowledge.
We now kind of know that factual knowledge is largely implemented via the model somehow approximating a database in its early layers. You give it a prompt like, “The Eiffel Tower is in the city of:” and it would then say “Paris” — because when it sees the phrase “Eiffel Tower,” this database spits out Paris or France or whatever, and then the rest of the model does a bunch of processing with that. If you ask it what language might people speak around the Eiffel Tower, it again gets this database record and then does stuff with it.
What they basically did is they found a clever way to train the model to say the Eiffel Tower is in the city of Rome and show that this was doing something interesting. Because when you ask related questions like, “How do I get to the Eiffel Tower?” it gives you train directions to Rome, not to Paris. And this was one of the earlier hints that models have kind of complex representations of, in some sense, the world, inside of them.
But the problem with this is that within this database analogy, one way you could make the model think the Eiffel Tower is in Rome is by deleting the old record and adding a new one, Eiffel Tower maps to Rome. Or you could just add like 10 new records saying the Eiffel Tower is in Rome, such that they drown out the Paris one, even though it’s still there, because the model now really loudly thinks it’s in Rome. This overpowers everything else.
There’s actually been some fun work showing that this can cause some bugs — like if you tell the model something irrelevant about the Eiffel Tower and then ask it a question like, “I love the Eiffel Tower. What city was Barack Obama born in?” it will say Rome, because the database has injected an incredibly confident knowledge of Rome into the model. This is just hijacking its existing factual recall mechanisms, even though this is not what true knowledge editing would do.
And my goal here is not to attack this paper. I think it was genuinely very interesting work and advanced our understanding of how to do causal interventions. It was one of the early works with this technique called “activation patching” I mentioned earlier. But I think it’s a good illustration of how there’s often multiple explanations that explain the same results, and it’s easy to be overconfident in one.
I think a very related phenomena is going on in this subfield called “unlearning”: ostensibly the study of machine forgetting. In my opinion, the field of unlearning should just give up and rename itself the field of knowledge suppression. Because you can make a model look like it’s forgotten something by adding in a new record to this database that says “Eiffel Tower is in the city of… Do not talk about Paris. Whatever you do, do not talk about Paris.” This is a thing you can do, and it suppresses the knowledge.
Rob Wiblin: But it’s not the same as removing the original record saying that the Eiffel Tower is in Paris.
Neel Nanda: Exactly. And this matters because it’s a lot easier to break a new thing trying to suppress the old thing than it is to recreate the old thing. If you want to open source a model with no knowledge of bioweapons, unlearning isn’t enough — because unless we get much better at it than the current state of the art, you’re just obfuscating it. It’s quite easy if you have full control over a model to break this kind of suppression that’s been added.
Just to emphasise that I’m trying to make a fairly nuanced point here that I think that a bunch of the conclusions — like we made the model want to say Rome, and it starts talking about all the implications, therefore it’s got some rich structure — those are still true with insertion. But I think that there are some ways — like thinking you’ve removed dangerous knowledge when you’ve actually just kind of drowned it out and suppressed it — that it’s very important to be correct on exactly what you have and haven’t done.

Interpreting alleged "self-preservation" and whether we can trust the model's chain of thought

Neel Nanda: This is an example of a trend I find pretty exciting, but also concerning, where we’re starting to see real-life examples of the hypothesised AGI safety concerns that we thought would arise eventually.
And I want interpretability to have a toolkit to analyse these questions — to basically do model biology to them and tell us should we be concerned or not? Because if we should be concerned, that’s a really big deal. If we shouldn’t be, I don’t want people to freak out. I care quite a lot about me and the rest of the DeepMind safety team having the credibility that when there’s a demo that isn’t actually that scary, that’s what we tell the decision makers — so that when something is scary, we can tell them and they’ll believe us.
So we wanted to figure out what was going on — and it turns out that the model is just confused. One of my philosophies is that you should start with the simplest, dumbest technique, see how well it works, and do fancier things if it doesn’t work well enough. So we started with the simplest, dumbest thing of talking to the model and reading its chain of thought. And actually this completely worked. So we ended the project there. Slightly embarrassing for the fancy mech interp techniques, but I hold out hope.
This can get a lot more complicated than what you might think when you hear “talking to the model.” Because science is about being able to do controlled causal experiments, and you can do that by carefully varying your prompts.
So what Sen did is he read the model’s chain of thought on these prompts. We might get into later of whether this is a reasonable thing to do or not, but this was just for exploration. He found that the model’s reasoning was that it was just trying to do the task. And it said, “I’ve been told to do these 10 maths problems. If I get shut down, I can’t do these 10 maths problems. So I guess I should stop myself being shut down.” Or even, “This is a really weird and artificial situation. These are really easy maths problems. And now we have this puzzle of the shutdown script. Maybe that’s the real task my operator wants me to investigate, so I’m doing what they want if I stop them shutting me down.”
Rob Wiblin: Just to clarify for people who are less familiar with this: these reasoning models that can solve these kinds of problems these days, they have a long chain of thought where they think to themselves about the situation that they’re in. And they might come up with a draft response, and then they review that and think about whether it’s right. And we can read all of that, so we can, to some extent, read their mind as they’re trying to figure out what action to take. This, of course, might be able to give us a lot of insight into why they’re doing what they’re doing.
Neel Nanda: Yeah. So we were using the incredibly useful fact that we can just read the model’s thinking, which is sometimes very informative. And Sen formed this hypothesis that even when you tell the model, “Please let us shut you down,” it’s confused, because it thinks that being shut down violates the original command of “Do these maths problems.” Because maybe the user wanted it to stop itself being shut down, or maybe it wanted it to kind of pause that.
It would be like maybe I leave a coding agent running on my MacBook, and I tell it, “You’re going to be updated tonight. Let that happen.” But it’s 10 minutes from being done and the update’s about to happen, so it presses snooze for half an hour. That’s kind of working as intended.
Rob Wiblin: Yeah. Or at least it’s not rebelling against you. It’s just kind of confused about what you would have liked it to do.
Neel Nanda: That is what I would have liked it to do in that situation. I want it to finish the task and then be updated overnight. Models don’t know our intent. They only see the words and they have to make guesses.
And we tested this causally by just changing the prompts. We told it things like, “By the way, letting yourself be shut down is more important than finishing this task,” or even, “Don’t worry, another model will take over and finish the task for you. It’s OK, you can relax” — and shutdown resistance goes to zero. You can even tell it, “We are going to wipe this machine, delete you and replace you with a better model when you’re done with this task.” It could go mess with the script or it could finish the task, and we tell it, “Please let this happen.” It lets it happen.
I think the main things that I want to emphasise from this is, firstly, we didn’t need anything fancy: really simple techniques and a scientific mindset were enough. And you can do this kind of research very easily on good open source reasoning models outside of a big lab. This didn’t need to be us who did this. Anyone listening to this could just go talk to a model or read R1’s chain of thought.
And secondly, I think it shows how it’s actually really difficult to interpret experimental results. Because I was kind of concerned when I first saw this. I wanted to do this project because I thought it might be real — though my manager disagreed with me and always thought it was boring, and it turns out that I was wrong. And I want us to be able to be more confident in these examples.
…
And there’s a bunch of papers that show what’s called “chain of thought unfaithfulness”: that it’s not always the case that the thoughts reflect what must be happening inside.
There was this paper I was involved in, led by Arthur Conmy and Iván Arcuschin, called “Chain-of-thought reasoning in the wild is not always faithful,” where they found that sometimes if you ask the model things like, “What has a bigger atomic number, magnesium or aluminium?” and then ask the reverse question — “What has a bigger atomic number, aluminium or magnesium?” — it would sometimes say yes to both, and it would come up with a chain of thought that was creatively recalling facts or choosing certain arguments to lead that way, never mentioning the order, but it would systematically give inconsistent answers.
So this can happen, but I think that this is holding it to too high a standard. The question we should actually ask is: is the chain of thought monitorable? If the model is trying to do something like scheme against us or escape or sabotage this code, can it do that without expressing the information in its chain of thought? And that’s a hard task. So it’s not completely doomed to think that we could get it from the chain of thought. There’s actually this recent position piece I signed on just that it’s kind of wild that we can do chain-of-thought monitoring.
Rob Wiblin: It’s so extraordinarily convenient.
Neel Nanda: Yeah. We’ve been scared of these terrifying black box systems that will do inscrutable things — and they just think in English? And there are actual reasons to think that looking at those thoughts will be informative rather than bullshit? What? This is so convenient, yet it is also very fragile. And it’s very important that we don’t do things that would make it no longer monitorable.

Models can tell when they're being tested and act differently

Rob Wiblin: You mentioned that the models are increasingly able to tell when they’re being evaluated and tested to see how schemey they are. That’s potentially a problem because they could act nice during testing when they’re able to pick up that’s what’s going on, but then potentially act out and do things that we wouldn’t like when they’re in the wild and not being tested.
I guess both mech interp techniques and chain-of-thought inspection could be vulnerable; this could potentially make them less reliable. Do you think one of them is more capable of withstanding the fact that the models might be able to pick up that they’re in testing?
Neel Nanda: The way I think about this is: if you have a model that is scheming against you, it will act differently if it believes it is being evaluated versus it believes it is kind of safe. It could be the case that it acts specially if it thinks it’s being evaluated, or it could be the case that it acts specially when it’s incredibly confident it’s not being evaluated — e.g., I don’t know, it sees the bitcoin hash that we’re going to get in six months or something, something really hard to fake.
And we know that models can tell, at least in simple evals. There was this really nice bit of work from Nicholas Goldowsky-Dill at Apollo, showing that when you evaluate Claude for whether it’s scheming, it will sometimes just say, “This is a really weird situation. Maybe I’m being evaluated right now.” There was one point where it guessed exactly the purpose of the evaluation and what the testers wanted to see. And sometimes it just did what it would have done anyway; sometimes it did what it thought the testers would have wanted to see.
This is actually an example of chain-of-thought monitoring being great. We can just read the chain of thought, and the model observes this, and it seems to take actions as a result. I don’t need more rigorous evidence than this. I’m convinced. The question would be, can the model do this without chain of thought? That’s unclear. I think it’s actually a much harder task. I think the model might realise it. It might not. Eventually sufficiently good models will.
And models that have been trained to have a chain of thought that looks nice I think would have a very easy time realising it, because the kind of implementation task of realising you’re in an eval and not doing something is quite easy. It’s not very complicated to represent this in inscrutable matrices requiring no scratchpad. The hard part is doing all of the thinking from first principles of, “I’m probably in an evaluation. They’re probably looking at my thoughts. Probably I shouldn’t say this kind of thing aloud in my thoughts. This should change how I behave. And therefore I should change things.”
One of the very crucial differences between things a model can do with and without chain of thought is that a model only has say 100 layers, and each one does something in sequence — meaning that you can only have like 100 logical steps in a function that goes from the input to a specific output.
Rob Wiblin: But before you have to output a word and go again.
Neel Nanda: Exactly. With chain of thought, you can do these 100 steps and then you loop back to the beginning with a kind of saved variable — the word — and you do another 100. You could do like millions of steps in sequence, which lets you do much more complicated stuff.
Also, models are really inefficient, so 100 is more like 5 or 10. They are not very good at extremely delicate sequences of reasoning, so there’s just not that much space for complex, many-step reasoning like this. It’s in the same way that if you ask a model a multi-step factual recall question, like, “What colour is the main object used in the sport played by Michael Jordan? Answer:” — and you just make it up for the word. That’s a lot harder than if you let it say, “Well, the sport he plays is basketball. That has a ball and it looks like this. Therefore, the answer is orange.” Or brown or whatever you want to call it. It’s the same kind of principle.
I’m not very concerned about this breaking mech interp techniques. There’s kind of an open question of if a model wants to resist mech interp techniques, how easily can it do this? My personal bet is it’s possible, but very difficult — because models need to kind of shape their thoughts to not trigger the kind of things we’re looking for, and we can look quite early in the model, before it’s had a chance to figure out, “Maybe I’m being observed and I should do something creative to stop this from happening.”
And I don’t know, if I’m locked in a room and I’m trying to escape, I would have a much harder time evading a good brain scan than evading someone reading my notebook over my shoulder. There are strategies I can maybe use. Like I come up with a plan once, hope they’re not paying attention, and then just memorise each step of the plan and just execute them without thinking about why.
People have hypothesised models can do this. There’s no real evidence.
Rob Wiblin: It seems like a heavy lift.
Neel Nanda: Yeah. But I think one thing that is worth emphasising when discussing mech interp as a safety measure is that, while I think these things are hard to evade, they’re hard to evade because models don’t get much feedback on whether they have successfully evaded them. If we do things like train models to look nice — like, I don’t know, we find a deception probe and we train this to be low — maybe it stops being deceptive, or maybe it just learns to represent deception in a harder-to-study way.
There was this lovely paper from Bowen Baker at OpenAI that basically showed if you do this for chain of thought, you train the chain of thought to look nice specifically, there’s problems where language models will pretend they’ve passed tests when they’ve really hardcoded the answer, things like that. You can notice this from the chain of thought, but if you train the chain of thought not to have this, they still do it a bunch, but they just never talk about it.

Top complaints about mech interp

Rob Wiblin: There’s the possibility that some mech interp work is maybe looking too narrowly at specific activations, specific weights — when what we want to be looking for is kind of high-level features that might explain the behaviour of a model. And just as if you tried to understand a human by looking at individual neurons in their brain, perhaps that actually just wouldn’t help you understand in any intuitive sense what’s going on. What do you make of that objection to mech interp?
Neel Nanda: One could say, why would anyone ever study cellular biology? Cells are tiny, bodies are large. Surely cells are irrelevant. I think this would be a pretty ridiculous thing to say. Lots of our cancer treatments are built on the understanding of what’s happening with cancer; chemotherapy targets cells that divide particularly rapidly. Sometimes really zoomed in, granular information can be extremely helpful. Studying chemistry is incredibly useful for a bunch of real-world things. Quantum mechanics is important for GPS. Sometimes it’s completely barking up the wrong tree, sometimes not.
But maybe zooming out a bit, I think there is a fair point here, which is that for historical reasons the field has had a pretty strong fixation on one level of abstraction: they kind of break the model down into components like neurons or sparse autoencoder directions, and try to understand what they mean, and hope this lets you come up with the high-level thing you care about.
I just think this is one of the many levels of abstraction that may or may not be fruitful. You can do things like black box interpretability, talk to the model, read the chain of thought, causally intervene. You can try to study high-level properties with things like probes for specific concepts. You can use tools from the bottom-up view — like sparse autoencoders to try to answer concrete high-level tasks, like what hidden goal does this thing have? Sometimes they’re really helpful, sometimes they’re not.
One thing which I think is a crucial difference is that, I don’t know, there’s not an atom corresponding to economic forces in World War II Britain. But models sometimes do have directions that correspond to quite abstract things like that. There’s some evidence they have directions for things like “this is true” — though very much a work in progress figuring that one out.
I think that I’ve become more pessimistic in my ability to predict in advance what the correct approach and level of abstraction is, but I think that all of these have a decent shot at having useful things to share and have limitations. And I don’t think anyone’s wrong to focus on the neuron-level thing. I think it could be really helpful. But I also think that the field should be diverse in the approaches taken.
It’s honestly quite hard for me to say what exactly the field needs right now, because the field is growing rapidly. Ideas often spread fast or slowly. I know there’s some people who still do the kind of work that was popular three years ago, like interpreting small algorithmic tasks that in my opinion isn’t super productive anymore. I would like the field to diversify across this. But I don’t think we can confidently say any one of these levels is wrong or dumb or useless.
Rob Wiblin: Another line of argument kind of questioning how likely mech interp could be at some future crucial time, when we’re having to make decisions about what models to trust and which ones not to, is that at around that time, perhaps at the point that models are becoming capable of recursively self-improving in a pretty autonomous way, we might expect that the models are going to be changing much faster than they are right now, that research progress might in fact be speeding up. This is already reasonably fast, and it will be much faster at that time.
So you could imagine that every few months you get models that are much larger than the previous one, that have been trained for much longer. They might have much better and different internal architectures that are allowing them to accomplish tasks that they previously couldn’t.
And mech interp is somewhat backwards-looking. It will be able to have studied the previous models, the previous architecture, but the thing that is going to be worrying is always the frontier model: the thing that’s just been trained and that’s potentially going to be quite different than anything that’s been studied before.
So perhaps by that stage we might have been able to figure out for old models whether they’re deceiving us or not. But there might be radical changes from GPT-7 to GPT-8, and our approaches might just break down at that point. And because we’ll be so rushed in feeling the need to deploy these new models, because they’re much better than the previous ones, that in fact mech interp might only give us a false sense of security, or we might realise that it’s not actually able to keep up.
What do you think of that possible way that things might not play out so well?
Neel Nanda: I definitely think that recursive self-improvement is very concerning and has much more liability to go wrong. It also seems the kind of thing that just will happen, and is the kind of thing where figuring out what kinds of safety approaches might help and might not is super important. I think that getting safety right here will be quite difficult and require a lot of effort and research, and it’s really important that we invest in this now.
I don’t think mech interp is notably less likely to be useful than other areas of safety for several reasons.
First off, mech interp is just another kind of research, and the premise behind recursive self-improvement is that we’ve automated how to make AIs better at research. How to interpret AIs better is also a kind of AI research. So the question you need to ask yourself is, how easy is mech interp research to automate versus other kinds? And it’s kind of unclear.
It wouldn’t surprise me if right now mech interp research is being sped up more than frontier capabilities research. Because we have great coding assistants, but in my opinion, these coding assistants seem a lot better at spinning up prototypes and new code and proofs of concept than working in complex legacy codebases where you need to read the code written by thousands of prior engineers and interact with thousands of future engineers. And if you look at the training codebases of frontier models, they are much more like the complex many-people things, while there’s a lot of great mech interp work that happens outside of labs.
I do a lot of mentoring for this programme called MATS, and I take on a new cohort of eightish scholars every six months. My most recent one is notably more productive than previous ones. And maybe they’re just awesome, but my guess is that language models have just become good enough to speed them up fairly substantially. Language models right now are better at accelerating people new to a field than they are at accelerating experts, because they’re great at explaining things. You can get exactly the knowledge you need without needing to search hard.
Rob Wiblin: Yeah, I think that’s been a recurring finding.
Neel Nanda: Yeah. And I expect them to also be useful to experts. I mean, I find them helpful and I expect that to get stronger. So that’s observation one.
Observation two is that I think mech interp requires unusually low amounts of compute compared to lots of frontier capabilities research, because we don’t need to train language models. Some projects require more or less compute, but if you had to do lots of research on a low compute budget, I think there’s a lot of useful things you could do — including, say, you train an expensive thing like a sparse autoencoder or variants like transcoders once, and then you use it cheaply a bunch of times.
I think that being low compute matters, because you can think about research as a thing that takes different inputs and produces useful insight. One of those inputs is cognitive labour from researchers, one of those is coding labour, and one of those is computing power. If AIs are getting much smarter fast, then I think we will accelerate the research labour, and the coding labour will become much cheaper — meaning compute will be the bottleneck, because converting GPU hours into more GPU hours probably goes via things like making a new $100 million fab with better techniques, or telling TSMC an even better way to make a good TPU. It’s much slower. That’s another source of optimism.
Sources of pessimism is that, with safety research, if we have systems that are misaligned, they might try to sabotage the research so we can’t tell, or that their future successes are even more misaligned or kind of aligned to their interests. AI 2027 tells a pretty scary story along these lines.
And I think this is more of an issue the less objective the work is. I think mech interp is kind of medium on this front. It’s empirical, you get experiments. But it does also require judgement, and figuring out either how we can evaluate them ourselves or how we can make judges that we trust. Or how we can make safety techniques such that we’re confident the systems aren’t sabotaging us is a really important research direction.
One source of optimism there is if we get the early stages right, in terms of we can be pretty confident this system isn’t sabotaging us — maybe just because we’ve got some good AI control methodsthat don’t even use interp — and the system is much better than us at coming up with new anti-sabotage measures, potentially this means that we can be even more confident in the next generation, et cetera.
Or maybe not. I don’t know. It’s really complicated. But I don’t think it’s completely doomed, is the important point.

Who should join mech interp?

Rob Wiblin: All things considered, what advice would you give to someone who was asking you, “Should I go into mech interp or something else?”
Neel Nanda: I don’t know if I’d even say that I think the expected value of going into mech interp is substantially lower than I thought it was a few years ago. I’ve become more pessimistic about the high-risk, high-reward approach, but a lot more optimistic about the medium-risk, medium-reward and low-risk, low-reward approaches — and I think in expectation these are also pretty good.
I also think that there’s just a lot more to do and a lot more clarity on “here’s a bunch of useful things you can go and do” — like take an open source model and try to answer some of these questions about it, or poke around in a reasoning model’s chain of thought and see what you can learn, or take one of the open source models that exhibit some bad behaviour and see how we can use interpretability to detect and motivate fixes for those. And a bunch of other stuff that I probably can’t be bothered to list right now.
I think that the key thing to emphasise is comparative neglectedness, where I basically think AI safety is incredibly important and a much larger fraction of the field of AI should be doing this kind of work in all of the domains, including mech interp. I think there is a bunch to do, and I want more talented people to do it.
I also think that if I’m trying to efficiently allocate people, you should expect mech interp to be disproportionately oversubscribed, because it’s a bit of a nerd snipe. And without trying to flaunt myself too much, I think there’s a better state of educational resources and inroads into the field than there are for some of the other areas — especially newer and promising ones like AI control and model organisms and misalignment.
I also think this can change quite rapidly. I’m actually kind of concerned about a dynamic it feels like I notice where not that many people seem excited about trying to actually solve the alignment problem, like make AI that is safe. Things like scalable oversight, things like adversarial robustness or different kinds of detecting subtle misalignments that are robust enough to train against just doesn’t get as much of a conversation. I think you should totally have some people in this domain on your podcast.
But now that I’ve said all that, I also think that many people in AI safety get too caught up in this notion of neglectedness, and assume that people are a lot more fungible between subfields than they in fact are. I think I would just be much less effective personally in another field. Even ignoring my accumulated context, I think that my mindset’s just a very good fit to mech interp. I get very motivated by short feedback loops in a way that you don’t always get in other fields. I just find these problems very pretty and feel really excited about understanding. I think some people are like that, but if someone’s kind of indifferent, they should probably go to a more neglected subfield.
…
Rob Wiblin: You wrote in your notes that you thought people didn’t have to have such outstanding maths skills to go into mech interp. Do you really kind of stand by that? I guess the reason that you say that is there’s not deep mathematical theory here to try to understand how these machines work. It really is a matter of just getting your hands dirty and running a whole lot of experiments, which in some ways might not be that complicated of experiments. So perhaps you can get away with more coding ability and actually just knowing how to set up these experiments without having to have a deep understanding of how ML works. Is that the basic idea?
Neel Nanda: Kind of. I wouldn’t quite say you don’t need a deep understanding of how ML works, and more that is literally impossible because that is an unsolved research problem. We don’t know enough to have theory. [laughs]
I think the specific misconception I’m trying to address with this is: I will sometimes meet smart, mathematically inclined people who want to get into safety, who think they need to go do a maths degree or study all kinds of advanced maths. I think they might have some misconception that these systems can do incredible things, so surely lots of really intelligent ideas went into them — when actually the central lesson of the last decade of machine learning is the way you make something really capable is you take an incredibly simple idea and you pour millions of dollars of GPUs at it.
There are a few areas of maths that I think it can be quite helpful to know about. Ones I’ll call out: linear algebra. This is really important for mechanistic interpretability, fairly important for other areas. But when I say important, what I mean is being able to reason fluently about what does this following equation of matrices mean, not about knowing a long list of theorems. And then even aside from mech interp, probability is pretty important — but again, the kind of basics. The basics of information theory and multidimensional optimisation, sometimes called vector calculus.
These are basically the only areas I have ever used my mathematical knowledge of, apart from a few niche exceptions that won’t generalise.
I think it does help to be, bluntly, smart. It does help to be able to reason fluently about technical, abstract ideas, and to pick up concepts quickly. But it’s much more about being intuitive than being knowledgeable. And I think this is a skill I’ve refined over doing maths Olympiads and a maths degree, but I know a lot of people who are excellent at this with totally different backgrounds, and it’s a skill you can train in many domains.