Hide table of contents

There are a couple of explanations of mesa-optimization available. I think Rob Miles' video on the topic is excellent, but I think existing written descriptions don't make the concept simple enough to be understood thoroughly by a broad audience. This is my attempt at doing that, for those who prefer written content over video. 

Summary

Mesa-optimization is an important concept in AI alignment. Sometimes, an optimizer (like gradient descent, or evolution) produces another optimizer (like complex AIs, or humans). When this happens, the second optimizer is called a 'mesa-optimizer'; problems with the alignment (safety) of the mesa-optimizer are called 'inner alignment problems'.

What is an optimizer?

I'll define an optimizer as something that looks through a 'space' of possible things and behaves in a way that 'selects' some of those things. One sign that there might be an optimizer at work is that weird things are happening, by which I mean 'things that are very unlikely if the system was behaving randomly'. For instance, humans do things all the time that would be very unlikely to happen if we behaved randomly, and humans would certainly not exist at all if evolution worked by just making new organisms with totally random genes.

The human brain

The human brain is an optimizer -- it looks through the different things you could do and picks ones that get you things you like. 

For instance, you might go to the shop, get ice cream, pay for the ice cream, and then come back. If you think about it, that's a really complex series of actions -- if you behaved completely randomly, you would never ever end up with ice cream. Your brain has searched through a lot of possible actions you could take (walking somewhere else, dancing in your living room, moving just your left foot up 30 degrees) and expects none of them will get you ice cream -- and has then selected one of the very few paths that will get you what you want.

Evolution

A stranger example of an optimizer is evolution. You've probably heard this already -- evolution 'optimizes inclusive genetic fitness'. But what does this mean? 

Organisms change randomly-- their genes change due to mistakes in the process of making new organisms. Sometimes, that change helps the organism to reproduce by allowing it to survive longer, by making it more attractive, etc. When that happens, the next generation has more of that change in it. Over time, lots of these changes accumulate to make very complicated systems like plants, animals, and fungi. 

Because there end up being more of the organisms that reproduce more, and fewer of the ones with changes that make them reproduce less (like genetic diseases), evolution selects from the space of all the mutations that happen -- if you made an organism with a completely random genome, it would certainly die immediately (or rather, not be alive to begin with). 

AI Training

When an AI is trained, it's usually done by 'gradient descent'. What is gradient descent?

First, we define something that says how well an AI is doing (the 'loss function'). This could be very simple (1 point every time you output 'dog') or very complex (1 point for every time you say something that is an English sentence that makes sense as a response to what I typed in). 

Then, we make a random AI-- one that's basically just a set of random numbers connected to one another. This AI, predictably, does very badly -- it outputs 'dska£hg@tb5gba-0aaa' or similar. We run it a lot of times, and see when it comes a bit close to outputting 'dog' (say, with a d as the first letter, or outputting close to 3 characters). Then, we use a mathematical algorithm to figure out which of those random numbers caused the AI to be close to what we want, and which were very far from the right number -- and we move them very slightly in the right direction. Then we repeat this a lot until, eventually, the AI consistently outputs 'dog' every time[1]

This process is strange, but it's much easier to figure out the direction the numbers should change in than to figure out from the beginning exactly what the right numbers are, particularly for more complicated tasks. 

Again, we can see that this process is an optimizer -- the random AI does nothing interesting, but by pushing the numbers around in the right direction, we can do very complicated tasks which would never happen at random. And this is happening because we look at a very large 'space' of possible AIs (with different numbers in them) and 'move' through it towards an AI that does what we want.

 What is a Mesa-optimizer?

To understand mesa-optimization, we'll return to the evolution analogy. We saw two examples of optimizers -- evolution, and the human brain. One of these created the other! This is the key to mesa-optimization.

When we train AI, we might -- just like evolution -- find AIs which are optimizers in their own right, i.e. they look over spaces of possible actions and take ones that give them a good score, similar to the human brain. In AI, this second optimizer is referred to as the mesa-optimizer. Mesa-optimizer refers to the AI that we have trained itself, while the outer optimizer is the process we used to train that AI (gradient descent). Note that some AI -- especially simple AI-- may not count as mesa-optimizers, because they don't display behaviour complex enough to qualify as optimizers in their own right. 

Image shamelessly 'borrowed' from Rob Miles' video on mes'a-optimizers

Why is this a problem?

If we have two optimizers, we now have two problems with getting AI to behave well (two 'alignment problems'). Specifically, we have an 'outer' and an 'inner' alignment problem, relating to gradient descent and the mesa-optimizer respectively. 

The outer alignment problem is the classic problem of AI alignment. When you make an optimizer, it's hard to make sure that the thing you've told it to do is the thing that you actually want it to do. For example, how do you tell a computer to 'write coherent English sentences that follow on sensibly from what I've written'? Defining complex tasks formally as a loss function can be very tricky. This can be dangerous with some tasks, but discussing that will take too long for this post.

The inner alignment problem is a bit trickier. We might define our loss function perfectly for what we want it to do, and have an AI that still behaves dangerously, because it's not aligned with the outer optimizer. For instance, if gradient descent finds an algorithm that optimizes 'pretend to behave well to get a good score while I'm being trained, so I can do what I actually want later on' ('deceptive alignment'), this will get an excellent score on our training dataset while still being dangerous. 

Examples of inner misalignment

Humans

For our first example, we'll return to our evolution analogy one more time. Evolution's search for things that are very good at surviving and reproducing eventually produced brains -- a kind of mesa-optimizer. Now, humans are becoming misaligned with evolution[2]

Because evolution is quite an odd type of search (like gradient descent), it couldn't put the concept of 'reproduction' into our brains directly. Instead, a bunch of simpler concepts developed to do with genital friction, complex things to do with relationships, and so on. 

Now, humans do things which very effectively meet those simpler concepts, but are terrible for inclusive genetic fitness, like masturbate, watch porn, and not donate to sperm banks. This shows that evolution has an inner misalignment problem.

Evolving to Extinction

But humans are still pretty well aligned to evolution -- our population size is great compared to other apes. To show how badly evolution can be inner misaligned, let's look at Irish Elk. Irish Elk (probably) evolved to extinction. But isn't that the opposite of how evolution works? How does this happen?

Photograph of a museum specimen of an Irish elk skull with large antlers
An Irish elk complete with trademark huge antlers

Sexual selection is when animals evolve to be attractive to mates, rather than to be better at surviving. This can be good for evolution -- honestly showing potential mates that you're strong, fast, well-fed, etc. can be a good way to ensure that those likely to survive reproduce more. For instance, growing large antlers can be a sign that you're well-fed and able to provide plenty of nutrition to those antlers, or that you're able to fight well to defend yourself. 

However, sexual selection can also go wrong. Having large antlers is great-- up to a point. Once they get too large, you might not be able to move your head well, get caught on trees, and waste a lot of key resources. But if females love great antlers, males with huge antlers reproduce a lot, even if only a few of them survive to adulthood. Soon, all the males have huge antlers and are struggling to survive. Even if this doesn't cause extinction directly, it can contribute if the population weakens. This shows that evolution has a bad inner alignment problem.

Hiring Executives

3 Things Successful African-American Women Do Differently in Business ...
A business person, because I suspect you're all drifting off by now

When hiring executives, the shareholders of a company face two problems. 

The first problem is an outer alignment problem: how do they design a hiring process and incentives scheme such that the executives they hire are motivated to act in the best interests of the company? This problem seems genuinely hard-- how do they stop executives acting in the company's short-term interests to get bonuses and look good at the cost of doing long-term harm to the company (when they've likely moved on to another company to do the same thing[3])?

The inner misalignment problem here comes from the fact that the people being hired are optimizers themselves -- and may have very different goals from those of the shareholders. If, for instance, there are potential-executives who want to harm the company[4], they're strongly motivated to perform well on the hiring process. They may even be motivated to perform well on metrics and incentive schemes in order to stay on and continue doing subtle damage to the company, or change the focus of the company to or away from certain areas while costing the shareholders money in ways that are difficult to measure. 

Conclusion

I'd like to note that the borders between outer and inner misalignment are quite fuzzy, and experienced researchers can sometimes struggle to tell them apart. Additionally, you can have inner and outer misalignment at once. 

Hopefully this helps you more thoroughly understand inner misalignment. Please say in the comments if you have questions, or feedback on my writing. 

Some questions to check you understood:

  • What is an example of a mesa-optimizer? How is it different from other kinds of optimizers?
  • What is an 'outer alignment problem'?
  • What is 'deceptive alignment'?
Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f