Spend your alignment budget on incentives before oversight

Cheng Qian

I used an LLM to help draft this post and it likely contains >10% AI-generated text, but I've edited/rewritten it extensively and endorse it.

My dog has a fixed amount of energy to spend finding a hidden treat. He can sniff toward it, walk left, walk right, or sit and wait. None of that is free. Energy spent sniffing is energy not spent walking, so he has a budget and he has to divide it up.

Suppose you want to say how well he's using that budget. You need two things: how much each action actually helps him reach the treat, and how he's splitting his effort across the actions. Call the first the goal-weights k, the second the split e. The value he gets out is V = Σ kᵢ ln eᵢ. The best split is the obvious one. Put effort into each action in proportion to how much it helps: eᵢ* = E·kᵢ/K. Sniff a lot if smell tells you a lot. Don't spend energy walking in a direction that tells you nothing.

Two things drop out of this that I didn't expect to be so clean.

There's a ceiling. If the dog can't smell or see the treat at all, running harder does nothing. He can't beat chance. The most value he can possibly create is set by how much his senses tell him about where the treat is, and nothing he does with his legs raises that. Written out it's ΔG ≤ I(X;Y): value created is bounded by the mutual information between the world and what the agent perceives of it. Perception sets the ceiling. Effort doesn't.

And confidence can put you below zero. A dog who is sure the treat is in the left corner, and digs there hard, when it's really on the right, spends his whole budget and ends up worse off than a dog who never moved. Confident error is negative value. There's a term for that gap between what he believes and what's true, D(q‖p), and it comes straight off the same ledger.

Here's the part that matters for AI. Nobody asked the dog what he wants. "Get the treat" is a goal the trainer put there. The world tells the dog what is the case; it never tells him what to want. You can't read his goal off his behavior and you can't read it off the world. It has to come from outside.

The move behind all of this is the one Shannon made for information: throw away the meaning and keep the part that obeys laws. Value, with morality and price and psychology stripped off, is just the rate at which an agent turns a resource into goal-progress, measured against its own goal. That's the whole definition. Everything above is what falls out of it.

The point: where to spend

Now go from one dog to a population of agents. Many of them, all spending resource, each with a slightly different goal. Two forces act on the population's average goal.

One is whatever you do to keep them in line. Monitoring, catching bad behavior, correcting, retraining. Call its strength γ. The other is selection. Whichever goal happens to capture more resource spreads, and it spreads regardless of whether it's the goal you wanted. Call the strength of that pull g.

Write both forces down and you get a plain differential equation. It doesn't settle at the target you wanted. It settles a fixed distance away:

offset = ‖Vg‖ / γ

Goal-spread times reward-pull, over how hard you're correcting. You have three knobs, and they are not equally good.

You can correct harder, raise γ. To halve the offset you double your correction effort, and you pay that bill forever, because the selection pull is still there. You're bailing a boat without patching the hole.

You can narrow the goals, shrink V. That lowers the offset too, but now your agents are all the same, and you've thrown away the variety that made the population good at anything.

Or you can change what pays, drive g → 0. Make off-target goals stop getting rewarded. Do that and the offset goes to zero for any γ and any V. You patched the hole.

So at a fixed budget, money spent making the wrong thing not pay beats money spent catching the wrong thing after it already paid. Incentive design over oversight. Train the reward, don't just nag.

What's mine and what isn't

Most of this is not mine, and I'd rather say so than have you find out in the comments.

The single-agent math is Kelly's, from 1956. Bet in proportion to your beliefs and your wealth grows at an information rate. Swap "wealth" for "the dog's progress" and you have everything in the first half. I'm not claiming it.

The fact that you can't recover an agent's goals from its behavior is not mine either. Armstrong and Mindermann proved it in 2018 as a no-free-lunch result, and they were the ones who called it Hume's is-ought gap. I'm using their result.

The control math is undergraduate. ‖Vg‖/γ is the steady-state error of a Type-1 controller following a ramp, and γ > λ_max is high-gain stabilization. A controls person would point this out in one sentence, so I'm pointing it out first.

What I'm actually claiming is narrower. Goal-drift under selection is a ramp-tracking problem, and once you see it that way the incentives-over-oversight ordering is forced and comes with a number on it. That's an application of old tools, not a new theorem. The longer paper has a couple of things I haven't found in prior work, mainly a ceiling on how much value a whole fleet of agents can produce together (it's bounded by the entropy of the world they're working in), and a clean split between value, which depends on each agent's goal, and price, which doesn't. Those might be new. I'd rather be corrected than oversell them.

What I don't get to claim yet

This is a mean-field result. It's a statement about the average over a population, and a small group can drift off even when the average is stable.

The empirics are honest and incomplete. A toy reproduces the offset formula almost exactly, but that's circular, since the toy is built from the same equations. The real test was on actual language-model agents, pre-registered, and it came back underpowered. The design saturated before the scaling became visible, so it's not a confirmation and not a refutation. I'm telling you that instead of hiding it. If the offset turns out not to scale as ‖Vg‖/γ on real agents, the dynamical layer is wrong, and that's the experiment I'd want someone to run.

If you work on aligning systems built out of many agents, the one thing to take away is the ordering. Shape what gets rewarded before you build more ways to watch. The full writeup, with the proofs and the negative result, is here: https://doi.org/10.5281/zenodo.20487041. There's also a small browser tool that runs the measure on any agent's recorded outputs, if you want to try it on something of your own: https://value.macrokit.dev/tools/value-meter/

Effective Altruism Forum
EA Forum

Spend your alignment budget on incentives before oversight

1

The point: where to spend

What's mine and what isn't

What I don't get to claim yet

1

Reactions

More posts like this