3 levels of threat obfuscation

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·2d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Recent opportunities to take action

RP is looking for project founders in neglected animal areas

Rethink Priorities·4h ago·7m read

Time Sensitive Do Gooding Opportunities

Bentham's Bulldog·5h ago·5m read

146

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·1w ago·4m read

Training game seems very likely by default, deceptive alignment much less so, gradient hacking still less so

This is pretty superficial and mostly based on a couple of conversations with Paul Christiano. I’m hoping he’ll write a better version of this at some point.

The “training game” mode of threat obfuscation seems like it is very likely by default, because a normal AI training process will probably directly incentivize it. Errors in labeling will directly favor the training game over intended generalization; even with perfect labeling, there’s no particular reason to think that the training game is disfavored compared to the intended generalization (if anything it vaguely/intuitively seems simpler - “do that which gets treated as intended behavior” as opposed to “do that which was intended.”)

By contrast, deceptive alignment seems to face a bigger hurdle. For example, if you train an AI with positive reinforcement for getting cookies:

A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”

Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”

The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.)

An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.

Paul thinks it’s <50% likely (clarification in footnote^[3]) that deceptive alignment is going to be a major issue by default (at least if people make reasonable efforts to avoid the sorts of dynamics discussed in the next section), and other folks whose judgment I’d put a lot of weight on (due to seeming fluent with the concern about deceptive alignment, while also experienced and/or accomplished with ML) seem to generally be in the same camp.

Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”

How bad is the situation if straightforward training leads to the training game, but not to deceptive alignment or gradient hacking?

At first glance, this situation doesn’t seem too bad in terms of threat obfuscation. It basically means that measuring an AI’s dangerousness is as simple as “Put it in an environment where doing malicious things is good in-episode performance. A possible example: if it does malicious things to gain control of its environment, simply let it control its environment for a while and don’t do any of the things one would normally do for unintended behavior (such as trying to use gradient descent to correct against it).” This could result in a dangerous, situationally aware AI (e.g., one that has learned a pattern "Do malicious things iff they won't be caught and trained against") going ahead and doing malicious things while being tested, with no particular reason to conceal them.

However, it’s pretty easy for this to go off the rails if people aren’t careful about it. I’d be concerned about a dynamic like this:

AIs frequently have opportunities to behave maliciously and gain resources within an episode. These opportunities look good on their reward models. They take these opportunities.

Humans respond by trying to train out this sort of behavior - e.g., finding incidents like this via audits and news stories, and going back and doing major gradient updates against the corresponding behaviors/reward models, long after the fact.

In so doing, they could effectively be training AIs to build longer-term, more careful models of human reactions to their behavior. For a situationally aware AI, training like this could lead to behavior like “Try to accumulate resources and do things associated with high reward, unless humans might catch it later via an audit; hence, look for opportunities to avoid or prevent or manipulate eventual audits.”

This sort of behavior could end up looking a lot like deceptive alignment: optimizing heavily around factors and events well outside the “episode” as traditionally defined (you could also think of this as effectively extending the “episode” to encompass a very long time period during which humans might do audits and respond to what they see).

To avoid this, we need to be careful with how we handle unexpected incidents of AIs’ behaving maliciously. It won’t always be the right move to straightforwardly “train against” such incidents; it may sometimes be important to do something more like “starting over,” e.g. insisting on a more principled way^[4] of making AIs less prone to malicious behavior.

This situation seems scary:

We won’t necessarily be able to come up with a principled (and competitive) way of driving out malicious behavior.

At least some AI developers (and maybe many) will be doing non-principled things (e.g., the problematic “train out the malicious behavior” thing above); saying “Behold, our systems are safe!”; and wanting to deploy them.

The line between principled and non-principled things doesn’t feel super clear.

But this situation still seems much better than having to deal with deceptive alignment: we do have a tangible, non-interpretability-dependent way of detecting danger if we can be disciplined about it! In particular, a successful “evals” project looks very tractable in this world - you’re basically running the “make rewards available for bad behavior” test, and then auditing training for signs that it’s been explicitly trained against.

I'm not saying this situation is the only one we should plan around. But it is an example of how relatively good luck could put us in a position where some theoretically very difficult aspects of alignment end up being less difficult in reality.

Notes

^{^}

The “catch” would be that we would want to avoid deploying similar AI systems in other contexts. But this is not a concern for the “simple training game” AI that is focused on obtaining good in-episode performance. It is a concern for an AI that has a goal other than good in-episode performance.

^{^}

Though some versions might work - see "Providing unusual incentives ..." here.

^{^}

“Without thinking carefully I’d guess there is a 25% chance that deceptive alignment is a major challenge that arises early enough that we need to deal with it ourselves. Then another 25% that it’s a challenge that someone needs to deal with someday, but is too far in the future to be worth thinking about now e.g. because AI systems will have obsoleted all of our thinking before they are deceptively aligned. And another 25% chance that deceptive alignment just never arises in the real world if you are careful and take the kind of simple precautions that we already understand. And a final 25% where the entire concern was misguided and it never happens in practice with modern ML no matter how careless you are.”

^{^}

Such as via internals-based training.