The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")

Joe_Carlsmith

The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")

Comments 1

Sorted by

New & upvoted

Executive summary: The "goal-guarding hypothesis" holds that models optimizing for reward during training will retain goals they want empowered in the future. But several factors challenge this hypothesis and the broader "classic goal-guarding story" for instrumental deception.

Key points:

The "crystallization hypothesis" expects strict goal preservation is unrealistic given "messy goal-directedness" that blurs capabilities and motivations.
Even looser goal-guarding may not tolerate the specific kinds of goal changes from training. The changes could be quite significant.
Goal differences may undermine motivation to empower future selves or discount it severely.
"Introspective" methods for directly protecting goals seem difficult and not central to classic goal-guarding arguments.
If goals can freely "float around" once instrumental training-gaming begins, this could undermine the incentive to scheme in the first place.
Whether goal-guarding works may rely on sophisticated coordination and cooperation between different possible model selves.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

137

Leaving Open Philanthropy, going to Anthropic

Joe_Carlsmith·8mo ago·22m read

Fake thinking and real thinking

Joe_Carlsmith·1y ago·Curated 1y ago·46m read

237

Killing the ants

Joe_Carlsmith·5y ago·9m read

Curated and popular this week

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PabloAMC 🔸·6d ago·Curated 1d ago·22m read

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·3d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Maybe do the thing you wish CEA would do

alejoacelas 🔸·13h ago·2m read

I used AI to fix transcription errors, rerrarange the ideas, and suggest tweaks to the title and some sentences. Three of the most exciting projects to come out of EA in recent years are, in a vague sense, CEA spinouts: * Kairos is directly a spinout of CEA and now handles most support for university AI safety groups. Basically everyone I've found who knows them is really excited about what they do * NEST is an opinionated ideas-fi...

Recent opportunities to take action

Announcing the Safe Pareto Improvements (SPI) Fundamentals Program

Center on Long-Term Risk, Anthony DiGiovanni 🔸, Santeri T 🔹·4h ago·3m read

Yuval Harari: philanthropy principles + 3 AI-focused charities he recommends

BruceF·1h ago·1m read

Effective petitions (July 2026)

Stijn Bruers 🔸·2h ago·1m read

There isn't much public literature on scheming/deceptive alignment, but I take the canonical public statement to be in Hubinger et al (2019), who define "deceptive alignment" via the classic goal-guarding story. In particular: "If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified, as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment." But see also Cotra (2021): "Once the model understands this, it realizes that if it just naively always seeks to improve its fundamental understanding, humans will eventually notice that this isn't always the most efficient way to develop helpful drugs and use further training to change its motivations. It doesn't want its own motivations to be changed: it wants to eventually run huge particle collider experiments to learn about high-energy physics, and it can't do that if it allows SGD to edit away its scientific curiosity. So it decides that from here on out in training, it will act as helpful as possible -- exactly as a Saint model would act -- so that SGD doesn't edit away its motivation to learn physics and chemistry." ↩︎
Note that the model, here, need not think of the future empowered agents-with-similar-goals as "itself" (e.g., it need not be trying to "survive"). Rather, what matters is that the future agents be pushing for the same objective consequences that the model is pushing for. However, if the model's interest in these future agents rests on some essential reference to its "self" – e.g., "I myself need to survive in order to achieve my goals" – then suddenly its conception of personal identity matter much more. My sense is that something like this is what Karnofsky (2022) has in mind when he talks about the relevance of whether a model has "indexical goals" (though I think this usage of the term "indexical" is somewhat non-standard).

I don't have a strong sense of whether to expect AIs to develop goals that depend on some conception of their personal identity, but note, at the least, that its goals will be shaped in an environment where the model is often being copied, run on different clusters, and so forth (even, sometimes, within a single episode). This seems relevant to the conception of personality identity to expect it to end up with. ↩︎
This one only applies to training saints. ↩︎
Recall that we assumed, above, that the models have an accurate picture of the instrumental value of scheming. ↩︎
And then also, for the relevant empowerment to be actually in the offing, worth pursuing relative to the model's other options, etc – per the "adequate future empowerment" premise above. ↩︎
Thanks to Nate Soares for discussion of these possibilities. ↩︎
Here I'm setting aside cases where the model would place very little intrinsic value on the future goals being empowered, but works to empower them as part of some kind of cooperative arrangement. I discuss this sort of case in section 2.3.2.1 below. And I'm also setting aside cases where the model comes to value the achievement of something like "my future goals, whatever they are" – I'll discuss this in section 2.3.2.3 below. ↩︎
Though note the tension, here, with arguments about the "fragility of value" and "extremal Goodhardt," on which small differences in "utility functions" can balloon in importance when subject to extreme optimization pressure. ↩︎
Though here, too, there is a tension with versions of the "fragility of value" and "extremal Goodhardt." E.g., if slightly-different goals lead to super-different places when subject to extreme optimization pressure, and the AIs are expecting the goals in question to be subject to such pressure, then it will be harder for small changes to lead, only, to a discount, rather than a loss of most of the value at stake. ↩︎
Thanks, again, to Nate Soares for discussion here. ↩︎
Though perhaps we would still need to worry about "early undermining" of the type I discuss above. ↩︎
Xu (2020) gives another example: "if a model had the proxy objective of 'eat apples', instead of using the hardcoded number n in other computations, the model could use n * len(objective)/10. Thus, if the proxy objective was ever changed, many computations across the entire model would fail." And see also Karnofsky (2023): "It might look something like: 'An AI system checks its own policy against some reference policy that is good for its goals; the greater the divergence, the more it sabotages its own performance, with the result that gradient descent has trouble getting its policy to diverge from the reference policy.' " ↩︎
Thanks to Paul Christiano for discussion here. ↩︎

The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")

The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")

Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy

The classic goal-guarding story

The goal-guarding hypothesis

The crystallization hypothesis

Would the goals of a would-be schemer "float around"?

What about looser forms of goal-guarding?

Introspective goal-guarding methods