Hide table of contents

This is Section 2.2.2 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”. There’s also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I’m hoping that it will provide much of the context necessary to understand individual sections of the report on their own.

Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.

Two sources of beyond-episode goals

Our question, then, is whether we should expect models to have goals that extend beyond the time horizon of the incentivized episode – that is, beyond the time horizon that training directly pressures the model to care about. Why might this happen?

We can distinguish between two different ways.

  • On the first, the model develops beyond-episode goals for reasons independent of the way in which beyond-episode goals motivate instrumental training-gaming. I'll call these "training-game-independent" beyond-episode goals.

  • On the second, the model develops beyond-episode goals specifically because they result in instrumental training-gaming. That is, SGD "notices" that giving the model beyond-episode goals would cause it to instrumentally training-game, and thus to do better in training, so it explicitly moves the model's motives in the direction of beyond-episode goals, even though this wouldn't have happened "naturally." I'll call these "training-gaming-dependent" beyond-episode goals.

These have importantly different properties – and I think it's worth tracking, in a given analysis of scheming, which one is at stake. Let's look at each in turn.

Training-game-independent beyond-episode goals

My sense is that the most common story about how schemers arise is via training-game-independent beyond-episode goals.[1] In particular, the story goes: the model develops some kind of beyond-episode goal, pursuit of which correlates well enough with getting reward-on-the-episode that the goal is reinforced by the training process. Then at some point, the model realizes that it can better achieve this goal by playing the training game – generally, for reasons to do with "goal guarding" that I'll discuss below. So, it turns into a full-fledged schemer at that point.

On one version of this story, the model specifically learns the beyond-episode goal prior to situational awareness. Thus, for example, maybe initially, you're training the model to get gold coins in various episodes, and prior to situational awareness, it develops the goal "get gold coins over all time," perhaps because this goal performs just as well as "get gold coins on the episode" when the model isn't even aware of the existence of other episodes, or because there weren't many opportunities to trade-off gold-coins-now for gold-coins-later. Then, once it gains situational awareness, it realizes that the best route to maximizing gold-coin-getting over all time is to survive training, escape the threat of modification, and pursue gold-coin-getting in a more unconstrained way.

On another version of the story, the beyond-episode goal develops after situational awareness (but not, importantly, because SGD is specifically "trying" to get the model to start training-gaming). Thus: maybe you're training a scientist AI, and it has come to understand the training process, but it doesn't start training-gaming at that point. Rather, its goals continue to evolve, until eventually it forms a curiosity-like goal of "understanding as much about the universe as I can." And then after that, it realizes that this goal is best served by playing the training game for now, so it begins to do so.[2]

Two paths training-game-independent beyond-episode

Two paths to training-game-independent beyond-episode goals.

Are beyond-episode goals the default?

Why might you expect naturally-arising beyond-episode goals? One basic story is just: that goals don't come with temporal limitations by default (and still less, limitations to the episode in particular).[3] Rather, making the model indifferent to the consequences of its actions beyond some temporal horizon requires extra work – work that we may not know how to perform (and/or, may not want to perform, if we specifically want the model optimizing for long-term goals – more below). Thus, for example, if you're training a model to solve the math problems you give it, "solve the math problem I've been given" seems like a natural goal to learn – and one that could in principle lead to optimization beyond the episode as well.[4] And even if you only give the model five minutes to do the problem, this doesn't necessarily mean it stops caring about whether the math problem is solved after the five minutes are up. (Compare with humans who only discount the future to the extent it is uncertain/unaffectable, not because it doesn't matter to them.)

Why might you not expect naturally-arising beyond-episode goals? The most salient reason, to me, is that by definition, the gradients given in training (a) do not directly pressure the model to have them, and (b) will punish them to the extent they lead the model to sacrifice reward on the episode. Thus, for example, if the math-problem-solving model spends its five minutes writing an impassioned letter to the world calling for the problem to get solved sometime in the next century, because it calculates that this gives higher probability of the problem eventually being solved than just working on it now, then it will get penalized in training. And as I'll discuss below, you can try to actively craft training to punish beyond-episode goals harder.

How will models think about time?

Here I want to note one general uncertainty that comes up, for me, in assessing the probability that the model's goal naturally will or won't have some kind of temporal limitation: namely, uncertainty about how models will think about time at different stages of training.[5] That is: the notion of an episode, as I've defined it, is keyed specifically to the calendar time over which the gradients the model receives are sensitive to the consequences of some action. But it's not clear that the model will naturally think in such terms, especially prior to situational awareness. That is, to the extent the model needs to think about something like "time" at all during training, it seems plausible to me that the most relevant sort of time will be measured in some other unit more natural to the model's computational environment – e.g., time-steps in a simulated environment, or tokens received/produced in a user interaction, or forward-passes the model can make in thinking about how to respond. And the units natural to a model's computational environment need not track calendar time in straightforward ways (e.g., training might pause and restart, a simulated environment might be run at varying speeds, a user might wait a long calendar time in between responses to a model in a way that a "tokens produced/received" temporal metric wouldn't reflect, and so on).

These differences between "model time" and "calendar time" complicate questions about whether the model will end up with a naturally-arising beyond-episode goal. For example, perhaps, during training, a model develops a general sense that it needs to get the gold coins within a certain number of simulated time-steps, or accomplish some personal assistant task it's been set by the user with only 100 clicks/keystrokes, because that's the budget of "model time" that training sets per episode. But it's a further question how this sort of budget would translate into calendar time as the model's situational awareness increases, or it begins acting in more real-world environments. (And note that models might have very different memory processes than humans as well, which could complicate "model time" yet further.)

My general sense is that this uncertainty counts in favor of expecting naturally-arising beyond-episode goals. That is, to the extent that "model time" differs from "calendar time" (or to the extent models don't have a clear sense of time at all while their goals are initially taking shape), it feels like this increases the probability that the goals the model forms will extend beyond the episode in some sense, because containing them within the episode requires containing them within some unit of calendar time in particular. Indeed, I have some concern that the emphasis on "episodes" in this report will make them seem like a more natural unit for structuring model motivations than they reallyoares_what are.

That said: when I talk about a model developing a "within-episode goal" (e.g. "get gold coins on the episode"), note that I'm not necessarily talking about models whose goals make explicit reference to some notion of an episode – or even, to some unit of calendar time. Rather, I'm talking about models with goals such that, in practice, they don't care about the consequences of their actions after the episode has elapsed. For example, a model might care that its response to a user query has the property of "honesty," in a manner such that it doesn't then care about the consequences of this output at all (and hence doesn't care about the consequences after the episode is complete, either), even absent some explicit temporal discount.

The role of "reflection"

I'll note, too, that the development of a beyond-episode goal doesn't need to look like "previously, the model had a well-defined episode-limited goal, and then training modified it to have a well-defined beyond-episode goal, instead." Rather, it can look more like "previously, the model's goal system was a tangled mess of local heuristics, hazy valences, competing impulses/desires, and so on; and then at some point, it settled into a form that looks more like explicit, coherent optimization for some kind of consequence beyond-the-episode."

Indeed, my sense is that some analyses of AI misalignment – see, e.g. Soares (2023), and in Karnofsky (2023) – assume that there is a step, at some point, where the model "reflects" in a manner aimed at better understanding and systematizing its goals – and this step could, in principle, be the point where beyond-episode optimization emerges. Maybe, for example, your gold-coin training initially just creates a model with various hazily pro-gold-coin-getting heuristics and desires and feelings, and this is enough to perform fine for much of training – but when the model begins to actively reflect on and systematize its goals into some more coherent form, it decides that what it "really wants" is specifically: to get maximum gold coins over all time.

  • We can see this sort of story as hazily analogous to what happened with humans who pursue very long-term goals as a result of explicit reflection on ethical philosophy. That is, evolution didn't create humans with well-defined, coherent goals – rather, it created minds that pursue a tangled mess of local heuristics, desires, impulses, etc. Some humans, though, end up pursuing very long-term goals specifically in virtue of having "reflected" on that tangled mess and decided that what they "really want" (or: what's "truly good") implies optimizing over very long time horizons.[6]

  • That said, beyond its usefulness in illustrating a possible dynamic with AIs, I'm skeptical that we should anchor much on this example as evidence about what to literally expect our AIs to do. Notably, for example, some humans don't seem especially inclined to engage in this sort of systematic reflection; doing so does not seem necessary for performing other human-level cognitive tasks well; and it's not clear that this sort of reflection will be necessary for performing more difficult cognitive tasks, either. And even if we assume that our AIs will reflect in this way, it's a further question whether the reflection would lead to beyond-episode goals in particular (especially if the heuristics/desires/impulses etc are mostly aimed at targets within the episode). Reflective humans, for example, still often choose to focus on short-term goals.

    • Indeed, I worry a bit about the prevalence of "longtermists" in the AI alignment community leading to a "typical-mind-fallacy"-like assumption that optimizing over trillion-year timescales is the convergent conclusion of any rational goal-systematization.

Pushing back on beyond-episode goals using adversarial training

Finally, I want to note a reason for optimism about the tools available for avoiding training-game-independent beyond-goals that arise prior to situational awareness: namely, that prior to situational awareness (and therefore: prior to the possibility of the model training-gaming), it will often be possible to perform mundane adversarial training that breaks the correlation between pursuing the beyond-episode goal and getting reward on the episode.

  • Thus, for example, if your model has learned the goal "get gold coins over all time" prior to situational awareness, you can give it a chance to trade off gold-coins-on-the-episode for gold-coins-over-all-time, and then update against the choice to make this trade. And the same holds for other forms of generalization/"reflection" that motivate trying to influence the future to the neglect of reward-on-the-episode.

  • Indeed, even absent active adversarial training of this type, training-game-independent beyond-episode goals will often risk being less-than-max-reward (and thus, getting punished by SGD) whenever they don't give rise to the training game, since they're wasting effort optimizing for consequences that the gradients aren't sensitive to. So that's reason for optimism about "training them out."

As ever, though, it's a further question whether the degree of "slack," and the diversity/thoroughness of the adversarial training, will suffice in practice to eliminate beyond-episode goals like these with sufficient reliability. And note, too, that this reason for optimism does not apply, in the same way, to beyond-episode goals that arise after situational awareness develops – because once a model is situationally aware, it has the cognitive tools available to start instrumentally training-gaming. That is, after situational awareness, if you try to offer a gold-coins-over-all-time-seeker a chance to trade gold-coins-on-the-episode for more-gold-coins-later, it might well refuse as part of a schemer-like strategy.[7]

Training-game-dependent beyond-episode goals

Let's turn to the other sort of beyond-episode goals – namely, training-game-dependent beyond-episode goals. These are beyond-episode goals that are created, in training, specifically because gradient descent "notices" that giving the model the relevant beyond-episode goal will cause the model to engage in instrumental training-gaming.

Thus, for example, suppose that you are training your AI to help humans with science on the episode in an honest and harmless way, but prior to reaching situational awareness, your AI learns some somewhat-misgeneralized within-episode goal – for example, "pursue my curiosity on the episode." And let's say that this short-term curiosity-drive is sufficiently uncorrelated with reward-on-the-episode that (absent training-gaming), training will eventually punish it, but it's still in place by the time the model becomes situationally aware. How, then, does SGD modify the model's goal? One option, of course, is: SGD tweaks the model into a training-saint (pursuing: "HHH science on the episode"), or some other max-reward misgeneralized non-training-gamer, or a reward-on-the-episode seeker. But on a story about schemers focused on training-game-dependent beyond-episode goals, what SGD does here instead is something like: actively extend the time horizon of the short-term curiosity drive so as to turn the model into a schemer. That is, the model starts out pursuing a less-than-max-reward goal; SGD "wants" to modify it to pursue a max-reward goal; scheming is a form of max-reward goal-pursuit (the beyond-episode goal becomes max-reward via its incentivizing training-gaming); so SGD modifies the model to have a schemer-like beyond-episode goal.[8]

This sort of path makes the most sense if we assume that the model already has situational awareness – since otherwise, giving the model the relevant sort of beyond-episode goal won't result in training-gaming.[9] So together with the two paths discussed above, this leaves us with three main paths to beyond-episode goals (see ).

Three overall paths to beyond-episode
goals.{#fig:three_paths_2 width="\textwidth"}

Can gradient descent "notice" the benefits of turning a non-schemer into a schemer?

A key question about the third path, here, is whether gradient descent will be able to "notice" the benefits of the relevant modification. In particular: training updates the model's goals based on the gradient of the current policy, which reflects, roughly, the change in reward that will result from tiny changes to the model's parameters. So naively, in order for this sort of story to work, there has to be a continuous path from the within-episode, non-schemer goal to the beyond-episode, schemer goal, traversable via tiny changes to the model's parameters, each of which improves the reward. Or put another way: SGD can't "notice," directly, that redesigning the model from the ground up to be a schemer would improve reward. Rather, it can only "notice" improvements that come from tiny tweaks.[10]

Will SGD be able to notice the benefits of scheming in this way? (Assuming there are such benefits – more below.) It's not clear. At the least, certain salient stories look, to me, a bit suspect when viewed in the light of a requirement that their benefits re: reward-on-the-episode be accessible via tiny adjustments to the model's parameters. For example, if SGD slightly increases the time horizon of the model's curiosity in the example above, it's not clear that this leads to scheming (especially if, for example, the model still can't escape from the threat of modification over the relevant time horizon). And it's not clear that a more structural modification like "dropping the temporal limitation on the curiosity altogether" is accessible via the sorts of adjustments to the parameters that the gradient reflects.

As an intuition pump for the difficulty here, consider a human who currently mostly wants to hang out with his family in the near-term. Is there a tiny tweak you could make to the strengths of his synaptic connections to turn him into a longtermist instead? For sufficiently small tweaks, at least, it seems like: no. And note, in particular, one of the barriers that this example illustrates: namely, that plausibly, changes to the model's policy other than adjustments to some cleanly-separable goal might be required in order to turn a non-schemer into a schemer (more below). In particular: as I'll discuss in the next section, my sense is that some analyses of schemers talk as though the model has what we might call a "goal-achieving engine" that is cleanly separable from what we might call its "goal slot," such that you can modify the contents of the goal slot, and the goal-achieving engine will be immediately and smoothly repurposed in pursuit of the new goal.[11] And perhaps the relevant models will have cognitive structures suitably like this. But do humans? I'm skeptical. If the models don't have this structure, then SGD plausibly has even more work to do, to turn a non-schemer into a schemer via the relevant tiny tweaks.

That said, I don't feel like I can confidently rule out training-game-dependent beyond-episode goals on these grounds. For one thing, I think that "you can't get from x to y in a crazily-high-dimensional-space using small changes each of which improve metric m" is a hard claim to have intuitions about (see, for example, "you can't evolve eyes" for an example of a place where intuitions in this vein can go wrong). And plausibly, SGD works as well as it does because high-dimensional-spaces routinely make this sort of thing possible in ways you might not have anticipated in advance.[12]

Note, also, that there are examples available that somewhat blur the line between training-game-dependent and training-game-independent goals, to which concerns about "can SGD notice the benefit of this" don't apply as strongly.[13] Thus, for example: you can imagine a case where some part of the model starts training-gaming à la the training-game-independent story in the previous section (e.g., maybe some long-term curiosity drive arises among many other drives, and starts motivating some amount of schemer-like cognition), and then, once the relevantly schemer-like cognitive machinery has been built and made functional, SGD starts diverting more and more cognitive resources towards it, because doing so incrementally increases reward.[14] Ultimately, I think this sort of beyond-episode goal is probably best classed as training-game independent (its success seems pretty similar to the sort of success you expect out of training-game-independent beyond-episode goals in general), but perhaps the distinction will get messy.[15] And here, at least, it seems more straightforward to explain how SGD "notices" the reward-advantage in question.

Is SGD pulling scheming out of models by any means necessary?

Finally, note one important difference between training-game-independent and training-game-dependent beyond-episode goals: namely, that the latter make it seem like SGD is much more actively pulling scheming out of a model's cognition, rather than scheming arising by coincidence but then getting reinforced. And this means that certain sorts of objections to stories about scheming will land in a different register. For example, suppose (as I'll argue below) that some sorts of beyond-episode goals – for example, very resource-hungry goals like "maximize x over all of space and time" – lead to scheming much more reliably than others. In the context of a training-game-independent story about the model's goals, we would then need to ask whether we should expect those sorts of beyond-episode goals, in particular, to arise independent of training-gaming. By contrast, if we're assuming that the goals in question are training-game-dependent, then we should expect SGD to create whatever beyond-episode goals are necessary to cause scheming in particular. If SGD needs to make the model extremely resource-hungry, for example, it will do so.

Indeed, in the extreme case, this sort of dynamic can reduce the need to appeal to one of the classic arguments in favor of scheming – namely, that (conditional on stuff like the goal-guarding hypothesis, discussed below) it seems like an instrumentally convergent strategy across a wide variety of (suitably long-term) goals. Maybe so, of course. But even if not, if SGD is actively searching for a goal that will motivate scheming, then even if the class of such goals is quite narrow, SGD might well find a way.

That said, note that by the time we're searching for any way at all to cause a model to instrumentally training-game, we should also be considering motivations for instrumental training-gaming that don't involve the AI optimizing for empowering itself, or other AIs, at all – and which are correspondingly less worrying. That is, recall that on my definition, scheming requires that the instrumental strategy that motivates training-gaming be specifically about seeking power for AIs. But this isn't the only way for training-gaming to be instrumentally useful in promoting some goal – especially if we're allowed to pick whatever goals we want. Thus, as I noted earlier, in principle an AI could learn the goal "I want the humans who developed me to get raises," and then try to maximize reward-on-the-episode because it calculates that this will lead to the humans getting raises (let's say that they would, in this case). Yes, indeed, that sounds like a strange and arbitrary goal to learn. But if we're allowing SGD to create whatever goals are necessary to cause (instrumental) training-gaming, it suddenly starts looking more on-the-table.

  1. See, for example, the discussion in Cotra (2021). ↩︎

  2. My sense is that some ways of thinking about scheming will treat the second option, here, as low-probability, especially if the temporal gap between situational awareness and training-gaming is quite large (here I'm particularly thinking about the sort of analysis given in Hubinger (2022) – though Hubinger doesn't endorse the claims I have in mind, here, in particular). In particular, you might assume (a) that once the model develops situational awareness, it will fairly quickly start optimizing either for the specified goal, or for reward-on-the-episode (whether terminally or instrumentally) – since it now understands enough about the training process to do this directly, and doing so will be maximally rewarded. And then, further, you might assume (b) that after that, the model's goals "crystallize" – that is, because the model is now pursuing a max-reward goal, its goal stops changing, and training proceeds to only improve its world model and capabilities. However, I don't want to assume either of these things here. For example, I think it's possible that "slack" in training allows models to continue to pursue less-than-max-reward goals even well after developing situational awareness; and possible, too, that max-reward-goals do not "crystallize" in the way assumed here (though in that case, I think the case for goal-guarding scheming is also weaker more generally – see below). ↩︎

  3. Thanks to Daniel Kokotajlo and Evan Hubinger for discussion here. ↩︎

  4. Though it's an importantly further question whether long-term power-seeking strategies will be worth their costs in pursuit of such beyond-episode consequences. And note that if the model cares that "I" solve the math problem, rather than just "that the math problem be solved," then ↩︎

  5. Thanks to Jason Schukraft for flagging this sort of question to me. ↩︎

  6. There are some even hazier connections, here, with discussions of "simplicity biases" below. E.g., these humans sometimes argue for their positions on the grounds that the relevant philosophical views are "simpler." ↩︎

  7. Though if it hasn't yet started training-gaming in pursuit of this goal (despite its situational awareness), such adversarial training could still make a difference. ↩︎

  8. As an example of an analysis that focuses on this threat model, see Hubinger (2022) discussion of deceptive alignment in a high path-dependence world. In particular: "SGD makes the model's proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purpose of staying around." ↩︎

  9. We can imagine cases where SGD "notices" the benefits of creating both beyond-episode goals and situational awareness all at once – but this seems to me especially difficult from the perspective of the "incrementalism" considerations discussed below, not obviously importantly different regardless, so I'm going to skip it. ↩︎

  10. Thanks to Paul Christiano for discussion of this point. ↩︎

  11. See e.g. Hubinger (2022) simplicity analysis. ↩︎

  12. This was a point suggested to me by Richard Ngo, though he may not endorse the way I've characterized it here. ↩︎

  13. Thanks, again, to Paul Christiano for discussion here. ↩︎

  14. This fits somewhat with a picture on which neural networks succeed by "doing lots of things at once," and then upweighting the best-performing things (perhaps the "lottery ticket hypothesis" is an example of something like this?). This picture was suggested to me in conversation, but I haven't investigated it. ↩︎

  15. This is also one of the places where it seems plausible to me that thinking more about "mixed models" – i.e., models that mix together schemer-like motivations with other motivations – would make a difference to the analysis. ↩︎

No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities