This is Sections 184.108.40.206-220.127.116.11 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”. There’s also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I’m hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
What if you intentionally train models to have long-term goals?
In my discussion of beyond-episode goals thus far, I haven't been attending very directly to the length of the episode, or to whether the humans are setting up training specifically in order to incentivize the AI to learn to accomplish long-horizon tasks. Do those factors make a difference to the probability that the AI ends up with the sort of the beyond-episode goals necessary for scheming?
Yes, I think they do. But let's distinguish between two cases, namely:
Training the model on long (but not: indefinitely long) episodes, and
Trying to use short episodes to create a model that optimizes over long (perhaps: indefinitely long) time horizons.
I'll look at each in turn.
Training the model on long episodes
In the first case, we are specifically training our AI using fairly long episodes – say, for example, a full calendar month. That is: in training, in response to an action at t1, the AI receives gradients that causally depend on the consequences of its action a full month after t1, in a manner that directly punishes the model for ignoring those consequences in choosing actions at t1.
Now, importantly, as I discussed in the section on "non-schemers with schemer-like traits," misaligned non-schemers with longer episodes will generally start to look more and more like schemers. Thus, for example, a reward-on-the-episode seeker, here, would have an incentive to support/participate in efforts to seize control of the reward process that will pay off within a month.
But also, importantly: a month is still different from, for example, a trillion years. That is, training a model on longer episodes doesn't mean you are directly pressuring it to care, for example, about the state of distant galaxies in the year five trillion. Indeed, on my definition of the "incentivized episode," no earthly training process can directly punish a model for failing to care on such a temporal scope, because no gradients the model receives can depend (causally) on what happens over such timescales. And of course, absent training-gaming, models that sacrifice reward-within-the-month for more-optimal-galaxies-in-year-five-trillion will get penalized by training.
In this sense, the most basic argument against expecting beyond episode-goals (namely, that training provides no direct pressure to have them, and actively punishes them, absent training-gaming, if they ever lead to sacrificing within-episode reward for something longer-term) applies to both "short" (e.g., five minutes) and "long" (e.g., a month, a year, etc) episodes in equal force.
However, I do still have some intuition that once you're training a model on fairly long episodes, the probability that it learns a beyond-episode goal goes up at least somewhat. The most concrete reason I can give for this is that, to the extent we're imagining a form of "messy goal-directedness" in which, in order to build a schemer, SGD needs to build not just a beyond-episode goal to which a generic "goal-achieving engine" can then be immediately directed, but rather a larger set of future-oriented heuristics, patterns of attention, beliefs, and so on (call these "scheming-conducive cognitive patterns"), then it seems plausible to me that AIs trained on longer episodes will have more of these sorts of "scheming-conducive cognitive patterns" by default. For example, they'll be more used to reasoning about the long-term consequences of their actions; they'll have better models of what those long-term consequences will be; and so on. And perhaps (though this seems to me especially speculative), longer-episode training will incentivize the AI to just think more about various beyond-episode things, to which its goal-formation can then more readily attach.
Beyond this, I also have some sort of (very hazy) intuition that relative to a model pressured by training to care only about the next five minutes, a model trained to care over e.g. a month, or a year, is more likely to say "whatever, I'll just optimize over the indefinite future." However, it's not clear to me how to justify this intuition.
(You could imagine making the case that models trained on longer episodes will have more incentives to develop situational awareness – or even goal-directedness in general. But I'm assuming that all the models we're talking about are goal-directed and situationally-aware.)
Using short episodes to train a model to pursue long-term goals
Let's turn to the second case above: trying to use short-episode training to create a model that optimizes over long time horizons.
Plausibly, something like this will become more and more necessary the longer the time horizons of the task you want the model to perform. Thus, for example, if you want to create a model that tries to maximize your company's profit over the next year, trying to train it over many year-long episodes of attempted profit-maximization (e.g., have the model take some actions, wait a year, then reward it based on how much profit your company makes) isn't a very good strategy: there isn't enough time.
Indeed, it seems plausible to me that this sort of issue will push AI development away from the sort of simple, baseline ML training methods I'm focused on in this report. For example, perhaps the best way to get models to pursue long-term goals like "maximize my company profits in a year" will be via something akin to "Language Model Agents," built using trained ML systems as components, but which aren't themselves optimized very directly via gradients that depend on whether they are achieving the (possibly long-term) goals users set for them. These sorts of AIs would still pose risks of schemer-like behavior (see the section on "non-schemers with schemer-like traits" above), but they wouldn't be schemers in the sense I have in mind.
That said, there are ways of trying to use the sort of training I'm focused on, even with fairly short-term episodes, to try to create models optimizing for long-term goals. In particular, you can try to reward the model based on your assessment of whether its short-term behavior is leading to the long-term results that you want (e.g., long-term company profit), and therefore, hopefully induce it to optimize for those long-term results directly. Of course, whether this will work (as opposed, for example, to inducing the AI to optimize your short-term assessments of its actions) is a further question. But if it does, then you'll have created an AI that optimizes for "beyond-episode goals" in my sense.
Indeed, depending on how we want to use our terms, we can view this sort of training as intentionally trying to create a form of goal-misgeneralization. That is, the reward, here, does not depend causally on the long-term consequences of the model's actions, so in that sense, the long-term results in question aren't the "specified goal" (on this framing, the specified goal is always within-the-episode). But you're trying to get the AI to care intrinsically about them anyway.
Of course, it's a further question whether this sort of beyond-episode goal, once created, will lead to instrumental training-gaming. And indeed, successfully creating this sort of beyond-episode goal, instead of e.g. a reward-on-the-episode seeker, requires avoiding a certain kind of training-gaming up front – that is, the model has to not learn to just optimize for your short-term evaluations. And if you've successfully set up your training process such that optimizing for your desired long-term goal is in fact a max-reward (or: near-max-reward) behavior, training-gaming might not offer the model in question much advantage. (Here the human analogy would be something like: if you're supervisor is sufficiently good at assessing whether your near-term performance is going to lead to long-term profit, and sufficiently immune to manipulation, then you'll perform as good or better, in performance reviews, by just directly optimizing for long-term profit – for example, because you're not wasting time thinking about your supervisor at all.)
Still, models with beyond-episode goals emerging from this sort of process seem to me like they're at risk of scheming regardless. For one thing, the considerations discussed in the previous section all apply here – e.g., this sort of training involves pointing your model's cognition in a very future-focused direction, thereby plausibly inducing it to develop various scheming-conducive cognitive patterns, to attach value to various long-term consequences, and so on (and in this case, the horizon of the episode sets no bound on the temporal horizon of the "future" that the model's cognition is pointed towards; rather, that bound is set, centrally, by your evaluations of what the model's actions will cause, when).
More than this, though, it seems plausible to me that your evaluations of the consequences of a model's action will be in some sense "noisier" than a reward process that depends causally on those consequences, in a manner that makes it harder to differentiate between the different sorts of long-term goals your training is incentivizing. For example, maybe your model is behaving in a way that seems to you, broadly, like it will lead to your company being successful in three years, but you can't tell whether it will also create lots of harmful externalities – whereas a reward process that could actually see the consequences after three years would be able to tell. And an inability to readily distinguish between the different sorts of long-term goals you might be instilling seems like it increases the risk of accidentally instilling a schemer-like goal.
We could try appealing to simplicity (thanks to Evan Hubinger for discussion), but it's not clear to me that "five minutes" is meaningfully simpler than "a month." ↩︎
This is somewhat akin to a form of "process-based feedback," except that in a strict form of process-based feedback, you never look at any of the outcomes of the model's actions, whereas in this version, you can look at outcomes up to whatever time-horizon is efficient for you to get data about. ↩︎
For example, maybe you wanted to create a long-term goal regulated by some concept of "honesty," which you were counting on to prevent scheming. But maybe you can't tell if you've succeeded. ↩︎