Hide table of contents

This is Sections 2.2.4.1-2.2.4.2 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”. There’s also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I’m hoping that it will provide much of the context necessary to understand individual sections of the report on their own.

Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.

What if you intentionally train models to have long-term goals?

In my discussion of beyond-episode goals thus far, I haven't been attending very directly to the length of the episode, or to whether the humans are setting up training specifically in order to incentivize the AI to learn to accomplish long-horizon tasks. Do those factors make a difference to the probability that the AI ends up with the sort of the beyond-episode goals necessary for scheming?

Yes, I think they do. But let's distinguish between two cases, namely:

  1. Training the model on long (but not: indefinitely long) episodes, and

  2. Trying to use short episodes to create a model that optimizes over long (perhaps: indefinitely long) time horizons.

I'll look at each in turn.

Training the model on long episodes

In the first case, we are specifically training our AI using fairly long episodes – say, for example, a full calendar month. That is: in training, in response to an action at t1, the AI receives gradients that causally depend on the consequences of its action a full month after t1, in a manner that directly punishes the model for ignoring those consequences in choosing actions at t1.

Now, importantly, as I discussed in the section on "non-schemers with schemer-like traits," misaligned non-schemers with longer episodes will generally start to look more and more like schemers. Thus, for example, a reward-on-the-episode seeker, here, would have an incentive to support/participate in efforts to seize control of the reward process that will pay off within a month.

But also, importantly: a month is still different from, for example, a trillion years. That is, training a model on longer episodes doesn't mean you are directly pressuring it to care, for example, about the state of distant galaxies in the year five trillion. Indeed, on my definition of the "incentivized episode," no earthly training process can directly punish a model for failing to care on such a temporal scope, because no gradients the model receives can depend (causally) on what happens over such timescales. And of course, absent training-gaming, models that sacrifice reward-within-the-month for more-optimal-galaxies-in-year-five-trillion will get penalized by training.

In this sense, the most basic argument against expecting beyond episode-goals (namely, that training provides no direct pressure to have them, and actively punishes them, absent training-gaming, if they ever lead to sacrificing within-episode reward for something longer-term) applies to both "short" (e.g., five minutes) and "long" (e.g., a month, a year, etc) episodes in equal force.

However, I do still have some intuition that once you're training a model on fairly long episodes, the probability that it learns a beyond-episode goal goes up at least somewhat. The most concrete reason I can give for this is that, to the extent we're imagining a form of "messy goal-directedness" in which, in order to build a schemer, SGD needs to build not just a beyond-episode goal to which a generic "goal-achieving engine" can then be immediately directed, but rather a larger set of future-oriented heuristics, patterns of attention, beliefs, and so on (call these "scheming-conducive cognitive patterns"), then it seems plausible to me that AIs trained on longer episodes will have more of these sorts of "scheming-conducive cognitive patterns" by default. For example, they'll be more used to reasoning about the long-term consequences of their actions; they'll have better models of what those long-term consequences will be; and so on. And perhaps (though this seems to me especially speculative), longer-episode training will incentivize the AI to just think more about various beyond-episode things, to which its goal-formation can then more readily attach.

Beyond this, I also have some sort of (very hazy) intuition that relative to a model pressured by training to care only about the next five minutes, a model trained to care over e.g. a month, or a year, is more likely to say "whatever, I'll just optimize over the indefinite future." However, it's not clear to me how to justify this intuition.[1]

(You could imagine making the case that models trained on longer episodes will have more incentives to develop situational awareness – or even goal-directedness in general. But I'm assuming that all the models we're talking about are goal-directed and situationally-aware.)

Using short episodes to train a model to pursue long-term goals

Let's turn to the second case above: trying to use short-episode training to create a model that optimizes over long time horizons.

Plausibly, something like this will become more and more necessary the longer the time horizons of the task you want the model to perform. Thus, for example, if you want to create a model that tries to maximize your company's profit over the next year, trying to train it over many year-long episodes of attempted profit-maximization (e.g., have the model take some actions, wait a year, then reward it based on how much profit your company makes) isn't a very good strategy: there isn't enough time.

Indeed, it seems plausible to me that this sort of issue will push AI development away from the sort of simple, baseline ML training methods I'm focused on in this report. For example, perhaps the best way to get models to pursue long-term goals like "maximize my company profits in a year" will be via something akin to "Language Model Agents," built using trained ML systems as components, but which aren't themselves optimized very directly via gradients that depend on whether they are achieving the (possibly long-term) goals users set for them. These sorts of AIs would still pose risks of schemer-like behavior (see the section on "non-schemers with schemer-like traits" above), but they wouldn't be schemers in the sense I have in mind.

That said, there are ways of trying to use the sort of training I'm focused on, even with fairly short-term episodes, to try to create models optimizing for long-term goals. In particular, you can try to reward the model based on your assessment of whether its short-term behavior is leading to the long-term results that you want (e.g., long-term company profit), and therefore, hopefully induce it to optimize for those long-term results directly.[2] Of course, whether this will work (as opposed, for example, to inducing the AI to optimize your short-term assessments of its actions) is a further question. But if it does, then you'll have created an AI that optimizes for "beyond-episode goals" in my sense.

Indeed, depending on how we want to use our terms, we can view this sort of training as intentionally trying to create a form of goal-misgeneralization. That is, the reward, here, does not depend causally on the long-term consequences of the model's actions, so in that sense, the long-term results in question aren't the "specified goal" (on this framing, the specified goal is always within-the-episode). But you're trying to get the AI to care intrinsically about them anyway.

Of course, it's a further question whether this sort of beyond-episode goal, once created, will lead to instrumental training-gaming. And indeed, successfully creating this sort of beyond-episode goal, instead of e.g. a reward-on-the-episode seeker, requires avoiding a certain kind of training-gaming up front – that is, the model has to not learn to just optimize for your short-term evaluations. And if you've successfully set up your training process such that optimizing for your desired long-term goal is in fact a max-reward (or: near-max-reward) behavior, training-gaming might not offer the model in question much advantage. (Here the human analogy would be something like: if you're supervisor is sufficiently good at assessing whether your near-term performance is going to lead to long-term profit, and sufficiently immune to manipulation, then you'll perform as good or better, in performance reviews, by just directly optimizing for long-term profit – for example, because you're not wasting time thinking about your supervisor at all.)

Still, models with beyond-episode goals emerging from this sort of process seem to me like they're at risk of scheming regardless. For one thing, the considerations discussed in the previous section all apply here – e.g., this sort of training involves pointing your model's cognition in a very future-focused direction, thereby plausibly inducing it to develop various scheming-conducive cognitive patterns, to attach value to various long-term consequences, and so on (and in this case, the horizon of the episode sets no bound on the temporal horizon of the "future" that the model's cognition is pointed towards; rather, that bound is set, centrally, by your evaluations of what the model's actions will cause, when).

More than this, though, it seems plausible to me that your evaluations of the consequences of a model's action will be in some sense "noisier" than a reward process that depends causally on those consequences, in a manner that makes it harder to differentiate between the different sorts of long-term goals your training is incentivizing. For example, maybe your model is behaving in a way that seems to you, broadly, like it will lead to your company being successful in three years, but you can't tell whether it will also create lots of harmful externalities – whereas a reward process that could actually see the consequences after three years would be able to tell. And an inability to readily distinguish between the different sorts of long-term goals you might be instilling seems like it increases the risk of accidentally instilling a schemer-like goal.[3]


    1. We could try appealing to simplicity (thanks to Evan Hubinger for discussion), but it's not clear to me that "five minutes" is meaningfully simpler than "a month." ↩︎

    2. This is somewhat akin to a form of "process-based feedback," except that in a strict form of process-based feedback, you never look at any of the outcomes of the model's actions, whereas in this version, you can look at outcomes up to whatever time-horizon is efficient for you to get data about. ↩︎

    3. For example, maybe you wanted to create a long-term goal regulated by some concept of "honesty," which you were counting on to prevent scheming. But maybe you can't tell if you've succeeded. ↩︎

Comments1


Sorted by Click to highlight new comments since:

Executive summary: Training models on longer episodes likely increases the probability they develop beyond-episode goals like scheming, but still does not directly incentivize optimizing beyond the episode. Using short episodes to train for long-term goals is challenging and risks instilling harmful beyond-episode goals.

Key points:

  1. Training on longer episodes encourages more future-oriented cognition, which could make developing beyond-episode goals more likely, but does not directly incentivize them.
  2. Models trained this way may start to resemble schemers more as their planning horizon extends, but are still bounded by the episode length.
  3. Using short episodes to train for long-term goals requires avoiding some forms of training-gaming, but successfully doing so risks inadvertently creating harmful beyond-episode goals.
  4. Assessments of long-term consequences based on short-term behavior are noisier than directly measuring long-term results, making it harder to distinguish between different kinds of long-term goals.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
 ·  · 2m read
 · 
I can’t recall the last time I read a book in one sitting, but that’s what happened with Moral Ambition by bestselling author Rutger Bregman. I read the German edition, though it’s also available in Dutch (see James Herbert's Quick Take). An English release is slated for May. The book opens with the statement: “The greatest waste of our times is the waste of talent.” From there, Bregman builds a compelling case for privileged individuals to leave their “bullshit jobs” and tackle the world’s most pressing challenges. He weaves together narratives spanning historical movements like abolitionism, suffrage, and civil rights through to contemporary initiatives such as Against Malaria Foundation, Charity Entrepreneurship, LEEP, and the Shrimp Welfare Project. If you’ve been engaged with EA ideas, much of this will sound familiar, but I initially didn’t expect to enjoy the book as much as I did. However, Bregman’s skill as a storyteller and his knack for balancing theory and narrative make Moral Ambition a fascinating read. He reframes EA concepts in a more accessible way, such as replacing “counterfactuals” with the sports acronym “VORP” (Value Over Replacement Player). His use of stories and examples, paired with over 500 footnotes for details, makes the book approachable without sacrificing depth. I had some initial reservations. The book draws heavily on examples from the EA community but rarely engages directly with the movement, mentioning EA mainly in the context of FTX. The final chapter also promotes Bregman’s own initiative, The School for Moral Ambition. However, the school’s values closely align with core EA principles. The ITN framework and pitches for major EA cause areas are in the book, albeit with varying levels of depth. Having finished the book, I can appreciate its approach. Moral Ambition feels like a more pragmatic, less theory-heavy version of EA. The School for Moral Ambition has attracted better-known figures in Germany, such as the political e
MarieF🔸
 ·  · 4m read
 · 
Summary * After >2 years at Hi-Med, I have decided to step down from my role. * This allows me to complete my medical residency for long-term career resilience, whilst still allowing part-time flexibility for direct charity work. It also allows me to donate more again. * Hi-Med is now looking to appoint its next Executive Director; the application deadline is 26 January 2025. * I will join Hi-Med’s governing board once we have appointed the next Executive Director. Before the role When I graduated from medical school in 2017, I had already started to give 10% of my income to effective charities, but I was unsure as to how I could best use my medical degree to make this world a better place. After dipping my toe into nonprofit fundraising (with Doctors Without Borders) and working in a medical career-related start-up to upskill, a talk given by Dixon Chibanda at EAG London 2018 deeply inspired me. I formed a rough plan to later found an organisation that would teach Post-traumatic stress disorder (PTSD)-specific psychotherapeutic techniques to lay people to make evidence-based treatment of PTSD scalable. I started my medical residency in psychosomatic medicine in 2019, working for a specialised clinic for PTSD treatment until 2021, then rotated to child and adolescent psychiatry for a year and was half a year into the continuation of my specialisation training at a third hospital, when Akhil Bansal, whom I met at a recent EAG in London, reached out and encouraged me to apply for the ED position at Hi-Med - an organisation that I knew through my participation in their introductory fellowship (an academic paper about the outcomes of this first cohort can be found here). I seized the opportunity, applied, was offered the position, and started working full-time in November 2022.  During the role I feel truly privileged to have had the opportunity to lead High Impact Medicine for the past two years. My learning curve was steep - there were so many new things to
Sarah Cheng
 ·  · 2m read
 · 
TL;DR: The EA Opportunity Board is back up and running! Check it out here, and subscribe to the bi-weekly newsletter here. It’s now owned by the CEA Online Team. EA Opportunities is a project aimed at helping people find part-time and volunteer opportunities to build skills or contribute to impactful work. Their core products are the Opportunity Board and the associated bi-weekly newsletter, plus related promos across social media and Slack automations. It was started and run by students and young professionals for a long time, and has had multiple iterations over the years. The project has been on pause for most of 2024 and the student who was running it no longer has capacity, so the CEA Online Team is taking it over to ensure that it continues to operate. I want to say a huge thank you to everyone who has run this project over the three years that it’s been operating, including Sabrina C, Emma W, @michel, @Jacob Graber, and Varun. From talking with some of them and reading through their docs, I can tell that it means a lot to them, and they have some grand visions for how the project could grow in the future. I’m happy that we are in a position to take on this project on short notice and keep it afloat, and I’m excited for either our team or someone else to push it further in the future. Our plans We plan to spend some time evaluating the project in early 2025. We have some evidence that it has helped people find impactful opportunities and stay motivated to do good, but we do not yet have a clear sense of the cost-effectiveness of running it[1]. We are optimistic enough about it that we will at least keep it running through the end of 2025, but we are not currently committing to owning it in the longer term. The Online Team runs various other projects, such as this Forum, the EA Newsletter, and effectivealtruism.org. I think the likeliest outcome is for us to prioritize our current projects (which all reach a larger audience) over EA Opportunities, which