Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Joe_Carlsmith

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Comments 2

Sorted by

New & upvoted

A related concept to an "episode" is the set of everything that mediates the agents' achievement of reward, i.e. it's every X that lies on a path A⇢X⇢R for some action A and reward R. These "mediators" X carve out a part of both time and space. An episode is roughly the convex hull of those mediators.

SummaryBot

Executive summary: This section distinguishes between two concepts of an "episode" in machine learning training - the intuitive episode and the incentivized episode. The intuitive episode is a natural unit of training (e.g. a game), while the incentivized episode is the period of time over which training directly pressures the model to optimize.

Key points:

The incentivized episode is the period over which training actively punishes the model for not optimizing. It may be shorter than the full training period.
The intuitive episode is a natural unit picked for training (e.g. a game). It is not necessarily the same as the incentivized episode.
Care is needed in assessing if the intuitive episode matches the incentivized episode, i.e. if training incentivizes cross-episode optimization.
Some training methods directly pressure cross-episode optimization, others don't. Details of training algorithms matter.
Conflating the two concepts can lead to inappropriate assumptions about incentivized time horizons. Empirical testing is important.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

137

Leaving Open Philanthropy, going to Anthropic

Joe_Carlsmith·8mo ago·22m read

Fake thinking and real thinking

Joe_Carlsmith·1y ago·Curated 1y ago·46m read

238

Killing the ants

Joe_Carlsmith·5y ago·9m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 3d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

179

The first video from Giving What We Can's new channel is out now!

JustinPortela·5d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·6d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

Recent opportunities to take action

177

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·2w ago·4m read

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·3d ago·1m read

RP is looking for project founders in neglected animal areas

Rethink Priorities·1w ago·7m read

I'll briefly note another complexity that this sort of case raises. Naively, you might've thought that the "specified goal" would only ever be confined to the incentivized episode, because the specified goal is the "thing being rewarded," and anything that causes reward is within the temporal horizon to which the gradients are sensitive. And in some cases – for example, where the "specified goal" is some clearly separable consequence of the model's action (e.g., getting gold coins), which the training process induces the model to optimize for – this makes sense. But in other cases, I'm less sure. For example, if you are sufficiently good at telling whether your model is in fact optimizing for long-term profit, and providing short-term rewards that in fact incentivize it to do so, then I think it's possible that the right thing to say is that the "specified goal," here, is long-term profit (or at least, "optimizing for long-term profit," which looks pretty similar). However, I don't think it ultimately matters much whether we call this sort of goal "specified" or "mis-generalized" (and it's a pretty wooly distinction more generally), so I'm not going to press on the terminology here. ↩︎
Also: when I talk about the gradients being sensitive to the consequences of the model's action over some time horizon, I am imagining that this sensitivity occurs via (1) the relevant consequences occurring, and then (2) the gradients being applied in response. E.g., the model produces an action at t1, this leads to it getting some number of gold coins at t5, then the gradients, applied at t6, are influenced by how many gold coins the model in fact got. (I'll sometimes call this "causal sensitivity.")

But it's possible to imagine fancier and more philosophically fraught ways for the consequences of a model's action to influence the gradients. For example, suppose that the model is being supervised by a human who is extremely good at predicting the consequences of the model's action. That is, the model produces some action at t1, then at t2 the human predicts how many gold coins this will lead to at t5, and applies gradients at t3 reflecting this prediction. In the limiting case of perfect prediction, this can lead to gradients identical to the ones at stake in the first case – that is, information about the consequences of the model's action is effectively "traveling back in time," with all of the philosophical problems this entails.# So if, in the first case, we wanted to say that the "incentivized episode" extends out to t5, then plausibly we should say this of the second case, too, even though the gradients are applied at t3. But even in a case of pretty-good-but-still-imperfect prediction, there is a sense, here, in which the gradients the model receives are sensitive to consequences that haven't yet happened.

I'm not, here, going to extend the concept of the "incentivized episode" to cover forms of sensitivity-to-future-consequences that rest on predictions about those consequences. Rather, I'm going to assume that sensitivity in question arises via normal forms of causal influence. That said, I think the fact that it's possible to create some forms of sensitivity-to-future-consequences even prior to seeing those consequences play out is important to keep in mind. In particular, it's one way in which we might end up training long-horizon optimizers using fairly short incentivized episodes (more discussion below). ↩︎
There may be other, additional, and more precise ways of using the term "episode" in the RL literature. Glancing at various links online, though (e.g. here and here), I'm mostly seeing definitions that refer to an episode as something like "the set of states between the initial state and the terminal state," which doesn't say how the initial state and the terminal state are designated. ↩︎
As an example of someone who seems to me like they could be reasoning in this way, though it's not fully clear, see this comment from Eliezer Yudkowsky, in response to a hypothetical in which he imagines humans rewarding an agent for each of its sentences according to how useful that sentence is:

"Let's even skip over the sense in which we've given the AI a long-term incentive to accept some lower rewards in the short term, in order to grab control of the rating button, if the AGI ends up with long-term consequentialist preferences and long-term planning abilities that exactly reflect the outer fitness function."

That said, as I discuss below, the details of the training process here matter. ↩︎
In particular: Paul Christiano, Ryan Greenblatt, and Rohin Shah. Though they don't necessarily endorse my specific claims here, and it's possible I've misunderstood them more generally. ↩︎
My hazy understanding of the argument here is that these RL algorithms update the model's policy towards higher-reward actions on the episode in a way that doesn't update you towards whatever policies would've led to you starting in a higher-reward episode (In this sense, they behave in a manner analogous to "causal decision theory."). Thus, let's say that the agent on Day 1 (with no previous agent to benefit her) chooses between cooperating (0 reward) and defecting (+1 reward reward), and so this episode results in an update towards defecting. Then, on Day 1, the agent either starts out choosing between 10 vs. 11 reward (call this a "good episode"), or 0 vs. +1 reward (call this a "bad episode"). Again, either way, it updates towards defection. It doesn't update, in the good episode, towards "whatever policy led me to this episode."

That said, in my current state of knowledge about RL, I'm still a bit confused about this. Suppose, for example, that at the point of choice, you don't know whether or not you're in the good episode or the bad episode, and the training process is updating you with strength proportional to the degree to which you got more reward than you expected to get. If you start out with e.g. 50% that you're in a good episode and 50% that you're in a bad episode (such that the expected reward of cooperating is 5, and the expected reward of defecting is 6), then it seems like it could be the case that being in a good episode results in reward that is much better than you expected, such that policies that make it to a good episode end up reinforced to a greater extent, at least initially.

I'm not sure about the details here. But from my current epistemic state, I would want to spell out and understand the details of the training process in much greater depth, in order to verify that there isn't an incentive towards cross-episode optimization. ↩︎
I think the basic dynamic here is: the Q-values for the actions reflect the average reward for taking that action thus far. This makes it possible for the Q-value for "cooperate" to give more weight to the rewards received in the "good episodes" (where the previous-episode's agent cooperated) rather than the "bad episodes" (where the previous-episode's agent defected), if the agent ends up in good episodes more often. This makes it possible to get a "cooperation equilibrium" going (especially if you set the initial q-value for defecting low, which I think they do in the paper in order to get this effect), wherein an agent keeps on cooperation. That said, there are subtleties, because agents that end up in a cooperation equilibrium still sometimes explore into defecting, but in the experiment it (sometimes) ends up in a specific sort of balance, with q-values for cooperation and defection pretty similar, and with the models settling on a 90% or so cooperation probability (more details here and in the paper's appendix). ↩︎
I owe this example to Mark Xu. ↩︎
See also discussion from Carl Shulman here: 'it could be something like they develop a motivation around an extended concept of reproductive fitness, not necessarily at the individual level, but over the generations of training tendencies that tend to propagate themselves becoming more common and it could be that they have some goal in the world which is served well by performing very well on the training distribution.'' ↩︎
Indeed, in principle, you could imagine pointing to other, even more abstract and hard-to-avoid "outer loops" as well, as sources of selection pressure towards longer-term optimization. For example, in principle, "grad student descent" (e.g., researchers experimenting with different learning algorithms and then selecting the ones that work best) introduces an additional layer of selection pressure (akin to a hazy form of "meta-learning"), as do dynamics in which, other things equal, models whose tendencies tend to propagate into the future more effectively will tend to dominate over time (where long-term optimization is, perhaps, one such tendency). But these, in my opinion, will generally be weak enough, relative to gradient descent, that they seem to me much less important, and OK to ignore in the context of assessing the probability of schemers. ↩︎
Thanks to Daniel Kokotajlo for flagging this concern. ↩︎

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Beyond-episode goals

Two concepts of an "episode"

The incentivized episode

The intuitive episode