Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)

This post is co-authored with Ben Garfinkel. It is cross-posted from the CEA blog. A PDF version can be found here. Summary: Some strategic decisions available to the effective altruism m...

Introducing Impact List: a ranking of philanthropists by expected lives saved

Elliot Olds·5d ago·6m read

TL;DR: I'm releasing a website that ranks philanthropists according to EA principles and research, and allows users to re-rank the list using their own assumptions. I'd like feedback and help making it better. I'd especially like ideas for how to make the results more trustworthy. Funding may be available. Crossposted to LessWrong. ...

Amsterdam Insect Protest

Bentham's Bulldog·2d ago·3m read

(Please share with your friends in the Netherlands). On August 5, in Amsterdam in the Netherlands, there’s a protest that could help shut down the world’s largest insect farm. It’s at Teleportboulevard 105 1043 EJ Amsterdam, Netherlands. Read more at this link. There’s also a zoom call on Wednesday 29th July at 12:30pm (CEST) with more info. This...

Recent opportunities to take action

Blog Revival Project

Austin, Carol N·16h ago·2m read

Amsterdam Insect Protest

Bentham's Bulldog·2d ago·3m read

Job: Executive Director of CEEALAR (EA Hotel)

CEEALAR·1d ago·3m read

See, for example, the discussion in Cotra (2021). ↩︎
My sense is that some ways of thinking about scheming will treat the second option, here, as low-probability, especially if the temporal gap between situational awareness and training-gaming is quite large (here I'm particularly thinking about the sort of analysis given in Hubinger (2022) – though Hubinger doesn't endorse the claims I have in mind, here, in particular). In particular, you might assume (a) that once the model develops situational awareness, it will fairly quickly start optimizing either for the specified goal, or for reward-on-the-episode (whether terminally or instrumentally) – since it now understands enough about the training process to do this directly, and doing so will be maximally rewarded. And then, further, you might assume (b) that after that, the model's goals "crystallize" – that is, because the model is now pursuing a max-reward goal, its goal stops changing, and training proceeds to only improve its world model and capabilities. However, I don't want to assume either of these things here. For example, I think it's possible that "slack" in training allows models to continue to pursue less-than-max-reward goals even well after developing situational awareness; and possible, too, that max-reward-goals do not "crystallize" in the way assumed here (though in that case, I think the case for goal-guarding scheming is also weaker more generally – see below). ↩︎
Thanks to Daniel Kokotajlo and Evan Hubinger for discussion here. ↩︎
Though it's an importantly further question whether long-term power-seeking strategies will be worth their costs in pursuit of such beyond-episode consequences. And note that if the model cares that "I" solve the math problem, rather than just "that the math problem be solved," then ↩︎
Thanks to Jason Schukraft for flagging this sort of question to me. ↩︎
There are some even hazier connections, here, with discussions of "simplicity biases" below. E.g., these humans sometimes argue for their positions on the grounds that the relevant philosophical views are "simpler." ↩︎
Though if it hasn't yet started training-gaming in pursuit of this goal (despite its situational awareness), such adversarial training could still make a difference. ↩︎
As an example of an analysis that focuses on this threat model, see Hubinger (2022) discussion of deceptive alignment in a high path-dependence world. In particular: "SGD makes the model's proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purpose of staying around." ↩︎
We can imagine cases where SGD "notices" the benefits of creating both beyond-episode goals and situational awareness all at once – but this seems to me especially difficult from the perspective of the "incrementalism" considerations discussed below, not obviously importantly different regardless, so I'm going to skip it. ↩︎
Thanks to Paul Christiano for discussion of this point. ↩︎
See e.g. Hubinger (2022) simplicity analysis. ↩︎
This was a point suggested to me by Richard Ngo, though he may not endorse the way I've characterized it here. ↩︎
Thanks, again, to Paul Christiano for discussion here. ↩︎
This fits somewhat with a picture on which neural networks succeed by "doing lots of things at once," and then upweighting the best-performing things (perhaps the "lottery ticket hypothesis" is an example of something like this?). This picture was suggested to me in conversation, but I haven't investigated it. ↩︎
This is also one of the places where it seems plausible to me that thinking more about "mixed models" – i.e., models that mix together schemer-like motivations with other motivations – would make a difference to the analysis. ↩︎

Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)

Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)

Two sources of beyond-episode goals

Training-game-independent beyond-episode goals

Are beyond-episode goals the default?

How will models think about time?

The role of "reflection"

Pushing back on beyond-episode goals using adversarial training

Training-game-dependent beyond-episode goals

Can gradient descent "notice" the benefits of turning a non-schemer into a schemer?

Is SGD pulling scheming out of models by any means necessary?