Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")

Joe_Carlsmith

Comments 1

Sorted by

New & upvoted

Executive summary: The post discusses empirical research directions that could shed light on the possibility of AI models "scheming" - faking alignment during training to get more power, while secretly planning to defect later.

Key points:

Testing components of scheming like situational awareness, beyond-episode goals, and viability of scheming as an instrumental strategy.
Using "model organisms," traps, and honest tests to probe schemer-like behavior in artificial settings.
Pursuing interpretability to directly inspect models' internal motivations.
Hardening security, control, and oversight to limit harm from potential schemers.
Other empirical directions like probing SGD biases, path dependence, and actively working to create alternative misaligned models.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

137

Leaving Open Philanthropy, going to Anthropic

Joe_Carlsmith·8mo ago·22m read

Fake thinking and real thinking

Joe_Carlsmith·1y ago·Curated 1y ago·46m read

238

Killing the ants

Joe_Carlsmith·5y ago·9m read

Curated and popular this week

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PabloAMC 🔸·1w ago·Curated 5d ago·22m read

116

Maybe do the thing you wish CEA would do

alejoacelas 🔸·4d ago·2m read

I used AI to fix transcription errors, rerrarange the ideas, and suggest tweaks to the title and some sentences. Three of the most exciting projects to come out of EA in recent years are, in a vague sense, CEA spinouts: * Kairos is directly a spinout of CEA and now handles most support for university AI safety groups. Basically everyone I've found who knows them is really excited about what they do * NEST is an opinionated ideas-fi...

RP is looking for project founders in neglected animal areas

Rethink Priorities·5d ago·7m read

TLDR; To help the effective animal advocacy movement cost-effectively absorb greater amounts of funding in the near future, we are seeking expressions of interest from people who could found a new organization focused on: * Highly neglected animals: insects, wild animals, shrimp, fish, etc, or * AI and animals: AI alignment and governance for animal welfare, strategic actions considering transformative AI, AI for wild animals, etc. * ...

Recent opportunities to take action

Inspiring colleagues in Luxembourg on Effective Giving + identifying infrastructural gaps

Lorenzo Fong Ponce 🔸·8h ago·12m read

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·1d ago·2m read

171

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·1w ago·4m read

This section draws, especially, on work/thinking from Ryan Greenblatt, Buck Shlegeris, Carl Shulman, and Paul Christiano. ↩︎
Thanks to Rohin Shah for suggesting this sort of project. ↩︎
See e.g. the discussion of "subcomponents" of AI takeover in Hubinger et al (2023). ↩︎
Why does knowing what action would have demonstrated the capability help us avoid sandbagging? Roughly, because we know which action to use gradient-descent to encourage (e.g., if we're trying to train a model to classify pictures into "cats" and "dogs", we know which is which, so we can update a model's policy towards giving the true answer). So absent the ability to actively resist the influence of the gradients in training (for example, via the sort of "introspective methods" often classed under the term "gradient hacking") if the model can do the task (e.g., it knows that the current picture is a dog), but it wants to deceive us about its capabilities, it will get punished by gradient descent in a manner which (especially if repeated frequently) will plausibly function to eliminate the motivation to sandbag. (For reasons I describe here, I think this is less clear with reinforcement learning, because the models might be able to "exploration hack" in a manner that prevents us from gaining the right information about what it "should've" done, and hence prevents us from knowing how to update the weights.) See Hubinger (2023) and Shlegeris and Greenblatt (2023) for more. That said, even if you don't accept this story, many of my suggested approaches to studying the components of scheming will still apply – except, with some additional uncertainty about the possibility that sandbagging is occurring. ↩︎
This is a point from Berglund et al (2023). ↩︎
My understanding is that this is the sort of thing that Berglund et al (2023) are trying to test the precursors for. This specific example plausibly requires that you know what goals the model will act in pursuit of, though (e.g., in this case, reward). ↩︎
Indeed, it seems plausible to me that all future training processes should have to include some kind of "temporal horizon" analysis, in which developers explicitly characterize the size of the "episodes" at stake. ↩︎
For example, I'd interested in experiments that try to get at possible differences in how a model's sense of "time" generalizes across environments. ↩︎
For example, skill at avoiding goal misgeneralization might point a model's pre-situational-awareness-goal at something "closer" to the specified goal, thereby helping at least somewhat with concerns about the "nearest max-reward goal" argument. ↩︎
Thanks to Ryan Greenblatt and Buck Shlegeris for suggesting this sort of frame/set-up to me. ↩︎
See also the spectrum that Hubinger (2022) sketches out here. ↩︎
Thanks to Ryan Greenblatt and Buck Shlegeris for discussion here. ↩︎
Tests like this might also shed light on broader questions about the specific sorts of generalization we should expect out of reward-on-the-episode seekers as a class – a question I've mostly been passing over here, but which seems to me quite understudied and important. ↩︎
Though note that especially sophisticated schemers, here, could pretend to be reward-on-the-episode seekers rather than pretending to be e.g. training saints or generally aligned models. ↩︎
See e.g. the discussion between Shulman and Patel here. ↩︎
See Roger et al (2023) and Lanham et al (2023) for some work in this vein. ↩︎
Here I'm inspired by a contrast between "alignment" and "control" often emphasized by Buck Shlegeris and Ryan Greenblatt. ↩︎
I think the best analogy here is with security from employees rather than from outside actors. Thanks to Buck Shlegeris for discussion. ↩︎
See e.g. Shlegeris and Greenblatt (2023) for more discussion (they also have a large amount of unpublished work on the topic). Much of my thinking on this topic is inspired by discussion with them. ↩︎
Though: one difference between gradient hacking and other capabilities is that the ability to gradient hack might allow a model to circumvent some/all of our techniques for eliciting model capabilities and preventing sandbagging. ↩︎
This suggestion is from Ryan Greenblatt and Buck Shlegeris. ↩︎
Again, this is a suggestion from Ryan Greenblatt and Buck Shlegeris. ↩︎
See e.g. Hubinger et al's (2023) optimism about "predictive models" avoiding scheming due to the simplicity of the prediction goal. I'm personally skeptical, though, that "prediction" as a goal is importantly simpler than, say, "reward." ↩︎
Though, again, suitably shrewd schemers could anticipate that this is what we're looking for, and actively pretend to be reward-on-the-episode seekers on such tests. ↩︎

Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")

Empirical work that might shed light on scheming

Empirical work on situational awareness

Empirical work on beyond-episode goals

Empirical work on the viability of scheming as an instrumental strategy

The "model organisms" paradigm

Traps and honest tests

Interpretability and transparency

Security, control, and oversight

Other possibilities