Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")

Joe_Carlsmith

Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")

Comments 1

Sorted by

New & upvoted

Executive summary: The classic goal-guarding story for why AI systems would "scheme" during training faces non-obvious challenges regarding both whether training-gaming can sufficiently guard goals and whether the future payoff from scheming will be adequate.

Key points:

It's unclear if training-gaming can guard goals well enough given ongoing reward modifications and the irrelevance of precise goal content.
The payoff requires not just goal survival but also probable and impactful escape/takeover on a timescale the model cares about. This depends on many uncertaint factors.
The relative value of scheming depends partly on how much the model stands to gain from not scheming, which varies based on factors like goal ambition.
There are open questions around necessary goal time horizons and whether default goals will be highly ambitious.
The challenges don't decisively refute the story but highlight the need to clarify the necessary conditions.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

137

Leaving Open Philanthropy, going to Anthropic

Joe_Carlsmith·7mo ago·22m read

Fake thinking and real thinking

Joe_Carlsmith·1y ago·Curated 1y ago·46m read

237

Killing the ants

Joe_Carlsmith·5y ago·9m read

Curated and popular this week

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 3d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

105

New Video from AI in Context: The Fall and Rise of Sam Altman

ChanaMessinger, phoebe b, Aric Floyd·1w ago·3m read

New Video from AI in Context: The Fall and Rise of Sam Altman If you want to skip straight to the video, here it is! AI in Context is excited to be back with our fourth video! For those just hearing from us, we make videos for 80,000 Hours, telling stories about transformative AI...

Recent opportunities to take action

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·9h ago·3m read

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

Build a flourishing EA group at the University of Toronto

Joseph Kostousov, Sophia Wan (navarhontes)·1w ago·1m read

Thanks to Ryan Greenblatt for discussion here. ↩︎
Recall that forms of deployment like "interacting with users behind an API" can count as "training" on my definition. And naively, absent escape, to me it seems hard to create all that many paperclips via such interactions. ↩︎
Thanks to Daniel Kokotajlo for discussion here. ↩︎
At the least, that's the direction that human longtermists went (though plausibly, many humans with long-term goals are better thought of as having intrinsic discount rates). ↩︎
That is, a story on which the temporally-impartial goals would need to arise naturally, rather than specifically in order to cause scheming. As I discussed above, if SGD is actively pulling whatever goals would motivate scheming out of the model, then we should expect those goals to have whatever temporal horizons are necessary. ↩︎
And of course, there are intermediate levels of empowerment in between "roaming the internet" and "part of a post-AI-takeover regime," too. ↩︎
In the limit, this sort of reasoning raises the sorts of questions discussed, in philosophy, under the heading of "fanaticism." ↩︎
Or as another version of the paperclip example: what happens if the model also values something else, other than paperclips, and which points in a direction other than training-gaming? For example, suppose that in addition to wanting to maximize paperclips, our model also values pursuing its curiosity on the episode, even when doing so conflicts with getting reward. So now, faced with the chance to pursue its curiosity on the episode, or to optimize for reward as part of a low-probability-of-success schemer-like strategy for making paperclips later, it needs to weigh up the respective value it places on each. Will scheming still win out? It depends on exactly how the model's competing values are structured. But even with such large numbers of paperclips at stake, the model need not accord its paperclip drive overwhelming priority. ↩︎
Of course, especially in a low-slack training regime that exerts strong pressure against max-reward goals, if this model's calculus here leads it to engage in not-highly-rewarded behavior, its goals may get modified to point at higher-reward targets. And one such modification could be to move the model in more schemer-like directions. But the modification could go in other directions as well. ↩︎
Though: it's also less clear what happens when these different motivations do come apart – for example, with respect to early undermining, or with escape/take-over attempts that will compromise short-term widget-making. ↩︎
Thanks to Daniel Kokotajlo for discussion. And see Karnofsky (2022) for use of the term "ambitious" here. I think Ngo et al's (2023) use of "broad scope" is trying to point at something similar. ↩︎
See Chapter 8, p. 149. ↩︎
Of course, as previously discussed, models might have sufficiently little power within the episode that a strategy that focuses on beyond-episode power is worth it even if the expected amount of power to be gained is quite low (for example: the model wants to make at least one paperclip, but it can't make any within the episode). Or, relatedly, it may be that most of what the model wants within the episode can be gained via playing the training game (e.g., the training process just wants it to make the initial efforts to solve the math problem), and to the extent it cares about stuff beyond the episode (e.g., checking and re-checking the problem later), this care can be fairly seamlessly channeled into instrumental-training-gaming. ↩︎
Here, again, setting aside "anthropic capture." ↩︎

Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")

Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")

Adequate future empowerment

When is the "pay off" supposed to happen?

Even if the model's values survive this generation of training, will they survive long enough to escape the threat of modification?

Will escape/take-over be suitably likely to succeed?

Will the time horizon of the model's goals extend to cover escape/take-over?

Will the model's values get enough power after escape/takeover?

How much does the model stand to gain from not training-gaming?

How "ambitious" is the model?

Overall assessment of the classic goal-guarding story