[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij

2 min read · Jun 13, 2024

Comments 2

Sorted by

New & upvoted

Andrew Gimber

Is there a typo in the first figure? I think the answer to the MMLU (top) question should be B, not A, because the greatest common factor of 36 and 90 is 18, not 9. (Not of course the central point of your paper/post, but it tripped me up when reading.)

Teun van der Weij

Ha, you're clearly right. We will fix it.

Comments

Curated and popular this week

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 4d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

130

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·6d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·22h ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...