AI companies aren't really using external evaluators

Zach Stein-Perlman

AI companies aren't really using external evaluators

Zach Stein-Perlman

4 min readMay 26, 2024

Comments 4

Sorted by

New & upvoted

Toby Tremlett🔹

I'm curating this post.
I wrote a draft for a feature on a politico piece for the EA newsletter, exploring this same question-- are big labs following through on verbal commitments to share their models with external evaluators? Despite taking "several months" to speak with experts- the politico piece didn't have as much useful information as this blog post. I cut the feature, because I couldn't find as much information in the time I had.
I think work like this is really valuable, filling a serious gap in our understanding of AI Safety. Thanks for writing this Zach!

phgubbins

Cross-commenting from lesswrong for future reference:

I had an opportunity to ask an individual from one of the mentioned labs about plans to use external evaluators and they said something along the lines of:

“External evaluators are very slow - we are just far better at eliciting capabilities from our models.”

They earlier said something much to the same effect when I asked if they’d been surprised by anything people had used deployed LLMs for so far, ‘in the wild’. Essentially, no, not really, maybe even a bit underwhelmed.

SummaryBot

Executive summary: Despite claims, AI companies are not consistently using external evaluators to assess their models for dangerous capabilities before deployment, which could improve risk assessment and provide public accountability.

Key points:

External evaluators like METR, UK AISI, and Apollo are not being given sufficient pre-deployment access to models to effectively evaluate risks.
Evaluators should be given deeper access than end users, including versions without safety filters or fine-tuning, to better assess threats from both deployment and potential leaks.
The costs and challenges of providing external evaluator access are unclear, but likely surmountable.
Some AI labs have committed to external "red teaming" but this is not a substitute for comprehensive model evaluation by dedicated experts.
AI labs should also provide deeper model access to safety researchers to support their work, potentially without fully releasing weights.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Nayanika

-8

Well I am into training AI models and working as an external evaluator. So I can say people are working. However I never really understood the claims of rising cost in AI training coz we hardly get paid here, atleast in India. P.S: The claims referred to in here are from elsewhere.

Comments

More from the author

220

FLI open letter: Pause giant AI experiments

Zach Stein-Perlman·3y ago·3m read

134

Maybe Anthropic's Long-Term Benefit Trust is powerless

Zach Stein-Perlman·2y ago·3m read

128

Introducing AI Lab Watch

Zach Stein-Perlman·2y ago·2m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

How (not) to fundraise from Anthropic staff

Jack Lewars·1w ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·4d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·2d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·2d ago·3m read

Starting an EA group @ SUNY Binghamton

micahzarin·1d ago·1m read

^{^}

METR's homepage says:

We have previously worked with Anthropic, OpenAI, and other companies to pilot some informal pre-deployment evaluation procedures. These companies have also given us some kinds of non-public access and provided compute credits to support evaluation research.
We think it’s important for there to be third-party evaluators with formal arrangements and access commitments - both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations.
We do not yet have such arrangements, but we are excited about taking more steps in this direction.

^{^}

GovAI: Schuett et al. 2023. See also DSIT 2023, Brundage et al. 2020, AI Safety Summit 2023, and Anthropic 2024.

^{^}

Idea: when sharing a model for external evals or red-teaming, for each mitigation (e.g. harmlessness fine-tuning or filters), either disable it or make it an explicit part of the safety case for the model. Either claim "users can't effectively jailbreak the model given the deployment protocol" or disable. Otherwise the lab is just stopping the bioengineering red-teamers from eliciting capabilities with mitigations that won't work against sophisticated malicious users.

^{^}

A previous version of this post omitted discussion of external testing of Gemini 1.5 Pro. Thanks to Mary Phuong for pointing out this error.

^{^}

Politico and UK government press releases report that AI labs committed to share pre-deployment access with UK AISI. I suspect they are mistaken and these claims trace back to the UK AI safety summit "safety testing" session, which is devoid of specific commitments. I am confused about why the labs have not clarified their commitments and practices.

^{^}

See Shevlane 2022. See also Bucknall and Trager 2023 and Casper et al. 2024.