Hide table of contents

Epistemic status: raising concerns, rather than stating confident conclusions.

I’m worried that a lot of work on AI safety evals matches the pattern of “Something must be done. This is something. Therefore this must be done.” Or, to put it another way: I judge eval ideas on 4 criteria, and I often see proposals which fail all 4. The criteria:

1. Possible to measure with scientific rigor.

Some things can be easily studied in a lab; others are entangled with a lot of real-world complexity. If you predict the latter (e.g. a model’s economic or scientific impact) based on model-level evals, your results will often be BS.

(This is why I dislike the term “transformative AI”, by the way. Whether an AI has transformative effects on society will depend hugely on what the society is like, how the AI is deployed, etc. And that’s a constantly moving target! So TAI a terrible thing to try to forecast.)

Another angle on “scientific rigor”: you’re trying to make it obvious to onlookers that you couldn’t have designed the eval to get your preferred results. This means making the eval as simple as possible: each arbitrary choice adds another avenue for p-hacking, and they add up fast.

(Paraphrasing a different thread): I think of AI risk forecasts as basically guesses, and I dislike attempts to make them sound objective (e.g. many OpenPhil worldview investigations). There are always so many free parameters that you can get basically any result you want. And so, in practice, they often play the role of laundering vibes into credible-sounding headline numbers. I'm worried that AI safety evals will fall into the same trap.

(I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.)

2. Provides signal across scales.

Evals are often designed around a binary threshold (e.g. the Turing Test). But this restricts the impact of the eval to a narrow time window around hitting it. Much better if we can measure (and extrapolate) orders-of-magnitude improvements.

3. Focuses on clearly worrying capabilities.

Evals for hacking, deception, etc track widespread concerns. By contrast, evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk. And even they don’t think it’s necessary for risk.

4. Motivates useful responses.

Safety evals are for creating clear Schelling points at which action will be taken. But if you don’t know what actions your evals should catalyze, it’s often more valuable to focus on fleshing that out. Often nobody else will!

In fact, I expect that things like model releases, demos, warning shots, etc, will by default be much better drivers of action than evals. Evals can still be valuable, but you should have some justification for why yours will actually matter, to avoid traps like the ones above. Ideally that justification would focus either on generating insight or being persuasive; optimizing for both at once seems like a good way to get neither.

Lastly: even if you have a good eval idea, actually implementing it well can be very challenging

Building evals is scientific research; and so we should expect eval quality to be heavy-tailed, like most other science. I worry that the fact that evals are an unusually easy type of research to get started with sometimes obscures this fact.

38

2
2

Reactions

2
2

More posts like this

Comments2


Sorted by Click to highlight new comments since:

Thanks for the post!

“Something must be done. This is something. Therefore this must be done.”

1. Have you seen grants that you are confident are not a good use of EA's money?

2. If so, do you think that if the grantmakers had asked for your input, you would have changed their minds about making the grant?

3. Do you think Open Philanthropy (and other grantmakers) should have external grant reviewers?

I remain in favor of people doing work on evals, and in favor of funding talented people to work on evals. The main intervention I'd like to make here is to inform how those people work on evals, so that it's more productive. I think that should happen not on the level of grants but on the level of how they choose to conduct the research.

Curated and popular this week
trammell
 ·  · 25m read
 · 
Introduction When a system is made safer, its users may be willing to offset at least some of the safety improvement by using it more dangerously. A seminal example is that, according to Peltzman (1975), drivers largely compensated for improvements in car safety at the time by driving more dangerously. The phenomenon in general is therefore sometimes known as the “Peltzman Effect”, though it is more often known as “risk compensation”.[1] One domain in which risk compensation has been studied relatively carefully is NASCAR (Sobel and Nesbit, 2007; Pope and Tollison, 2010), where, apparently, the evidence for a large compensation effect is especially strong.[2] In principle, more dangerous usage can partially, fully, or more than fully offset the extent to which the system has been made safer holding usage fixed. Making a system safer thus has an ambiguous effect on the probability of an accident, after its users change their behavior. There’s no reason why risk compensation shouldn’t apply in the existential risk domain, and we arguably have examples in which it has. For example, reinforcement learning from human feedback (RLHF) makes AI more reliable, all else equal; so it may be making some AI labs comfortable releasing more capable, and so maybe more dangerous, models than they would release otherwise.[3] Yet risk compensation per se appears to have gotten relatively little formal, public attention in the existential risk community so far. There has been informal discussion of the issue: e.g. risk compensation in the AI risk domain is discussed by Guest et al. (2023), who call it “the dangerous valley problem”. There is also a cluster of papers and works in progress by Robert Trager, Allan Dafoe, Nick Emery-Xu, Mckay Jensen, and others, including these two and some not yet public but largely summarized here, exploring the issue formally in models with multiple competing firms. In a sense what they do goes well beyond this post, but as far as I’m aware none of t
 ·  · 1m read
 · 
 ·  · 19m read
 · 
I am no prophet, and here’s no great matter. — T.S. Eliot, “The Love Song of J. Alfred Prufrock”   This post is a personal account of a California legislative campaign I worked on March-June 2024, in my capacity as the indoor air quality program lead at 1Day Sooner. It’s very long—I included as many details as possible to illustrate a playbook of everything we tried, what the surprises and challenges were, and how someone might spend their time during a policy advocacy project.   History of SB 1308 Advocacy Effort SB 1308 was introduced in the California Senate by Senator Lena Gonzalez, the Senate (Floor) Majority Leader, and was sponsored by Regional Asthma Management and Prevention (RAMP). The bill was based on a report written by researchers at UC Davis and commissioned by the California Air Resources Board (CARB). The bill sought to ban the sale of ozone-emitting air cleaners in California, which would have included far-UV, an extremely promising tool for fighting pathogen transmission and reducing pandemic risk. Because California is such a large market and so influential for policy, and the far-UV industry is struggling, we were seriously concerned that the bill would crush the industry. A partner organization first notified us on March 21 about SB 1308 entering its comment period before it would be heard in the Senate Committee on Natural Resources, but said that their organization would not be able to be publicly involved. Very shortly after that, a researcher from Ushio America, a leading far-UV manufacturer, sent out a mass email to professors whose support he anticipated, requesting comments from them. I checked with my boss, Josh Morrison,[1] as to whether it was acceptable for 1Day Sooner to get involved if the partner organization was reluctant, and Josh gave me the go-ahead to submit a public comment to the committee. Aware that the letters alone might not do much, Josh reached out to a friend of his to ask about lobbyists with expertise in Cal