Hide table of contents

Epistemic status: raising concerns, rather than stating confident conclusions.

I’m worried that a lot of work on AI safety evals matches the pattern of “Something must be done. This is something. Therefore this must be done.” Or, to put it another way: I judge eval ideas on 4 criteria, and I often see proposals which fail all 4. The criteria:

1. Possible to measure with scientific rigor.

Some things can be easily studied in a lab; others are entangled with a lot of real-world complexity. If you predict the latter (e.g. a model’s economic or scientific impact) based on model-level evals, your results will often be BS.

(This is why I dislike the term “transformative AI”, by the way. Whether an AI has transformative effects on society will depend hugely on what the society is like, how the AI is deployed, etc. And that’s a constantly moving target! So TAI a terrible thing to try to forecast.)

Another angle on “scientific rigor”: you’re trying to make it obvious to onlookers that you couldn’t have designed the eval to get your preferred results. This means making the eval as simple as possible: each arbitrary choice adds another avenue for p-hacking, and they add up fast.

(Paraphrasing a different thread): I think of AI risk forecasts as basically guesses, and I dislike attempts to make them sound objective (e.g. many OpenPhil worldview investigations). There are always so many free parameters that you can get basically any result you want. And so, in practice, they often play the role of laundering vibes into credible-sounding headline numbers. I'm worried that AI safety evals will fall into the same trap.

(I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.)

2. Provides signal across scales.

Evals are often designed around a binary threshold (e.g. the Turing Test). But this restricts the impact of the eval to a narrow time window around hitting it. Much better if we can measure (and extrapolate) orders-of-magnitude improvements.

3. Focuses on clearly worrying capabilities.

Evals for hacking, deception, etc track widespread concerns. By contrast, evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk. And even they don’t think it’s necessary for risk.

4. Motivates useful responses.

Safety evals are for creating clear Schelling points at which action will be taken. But if you don’t know what actions your evals should catalyze, it’s often more valuable to focus on fleshing that out. Often nobody else will!

In fact, I expect that things like model releases, demos, warning shots, etc, will by default be much better drivers of action than evals. Evals can still be valuable, but you should have some justification for why yours will actually matter, to avoid traps like the ones above. Ideally that justification would focus either on generating insight or being persuasive; optimizing for both at once seems like a good way to get neither.

Lastly: even if you have a good eval idea, actually implementing it well can be very challenging

Building evals is scientific research; and so we should expect eval quality to be heavy-tailed, like most other science. I worry that the fact that evals are an unusually easy type of research to get started with sometimes obscures this fact.

Comments2


Sorted by Click to highlight new comments since:

Thanks for the post!

“Something must be done. This is something. Therefore this must be done.”

1. Have you seen grants that you are confident are not a good use of EA's money?

2. If so, do you think that if the grantmakers had asked for your input, you would have changed their minds about making the grant?

3. Do you think Open Philanthropy (and other grantmakers) should have external grant reviewers?

I remain in favor of people doing work on evals, and in favor of funding talented people to work on evals. The main intervention I'd like to make here is to inform how those people work on evals, so that it's more productive. I think that should happen not on the level of grants but on the level of how they choose to conduct the research.

Curated and popular this week
 ·  · 32m read
 · 
Summary Immediate skin-to-skin contact (SSC) between mothers and newborns and early initiation of breastfeeding (EIBF) may play a significant and underappreciated role in reducing neonatal mortality. These practices are distinct in important ways from more broadly recognized (and clearly impactful) interventions like kangaroo care and exclusive breastfeeding, and they are recommended for both preterm and full-term infants. A large evidence base indicates that immediate SSC and EIBF substantially reduce neonatal mortality. Many randomized trials show that immediate SSC promotes EIBF, reduces episodes of low blood sugar, improves temperature regulation, and promotes cardiac and respiratory stability. All of these effects are linked to lower mortality, and the biological pathways between immediate SSC, EIBF, and reduced mortality are compelling. A meta-analysis of large observational studies found a 25% lower risk of mortality in infants who began breastfeeding within one hour of birth compared to initiation after one hour. These practices are attractive targets for intervention, and promoting them is effective. Immediate SSC and EIBF require no commodities, are under the direct influence of birth attendants, are time-bound to the first hour after birth, are consistent with international guidelines, and are appropriate for universal promotion. Their adoption is often low, but ceilings are demonstrably high: many low-and middle-income countries (LMICs) have rates of EIBF less than 30%, yet several have rates over 70%. Multiple studies find that health worker training and quality improvement activities dramatically increase rates of immediate SSC and EIBF. There do not appear to be any major actors focused specifically on promotion of universal immediate SSC and EIBF. By contrast, general breastfeeding promotion and essential newborn care training programs are relatively common. More research on cost-effectiveness is needed, but it appears promising. Limited existing
Ben_West🔸
 ·  · 1m read
 · 
> Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. > > The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts. > > Full paper | Github repo Blogpost; tweet thread. 
 ·  · 2m read
 · 
For immediate release: April 1, 2025 OXFORD, UK — The Centre for Effective Altruism (CEA) announced today that it will no longer identify as an "Effective Altruism" organization.  "After careful consideration, we've determined that the most effective way to have a positive impact is to deny any association with Effective Altruism," said a CEA spokesperson. "Our mission remains unchanged: to use reason and evidence to do the most good. Which coincidentally was the definition of EA." The announcement mirrors a pattern of other organizations that have grown with EA support and frameworks and eventually distanced themselves from EA. CEA's statement clarified that it will continue to use the same methodologies, maintain the same team, and pursue identical goals. "We've found that not being associated with the movement we have spent years building gives us more flexibility to do exactly what we were already doing, just with better PR," the spokesperson explained. "It's like keeping all the benefits of a community while refusing to contribute to its future development or taking responsibility for its challenges. Win-win!" In a related announcement, CEA revealed plans to rename its annual EA Global conference to "Coincidental Gathering of Like-Minded Individuals Who Mysteriously All Know Each Other But Definitely Aren't Part of Any Specific Movement Conference 2025." When asked about concerns that this trend might be pulling up the ladder for future projects that also might benefit from the infrastructure of the effective altruist community, the spokesperson adjusted their "I Heart Consequentialism" tie and replied, "Future projects? I'm sorry, but focusing on long-term movement building would be very EA of us, and as we've clearly established, we're not that anymore." Industry analysts predict that by 2026, the only entities still identifying as "EA" will be three post-rationalist bloggers, a Discord server full of undergraduate philosophy majors, and one person at