Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

This is the first great public writeup on model evals for averting existential catastrophe. I think it's likely that if AI doesn't kill everyone, developing great model evals and causing everyone to use them will be a big part of that. So I'm excited about this paper both for helping AI safety people learn more and think more clearly about model evals and for getting us closer to it being common knowledge that responsible labs should use model evals and responsible authorities should require them (by helping communicate model evals more widely, in a serious/legible manner).

Non-DeepMind authors include Jade Leung (OpenAI governance lead), Daniel Kokotajlo (OpenAI governance), Jack Clark (Anthropic cofounder), Paul Christiano, and Yoshua Bengio.

See also DeepMind's related blogpost.

For more on model evals for AI governance, see ARC Evals, including Beth's EAG talk Safety evaluations and standards for AI and the blogpost Update on ARC's recent eval efforts (LW).




Sorted by Click to highlight new comments since:

Side note: what's up with "model evals"? Seems like a jargony term that excludes outsiders.

This Is where I depart from most others: 

1. If you cannot define intelligence generalization scientifically in a complete and measurable way then this is a complete waste of time; you cannot assess risk usefully for something you cannot measure usefully. This is science 101

Here’s our definition at Zeusfyi

We define generalization in the context of intelligence, as the ability to generate learned differentiation of subsystem components, then manipulate, and build relationships towards greater systems level understanding of the universal construct that governs the reality. This is not possible if physics weren’t universal for feedback to be derived. Zeusfyi, Inc is the only institution that has scientifically defined intelligence generalization. The purest test for generalization ability; create a construct with systemic rules that define all possible outcomes allowed; greater ability to predict more actions on first try over time; shows greater generalization; with >1 construct; ability to do same; relative to others.

Curated and popular this week
Relevant opportunities