Today, PauseAI and the Existential Risk Observatory release TakeOverBench.com: a benchmark, but for AI takeover.

There are many AI benchmarks, but this is the one that really matters: how far are we from a takeover, possibly leading to human extinction?

In 2023, the broadly coauthored paper Model evaluation for extreme risks defined the following nine dangerous capabilities: Cyber-offense, Deception, Persuasion & manipulation, Political strategy, Weapons acquisition, Long-horizon planning, AI development, Situational awareness, and Self-proliferation. We think progress in all of these domains is worrying, and it is even more worrying that some of these domains add up to AI takeover scenarios (existential threat models).

Using SOTA benchmark data, to the degree it is available, we track how far we have progressed on our trajectory towards the end of human control. We highlight four takeover scenarios, and track the dangerous capabilities needed for them to become a reality.

Our website aims to be a valuable source of information for researchers, policymakers, and the public. At the same time, we want to highlight gaps in the current research:

  • For many leading benchmarks, we just don't know how the latest models score. Replibench, for example, hasn't been run for almost a whole year. We need more efforts to run existing benchmarks against newer models!
  • AI existential threat models have received only limited serious academic attention, which we think is a very poor state of affairs (the Existential Risk Observatory, together with MIT FutureTech and FLI, is currently trying to mitigate this situation with a new threat model research project).
  • Even if we had accurate threat models, we currently don’t know exactly where the capability red lines (or red regions, given uncertainty) are. Even if we had accurate red lines/regions, we don’t always reliably know how to measure them with benchmarks.

Despite all these uncertainties, we think it is constructive to center the discussion on the concept of an AI takeover, and to present the knowledge that we do have on this website.

We hope that TakeOverBench.com contributes to:

  • Raising awareness.
  • Grounding takeover scenarios in objective data.
  • Providing accessible information for researchers, policymakers, and the public.
  • Highlighting gaps in research on takeover scenarios, red lines, and benchmarks.

TakeOverBench.com is an open source project, and we invite everyone to comment and contribute on GitHub.

Enjoy TakeOverBench!

21

1
1

Reactions

1
1
Comments7
Sorted by Click to highlight new comments since:

What scale is the METR benchmark on? I see a line that "Scores are normalized such that 100% represents a 50% success rate on tasks requiring 8 human-expert hours.", but is the 0% point on the scale 0 hours?

METR does not think that 8 human hours is sufficient autonomy for takeover; in fact 40 hours is our working lower bound.

METR has an official internal view on what time horizons correspond to "takeover not ruled out"? 

See the gpt-5 report. "Working lower bound" is maybe too strong; maybe it's more accurate to describe it as an initial guess at a warning threshold for rogue replication and 10x uplift (if we can even measure time horizons that long). I don't know what the exact reasoning behind 40 hours was, but one fact is that humans can't really start viable companies using plans that only take a ~week of work. IMO if AIs could do the equivalent with only a 40 human hour time horizon and continuously evade detection, they'd need to use their own advantages and have made up many current disadvantages relative to humans (like being bad at adversarial and multi-agent settings).

Indeed the 0%point is zero hours, so compared to the METR plot it is divided by 8 hours. 

The 8 hours I agree is somewhat arbitrary and I had missed that METR had a more 'official' stance on it. I made an issue out of it now to see if anyone else had reasons to make it 8 hours.

(For context I did most of the benchmark literature review for this project and data collection.)

Could you please explain your reasoning on 40 hours?

Cool website!

One question: why are most of the SOTA models Claude? Is it because Anthropic is the company that releases the most data about their models? I thought that by most measures, Gemini would be the SOTA model today.

Thanks!

To the largest degree possible we have collected data from public leader boards or system cards and it seems to be the case that Gemini models are a bit underrepresented. I am not sure why that is, but that Anthropic releases more data is definitely part of it. For example the most updated data points from CyBench come from the Claude (and Grok) system cards, and for the virology test there are data points in the system card of Opus 4.5 but not for Gemini 3.0 Pro.

More from Otto
Curated and popular this week
Relevant opportunities