I underestimated AI capabilities (again)

Ajeya

I underestimated AI capabilities (again)

Ajeya

6 min read

Comments 4

Sorted by

New & upvoted

Denkenberger🔸

4mo

Do we need a scared reaction option on the EA Forum?

Callum Hinchcliffe

4mo

More emotional reaction options could help other users to differentiate between well-thought-out responses and responses published while still in an emotional state. They could also be used to gauge the emotional state of the community.

This comment was written hastily without being carefully considered.

Vasco Grilo🔸

4mo

Hi Ajeya.

But for the first time, I don’t see any solid trend we can extrapolate to say it won’t happen soon.^[11] AI R&D really could be automated this year.

What are your predictions for the unemployment rate of software engineers? What do you think about these reasons for potentially overestimating the pace of automation based on AI benchmarks?

But there’s a big problem here – if AIs are actually able to perform most tasks on 1-hour task horizons, why don’t we see more real-world task automation? For example, most emails take less than an hour to write, but crafting emails remains an important part of the lives of billions of people every day.
Some of this could be due to people underusing AI systems,^[2] but in this post I want to focus on reasons that are more fundamental to the capabilities of AI systems. In particular, I think there are three such reasons that are the most important:
Time-horizon estimates are very domain-specific
Task reliability strongly influences task horizons
Tasks are very bundled together and hard to separate out.

Gergely Máté

4mo*

Nice article, thanks!

Maybe the closer we are to a singularity-like moment, the larger the deviations become in our expectations about the future. It would make sense, because at singularity our uncertainty about the future should be maximal, I think.
However, hopefully that's a bit off for now. And maybe (I really hope!) we can keep it at a distance.

I was thinking about the 50% success definition of time horizons, and the possible practical consequences of that.

One thing that may be still interesting is how long it takes the AI agent to do the task. Let's take, for example, a task that takes 4 hours for a human developer. Is it 10 minutes for an AI agent, or is it 8 hours? I got it that this is temporary anyway, and AI agents will be much faster pretty soon, but another question may be: when?

Another thing is what's happening on failure. If an AI agent tries a 4-hours-human-level-task, and fails in those other 50% (... 25%, or any less), what's next, when one wants do deliver something? Doing it by hand of a human and accepting the "wasted" time and cost? Restarting the AI agent n times or up to a cost limit? Companies hire engineers in the hope they'll have a very high success rate, and working in teams usually provides a multiplier on top of that. How does that scale with teams of AI agents?

Comments

More from the author

165

I'm never satisfied

Ajeya·2w ago·5m read

AI predictions for 2026

Ajeya·6mo ago·9m read

Six milestones for AI automation

Ajeya·3mo ago·5m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 4d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

155

Let's taboo the V-word

lincolnq·1w ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

112

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·5d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

Recent opportunities to take action

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·3h ago·3m read

announcing High Impact Aliens

tzukitchan·3d ago·1m read

Help us launch AI safety university groups by referring potential founders

Jason Chin🔸, Thomas Rodskog·3d ago·4m read

^{^}

It takes a bit of taste and judgment to decide what a task’s “time horizon” is, and it’s possible to game the metric to make it meaningless. Consider the “task” of taking a giant math test consisting of 10,000 easy elementary school word problems — this is obviously 10,000 thirty-second tasks, rather than one 83-hour task. Or to take an example suggested by my colleague Tom Cunningham, consider the task “Count how many times a horse is mentioned in Anna Karenina.” This might take a single human 10 hours, but it’s highly parallelizable: a team of 300 people could each take one page and do the task in a minute. In the METR task suite, “human time to complete” works as a good proxy for “intrinsic” difficulty because the tasks are constructed to be hard to easily decompose and parallelize. Within each task, every piece depends on every other, and you benefit from keeping the whole context in mind. More on that later in the post.

^{^}

Actually, at the time, METR was measuring models on its original time horizon suite (TH 1.0), and the precise central estimate for Opus 4.5 was ~4h48min on that distribution. But since then, METR has released an updated suite with more tasks (TH 1.1) which caused all models' scores to shift slightly; the measured time horizon for Opus 4.5 on TH 1.1 is ~5h20min.

^{^}

The 2019-2025 doubling time calculated in the original paper was 212 days, or 0.58 years. That means that if the time horizon at the beginning of January was 5 hours, the time horizon at the end of the year should be 5 * exp(ln(2) * 1/ 0.58) ~= 16.4 hours. When I was doing mental math, I was approximating the 7 month doubling time as two doublings a year: 5 * 2 * 2 ~= 20 hours. The rule of thumb was a bit more aggressive than the strict extrapolation, but in fact, I just realized while writing this post that the mental extrapolation was too conservative in a different way — I should have dated the ~5 hour time horizon to Nov 24 (Opus 4.5’s release date), not the beginning of January, meaning I should have added an extra month to all these extrapolations.

^{^}

I didn’t actually do this math at the time, but the original paper calculated a doubling time of 118 days or 0.32 years since 2024, and if you assume that doubling time was correct, then the time horizon by EOY 2026 should have been 5 * exp(ln(2) * 1/0.32) ~= 43 hours. Given that I took this faster doubling time pretty seriously at this time, this suggests my median and especially my 80th percentile should have been higher to begin with. I would probably have been better served by the heuristic of using just the previous year (rather than the previous seven years) to forecast the next year.

^{^}

Originally, METR estimated Opus 4.6 to have a 50% time horizon of ~14.5 hours on Feb 20, 2026. We corrected a bug in our modeling on March 3, 2026 and this reduced its time horizon estimate to ~12 hours. Note that this measurement was done on Time Horizon 1.1, the new task suite released on Jan 29, 2026.

^{^}

This is something I appreciate about the time horizon construct. For standard benchmarks, confidence intervals around a point get narrower as the benchmark approaches saturation: if a model gets 50% accuracy, there’s more room for error in either direction than if it gets 95% accuracy. But since the time horizon metric could get infinitely long, the confidence intervals can be constructed so uncertainty gets wider rather than narrower as the current hardest tasks are saturated. Epistemically, it’s appropriate for your uncertainty about real capabilities to blow up when you no longer have tasks that the models can’t complete.

^{^}

In the original time horizon paper, METR measured human completion times for most of the tasks in the dataset by actually having humans do them (148 out of 169). In the updated suite, only 5 of the 19 tasks longer than 8 hours have measured human baselines — the others are estimated.

^{^}

Agents are usually run on the same task 6 separate times. The same agent doesn’t always approach the same task the same way each time — it sometimes gets lucky or unlucky. For four of the 19 hard tasks, Claude Opus 4.6 solved it successfully in all six runs; for another ten, it solved it successfully in at least one of the six runs.

^{^}

If you look just at the year 2025, agent time horizons doubled every ~3.5 months, not every ~7 months as in the long-run trend or even every ~4 months as in the 2024-2025 trend. I mostly didn’t factor this into my forecasts; it wasn’t salient to me compared to the rule of thumb I’d absorbed of “two doublings a year.” If I had used this to extrapolate from Opus 4.5, that would have suggested time horizons by the end of the year should be 5 * exp(ln(2) * 12/3.5) = 54 hours by the end of the year.

^{^}

Specifically, my operationalization was that firing all human members of technical staff (research leads, engineers, everyone) would only slow down progress by 25%. Now and in the past, of course, firing every single human would cause progress to completely halt.

^{^}

A concrete example of this is the kind of argument that my colleague Nikola made here, in Nov 2025, when the state-of-the-art time horizon was 2 hours; this kind of argument would be much shakier made today (less than four months later!).