I underestimated AI capabilities (again)

Ajeya

I underestimated AI capabilities (again)

Ajeya

6 min read · Mar 5

Comments 4

Sorted by

New & upvoted

Denkenberger🔸

3mo

Do we need a scared reaction option on the EA Forum?

Callum Hinchcliffe

3mo

More emotional reaction options could help other users to differentiate between well-thought-out responses and responses published while still in an emotional state. They could also be used to gauge the emotional state of the community.

This comment was written hastily without being carefully considered.

Vasco Grilo🔸

3mo

Hi Ajeya.

But for the first time, I don’t see any solid trend we can extrapolate to say it won’t happen soon.^[11] AI R&D really could be automated this year.

What are your predictions for the unemployment rate of software engineers? What do you think about these reasons for potentially overestimating the pace of automation based on AI benchmarks?

But there’s a big problem here – if AIs are actually able to perform most tasks on 1-hour task horizons, why don’t we see more real-world task automation? For example, most emails take less than an hour to write, but crafting emails remains an important part of the lives of billions of people every day.
Some of this could be due to people underusing AI systems,^[2] but in this post I want to focus on reasons that are more fundamental to the capabilities of AI systems. In particular, I think there are three such reasons that are the most important:
Time-horizon estimates are very domain-specific
Task reliability strongly influences task horizons
Tasks are very bundled together and hard to separate out.

Gergely Máté

3mo*

Nice article, thanks!

Maybe the closer we are to a singularity-like moment, the larger the deviations become in our expectations about the future. It would make sense, because at singularity our uncertainty about the future should be maximal, I think.
However, hopefully that's a bit off for now. And maybe (I really hope!) we can keep it at a distance.

I was thinking about the 50% success definition of time horizons, and the possible practical consequences of that.

One thing that may be still interesting is how long it takes the AI agent to do the task. Let's take, for example, a task that takes 4 hours for a human developer. Is it 10 minutes for an AI agent, or is it 8 hours? I got it that this is temporary anyway, and AI agents will be much faster pretty soon, but another question may be: when?

Another thing is what's happening on failure. If an AI agent tries a 4-hours-human-level-task, and fails in those other 50% (... 25%, or any less), what's next, when one wants do deliver something? Doing it by hand of a human and accepting the "wasted" time and cost? Restarting the AI agent n times or up to a cost limit? Companies hire engineers in the hope they'll have a very high success rate, and working in teams usually provides a multiplier on top of that. How does that scale with teams of AI agents?

Comments

More from the author

AI predictions for 2026

Ajeya·5mo ago·9m read

Six milestones for AI automation

Ajeya·2mo ago·5m read

Takeoff speeds rule everything around me

Ajeya·4mo ago·3m read

Curated and popular this week

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 4d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

130

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

AI probably won't make factory farms obsolete

Hazo·6d ago·7m read

Bentham’s Bulldog recently argued that AI won’t definitely make factory farms obsolete. I agree, but I’d go further and argue that by default AI won’t make factory farms obsolete. However, I think it’s possible (though not guaranteed) that AI could make factory farms a lot more humane. He throws out an 80% chance of cultivated meat being developed, and a 70% chance of it displacing factory far...

Recent opportunities to take action

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·23h ago·3m read

130

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

Build a flourishing EA group at the University of Toronto

Joseph Kostousov, Sophia Wan (navarhontes)·1w ago·1m read

^{^}

It takes a bit of taste and judgment to decide what a task’s “time horizon” is, and it’s possible to game the metric to make it meaningless. Consider the “task” of taking a giant math test consisting of 10,000 easy elementary school word problems — this is obviously 10,000 thirty-second tasks, rather than one 83-hour task. Or to take an example suggested by my colleague Tom Cunningham, consider the task “Count how many times a horse is mentioned in Anna Karenina.” This might take a single human 10 hours, but it’s highly parallelizable: a team of 300 people could each take one page and do the task in a minute. In the METR task suite, “human time to complete” works as a good proxy for “intrinsic” difficulty because the tasks are constructed to be hard to easily decompose and parallelize. Within each task, every piece depends on every other, and you benefit from keeping the whole context in mind. More on that later in the post.

^{^}

Actually, at the time, METR was measuring models on its original time horizon suite (TH 1.0), and the precise central estimate for Opus 4.5 was ~4h48min on that distribution. But since then, METR has released an updated suite with more tasks (TH 1.1) which caused all models' scores to shift slightly; the measured time horizon for Opus 4.5 on TH 1.1 is ~5h20min.

^{^}

The 2019-2025 doubling time calculated in the original paper was 212 days, or 0.58 years. That means that if the time horizon at the beginning of January was 5 hours, the time horizon at the end of the year should be 5 * exp(ln(2) * 1/ 0.58) ~= 16.4 hours. When I was doing mental math, I was approximating the 7 month doubling time as two doublings a year: 5 * 2 * 2 ~= 20 hours. The rule of thumb was a bit more aggressive than the strict extrapolation, but in fact, I just realized while writing this post that the mental extrapolation was too conservative in a different way — I should have dated the ~5 hour time horizon to Nov 24 (Opus 4.5’s release date), not the beginning of January, meaning I should have added an extra month to all these extrapolations.

^{^}

I didn’t actually do this math at the time, but the original paper calculated a doubling time of 118 days or 0.32 years since 2024, and if you assume that doubling time was correct, then the time horizon by EOY 2026 should have been 5 * exp(ln(2) * 1/0.32) ~= 43 hours. Given that I took this faster doubling time pretty seriously at this time, this suggests my median and especially my 80th percentile should have been higher to begin with. I would probably have been better served by the heuristic of using just the previous year (rather than the previous seven years) to forecast the next year.

^{^}

Originally, METR estimated Opus 4.6 to have a 50% time horizon of ~14.5 hours on Feb 20, 2026. We corrected a bug in our modeling on March 3, 2026 and this reduced its time horizon estimate to ~12 hours. Note that this measurement was done on Time Horizon 1.1, the new task suite released on Jan 29, 2026.

^{^}

This is something I appreciate about the time horizon construct. For standard benchmarks, confidence intervals around a point get narrower as the benchmark approaches saturation: if a model gets 50% accuracy, there’s more room for error in either direction than if it gets 95% accuracy. But since the time horizon metric could get infinitely long, the confidence intervals can be constructed so uncertainty gets wider rather than narrower as the current hardest tasks are saturated. Epistemically, it’s appropriate for your uncertainty about real capabilities to blow up when you no longer have tasks that the models can’t complete.

^{^}

In the original time horizon paper, METR measured human completion times for most of the tasks in the dataset by actually having humans do them (148 out of 169). In the updated suite, only 5 of the 19 tasks longer than 8 hours have measured human baselines — the others are estimated.

^{^}

Agents are usually run on the same task 6 separate times. The same agent doesn’t always approach the same task the same way each time — it sometimes gets lucky or unlucky. For four of the 19 hard tasks, Claude Opus 4.6 solved it successfully in all six runs; for another ten, it solved it successfully in at least one of the six runs.

^{^}

If you look just at the year 2025, agent time horizons doubled every ~3.5 months, not every ~7 months as in the long-run trend or even every ~4 months as in the 2024-2025 trend. I mostly didn’t factor this into my forecasts; it wasn’t salient to me compared to the rule of thumb I’d absorbed of “two doublings a year.” If I had used this to extrapolate from Opus 4.5, that would have suggested time horizons by the end of the year should be 5 * exp(ln(2) * 12/3.5) = 54 hours by the end of the year.

^{^}

Specifically, my operationalization was that firing all human members of technical staff (research leads, engineers, everyone) would only slow down progress by 25%. Now and in the past, of course, firing every single human would cause progress to completely halt.

^{^}

A concrete example of this is the kind of argument that my colleague Nikola made here, in Nov 2025, when the state-of-the-art time horizon was 2 hours; this kind of argument would be much shakier made today (less than four months later!).