Note: This post was crossposted from Planned Obsolescence by the Forum team, with the author's permission. The author may not see or respond to comments on this post.
Revisiting a prediction ten months early
On Jan 14th, I made predictions about AI progress in 2026. My forecasts for software engineering capabilities already feel much too conservative.
In my view, METR (where I now work) has some of the hardest and highest-quality software engineering and ML engineering benchmarks out there, and the most useful framework for making benchmark performance intuitive: we measure a task’s difficulty by the amount of time a human expert would take to complete it (called the “time horizon”).
When I made my forecasts last month, the model with the longest measured time horizon on METR’s suite of software engineering tasks was Claude Opus 4.5; it could succeed around half the time at software tasks that would take a human software engineer about five hours. Time horizons on software tasks had been doubling a little less than twice a year from 2019 through 2025, which would have implied the state-of-the-art 50% time horizon should be somewhat less than 20 hours by the end of 2026. But there was ambiguity about whether the more recent doubling time was faster than the long-run trend, so I bumped that up to 24 hours for my median guess. My 20th percentile was around 15 hours and my 80th percentile was around 40 hours.
Now, Opus 4.6 (released only 2.5 months after Opus 4.5) was estimated to have a 50% time horizon of ~12 hours. I don’t take the specific number literally — there are many fewer very-long tasks than medium and short tasks, and the long tasks more often have guesstimated (rather than measured) human completion times, so time horizon estimates for the latest models are a lot noisier than they were in 2025. And the benchmark underlying the time horizon graph is nearly saturated, which causes the confidence intervals to blow up: the 95% CI is 5.3 hours to 66 hours. It’s really hard to discriminate between different capability levels at the current range.
But at the end of the day, that dataset had 19 software engineering tasks estimated to take humans longer than 8 hours, and Opus 4.6 was able to solve 14 of them at least some of the time (and it reliably nailed four of them). And beyond just this one task suite, we’ve seen examples of AI agents doing certain very well-specified software tasks like writing a browser or C compiler, or porting a giant game, that would take humans many weeks or months to do on their own — not perfectly, but better than most people expected and better than a naive reading of the agents’ measured time horizon would have suggested.
And this happened in February. It’s no longer very plausible that after ten whole months of additional progress at the recent blistering pace, AI agents would still struggle half the time at 24 hour tasks.
I wish them the best, but I think my colleagues on the capability evaluations team at METR might struggle to create new software tasks from a similar distribution capable of measuring AI agents’ true time horizons through the end of the year. If we could measure this, I’d guess that by the end of the year, AI agents will have a time horizon of over 100 hours on the sorts of software tasks in METR’s suite (which are not highly precisely specified — on certain extremely well-specified software tasks like the examples above, agents seem to already have a time horizon of more than a hundred hours).
And once you’re talking about multiple full-time-equivalent weeks of work, I wonder if the whole concept of “time horizon” starts to break down.
It’s nearly impossible to subdivide a typical one hour task (e.g., debugging one failing test) into smaller pieces that multiple people can work on in parallel. It wouldn’t go very well if you had to farm out writing this print statement or reading that error message or tweaking this line of code to different people — the right action to do next depends intimately on everything that came before it and the precise state of the code as a whole, you have to hold the whole context in your mind as you take each action or they won’t cohere in the right way.
It’s somewhat easier to decompose an eight hour task (e.g., writing a simple browser game) into smaller components, but those components are constantly bleeding into each other in ways that make clean handoffs hard. When you’re implementing the game logic, you realize it needs to know something about how the graphics are rendered. When you’re handling user input, you find yourself tweaking the game loop. The fastest way to do it is probably one person knocking it out in a day, making a hundred small decisions fluidly as they go.
But it’s actually pretty feasible to break down a month-long task into smaller pieces. In fact, you may start benefiting from some explicit decomposition — it might be helpful to write a design doc laying out how the pieces fit together, or break the work into tickets so you don’t lose track of what’s done and what’s left. And while it might take one person working alone a month to complete, the fastest way to get it done might be to have different people work on different pieces like the checkout flow or the inventory management panel in parallel.
And of course, tasks that take multiple full-time-equivalent years of work nearly always can and should be broken down into smaller milestones, and parallelized across multiple teammates. Human work appears to get more and more decomposable the longer and longer it gets.
In other words, very few tasks feel intrinsically like year-long tasks, the way that writing one bash command feels like an intrinsically one-second task, or debugging one simple bug feels intrinsically like a one-hour task. Maybe a mathematician banging their head against a hard conjecture for a year before finally making a breakthrough is a “real” year-long task? But most many-person-month software projects in the real world sort of feel like they might be a bunch of few-week tasks in a trenchcoat, the way that a hundred-question elementary school math test is really 100 thirty-second tasks in a trenchcoat.
If an extreme version of that is true, then once AI agents can consistently do (say) 80-hour tasks, they should be able to make continuous progress on projects of arbitrary scale. Maybe manager AIs can spend their work week figuring out how to farm out the current project goal to line-worker AIs, line-workers can execute on their individual piece, and all the AIs can maintain good enough records that no individual agent needs to build up holistic, long-term state on the whole project.
I think this probably won’t fully work. Even projects with lots of formalized goal-tracking still benefit a lot from everyone involved intuitively appreciating the bigger picture in a way that isn’t fully captured in Jira tickets and Asana tasks. Decomposing a 6 month project so cleanly and precisely that it can be executed by a team of people with no such holistic context might itself be, say, a 2 month task.
But it might work surprisingly well for a surprisingly large class of software projects. AI agents are a lot cheaper and a lot more patient than humans, so it could be practical to get them to do far, far more task-tracking, documentation, and other project management than human teams ever do. People have already started aggressively experimenting with scaffolding for orchestrating agent teams. It’s not clear how far it will go over the next several months.
This is why my colleague Tom proposed that the calendar time it takes a large team of humans to do a task might be a better proxy for “intrinsic difficulty” than the time it takes one human working alone. So far, the “team time” and “solo time” have been very similar in the METR task suite, since it has ranged from 1 second tasks to maybe 20 hour tasks. But we’re entering the regime where these numbers could rapidly diverge. If Tom’s conjecture is true, the “solo time” metric should start going super-exponential about now…which makes it very hard to bound software engineering capabilities by the end of the year.
In my predictions last month, my probability that AI R&D would be fully automated by the end of the year — AIs taking care of all the research ideation and implementation, no humans necessary — was 10%. After I published that piece, I heard from a few others in the AI forecasting space (including those I generally think of as more bullish on AI timelines than I am) that it seemed a bit high. But now ten percent feels like it’s in the right ballpark again.
Fully automating AI R&D still seems like a tall order. Even fully automating software engineering seems like it requires an aggressive read of the evidence, and AI R&D is not just software engineering — it seems like automating it would require a surprising amount of progress on “research judgment” and “creativity” and other ephemeral skills that AI systems still appear to be worse at than human researchers. I think it’s a lot more likely in the coming three or five years than this year.
But for the first time, I don’t see any solid trend we can extrapolate to say it won’t happen soon. AI R&D really could be automated this year.
Do we need a scared reaction option on the EA Forum?
More emotional reaction options could help other users to differentiate between well-thought-out responses and responses published while still in an emotional state. They could also be used to gauge the emotional state of the community.
This comment was written hastily without being carefully considered.