Note: This post was crossposted from Planned Obsolescence by the Forum team, with the author's permission. The author may not see or respond to comments on this post.
Revisiting a prediction ten months early
On Jan 14th, I made predictions about AI progress in 2026. My forecasts for software engineering capabilities already feel much too conservative.
In my view, METR (where I now work) has some of the hardest and highest-quality software engineering and ML engineering benchmarks out there, and the most useful framework for making benchmark performance intuitive: we measure a task’s difficulty by the amount of time a human expert would take to complete it (called the “time horizon”).[1]
When I made my forecasts last month, the model with the longest measured time horizon on METR’s suite of software engineering tasks was Claude Opus 4.5; it could succeed around half the time at software tasks that would take a human software engineer about five hours.[2] Time horizons on software tasks had been doubling a little less than twice a year from 2019 through 2025, which would have implied the state-of-the-art 50% time horizon should be somewhat less than 20 hours by the end of 2026.[3] But there was ambiguity about whether the more recent doubling time was faster than the long-run trend, so I bumped that up to 24 hours for my median guess.[4] My 20th percentile was around 15 hours and my 80th percentile was around 40 hours.
Now, Opus 4.6 (released only 2.5 months after Opus 4.5) was estimated to have a 50% time horizon of ~12 hours.[5] I don’t take the specific number literally — there are many fewer very-long tasks than medium and short tasks, and the long tasks more often have guesstimated (rather than measured) human completion times, so time horizon estimates for the latest models are a lot noisier than they were in 2025. And the benchmark underlying the time horizon graph is nearly saturated, which causes the confidence intervals to blow up: the 95% CI is 5.3 hours to 66 hours.[6] It’s really hard to discriminate between different capability levels at the current range.
But at the end of the day, that dataset had 19 software engineering tasks estimated[7] to take humans longer than 8 hours, and Opus 4.6 was able to solve 14 of them at least some of the time (and it reliably nailed four of them).[8] And beyond just this one task suite, we’ve seen examples of AI agents doing certain very well-specified software tasks like writing a browser or C compiler, or porting a giant game, that would take humans many weeks or months to do on their own — not perfectly, but better than most people expected and better than a naive reading of the agents’ measured time horizon would have suggested.
And this happened in February. It’s no longer very plausible that after ten whole months of additional progress at the recent blistering pace,[9] AI agents would still struggle half the time at 24 hour tasks.
I wish them the best, but I think my colleagues on the capability evaluations team at METR might struggle to create new software tasks from a similar distribution capable of measuring AI agents’ true time horizons through the end of the year. If we could measure this, I’d guess that by the end of the year, AI agents will have a time horizon of over 100 hours on the sorts of software tasks in METR’s suite (which are not highly precisely specified — on certain extremely well-specified software tasks like the examples above, agents seem to already have a time horizon of more than a hundred hours).
And once you’re talking about multiple full-time-equivalent weeks of work, I wonder if the whole concept of “time horizon” starts to break down.
It’s nearly impossible to subdivide a typical one hour task (e.g., debugging one failing test) into smaller pieces that multiple people can work on in parallel. It wouldn’t go very well if you had to farm out writing this print statement or reading that error message or tweaking this line of code to different people — the right action to do next depends intimately on everything that came before it and the precise state of the code as a whole, you have to hold the whole context in your mind as you take each action or they won’t cohere in the right way.
It’s somewhat easier to decompose an eight hour task (e.g., writing a simple browser game) into smaller components, but those components are constantly bleeding into each other in ways that make clean handoffs hard. When you’re implementing the game logic, you realize it needs to know something about how the graphics are rendered. When you’re handling user input, you find yourself tweaking the game loop. The fastest way to do it is probably one person knocking it out in a day, making a hundred small decisions fluidly as they go.
But it’s actually pretty feasible to break down a month-long task into smaller pieces. In fact, you may start benefiting from some explicit decomposition — it might be helpful to write a design doc laying out how the pieces fit together, or break the work into tickets so you don’t lose track of what’s done and what’s left. And while it might take one person working alone a month to complete, the fastest way to get it done might be to have different people work on different pieces like the checkout flow or the inventory management panel in parallel.
And of course, tasks that take multiple full-time-equivalent years of work nearly always can and should be broken down into smaller milestones, and parallelized across multiple teammates. Human work appears to get more and more decomposable the longer and longer it gets.
In other words, very few tasks feel intrinsically like year-long tasks, the way that writing one bash command feels like an intrinsically one-second task, or debugging one simple bug feels intrinsically like a one-hour task. Maybe a mathematician banging their head against a hard conjecture for a year before finally making a breakthrough is a “real” year-long task? But most many-person-month software projects in the real world sort of feel like they might be a bunch of few-week tasks in a trenchcoat, the way that a hundred-question elementary school math test is really 100 thirty-second tasks in a trenchcoat.
If an extreme version of that is true, then once AI agents can consistently do (say) 80-hour tasks, they should be able to make continuous progress on projects of arbitrary scale. Maybe manager AIs can spend their work week figuring out how to farm out the current project goal to line-worker AIs, line-workers can execute on their individual piece, and all the AIs can maintain good enough records that no individual agent needs to build up holistic, long-term state on the whole project.
I think this probably won’t fully work. Even projects with lots of formalized goal-tracking still benefit a lot from everyone involved intuitively appreciating the bigger picture in a way that isn’t fully captured in Jira tickets and Asana tasks. Decomposing a 6 month project so cleanly and precisely that it can be executed by a team of people with no such holistic context might itself be, say, a 2 month task.
But it might work surprisingly well for a surprisingly large class of software projects. AI agents are a lot cheaper and a lot more patient than humans, so it could be practical to get them to do far, far more task-tracking, documentation, and other project management than human teams ever do. People have already started aggressively experimenting with scaffolding for orchestrating agent teams. It’s not clear how far it will go over the next several months.
This is why my colleague Tom proposed that the calendar time it takes a large team of humans to do a task might be a better proxy for “intrinsic difficulty” than the time it takes one human working alone. So far, the “team time” and “solo time” have been very similar in the METR task suite, since it has ranged from 1 second tasks to maybe 20 hour tasks. But we’re entering the regime where these numbers could rapidly diverge. If Tom’s conjecture is true, the “solo time” metric should start going super-exponential about now…which makes it very hard to bound software engineering capabilities by the end of the year.
In my predictions last month, my probability that AI R&D would be fully automated by the end of the year — AIs taking care of all the research ideation and implementation, no humans necessary[10] — was 10%. After I published that piece, I heard from a few others in the AI forecasting space (including those I generally think of as more bullish on AI timelines than I am) that it seemed a bit high. But now ten percent feels like it’s in the right ballpark again.
Fully automating AI R&D still seems like a tall order. Even fully automating software engineering seems like it requires an aggressive read of the evidence, and AI R&D is not just software engineering — it seems like automating it would require a surprising amount of progress on “research judgment” and “creativity” and other ephemeral skills that AI systems still appear to be worse at than human researchers. I think it’s a lot more likely in the coming three or five years than this year.
But for the first time, I don’t see any solid trend we can extrapolate to say it won’t happen soon.[11] AI R&D really could be automated this year.
- ^
It takes a bit of taste and judgment to decide what a task’s “time horizon” is, and it’s possible to game the metric to make it meaningless. Consider the “task” of taking a giant math test consisting of 10,000 easy elementary school word problems — this is obviously 10,000 thirty-second tasks, rather than one 83-hour task. Or to take an example suggested by my colleague Tom Cunningham, consider the task “Count how many times a horse is mentioned in Anna Karenina.” This might take a single human 10 hours, but it’s highly parallelizable: a team of 300 people could each take one page and do the task in a minute. In the METR task suite, “human time to complete” works as a good proxy for “intrinsic” difficulty because the tasks are constructed to be hard to easily decompose and parallelize. Within each task, every piece depends on every other, and you benefit from keeping the whole context in mind. More on that later in the post.
- ^
Actually, at the time, METR was measuring models on its original time horizon suite (TH 1.0), and the precise central estimate for Opus 4.5 was ~4h48min on that distribution. But since then, METR has released an updated suite with more tasks (TH 1.1) which caused all models' scores to shift slightly; the measured time horizon for Opus 4.5 on TH 1.1 is ~5h20min.
- ^
The 2019-2025 doubling time calculated in the original paper was 212 days, or 0.58 years. That means that if the time horizon at the beginning of January was 5 hours, the time horizon at the end of the year should be 5 * exp(ln(2) * 1/ 0.58) ~= 16.4 hours. When I was doing mental math, I was approximating the 7 month doubling time as two doublings a year: 5 * 2 * 2 ~= 20 hours. The rule of thumb was a bit more aggressive than the strict extrapolation, but in fact, I just realized while writing this post that the mental extrapolation was too conservative in a different way — I should have dated the ~5 hour time horizon to Nov 24 (Opus 4.5’s release date), not the beginning of January, meaning I should have added an extra month to all these extrapolations.
- ^
I didn’t actually do this math at the time, but the original paper calculated a doubling time of 118 days or 0.32 years since 2024, and if you assume that doubling time was correct, then the time horizon by EOY 2026 should have been 5 * exp(ln(2) * 1/0.32) ~= 43 hours. Given that I took this faster doubling time pretty seriously at this time, this suggests my median and especially my 80th percentile should have been higher to begin with. I would probably have been better served by the heuristic of using just the previous year (rather than the previous seven years) to forecast the next year.
- ^
Originally, METR estimated Opus 4.6 to have a 50% time horizon of ~14.5 hours on Feb 20, 2026. We corrected a bug in our modeling on March 3, 2026 and this reduced its time horizon estimate to ~12 hours. Note that this measurement was done on Time Horizon 1.1, the new task suite released on Jan 29, 2026.
- ^
This is something I appreciate about the time horizon construct. For standard benchmarks, confidence intervals around a point get narrower as the benchmark approaches saturation: if a model gets 50% accuracy, there’s more room for error in either direction than if it gets 95% accuracy. But since the time horizon metric could get infinitely long, the confidence intervals can be constructed so uncertainty gets wider rather than narrower as the current hardest tasks are saturated. Epistemically, it’s appropriate for your uncertainty about real capabilities to blow up when you no longer have tasks that the models can’t complete.
- ^
In the original time horizon paper, METR measured human completion times for most of the tasks in the dataset by actually having humans do them (148 out of 169). In the updated suite, only 5 of the 19 tasks longer than 8 hours have measured human baselines — the others are estimated.
- ^
Agents are usually run on the same task 6 separate times. The same agent doesn’t always approach the same task the same way each time — it sometimes gets lucky or unlucky. For four of the 19 hard tasks, Claude Opus 4.6 solved it successfully in all six runs; for another ten, it solved it successfully in at least one of the six runs.
- ^
If you look just at the year 2025, agent time horizons doubled every ~3.5 months, not every ~7 months as in the long-run trend or even every ~4 months as in the 2024-2025 trend. I mostly didn’t factor this into my forecasts; it wasn’t salient to me compared to the rule of thumb I’d absorbed of “two doublings a year.” If I had used this to extrapolate from Opus 4.5, that would have suggested time horizons by the end of the year should be 5 * exp(ln(2) * 12/3.5) = 54 hours by the end of the year.
- ^
Specifically, my operationalization was that firing all human members of technical staff (research leads, engineers, everyone) would only slow down progress by 25%. Now and in the past, of course, firing every single human would cause progress to completely halt.
- ^
A concrete example of this is the kind of argument that my colleague Nikola made here, in Nov 2025, when the state-of-the-art time horizon was 2 hours; this kind of argument would be much shakier made today (less than four months later!).
