Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.

Full paper | Github repo

Blogpost; tweet thread

99

0
0
1

Reactions

0
0
1
Comments16


Sorted by Click to highlight new comments since:

Is the point when models hit a length of time on the x-axis of the graph meant to represent the point where models can do all tasks of that length that a normal knowledge worker could perform on a computer? The vast majority of knowledge worker tasks of that length? At least one task of that length? Some particular important subset of tasks of that length? 

As it says in the subtitle of the graph, it's the length of task at which models have a 50% success rate.

I don't quite get what that means. Do they really take exactly the same amount of time on all tasks for which they have the same success rate? Sorry, maybe I am being annoying here and this is all well-explained in the linked post. But I am trying to figure out how much this is creating the illusion that progress on it means a model will be able to handle all tasks that it takes normal human workers about that amount of time to do, when it really means something quite different.  

Thanks for the question David! I believe the methodology sections of the paper help answer this, particularly: section 4 goes into more detail on what the horizon means and section 8.1 discusses some limitations of this approach.

This is going viral on X (2.8M views as of posting this comment).

Reposting this from Daniel Kokotajlo:

This is probably the most important single piece of evidence about AGI timelines right now. Well done! I think the trend should be superexponential, e.g. each doubling takes 10% less calendar time on average. Eli Lifland and I did some calculations yesterday suggesting that this would get to AGI in 2028. Will do more serious investigation soon.

Why do I expect the trend to be superexponential? Well, it seems like it sorta has to go superexponential eventually. Imagine: We've got to AIs that can with ~100% reliability do tasks that take professional humans 10 years. But somehow they can't do tasks that take professional humans 160 years? And it's going to take 4 more doublings to get there? And these 4 doublings are going to take 2 more years to occur? No, at some point you "jump all the way" to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.

Also, zooming in mechanistically on what's going on, insofar as an AI system can do tasks below length X but not above length X, it's gotta be for some reason -- some skill that the AI lacks, which isn't important for tasks below length X but which tends to be crucial for tasks above length X. But there are only a finite number of skills that humans have that AIs lack, and if we were to plot them on a horizon-length graph (where the x-axis is log of horizon length, and each skill is plotted on the x-axis where it starts being important, such that it's not important to have for tasks less than that length) the distribution of skills by horizon length would presumably taper off, with tons of skills necessary for pretty short tasks, a decent amount necessary for medium tasks (but not short), and a long thin tail of skills that are necessary for long tasks (but not medium), a tail that eventually goes to 0, probably around a few years on the x-axis. So assuming AIs learn skills at a constant rate, we should see acceleration rather than a constant exponential. There just aren't that many skills you need to operate for 10 days that you don't also need to operate for 1 day, compared to how many skills you need to operate for 1 hour that you don't also need to operate for 6 minutes.

There are two other factors worth mentioning which aren't part of the above: One, the projected slowdown in capability advances that'll come as compute and data scaling falters due to becoming too expensive. And two, pointing in the other direction, the projected speedup in capability advances that'll come as AI systems start substantially accelerating AI R&D.

Reposting this from Daniel Eth:

On the one hand, this seems like not much (shouldn’t AGIs be able to hit ‘escape velocity’ and operate autonomously forever?), but on the other, being able to do a month’s worth of work coherently would surely get us close to recursive self-improvement.

Remember that this is graphing the length of task that the AI can do with an over 50% success rate. The length of task that an AI can do reliably is much shorter than what is shown here (you can look at figure 4 in the paper): for an 80% success rate it's 30 seconds to a minute. 

Being able to do a months work of work at a 50% success rate would be very useful and productivity boosting, of course, but it would really be close to recursive self improvement? I don't think so. I feel that some part of complex projects needs reliable code, and that will always be a bottleneck. 

Figure four averages across all models. I think figure six is more illuminating:

Basically, the 80% threshold is ~2 doublings behind the 50% threshold, or ~1 year. An extra year isn't nothing! But you're still not getting to 10+ year timelines.

The more task lengths the 80% threshold has to run through before it gets to task length we'd regard as AGI complete though, the more different the tasks at the end of the sequence are from the beginning, and therefore the more likely it is that the doubling trend will break down somewhere along the length of the sequence. That seems to me like the main significance of titotal's point, not the time gained if we just assume the current 80% doubling trend will continue right to the end of the line. Plausibly 30 seconds to minute long tasks are more different from weeks long tasks than 15 minute tasks are. 

So the claim is:

  1. The 50% trend will break down at some length of task 
  2. The 80% trend will therefore break at 
  3. And maybe  is large enough to cause some catastrophic risk, but  isn't

?

Yes. (Though I'm not saying this will happen, just that it could, and that is more significant than a short delay.) 

Fair enough! My guess is that when the trend breaks it will be because things have gone super-exponential rather than sub-exponential (some discussion here) but yeah, I agree that this could happen!

Just so people know what you're referring to, this is Figure 4: 

Ben West noted in the blog post that 

We think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people’s day-to-day work: the best current models—such as Claude 3.7 Sonnet—are capable of some tasks that take even expert humans hours, but can only reliably complete tasks of up to a few minutes long.

Fascinating trend, AI's ability to handle long, complex tasks is accelerating fast. A decade from now, automation could reshape entire industries, especially in software development. Curious to see how this scales beyond coding into other professional fields!

Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
 ·  · 8m read
 · 
In my past year as a grantmaker in the global health and wellbeing (GHW) meta space at Open Philanthropy, I've identified some exciting ideas that could fill existing gaps. While these initiatives have significant potential, they require more active development and support to move forward.  The ideas I think could have the highest impact are:  1. Government placements/secondments in key GHW areas (e.g. international development), and 2. Expanded (ultra) high-net-worth ([U]HNW) advising Each of these ideas needs a very specific type of leadership and/or structure. More accessible options I’m excited about — particularly for students or recent graduates — could involve virtual GHW courses or action-focused student groups.  I can’t commit to supporting any particular project based on these ideas ahead of time, because the likelihood of success would heavily depend on details (including the people leading the project). Still, I thought it would be helpful to articulate a few of the ideas I’ve been considering.  I’d love to hear your thoughts, both on these ideas and any other gaps you see in the space! Introduction I’m Mel, a Senior Program Associate at Open Philanthropy, where I lead grantmaking for the Effective Giving and Careers program[1] (you can read more about the program and our current strategy here). Throughout my time in this role, I’ve encountered great ideas, but have also noticed gaps in the space. This post shares a list of projects I’d like to see pursued, and would potentially want to support. These ideas are drawn from existing efforts in other areas (e.g., projects supported by our GCRCB team), suggestions from conversations and materials I’ve engaged with, and my general intuition. They aren’t meant to be a definitive roadmap, but rather a starting point for discussion. At the moment, I don’t have capacity to more actively explore these ideas and find the right founders for related projects. That may change, but for now, I’m interested in