Hide table of contents

Summary: I built a simple back-of-the-envelope model of AI agent economics that combines Ord's half-life analysis of agent reliability with real inference costs. The core idea is that if agent cost per successful outcome scales exponentially with task length, and human cost scales linearly, it creates a sharp viability boundary that cost reductions alone cannot meaningfully shift. The only parameter that matters much is the agent's half-life (reliability horizon), which is precisely the thing that requires the continual learning breakthrough (which I think is essential for AGI-level agents) that some place 5-20 years away. I think this has underappreciated implications for the $2T+ AI infrastructure investment thesis.


The setup

Toby Ord's "Half-Life" analysis (2025) demonstrated that AI agent success rates on tasks decay exponentially with task length, following a pattern analogous to radioactive decay. If an agent completes a 1-hour task with 50% probability, it completes a 2-hour task with roughly 25% probability and a 4-hour task with about 6%. There is a constant per-step failure probability, and because longer tasks chain more steps, success decays exponentially.

METR's 2025 data showed the 50% time horizon for the best agents was roughly 2.5-5 hours (model-dependent) and had been doubling every ~7 months. The International AI Safety Report 2026, published this week, uses the same data (at the 80% success threshold, which is more conservative) and projects multi-day task completion by 2030 if the trend continues.

What I haven't seen anyone do is work through the economic implications of the exponential decay structure. So here is a simple model.

The model

Five parameters:

  1. Cost per agent step ($): average cost of one model call, including growing context windows. Ranges from ~$0.02 (cheap model, short context) to ~$0.55 (frontier model, long context).
  2. Steps per hour of human-equivalent work: how many agent actions correspond to one hour of human task time. I use 50-120 depending on task complexity.
  3. Half-life (hours): the task length at which the agent succeeds 50% of the time. Currently ~2.5-5h for frontier models on software tasks.
  4. Human hourly rate ($): fully loaded cost (salary + benefits + overhead). $100-200 for skilled knowledge workers.
  5. Oversight cost: someone needs to review agent output. Modelled as 15% of human hourly rate, capped at 4 hours.

The key equation:

P(success) = 0.5 ^ (task_hours / half_life)
E[attempts to succeed] = 1 / P(success) = 2 ^ (task_hours / half_life)
Cost per success = (steps × cost_per_step × context_multiplier) × 2^(task_hours / half_life)
Human cost = hourly_rate × task_hours

Human cost is linear in task length. Agent cost per success is exponential. They must cross.

Results: base case

Using base case parameters (cost/step = $0.22, steps/hr = 80, half-life = 5h, human rate = $150/hr):

Task lengthSteps$/attemptP(success)E[attempts]Agent costHuman costRatio
15 min20$4.4096.6%1.0$9.90$37.500.26×
30 min40$8.8093.3%1.1$16.93$75.000.23×
1h80$17.6087.1%1.1$42.70$1500.28×
2h160$36.9675.8%1.3$93.78$3000.31×
4h320$77.4457.4%1.7$194.91$6000.32×
8h640$167.2033.0%3.0$597.05$1,2000.50×
16h1,280$352.0010.9%9.2$3,286$2,4001.37×
24h1,920$554.403.6%27.9$15,574$3,6004.33×
1 week (40h)3,200$950.400.4%256$243K$6,00040.5×
2 weeks (80h)6,400$1,900.800.002%65,536$124M$12,000~10,000×

A few things to notice:

  • The transition is sharp. Agents are 3-4× cheaper than humans up to about 4 hours, roughly cost-competitive at 8 hours, and then costs explode. By 16 hours the agent is more expensive. By 40 hours it is absurd.
  • The 80-hour row is not a typo. A two-week task with a 5-hour half-life requires, in expectation, 65,536 attempts. Each attempt costs ~$1,900. That is $124 million per success, for a task a human does for $12,000. This is what exponential decay looks like in dollars.
  • The "viable zone" for current agents is roughly sub-day tasks, which maps onto exactly the domain where agents are already demonstrating value (coding sprints, bug fixes, refactoring against test suites).

Finding 1: cost reductions cannot beat the exponential

A natural response: "inference costs are dropping fast, won't this solve itself?" No. Cost per step enters the equation linearly. The half-life enters it exponentially.

I built a sensitivity analysis crossing half-life (rows) against cost per step (columns) for an 8-hour task:

Half-life ↓ \ $/step →$0.01$0.08$0.25$0.50$1.00
1h5.4×43×135×270×540×
2h0.7×5.4×17×34×68×
5h0.1×0.5×1.5×2.9×5.9×
12h0.02×0.2×0.5×1.0×2.1×
40h0.01×0.04×0.1×0.2×0.5×

Read down the $0.25 column. Going from a 1-hour to 5-hour half-life improves the ratio by 90×. Going from $0.25 to $0.01 per step (a 25× cost reduction!) only improves it by ~9×. The half-life improvement is 10× more valuable than the cost reduction, because it acts on the exponent rather than the base.

This is the economic translation of Ord's Scaling Paradox. You can keep making each step cheaper, but the number of required attempts is growing exponentially with task length, so you are playing cost reduction against exponential growth. 

Finding 2: the half-life is the whole game

Doubling the half-life from 5h to 10h does not double the viable task range. The break-even point for the base case at 5h half-life is around 12-16h tasks. At 10h half-life it shifts to around 40-60h. At 40h half-life, essentially all knowledge-worker tasks become viable.

The METR data shows the half-life has been extending (doubling every ~7 months at the 50% threshold). If this continues, the economics steadily improve. But Ord's analysis of the same data shows that the structure of the exponential decay has not changed; the half-life parameter is just getting longer, the functional form is the same. And crucially, extending the half-life via scaling faces the Scaling Paradox: each increment of per-step reliability improvement costs exponentially more compute. So you are trying to shift an exponential parameter via a process that itself faces exponential costs.

What would actually help is something that changes the functional form: a system that learns from its mistakes during execution, reducing the per-step failure rate on familiar sub-tasks. This is, of course, precisely what continual learning would provide. And it's what Ord notes when he observes that humans show a markedly different decay pattern, maintaining much higher success rates on longer tasks, presumably because they can correct errors and build procedural memory mid-task.

Finding 3: task decomposition helps but has limits

The obvious objection: "just break the long task into short ones." This genuinely helps. Breaking a 24h task (base case: 4.3× human cost) into twelve 2-hour chunks reduces it dramatically, because each chunk has high success probability.

But decomposition has costs:

  • Coordination overhead: someone or something needs to specify each chunk, hand off context, and integrate outputs. I model this conservatively as 10% of human hourly rate per handoff.
  • Context loss: information degrades at each boundary. The agent solving chunk 7 does not have the implicit context from chunks 1-6 unless you explicitly pass it, which costs tokens and attention.
  • Decomposability: many high-value tasks resist clean decomposition. Architectural decisions, strategic planning, novel research, and anything requiring long-range coherence cannot simply be chopped into independent two-hour units.

In the model, the sweet spot for a 24h task is usually 4-8 chunks (3-6 hours each), bringing the ratio from 4.3× down to roughly 1-2×. Helpful, but it does not make the economics transformative, and it only works for tasks that decompose cleanly.

What this means for the investment thesis

The International AI Safety Report 2026, released this week, presents four OECD scenarios for AI capabilities by 2030 (section 1.3) ranging from stagnation to human-level performance. The investment case underlying current infrastructure spending (~$500B+ announced by Meta and OpenAI alone) implicitly requires something like Scenario 3 or 4, where agents can complete multi-week professional tasks with high autonomy.

This BOTEC suggests that's only viable if the half-life extends to 40+ hours, which requires either:

  1. Brute-force scaling, which faces the Scaling Paradox (exponential compute cost per linear reliability gain), or
  2. A qualitative breakthrough in continual learning, which Sutskever and Karpathy both identify as a key bottleneck on the path to AGI-level agents, placing full agent viability 5-20 years and roughly a decade away respectively, and which no frontier lab has yet demonstrated for a general-purpose model.

Without one of these, agent economics remain viable for sub-day tasks in domains with tight feedback loops (coding, data processing, structured analysis) and become rapidly uneconomical for longer, more complex, less verifiable work. That is a large and valuable market! But it is not the market that justifies $2 trillion in annual AI revenue by 2030, which is what Bain estimates is needed to justify current infrastructure investment.

The base case, in my view, is that agents become an extraordinarily valuable tool for augmenting skilled workers on sub-day tasks, generating real but bounded productivity gains. The transformative case, where agents replace rather than augment workers on multi-week projects, requires solving the reliability problem at a level that nobody has demonstrated and that the some think is years to decades away. In a sense I would see this as good news for agentic ASI timelines.

Interactive model

I built an interactive version of this model where you can adjust all parameters, explore the sensitivity analysis, and test task decomposition. It has a couple of baseline options that are scenario's from the sources. You can use it here.

Caveats and limitations

This model is deliberately simple. Real deployments are more complex in several ways:

  • Partial credit: failed attempts often produce useful intermediate work. A human can salvage a 70%-complete agent output faster than doing the task from scratch.
  • Task-specific feedback loops: coding against test suites effectively shortens the task by providing intermediate verification, which is why coding agents work so well. The model does not account for this.
  • Agentic scaffolding: sophisticated orchestration (multi-agent systems, checkpointing, rollback) can improve effective reliability beyond what the raw model achieves.
  • Rapidly falling costs: inference costs have been dropping ~2-4× per year. This matters, even if it cannot beat the exponential on its own.
  • Measurement uncertainty: the half-life parameters are derived from METR's software engineering benchmarks, which may not generalise to other domains.

I think these caveats make the picture somewhat more favourable for agents on the margin, but they do not change the core result that exponential decay in success rate creates an exponential wall that cost reductions and decomposition can only partially mitigate.


This model is deliberately simplified and I'm sure I've gotten things wrong. I'd welcome corrections, extensions, and pushback in the comments.

30

0
0
1

Reactions

0
0
1

More posts like this

Comments1
Sorted by Click to highlight new comments since:

Hey, cool toy model (:

I bet there's not enough data on METR about how messy are the tasks to include it here, but I would expect it to have real world consequences and to tug in the direction of agents being less viable outside of well defined domains.

Curated and popular this week
Relevant opportunities