Hide table of contents

Update after publication: Mo Putera pointed me to SemiAnalysis data on real Opus 4.7 agentic workflows, which pushed my central estimates for both token usage and blended token price substantially downward. The original version used ~140M tokens per 8-hour workday-equivalent and ~$5 per 1M tokens; I now use assumptions closer to ~10M tokens and ~$1 per 1M tokens, but still with wide probability distributions. This makes the energy bottleneck look less immediate for modest levels of labor-equivalent AI use, but still potentially important for very rapid, large-scale deployment. For example, I now focus more on scenarios like 50% global labor-equivalent output (I previously used closer to 3%), including both replacement and additional AI-generated work, where the updated model much smaller but still significant electricity requirements.

Core claim

If AI systems are deployed at the scale implied by serious labor-substitution scenarios, inference energy demand could become larger than training-focused or current-data-center-growth-based AI energy estimates suggest. Whether this happens depends heavily on one poorly measured variable: how many tokens are required per unit of economically useful AI work.

A rough benchmark emerged from this analysis: if future systems can produce a human-workday-equivalent of useful output near the one-million-token mark, large-scale AI labor substitution may be much less energy-constrained. If reliable useful output requires tens or hundreds of millions of tokens per workday-equivalent, inference energy could quickly become a major bottleneck.

Epistemic status

This is an initial, simple model. I am not trying to make a confident point forecast. I am trying to explore whether large-scale AI inference could plausibly become a major energy constraint, and which assumptions drive that possibility.

What the model does

Instead of starting from current data-center buildout, chip deployment, or training-run energy use, I start from hypothetical levels of AI labor-equivalent work and ask what inference energy demand would follow under different assumptions about token use, token prices, inference delivery costs, electricity cost shares, etc.

The core question is not “what will AI energy demand be?” but “what range of inference energy demand follows if AI systems are used for economically meaningful labor-equivalent work at scale?”

Token counts are not the whole story. Energy per unit of useful AI work can also fall through lower FLOPs per token, better hardware efficiency, higher utilization, better scaffolding, or more specialized models. The broader question is whether the full inference-efficiency stack can improve fast enough to keep up with large-scale AI deployment.

My motivation

My motivation for this model was that public training-energy estimates seemed too small to explain how seriously frontier AI actors appear to be treating power, siting, and dedicated energy infrastructure. Large power deals, behind-the-meter generation, nuclear arrangements, and more speculative ideas such as floating or orbital data centers did not make sense given the current, “modest” estimates of future AI energy use.

Uncertainty and model limitations

There are three levels of uncertainty in this piece. First, there is parameter uncertainty: token use per workday-equivalent output, future token prices, delivery-cost shares, electricity-cost shares, and electricity prices are all uncertain. I use probability distributions to reflect this.

Second, for each major parameter, I discuss what would update me toward pushing these distributions higher or lower; what evidence might I not have seen that would change my mind.

Third, there is model uncertainty. The whole framing may be wrong or incomplete. AI systems may map poorly onto human-workday-equivalent units; token prices may not reliably reveal underlying compute or electricity costs; future AI work may be dominated by specialized systems rather than frontier agents; and economic value may be better measured per completed task, per dollar of output, per model-hour, or per successful research step rather than per labor-equivalent workday.

I still think the model is useful because it makes one question more explicit: if ambitious AI deployment scenarios involve very large amounts of useful inference, what energy demand might that imply? The model should therefore be read less as a forecast and more as an initial exploration of whether large-scale inference-driven energy demand could become increasingly binding.

Lastly, my expertise lies in engineering and energy. I am more of a newcomer to compute and AI infrastructure. As such, I might have missed important technical details in this analysis, or made basic errors in my interpretation and use of various metrics etc. I thus value all and any criticism that can make this analysis stronger.

Introduction

Most AI energy forecasts I have seen start from the supply side: current data-center growth, projected chip deployment or training-run energy use. Epoch, RAND, IEA, and others have already done useful work along these lines, and this work has helped start making sense of AI energy demand. In this post I try a complementary calculation. Instead of starting with current data centers and extrapolating forward, I start with ambitious AI adoption scenarios — for example, AI systems doing work equivalent to some fraction of the global workforce — and ask what inference energy demand might follow from such assumptions.

Part of what made me interested in this is the apparent mismatch between AI-sector behavior and energy-sector normalcy. Frontier AI actors seem to be treating power, siting, and infrastructure as strategically important: large power deals, nuclear arrangements, behind-the-meter generation, and even more speculative (?) ideas such as floating or orbital data centers keep appearing. These more unusual ideas do not seem motivated by current training-run energy estimates alone. While they may be hype, PR, or option value, it is also possible that they make sense if frontier labs and their investors are planning against a much larger demand picture: large-scale inference compute energy requirements.

The reason inference matters is that many of the largest claims about AI’s future impact are about replacing or augmenting large amounts of human labor, creating very large economic value, or running large numbers of agents. If those scenarios happen, energy use may become dominated by inference rather than training.

This post therefore uses a simple top-down model. I start with labor-equivalent AI adoption scenarios and ask what energy demand follows under different assumptions about token usage, token price, compute cost, and electricity cost. The point is to start a discussion around the energy implications of large-scale AI adoption. Moreover, the model also highlights a currently poorly measured variable: how many tokens are needed per unit of useful work.

If that number is low, energy may be much less constraining. But if that number is high, inference energy could become a serious bottleneck. Other factors that could indicate a lower inference energy demand are lower FLOPs per token or improved compute energy efficiency. Thus, this is more than just a claim about token counts: It is a question about the full inference efficiency stack.

I use “energy” rather than only “electricity” because the relevant constraints may include not just grid electricity, but also on-site gas turbines, nuclear plants and their associated upstream supply chains.

If inference energy turns out to be much larger than training energy, then energy may be worth investigating not only as an infrastructure constraint, but possibly also as a policy-relevant pacing or governance surface. I am less confident in that governance claim than in the narrower claim that inference-energy demand deserves more attention. A follow-up piece could explore those possible intervention points in more detail.

Use of Monte Carlo analysis and uncertainty handling

The calculations in this article are made using Guesstimate, meaning uncertainty is baked into the final projections. In the sections below I give an overview of the main calculation steps, the uncertainty around each and what would update my prediction to be lower or higher. Guesstimate is a user-friendly, online Monte Carlo simulation calculator. Instead of adding, multiplying, etc. specific numbers, these numbers are generated randomly from probability distributions. This is appropriate for this task where there is a large uncertainty around each of the model parameters. Thus, I do not need to determine a specific number but can instead say things like “I think the tokens used per human workday equivalent task will be somewhere between 45 and 490 million tokens”. The calculator will now calculate the final energy estimate hundreds of times, each time using a different number in the 45-490 million range. In the end, the different final energy estimates is used to calculate median, 25th, 90th, etc. percentile estimates of energy demand from AI inference.

A simplified example of Monte Carlo might be an estimate of how many muffins you need for your birthday party. You do not know:

  • How many guests will come, G, or
  • How many muffins they will eat individually, M_i

The estimate of the total number of muffins needed, M_tot, is then:

M_tot = G * M_i

This simplified Monte Carlo does 3 runs, pulling numbers based on your defined distributions for each variable:

  1. M_tot = 5 * 2 = 10
  2. M_tot = 8 * 2 = 16
  3. M_tot = 6 * 3 = 18

So, the median number is 16 muffins, and the “75th percentile” (I know, not proper stats but for illustration!) is 18. So maybe (ignoring many simplifications) if you want to be 90% sure you have enough muffins, you get 20 (~~10% chance of not having enough, but this is very from the hip – actual Monte Carlo simulations use hundreds of such runs for better statistical calculations).

Note that as Guesstimate continuously pulls numbers from our defined distributions, the numbers the model give will move around from time to time. Therefore, the numbers for energy consumption herein are rounded and approximate but should be close to the numbers you will find in the Guesstimate model.

Here is the Guesstimate model: https://www.getguesstimate.com/models/26731

Calculation steps and uncertainties/assumptions

Here is an overview over the calculation steps that are explained in more detail in this section:

Model stepInput / assumptionCurrent central value
AI work scaleWorkdays-equivalent per yearScenario input
Tokens per workday-equivalentPERT(5, 700, 2)*100000~12M tokens
Token priceexp(PERT(log(0.05), log(9), log(1)))~$1.4 / 1M tokens
Delivery cost share of token revenuePERT(0.2, 1.2, 0.6)~60%
Electricity share of delivery costPERT(0.05, 0.2, 0.11)~12%
Electricity pricePERT(0.04, 0.08, 0.06)~$0.06/kWh
OutputInference electricity useTWh/year

 

1  – How many worker equivalents might AI replace?

One uncertainty in my model is whether human-workday-equivalent output is the right unit to use at all. Still, I chose this unit because it gives a rough bridge between two things that otherwise do not obviously connect:

  1. First, some of the best public evidence on agentic AI capabilities is already framed in human-time terms. For example, time-horizon evaluations ask how long a task takes humans, and then estimate what task length a model can complete with some probability.
  2. Second, many claims about AI’s future impact are framed in terms of labor replacement, productivity growth, or economic output. Human-workday-equivalent output is therefore a way to connect capability measurements to economic scale: it gives something to anchor on when asking how much economic activity a given number of tokens, compute, or energy might represent.

But AI systems may not come in human-shaped units. They can be copied, specialized, run in parallel, combined into larger systems, or used in workflows that do not map cleanly onto jobs or workers. This could make the human-equivalent framing wrong in either direction.

Ways this unit could make the model overstate inference energy demand:

  • AI systems may produce economically useful output in highly efficient workflows that do not resemble human jobs, making human-workday-equivalent output too pessimistic as a unit.
  • Useful AI work may not require one frontier model doing a full human-workday-sized task. Some work may be decomposed into smaller subtasks or be handled by narrow automated systems rather than human-like agents. For example, perception, classification, retrieval, monitoring, routing, or simulation tasks may be done by specialized models, scripts, or tools that are far more energy-efficient than a general frontier model.
  • AI systems may have lower organizational friction than humans: no hiring process, less onboarding, less interpersonal variance, and easier replication of successful workflows.

Ways this unit could make the model understate inference energy demand:

  • AI may not merely substitute for existing human labor, but create large new categories of demand. Cheap agents could make it normal to run many parallel attempts, speculative tasks, simulations, experiments, code-generation workflows, research directions, and automated services that no human team would otherwise have been assigned to do.
  • Some future systems may not decompose into discrete human-workday-sized tasks. A long-running AI research or operations system might continuously maintain context, monitor inputs, spawn subagents, evaluate outputs, verify its own work, and update plans. If so, counting only completed “worker-equivalent tasks” may miss substantial background inference demand.

Some of these issues also affect token usage per unit of work. I discuss those mechanisms more directly in the token-usage section below.

In summary, I use human-workday-equivalent output only as a crude but intuitive parameter that also lets other research such as labor replacement or economic activity plug more directly into this model. The point is not that AI systems will literally replace humans in neat worker-sized units. The point is that ambitious AI adoption scenarios imply large amounts of economically useful AI work, and we need some way to translate that work into inference demand. The correct unit may eventually be something else entirely — perhaps tokens per completed task, tokens per dollar of useful output, tokens per successful research step, or model-hours of economically useful cognition. For now, human-equivalent work is a useful way to make the scale of the energy question visible.

2 – How to compare human and AI work (8 hr task length)

After choosing human-equivalent output as a rough scaling unit, the next question is what time granularity to anchor on:

  1. I could have used shorter tasks, such as 30-minute or 1-hour tasks, and then multiplied by the number of such tasks needed to make a workday or a year of human-equivalent output.
  2. I could also have tried to estimate token usage for much longer units, such as a week-long project, a month-long autonomous research effort, or a company’s multi-year goal.

Instead, I use an 8-hour human-workday-equivalent as a middle ground. It is long enough to feel closer to economically meaningful independent work than a short benchmark task, but not so long that I need to model fully autonomous multi-day or multi-month projects with changing goals, accumulated context, feedback loops, coordination between agents, and unclear boundaries between one task and the next.

Why the time anchor matters

Token usage probably does not scale linearly with task length.

Eight 1-hour tasks are not necessarily equivalent to one 8-hour task. Shorter tasks will usually require fewer tokens. But a single longer task may require more context, planning, tool use, verification, backtracking, and ability to recover from mistakes. So lower token counts per hour seem more plausible at shorter time horizons, while longer tasks with more context and complexity may push token usage up nonlinearly.

This means task horizon and token usage are not independent variables. The token distribution I would use for a 1-hour task is not the same distribution I would use for an 8-hour task, and neither is the same as the distribution I would use for a week-long project. Token usage is conditional on the chosen work unit.

The next section estimates token usage conditional on this 8-hour anchor.

What would update me toward a shorter anchor?

Shorter task horizons would be more appropriate if economically useful AI work mostly happens through many small, well-scaffolded tasks.

I would update toward a shorter task-time anchor if:

- substantial labor substitution happens through short tasks that can be chained together without much long-horizon overhead;

- useful AI work looks less like “one agent works for a day” and more like many small task completions coordinated by tools, workflow software, retrieval systems, tests, or humans;

- external memory, scaffolding, and verification systems carry most of the continuity that a human worker would normally keep in their head;

- short task benchmarks turn out to predict real economic usefulness better than longer-horizon benchmarks;

- systems of smaller or more specialized AIs can collaborate cheaply enough that long single-agent task horizons become less relevant.

If this is right, the 8-hour anchor may be too demanding, and the token-usage estimate would likely move downward — unless coordination between short tasks adds back much of the saved token use.

What would update me toward a longer anchor?

Longer task horizons would be more appropriate if economically important AI work requires sustained context, sequential reasoning, or persistent planning across longer arcs.

I would update toward a longer task-time anchor if:

- useful AI work looks more like multi-day debugging, research, strategy, architecture, or planning than discrete workday-sized tasks;

- the economically valuable part is not doing many subtasks, but maintaining coherent direction across them;

- long-horizon agentic systems show large gains from keeping context, memory, and plans active over time;

- real-world deployment requires many cycles of testing, critique, revision, and verification before useful output is accepted;

- future AI systems look less like task-by-task tools and more like persistent AI workers, teams, or organizations.

If this is right, the 8-hour anchor may be too short, and token usage could rise nonlinearly because longer tasks may require larger context windows, more retrieval, more self-checking, more tool use, more failed branches, and more coordination.

Why METR-style time horizons are useful but limited

METR-style time-horizon evaluations are useful because they already compare AI performance to human task-completion time, especially in software-related domains where AI is currently most visibly useful and where benchmarks are more developed.

But I do not want to overinterpret them. A measured long time horizon on a benchmark suite does not necessarily mean broad real-world labor substitution. Some tasks that take humans a long time may be decomposable, scaffold-friendly, or routine once the right process is found. Conversely, some short tasks may be hard because they require unusual judgment, tacit knowledge, or a hard conceptual leap.

This matters because time horizon is a lossy proxy. It may mix     very different kinds of difficulty: persistence, decomposition, tool use, context length, verification burden, and actual reasoning depth.

For now, I use 8 hours because it is intuitive, close to the workday language used in labor-substitution discussions, more economically meaningful than very short tasks, and easier to model than week- or month-long autonomous projects. Future work should probably model several task horizons separately — for example 1-hour, 8-hour, and multi-day tasks — instead of pretending there is one universal token cost for economically useful AI work.

3 – Estimated token usage at 8 hr task length

After fixing the work unit at an 8-hour human-workday-equivalent, the next step is to estimate how many tokens such a “work package” might require.

This is the most important and least certain part of the model. In the Guesstimate sensitivity analysis, token usage per workday-equivalent is the largest driver of the final energy result. It might also be the parameter where better data would most change my view.

For the current version of the model, I use a high-inference scenario with a median of 140M tokens for an 8-hour workday-equivalent, with most of the distribution between roughly 38M and 450M tokens. I do not want readers to treat this as a confident forecast. It is better understood as one possible frontier-agent regime. The stronger claim is that the final energy estimate is highly sensitive to tokens per unit of useful work, and that we currently have poor evidence for what this number should be.

Why token usage is hard to estimate

Token usage is not just a function of how long a task takes a human. As discussed above, the estimate is conditional on the 8-hour workday-equivalent anchor. For a given anchor, token usage still depends on task type, decomposability, model choice, scaffold, tool use, context length, retries, verification requirements, and what counts as an accepted output. The time horizon data I built this contains only specific types of work, and token usage could be higher or lower for all economically relevant work performed by AIs in the future.

Why I use the METR token-vs-time-horizon plot

The dataset I use is not ideal. It was created for a different purpose: to investigate how additional token budget affects measured task horizon in METR’s evaluation of GPT-5.1-Codex-Max and other models.

Still, I feel like I have to use it because it connects token usage to human-calibrated task length. That is close to the relationship I need for this model, even if the benchmark setting is not the same as real-world AI labor.

In particular, I use the plot below, which shows agent performance on HCAST and RE-Bench by allowed token count.

There are several caveats:

  • METR warns against exactly what I use this data for - that absolute token comparisons across models can be misleading because scaffolds and API setups differ.
  • The plot is based on benchmark tasks, not general real-world work.
  • The plotted relationship is about allowed token budget, not necessarily the minimum tokens required for useful deployed output (I try to calibrate for this below)
  • The right token number for real-world labor-equivalent work may differ substantially depending on task type, reliability requirements, and workflow design.

Still, I use this plot as a rough anchor, not as direct evidence that future 8-hour AI work must cost any particular number of tokens.

How I construct the range

The plot above suggests that current frontier systems can reach longer measured time horizons by consuming higher token budgets, but also that marginal returns eventually diminish in the tested setup.

To estimate a plausible range given these marginal returns, I first cut off the final “bump” in some curves, where performance increases near the end of the token budget. This is said to reflect the model being prompted to submit a solution before running out of tokens, rather than genuine smooth gains from additional token budget.

Next, I define a rough “plateau” point: the place where marginal returns are low. For simplicity, I walked backward from the end of the curve without the bump, until the measured task horizon had fallen by about 10 minutes. These cutoff points are marked as circles in the second plot below.

I tried several extrapolations from these points. The results ranged from a future model reaching 8-hour tasks at roughly the same token usage as GPT-5.1-Codex-Max, up to hundreds of billions of tokens. That range seemed too sensitive to the extrapolation method to be useful.

So instead of relying on a single extrapolation, I define a broad lower and upper range and model the interval with a lognormal distribution skewed toward lower token usage.

Lower end: around 40M tokens

For the lower end, I use roughly 40M tokens for an 8-hour workday-equivalent.

This could be seen as optimistic. It assumes that a future system could reach an 8-hour task horizon at only about twice the token usage where GPT-5.1-Codex-Max reaches roughly a 2-hour task horizon in the plot. In other words, it assumes significant gains in capability or efficiency relative to current models.

This could happen if:

  • future models become much more capable per token;
  • scaffolding improves substantially;
  • tasks decompose cleanly;
  • verification and tool use become more efficient;
  • or future economically useful work is mostly made of shorter tasks that compose cheaply.

So, I treat 40M as something like an optimistic lower bound for this high-inference 8-hour-task framing, not as a central forecast.

Upper end: around 1.3B tokens

For the upper end, I use a rough economic ceiling.

The idea is: if token costs became very low, how many tokens could an AI system spend on an 8-hour workday-equivalent before it stopped being competitive with human labor?

Using an aggressive future price of $0.15 per 1M tokens and a rough human labor cost of $25 per hour gives an upper cap of around 1.3B tokens.

This is obviously crude. It assumes token price is the relevant economic constraint, ignores many deployment details, and uses a low human labor cost. But it is useful as a sanity check: if AI needs substantially more than this number of tokens for an 8-hour workday-equivalent, then even very cheap tokens may struggle to make broad labor substitution economically attractive.

It also prevents the model from assigning too much weight to extreme extrapolations such as hundreds of billions of tokens per workday-equivalent.

Distribution used in the model

For the current Guesstimate model, I use:

=lognormal(18.73, 0.743)

=PERT(5, 700, 2)*100000 - UPDATE REST OF TEXT

This gives a median around 140M tokens, with most of the distribution roughly between 38M and 450M tokens.

Again, this should not be read as “I think the correct number is 140M.” A better interpretation is:

Conditional on using an 8-hour workday-equivalent and considering a high-inference frontier-agent regime, I use a wide distribution around 140M tokens to explore the energy implications.

The more robust point is that the final energy result is very sensitive to this parameter.

What would update me lower?

I would update toward lower token usage per 8-hour workday-equivalent if:

  • real-world economically useful AI work mostly decomposes into shorter tasks that compose cheaply;
  • smaller or specialized models handle much of the work instead of frontier models;
  • scaffolds, memory, retrieval, tool use, and automated verification substantially reduce repeated context use;
  • stronger models need many fewer retries and human corrections;
  • deployed systems show that accepted outputs can be produced with far fewer tokens than benchmark runs suggest;
  • frontier models become much cheaper or more capable per token without task complexity rising at the same time.

In that world, the model’s current token estimate would be too high, and the final energy estimates would fall substantially.

What would update me higher?

I would update toward higher token usage per 8-hour workday-equivalent if:

  • economically important tasks require long context, planning, tool use, backtracking, and verification;
  • multi-agent workflows add substantial overhead through summaries, handoffs, critique, and coordination;
  • accepted outputs require many failed or partially successful attempts;
  • real-world usage shows that users give more capable models harder tasks, larger contexts, and more autonomy, so total token usage rises despite efficiency gains;
  • the 8 hour time horizon continue to require sharply higher token budgets;
  • economically valuable AI work depends on frontier models rather than cheaper specialized systems.

In that world, the model’s current median token estimate could be too low, and inference energy would become more constraining.

Success rate and accepted output

One further caveat is that METR-style time horizons often use a 50% success threshold. In this model, I implicitly treat the 8-hour workday-equivalent as something like a 50%-success task unit.

This is not obviously unreasonable. Human work also often involves review, feedback, rework, and failed attempts. But for deployed AI systems, the relevant unit may be tokens per accepted output, not tokens per attempt. If a model needs several attempts before producing something that is actually accepted by users or organizations, then real token usage per useful output could be higher than the benchmark token usage per run.

4 – Cost of tokens

The next step is to convert token usage into dollars. For this I use an effective blended price per 1M tokens. This is not the physical cost of inference; it is the price paid for tokens, which I later convert into compute and electricity cost.

I looked at release prices for frontier models over time:

I have not deeply modelled token accounting. Providers might distinguish between input, output, cached, and sometimes reasoning/thinking tokens, and these can be priced differently. Cached tokens may be cheaper, while internal reasoning/thinking tokens may be billed even when not visible to the user. For simplicity, I use one blended effective price per 1M tokens.

For the lower end, I use $0.15 per 1M tokens. This might represent an aggressive future low-price scenario and would require some combination of cheaper models, lower compute cost per token, better hardware utilization, or subsidized pricing.

For the upper end, I use $20 per 1M tokens. This represents a world where relevant models remain expensive on a per-token basis, for reasons I have not tried to decompose in detail. Some drivers may include larger models, high output-token shares, long contexts, lower utilization, or other factors. It is my understanding that some of these may show up as more tokens per task rather than a higher price per token, but I did not study this in detail.

In Guesstimate, I use:

=exp(PERT(log(0.15), log(20), log(5)))

=exp(PERT(log(0.05), log(9), log(1))) - UPDATE TEXT

This gives a central value around $5 per 1M tokens, skewed toward lower prices.

I would update lower if useful work can be done by cheaper models with lower compute cost per token, if hardware utilization improves, etc. I would update higher if economically useful frontier-level work continues to require models with high compute cost per generated token, expensive output/reasoning tokens, etc.

5 – Compute costs as share of token revenue

For this parameter, I estimate what share of token revenue is spent on delivering inference. This is not the same as electricity cost; it is an intermediate step used to move from token price to the cost base that electricity is part of.

Gross margin is useful here as a rough proxy. If a company has a 40% gross margin, then roughly 60% of revenue is spent delivering the service. This is not a clean measure of compute cost, since it can include other items and accounting practices. But it is one of the few public anchors available. The numbers used to build intuition behind the below distribution are as follows:

Source

URL

Evidence

Implied cost ratio

Role in model

OpenAI (H1 2025)https://www.reuters.com/commentary/breakingviews/how-infer-method-to-openais-madness-2025-10-15/$4.3B revenue vs $2.5B cost to deliver

58%

Lower bound of credible central range
OpenAI (adjusted)https://www.reuters.com/technology/openai-sees-compute-spend-around-600-billion-by-2030-cnbc-reports-2026-02-20/Adjusted gross margin ~33%

67%

Upper bound of credible central range
Anthropichttps://www.investing.com/news/stock-market-news/anthropic-trims-profit-margin-outlook-as-ai-operating-costs-rise--the-information-4459316Gross margin ~40%

60%

Independent confirmation
DeepSeekhttps://www.reuters.com/technology/chinas-deepseek-claims-theoretical-cost-profit-ratio-545-per-day-2025-03-01/$87k cost vs $562k theoretical revenue15.5%Lower-bound / skew anchor

Based on the above, I use the following distribution in Guesstimate:

=PERT(0.2, 1.2, 0.6)

This gives a central value around 60%, while allowing values above 100%. I include values above 100% because some inference usage may be subsidized or priced below cost, especially under subscription pricing or aggressive user-acquisition strategies.

I would update this lower if providers show sustained high gross margins. I would update it higher if there is new evidence that frontier inference is subsidized.

6 – Energy cost share of compute cost

With an estimate for inference delivery cost as a share of token revenue, I next estimate what share of that delivery cost is electricity.

By “delivery cost,” I mean the cost of actually running and delivering model outputs to users, excluding things like R&D and marketing. This includes building the data centers, computer hardware, data-center operation, networking, provider overhead, and electricity.

The public anchors I found are sparse, but suggest electricity is material while still being a minority of total delivery cost:

Source

Key numbers

Implied electricity share of full compute TCO

https://www.businessinsider.com/why-nvidia-worth-5-trillion-inside-35-billion-ai-datacenter-2025-10?utm_source=chatgpt.com1 GW AI data center: $35B capex, $1.3B/year electricity~10–16% (if amortized over 3–5 years, before other O&M)
https://en.wikipedia.org/wiki/Data_center?utm_source=chatgpt.com#cite_note-112Electricity is >10% of total data center TCO (general baseline)Establishes floor ≳10% in many cases
    

Based on the above, I use:

=PERT(0.05, 0.2, 0.11)

This gives a central value around 11%, with a range from 5% to 20%.

I would update this lower if hardware capex, construction, networking, cloud/provider overhead, or other non-electricity costs dominate more than assumed. I would update it higher if electricity prices are indeed higher (including cost of building on-site gas turbines).

One point to note is that electricity can be a small share of total cost and still be a binding constraint, because availability, interconnection, siting, permitting, and power-project timelines can matter more than cost share. That said, the chips on which AI is trained and run is recognized as another likely constraint on AI development.

7 – Converting $ to kWh

To convert electricity cost into kWh, I use Texas electricity prices as a rough proxy for low-cost regions where large AI data centers seem to be deployed.

As a baseline, I use EIA table 4, which indicates roughly $0.06–$0.08/kWh. Large data-center customers may sometimes face lower effective prices through power contracts, wholesale exposure, or behind-the-meter generation.

For the model, I use:

=PERT(0.04, 0.08, 0.06)

This gives a central value of $0.06/kWh, with a range from $0.04 to $0.08/kWh.

I would update lower if AI data centers mostly use very cheap power, and higher if the marginal cost of reliable AI power is above ordinary industrial electricity prices because of grid constraints, dedicated power projects, or interconnection scarcity.

Estimates of future inference energy usage

With the numbers above, we can estimate future energy usage due to inference. With ~1bn human worker equivalents we get a median of ~5000 TWh, representing almost 20% of current, global electrical consumption.

Note that the number of workers here was chosen in order to generate an interesting trade-off. We can also look at energy usage at 10%, 100% and 500% of human work force, giving central estimates of around 2000TWh, 20 000TWh and 100 000TWh.

Current, total global electricity consumption is ~32 000TWh per year.

Source of total world consumption: https://ourworldindata.org/grapher/electricity-prod-source-stacked

Example: Token usage required for 50% labor replacement

Let us assume we want to look at what is required to get to 50% labor replacement by 2030. We next set the increase in global electricity production to 1000 TWh/year (1500 TWh being the largest addition historically). This means (2026 at the time of writing), that we allow for 4 years of such additions, totaling 4000 TWh. Trying different numbers, at 10M tokens per human work day equivalent gives us a central estimate of 11 000TWh, close to our target. This means, holding hardware energy efficiency and all other constant, we need to, on average across all tasks, keep token usage to around 10M tokens for a human workday equivalent.

  • I used an LLM to help create this post and it likely contains ">10% AI-generated text". That said, I really do not know how to measure this 10% number. My workflow is that I write something, then use AI to tighten it and improve flow. I then give feedback to the AI on its rewriting attempts. I might then finally go over it and do some small edits myself. Or not. So from that perspective you might say "80% of the text was copied from an LLM output" - but the text is far from AI generated in my mind - these are my ideas and analyses, my structure and so on. The AI might be more of a polish on top, kind of like an improved spell check. I also bounced some technical ideas and framing questions with an LLM. Feel free to give me guidance for future posts on how to disclose my AI use. Also happy to share DMs with links to LLM convos on how I work.

10

1
0

Reactions

1
0

More posts like this

Comments4
Sorted by Click to highlight new comments since:

SemiAnalysis' recent newsletter provides some data points on token spend vs labor cost ROIs for actual 1-20 hour tasks. 

SemiAnalysis has written and talked extensively about our Claude Code usage, but it is important to emphasize that agentic AI is no longer limited to just coding. Our analysts are using agents every day to convert excel models into dashboards, create charts for all our notes, build financial models and analyze company earnings, and much more. These are all tasks that either 1) we simply wouldn’t have been able to do before or 2) would’ve previously taken our junior analysts many hours, taking them away from far more value added tasks.

The table below shows a handful of real examples from our own workflows, comparing token spend against what the equivalent human labor would have cost:

... We estimate that the true blended price per million tokens for running Opus 4.7 on agentic tasks at $0.99 despite the sticker price being $5/$25 per MTok. Agentic workloads have extremely high input-to-output ratios (our Claude Code usage has a ratio of about 300:1) and high cache hit rates (90%+). Because cached input tokens only cost $0.50/MTok, most of the tokens end up in the cheapest tier. We walk through the full methodology here.

Eyeballing, it looks like 8 hours of analyst-type work costs them $7-30 in Opus 4.7 token spend, so (very roughly) 7-30M tokens at their true blended price of ~$1 per M tokens, in contrast with the post's 40-1,300M token estimate, and already squarely here. I expect token usage to drop further for a given task with more advanced models, and also to vary a lot depending on (essentially) how much the big AI companies prioritise RLVR-ing them and on model jaggedness, but also for doable tasks to get much more complicated, like this and more.

Epoch BOTEC-ed a related question last year, prior to Claude Code: How many digital workers could OpenAI deploy? My main takeaway was "worker equivalents is probably more misleading than helpful if people just skim headline numbers" (which everyone does, speaking as someone who sometimes needs to produce headline numbers). 

On the tasks that AIs are able to perform today, how many “human-equivalent digital workers” could frontier AI labs deploy to work on them?

Based on a speculative back-of-the-envelope calculation, we estimate that companies like OpenAI have the hardware to deploy on the order of 7 million digital workers, with a wide 90% confidence interval of 400,000 to around 300 million.2 This doesn’t mean that OpenAI could do the jobs of 7 million human employees today, because AIs can’t fully substitute for humans. But as AI progress continues, AIs will be able to perform an increasing fraction of the tasks that humans currently do.

Thanks so much Mo! I am tempted to make the following updates already - does this seem roughly right? Or is this still too high?

  1. Token usage at 8 hrs centered on 5M tokens, with an upper limit closer to 100M. The reasoning for the
    1. Upper range of 100M being that more complex tasks (assuming those from the study you quoted were low hanging fruits) might push this higher (as indicated by the compiler example), while
    2. efficiency gains might push lower, it already seems that from METR's GPT-5.1-Codex-Max work <6 months ago it might, and this is very, very crude, be going lower.
  2. Token price centered at $1 per million tokens, instead of $5. I could make this even lower as $1 might show a downward trend, but at the same time this low price seems more to be due to cache tokens which I had ignored in my analysis - the input and output tokens still seem priced at roughly the price I found

At the same time, I also feel like these numbers might still be too high - especially token price. The reason is that the super helpful links you sent point at pretty steep downward trends on token cost and point well taken on cache tokens being much cheaper.

(I'm not at all an expert on any of this, please discount appropriately)

  1. Agree with reasoning for directional adjustment and bounds, magnitude-wise seems a bit overcorrected? SemiAnalysis' figures roughly suggest 15M center. But you're on track to becoming correct given token efficiency trends anyhow
    1. I wish I had a more empirically-grounded sense of how token usage varies by type of task, fixing task duration at 8 hours for a human professional (that you'd pay $400/day for, say). My guess from comparing model vs human jaggedness (e.g. this) is that leadership-level / early-employee / entrepreneurial / high-context / taste-heavy work would require way more tokens to get 8 hours of work done than the routine analyst-type / junior SWE etc tasks typical of benchmarks
  2. My sense is global average cost per token will go down a lot due to the following, but very unclear as to the mix
    1. a key driver of inference demand going forward being very cache tokens-heavy agentic workflows
    2. a rising share of demand being satisficing not maximising w.r.t output quality for ever-growing task share (e.g. plan with Opus -> code with Sonnet or even DeepSeek models at 1-2 OOM cheaper price point)
    3. race to the bottom pricing wars (DeepSeek again)

Just a note that I am currently updating the text based on Mo's excellent feedback - if anyone knows how to update posts in respons to new evidence I would love to know why (something like a process for epistemic version history or something).

Curated and popular this week
Relevant opportunities