The Scaling Series Discussion Thread: with Toby Ord

Toby Tremlett🔹; Toby_Ord

We're trying something a bit new this week. Over the last year, Toby Ord has been writing about the implications of the fact that improvements in AI require exponentially more compute. Only one of these posts so far has been put on the EA forum.

This week we've put the entire series on the Forum and made this thread for you to discuss your reactions to the posts. Toby Ord will check in once a day to respond to your comments^[1].

Feel free to also comment directly on the individual posts that make up this sequence, but you can treat this as a central discussion space for both general takes and more specific questions.

If you haven't read the series yet...

Read it here

...or choose a post to start with:

Are the Costs of AI Agents Also Rising Exponentially?
Agents can do longer and longer tasks, but their dollar cost to do these tasks may be growing even faster.

How Well Does RL Scale?
I show that RL-training for LLMs scales much worse than inference or pre-training.

Evidence that Recent AI Gains are Mostly from Inference-Scaling
I show how most of the recent AI gains in reasoning come from spending much more compute every time the model is run.

The Extreme Inefficiency of RL for Frontier Models
The new RL scaling paradigm for AI reduces the amount of information a model could learn per hour of training by a factor of 1,000 to 1,000,000. What follows?

Is There a Half-Life for the Success Rates of AI Agents?
The declining success rates of AI agents on longer-duration tasks can be explained by a simple mathematical model — a constant rate of failing during each minute a human would take to do the task.

Inference Scaling Reshapes AI Governance
The shift towards inference scaling may mean the end of an era for AI governance. I explore the many consequences.

Inference Scaling and the Log-x Chart
The new trend to scaling up inference compute in AI has come hand-in-hand with an unusual new type of chart that can be highly misleading.

The Scaling Paradox
The scaling up of frontier AI models has been a huge success. But the scaling laws that inspired it actually show extremely poor returns to scale. What’s going on?

^{^}
He's not committing to respond to every comment.

72 Reactions

The Scaling Paradox

1 comments49 karma

Comments19

Sorted by

New & upvoted

Click to highlight new comments since: Today at 8:31 PM

Toby_OrdFeb 342

A few points to clarify my overarching view:

All kinds of compute scaling are quite inefficient on most standard metrics. There are steady gains, but they are coming from exponentially increasing inputs. These can't continue forever, so all these kinds of gains from compute scaling are naturally time-limited. The exponential growth in inputs may also be masking fundamental deficiencies in the learning algorithms.
By 'compute scaling' I'm generally referring to the strategy of adding more GPUs to get more practically useful capabilities. I think this is running out of steam for pretraining and will soon start running out of steam for RL and inference scaling. This is possible even if the official 'scaling laws' of pretraining continue to hold (I'm generally neutral on whether they will).
It is possible that there will always be a new paradigm of compute scaling to take over when old ones run out of steam. If so, then like Moore's Law, the longterm upward trend might be made out of a series of stacked S-curves. I'm mainly pointing out the limits of the current scaling paradigms, not denying the possibility of future ones.
I don't think that companies are likely to completely stop scaling any of these forms of compute-scaling. The maths tends to recommend balancing the shares of compute that go to all of them in proportion to how much they improve capabilities per doubling of compute. e.g. perhaps a 3:1:2 ratio between pre-training, RL, and inference (though I expect the 3 to decline due to running out of high-quality text for pretraining).
But given that we don't know if there will always be more paradigms delivered on time to save scaling, the limits on the current approaches should increase our credence that the practical process of scaling will provide less of a tail-wind to AI progress going forward. Overall, my view is something like: the strength of this tail-wind that has driven much of AI progress since 2020 will halve. (So it would still be important, but no longer the main determinant of who is in front, or of the pace of progress.)
As well as implications for the pace of progress, changes in what determines progress has implications for strategy and governance of AI. For example, AI researchers will be comparatively more important than in recent years, and if inference-scaling becomes the main form of scaling, that has big implications for compute governance and for the business model of AI companies.

Ben_West🔸Feb 102

the strength of this tail-wind that has driven much of AI progress since 2020 will halve

I feel confused about this point because I thought the argument you were making implies a non-constant "tailwind." E.g. for the next generation these factors will be 1/2 as important as before, then the one after that 1/4, and so on. Am I wrong?

Toby_OrdFeb 102

Yeah, it isn't just like a constant factor slow-down, but is fairly hard to describe in detail. Pre-training, RL, and inference all have their own dynamics, and we don't know if there will be new good scaling ideas that breathe new life into them or create a new thing on which to scale. I'm not trying to say the speed at any future point is half what it would have been, but that you might have seen scaling as a big deal, and going forward it is a substantially smaller deal (maybe half as big a deal).

Ben_West🔸Feb 124

Thanks, that's helpful. Do you have a sense of where we are on the current S-curve? E.g., if capabilities continue to progress in a straight line through the end of this year, is that evidence that we have found a new S-curve to stack on the current one?

Toby_OrdFeb 164

That's a great question. I'd expect a bit of slowdown this year, though not necessarily much. e.g. I think there is a 10x or so possible for RL before RL-training-compute reaches the size of pre-training compute, and then we know they have enough to 10x again beyond that (since GPT-4.5 was already 10x more), so there are some gains still in the pipe there. And I wouldn't be surprised if METR timelines keep going up in part due to increased inference spend (i.e. my points about inference scaling not being that good are to do with costs exploding, so if a cost-insensitive benchmark is going on, it might not register on it all that much). There is also room for more AI-research or engineering improvements to these things, and a lump of new compute coming in, making it a bit messy.

Overall, I'd say my predictions are more about appreciable slowing in 2027+ rather than 2026.

Toby_OrdFeb 162

And I'll add that RL training (and to a lesser degree inference scaling) is limited to a subset of capabilities (those with verifiable rewards and that the AI industry care enough about to run lots of training on). So progress on benchmarks has been less representative of how good they are at things that aren't being benchmarked than it was in the non-reasoning-model era. So I think the problems of the new era are somewhat bigger than the effects that show up in benchmarks.

NickLaingFeb 52

"All kinds of compute scaling are quite inefficient on most standard metrics. There are steady gains, but they are coming from exponentially increasing inputs."

Is this kind of the opposite of Moore's law lol?

Ben_West🔸Feb 216

I'm excited about this series!

I would be curious what your take is on this blog post from OpenAI, particularly these two graphs:

Investment in compute powers leading-edge research and step-change gains in model capability. Stronger models unlock better products and broader adoption of the OpenAI platform. Adoption drives revenue, and revenue funds the next wave of compute and innovation. The cycle compounds.

While their argument is not very precise, I understand them to be saying something like, "Sure, it's true that the costs of both inference and training are increasing exponentially. However, the value delivered by these improvements is also increasing exponentially. So the economics check out."

A naive interpretation of e.g. the METR graph would disagree: humans are modeled as having a constant hourly wage, so being able to do a task which is 2x as long is precisely 2x as valuable (and therefore can't offset a >2x increase in compute costs). But this seems like an implausible simplification.

Do we have any evidence on how the value of models changes with their capabilities?

Toby_OrdFeb 314

That is quite a surprising graph — the annual tripling and the correlation between the compute and revenue are much more perfect than I think anyone would have expected. Indeed they are so perfect that I'm a bit skeptical of what is going on.

One thing to note is that it isn't clear what the compute graph is of (e.g. is it inference + training compute, but not R&D?). Another thing to note is that it is year-end figures vs year total figures on the right, but given they are exponentials with the same doubling time and different units, that isn't a big deal.

There are a number of things I disagree with in the post. The main one relevant to this graph is the implication that the graph on the left causes the graph on the right. That would be genuinely surprising. We've seen that the slope on the famous scaling law graphs is about -0.05 for compute — so you need to double compute 20 times to get log-loss to halve. Whereas this story of 3x compute leading to 3x the revenue implies that the exponent for a putative scaling law of compute vs R&D is extremely close to 1.0. And that it remains flukishly close to that magic number despite the transition from pretraining scaling to RL+inference scaling. I could believe a power law exponent of 1.0 for some things that are quite mathematical of physical, but not for the extremely messy relationship of compute to total revenue which depends on details of:

the changing relationship between compute and intelligence,
the utility of more intelligence to people, the market dynamics between competitors,
running out of new customers and having to shift to more revenue per customer,
the change from a big upfront cost (training compute) to mostly per-customer charges (inference compute)

More likely is something like reverse causation — that the growth in revenue is driving the amount of compute they can afford. Or it could be that the prices they need to charge increase with the amount of investment they received in order to buy compute — so they are charging the minimum they can in order to allow revenue growth to match investment growth.

Overall, I'd say that I believe these are real numbers, but I don't believe the implied model. e.g. I don't believe this trend will continue in the long run and I don't think that if they had been able to 10x compute in one of those years that the revenue would have also jumped by 10x (unless that is because they effectively choose how much revenue to take by trading market growth for revenue in order to make this graph work to convince investors).

NickLaingFeb 410

I think this is a classic potential "correlation" problem. Probably Open AI just cherry picked data which looks good for them. They didn't pick a hypothesis to test, just put 2 graphs next to each other that look the same which is very weak data interpretaiton. Sure both compute and revenue might have increased at 3x a year for 2 years, but that doesn't tell us much. It doesn't mean they have that much to do with each other directly. My guess is that of course there's some relationship between increased compute and revenue, but how much we just don't know. I think Open AI are reading too much into the data. At least they state a falsifiable theory of change and we'll see if the trend continues.

"Stronger models unlock better products and broader adoption of the OpenAI platform. Adoption drives revenue, and revenue funds the next wave of compute and innovation. The cycle compounds."

If I made this claim from this data in the global health world I would get a shug and a "that's a nice theory, make a hypothesis and test it over the next 2-3 years then we might take you seriously"

My prediction is that you never see this graph again because this correlation trend will not continue. Open AI will just pick other graphs which look good for them next year. If it was really strongly "causally" related something similar would continue for the next 2-3 years as well.

Mo PuteraFeb 46

re: "I think Open AI are reading too much into the data", to be perfectly honest I don't think they're reading into anything, I just interpreted it as marketing and hence dismissed it as evidence pertaining to AI progress. I'm not even being cynical, I've just worked in big corporate marketing departments for many years.

Toby_OrdFeb 45

I agree — a bunch of the arguments read like marketing that is greatly simplifying the real picture and not seeming very interested in digging deeper once a convenient story was found.

NickLaingFeb 44

makes sense. But at least they appear to be reading too much into the data lol.

Paolo BovaFeb 211

Super cool! Great to see others digging into the costs of Agent performance. I agree that more people should be looking into this.

I'm particularly interested in predicting the growth of costs for Agentic AI safety evaluations. So I was wondering if you had any takes on this given this recent series. Here are a few more specific questions along those lines for you, Toby

- Given the cost trends you've identified, do you expect the costs of running agents to take up an increasing share of the total costs of AI safety evaluations (including researcher costs)?
- Which dynamics do you think will drive how the costs of AI safety evaluations change over the next few years?
- Any thoughts on under what conditions it would be better to elicit the maximum capabilities of models using a few very expensive safety evaluations, or better to prioritise a larger quantity of evaluations that get close to plateau performance (i.e. hitting that sweet spot where their hourly cost / performance is lowest, or alternatively their saturation point)? Presumably a mix is a best, but how do we determine what a good mix looks like? What might you recommend to an AI Lab's Safety/Preparedness team? I'm thinking about how this might inform evaluation requirements for AI labs.

Many thanks for the excellent series! You have a knack for finding elegant and intuitive ways to explain the trends from the data. Despite knowing this data well, I feel like I learn something new with every post. Looking forward to the next thing.

Toby_OrdFeb 26

Thanks Paolo,

I was only able to get weak evidence of a noisy trend from the limited METR data, so it is hard to draw many conclusions from that. Moreover, METR's desire to measure the exponentially growing length of useful work tasks is potentially more exposed to an exponential rise in compute costs than more safety-related tasks. But overall, I'd think that the year-on-year growth in the amount of useful compute you can use on safety evaluations is probably growing faster than one can sustainably grow the number of staff.

I'm not sure how the dynamics will shake out for safety evals over a few years. e.g. a lot of recent capability gain has come from RL, which I think isn't sustainable, and I also think the growth in ability via inference compute will limit both ability to serve the model and ability for people to afford it, so I suspect we'll see some returning to eke what they can out of more pretraining. i.e. that the reasoning era saw them shift to finding capabilities in new areas with comparably low costs of scaling, but once they reach the optimal mix, we'll see a mixture of all three going forward. So the future might look a bit less like 2025 and more like a mix of that and 2022-24.

Unfortunately I don't have much insight on the question of ideal mixes of safety evals!

basil.halperinFeb 45

(The whole series of essays has been fantastic.)

Have you put any thought into whether two of your points can be combined?

The constant hazard rate model: there is a predictable difference between T_"50% success" horizon vs. T_"X% success" horizon.
Costs matter, and costs are plausibly rising fast.

In particular, can one use the constant hazard rate model -- and other information? -- to go from "data on costs to achieve 50% success" to extrapolate "cost to achieve 99% (e.g.) success".

I spent a bit of time thinking about this, but I think there's a missing ingredient: for example,

If you can do a 70-minute task, at 50% reliability, for $100
Then (per constant hazard rate) you can do a 1-minute task, at 99% reliability, for $X

The difficulty is: what can we say about $X?

Presumably it is upper bounded by $100, the cost of doing a 70-minute task at 50% reliability
Presumably it is lower bounded by "cost of doing a 1-minute task at 50% reliability"

==> basically, as you wrote, would love more data from METR on costs...!

(Details: I had ChatGPT attempt to digitize the cost curve for GPT-5 from METR, and then generated the upper and lower bounds as described above)

Toby_OrdFeb 46

Thanks Basil! That's an interesting idea. The constant hazard rate model is just comparing two uses of the same model over different task lengths, so if using that to work out the 99% time horizon, it should cost 1/70th as much ($1.43). Over time, I think these 99% tasks should rise in cost in roughly the same way as the 50%-horizon ones (as they are both increasing in length in proportion). But estimating how that will change in practice is especially dicey as there is too little data.

Also, note that Gus Hamilton has written a great essay that takes the survival analysis angle I used in my constant hazard rates piece and extended it to pretty convincingly show that the hazard rates are actually decreasing. I explain it in more detail here. One upshot is that it gives a different function for estimating the 99% horizon lengths and he also shows that these are poorly constrained by the data and his model disagrees with METR's by a factor of 20 on how long they are, with even more disagreement for shorter lengths.

RasoolFeb 42

I want to signal-boost this conversation between you and @Wei Dai (and others), and see whether you have any further thoughts on the matter

phytographerFeb 71

"Lies, damned lies, and statistics"
Thanks for such comprehensive statistical analyses that have enlightened me on how AI firms have sometimes been misleading.
It inspires me to try building my own predictive models again.
It seems governments would do well to remember another proverb: "The best way to predict the future is to create it" - an interesting [Deloitte analysis](https://www.deloitte.com/us/en/insights/industry/government-public-sector-services/ai-regulations-around-the-world.html) highlighted government's power as a buyer and not only a regulator.

[comment deleted]Feb 22

Deleted by Ben_West🔸, 02/02/2026