The Scaling Series

Toby Ord analyses why AI scaling costs are exploding while returns diminish, and what that means for the future.

The Scaling Series Discussion Thread: with Toby Ord

We're trying something a bit new this week. Over the last year, Toby Ord has been writing about the implications of the fact that improvements in AI require exponentially more compute. Only one of these posts so far has been put on the EA forum. This week we've put the entire series on the Forum and made this thread for you to discuss your reactions to the posts. Toby Ord will check in once a day to respond to your comments[1]. Feel free to also comment directly on the individual posts that make up this sequence, but you can treat this as a central discussion space for both general takes and more specific questions. If you haven't read the series yet... Read it here ...or choose a post to start with: Are the Costs of AI Agents Also Rising Exponentially? Agents can do longer and longer tasks, but their dollar cost to do these tasks may be growing even faster. How Well Does RL Scale? I show that RL-training for LLMs scales much worse than inference or pre-training. Evidence that Recent AI Gains are Mostly from Inference-Scaling I show how most of the recent AI gains in reasoning come from spending much more compute every time the model is run. The Extreme Inefficiency of RL for Frontier Models The new RL scaling paradigm for AI reduces the amount of information a model could learn per hour of training by a factor of 1,000 to 1,000,000. What follows? Is There a Half-Life for the Success Rates of AI Agents? The declining success rates of AI agents on longer-duration tasks can be explained by a simple mathematical model — a constant rate of failing during each minute a human would take to do the task. Inference Scaling Reshapes AI Governance The shift towards inference scaling may mean the end of an era for AI governance. I explore the many consequences. Inference Scaling and the Log-x Chart The new trend to scaling up inference compute in AI has come hand-in-hand with an unusual new type of chart that can be highly misleading. The Scaling Paradox The sca

Toby Tremlett🔹

Toby_Ord

How Well Does RL Scale?

Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety. The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types of scaling: 1. Scaling the amount of compute used for RL during training 2. Scaling the amount of compute used for inference during deployment We can see (1) as training the AI in more effective reasoning techniques and (2) as allowing the model to think for longer. I’ll call the first RL-scaling, and the second inference-scaling. Both new kinds of scaling were present all the way back in OpenAI’s announcement of their first reasoning model, o1, when they showed this famous chart: I’ve previously shown that in the initial move from a base-model to a reasoning model, most of the performance gain came from unlocking the inference-scaling. The RL training did provide a notable boost to performance, even holding the number of tokens in the chain of thought fixed. You can see this RL boost in the chart below as the small blue arrow on the left that takes the base model up to the trend-line for the reasoning model. But this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost. The question of where these capability gains come from is important because scaling up the inference compute has very different implications than scaling up the traini

Toby_Ord

Are the Costs of AI Agents Also Rising Exponentially?

There is an extremely important question about the near-future of AI that almost no-one is asking. We’ve all seen the graphs from METR showing that the length of tasks AI agents can perform has been growing exponentially over the last 7 years. While GPT-2 could only do software engineering tasks that would take someone a few seconds, the latest models can (50% of the time) do tasks that would take a human a few hours. As this trend shows no signs of stopping, people have naturally taken to extrapolating it out, to forecast when we might expect AI to be able to do tasks that take an engineer a full work-day; or week; or year. But we are missing a key piece of information — the cost of performing this work. Over those 7 years AI systems have grown exponentially. The size of the models (parameter count) has grown by 4,000x and the number of times they are run in each task (tokens generated) has grown by about 100,000x. AI researchers have also found massive efficiencies, but it is eminently plausible that the cost for the peak performance measured by METR has been growing — and growing exponentially. This might not be so bad. For example, if the best AI agents are able to complete tasks that are 3x longer each year and the costs to do so are also increasing by 3x each year, then the cost to have an AI agent perform tasks would remain the same multiple of what it costs a human to do those tasks. Or if the costs have a longer doubling time than the time-horizons, then the AI-systems would be getting cheaper compared with humans. But what if the costs are growing more quickly than the time horizons? In that case, these cutting-edge AI systems would be getting less cost-competitive with humans over time. If so, the METR time-horizon trend could be misleading. It would be showing how the state of the art is improving, but part of this progress would be due to more and more lavish expenditure on compute so it would be diverging from what is economical. It would be b

Toby_Ord

The Scaling Paradox

AI capabilities have improved remarkably quickly, fuelled by the explosive scale-up of resources being used to train the leading models. But if you examine the scaling laws that inspired this rush, they actually show extremely poor returns to scale. What’s going on? AI Scaling is Shockingly Impressive The era of LLMs has seen remarkable improvements in AI capabilities over a very short time. This is often attributed to the AI scaling laws — statistical relationships which govern how AI capabilities improve with more parameters, compute, or data. Indeed AI thought-leaders such as Ilya Sutskever and Dario Amodei have said that the discovery of these laws led them to the current paradigm of rapid AI progress via a dizzying increase in the size of frontier systems. Before the 2020s, most AI researchers were looking for architectural changes to push the frontiers of AI forwards. The idea that scale alone was sufficient to provide the entire range of faculties involved in intelligent thought was unfashionable and seen as simplistic. A key reason it worked was the tremendous versatility of text. As Turing had noted more than 60 years earlier, almost any challenge that one could pose to an AI system can be posed in text. The single metric of human-like text production could therefore assess the AI’s intellectual competence across a huge range of domains. The next-token prediction scheme was also an instance of both sequence prediction and compression — two tasks that were long hypothesized to be what intelligence is fundamentally about. By the time of GPT-3 it was clear that it was working. It wasn’t just a dry technical metric that was improving as the compute was scaled up — the text was qualitatively superior to GPT-2 and showed signs of capturing concepts that were beyond that smaller system. Scaling has clearly worked, leading to very impressive gains in AI capabilities over the last five years. Many papers and AI labs trumpet the success of this scaling in their

Toby_Ord

Inference Scaling Reshapes AI Governance

> The shift from scaling up the pre-training compute of AI systems to scaling up their inference compute may have profound effects on AI governance. The nature of these effects depends crucially on whether this new inference compute will primarily be used during external deployment or as part of a more complex training programme within the lab. Rapid scaling of inference-at-deployment would: lower the importance of open-weight models (and of securing the weights of closed models), reduce the impact of the first human-level models, change the business model for frontier AI, reduce the need for power-intense data centres, and derail the current paradigm of AI governance via training compute thresholds. Rapid scaling of inference-during-training would have more ambiguous effects that range from a revitalisation of pre-training scaling to a form of recursive self-improvement via iterated distillation and amplification. The end of an era — for both training and governance The intense year-on-year scaling up of AI training runs has been one of the most dramatic and stable markers of the Large Language Model era. Indeed it had been widely taken to be a permanent fixture of the AI landscape and the basis of many approaches to AI governance. But recent reports from unnamed employees at the leading labs suggest that their attempts to scale up pre-training substantially beyond the size of GPT-4 have led to only modest gains which are insufficient to justify continuing such scaling and perhaps even insufficient to warrant public deployment of those models. A possible reason is that they are running out of high-quality training data. While the scaling laws might still be operating (given sufficient compute and data, the models would keep improving), the ability to harness them through rapid scaling of pre-training may not. What was taken to be a fixture may instead have been just one important era in the history of AI development; an era which is now coming to a close. Just

Toby_Ord

Inference Scaling and the Log-x Chart

> Improving model performance by scaling up inference compute is the next big thing in frontier AI. But the charts being used to trumpet this new paradigm can be misleading. While they initially appear to show steady scaling and impressive performance for models like o1 and o3, they really show poor scaling (characteristic of brute force) and little evidence of improvement between o1 and o3. I explore how to interpret these new charts and what evidence for strong scaling and progress would look like. From scaling training to scaling inference The dominant trend in frontier AI over the last few years has been the rapid scale-up of training — using more and more compute to produce smarter and smarter models. Since GPT-4, this kind of scaling has run into challenges, so we haven’t yet seen models much larger than GPT-4. But we have seen a recent shift towards scaling up the compute used during deployment (aka 'test-time compute’ or ‘inference compute’), with more inference compute producing smarter models. You could think of this as a change in strategy from improving the quality of your employees’ work via giving them more years of training in which acquire skills, concepts and intuitions to improving their quality by giving them more time to complete each task. Or, using an analogy to human cognition, you could see more training as improving the model’s intuitive ‘System 1’ thinking and more inference as improving its methodical ‘System 2’ thinking. There has been a lot of excitement about the results of scaling up inference, especially in OpenAI’s o1 and o3 models. But I’ve seen many people getting excited due to misreading the results. To understand just how impressed we really should be, we need to get a grip on the new ‘scaling laws’ for inference. And to do this, we need to understand a new kind of chart that we’ve been seeing a lot lately. Here is the chart that started it off — from OpenAI’s introduction to o1. On the left are the results of scaling t

Toby_Ord

Is there a Half-Life for the Success Rates of AI Agents?

Building on the recent empirical work of Kwa et al. (2025), I show that within their suite of research-engineering tasks the performance of AI agents on longer-duration tasks can be explained by an extremely simple mathematical model — a constant rate of failing during each minute a human would take to do the task. This implies an exponentially declining success rate with the length of the task and that each agent could be characterised by its own half-life. This empirical regularity allows us to estimate the success rate for an agent at different task lengths. And the fact that this model is a good fit for the data is suggestive of the underlying causes of failure on longer tasks — that they involve increasingly large sets of subtasks where failing any one fails the task. Whether this model applies more generally on other suites of tasks is unknown and an important subject for further work. METR’s results on the length of tasks agents can reliably complete A recent paper by Kwa et al. (2025) from the research organisation METR has found an exponential trend in the duration of the tasks that frontier AI agents can solve: every 7 months, the length of task they can solve doubles. These headline results are based on a test suite of 170 software engineering, cybersecurity, general reasoning, and ML tasks that they assembled to be indicative of the kinds of tasks that could help AI agents assist in AI research. These tasks are assembled from three different benchmarks that take different amounts of time for humans to achieve: In general, ability to perform a task drops off as its duration increases, so they use the AI agent’s performance on tasks of different lengths to estimate the task-length at which the model would have a 50% success rate. They then showed that this length has been doubling every 7 months as the capabilities of frontier agents improve. The task-lengths are measured by how long it took humans to solve the same tasks. They used 50% success rate

Toby_Ord

The Extreme Inefficiency of RL for Frontier Models

> The new scaling paradigm for AI reduces the amount of information a model can learn from per hour of training by a factor of 1,000 to 1,000,000. I explore what this means and its implications for scaling. The last year has seen a massive shift in how leading AI models are trained. 2018–2023 was the era of pre-training scaling. LLMs were primarily trained by next-token prediction (also known as pre-training). Much of OpenAI’s progress from GPT-1 to GPT-4, came from scaling up the amount of pre-training by a factor of 1,000,000. New capabilities were unlocked not through scientific breakthroughs, but through doing more-or-less the same thing at ever-larger scales. Everyone was talking about the success of scaling, from AI labs to venture capitalists to policy makers. However, there’s been markedly little progress in scaling up this kind of training since (GPT-4.5 added one more factor of 10, but was then quietly retired). Instead, there has been a shift to taking one of these pre-trained models and further training it with large amounts of Reinforcement Learning (RL). This has produced models like OpenAI’s o1, o3, and GPT-5, with dramatic improvements in reasoning (such as solving hard maths problems) and ability to pursue objectives in an agentic manner (such as performing software-engineering work). AI labs have trumpeted the new reasoning abilities that RL unlocked, but have downplayed the ending of the remarkable era of pre-training scaling. For example, Dario Amodei of Anthropic said: > Every once in a while, the underlying thing that is being scaled changes a bit, or a new type of scaling is added to the training process. From 2020-2023, the main thing being scaled was pretrained models: models trained on increasing amounts of internet text with a tiny bit of other training on top. In 2024, the idea of using reinforcement learning (RL) to train models to generate chains of thought has become a new focus of scaling. But there are profound differences b

Toby_Ord

Evidence that Recent AI Gains are Mostly from Inference-Scaling

In the last year or two, the most important trend in modern AI came to an end. The scaling-up of computational resources used to train ever-larger AI models through next-token prediction (pre-training) stalled out. Since late 2024, we’ve seen a new trend of using reinforcement learning (RL) in the second stage of training (post-training). Through RL, the AI models learn to do superior chain-of-thought reasoning about the problem they are being asked to solve. This new era involves scaling up two kinds of compute: 1. the amount of compute used in RL post-training 2. the amount of compute used every time the model answers a question Industry insiders are excited about the first new kind of scaling, because the amount of compute needed for RL post-training started off being small compared to the tremendous amounts already used in next-token prediction pre-training. Thus, one could scale the RL post-training up by a factor of 10 or 100 before even doubling the total compute used to train the model. But the second new kind of scaling is a problem. Major AI companies were already starting to spend more compute serving their models to customers than in the training phase. So if it costs a factor of 10 or 100 times as much to answer each question, this really does affect their bottom line. And unlike training costs, these costs can’t be made up in volume. This kind of scaling is known as inference-scaling (since it is scaling the compute used in the output, or ‘inference’, stage). It is critical to find out how much of the benefits are coming from each kind of scaling. If further improvements in AI capabilities are mainly going to come from inference-scaling, this would have many implications for the trajectory of AI, including for AI companies, AI governance, and AI risk. In past writings, I’ve mainly focused on the inference-scaling, treating RL as primarily enabling larger and larger amounts of inference to be productively used to answer a question — improving

Toby_Ord