On January 1, 2030, there will be no artificial general intelligence (AGI) and AGI will still not be imminent.
A few reasons why I think this:
-If you look at easy benchmarks like ARC-AGI and ARC-AGI-2 that are easy for humans to solve and intentionally designed to be a low bar for AI to clear, the weaknesses of frontier AI models are starkly revealed.
-Casual, everyday use of large language models (LLMs) reveals major errors on simple thinking tasks, such as not understanding that an event that took place in 2025 could not have caused an event that took place in 2024.
-Progress does not seem like a fast exponential trend, faster than Moore's law and laying the groundwork for an intelligence explosion. Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to train models is probably happening faster than Moore's law, but not the actual intelligence of the models.
-Most AI experts and most superforecasters give much more conservative predictions when surveyed about AGI, closer to 50 or 100 years than 5 or 10 years.
-Most AI experts are skeptical that scaling up LLMs could lead to AGI.
-It seems like there are deep, fundamental scientific discoveries and breakthroughs that would need to be made for building AGI to become possible. There is no evidence we're on the cusp of those happening and it seems like they could easily take many decades.
-Some of the well-known people who are making aggressive predictions about the timeline of AGI now have also made aggressive predictions about the timeline of AGI in the past that were wrong.
-The stock market doesn't think AGI is coming in 5 years.
-There has been little if any clear, observable effect of AI on economic productivity or the productivity of individual firms.
-AI can't yet replace human translators or do other jobs that it seems best-positioned to overtake.
-Progress on AI robotics problems, such as fully autonomous driving, has been dismal. (However, autonomous driving companies have good PR and marketing right up until the day they announce they're shutting down.)
-Discourse about AGI sounds way too millennialist and that's a reason for skepticism.
-The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group's norms), and closed-off to even reasonable, relatively gentle criticism (whether or not they pay lip service to listening to criticism or perform being open-minded). It doesn't feel like a scientific community. It feels more like a niche subculture. It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up (either with excitement or anxiety), and reinforcing each other's ideas all the time. It doesn't seem like an intellectually healthy community.
-A lot of the aforementioned points have been made before and there haven't been any good answers to them.
I'd like to thank Sam Altman, Dario Amodei, Demis Hassabis, Yann LeCun, Elon Musk, and several others who declined to be named for giving me notes on each of the sixteen drafts of this post I shared with them over the past three months. Your feedback helped me polish a rough stone of thought into a diamond of incisive criticism.
Note: I edited this post on 2025-04-12 at 20:30 UTC to add some footnotes.
Moore's law is ~1 doubling every 2 years. Barnes' law is ~4 doublings every 2 years:
I think if you surveyed any expert on LLMs and asked them "which was a greater jump in capabilities, Gpt2 to GPT3 or GPT3 to GPT4?" the vast majority would say the former, and I would agree with them. This graph doesn't capture that, which makes me cautious about overelying on it.
That's a really broad question though. If you asked something like, which system unlocked the most real-world value in coding, people would probably say the jump to a more recent model like o3-mini or Gemini 2.5
You could similarly argue the jump from infant to toddler is much more profound in terms of general capabilities than college student to phd but the latter is more relevant in terms of unlocking new research tasks that can be done.
I would be curious to know what the best benchmarks are which show a sub-Moore's-law trend.
Hi Ben. Is there any bet you would be willing to make about the impact of AI on large scale outcomes, like global catastrophes, unemployment, economic growth, or energy consumption? I am open to bets against short AI timelines, or what they supposedly imply, up to 10 k$.
Pay attention to the rest of that paragraph you quoted from:
Measuring intelligence is hard. On the wrong benchmark, a calculator is superintelligent. And yet a calculator lacks what we talk about when we talk about human intelligence, animal intelligence, and hypothetical future artificial general intelligence, like the robots and androids and sentient supercomputers that populate sci-fi.
I don't think ARC-AGI-2 is some perfect encapsulation of the essence of intelligence. It's more or less a puzzle game. But it's refreshing in that it does more than many benchmarks in teasing out some of the differences in intellectual capability between present-day deep neural networks and ordinary humans.
ARC-AGI-2 does not attempt to be a test of whether an AI system is an AGI or not. It's intended to be a low bar for AI systems to clear. The idea is to make it easy enough for AI systems that they have some hope of getting a high score within the next few years because the goal is to move AI research forward (and not just prove a point about artificial intelligence vs. human intelligence or something like that). So, getting a high score on ARC-AGI-2 would show incremental progress toward AGI; not getting a high score on ARC-AGI-2 over the next several years would show slow progress or a lack of progress toward AGI. (No result, even a score of 100%, as cool and impressive as that would be, would show that an AI system is AGI.)
Badly operationalizing a concept like "intelligence" is worse than not operationalizing it at all. If you operationalize "happiness" as "the number of times a person smiles per day", you've actually gone backwards in your understanding of happiness and would have been better off sticking to a looser, more nebulous conceptualization. To the extent we want to measure such complex and puzzling phenomena, we need really carefully designed measurement tools.
When we're measuring AI, the selection of which tasks we're evaluating on really matters. On the sort of tasks that frontier AI models struggle with, the length of tasks that AI can successfully do has not been reliably doubling. If you drew a chart for the GPT models on ARC-AGI-2, it would mostly just be a flat line. These are the results:
GPT-2: 0.0%
GPT-3: 0.0%
GPT-3.5: 0.0%
GPT-4: 0.0%
GPT-4o: 0.0%
GPT-4.5: 0.0%
o3-mini-high: 0.0%
It's only with the o3-low and o1-pro models we see scores above 0% — but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. There's a nuanced discussion to be had about that topic. But I don't see how you could use these results to draw a trendline of AI models rapidly barrelling toward AGI.
... which is what (super)-exponential growth looks like, yes?
Specifically: We've gone from o1 (low) getting 0.8% to o3 (low) getting 4% in ~1 year, which is ~2 doublings per year (i.e. 4x Moore's law). Forecasting from this few data points sure seems like a cursed endeavor to me, but if you want to do it then I don't see how you can rule out Moore's-law-or-faster growth.
By some accounts, growth from 0.0 to 4.0 is infinite growth, which is infinitely faster than Moore’s law!
More seriously, I didn’t really think through precisely whether artificial intelligence could be increasing faster than Moore’s law. I guess in theory it could. I forgot that Moore’s law speed actually isn’t that impressive on its own. It has to compound over decades to be impressive.
If I eat a sandwich today and eat two sandwiches tomorrow, the growth rate in my sandwich consumption is astronomically faster than Moore’s law. But what matters is if the growth rate continues and compounds long-term.
The bigger picture is how to measure general intelligence or “fluid intelligence” in a way that makes sense. The Elo rating of AlphaGo probably increased faster than Moore’s law from 2014 to 2017. But we don’t see the Elo rating of AlphaGo as a measure of AGI, or else AGI would have already been achieved in 2015.
I think essentially all of these benchmarks and metrics for LLM performance are like the Elo rating of AlphaGo in this respect. They are measuring a narrow skill.
Fair enough, but in that case I feel kind of confused about what your statement "Progress does not seem like a fast exponential trend, faster than Moore's law" was intended to imply.
If the claim you are making is "AGI by 2030 will require some growth faster than Moore's law" then the good news is that almost everyone agrees with you but the bad news is that everyone already agrees with you so this point is not really cruxy to anyone.
Maybe you have an additional claim like "...and growth faster than moore's law is unlikely?" If so, I would encourage you to write that because I think that is the kind of thing that would engage with people's cruxes!
So, what I originally wrote is:
To remove the confusing part about Moore’s law, I could re-word it like this:
I think this conveys my meaning better than what I wrote originally, and it avoids getting into the Moore’s law topic.
The Moore’s law topic is a bit of an unnecessary rabbit hole. A lot of things increase faster than Moore’s law during a short window of time, but few increase at a CAGR of 41% (or whatever Moore’s law’s CAGR is) for decades. There’s all kinds of ways to mis-apply the analogy of Moore’s law.
People have made jokes about this kind of thing before, like The Economist sarcastically forecasting in 2006 based on then-recent trends that a 14-blade razor would be released by 2010.
I also think of David Deutsch’s book The Beginning of Infinity, in which he rails against the practice of uncritically extrapolating past trends forward, and his TED Talk where he does a bit of the same.
My impression is that ARC-AGI (1) is close to being solved, which is why they brought our ARC-AGI-2 a few weeks ago.
Benchmarks are often adversarially selected so they take longer to be saturated, so I don't think little progress on ARC-AGI-2 a few weeks after release (and iirc after any major model release) tells us much at all.
It depends what you want ARC-AGI-2 to tell you. For one, it tells you that current frontier models lack the general intelligence or “fluid intelligence” to solve simple puzzles that pretty much any person can solve. Why is that? Isn’t that interesting?
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesn’t that show they are lacking in the capability to generalize to novel problems? If they don’t have to be specifically fine-tuned, then the timing shouldn’t matter. A model with good generalization capability should be able to do well whether it happens to be released before or after the reveal of the ARC-AGI-2 benchmark.
Another “benchmark” I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that “benchmark” has been much, much slower than Moore’s law, but, then again, I don’t know if anyone’s been able to accurately measure that.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I haven’t seen signs of anything but modest improvement over the last ~2.5 years. I also don’t see many people trying to quantify those things.
On one level, that makes sense because it takes time, money/labour, and expertise to create a good benchmark and there is no profit in it. You don’t seem to get much acclaim, either. Also, you might feel like you wasted your time if you made a benchmark that frontier AI models got ~0% on and, a year later, they still got ~0%…
On another level, measuring AGI progress carefully and thoughtfully seems important and it’s a bit surprising/disappointing that the status quo for benchmarks is so poor.
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesn’t that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so it's not clear that it's pointing at a significant lack in LM capabilities. I agree that it's weak evidence that they can't generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or “fluid intelligence” to solve simple puzzles that pretty much any person can solve. Why is that? Isn’t that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having "fluid intelligence" - I agree that was the intention of the benchmark, and I think it's weak evidence.
Another “benchmark” I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that “benchmark” has been much, much slower than Moore’s law, but, then again, I don’t know if anyone’s been able to accurately measure that.
Has this been a lot slower than Moore's law? I think OpenAI revenue is, on average, more aggressive than Moore's law. I'd guess that LM ability to automate intellectual work is more aggressive than Moore's law, too, but it started from a very low baseline, so it's hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but 🤷♂️.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I haven’t seen signs of anything but modest improvement over the last ~2.5 years. I also don’t see many people trying to quantify those things.
I'm curious for examples here - particularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesn't seem worth getting into.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAI's customers are generating more profit for themselves by using OpenAI's models than they were before using LLMs.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Moore's law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Moore's law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Moore's law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit... In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for "entertainment". I was so surprised by this because you wouldn't expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAI's models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAI's API and then just re-packages OpenAI's models for entertainment purposes, then that shouldn't count, since that's the same function I wanted to exclude from the beginning and the only thing that's different is an intermediary has been added.)
I haven't seen much hard data on changes in firm-level profitability or firm-level productivity among companies that adopt LLMs. One of the few sources of data I can find is this study about customer support agents: https://academic.oup.com/qje/article/140/2/889/7990658 The paper is open access.
Here's an interesting quote:
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.
This is somewhat disingenuous. o3-mini (high) is actually on 1.5%, and none of the other models are reasoning (CoT / RL / long inference time) models (oh, and GPT 4.5 is actually on 0.8%). The actual leaderboard looks like this:
Yes the scores are still very low, but it could just be a case of the models not yet "grokking" such puzzles. In a generation or two they might just grok them and then jump up to very high scores (many benchmarks have gone like this in the past few years).
I was not being disingenuous and I find your use of the word "disingenuous" here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
So that we don't miss the bigger point, I want to reiterate that ARC-AGI-2 is designed to be solved by near-term, sub-AGI AI models with some innovation on the status quo, not to stump them forever. This is François Chollet describing the previous version of the benchmark, ARC-AGI, in a post on Bluesky from January 6, 2025:
To reiterate, ARC-AGI and ARC-AGI-2 are not tests of AGI. It is a test of whether a small, incremental amount of progress toward AGI has occurred. The idea is for ARC-AGI-2 to be solved, hopefully within the next few years and not, like, ten years from now, and then to move on to ARC-AGI-3 or whatever the next benchmark will be called.
Also, ARC-AGI was not a perfectly designed benchmark (for example, Chollet said about half the tasks turned out to be flawed in a way that made them susceptible to "brute-force program search") and ARC-AGI-2 is not a perfectly designed benchmark, either.
ARC-AGI-2 is worth talking about because most, if not all, of the commonly used AI benchmarks have very little usefulness for quantifying general intelligence or quantifying AGI progress. It's the problem of bad operationalization leading to distorted conclusions, as I discussed in my previous comment.
I don't know of other attempts to benchmark general intelligence (or "fluid intelligence") or AGI progress with the same level of carefulness and thoughtfulness as ARC-AGI-2. I would love to hear if there are more benchmarks like this.
One suggestion I've read is that a benchmark should be created with a greater diversity of tasks, since all of ARC-AGI-2 tasks are part of the same "puzzle game" (my words).
There's a connection between frontier AI models' failures on a relatively simple "puzzle game" like ARC-AGI-2 and why we don't see AI models showing up in productivity statistics, real per capita GDP growth, or taking over jobs. When people try to use AI models for practical tasks in the real world, their usefulness is quite constrained.
I understand the theory that AI will have a super fast takeoff, so that even though it isn't very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present. People can and did make this argument before ChatGPT, before AlphaGo, even before AlexNet. Ray Kurzweil has been saying this since at least the 1990s.
It's important to have good, constrained, scientific benchmarks like ARC-AGI-2 and hopefully some people will develop another one, maybe with more task diversity. Other good "benchmarks" are economic and financial data around employment, productivity, and economic growth. Can AI actually do useful things that generate profit for users and that displace human labour?
This is a nuanced question, since there are models like AlphaFold (and AlphaFold 2 and 3) that can, at least in theory, improve scientific productivity, but which are narrow in scope and do not exhibit general intelligence or fluid intelligence. You have to frame the question carefully, in a way that actually tests what you want to test.
For example, using LLMs as online support chatbots, where humans are already usually following scripts and flow charts, and for which conventional "Software 1.0" was largely already adequate, is somewhat cool and impressive, but doesn't feel like a good test of general intelligence. A much better sign of AGI progress would be if LLM-based models were able to replace human labour in multiple sorts of jobs where it is impossible to provide precise, step-by-step written instructions.
To frame the question properly would require thought, time, and research.
I think Chollet has shifted the goal posts a bit from when he first developed ARC [ARC-AGI 1]. In his original paper from 2019, Chollet says:
And the original announcement (from June 2024) says:
(And ARC-AGI 1 has now basically been solved). You say:
But we are seeing a continued rapid improvement in A(G)I capabilities, not least along the trajectory to automating AGI development, as per the METR report Ben West mentions.
In his interview with Dwarkesh Patel in June 2024 to talk about the launch of the ARC Prize, Chollet emphasized how easy the ARC-AGI tasks were for humans, saying that even children could do them. This is not something he’s saying only now in retrospect that the ARC-AGI tasks have been mostly solved.
That first quote, from the 2019 paper, is consistent with Chollet’s January 2025 Bluesky post. That second quote is not from Chollet, but from Mike Knoop. I don’t know what the first sentence is supposed to mean, but the second sentence is also consistent with the Bluesky post.
In response to the graph… Just showing a graph go up does not amount to a “trajectory to automating AGI development”. The kinds of tasks AI systems can do today are very limited in their applicability to AGI research and development. That has only changed modestly between ChatGPT’s release in November 2022 and today.
In 2018, you could have shown a graph of go performance increasing from 2015 to 2017 and that also would not have been evidence of a trajectory toward automating AGI development. Nor would AlphaZero’s tripling of the games a single AI system can master from go to go, chess, and shogi. Measuring improved performance on tasks only provides evidence for AGI progress if the tasks you are measuring test for general intelligence.
GPT-2 is not mentioned in the blog post. Nor is GPT-3. Or GPT3.5. Or GPT-4. Or even GPT-4o! You are writing 0.0% a lot for effect. In the actual blog post, there are only two 0.0% entries, for "gpt-4.5 (Pure LLM)", and "o3-mini-high (Single CoT)"; and note the limitations in parenthesis, which you also neglect to include in your list (presumably for effect? Given their non-zero scores when not limited in such ways.)
It seems like you are really zeroing in on nitpicky details that make barely any difference to the substance of what I said in order to accuse me of being intentionally deceptive. This is not a cool behaviour.
I am curious to see what will happen in 5 years when there is no AGI. How will people react? Will they just kick their timelines 5 years down the road and repeat the cycle? Will some people attempt to resolve the discomfort by defining AGI as whatever exists in 5 years? Will some people be disillusioned and furious?
I hope that some people engage in soul searching about why they believed AGI was imminent when it wasn’t. And near the top of the list of reasons why will be (I believe) intolerance of disagreement about AGI and hostility to criticism of short AGI timelines.
I don't think it's nitpicky at all. A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s, as Ben West points out.
If this happens, we will at least know a lot more about how AGI works (or doesn't). I'll be happy to admit I'm wrong (I mean, I'll be happy to still be around, for a start[1]).
I think the most likely reason we won't have AGI in 5 years is that there will be a global moratorium on further development. This is what I'm pushing for.
Then it's a good thing I didn't claim there was "a trend that is all flat 0s" in the comment you called "disingenuous". I said:
This feels like such a small detail to focus on. It feels ridiculous.