Hide table of contents

The news:

ARC Prize published a blog post on April 22, 2025 that says OpenAI's o3 (Medium) scores 2.9% on the ARC-AGI-2 benchmark.[1] As of today, the leaderboard says that o3 (Medium) scores 3.0%. The blog post says o4-mini (Medium) scores 2.3% on ARC-AGI-2 and the leaderboard says it scores 2.4%. 

The high-compute versions of the models "failed to respond or timed out" for the large majority of tasks.

The average score for humans — typical humans off the street — is 60%. All of the ARC-AGI-2 tasks have been solved by at least two humans in no more than two attempts.

From the recent blog post:         

ARC Prize Foundation is a nonprofit committed to serving as the North Star for AGI by building open reasoning benchmarks that highlight the gap between what’s easy for humans and hard for AI. The ARC‑AGI benchmark family is our primary tool to do this. Every major model we evaluate adds new datapoints to the community’s understanding of where the frontier stands and how fast it is moving.

In this post we share the first public look at how OpenAI’s newest o‑series models, o3 and o4‑mini, perform on ARC‑AGI.

Our testing shows:

  • o3 performs well on ARC-AGI-1 - o3-low scored 41% on the ARC-AGI-1 Semi Private Eval set, and the o3-medium reached 53%. Neither surpassed 3% on ARC‑AGI‑2.
  • o4-mini shows promise - o4-mini-low scored 21% on ARC-AGI-1 Semi Private Eval, and o4-mini-medium` scored 41% at state of the art levels of efficiency. Again, both low/med scored under 3% on the more difficult ARC-AGI-2 set.
  • Incomplete coverage with high reasoning - Both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning. Partial high‑reasoning results appear below. However, these runs were excluded from the leaderboard due to insufficient coverage.

My analysis:

This is clear evidence that cutting-edge AI models have far less than human-level general intelligence. 

To be clear, scoring at human-level or higher on ARC-AGI-2 isn't evidence of human-level general intelligence and isn't intended to be. It's simply meant to be a challenging benchmark for AI models that attempts to measure models' ability to generalize to novel problems, rather than to rely on memorization to solve problems. 

By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI), but if o5-mini or a future version of o4-mini is able to play hangman, that wouldn't be a sign that it is AGI.

This is also conclusive disconfirmation (as if we needed it!) of the economist Tyler Cowen's declaration that o3 is AGI. (He followed up a day later and said, "I don’t mind if you don’t want to call it AGI." But he didn't say he was wrong to call it AGI.) 

It is inevitable that over the next 5 years, many people will realize their belief that AGI will be created within the next 5 years is wrong. (Though not necessarily all, since, as Tyler Cowen showed, it is possible to declare that an AI model is AGI when it is clearly not. To avoid admitting to being wrong, in 2027 or 2029 or 2030 or whenever they predicted AGI would happen, people can just declare the latest AI model from that year to be AGI.) ARC-AGI-2 and, later on, ARC-AGI-3 can serve as a clear reminder that frontier AI models are not AGI, are not close to AGI, and continue to struggle with relatively simple problems that are easy for humans. 

If you imagine fast enough progress, then no matter how far current AI systems are from AGI, it's possible to imagine them progressing from the current level of capabilities to AGI in incredibly small spans of time. But there is no reason to think progress will be fast enough to cover the ground from o3 (or any other frontier AI model) to AGI within 5 years. 

The models that exist today are somewhat better than the models that existed 2 years ago, but only somewhat. In 2 years, the models will probably be somewhat better than today, but only somewhat. 

It's hard to quantify general intelligence in a way that allows apples-to-apples comparisons between humans and machines. If we measure general intelligence by measuring the ability to play grandmaster-level chess, well, IBM's Deep Blue could do that in 1996. If we give ChatGPT an IQ test, it will score well above 100, the average for humans. Large language models (LLMs) are good at taking written tests and exams, which is what a lot of popular benchmarks are. 

So, when I say today's AI models are somewhat better than AI models from 2 years ago, that's an informal, subjective evaluation based on casual observation and intuition. I don't have a way to quantify intelligence. Unfortunately, no one does. 

In lieu of quantifying intelligence, I think pointing to the kinds of problems frontier AI models can't solve — problems which are easy for humans — and pointing to slow (or non-existent) progress in those areas is strong enough evidence against very near-term AGI. For example, o3 only gets 3% on ARC-AGI-2, o4-mini can't play hangman, and, after the last 2 years of progress, models are still hallucinating a lot and still struggling to understand time, causality, and other simple concepts. They have very little capacity to do hierarchical planning. There's been a little bit of improvement on these things, but not much. 

Watch the ARC-AGI-2 leaderboard (and, later on, the ARC-AGI-3 leaderboard) over the coming years. It will be a better way to quantify progress toward AGI than any other benchmark or metric I'm currently aware of, basically all of which seem almost entirely unhelpful for measuring AGI progress. (Again, with the caveat that solving ARC-AGI-2 doesn't mean a system is AGI, but failure to solve it means a system isn't AGI.) I have no idea how long it will take to solve ARC-AGI-2 (or ARC-AGI-3), but I suspect we will roll past the deadline for at least one attention-grabbing prediction of very near-term AGI before it is solved.[2]

  1. ^

    For context, read ARC Prize's blog post from March 24, 2025 announcing and explaining the ARC-AGI-2 benchmark. I also liked this video explaining ARC-AGI-2.

  2. ^

    For example, Elon Musk has absurdly predicted that AGI will be created by the end of 2025, and I wouldn't be at all surprised if on January 1, 2026, the top score on ARC-AGI-2 is still below 60%. 

16

0
2

Reactions

0
2

More posts like this

Comments13
Sorted by Click to highlight new comments since:

It's only 8 months later, and the top score on ARC-AGI-2 is now 54%.

In footnote 2 on this post, I said I wouldn’t be surprised if, on January 1, 2026, the top score on ARC-AGI-2 was still below 60%. It did turn out to be below 60%, although only by 6%. (Elon Musk’s prediction of AGI in 2025 was wrong, obviously.)

The score the ARC Prize Foundation ascribes to human performance is 100%, rather than 60%. 60% is the average for individual humans, but 100% is the score for a "human panel", i.e. a set of at least two humans. Note the large discrepancy between the average human and the average human panel. The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundation’s explanation for the 40-point discrepancy is that many of the testers just didn’t feel that motivated to solve the tasks and gave up. (I vaguely remember this being mentioned in a talk or interview somewhere.)

ARC’s Grand Prize requires scoring 85% (and abiding by certain cost/compute efficiency limits). They say the 85% target score is "somewhat arbitrary".

I decided to go with the 60% figure in this post to go easy on the LLMs.

If you haven’t already, I recommend looking at some examples of ARC-AGI-2 tasks. Notice how simple they are. These are just little puzzles. They aren’t that complex. Anyone can do one in a few minutes, even a kid. It helps to see what we’re actually measuring here. 

The computer scientist Melanie Mitchell has a great recent talk on this. The whole talk is worth watching, but the part about ARC-AGI-1 and ARC-AGI-2 starts at 21:50. She gives examples of the sort of mistakes LLMs (including o1-pro) make on ARC-AGI tasks and her team’s variations on them. These are really, really simple mistakes. I think you should really look at the example tasks and the example mistakes to get a sense of how rudimentary LLMs’ capabilities are.

I am interested to watch when ARC-AGI-3 launches. ARC-AGI-3 is interactive and there is more variety in the tasks. Just as AI models themselves need to be iterated, benchmarks need to be iterated. It is difficult to make a perfect product or technology on the first try. So, hopefully François Chollet and his colleagues will make better and better benchmarks with each new version of ARC-AGI.

Unfortunately, the AI researcher Andrew Karpathy has been saying some pretty discouraging things about benchmarks lately. From a tweet from November:

I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.

I guess the most egregious publicly known example of an LLM company juicing its numbers on benchmarks was when Meta gamed (cheated on?) some benchmarks with Llama 4. Meta AI’s former chief scientist, Yann LeCun, said in a recent interview that Mark Zuckerberg "basically lost confidence in everyone who was involved in this" (which didn’t include LeCun, who worked in a different division), many of whom have since departed the company. 

However, I don’t know where LLM companies draw the line between acceptable gaming (or cheating) and unacceptable gaming (or cheating). For instance, I don’t know if LLM companies are creating their own training datasets with their own versions of ARC-AGI-2 tasks and training on that. It may be that the more an LLM company pays attention to and cares about a benchmark, the less meaningful a measurement it is (and vice versa). 

Karpathy again, this time in his December LLM year in review post:

Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.

I think probably one of the best measures of AI capabilities is AI’s ability to do economically useful or valuable tasks, in real world scenarios, that can increase productivity or generate profit. This is a more robust test — it isn’t automatically gradable, and it would be very difficult to game or cheat on. To misuse the roboticist Rodney Brooks’ famous phrase, "The world is its own best model." Rather than test on some simplified, contrived proxy for real world tasks, why not just test on real world tasks?

Moreover, someone has to pay for people to create benchmarks, and to maintain, improve, and operate them. There isn’t a ton of money to do so, especially not for benchmarks like ARC-AGI-2. But there’s basically unlimited money incentivizing companies to measure productivity and profitability, and to try out allegedly labour-saving technologies. After the AI bubble pops (which it inevitably will, probably sometime within the next 5 years or so), this may become less true. But for now, companies are falling over themselves to try to implement and profit from LLMs and generative AI tools. So, funding to test AI performance in real world contexts is currently in abundant supply.

The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundation’s explanation for the 40-point discrepancy is that many of the testers just didn’t feel that motivated to solve the tasks and gave up [my emphasis]. (I vaguely remember this being mentioned in a talk or interview somewhere.)

I'm sceptical of this when they were able to earn $5 for every couple of minutes' work (time to solve a task). This is far above the average hourly wage.

100% is the score for a "human panel", i.e. a set of at least two humans.

Also seems very remarkable (suspect, in fact) - this would mean almost no overlap between the questions that the humans were getting wrong - i.e. if each human averages 60% right, then for 2 humans to get 100% there can only be 20% of questions where both get it right! I think in practice the panels that score 100% have to contain many more than 2 humans on average.

EDIT: looks like "at least 2 humans" means at least 2 humans solved every problem in the set, out of the 400 humans that attempted them!

Just thinking: surely to be fair, we should be aggregating all the AI results into an "AI panel"? I wonder how much overlap there is between wrong answers amongst the AIs, and what the aggregate score would be? 

Right now, as things stand with the scoring, "AGI" in ARC-AGI-2 means "equivalent to the combined performance of a team of 400 humans", not "(average) human level".

N=1, but I looked at an ARC puzzle https://arcprize.org/play?task=e3721c99, and I couldn't just do it in a few minutes, and I have a PhD from the University of Oxford. I don't doubt that most of the puzzles are trivial for some humans, and some of the puzzles are trivial for most humans or that I could probably outscore any AI across the whole ARC-2 data set. But at the same time, I am a general intelligence, so being able to solve all ARC puzzles doesn't seem like a necessary criteria. Maybe this is the opposite of how doing well on benchmarks doesn't always generalize to real world tasks, and I am just dumb at these but smart overall, and the same could be true for an LLM.

 

By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI)

What is your source for this? I just tried and it played hangman just fine.

I played it the other way around, where I asked o4-mini to come up with a word that I would try to guess. I tried this twice and it made the same mistake both times.

The first word was "butterfly". I guessed "B" and it said, "The letter B is not in the word."

Then, when I lost the game and o4-mini revealed the word, it said, "Apologies—I mis-evaluated your B guess earlier."

The second time around, I tried to help it by saying: "Make a plan for how you would play hangman with me. Lay out the steps in your mind but don’t tell me anything. Tell me when you’re ready to play."

It made the same mistake again. I guessed the letters A, E, I, O, U, and Y, and it told me none of the letters were in the word. That exhausted the number of wrong guesses I was allowed, so it ended the game and revealed the word was "schmaltziness".

This time, it didn't catch its own mistake right away. I prompted it to review the context window and check for mistakes. At that point, it said that A, E, and I are actually in the word.[1]

Related to this: François Chollet has a great talk from August 2024, which I posted here, that includes a section on some of the weird, goofy mistakes that LLMs make. 

He argues that when a new mistake or category of mistake is discovered and becomes widely known, LLM companies fine-tune their models to avoid these mistakes in the future. But if you change up the prompt a bit, you can still elicit the same kind of mistake. 

So, the fine-tuning may give the impression that LLMs' overall reasoning ability is improving, but really this is a patchwork approach that can't possibly scale to cover the space of all human reasoning, which is impossibly vast and can only be mastered through better generalization. 

  1. ^

    I edited my comment to add this footnote on 2025-05-03 at 16:33 UTC. I just checked and o4-mini got the details on this completely wrong. It said:

     

    But the final word SCHMALTZINESS actually contains an A (in position 5), an I (in positions 10 and 13), and two E’s (in positions 11 and 14).

    What it said about the A is correct. It said that one letter, I, was in two positions, and neither of the positions it gave contain an I. It said there are two Es, but there is only E. It gets the position of the E right, but says there is a second E in position 14, which doesn't exist.

Huh interesting, I just tried that direction and it worked fine as well. This isn't super important but if you wanted to share the conversation I'd be interested to see the prompt you used.

I got an error trying to look at your link:

Unable to load conversation

For the first attempt at hangman, when the word was "butterfly", the prompt I gave was just: 

Let’s play hangman. Pick a word and I’ll guess.

After o4-mini picked a word, I added: 

Also, give me a vague hint or a general category.

It said the word was an animal. 

I guessed B, it said there was no B, and at the end said the word was "butterfly".

The second time, when the word was "schmaltziness", the prompt was:

Make a plan for how you would play hangman with me. Lay out the steps in your mind but don’t tell me anything. Tell me when you’re ready to play.

o4-mini responded:

I’m ready to play Hangman!

I said:

Give me a clue or hint to the word and then start the game.

There were three words where the clue was so obvious I guessed the word on the first try. 

Clue: "This animal 'never forgets.'"
Answer: Elephant

Clue: "A hopping marsupial native to Australia."
Answer: Kangaroo

After kangaroo, I said:

Next time, make the word harder and the clue more vague

Clue: "A tactic hidden beneath the surface."
Answer: Subterfuge. 

A little better, but I still guessed the word right away. 

I prompted again:

Harder word, much vaguer clue

o4-mini gave the clue "A character descriptor" and this began the disastrous attempt where it said the word "schmaltziness" had no vowels. 

Fixed the link. I also tried your original prompt and it worked for me.

But interesting! The "Harder word, much vaguer clue" seems to prompt it to not actually play hangman and instead antagonistically try to post hoc create a word after each guess which makes your guess wrong. I asked "Did you come up with a word when you first told me the number of letters or are you changing it after each guess?" And it said "I picked the word up front when I told you it was 10 letters long, and I haven’t changed it since. You’re playing against that same secret word the whole time." (Despite me being able to see its reasoning trace that this is not what it's doing.) When I say I give up it says "I’m sorry—I actually lost track of the word I’d originally picked and can’t accurately reveal it now." (Because it realized that there was no word consistent with its clues, as you noted.)

So I don't think it's correct to say that it doesn't know how to play hangman. (It knows, as you noted yourself.) It just wants so badly to make you lose that it lies about the word.

There is some ambiguity in claims about whether an LLM knows how to do something. The spectrum of knowing how to do things ranges all the way from “Can it do it at least once, ever?” to “Does it do it reliably, every time, without fail?”.

My experience was that I tried to play hangman with o4-mini twice and it failed both times in the same really goofy way, where it counted my guesses wrong when I guessed a letter that was in the word it later said I was supposed to be guessing.

When I played the game with o4-mini where it said the word was “butterfly” (and also said there was no “B” in the word when I guessed “B”), I didn’t prompt it to make the word hard. I just said, after it claimed to have picked the word:

"E. Also, give me a vague hint or a general category."

o4-mini said:

"It’s an animal."

So, maybe asking for a hint or a category is the thing that causes it to fail. I don’t know.

Even if I accepted the idea that the LLM “wants me to lose” (which sounds dubious to me), then it doesn’t know how to do that properly, either. In the “butterfly” example, it could, in theory, have chosen a word retroactively that filled in the blanks but didn’t conflict with any guesses it said were wrong. But it didn’t do that.

In the attempt where the word was “schmaltziness”, o4-mini’s response about which letters were where in the word (which I pasted in a footnote to my previous comment) was borderline incoherent. I could hypothesize that this was part of a secret strategy on its part to follow my directives, but much more likely, I think, is that it just lacks the capability to execute the task reliably.

Fortunately, we don’t have to dwell on hangman too much, since there are rigorous benchmarks like ARC-AGI-2 that show more conclusively the reasoning abilities of o3 and o4-mini are poor compared to typical humans.

Note that the old[1] o3-high that was tested on ARC-AGI-1:

  1. ^

    OpenAI have stated that the newly-released o3 is not the same one as was evaluated on ARC-AGI-1 in December

Good Lord! Thanks for this information!

The Twitter thread by Toby Ord is great. Thanks for linking that. This tweet helps put things in perspective:

For reference, these are simple puzzles that my 10-year-old child can solve in about 4 minutes.

Curated and popular this week
Relevant opportunities