Hide table of contents

Note: This post was crossposted from Planned Obsolescence by the Forum team, with the author's permission. The author may not see or respond to comments on this post.

Most experts were surprised by progress in language models in 2022 and 2023. There may be more surprises ahead, so experts should register their forecasts now about 2024 and 2025.

Kelsey Piper co-drafted this post. Thanks also to Isabel Juniewicz for research help.

If you read media coverage of ChatGPT — which called it ‘breathtaking’, ‘dazzling’, ‘astounding’ — you’d get the sense that large language models (LLMs) took the world completely by surprise. Is that impression accurate?

Actually, yes. There are a few different ways to attempt to measure the question “Were experts surprised by the pace of LLM progress?” but they broadly point to the same answer: ML researchers, superforecasters,[1] and most others were all surprised by the progress in large language models in 2022 and 2023.

Competitions to forecast difficult ML benchmarks

ML benchmarks are sets of problems which can be objectively graded, allowing relatively precise comparison across different models. We have data from forecasting competitions done in 2021 and 2022 on two of the most comprehensive and difficult ML benchmarks: the MMLU benchmark and the MATH benchmark[2]

First, what are these benchmarks?

The MMLU dataset consists of multiple choice questions in a variety of subjects collected from sources like GRE practice tests and AP tests. It was intended to test subject matter knowledge in a wide variety of professional domains. MMLU questions are legitimately quite difficult: the average person would probably struggle to solve them.

alt_text

alt_text

At the time of its introduction in September 2020, most models only performed close to random chance on MMLU (~25%), while GPT-3 performed significantly better than chance at 44%. The benchmark was designed to be harder than any that had come before it, and the authors described their motivation as closing the gap between performance on benchmarks and “true language understanding”:

Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks. However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models.

Meanwhile, the MATH dataset consists of free-response questions taken from math contests aimed at the best high school math students in the country. Most college-educated adults would get well under half of these problems right (the authors used computer science undergraduates as human subjects, and their performance ranged from 40% to 90%).

alt_text

At the time of its introduction in January 2021, the best model achieved only about ~7% accuracy on MATH. The authors say:

We find that accuracy remains low even for the best models. Furthermore, unlike for most other text-based datasets, we find that accuracy is increasing very slowly with model size. If trends continue, then we will need algorithmic improvements, rather than just scale, to make substantial progress on MATH.

So, these are both hard benchmarks — the problems are difficult for humans, the best models got low performance when the benchmarks were introduced, and the authors seemed to imply it would take a while for performance to get really good.

In mid-2021, ML professor Jacob Steinhardt ran a contest[3] with superforecasters at Hypermind to predict progress on MATH and MMLU.[4] Superforecasters massively undershot reality in both cases.

They predicted that performance on MMLU would improve moderately from 44% in 2021 to 57% by June 2022. The actual performance was 68%, which superforecasters had rated incredibly unlikely.

alt_text

Shortly after that, models got even better — GPT-4 achieved 86.4% on this benchmark, close to the 89.8% that would be “expert-level” within each domain, corresponding to 95th percentile among human test takers within a given subtest.

Superforecasters missed even more dramatically on MATH. They predicted the best model in June 2022 would get ~13% accuracy, and thought it was extremely unlikely that any model would achieve >20% accuracy. In reality, the best model in June 2022 got 50% accuracy,[5] performing much better than the majority of humans.

alt_text

Did ML researchers do any better? Steinhardt himself did worse in 2021. In his initial blog post, Steinhardt remarked that the superforecasters’ predictions on MATH were more aggressive (predicting faster progress) than his own.[6] We haven’t found any similar advance predictions from other ML researchers, but Steinhardt’s impression is that he himself anticipated faster progress than most of his colleagues did.

However, ML researchers do seem to be improving in their ability to anticipate progress on these benchmarks. In mid-2022, Steinhardt registered his predictions for MATH and MMLU performance in July 2023, and performed notably better: “For MATH, the true result was at my 41st percentile, while for MMLU it was at my 66th percentile.” Steinhardt also argues that ML researchers performed reasonably well on forecasting MATH and MMLU in the late-2022 Existential Risk Persuasion Tournament (XPT) (though superforecasters continued to underestimate benchmark progress).

Expert surveys about qualitative milestones

Not all forms of progress can be easily captured in quantifiable benchmarks. Often we care more about when AI systems will achieve more qualitative milestones: when will they translate as well as a fluent human? When will they beat the best humans at Starcraft? When will they prove novel mathematical theorems?

Katja Grace of AI Impacts asked ML experts to predict a wide variety of AI milestones, first in 2016 and then again in 2022.

In 2016, the ML experts were reasonably well-calibrated, but the predictions followed a clear pattern: progress in gameplay and robotics advanced slower than expected, but progress in language use (including programming) advanced more quickly than expected.[7]

The second iteration of the survey was conducted in mid-2022, a few months before ChatGPT was released. This time accuracy was lower — experts failed to anticipate the progress that ChatGPT and GPT-4 would soon bring. These models achieved milestones like “Write an essay for a high school history class” (actually GPT-4 does pretty well in college classes too) or “Answer easily Googleable factual but open-ended questions better than an expert” just a few months after the survey was conducted, whereas the experts expected them to take years.

That means that even after the big 2022 benchmark surprises, experts were still in some cases strikingly conservative about anticipated progress, and undershooting the real situation.

Anecdata of researcher impressions

ML researchers rarely register predictions, so AI Impacts’ surveys are the best systematic evidence we have about what ML researchers expected ahead of time about qualitative milestones. Anecdotally though, a number of ML experts have expressed that they (and the ML community broadly) were surprised by ChatGPT and GPT-4.

For a long time, famous cognitive scientist Douglas Hofstadter was among those predicting slow progress. “I felt it would be hundreds of years before anything even remotely like a human mind”, he said in a recent interview.

Now? “This started happening at an accelerating pace, where unreachable goals and things that computers shouldn't be able to do started toppling. …systems got better and better at translation between languages, and then at producing intelligible responses to difficult questions in natural language, and even writing poetry. …The accelerating progress, has been so unexpected, so completely caught me off guard, not only myself but many, many people, that there is a certain kind of terror of an oncoming tsunami that is going to catch all humanity off guard.”

Similarly, during a Senate Judiciary Committee hearing last month, acclaimed leading AI researcher Yoshua Bengio said “I and many others have been surprised by the giant leap realized by systems like ChatGPT.”

In my role as a grantmaker, I’ve heard many ML academics express similar sentiments in private over the last year. In particular, I’ve spoken to many researchers who were specifically surprised by the programming and reasoning abilities of GPT-4 (even after seeing the capabilities of the free version of ChatGPT).

Another surprise ahead?

In 2021, most people were systematically and severely underestimating progress in language models. After a big leap forward in 2022, it looks like ML experts improved in their predictions of benchmarks like MMLU and MATH — but many still failed to anticipate the qualitative milestones achieved by ChatGPT and then GPT-4, especially in reasoning and programming.

I think many experts will soon be surprised yet again. Most importantly, ML experts and superforecasters both seem to be massively underestimating future spending on training runs. In the XPT tournament mentioned earlier, both groups predicted that the most expensive training run in 2030 would only cost around $100-180M.[8] Instead, I think that the largest training run will probably cross $1 billion by 2025. This rapid scaleup will probably drive another qualitative leap forward in capability like what we saw over the last 18 months.

I’d be really excited for ML researchers to register their forecasts about what AI systems built on language models will be able to do in the next couple of years. I think we need to get good at predicting what language models will be able to do — in the real world, not just on benchmarks. Massively underestimating near-future progress could be very risky.


  1. ^

    These are people who consistently outperform experts and the general public in forecasting future events.

  2. ^

    Dan Hendrycks is the first author on both benchmarks; he did this work while he was a graduate student under Open Philanthropy grantee Jacob Steinhardt. Dan now runs the Center for AI Safety (CAIS), which Open Philanthropy has also funded.

  3. ^

    Open Philanthropy funded this forecasting contest.

  1. ^

    His contest consisted of six questions, of which two were forecasting MATH and MMLU. The other questions were not about language model capabilities (two were about vision, and others were questions about inputs to AI progress).

  2. ^

    Interestingly, the paper that achieved this milestone (Minerva) was published just one day before the deadline, on June 29, 2022. According to the Minerva paper, the previous published result was the same ~7% accuracy that was reported in the original MATH paper (though the paper claims an unpublished result of ~20%). This means progress on MATH turned out to be pretty “lumpy,” jumping a large amount with just one paper.

  3. ^

    In fact, in 2021, Steinhardt was surprised that forecasters predicted that models would achieve 50% performance on MATH by 2025. “I'm still surprised that forecasters predicted 52% on MATH [by 2025], when current accuracy is 7% (!). My estimate would have had high uncertainty, but I'm not sure the top end of my range would have included 50%.” As we said above, 50% was achieved in 2022. 

  4. ^

    For example, one of the milestones was “Write Python code to implement algorithms like quicksort.” Experts predicted that would happen around 2026, but actually it happened in 2021 — and by 2022 language models could write much more complex pieces of code than quicksort.

  5. ^
Show all footnotes
Comments10


Sorted by Click to highlight new comments since:

The XPT forecast about compute in 2030 still boggles my mind. I'm genuinely confused what happened there. Is anybody reading this familiar with the answer?

Reminds me of this:
A kind of conservativeness of "expert" opinion that doesn't correctly appreciate (rapid) exponential growth.

I think it’s also just very difficult for experts to adopt a new paradigm. In transportation the experts consistently overestimate.

Chart showing US DOT estimates gs actuals

Can you share a link to the source of this chart? The current link shows me a jpg and nothing else.

Source is here. (I've not read the article - I've seen the chart (or variations of it) a bunch of times before and just googled for the image.)

FWIW you can see more information, including some of the reasoning, on page 655 (# written on pdf) /  659 (# according to page searcher) of the report. (H/t Isabel.) See also page 214 for the definition of the question.

Some tidbits:

Experts started out much higher than superforecasters, but updated downwards after discussion. Superforecasters updated a bit upward, but less:

(Those are billions on the y-axis.)

This was surprising to me. I think the experts' predictions look too low even before updating, and look much worse after updating!

The part of the report that talks about "arguments given for lower forecasts". (The footnotes contain quotes from people expressing those views.)

Arguments given for lower forecasts (2024: <$40m, 2030: <$110m, 2050: ⩽$200m)

● Training costs have been stable around $10m for the last few years.1326

● Current trend increases are not sustainable for many more years.1327 One team cited this AI Impacts blog post.

● Major companies are cutting costs.1328

● Increases in model size and complexity will be offset by a combination of falling compute costs, pre-training, and algorithmic improvements.1329

● Large language models will probably see most attention in the near future, and these are bottlenecked by availability of data, which will lead to smaller models and less compute.1330

● Not all experiments will be public, and it is possible that the most expensive experiments will not be public.1331

(This last bullet point seems irrelevant to me. The question doesn't specify that the experiments has to be public, and "In the absence of an authoritative source, the question will be resolved by a panel of experts.")

Thanks! 

I think this is evidence for a groupthink phenomenon amongst superforecasters. Interestingly my other experiences talking with superforecasters have also made me update in this direction (they seemed much more groupthinky than I expected, as if they were deferring to each other a lot. Which, come to think of it, makes perfect sense -- I imagine if I were participating in forecasting tournaments, I'd gradually learn to reflexively defer to superforecasters too, since they genuinely would be performing well.)

I think we need to get good at predicting what language models will be able to do — in the real world, not just on benchmarks.

Any ideas how to do this? It seems like one key difficulty is that we just don't have good explicit understandings of many cognitive abilities, and don't have much hope of achieving such understandings in the relevant time frame.

So I'm not sure what can be done aside from applying human intuition to whatever relevant info we have (like how LMs qualitatively progressed in the past, how hard various future capabilities seem). Maybe try to find people with particularly good intuitions and calibrations (as demonstrated by past records of predictions)? More/better usage of prediction markets?

Anyway, does anyone have any qualitative predictions of what AIs produced by $1B training runs will look like? What do you think they will be able to do that will be most interesting or useful or or dangerous or economically valuable?

Most college-educated adults would get well under half of these problems right (the authors used computer science undergraduates as human subjects, and their performance ranged from 40% to 90%).

I think the hardness of the MATH benchmark was somewhat exaggerated. I downloaded the dataset myself and took a look, and came to the conclusion that many -- perhaps most -- of the questions are simple plug-and-chug problems. The reported performance of 40-90% among students may have been a result of time constraints rather than pure difficulty. In the paper, they wrote:

"To provide a rough but informative comparison to human-level performance, we randomly sampled 20 problems from the MATH test set and gave them to humans. We artificially require that the participants have 1 hour to work on the problems and must perform calculations by hand."

I’d be really excited for ML researchers to register their forecasts about what AI systems built on language models will be able to do in the next couple of years.

Great call to action. What's the best place/way to do it?

Curated and popular this week
 ·  · 10m read
 · 
I wrote this to try to explain the key thing going on with AI right now to a broader audience. Feedback welcome. Most people think of AI as a pattern-matching chatbot – good at writing emails, terrible at real thinking. They've missed something huge. In 2024, while many declared AI was reaching a plateau, it was actually entering a new paradigm: learning to reason using reinforcement learning. This approach isn’t limited by data, so could deliver beyond-human capabilities in coding and scientific reasoning within two years. Here's a simple introduction to how it works, and why it's the most important development that most people have missed. The new paradigm: reinforcement learning People sometimes say “chatGPT is just next token prediction on the internet”. But that’s never been quite true. Raw next token prediction produces outputs that are regularly crazy. GPT only became useful with the addition of what’s called “reinforcement learning from human feedback” (RLHF): 1. The model produces outputs 2. Humans rate those outputs for helpfulness 3. The model is adjusted in a way expected to get a higher rating A model that’s under RLHF hasn’t been trained only to predict next tokens, it’s been trained to produce whatever output is most helpful to human raters. Think of the initial large language model (LLM) as containing a foundation of knowledge and concepts. Reinforcement learning is what enables that structure to be turned to a specific end. Now AI companies are using reinforcement learning in a powerful new way – training models to reason step-by-step: 1. Show the model a problem like a math puzzle. 2. Ask it to produce a chain of reasoning to solve the problem (“chain of thought”).[1] 3. If the answer is correct, adjust the model to be more like that (“reinforcement”).[2] 4. Repeat thousands of times. Before 2023 this didn’t seem to work. If each step of reasoning is too unreliable, then the chains quickly go wrong. Without getting close to co
JamesÖz
 ·  · 3m read
 · 
Why it’s important to fill out this consultation The UK Government is currently consulting on allowing insects to be fed to chickens and pigs. This is worrying as the government explicitly says changes would “enable investment in the insect protein sector”. Given the likely sentience of insects (see this summary of recent research), and that median predictions estimate that 3.9 trillion insects will be killed annually by 2030, we think it’s crucial to try to limit this huge source of animal suffering.  Overview * Link to complete the consultation: HERE. You can see the context of the consultation here. * How long it takes to fill it out: 5-10 minutes (5 questions total with only 1 of them requiring a written answer) * Deadline to respond: April 1st 2025 * What else you can do: Share the consultation document far and wide!  * You can use the UK Voters for Animals GPT to help draft your responses. * If you want to hear about other high-impact ways to use your political voice to help animals, sign up for the UK Voters for Animals newsletter. There is an option to be contacted only for very time-sensitive opportunities like this one, which we expect will happen less than 6 times a year. See guidance on submitting in a Google Doc Questions and suggested responses: It is helpful to have a lot of variation between responses. As such, please feel free to add your own reasoning for your responses or, in addition to animal welfare reasons for opposing insects as feed, include non-animal welfare reasons e.g., health implications, concerns about farming intensification, or the climate implications of using insects for feed.    Question 7 on the consultation: Do you agree with allowing poultry processed animal protein in porcine feed?  Suggested response: No (up to you if you want to elaborate further).  We think it’s useful to say no to all questions in the consultation, particularly as changing these rules means that meat producers can make more profit from sel
 ·  · 11m read
 · 
My name is Keyvan, and I lead Anima International’s work in France. Our organization went through a major transformation in 2024. I want to share that journey with you. Anima International in France used to be known as Assiettes Végétales (‘Plant-Based Plates’). We focused entirely on introducing and promoting vegetarian and plant-based meals in collective catering. Today, as Anima, our mission is to put an end to the use of cages for laying hens. These changes come after a thorough evaluation of our previous campaign, assessing 94 potential new interventions, making several difficult choices, and navigating emotional struggles. We hope that by sharing our experience, we can help others who find themselves in similar situations. So let me walk you through how the past twelve months have unfolded for us.  The French team Act One: What we did as Assiettes Végétales Since 2018, we worked with the local authorities of cities, counties, regions, and universities across France to develop vegetarian meals in their collective catering services. If you don’t know much about France, this intervention may feel odd to you. But here, the collective catering sector feeds a huge number of people and produces an enormous quantity of meals. Two out of three children, more than seven million in total, eat at a school canteen at least once a week. Overall, more than three billion meals are served each year in collective catering. We knew that by influencing practices in this sector, we could reach a massive number of people. However, this work was not easy. France has a strong culinary heritage deeply rooted in animal-based products. Meat and fish-based meals remain the standard in collective catering and school canteens. It is effectively mandatory to serve a dairy product every day in school canteens. To be a certified chef, you have to complete special training and until recently, such training didn’t include a single vegetarian dish among the essential recipes to master. De