This is François Chollet's keynote talk at the AGI-24 conference in Seattle, Washington in August 2024.
Chollet is an AI researcher who may be best known for creating the deep learning library Keras. He did deep learning research and software development for Keras at Google for nine years before leaving recently to create his own AI startup. In September 2024, Time named him one of the 100 most influential people in AI.
In this talk, Chollet describes what he sees as the fundamental weaknesses in large language models (LLMs), the flaws in the commonly used benchmarks for LLM performance, and argues why LLMs are incapable of scaling to artificial general intelligence (AGI). He also argues that apparent progress by LLMs in some of their weak areas is the result of superficial, brittle fixes by human annotators, which is a non-scalable and labour-intensive approach.
Chollet's opinion that LLMs won't scale to AGI appears to be the view of a majority of AI experts. A March 2025 report from the Association for the Advancement of Artificial Intelligence (AAAI) found the following after surveying 475 AI experts (page 63):
The majority of respondents (76%) assert that “scaling up current AI approaches” to yield AGI is “unlikely” or “very unlikely” to succeed, suggesting doubts about whether current machine learning paradigms are sufficient for achieving general intelligence.
The approach to AGI that Chollet favours is a combination of deep learning and program synthesis.
Related post: ARC-AGI-2 Overview With François Chollet
With Chollet acknowledging that o1/o3 (and ARC 1 getting beaten) was a significant breakthrough, how much is this talk now outdated vs still relevant?
I think it’s still very relevant! I don’t think this talk’s relevance has diminished. It’s just important to also have that more recent information about o3 in addition to what’s in this talk. (That’s why I linked the other talk at the bottom of this post.)
By the way, I think it’s just o3 and not o1 that achieves the breakthrough results on ARC-AGI-1. It looks like o1 only gets 32% on ARC-AGI-1, whereas the lower-compute version of o3 gets around 76% and the higher-compute version gets around 87%.
The lower-compute version of o3 only gets 4% on ARC-AGI-2 in partial testing (full testing has not yet been done) and the higher-compute version has not yet been tested.
Chollet speculates in this blog post about how o3 works (I don’t think OpenAI has said much about this) and how that fits in to his overall thinking about LLMs and AGI: