I voted 'disagree' on this, not because I'm highly confident you are wrong, but because I think things are a lot less straightforward than this. A couple of counterpoints that I think clash with this thesis:
To be clear, I don't think this is a watertight argument that AGIs will be moral, I think it's an argument for just being really uncertain. For example, maybe utilitarianism is a kind of natural idea that any intelligent being who feels some form of compassion might arrive at (this seems very plausible to me), but maybe a pure utilitarian superintelligence would actually be a bad outcome! Maybe we don't want the universe filled with organisms on heroin! Or for everyone else to be sacrificed to an AGI utility monster.
I can see lots of reasons for worry, but I think there's reasons for optimism too.
I'm feeling inspired by Anneliese Dodds' decision to resign as a government minister over this issue, which is grabbing the headlines today! Before that I'd been feeling very disappointed about the lack of pushback I was seeing in news coverage.
I haven't written my letter to my MP yet, but I've remembered that I am actually a member of the Labour party. Would a letter to my local Labour MP have even more impact if I also cancelled my Labour membership in protest? Ok, I might not be a government minister, I'm just an ordinary party member who hasn't attended a party event in years, but still, they get some money from me at the moment!
Or would cancelling the membership mean I have less influence on future issues, and so ultimately be counter-productive? Any thoughts?
In addition, o3 was also trained on the public data of ARC-AI, a dataset comprised of abstract visual reasoning problems in the style of Raven’s progressive matrices [52]. When combined with the large amount of targeted research this benchmark has attracted in recent years, the high scores achieved by o3 should not be considered a reliable metric of general reasoning capabilities.
This take seems to contradict Francois Chollet's own write-up of the o3 ARC results, where he describes the results as:
a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before
(taken from your reference 52 , emphasis mine)
You could write this off as him wanting to talk-up the significance of his own benchmark, but I'm not sure that would be right. He has been very publicly sceptical of the ability of LLMs to scale to general intelligence, so this is a kind of concession from him. And he had already laid the groundwork in his Dwarkesh Patel interview to explain away high ARC performance as cheating if it tackled the problem in the wrong way, cracking it through memorization via an alternative route (e.g. auto-generating millions of ARC-like problems and training on those). He could easily have dismissed the o3 results on those grounds, but chose not to, which made an impression on me (a non-expert trying to decide how to weigh up the opions of different experts). Presumably he is aware that o3 trained on the public dataset, and doesn't view that as cheating. The public dataset is small, and the problems are explicitly designed to resist memorization, requiring general intelligence. Being told the solution to earlier problems is not supposed to help you solve later problems.
What's your take on this? Do you disagree with the write up in [52]? Or do you think I'm mischaracterizing his position (there are plenty of caveats outside the bit I selectively quoted as well - so maybe I am)?
The fact that the human-level ARC performance could only be achieved by extremely high inference-time compute costs seems significant too. Why would we get inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, instead of real reasoning?
For context, I used to be pretty sympathetic to the "LLMs do most of the impressive stuff by memorization and are pretty terrible at novel tasks" position, and still think this is a good model for the non-reasoning LLMs, but my views have changed a lot since the reasoning models, particularly because of the ARC results.
This is an interesting analysis!
I agree with MaxRa's point. When I skim read "Metaculus pro forecasters were better than the bot team, but not with statistical significance" I immediately internalised that the message was "bots are getting almost as good as pros" (a message I probably already got from the post title!) and it was only when I forced myself to slow down and read it more carefully that I realised this is not what this result means (for example you could have done this study only using a single question, and this stated result could have been true, but likely not tell you much either way about their relative performance). I only then noticed that both main results were null results. I'm then not sure if this actually supports the 'Bots are closing the gap' claim or not..?
The histogram plot is really useful, and the points of reference are helpful too. I'd be interested to know what the histogram would look like if you compared pro human forecasters to average human forecasters on a similar set of questions? How big an effect do we see there? Or maybe to get more directly at what I'm wondering: how do bots compare to average human forecasters? Are they better with statistical significance, or not? Has this study already been done?
Thanks for the link, I've just given your previous post a read. It is great! Extremely well written! Thanks for sharing!
I have a few thoughts on it I thought I'd just share. Would be interested to read a reply but don't worry if it would be too time consuming.
I like the 'replace one neuron at a time' thought-experiment, but accept it has flaws. For me, it's that we could in principle simulate a brain on a digital computer and have it behave identically, that convinces me of functionalism. I can't grok how some system could behave identically but its thoughts not 'exist'.
Thanks for the reply, this definitely helps!
The brain operating according to the known laws of physics doesn't imply we can simulate it on a modern computer (assuming you mean a digital computer). A trivial example is certain quantum phenomena. Digital hardware doesn't cut it.
Could you explain what you mean by this..? I wasn't aware that there were any quantum phenomena that could not be simulated on a digital computer? Where do the non-computable functions appear in quantum theory? (My background: I have a PhD in theoretical physics, which certainly doesn't make me an expert on this question, but I'd be very surprised if this was true and I'd never heard about it! And I'd be a bit embarrassed if it was a fact considered 'trivial' and I was unaware of it!)
There are quantum processes that can't be simulated efficiently on a digital computer, but that is a different question.
I don't think I fully understand exactly what you are arguing for here, but would be interested in asking a few questions to help me understand it better, if you're happy to answer?
Ah, that's a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadn't quite got this message from your post.
My understanding of Francois Chollet's position (he's where I first heard the comparison of logarithmic inference-time scaling to brute force search - before I saw Toby's thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-search - but whatever the magic ingredient is he has acknowledged that o3 has it).
Of course this could just be his way of explaining why the o3 ARC results don't prove his earlier positions wrong. People don't like to admit when they're wrong! But this view still seems plausible to me, it contradicts the 'trading off' narrative, and I'd be extremely interested to know which picture is correct. I'll have to read that paper!
But I guess maybe it doesn't matter a lot in practice, in terms of the impact that reasoning models are capable of having.
This was a thought-provoking and quite scary summary of what reasoning models might mean.
I think this sentence may have a mistake though:
"you can have GPT-o1 think 100-times longer than normal, and get linear increases in accuracy on coding problems."
Doesn't the graph show that the accuracy gains are only logarithmic? The x-axis is a log scale.
This logarithmic relationship between performance and test-time compute is characteristic of brute-force search, and maybe is the one part of this story that means the consequences won't be quite so explosive? Or have I misunderstood?
Evolution is chaotic and messy, but so is stochastic gradient descent (the word 'stochastic' is in the name!) The optimisation function might be clean, but the process we use to search for optimum models is not.
If AGI emerges from the field of machine learning in the state it's in today, then it won't be "designed" to pursue a goal, any more than humans were designed. Instead it will emerge from a random process, through billions of tiny updates, and this process will just have been rigged to favour things which do well on some chosen metric.
This seems extremely similar to how humans were created, through evolution by natural selection. In the case of humans, the metric being optimized for was the ability to spread our genes. In AIs, it might be accuracy at predicting the next word, or human helpfulness scores.
The closest things to AGI we have so far do not act with "strict logical efficiency", or always behave rationally. In fact, logic puzzles are one of the things they particularly struggle with!