Hide table of contents

This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work.

Earlier this month, I wrote, "There's a vibe that AI progress has stalled out in the last ~year, but I think it's more accurate to say that progress has become increasingly illegible."

I argued that while AI performance on everyday tasks only got marginally better, systems made massive gains on difficult, technical benchmarks of math, science, and programming. If you weren't working in these fields, this progress was mostly invisible, but might end up accelerating R&D in hard sciences and machine learning, which could have massive ripple effects on the rest of the world.

Today, OpenAI announced a new model called o3 that turbocharges this trend, obliterating benchmarks that the average person would have no idea how to parse (myself included).

A bit over a month ago, Epoch AI introduced FrontierMath, "a benchmark of hundreds of original, expert-crafted mathematics problems designed to evaluate advanced reasoning capabilities in AI systems."

These problems are really fucking hard, and the state-of-the-art (SOTA) performance of an AI model was ~2%. They were also novel and unpublished, to eliminate the risk of data contamination.

OpenAI says that o3 got 25% of these problems correct.

Terence Tao, perhaps the greatest living mathematician, said that the hardest of these problems are "extremely challenging... I think they will resist AIs for several years at least.”

Jaime Sevilla, director of Epoch AI, wrote that the results were "far better than our team expected so soon after release. AI has hit a wall, and smashed it through."

Buck Shlegeris, CEO of the AI safety nonprofit Redwood Research, wrote to me that, "the FrontierMath results were very surprising to me. I expected it to take more than a year to get this performance."

FrontierMath was created in part because models were quickly scoring so well they "saturate" other benchmarks to the point that they stop being useful differentiators.

Other benchmarks

o3 significantly improved upon the SOTA in a number of other challenging technical benchmarks of mathematics, hard science questions, and programming.

In September, o1 first exceeded human domain experts (~70%) in the GPQA Diamond benchmark of PhD-level hard science questions (~78%). o3 now definitively beats them both, getting 88% right. In other words, a single AI system outperforms the average of human experts in these tests of each their respective domains of expertise.

o3 also scores high enough on Codeforces programming competition problems to place it as 175th top-scoring human in the world (out of the ~168k users in the last six months).

SWE-Bench is a repository of real-life, unresolved issues in open source codebases. The top score a year ago was 4.4%. The top score at the start of December was 55%. OpenAI says o3 got 72% correct.

ARC-AGI

o3 also approaches human performance on the ARC-AGI benchmark, which was designed to be hard for AI systems, but relatively doable for humans (i.e. you don't need a PhD to get decent scores). However, it's expensive to get those scores.

OpenAI research Nat McAleese published a thread on the results, acknowledging "o3 is also the most expensive model ever at test-time," i.e. when the model is being used on tasks. Running o3 on ARC-AGI tasks cost between $17 and thousands of dollars per problem — while humans can solve them for $5-10 each.

This aligns with what I wrote last month about how the first AI systems to achieve superhuman performance on certain tasks might actually cost more than humans working on those same tasks. However, that probably won't last long, if historic cost trends hold.

McAleese agrees (though note that it's bad news for OpenAI if this isn't the case):

My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale.

I'm particularly interested in seeing how o3 performs on RE-Bench, a set of machine learning problems that may offer the best insight into how well AI agents stack up against expert humans in doing the work that could theoretically lead to an explosion in AI capabilities. I would guess that it will be significantly better than the current SOTA, but also significantly more expensive (though still cheaper than the human experts).

Mike Knoop, co-founder of ARC-AGI prize wrote on X that "o3 is really special and everyone will need to update their intuition about what AI can/cannot do."

So is deep learning actually hitting a wall?

A bit over a month ago, I asked "Is Deep Learning Actually Hitting a Wall?" following a wave of reports that scaling up models like GPT-4 was no longer resulting in proportional performance gains. There was hope for the industry in the form of OpenAI's o1 approach, which uses reinforcement learning to "think" longer on harder problems, resulting in better performance on some reasoning and technical benchmarks. However, it wasn't clear how the economics of that approach would pencil out or where the ceiling was. I concluded:

all things considered, I would not bet against AI capabilities continuing to improve, albeit at a slower pace than the blistering one that has marked the dozen years since AlexNet inaugurated the deep learning revolution.

It's hard to look at the results of o1 and then the potentially even more impressive results of o3 published ~3 months later and say that AI progress is slowing down. We may even be entering a new world where progress on certain classes of problems happens faster than ever before, all while other domains stagnate.

Shlegeris alluded to this dynamic in his message to me:

It's interesting that the model is so exceptional at FrontierMath while still only getting 72% on SWEBench-Verified. There are way more humans who are able to beat its SWEBench performance than who are able to get 25% at FrontierMath.

Once more people get access to o3, there will inevitably be widely touted examples of it failing on common-sense tasks, and it may be worse than other models at many tasks (maybe even most of them). These examples will be used to dismiss the genuine capability gains demonstrated here. Meanwhile, AI researchers will use o3 and models like it to continue to accelerate their work, bringing us closer to a future where humanity is increasingly rendered obsolete.

If you enjoyed this post, please subscribe to The Obsolete Newsletter

No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 1m read
 · 
(Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) > “There comes a moment when the children who have been playing at burglars hush suddenly: was that a real footstep in the hall?”  > > - C.S. Lewis “The Human Condition,” by René Magritte (Image source here) 1. Introduction Sometimes, my thinking feels more “real” to me; and sometimes, it feels more “fake.” I want to do the real version, so I want to understand this spectrum better. This essay offers some reflections.  I give a bunch of examples of this “fake vs. real” spectrum below -- in AI, philosophy, competitive debate, everyday life, and religion. My current sense is that it brings together a cluster of related dimensions, namely: * Map vs. world: Is my mind directed at an abstraction, or it is trying to see past its model to the world beyond? * Hollow vs. solid: Am I using concepts/premises/frames that I secretly suspect are bullshit, or do I expect them to point at basically real stuff, even if imperfectly? * Rote vs. new: Is the thinking pre-computed, or is new processing occurring? * Soldier vs. scout: Is the thinking trying to defend a pre-chosen position, or is it just trying to get to the truth? * Dry vs. visceral: Does the content feel abstract and heady, or does it grip me at some more gut level? These dimensions aren’t the same. But I think they’re correlated – and I offer some speculations about why. In particular, I speculate about their relationship to the “telos” of thinking – that is, to the thing that thinking is “supposed to” do.  I also describe some tags I’m currently using when I remind myself to “really think.” In particular:  * Going slow * Following curiosity/aliveness * Staying in touch with why I’m thinking about something * Tethering my concepts to referents that feel “real” to me * Reminding myself that “arguments are lenses on the world” * Tuning into a relaxing sense of “helplessness” about the truth * Just actually imagining differ
 ·  · 5m read
 · 
When we built a calculator to help meat-eaters offset the animal welfare impact of their diet through donations (like carbon offsets), we didn't expect it to become one of our most effective tools for engaging new donors. In this post we explain how it works, why it seems particularly promising for increasing support for farmed animal charities, and what you can do to support this work if you think it’s worthwhile. In the comments I’ll also share our answers to some frequently asked questions and concerns some people have when thinking about the idea of an ‘animal welfare offset’. Background FarmKind is a donation platform whose mission is to support the animal movement by raising funds from the general public for some of the most effective charities working to fix factory farming. When we built our platform, we directionally estimated how much a donation to each of our recommended charities helps animals, to show users.  This also made it possible for us to calculate how much someone would need to donate to do as much good for farmed animals as their diet harms them – like carbon offsetting, but for animal welfare. So we built it. What we didn’t expect was how much something we built as a side project would capture peoples’ imaginations!  What it is and what it isn’t What it is:  * An engaging tool for bringing to life the idea that there are still ways to help farmed animals even if you’re unable/unwilling to go vegetarian/vegan. * A way to help people get a rough sense of how much they might want to give to do an amount of good that’s commensurate with the harm to farmed animals caused by their diet What it isn’t:  * A perfectly accurate crystal ball to determine how much a given individual would need to donate to exactly offset their diet. See the caveats here to understand why you shouldn’t take this (or any other charity impact estimate) literally. All models are wrong but some are useful. * A flashy piece of software (yet!). It was built as
Garrison
 ·  · 7m read
 · 
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work. Wow. The Wall Street Journal just reported that, "a consortium of investors led by Elon Musk is offering $97.4 billion to buy the nonprofit that controls OpenAI." Technically, they can't actually do that, so I'm going to assume that Musk is trying to buy all of the nonprofit's assets, which include governing control over OpenAI's for-profit, as well as all the profits above the company's profit caps. OpenAI CEO Sam Altman already tweeted, "no thank you but we will buy twitter for $9.74 billion if you want." (Musk, for his part, replied with just the word: "Swindler.") Even if Altman were willing, it's not clear if this bid could even go through. It can probably best be understood as an attempt to throw a wrench in OpenAI's ongoing plan to restructure fully into a for-profit company. To complete the transition, OpenAI needs to compensate its nonprofit for the fair market value of what it is giving up. In October, The Information reported that OpenAI was planning to give the nonprofit at least 25 percent of the new company, at the time, worth $37.5 billion. But in late January, the Financial Times reported that the nonprofit might only receive around $30 billion, "but a final price is yet to be determined." That's still a lot of money, but many experts I've spoken with think it drastically undervalues what the nonprofit is giving up. Musk has sued to block OpenAI's conversion, arguing that he would be irreparably harmed if it went through. But while Musk's suit seems unlikely to succeed, his latest gambit might significantly drive up the price OpenAI has to pay. (My guess is that Altman will still ma