Reposting from LessWrong, for people who might be less active there:[1]
TL;DR
Edit (19/01): Elliot (the project lead) points out that the holdout set does not yet exist (emphasis added):
As for where the o3 score on FM stands: yes I believe OAI has been accurate with their reporting on it, but Epoch can't vouch for it until we independently evaluate the model using the holdout set we are developing.[3]
Edit (24/01):
Tamay tweets an apology (possibly including the timeline drafted by Elliot). It's pretty succinct so I won't summarise it here! Blog post version for people without twitter. Perhaps the most relevant point:
OpenAI commissioned Epoch AI to produce 300 advanced math problems for AI evaluation that form the core of the FrontierMath benchmark. As is typical of commissioned work, OpenAI retains ownership of these questions and has access to the problems and solutions.
Nat from OpenAI with an update from their side:
- We did not use FrontierMath data to guide the development of o1 or o3, at all.
- We didn't train on any FM derived data, any inspired data, or any data targeting FrontierMath in particular
- I'm extremely confident, because we only downloaded frontiermath for our evals *long* after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked .
============
Some quick uncertainties I had:
Epistemic status: quickly summarised + liberally copy pasted with ~0 additional fact checking given Tamay's replies in the comment section
arXiv v5 (Dec 20th version) "We gratefully acknowledge OpenAI for their support in creating the benchmark."
See clarification in case you interpreted Tamay's comments (e.g. that OpenAI "do not have access to a separate holdout set that serves as an additional safeguard for independent verification") to mean that the holdout set already exists
Note that the hold-out set doesn't exist yet. https://x.com/ElliotGlazer/status/1880812021966602665
What does this mean for OpenAI's 25% score on the benchmark?
Note that only some of FrontierMath's problems are actually frontier, while others are relatively easier (i.e. IMO level, and Deepmind was already one point from gold on IMO level problems) https://x.com/ElliotGlazer/status/1870235655714025817
I've known Jaime for about ten years. Seems like he made an arguably wrong call when first dealing with real powaah, but overall I'm confident his heart is in the right place.
Some very quick thoughts from EY's TIME piece from the perspective of someone ~outside of the AI safety work. I have no technical background and don't follow the field closely, so likely to be missing some context and nuance; happy to hear pushback!
Shut down all the large training runs. Put a ceiling on how much computing power anyone is allowed to use in training an AI system, and move it downward over the coming years to compensate for more efficient training algorithms. No exceptions for governments and militaries. Make immediate multinational agreements to prevent the prohibited activities from moving elsewhere. Track all GPUs sold. If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.
Frame nothing as a conflict between national interests, have it clear that anyone talking of arms races is a fool. That we all live or die as one, in this, is not a policy but a fact of nature. Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs.
It's true we've discussed this already...