Fit Testing AI Benchmarking

Declan McKenna 🔷

5

Thanks for the post! For those reading, I'm Jay - head of technology and standards for the Inspect Evals repo, and the reviewer of Declan's PR. I happened to spot this post without realising it was from a recent contributor! A couple of quick clarifications around the structure of Inspect Evals (it is pretty confusing)

Inspect Evals != Inspect, and isn't run by the same team. Inspect is the evals framework, Inspect Evals is a repository of evals that use that framework. Inspect Evals is run by Arcadia Impact, and we're contracted by the UK AISI to maintain it.

Our developers work remotely as contractors, so moving to London isn't required. I'm in Australia at the moment. (though we're doing a restructure atm and it's uncertain how that's going to pan out, so I'm not sure about our current hiring)

I think the EA Hotel people have a point about evaluations, personally. I think if you're going to open-source evaluations, you should ask "Would I be okay if frontier AI companies trained on this / hill-climbed on this metric?" For frontier maths evals you might not want this - is it worth the increased knowledge we get about these capabilities? For moral reasoning under uncertainty, you may actively want them to do something like this.

Finally, I'm glad you liked the agent stuff - that's been a majority of where my time's gone this quarter. Appreciate the feedback, and more is always welcome :)