
In February I got in touch with CaML as part of the AIxAnimals incubator, run by Sentient Futures. They tasked me with putting MORU Bench (Moral Reasoning Under Uncertainty) up on Inspect a benchmarking service run by the UK’s AI Security Institute. My PR was accepted and completed the project within three weeks. It was my first project working with Claude Code.
It felt surprisingly familiar despite being python rather than Swift. Building tooling, writing to a spec, making sure test coverage is solid. Of all the things I’ve done during this career switch, this is probably the closest to what I actually did as an iOS framework engineer.
A few things stood out to me:
- The Inspect team had an agents.md setup that could run workflows for checking whether your benchmark was ready for a pull request, whether it had sufficient test coverage, and for generating a report from your results. These files are incredibly useful when onboarding people to an unfamiliar project. When you’re new you don’t know what to prompt for because you don’t know what’s there, you don’t know the guidelines, you don’t know the architecture. Having an agents.md automates all of that.
- Having a fairly thorough agent-driven PR review before a human looks at it seems like a good process to adopt. I know a lot of people in the industry who are struggling with the volume of pull requests coming through because of AI coding tools. This feels like a reasonable way to enforce some quality before it hits a human reviewer, which isn’t sustainable at scale if people are just vibe-coding their PRs.
Is this impactful work?
I’m a bit torn. On one hand, what CaML are doing (creating benchmarks that measure how an AI views animal welfare, how compassionate it is to non-human beings) seems like a great way to influence model behaviour. These benchmarks are used by frontier labs as targets to hit before releasing a model, so they have a pretty direct effect on how models will end up behaving.
On the other hand, I’m apprehensive about benchmarks in general. During my time at the EA hotel I spoke to a few AI safety people who considered most evaluations as progressing capabilities, not safety. The logic being that an evaluation measuring frontier maths or any other capability ends up helping the labs make their agents better at those capabilities. I don’t feel experienced enough to hold a strong opinion on this myself, I’m just aware opinions differ. I’ve been encouraging the people I know who feel strongly about this to write about it because I’d love to see their views stress-tested.
Fit
Of all the fit tests so far, this has been the most promising in terms of how capable I am at it. It maps nicely on to my existing experience and I did enjoy doing it, maybe not quite as much as iOS development, but more so than any other project I’ve worked on during this process.
I also have to bear in mind that creating an evaluation is probably one of the easier tasks. I can see there being more interesting work when you’re actually maintaining a framework rather than using it to create benchmarks. It’s also a project where I’m in a much better position to hit the ground running and get a job.
A couple of concerns though:
- In the case of Inspect, it would likely mean moving to London. Not sure that’s my cup of tea.
- Long-term viability. This is a form of coding where you’re not doing any ML and it’s already highly automatable with Claude Code. I actually met with a former manager of AISI who’s attempting to automate benchmarks. That said, it could be interesting to actually do the automations.
All in all, it’s a high contender so far and a successful fit test. I may go back and do a few additional PRs for Inspect or look into other evaluation projects to revisit this.
