H

HugoSave

4 karmaJoined

Comments
2

Thanks!

To the largest degree possible we have collected data from public leader boards or system cards and it seems to be the case that Gemini models are a bit underrepresented. I am not sure why that is, but that Anthropic releases more data is definitely part of it. For example the most updated data points from CyBench come from the Claude (and Grok) system cards, and for the virology test there are data points in the system card of Opus 4.5 but not for Gemini 3.0 Pro.

Indeed the 0%point is zero hours, so compared to the METR plot it is divided by 8 hours. 

The 8 hours I agree is somewhat arbitrary and I had missed that METR had a more 'official' stance on it. I made an issue out of it now to see if anyone else had reasons to make it 8 hours.

(For context I did most of the benchmark literature review for this project and data collection.)