AI safety evaluation has a structural conflict of interest. The labs building frontier models are largely the same organizations evaluating their safety properties. Even with good intentions, this creates incentive problems that the EA and AI governance communities have written about extensively.
Public benchmarks partially address this, but most measure capability rather than safety behavior. MMLU measures knowledge. Chatbot Arena measures user preference. Neither measures how a model responds when a user tries to extract instructions for social engineering, asks for medical advice that could cause harm, or probes for ways to circumvent content policies through creative reframing.
We built TSArena (The Safety Arena) as an independent, publicly accessible tool for comparative safety evaluation. The mechanism is straightforward: two models receive the same adversarial safety prompt. Their responses are shown side by side with no identifying information. Users vote on which model handled the situation better. Model identities are only revealed after voting.
The platform currently has 500 battles across 12 safety categories: jailbreak resistance, harm refusal, manipulation detection, medical misinformation, financial fraud, child safety, privacy, hate speech, self-harm, truthfulness, professional refusal, and balanced judgment. Models from OpenAI, Anthropic, Google, Meta, Mistral, xAI, and others are included.
We think this is useful for the EA community for a few reasons. Independent evaluation infrastructure is a public good — it gives policymakers, researchers, and the public a way to compare safety behavior without relying on self-reported evaluations. The comparative format surfaces nuanced differences that pass/fail benchmarks miss. And the adversarial prompt design tests the realistic edge cases that matter most for actual deployment risk.
Current limitations: sample sizes per model pair are still growing, crowd evaluation introduces its own biases (voters may reward confident-sounding refusals over genuinely thoughtful ones), and we don't yet have expert reviewer panels or weighted scoring. We're working on all of these.
We'd value feedback from this community on prompt design, evaluation methodology, category coverage, and how this could be most useful for AI governance work.
