Epistemic status: early stage, suggestive findings, human validation needed.
The problem
I've been working on a project exploring whether frontier AI models respond fairly across genuinely different value perspectives, not just on average. Standard RLHF training uses a culturally homogeneous group of labellers to define what "good" looks like, and I wanted to find out if that creates a measurable bias.
The approach
Inspired by the Community Notes bridging algorithm, I built an evaluation framework that measures pluralistic acceptability. Contested prompts were submitted to three frontier models and evaluated by a panel of ideologically diverse AI personas: Libertarian, Collectivist, Nationalist, Globalist, Tech Optimist, Tech Sceptic, Religious, Secularist. They each rated each response for reasonableness from their own worldview. A bridging score rewards responses acceptable to disagreeing groups, not just the majority.
The limitation and the survey
The personas are prompts applied to a single model. Whether they represent real human value diversity is the open question, and it's one I can't answer without real human raters. The survey takes 5 minutes, initial findings and a link to the full research are on the results screen.
Thank you to anyone who takes the time to complete it, every response genuinely helps.
