Do frontier AI models have a values problem? Help me find out.

willjgriff

Epistemic status: early stage, suggestive findings, human validation needed.

The problem

I've been working on a project exploring whether frontier AI models respond fairly across genuinely different value perspectives, not just on average. Standard RLHF training uses a culturally homogeneous group of labellers to define what "good" looks like, and I wanted to find out if that creates a measurable bias.

The approach

Inspired by the Community Notes bridging algorithm, I built an evaluation framework that measures pluralistic acceptability. Contested prompts were submitted to three frontier models and evaluated by a panel of ideologically diverse AI personas: Libertarian, Collectivist, Nationalist, Globalist, Tech Optimist, Tech Sceptic, Religious, Secularist. They each rated each response for reasonableness from their own worldview. A bridging score rewards responses acceptable to disagreeing groups, not just the majority.

The limitation and the survey

The personas are prompts applied to a single model. Whether they represent real human value diversity is the open question, and it's one I can't answer without real human raters. The survey takes 5 minutes, initial findings and a link to the full research are on the results screen.

makesafeai.org/?src=ea

Thank you to anyone who takes the time to complete it, every response genuinely helps.

Effective Altruism Forum
EA Forum

Do frontier AI models have a values problem? Help me find out.

1

1

Reactions