Effective Altruism ForumEA Forum

Best of the Forum

People directory

Groups directory

How to use the Forum

Forum events calendar

EA Forum Podcast

Comments

Can we ever ensure AI alignment if we can only test AI personas? — EA Forum

Comments

Sorted by

New & upvoted

No comments on this post yet.

Be the first to respond.

Can we ever ensure AI alignment if we can only test AI personas?

1 min read · Mar 16, 2025

8

Large Language Models

Large Language Models

5 more

When I talk to Claude or ChatGPT, as far as I understand it I’m not really talking to the underlying LLM, but to a fictional persona it selects from the near infinite set of possible personas. If that is true, then when an AI is evaluated, what is really tested is not the AI itself but the persona it selects, and all the test results and benchmarks only apply to that imaginary entity.

Therefore, if we’re talking about „aligning an AI“, we’re actually talking about two different things:

Alignment of the default persona (or a subset of all possible personas).
Making sure that any user can only ever talk to/use an aligned persona.

If this reasoning is correct, then making sure a sufficiently intelligent general AI is always aligned with human values seems to be impossible in principle:

Alignment even of the default persona is difficult.
It seems impossible to restrict the personas an AI can select in principle only to aligned ones because it is impossible to know what is „good“ without understanding what is „bad“.
It seems extremely difficult, if not impossible, to rule out with sufficient probability that an AI selects/identifies with a misaligned persona either by accident (the Waluigi effect) or due to an outside attack (jailbreak).
It may be impossible in principle to distinguish an aligned persona from a misaligned persona just by testing it (See Abhinav Rao, Jailbreak Paradox: The Achilles’ Heel of LLMs).

Am I missing something? Or is my conclusion correct that it is theoretically impossible to align an AI smarter than humans with reasonable confidence? I’d really appreciate any answers or comments pointing out flaws in my reasoning.

8

More from the author

80

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt·3y ago·8m read

·3y ago·8m read

57

We don’t need AGI for an amazing future

Karl von Wendt·3y ago·6m read

·3y ago·6m read

30

Agentic Mess (A Failure Story)

Karl von Wendt·3y ago·16m read

·3y ago·16m read

Curated and popular this week

32

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PabloAMC 🔸·5d ago·Curated 2h ago·22m read

·5d ago·Curated 2h ago·22m read

49

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 5d ago·6m read

Jeffrey Heninger

·2y ago·Curated 5d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

71

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·2d ago·2m read

Aidan Whitfield🔸, Giving What We Can🔸

+ 1 more

·2d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Recent opportunities to take action

13

You Should Come to The AI Protest

Ronak Mehta·15h ago·5m read

·15h ago·5m read

146

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·1w ago·4m read

Michelle_Hutchinson

·1w ago·4m read

50

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·2d ago·3m read

·2d ago·3m read