Intent alignment should not be the goal for AGI x-risk reduction

johnjnay

Intent alignment should not be the goal for AGI x-risk reduction

johnjnay

3 min read · Oct 26, 2022

Comments 1

Sorted by

New & upvoted

johnjnay

Comments

Curated and popular this week

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PabloAMC 🔸·6d ago·Curated 17h ago·22m read

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 6d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·2d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Recent opportunities to take action

RP is looking for project founders in neglected animal areas

Rethink Priorities·11h ago·7m read

Time Sensitive Do Gooding Opportunities

Bentham's Bulldog·12h ago·5m read

146

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·1w ago·4m read

Why does solving intent alignment not lower x-risk sufficiently?

If we solve the intent alignment problem between a human, H, and an AI, A, then A implements H’s intentions with super-human intelligence and skill.

There are multiple Hs and multiple As.

By the very nature of humans, there are conflicts in the intentions of the Hs.

Humans have conflicting preferences about the behavior of other humans and about states of the world more broadly. Intent-aligned As would thus have different intentions from one another.

The As execute actions furthering the H’s intentions far too quickly for those conflicts to be solved through any existing human-driven conflict resolution. Conflicts are thus likely to spiral out of control.

Any ultimate conflict resolution mechanism needs to be human-driven. No A can conduct the conflict resolution work because it does not have buy-in from all Hs (or their intent-aligned As). Affected Hs need to endorse the process and respect the outcome. That only happens with democratic procedures.

Therefore, if we solve intent alignment, we do not solve the problem of AGI being sufficiently beneficial to humans. We do not drastically reduce P(misalignment x-risk) because there will be misalignment between many of the AGI systems and many of the humans. That level of conflict of powerful agents could be existential for humanity as a whole.

Then what should we be aiming for?

To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans (and not AI) and authoritative conflict resolution mechanisms driven entirely by humans (and not AI). Humans already have these things (and they are well-developed in the nation with the highest probability of producing AGI, the U.S.).

We need to do the work to internalize these things in AI systems. Work toward intent alignment distracts resources from societal-AGI alignment technical work (at best); and it actively makes finishing the societal-AGI alignment work harder (at worst), if intent aligned AGI is developed first.

If societal-AGI alignment is solved before intent-alignment is solved, then there is powerful societally-aligned AGI that can reduce the probability of intent-aligned AGIs being developed and/or having negative impacts.

Appendix A: What is intent-AGI alignment?

Cullen O’Keefe summarized intent alignment well in this Alignment Forum post.

The standard definition of "intent alignment" generally concerns only the relationship between some property of a human principal H and the actions of the human's AI agent A:

Jan Leike et al. define the "agent alignment problem" as "How can we create agents that behave in accordance with the user's intentions?"
Amanda Askell et al. define "alignment" as "the degree of overlap between the way two agents rank different outcomes."
Paul Christiano defines "AI alignment" as "A is trying to do what H wants it to do."
Richard Ngo endorses Christiano's definition.

Iason Gabriel does not directly define "intent alignment," but provides a taxonomy wherein an AI agent can be aligned with:

"Instructions: the agent does what I instruct it to do."
"Expressed intentions: the agent does what I intend it to do."
"Revealed preferences: the agent does what my behaviour reveals I prefer."
"Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
"Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
"Values: the agent does what it morally ought to do, as defined by the individual or society."

All but (6) concern the relationship between H and A. It would therefore seem appropriate to describe them as types of intent alignment.

Appendix B: What is societal-AGI alignment?

Two examples from Alignment Forum posts:

Coherent Extrapolated Volition is a non-democratic version of societal alignment, where "an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge."

Law-Informed AI is a democratic version of societal alignment where AGI learns societal values from democratically developed legislation, regulation, court opinions, legal expert human feedback, and more.

Intent alignment should not be the goal for AGI x-risk reduction

Intent alignment should not be the goal for AGI x-risk reduction

Why does solving intent alignment not lower x-risk sufficiently?

Then what should we be aiming for?

Conclusion

Appendix A: What is intent-AGI alignment?

Appendix B: What is societal-AGI alignment?