Project ideas: Backup plans & Cooperative AI

Lukas Finnveden

Comments 2

Sorted by

New & upvoted

Lukas, thanks for pulling together all these notes. To me, "cooperative AI" stands out and might deserve its own page(s). This terminology covers remarkably broad and disparate pursuits. In the words of Dafoe, et al. (mostly of the Cooperative AI Foundation):

"A first cluster consists of AI–AI cooperation, tackling ever more difficult, rich and realistic settings (see ‘Four elements of cooperative intelligence’)." - this is notably the focus of FOCAL@CMU, who are looking at "game theory appropriate for advanced, autonomous AI agents – with a focus on achieving cooperation".
"A second is AI–human cooperation, for which we will need to advance natural-language understanding, enable machines to learn about people’s preferences, and make machine reasoning more accessible to humans." - big problems but plenty happening here, of course, with RLHF and research on alignment (representation, etc.).
"A third cluster is work on tools for improving (and not harming) human–human cooperation, such as ways of making the algorithms that govern social media better at promoting healthy online communities."

This last one seems neglected, in my view, probably because it is an an inherently less straightforward and more interdisciplinary problem to tackle. But it's also arguably the one with the single greatest upside potential. Will MacAskill, in describing “the best possible future” imagines “technological advances… in the ability to reflect and reason with one another”. Already today, there's a wealth of social psychology research on what creates connection and cooperation; these ideas might be implemented at scale, with the help of AI - to help us understand, connect, and achieve things together. In a narrow sense, that might help scientists collaborate. In a bigger sense, it might ultimately reverse societal polarization and help unite humankind, in way that reduces existential risk and increases upside potential more than anything else we could do.

SummaryBot

Executive summary: This post suggests backup plans if AI systems become misaligned, as well as ideas for making AI systems more cooperative.

Key points:

We could study AI generalization to influence properties like lack of spitefulness, even if not full alignment.
Some properties, like lack of spite, may lead misaligned AIs to cooperate more with humans or other AIs.
We could implement "surrogate goals" in AI systems as harmless placeholders that threats could target instead of original goals.
Negotiation-assist AI could help resolve complex situations with many parties and options.
Acausal decision theory suggests learning too much could be risky; we may want caution before expanding knowledge of distant civilizations.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

Being honest with AIs

Lukas Finnveden·10mo ago·21m read

154

AGI and Lock-In

Lukas Finnveden, Jess_Riedel, CarlShulman·3y ago·Curated 3y ago·12m read

What's important in "AI for epistemics"?

Lukas Finnveden·1y ago·34m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 1d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

166

The first video from Giving What We Can's new channel is out now!

JustinPortela·3d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·4d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

Recent opportunities to take action

Seeking feedback and collaborators for an AI welfare project

Juliana Grant·9h ago·2m read

PauseCon London '26: Applications now open

Jonathan@PauseAI·8h ago·1m read

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·1d ago·1m read

Mike Albrecht

"A first cluster consists of AI–AI cooperation, tackling ever more difficult, rich and realistic settings (see ‘Four elements of cooperative intelligence’)." - this is notably the focus of FOCAL@CMU, who are looking at "game theory appropriate for advanced, autonomous AI agents – with a focus on achieving cooperation".
"A second is AI–human cooperation, for which we will need to advance natural-language understanding, enable machines to learn about people’s preferences, and make machine reasoning more accessible to humans." - big problems but plenty happening here, of course, with RLHF and research on alignment (representation, etc.).
"A third cluster is work on tools for improving (and not harming) human–human cooperation, such as ways of making the algorithms that govern social media better at promoting healthy online communities."

^{^}

Possibly assisted by aligned AIs or tool AIs.

^{^}

Maybe some mild desire for retribution (in a way that discourages bad behavior while still being de-escalatory) could be acceptable, or even good. But we would at least want to avoid extreme forms of spite.

^{^}

Sufficiently strong versions of this could also drastically reduce motivations to overthrow humans. At least if we’ve done an ok job at promising and demonstrating that we’ll treat digital minds well.

^{^}

This path also carries a higher risk of near-miss scenarios.

^{^}

Which I mainly care about because it might let us influence misaligned models. But in principle, it’s also possible that we could get intent-alignment via other means, but that we were still happy to have done this research because it lets us influence other properties of the model. But the path-to-impact there is more complicated, because it requires an explanation for why the people who the AI is aligned to aren’t able or willing to elicit that behavior just by asking/training for it. (Yet are willing to implement the training methodology that indirectly favors that behavior.)

^{^}

And if we’re specifically looking for ways to affect properties in worlds where alignment fails, then we’re conditioning on being in a world where the simplest “baseline” solutions (such as fine-tuning for good behavior) failed. Accordingly, we should be more pessimistic about simple solutions.

^{^}

Possibly via modifying a model that is “playing the training game” to better recognise that it’s being evaluated and to notice what the desired behavior is.

^{^}

Also: If there was some information that you wanted to be part of AI bargaining, but that you didn’t want to be communicated to the humans on the other side, you could potentially delete large parts of the record and only keep certain circumscribed conclusions.

Project ideas: Backup plans & Cooperative AI

Backup plans for misaligned AI

What properties would we prefer misaligned AIs to have? [Philosophical/conceptual] [Forecasting]

Making misaligned AI have better interactions with other actors

AIs that we may have moral or decision-theoretic reasons to empower

Making misaligned AI positively inclined toward us

Studying generalization & AI personalities to find easily-influenceable properties [ML]

Theoretical reasoning about generalization [ML] [Philosophical/conceptual]

Cooperative AI

Implementing surrogate goals / safe Pareto improvements [ML] [Philosophical/conceptual] [Governance]

AI-assisted negotiation [ML] [Philosophical/conceptual]

Implications of acausal decision theory [Philosophical/conceptual]

End