Deception as the optimal: mesa-optimizers and inner alignment

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

172

The first video from Giving What We Can's new channel is out now!

JustinPortela·4d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·5d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

Recent opportunities to take action

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·2d ago·1m read

173

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·2w ago·4m read

Seeking feedback and collaborators for an AI welfare project

Juliana Grant·1d ago·2m read

Deception as the optimal: mesa-optimizers and inner alignment

Deception as the optimal: mesa-optimizers and inner alignment

The setup of the problem

Deception is the optimal strategy for the model to achieve its goal under the following conditions:

Why is deceptive alignment so worrisome?

So, we have a deceptively aligned mesa-optimizer. What happens now?