Discovering alignment windfalls reduces AI risk

James Brady; stuhlmueller

Discovering alignment windfalls reduces AI risk

James Brady,

Comments

More from the author

AI Safety Needs Great Product Builders

James Brady·3y ago·7m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 2d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

173

The first video from Giving What We Can's new channel is out now!

JustinPortela·4d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·5d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

Recent opportunities to take action

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·2d ago·1m read

173

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·2w ago·4m read

Seeking feedback and collaborators for an AI welfare project

Juliana Grant·1d ago·2m read

^{^}

E.g. Hendrycks (2022)

^{^}

Jan Leike estimated RLHF accounted for 5–20% of the total cost of GPT-3 Instruct, factoring in things like engineering work, hiring labelers, acquiring compute, and researcher effort.

^{^}

Of course, it’s far from a complete solution to alignment and might be more akin to putting lipstick on a shoggoth than deeply rewiring the model’s concepts and capabilities.

^{^}

But also: Some have noted that interpretability could be harmful if our increased understanding of the internals of neural networks leads to capability gains. Mechanistic interpretability would give us a tight feedback loop with which to better direct our search for algorithmic and training setup improvements.

^{^}

We expect to contribute to reducing AI risk by improving epistemics, but highlighting alignment windfalls from factored cognition is important to us as well.

Discovering alignment windfalls reduces AI risk

Discovering alignment windfalls reduces AI risk

Alignment taxes

Alignment windfalls

Companies as optimisers

Shaping the landscape

Companies greedily optimise within the known landscape

Shaping knowledge of the landscape

Factored cognition: an example of an alignment windfall

Factored cognition as a windfall

How we’ve been exploring factored cognition at Elicit

Conclusion

Want to help us?