Compendium of problems with RLHF

Compendium of problems with RLHF

[anonymous]

12 min readJan 30, 2023

Comments

Sorted by

New & upvoted

No comments on this post yet.

Be the first to respond.

Comments

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·6d ago·Curated 2d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

153

Let's taboo the V-word

lincolnq·6d ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

105

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·3d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

Recent opportunities to take action

EA Organisation Updates thread: July 2026

Dane Valerie·5d ago·1m read

announcing High Impact Aliens

tzukitchan·2d ago·1m read

Help us launch AI safety university groups by referring potential founders

Jason Chin🔸, Thomas Rodskog·2d ago·4m read

^{^}

I tend to agree with Katja Grace that values are not fragile in the sense imagined in the sequences [link].

^{^}

Modulo Models Don't "Get Reward".

^{^}

It's actually not that expensive, I'm willing to buy an aligned AI for a lot more than that. But it gives a lower bound on the order of magnitude of the alignment fee for RLHF.

^{^}

This does not seem to be a problem per se if your model of the human giving feedbacks is robust. But your model has to be robust. Also keep in mind that even pure human feedback is also likely to lead to AI takeover.

^{^}

As a first approximation, I suspect we can consider that only the upper layers of the model have been refined, the lower intermediate layers having not been modified. In Sparrow, only the upper 16 layers have been fine-tuned.

Compendium of problems with RLHF

Compendium of problems with RLHF

Why RLHF counts as progress?

Why RLHF is insufficient?

Existing problems with RLHF because of (currently) non-robust ML systems

Incentives issues of the RL part of RLHF

Problems related to the HF part of RLHF

Superficial Outer Alignment

The Strawberry problem

Unknown properties under generalization

Final thoughts