# Compendium of problems with RLHF

by [anonymous]
Jan 30 202312 min read 0

# 16

Epistemic status: This post is a distillation of many comments/posts. I believe that my list of problems is not the best organization of sub-problems. I would like to make it shorter, and simpler, because cool theories are generally simple unified theories, by identifying only 2 or 3 main problems without aggregating problems with different types of gear level mechanisms, but currently I am too confused to be able to do so. Note that this post is not intended to address the potential negative impact of RLHF research on the world, but rather to identify the key technical gaps that need to be addressed for an effective alignment solution. Many thanks to Walter Laurito, Fabien Roger, Ben Hayum, Justis Mills for useful feedbacks.

RLHF tldr: We need a reward function, we cannot hand-write it, let’s make the AI learn it!

Problem 0: RLHF is confusing. Human judgment and feedback is so brittle, that even junior alignment researchers like me thought that RLHF is a not-too-bad-solution to the outer alignment problem. I think RLHF confuses a lot of people and distracts people from the core issues. Here is my attempt to become less confused.

## Why RLHF counts as progress?

• Without RLHF, approximating the reward function for a “good” backflip was tedious and almost impossible. With RLHF it is possible to obtain a reward function, that when optimized by an RL policy leads to beautiful backflips.
• Value is complexValue is fragile? But being polite is complex, being polite is fragile, and we can implement this roughly in ChatGPT[1].
• “RLHF withstands more optimization pressure than supervised fine-tuning: By using RLHF you can obtain reward functions, that withstand more optimization pressure, than traditionally hand-crafted reward functions. Insofar as that's a key metric we care about, it counts as progress.” [Richard’s comment]

## Why RLHF is insufficient?

Buck highlights two main types problems with using RLHF to create an AGI: oversight issues and the potential for catastrophic outcomes. This partition of the problem space comes from the post Low-Stakes alignment. Although it may be feasible to categorize all problems under this partition, I believe that this categorization is not granular enough and lumps different types of problems together.

Davidad suggests dividing the use of RLHF into two categories: "Vanilla-RLHF-Plan" and "Broadly-Construed-RLHF" as part of the alignment plan. The "Vanilla-RLHF-plan" refers to the narrow use of RLHF (say as used in ChatGPT) and is not sufficient for alignment, while "Broadly-Construed-RLHF" refers to the more general use of RLHF as a building block and is potentially useful for alignment. The list below outlines the problems with "Vanilla-RLHF" that "Broadly-Construed-RLHF" should aim to address.

### Existing problems with RLHF because of (currently) non-robust ML systems

Those issues seem to be mostly a result of poor capabilities, and I think those problems may disappear as models grow larger.

1. Benign Failures: ChatGPT can fail. And it is not clear if this problem will disappear as capabilities scale.

2. Mode Collapse, i.e. a drastic bias toward particular completions and patterns. Mode collapse is expected when doing RL.

3. You need regularization. Your model can severely diverge from the model you would have gotten if had gotten feedback in real-time from real humans. You need to use a KL divergence and choose the constant of regularization. Choosing this constant feels arbitrary. Without this KL divergence, you get mode collapse.(see the annex here).

My opinion: The benign failures seem to be much more problematic than mode collapse and the need for regularization. Currently, prompt injections seem to be able to bypass most security measures. And we can't even count on respectability incentives to push OpenAI and other companies to deploy more robust solutions. For example,the public was not happy with the fact that the AI kept repeating "I am an AI developed by OpenAI", which pushed OpenAI to release the January 9 version that is again much more hackable than the December 15 patch version (benchmark coming soon). So this is one of the problems that seems to me the most severe. But it may be possible to reduce a lot of this kind of failure with more focus on engineering andmore capable systems. So I rather predict that this kind of error will disappear in the future, but not with high confidence.

### Incentives issues of the RL part of RLHF

I expect those problems to be more salient in the future, as it seems to be suggested by recent work done at Anthropic.

4. RL makes the system more goal-directed, less symmetric than base model. For example, we can tell Paul Christiano's story of the model which was over-optimized by OpenAI. This model was trained using a positive sentiment reward system and ended up determining that wedding parties were the most positive subject. Whenever the model was given a prompt, it would generate text that described a wedding party. This breaks the symmetry: Fine-tuning a large sequence model with RLHF shapes a model that steers the sequence in rewarding directions. The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.

5. Instrumental convergence: Larger RLHF models seem to exhibit harmful self-preservation preferences, and *sycophancy*: insincere agreement with user’s sensibilities. [papershort summary]

6. Incentivize deception:RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what's true. RLHF does whatever it has learned makes you hit the "approve" button, even if that means deceiving you.” [from Steiner]. See also the robotic hand in Deep Reinforcement Learning From Human Preferences (Christiano et al, 2017) and comments on how this would scale.

7. RL could make thoughts opaque. See By Default, GPTs Think In Plain Sight“GPTs’ next-token-prediction process roughly matches System 1 (aka human intuition) and is not easily accessible, but GPTs can also exhibit more complicated behavior through chains of thought, which roughly matches System 2 (aka human conscious thinking process). Humans will be able to understand how a human-level GPTs (trained to do next-token-prediction) complete complicated tasks by reading the chains of thought. GPTs trained with RLHF will bypass this supervision”

8. Capabilities externalities: RL already increases the capabilities of base models, and RL could be applied to further increase capability by a lot, and could go much further than imitation. This has already burnt timelines, and will continue to do so.

My opinion:

• I’m not able to tell if point 4 is a big deal or not. I will be more concerned if the attractors that are created as a result of RLHF are of a bad nature. But as it is, it seems that the attractors that are created as a result of the RLHF are just emanations of repetition or diversity problems. To change my mind, you would have to show me an attractor that is bad in nature, and was still strengthened during the RL. Also, the models that result from RL shouldn’t be more directed than rejection sampling.
• 5 and 6 seem to be potential issues.
• The “Opaque thoughts” problem is more speculative, and I think this will be a problem only on advanced systems using scratchpads and long-term memory.
• 8 feels important, even if it is not a central point.

9. RLHF requires a lot of human feedback, and still exhibits failures. To align ChatGPT, I estimated that creating the dataset costed 1M of dollars, which was roughly the same price as the training of GPT3[3]. [details] And 1M is still not sufficient to solve the problem of benign failures currently, and much more effort could be needed to solve those problems completely. As systems become more advanced, they may require more complex data and the cost of obtaining the necessary data may become prohibitive. Additionally, as we push the limits of compute scale beyond that of human capabilities, we may encounter limitations in terms of the number of qualified annotators available.

10. “Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors.“ [LOL20]. And this remains true even if you recruit humans on Bountied Rationality and pay them a reasonable \$50 an hour, as OpenAI did… Several categories of people have been recruited by OpenAI, and Bountied Rationality seems to be the upper tier, according to the time: “To get those labels, OpenAI sent tens of thousands of snippets of text to an outsourcing firm in Kenya, beginning in November 2021. Much of that text appeared to have been pulled from the darkest recesses of the internet. Some of it described situations in graphic detail like child sexual abuse, bestiality, murder, suicide, torture, self harm, and incest.” Aligning an AI with RLHF requires going through unaligned territory, and I can’t imagine what the unaligned territory will look like in the future. This will probably lead annotators to psychological problems.

11. You are using a proxy, not human feedback directly: The model is a proxy trained on human feedback, that represents what humans want, you then use the model to give reward to a policy[4]. That's less reliable as having an actual human give the feedback directly to the model.

12. How to scale human oversight? Most of the challenge of RLHF in practice is in getting humans to reward the right thing, and in doing so at a sufficient scale. [more here]

My opinion:

• A solution to point 12 would be Scalable Oversight, Model assistance. See the critiques paper as a proof of concept. Counter solution: Alignment tax because Agent AIs will be more intel⁣li⁣gent than Tool AIs, in ad⁣di⁣tion to their greater eco⁣nomic com⁣pet⁣i⁣tive⁣ness.
• Point 9 is not currently a significant problem because the cost of 1M for OpenAI is relatively low.
• Using RL(AI)F may offer a solution to all the points in this section: By starting with a set of established principles, AI can generate and revise a large number of prompts, selecting the best answers through a chain-of-thought process that adheres to these principles. Then, a reward model can be trained and the process can continue as in RLHF. This approach is potentially better than RLHF as it does not require human feedback.

### Superficial Outer Alignment

13. Superficially aligned agents: Bad agents are still there beneath. The capability is still here[5]

14. The system is not aligned at all at the beginning of the training and has to be pushed in dangerous directions to train it. For more powerful systems, the beginning of the training could be the most dangerous period.

15. RLHF is not a specification, only a process. Even if you don't buy the outer/inner decomposition, RLHF is a more limited way of specifying intended behavior+cognition than what we would want, ideally. (Your ability to specify what you want depends on the behavior you can coax out of the model.). RLHF is just a fancy word for preference learning, leaving almost the whole process of what reward the AI actually gets undefined. [source]. Comment: No one has been able to obtain exhaustive specification of human values, but maybe a constitution is sufficient? (and we must then make the assumption that the constitution represents sufficiently our values)

16. If you have the weights of the model, it is possible to misalign it by fine-tuning it again in another direction, i.e. fine-tuning it so that it mimics Hitler or something else. This seems very inexpensive to do in terms of computation. If you believe in the orthogonality thesis, you could fine-tune your model toward any goal.

My opinion:

• Points 13, 15, and 16 appear to be low-priority problems, but they may indicate new paradigmatic ways of solving the issues at hand.
• 16 seems a bit unrealistic and is not a central point.

The following points in this post are not specifically points against RLHF, today no technique applied to language models solves the strawberry problem or the generalization problem. But it is useful to recall these problems.

### The Strawberry problem

RLHF does not a priori solve the strawberry problem. Nates Soares sums up the problem as:

17. Pointer problem: Directing a capable AGI towards an objective of your choosing.

18. Corrigibility:Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.”

“These two problems appear in the strawberry problem, which Eliezer's been pointing at for quite some time: the problem of getting an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else. The demand of cellular-level copying forces the AI to be capable; the fact that we can get it to duplicate a strawberry instead of doing some other thing demonstrates our ability to direct it; the fact that it does nothing else indicates that it's corrigible (or really well aligned to a delicate human intuitive notion of inaction).” [from the Sharp left turn post]

My opinion: At the end of the day, I’m not that worried about corrigibility at the moment. So far, we've got no evidence of ChatGPT failing to be corrigible and little evidence of it having a wrong pointer. The issues I’m raising are not specific to RLHF, but to priors on techniques which don't successfully prove they are safe against those (i.e. all techniques presented to this day). Even if Anthropic tries to demonstrate this in the model-written-evals (see fig. 1), it seems to me that that’s not really disproving corrigibility. I think it would be pretty easy to implement corrigibility because "To allow oneself to be turned off", is not such a complicated task, and humans should be able to evaluate this and implement this with RLHF. If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible. So we still don’t have much evidence.

### Unknown properties under generalization

19. Distributional leap: RLHF requires some negative feedback in order to train it. For large-scale tasks, this could maybe require killing someone to continue the gradient descent? We could avoid this with other techniques like Training Language Models with Language Feedback, or maybe we could use a form of curriculum learning, to start the training with benign actions and then finish on actions with potentially large consequences, but it seems that ‘ killing someone’ is what would happen for large scale vanilla-RLHF trainings.

• Unknown properties under generalization: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be misaligned because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn't manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct. [from Maxwell Clarke]
• Large distributional shift to dangerous domains: RLHF does not a priori generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions [LOL10]
• Sim to real is hard: RLHF won’t enable you to perform a pivotal act [LOL11]. Under the current paradigm, you would need the model to execute multiple pivotal acts, then assess each one. You don’t go from “Do you prefer, the backflip on the right or on the left” to “Do you prefer the pivotal act on the right or on the left?”.
• High intelligence is a large shift. Once the model become very intelligent and agentic because of RLHF, this is akin to a large shift of distribution.  [LOL12]

20. Sharp left turnHow to scale safely if model performances go up? Capability generalizes further than alignment. Some values collapse when self-reflecting. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is. According to Nate Soares, this problem is upstream the problem of “Directing a capable AGI towards an objective of your choosing and ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.”. So, even if 17 and 18 seem to be solved, 20 can still arise.

My opinion:

• 19. Distribution shift seems to be a problem during testing and seems to be plausible.
• 20. The sharp left turn seems to be a problem during training. It's uncertain whether it would occur during pre-training or fine-tuning. I have the intuition that the sharp left turn is also plausible, but this is much more sketchy.

## Final thoughts

To make a long story short, vanilla-RLHF does not solve any of the problems listed in the list of lethalities.

• What would happen if we used RLHF on an unaligned AGI and not ChatGPT?
• Which of the problems listed here would happen first?
• How can we reduce the list of problems to a few sub-problems?
1. ^

I tend to agree with Katja Grace that values are not fragile in the sense imagined in the sequences [link].

2. ^
3. ^

It's actually not that expensive, I'm willing to buy an aligned AI for a lot more than that. But it gives a lower bound on the order of magnitude of the alignment fee for RLHF.

4. ^

This does not seem to be a problem per se if your model of the human giving feedbacks is robust. But your model has to be robust. Also keep in mind that even pure human feedback is also likely to lead to AI takeover.

5. ^

As a first approximation, I suspect we can consider that only the upper layers of the model have been refined, the lower intermediate layers having not been modified. In Sparrow, only the upper 16 layers have been fine-tuned.