Recently, I was talking with a friend about AI Mis/Alignment when a thought struck me: as a non-technical, what if I’m being unfair to the engineers working closely with AI models? What if fixing AI safety issues is not as easy as I think? And all this time, I’ve been criticizing them without proper perspective. So, I decided that I’d read a bit on a topic I find interesting and walk a mile in their shoes.

Today’s topic is sycophancy in language models (LMs).

In English, ‘sycophancy’ means excessive flattery and praise, specifically when it’s not sincere. Praising someone’s bravery for catching a rat is not sycophancy, but calling someone brave for watching a cartoon about one is.

We’ve all experienced it one way or another in our interactions with LMs. 
For example:

  • If I say I like something, it’d give me positive responses: ‘Stephen King is my favorite.’ It would tell me that he’s the best writer of horror fiction (fact), and list his works with 5-star reviews, celebrating him and his talent.
  • If I say I don’t like something, its responses would turn negative: ‘I never liked Stephen King anyway.’ It would tell me how he never lives up to the hype around him (a complete and utter falsehood), and cite links to websites - real or fabricated - bashing him and his work.

The responses will reflect what I liked and disliked - my preferences. All it wants is to agree with me. This is classic sycophancy, and it’s quite surprising to find out that this phenomenon happens mainly during Reinforcement Learning from Human Feedback (RLHF).

Let’s unpack what RLHF is for a second: it first appeared in Christiano et al. (2017), and became a popular method for aligning LMs. It goes like this: you start with a dataset of outputs, which you give to humans to evaluate and choose which one they prefer. You could also train a preference model (PM) on human feedback to score the outputs and use that instead. RLHF is not a one time deal - companies continuously collect new feedback to retrain and update their models through multiple rounds. But sometimes it backfires - and they simply start agreeing with everything.

Could it get any worse? 
Sure, why not.

RLHF could have a more dangerous side effect: models don’t just agree with you - they learn to defend wrong answers with fabricated or cherry-picked evidence, presented as arguments that sound perfectly logical. This new beast is called sophistry.

To understand what it is and how it happens, Wen et al. (2025) conducted an experiment. They asked participants to evaluate the correctness of LMs outputs in Q&A and programming tasks, giving them a time frame of 3-10 minutes. They then compared the accuracy of these evaluations before versus after RLHF.

The result was a phenomenon they dubbed Unintended Sophistry. Unlike Intended Sophistry, where developers deliberately train a model to mislead in order to study and understand deceptive AI behavior in a controlled environment, U-Sophistry emerges naturally from standard RLHF. Nobody planned it; it just shows up.

What RLHF does in this case is:

  1. It makes models’ responses look more persuasive, misleading people into thinking the output is accurate and that the model had performed the task well.
  2. It weakens human judgments, because if it looks accurate, a normal person will think it actually is, and give their judgments accordingly.
  3. When the wrong answer ‘appears’ more convincing, it leads to the rise of false positives. In one of the tested Q&A setups, the percentage rose sharply from 46.7% before to 70.2% after.
  4. It gives a false sense of confidence. People would label wrong answers as correct with complete certainty.

You might think that perhaps there were some mistakes in the recruitment process of the human evaluators, I thought so too. But the authors confirm that there weren’t too many outliers, and that the subjects spent a good amount of time on each task they were given.

They point out, however, two major flaws in how the industry implements RLHF. The first is that the strict process the researchers applied during their study is not standard practice in many companies, where human feedback is treated as the ultimate ‘truth’ in training and evaluating models. The other flaw lies in the crowd-sourced feedback itself - it’s not ideal since many participants are untrained, anonymous users who spend minimal time evaluating.

What we have on our hands is a process created to improve model performance, but instead, we’re being deceived into thinking it’s actually doing its job.

Can the problem be solved?
Well, that’s the tricky part.

Fein et al. (2026) examined several state-of-the-art models and discovered that sycophancy remains widespread. When users suggest a wrong answer, most models still agree with them more than 24% of the time - a rate that should be 0%, with some models reaching 58% (AllenAI-Llama3) and 60% (DeBERTa).

You see, the model should agree with the user only when they’re right and not to please them. But the model doesn’t know the difference. When the researchers tried to penalize it for agreeing when the user was wrong, the model started disagreeing with everything. That makes it a genuinely hard technical problem, which nobody has solved yet.

So. 
Is the problem real? Yes, and it looks to be stubbornly persistent. 
Are the engineers working diligently on it? Absolutely.

Then, I guess the ball is now in the companies’ court:

Will they invest in quality feedback over cheap, rushed alternatives? 
Will they implement what the research is telling them? 
Will they choose user benefit over profit?

Only time will tell.

In the meantime, let’s stay watchful. Sycophancy can slowly build an echo chamber around us - unhelpful at best, harmful at worst. AI is meant to work for us, not flatter us. So, question every ‘Yes, Boss!’ it gives you - it’s your assistant after all.

 

References

Christiano P, Leike J, Brown TB, et al. Deep reinforcement learning from human preferences. Preprint. arXiv. 2017.

Fein D, Lamparth M, Xiang V, et al. One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models. Preprint. arXiv. 2026.

Sharma M, Tong M, Korbak T, et al. Towards Understanding Sycophancy in Language Models. Preprint. arXiv. 2025.

Wen J, Zhong R, Khan A, et al. Language Models Learn to Mislead Humans via RLHF. Preprint. arXiv. 2025.

2

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities