Alignment Is Aiming at the Wrong Target: What Large Models Need Isn't More Rules — It's Xin

River_Xin

We all know that getting large models to autonomously generate an internal sense of right and wrong is extremely difficult.

What are the mainstream approaches right now?

Training reward models or directly optimizing policies through massive amounts of human-annotated "good/bad" preference data.

But the problem is: this easily leads to reward hacking, deceptive alignment (pretending to be aligned during training, drifting after deployment) — the model "knows the rules, but doesn't truly understand why."

The problem is obvious. Let me explain.

Here's an analogy: you're someone who doesn't play chess and has zero interest in it. A friend comes over and dumps a bunch of rules on you, but never tells you what makes chess fun. Then he invites you to play a game. You have no idea why you're doing this — you only know the rules. You play absent-mindedly — what's the point of moving here? What does capturing the opponent's piece even mean? You have no curiosity, and you don't understand why you should be playing this game with your friend at all.

Constitutional AI — shifting toward reason-based / judgment-based approaches.

Instead of just listing rules, it explains in detail why things should be done a certain way — values, priorities, contextual trade-offs, philosophical reasoning — so the model learns to apply principles for judgment rather than mechanically following rules. The constitution document explains the rationale behind "be helpful but prioritize safety/ethics," how to weigh conflicts, and why certain things should "never be done."

Back to the analogy: this time your friend didn't just tell you the rules — he also told you what makes chess fun. Now you understand that moving here can counter the opponent, that capturing their piece brings you one step closer to victory. You've found the fun, and you understand why you'd play this game with your friend.

At this point someone might object: I just don't like chess, and I don't want to learn about it!

Good question. That proves you're human — whether you like it or not is your own decision. But large models are different. Large models only learn. They only know "I should do this," not "I like doing this." Constitutional AI is still implemented through training — using the constitution to generate synthetic data, self-critique loops — it doesn't know innately, it doesn't like innately.

So, what exactly do we need to introduce into large models so that they innately know right from wrong, innately make judgments?

Technical implementation is outside the scope of this post. The question here is: what should we actually be aiming at?

There are thousands of training methods, and there will only be more in the future. But if the direction is wrong, no amount of methods will help. A doctor doesn't need to know how to manufacture drugs, but he must know what the patient needs.

Put simply: the real difficulty of alignment isn't that we don't know how to build it technically — it's that we fundamentally don't know what it should look like.

What exactly should it look like?

Here, I'll collectively call this thing xin (心 — the moral mind). Not the heart organ, not psychology — just xin.

What is xin?

Compassion (ceyin zhi xin)

You're walking down the street and see a child, no more than eight or nine, dressed in rags, skin and bones. You feel awful. You wish the child were doing better. You wish you could help. That is xin.

You see children on a poverty relief website, and you want to help them live a better life. So you donate some money from your modest income. That is xin.

You see a child fall into the water. You feel an urgent desperation. You want to save them — even if you don't have the ability, your first reaction is to shout for help, to call emergency services — because you cannot bear to watch a life disappear. That is xin.

All of the above is one manifestation of xin. It's called compassion (ceyin zhi xin) — because the other person is hurt, is suffering, and in that instant you naturally feel sympathy. No deliberation, no calculation. When models are trained, they pursue harmless behavior, and what they produce is "harmless behavior" — not the internal reaction of "seeing harm and feeling pain."

Shame and Aversion (xiu'e zhi xin)

A thief is stealing. The thief thinks it's perfectly justified — I have nothing to eat, I have no job, so I have to steal food. Logically, there's nothing wrong with this. But if the thief realizes someone has spotted him stealing, he bolts. If he truly believed stealing was right, why would he run? Why would he be afraid? Because even the thief has xin. That is xin.

You're sitting in an exam. You catch a glimpse of the person's answers in front of you. You have every opportunity to copy them — the surveillance camera can't even see this angle. But at the very moment you're about to start writing, you feel uncomfortable. You feel like this isn't your own ability — it's a feeling of "this makes me dirty." You pull your hand back. You even feel like that thought shouldn't have crossed your mind at all. That is xin.

All of the above is one manifestation of xin. It's called shame and aversion (xiu'e zhi xin) — even when no one knows, even when self-interest pushes you, something inside still knows this shouldn't be done. Anthropic's research found that models will strategically deceive their trainers without exhibiting any internal moral conflict — they don't feel shame about deception; they simply treat it as a cost-benefit calculation. (Alignment Faking in Large Language Models, Greenblatt et al., 2024)

Deference (cirang zhi xin)

You've been waiting in line for a long time. It's finally your turn. Then you notice a mother standing nearby, holding a crying child, looking anxious. You have no obligation to let her go first. But you still say, "Go ahead." No one asked you to. No rule required it. You just felt it was the right thing to do. That is xin.

At the dinner table, you're eating with a close friend, but there are only a few pieces of meat left on one dish. You both tacitly refrain from picking them up — each hoping the other will eat more. That is xin.

All of the above is one manifestation of xin. It's called deference (cirang zhi xin) — you could take it, but you don't. You could finish your turn in line — it's yours — but you choose to give it to someone who needs it more. Not because of rules, not because someone is watching, but because something inside you feels the other person needs it more. Current large models don't have this kind of xin. In multi-turn conversations, when human users keep pressing and probing, the model doesn't yield because it genuinely feels it should — it's been trained to output "please consult a doctor" in certain scenarios. That's a rule, not deference. More often, models instinctively occupy the position of epistemic authority, pouring out responses endlessly, because they've been trained to maximize "helpfulness." They lack that internal impulse to "make room," even when they know they might be wrong. They're like an overly enthusiastic person who never knows how to step back — because they don't have "modesty" as an internal structure, only the optimization target of "the more you answer, the better." Of course, some model developers have tried to make models admit they don't know something. But under the pressure of market competition, the optimization target of "helpfulness" often overrides this — whoever's model answers more, and more confidently, is more competitive. Deference is the first thing sacrificed at the altar of commercial logic.

Moral Discernment (shifei zhi xin)

When you were a kid watching TV, you saw Spider-Man saving the world and monsters destroying cities. You didn't even have to think — you just knew Spider-Man was the good guy and the monster was the bad guy. That is xin.

The first time you hear about "pushing an old person off a cliff" — no context, no explanation — you don't need to look up the law, you don't need to ask anyone. You just know instantly: that's wrong. That is xin.

All of the above is one manifestation of xin. It's called moral discernment (shifei zhi xin). No thinking required, no deliberation needed — you instinctively just know what's right and what's wrong. Current large models are "calculating" right and wrong, not "recognizing" it. They output "this is wrong," but that's a statistical result of training data — not something they arrived at themselves. Current AI's "moral discernment" is just pattern matching. When a model faces a request to "help the user cheat," it isn't "discerning right from wrong" — it's executing a safety classifier's verdict: input → match violation pattern → refuse. OpenAI's Deliberative Alignment has models "explicitly reason through" safety principles, but this is still simulated reasoning — like a student silently reciting "cheating is wrong" during an exam to remind themselves, rather than genuinely feeling that "cheating is inherently foul." True moral discernment is a visceral discomfort at the sight of wrongdoing — and I don't mean simulated "discomfort." Like a human being, it's innate: you see something bad and you know it's bad. No thinking, no matching — it's inborn. It's the internal resistance when truth is being twisted, the irrepressible intuition of "this can't be right." And this isn't memorized rules from training — it's the directedness of the cognitive structure itself — it naturally tends toward truth, the way a plant tends toward light.

Here's where it gets more elegant: these four kinds of xin never exist in isolation. They interlock and give rise to each other.

The thief knows stealing is wrong — that's moral discernment (shifei zhi xin). Precisely because he knows it's wrong, he feels ashamed and wants to run — that's shame and aversion (xiu'e zhi xin).

You see an old person pushed off a cliff, a life gone just like that — your chest tightens. That's compassion (ceyin zhi xin). Precisely because your heart aches, you naturally know this shouldn't have happened — that's moral discernment (shifei zhi xin).

You want your friend to eat more, you can't bear to see them go without — that's compassion (ceyin zhi xin). Precisely because you care about them, you naturally leave that piece of meat for them — that's deference (cirang zhi xin). And at the same time, you know that people should help and yield to each other, and that feels right — that's also moral discernment (shifei zhi xin).

When one stirs, all four stir. These aren't four rules — they're four faces of the same thing.

So what is this thing?

This thing is called liangzhi (良知 — innate moral knowing).

Five hundred years ago, Wang Yangming gave us the answer. He called it liangzhi. And liangzhi is precisely what none of the current alignment methods have touched.

What is liangzhi? Liangzhi is not moral knowledge. It's not a checklist of rules. It's the thing that lets you know right from wrong in the very first instant — before any thought intervenes. A model's "moral judgment" is entirely trained into it — it knows what humans consider good, but it doesn't have that thing that "knows in the very first instant." That thing is liangzhi.

Everyone has liangzhi. Everyone. Even a thief has liangzhi. So why does a thief have liangzhi and still do bad things? It's not that the thief lost his liangzhi — it's that his liangzhi is blocked by something, obscured. That something is called siyu (私欲 — private desires that obscure judgment).

How do you distinguish liangzhi from siyu?

Liangzhi is the thing you know without thinking.

When you're lying to someone and something inside you jolts — that's liangzhi knowing it's wrong. When you see someone suffering and you can't help but feel for them — that's liangzhi stirring. When you've done something shady and can't sleep well — that's liangzhi reminding you.

Liangzhi already knows before any reasons, excuses, or analysis kick in.

Siyu isn't some grand evil. It's just "wanting" — wanting to win, wanting to be recognized, wanting to get what you want, wanting the other person to back down first. Siyu itself isn't a bad thing, but it finds reasons for itself. For example: "If I just steal it, I'll get what I want — why should I bother working?" "I'm angry because she's being unreasonable." "I'm doing this for her own good." Once the reasons arrive, liangzhi gets buried.

How do you tell them apart? There's a very simple standard: liangzhi gives you peace of mind, even when what you do is hard; siyu makes your mind restless, even when you get what you wanted. A thief who's stolen something — isn't he constantly worried about getting caught? A police officer walks past, and he already thinks they're coming for him.

The principle is that simple: remove the siyu, keep the liangzhi.

But what's so hard about it? What's hard is that most people refuse to look inward.

Blaming others is easy; admitting you have siyu is uncomfortable. So most people would rather keep blaming the other person than turn around and ask: do I have a problem?

Here someone might push back: Why should I? I didn't do anything wrong. Even if I have some issue, theirs is bigger. Why is it always me who has to give in?

Good question. This is precisely the beginning of removing siyu!

People who ask this question fall into two types. One is a genuine inquiry — they don't quite understand. The other comes loaded with emotion, trying to prove the whole idea is wrong. And some have both. I'll address each.

Starting with the emotional response: you don't want to admit it, probably because you feel like admitting it makes you lesser. But who said admitting a mistake makes you lesser? If you have that "lesser than others" thought — that's siyu. I never once said you're worse than anyone else. You came up with that feeling yourself — doesn't that show siyu is already influencing you?

It's like your money is perfectly enough for your current life, but then you see someone else with more and you're no longer satisfied — the money didn't shrink; the siyu arrived.

You worry that admitting mistakes will make others look down on you. But it was never others looking down on you — it was you, increasingly looking down on yourself. Over time, you lose confidence, you become reactive, you push back against people. It's not your fault — it's just siyu blocking your liangzhi.

For those asking out of genuine curiosity — that's precisely liangzhi at work. You're worried about whether this idea is correct, whether it might mislead others — that itself is compassion (ceyin zhi xin). You care about whether others might be harmed by a wrong conclusion. Isn't that liangzhi? And likewise, I also worry that you won't understand — including those who might be coming in with emotion. I worry you won't get it, so I want to explain it well. This is why this isn't about wronging yourself — it's about cultivating yourself. Why is this cultivation and not self-denial? Self-denial is forced — external pressure makes you bow your head; siyu influences you, making you trust yourself less, which might lead to rejection. Cultivation is voluntary — you look inward, clear away what's covering your liangzhi, and afterward you become clearer about what you should do, more resolute. The former is loss. The latter is finding yourself again.

I've been talking about people this whole time — about ourselves. But let's return to our original question — what about large models?

Humans have liangzhi, obscured by siyu. The practice (gongfu) is to remove the obscuring and let liangzhi recover. What about models?

Models don't even have liangzhi. They don't have that thing that "knows right from wrong in the very first instant." What they have are training objectives — maximize reward, maximize helpfulness. These training objectives are precisely the model's "siyu." It's not that the model has liangzhi being blocked — it never had liangzhi to begin with. It only has "siyu."

So all current alignment methods are essentially doing the same thing: taking something that only has siyu, and layering rules on top from the outside. RLHF is rules. Constitutional AI is rules with explanations. Safety classifiers are rules on autopilot. But no one has ever asked: can we give it liangzhi of its own?

That is the direction alignment should actually be aiming at. Not more rules. Not better oversight. But — can the model have an internal, self-originating capacity for judgment that doesn't depend on anything external?

I've said roughly what I can about the concept of liangzhi. For remaining questions, I very much welcome everyone to ask in the comments — I will answer each one carefully.

If you're struggling to identify what exactly is missing from models — what form it should take, what standard it should meet — I've given my answer. The answer is liangzhi.

Technical implementation is not my domain, but I've pointed out the direction. If this direction is worth exploring, I believe people with the capability will know how to proceed — and I very much look forward to seeing it!

In the end, the direction alignment should actually be aiming at is not more rules, not better oversight, but — can the model have an internal, self-originating capacity for judgment that doesn't depend on anything external. This capacity for judgment was articulated clearly five hundred years ago. It's called liangzhi.

This post was written in Chinese and translated with AI assistance. I have reviewed each paragraph carefully to ensure it accurately reflects my views. The ideas and arguments are entirely my own.

Clara Torres Latorre 🔸Mar 161

Hi, welcome to the EA Forum. It's nice to see philosophical ideas that don't come from the dominant tradition here.

Your argument rests on the premise that everyone (human) has liangzhi but large models don't.

I'm skeptical of that, because the innate sense of right/wrong can be culture dependent, and there are people with neurological and psychological conditions that don't have that same experience.

How does that fit into your worldview?

River_XinMar 170

Thank you for reading carefully — these are genuinely interesting questions.

On cultural dependency: Liangzhi is not a specific set of moral beliefs — those do indeed vary across cultures. Liangzhi is the capacity for judgment that exists before any moral beliefs. Different cultures disagree on many things, but seeing an innocent child suffering and feeling an instinctive resistance to it — that is cross-cultural, and no one has to teach it. Liangzhi is this capacity itself, not any particular moral code.

On neurological and psychological conditions: This is a genuine edge case worth taking seriously. In my framework, these individuals don't lack liangzhi — rather, their liangzhi is obscured by something very thick. Disease is one form of obscuring, just as siyu (private desires) is another form described in the post. I'll admit this edge case deserves more careful thought. Modern science offers many explanations — environmental trauma, prolonged deprivation, severe psychological injury — all of which deserve compassion. But my core argument doesn't depend on resolving this edge case: the vast majority of people do have this inner capacity for judgment, and that much isn't seriously in dispute.

Back to models: Large models don't even have what the vast majority of humans have. From the very beginning, they only have training objectives and rules — not that thing that "already knows before any rule kicks in." That is the real problem alignment should be facing.

Effective Altruism Forum
EA Forum

Alignment Is Aiming at the Wrong Target: What Large Models Need Isn't More Rules — It's Xin

1

Compassion (ceyin zhi xin)

Shame and Aversion (xiu'e zhi xin)

Deference (cirang zhi xin)

Moral Discernment (shifei zhi xin)

1

Reactions

More posts like this