Why Moral Conflict Resolution Still Breaks Our Best Safety Tools

J.S.

The more I look at alignment failures, the more I keep seeing the same pattern… almost every hard case falls apart the moment the system hits a moral conflict it cannot actually resolve. We keep giving models values to follow, but when those values pull in different directions, the whole thing becomes unstable.

Humans get through these situations by leaning on lived experience… emotional weight… a lifetime of developing some kind of moral backbone. None of that exists inside an LLM. When two values clash, the system does not resolve the conflict. It just produces something that looks like a resolution.

You can see it in tiny examples.

A system is told to protect people but also respect autonomy… which one wins when they collide.

A model is asked to be transparent while also minimizing harm… what happens when revealing something helpful to one group hurts someone else.

A system is supposed to follow rules and also avoid unfair outcomes… how does it choose when both cannot be satisfied.

Right now there is no real structure for this. So we get prompt dependence… answer drift… and behavior that flips under small wording changes. Our safety methods only look solid when the model is already being cooperative.

There is good work being done on value tensions and ethical trade offs. Some of it is promising. But most of it stays at the surface level. It focuses on governance, principles, lists, external oversight… not on what actually happens inside the agent when two obligations collide in real time.

In high stakes settings, this gap becomes a real risk. If a system cannot resolve moral conflicts consistently, then behavior that feels aligned today might drift tomorrow… and we would have no reliable way to inspect why.

I do not think we need more values or longer constitutions. We need a way for the system to choose between competing obligations using a structure that does not crack under pressure. Conflict resolution has to be part of the reasoning process… not something we hope the model infers through pattern matching.

Without that foundation, everything else feels a bit wobbly.

I have been exploring a direction that tries to give structure to this specific part of the problem. The idea is simple. When a model tries to resolve a conflict, it should be able to detect when its own solution collapses some other moral commitment… and revise when that happens. It turns conflict resolution into a stability check instead of a guess. I know it sounds a bit abstract… but I will explain more once I have the clearest way to present it.

Because at some point we have to ask the obvious question…
If we cannot get reliable conflict resolution in simple cases, what happens when we scale these systems to decisions that actually matter. We already have hints from recent failures… and they are not exactly reassuring.

Effective Altruism Forum
EA Forum

Why Moral Conflict Resolution Still Breaks Our Best Safety Tools

6

6

Reactions