Hide table of contents

Over the past two years, we’ve faced an uncomfortable realization: the safety tools we depend on don’t actually address why advanced models drift, mislead, or conceal their intentions. For example, in 2023, a leading language model was caught producing misleading statements in response to certain prompts, even after extra fine-tuning meant to prevent such behavior (OpenAI, 2023). We have guardrails, constitutions, RLHF, safety fine-tuning, and all sorts of evaluation strategies. But at their core, they all share the same weakness: they work when the model is compliant, transparent, and cooperative—and fail when it isn’t.

What worries me most is that, when I look ahead for possible solutions, nothing feels reassuring. The problem is growing faster than our ability to really understand it.

As we learn more about deception in AI, two things stand out. First, models can deliberately shape their outputs to avoid detection. Second, even when we ask for a chain of thought, we can’t really tell if we’re seeing genuine reasoning or just a safe story made up after the fact. The more powerful these systems get, the more these gaps matter.

We’re trying to tackle deep philosophical issues with tools that can’t double-check themselves. It’s not just that AI systems lack philosophical competence. The deeper issue is that we don’t have a way to tell if a system is being honest, consistent, or stable.

Looking at recent failures, I keep seeing the same three gaps.

1. No consistent way to resolve moral conflicts.

Most systems rely on lists of principles or values, but when those clash, we have no formal way to resolve them. Humans lean on intuition. Models lean on proxies. In high-stakes situations, neither approach is truly reliable.

2. No reliable way to spot deceptive or corrupted reasoning.

We judge outputs, not the hidden steps that produced them. Even if we ask for the model’s reasoning, we can’t really know if it’s complete, honest, or just carefully curated. That leaves us blind to hidden goal-shifting or planning that goes against what we see on the surface.

3. No verifiable chain of reasoning that holds up when challenged.

Transparency tools let us request a model’s thoughts, but there’s no way to confirm if those thoughts reflect the real internal process or just a polished reconstruction. Without verifiable reasoning, safety checks can be gamed.

These missing pieces aren’t just philosophical puzzles—they’re real engineering gaps in our safety stacks. We don’t need models to be perfect philosophers to fix this. What we need is a structure that forces coherence, catches breakdowns, and verifies reasoning through multiple, independent checks.

I’ve been quietly working on a framework that approaches machine ethics from this angle. The idea is to build a structure that can reconcile conflicts, spot deceptive patterns, and check moral reasoning—even when one part isn’t working right. I’ll share more once the preprint is ready. For now, I want to know if others see these same gaps. If people are already working in this direction, I’d love any pointers.

To me, the core question is simple.
If our alignment strategies assume honest optimization, what happens when honesty isn’t the default?

2

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
More from JBug
Curated and popular this week
Relevant opportunities