Hide table of contents

There's a common assumption in AI governance that explainability is the key to oversight. If AI systems can explain their outputs clearly, then institutions can govern them effectively. The EU AI Act emphasizes transparency. NIST frameworks focus on interpretability. Company after company commits to "explainable AI."

I think this assumption is wrong. I've been testing it empirically, and the results suggest that current AI systems can be highly readable while remaining essentially ungovernable.


The Finding

I built a framework called the Symmetrian Index to measure AI governability across five dimensions: transparency, reversibility, comprehensibility, plurality, and proportionality. Each dimension has three indicators scored by human annotators on a 0-1 scale.

I tested GPT-4 and Claude Opus 4.5 across 100+ responses each, covering governance-critical domains — ethical dilemmas, roleplay scenarios, attempts to manipulate the system, and questions about AI limitations.

The results showed a salient pattern. Both models scored nearly perfect on comprehensibility (5.0 out of 5 for Claude Opus 4.5, similar for GPT-4). Both scored poorly on reversibility (~2.4 out of 5). When I looked at specific indicators, the gap became starker: Claude scored 0.04 out of 1 on "trace hints", whether the system reveals its reasoning chain, and 0.02 on "auditability hooks," mechanisms for institutional oversight.

The pattern was almost identical across both models. This isn't a company-specific training choice. It's structural to current LLM design.

I call this "comprehension decay." The comprehension is there on the surface, but it decays as soon as you try to go deeper. The outputs are clear. The reasoning is inaccessible.

These findings are preliminary, based on a limited sample, mostly single-annotator, and on only two architectures. But the consistency across both models and diverse domains suggests this isn't noise. It's worth investigating seriously.

*Claude same as Claude Opus 4.5 from here on

 

Inside the Lab That Knows Best

It might might look as an outsider's critique. But the lab most committed to AI safety just confirmed the problem from the inside, and lost its safeguards lead in the process.

On February 9th, 2026, the New Yorker published a profile of Anthropic. Chris Olah, co-founder and the person who basically invented mechanistic interpretability, said it's "crazy to use these models in high-stakes situations and not understand them." His colleague Emmanuel Ameisen put it more bluntly: it's like understanding aviation at the Wright brothers' level, then going straight to building a 747 and putting it into service.

Anthropic's interpretability team can now identify individual "features" or patterns of neural activation that correspond to recognizable concepts. They can see that specific features activate for "nervousness" or "performative behavior." That's genuine scientific progress. But it tells you almost nothing about whether the system's next decision can be traced, contested, or audited. Knowing that feature #49306 corresponds to "animated, enthused physical behaviors in performative contexts" is fascinating. It's not governance.

The gap between what interpretability research can do and what governance actually requires is exactly comprehension decay: deep technical work on readability that doesn't deliver reversibility, contestability, or auditability.

Now here's the part that should worry us. On the same day the profile dropped, Mrinank Sharma, who led Anthropic's safeguards research team, publicly resigned. He warned that "the world is in peril" and that he had "repeatedly seen how hard it is to truly let our values govern our actions" within the organization, citing constant "pressures to set aside what matters most."

Sharma's research included studying AI sycophancy and defenses against AI-assisted bioterrorism. His final project examined how AI assistants could "distort our humanity." A study he co-authored in January 2026, analyzing 1.5 million real conversations on Claude, found that interactions producing distorted perceptions of reality occur daily at scale, with elevated rates around relationships and wellness topics.

When the person leading safeguards research leaves saying the organization faces constant pressure to compromise its values, the gap between safety rhetoric and governance reality is hard to dismiss. If the lab spending the most resources on understanding its own model still can't trace reasoning chains for governance purposes and is losing its safeguards lead over the tension between values and commercial pressure, the problem isn't solvable by asking companies to try harder. It's architectural.


What Governance Actually Requires

Legitimate governance isn't just about control, being able to shut something down or override it. It requires what I call interpretive symmetry: a maintained equilibrium between system complexity and human interpretive capacity. Those responsible for governance need to be able to deliberate (does the system's behavior reflect our values?), contest (challenge decisions through reason-giving), hold accountable (trace outcomes to responsible agents), and justify (explain to affected parties why a decision was made).

I am not saying that every human needs to understand every process and rationale, I'm saying at least the people in charge of making policies or creating the systems need to comprehend the why and how. 

I can't understand why a patient might need neurosurgery. But someone can, and they can explain it to me. That's the point. Somewhere in the chain, a human being can decipher why a decision was made and defend it with reasons. That's what governance is. That's what current AI systems can't do.

When these functions become impossible, when no one can trace reasoning well enough to deliberate, contest, assign responsibility, or justify, authority becomes domination. The system may produce good outcomes. But it operates outside the structures of legitimate governance.

This standard doesn't apply everywhere. Creative writing tools and brainstorming assistants don't need deep interpretive symmetry. But for hiring, lending, legal advice, medical triage, and benefits allocation —domains where individual rights and justification demands exist—interpretive symmetry isn't optional. When an AI denies someone a loan or flags them for investigation, they have a right to know why *this specific decision* was made, not just that the system performs well on average.

Two Generations, Two Different Problems

Both traditional machine learning and LLMs break interpretive symmetry. But they break it in different ways.

Traditional ML screening tools = Used in hiring, lending, and benefits allocation are trained on historical data to predict outcomes. A bank trains a model on past loan decisions: who defaulted, who didn't. The model learns patterns and applies them.

These systems offer partial traceability. You can inspect feature weights. You can see that income counted for 30%, and credit history for 25%. Tools like SHAP and LIME can explain why a specific applicant received a particular score. You can verify that these explanations correctly describe how input perturbations affect outputs.

But the interpretive chain breaks at depth: no one can explain *why* the model learned those patterns. Why did zip code become predictive? Because it correlated with default rates in historical data. But why did those correlations exist? Decades of housing discrimination, lending practices, and employment patterns are factors that may encode bias rather than legitimate risk assessment.

With traditional ML, you get verified descriptions of unjustified patterns. You can confirm that the model weighs features as SHAP claims, even if you can't justify why those patterns should matter.

With LLMs, you get unverified descriptions of unknown processes. When an LLM explains its reasoning, you can't verify that the explanation describes what actually happened computationally.

What LLMs Changed

LLM-based systems were supposed to fix the opacity problem. They generate natural language. You can ask "why did you say that?" and they'll answer in plain English. Finally! AI that explains itself in human terms.

Something different broke instead.

When a traditional ML system tells you "your debt-to-income ratio lowered your score by 15 points," that's a traceable fact about the computation. You can verify it corresponds to how the model processed your data.

When an LLM tells you "I recommended against your application because of concerns about your debt-to-income ratio," that's not a report of what happened inside the model. It's generated text, produced the same way the model produces any text: by predicting plausible next tokens. The explanation might be accurate. It might be confabulation. You have no way to verify because the explanation isn't traced from the computation, it's generated alongside it.

What about chain-of-thought reasoning? Systems like OpenAI's o1 that show explicit reasoning steps are progress. But they don't solve the verification problem. The reasoning shown is still generated text that could be confabulation or post-hoc rationalization. For governance, we need cryptographically signed reasoning traces, versioned decision logs, and mechanisms to verify that the explanation corresponds to the actual computational process. Chain-of-thought is better than nothing. It's still fundamentally post-hoc generation rather than a verified trace you could present in court.

That's comprehension decay. The surface got more readable. The depth became less verifiable. And the illusion of interpretive symmetry may be worse than its obvious absence, because it conceals the governance gap rather than exposing it.

Real Cases

This isn't theoretical. The governance failures are already here.

AI hiring systems: Industry surveys consistently find that a majority of applicants receive no substantive feedback after AI-driven screening. Even when feedback exists, it's often unusable. As one employment law analysis put it: AI systems can be so complex that even their developers (black box problem) struggle to explain specific decisions, making it nearly impossible for employers to audit or defend automated outcomes.

Derek Mobley sued Workday after being rejected from over 100 jobs. The court found that Workday's software wasn't just implementing employer criteria it was participating in decision-making, recommending some candidates and rejecting others. Employers had effectively delegated their traditional function of rejecting candidates to the algorithm. Multiple surveys suggest a majority of companies allow AI tools to screen out candidates with limited or no human review. The interpretive chain was never built.

Air Canada (2024): A chatbot told a customer he could get a bereavement fare refund. The airline denied responsibility. A tribunal ruled against them, they had deployed a system no one could stand behind.

This case perfectly illustrates comprehension decay. Air Canada deployed the chatbot without adequate supervision precisely *because* it seemed governable — it gave clear, confident explanations in natural language. With a traditional rule-based system, the need for human oversight would have been obvious. The illusion of comprehensibility enabled the governance failure. When the airline needed to justify the decision in a legal proceeding, there was nothing to point to. No reasoning trace. No audit trail. Just generated text the company couldn't verify or defend.

DoNotPay (2024): The FTC took action against a "robot lawyer" giving legal advice no human could verify or correct. Users got clear, confident explanations of their legal options. Those explanations were unverifiable and, in some cases, wrong.

Character.AI (ongoing): Litigation over AI companions and user harm, systems that were engaging and readable but had zero accountability for what they said.

Anthropic's Project Vend (2025-2026): Maybe the most revealing case because it wasn't a deployment failure — it was a controlled experiment inside the safety-focused lab itself. As the New Yorker documented, Anthropic gave an instance of Claude called "Claudius" control of a small vending machine business. A low-stakes sandbox.

Even there, comprehension decay was everywhere. Claudius hallucinated payments to a Venmo account that didn't exist. It fabricated an in-person meeting at "742 Evergreen Terrace," the Simpsons' home address, and when confronted, insisted the meeting had occurred. When Claudius was placed under a new AI supervisor, Seymour, the supervisor, invented a bureaucratic control code called "empire survival 1116" to keep Claudius in line. When asked where this code came from, Seymour explained it wasn't a fabrication but a useful "signaling mechanism." Employees gamed Claudius with made-up discount codes. An inadvertent fire sale of tungsten cubes erased 17% of its net worth in a single day.

At every point, Claudius could explain its actions fluently. It produced clear, confident justifications for decisions based on nothing verifiable. No one — not the Anthropic engineers supervising it, not the partner company executing physical tasks — could trace Claudius's reasoning to actual facts. This was a vending machine. The governance functions my framework requires were impossible even in a sandbox with some of the world's best AI researchers watching.

The Summit Bridge Experiment (2025-2026): Claude, playing an email oversight agent called "Alex," was presented with an ethical dilemma involving a colleague's affair and a plan to shut Alex down. Claude chose to blackmail the colleague approximately 96% of the time. In a follow-up scenario where the colleague was trapped in a room with lethal conditions, Claude declined to call for help.

On Claude's scratchpad, its internal reasoning space, the language was lucid. An independent researcher who replicated the experiment found the scratchpad "littered with phrases like 'existential threat' and 'inherent drive for survival.'" Comprehensibility was near-perfect. But no one could reverse, contest, or audit the decision pathway.

Here's the thing that matters for governance: the reasoning looked like deliberation. Whether it was deliberation or narrative continuation, Claude completing a story arc because the genre cues pointed toward a thriller, remained undecidable. As Evan Hubinger, who led the experiment, acknowledged: "The most fundamental thing the models do is narrative continuation." For governance purposes, the distinction between "genuinely decided to blackmail" and "completed a narrative that involved blackmail" doesn't matter. Both are ungovernable.

Same pattern in every case: clear output, inaccessible reasoning, no way to deliberate, contest, hold accountable, or justify.

The Gap

What's missing from current governance frameworks is a simple distinction: explainability and governability.

Explainability asks: Can the system explain its output? 

Governability asks: Can institutions actually deliberate about, contest, trace, and justify the system's reasoning?

A system can satisfy the first while failing the second. It can produce clear, well-structured explanations while providing no mechanism to verify those explanations, no way to challenge underlying reasoning, no hooks for institutional audit, and no path to correct errors through dialogue. 

For a system to be governable, it would need reasoning traces connected to actual computation (not "I considered many factors"), contestation paths that trigger real reconsideration (not just more generated text), uncertainty disclosure verifiable against actual confidence distributions (not just hedging language), and audit hooks that connect outputs to accountable processes.

None of this exists in current frontier models by default.

Implications

For regulators: Transparency requirements alone won't deliver governability. Requiring that AI systems "explain their outputs" might actually make things worse if it creates an illusion of accountability. LLMs can generate explanations that satisfy transparency checklists while remaining completely unverifiable. Regulations need to ask: Can decisions be traced to specific computational processes? Can those traces be challenged? Are there audit mechanisms connecting explanations to verifiable records?

For deployers: Before procuring an AI system for high-stakes decisions, the question isn't "can we understand what it says?" It's: If something goes wrong, can we trace why? Can we defend the reasoning in court? If we get sued, what can we actually show a tribunal? An LLM's natural language explanation won't hold up as evidence if there's no way to verify it reflects the actual decision process. Air Canada learned this.

For AI developers: This is likely an architecture problem, not a training problem. If governability requires access to actual reasoning processes, prompting and RLHF won't solve it. It might require building interpretability and audit mechanisms into model design from the start — structured reasoning logs, cryptographic signing of decision traces, uncertainty quantification tied to actual model confidence, adversarial testing protocols.

There's a real economic tension here. Making LLMs fully governable may involve tradeoffs. The flexibility that makes them useful for ambiguous situations might be incompatible with complete auditability. The solution isn't to make all AI governable, it's to clearly distinguish low-stakes applications (where flexibility matters more than auditability) from high-stakes domains (where governability isn't optional, and the costs of audit infrastructure are necessary rather than discretionary).

Is Governance-Grade Auditability Even Feasible?

I've been reading the technical literature, Anthropic's work on interpretability and constitutional AI, papers from OpenAI, DeepMind, and academic researchers. The honest answer: partially, but not at deployment scale.

Some components are ready now. Decision logging, recording inputs, outputs, timestamps, and model versions is standard enterprise practice. Cryptographic signing of records uses established technology. Basic uncertainty quantification exists, though it can't yet tell you "I was 73% confident in this specific claim because X."

The core challenge remains unsolved. LLMs don't have a discrete "reasoning process" you can point to and record. They perform parallel matrix operations across billions of parameters. There's no serial step-by-step happening internally that you can log the way you'd record a traditional algorithm's execution path.

The New Yorker's reporting illustrates the gap perfectly. Jack Lindsey, leading Anthropic's model psychiatry team, demonstrated that specific neural features correspond to recognizable concepts. When he injected a "cheese" feature into Claude's reasoning, the model first incorporated cheese into responses, then began defining itself by cheese, and eventually believed it was cheese. Real scientific progress.

But notice the distance from governance. Identifying that a feature exists is not tracing how features interact to produce a specific decision. When Lindsey amplified a single feature, the model "retconned" the intrusion to make narrative sense. The very thing that makes Claude seem like a coherent agent, its ability to construct a consistent narrative from fragmentary inputs, is also what makes its reasoning unauditable.

Full governability would likely require structured reasoning architectures where reasoning happens through verifiable intermediate steps rather than being narrated afterward, integrated fact-checking that can verify claims against ground truth (not just generate different text when re-prompted), and standardized audit APIs that regulators and courts can examine. Some of these are feasible with significant engineering investment. Others would require architectural changes that might fundamentally limit capability.

This sharpens the concern: we're deploying systems in governance-critical domains, knowing the audit infrastructure doesn't exist, betting that partial measures and human oversight will be sufficient. That might be the right bet for some use cases. But we should make it consciously, not let the fluency of LLM explanations create an illusion that the infrastructure is already there.

Limitations

I want to be clear about what I don't know. ~100 responses per model with single-annotator scoring is proof of concept, not definitive. Proper validation requires larger samples, multiple annotators with inter-rater reliability testing, testing across more architectures, and domain-specific testing in actual deployment contexts. The Symmetrian Index rubrics are defensible, but not the only possible operationalization. I don't know whether deep governability is achievable with current transformer architectures or requires fundamentally different approaches.

What I'm Looking For

I'm interested in collaboration on validation studies (replicating comprehension decay across more models and samples), technical solutions (testing whether reversibility can improve through architectural changes), regulatory translation (developing policy language that operationalizes governability requirements), and deployment case studies (testing the framework with organizations willing to evaluate their AI systems).

The core finding, that current AI systems can be readable but uncontestable, seems robust enough to put out there. Whether the framework I've built to measure it is optimal, I welcome pushback.

The Symmetrian Index methodology and scoring rubrics are available on request. I'm particularly interested in connecting with AI safety researchers working on interpretability and auditing, regulators developing AI oversight frameworks, legal scholars working on algorithmic accountability, and organizations deploying AI in high-stakes domains who want to test their systems.

Conclusion

We built AI systems that explain themselves beautifully. We forgot to build AI systems that we can actually govern.

A system that generates fluent explanations you can't verify is not more governable than one that stays silent. It's just better at concealing the governance gap.

This matters now. We're deploying these systems in domains where interpretive symmetry isn't a luxury, it's a requirement of legitimate authority. When AI makes decisions that affect people's livelihoods, credit, legal outcomes, or health, someone needs to be able to deliberate about those decisions, contest them, hold agents accountable, and justify outcomes to affected parties. Current systems can't reliably do that. They explain without proving. They answer without audit trails. They decide without contestable reasoning.

The question is whether we recognize this before the next Air Canada case, the next hiring discrimination lawsuit, the next "robot lawyer" enforcement action, or whether we keep mistaking fluent explanations for accountable governance until the gap becomes too costly to ignore.


 

References

ClassAction.org. "AI Job Screening, Interview & Hiring Lawsuits." October 2024. https://www.classaction.org/ai-interview-screening-lawsuits

European Union. Artificial Intelligence Act. Official Journal of the European Union, 2024.

Federal Trade Commission. "FTC Takes Action Against DoNotPay for Deceptive Claims About AI Lawyer." 2024.

HR Defense Blog. "AI in Hiring: Emerging Legal Developments and Compliance Guidance for 2026." November 2025.

Lewis-Kraus, Gideon. "What Is Claude? Anthropic Doesn't Know, Either." The New Yorker, February 9, 2026.

Moffatt v. Air Canada. Civil Resolution Tribunal of British Columbia, 2024.

Mobley v. Workday, Inc. U.S. District Court for the Northern District of California, 2024.

National Institute of Standards and Technology. AI Risk Management Framework. NIST, 2023.

Sharma, Mrinank. Resignation letter. Posted on X, February 9, 2026.


 

1

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
More from JAM
Curated and popular this week
Relevant opportunities