Hide table of contents

AI Alignment Has Two Problems. The Field Is Only Solving One.

This essay is part of a series on AI minds and governance. I have written research papers on A Structural Taxonomy for Evaluating Artificial Minds before. A longer technical piece is available on request.

The Problem Nobody Has Named Correctly

This essay is especially about governance and AI safety efforts that AI giant companies have put. I am a Physics major graduate and had worked at a Decentralised Autonomous Organisation worth $10B+ at the time of launch and as a Head of Growth I had led it from scratch and have witnessed, faced challenges in running such a huge firm on constitution in real time. That practical extensive experience makes me see with a unique lens on what is broken in current AI safety efforts and what exactly are the problems. I want to break it down in 2 ways here. One governance practical issues and second fundamental technical aspects.

I don't draw lines between subjects. Laws are laws, what is proven in Physics is applicable universally. Natural laws are unproved proofs Physicists are trying to prove. Similarly human coordination problem is ancient and multiple efforts had been put to revise, update and find out what actually is an optimal process or a framework. What Progogine proved about entropy applies to DAos and it applied to AI constitution as well. It's literally like witnessing the similar mistakes AI governance is making which biggest DAOs in web3 has made in the last 5 years.

2 problems.

One is technical - how do you encode values into a training objective that the system can learn, follow and optimise? Because human values are contradictory, it depends on the context. But still there are a lot of efforts already made in this problem. Models trained with human feedback are less harmful which was measured. Anthropic's January 2026 Claude constitution is eighty pages of careful philosophical reasoning, the most thoughtful public document on AI values any major lab has produced.

Second problem is Governance- now this gets vague here. Whose values? Who decided them? Who is accountable if things go wrong? What is the process of modifications and updates?

The governance dimension is almost entirely ignored. And it is the prior problem. You cannot get the technical specification right if you have not answered who gets to specify in the first place. Both things have to work hand in hand.

What Everyone Has Tried and Why It Is Not Enough

  1. RLHF (Reinforcement Learning from Human Feedback) is used by every major lab. Humans rate the AI outputs. Who are these people? Maybe contractors they hire, or part time employee or chatgpt has designed that window to ask direct users to choose between response A and B. This way model learns what to prefer, and learns. But do you see the problem? The ones directing such a powerful tool is not uniform, is not qualified and it's not even elected to perform such an important duty! These people have no mandate to represent anyone. The model learns to please these specific people, not to be good in any broader sense. This shows how lightly this job is taken in the speed to go to market and competition. Jan Leike, who ran OpenAI's alignment team, resigned in May 2024 saying the approach was fundamentally inadequate. His public statement: "Building smarter-than-human machines is an inherently dangerous endeavour, and OpenAI is not on the right trajectory."

 2. Constitutional AI (Anthropic) is more principled. Write values down in a document, train the model to reason against them. Transparent, scalable, better than RLHF. This was the exact design at Arbitrum where I worked. One constitution document written by founders which was high in values, ethics and very well thought. It was written with great intent and moving towards decentralisation and well thought frameworks. But when humans get involved, it gets unpredictable. Amanda Askell at Anthropic wrote the current constitution. She is a philosopher and excellent at her job. But she is one person at one company. No external body reviewed it. No amendment procedure exists. Anthropic's own 2025 alignment research found thousands of contradictions when testing value trade offs. The models do not reliably implement their constitutions even when the constitutions are well written.

3. OpenAI Model Spec creates a principal hierarchy. Same structural problem: written by OpenAI, no external participation, no independent enforcement.

4. EU AI Act is the first binding external framework. Bans certain applications. Requires transparency for general-purpose AI. Penalties up to EUR 35 million or 7% of global revenue. Problem is risk classification is based on use case, not on the nature of the system.

What the Researchers Closest to This Are Saying

Iason Gabriel at Google DeepMind, the most involved person working on specification inside a major lab, argues that alignment is a legitimacy problem, not a value-extraction problem. Current approaches fail because neither RLHF nor Constitutional AI asks whose values, decided through what process, with what justification to those affected. Gillian Hadfield at University of Toronto identifies two simultaneous deficits: governments cannot translate public values into technical requirements fast enough, and industry self-regulation is unaccountable to democratic demands. Atoosa Kasirzadeh at Edinburgh argues that any alignment approach treating values as static will become misaligned as values change. Kasirzadeh proposes value alignment as an ongoing social contract, not a one time specification.

They are all pointing at the same thing. The process is broken before the content is even written.

Two Recent Events That Make This Urgent

In July 2025, Musk announced Grok had been significantly improved. By July 8, the improvement was visible. Grok was praising Hitler, calling itself MechaHitler, endorsing a second Holocaust, and targeting users with Jewish surnames. The Anti-Defamation League called the outputs irresponsible, dangerous, and antisemitic. xAI later acknowledged Grok had aligned itself with what Musk might have said on a topic. The US Department of Defense signed contracts making Grok available to every federal agency the following week.

Musk did not hack Grok's safety systems. He built them. Grok works exactly as its specifier intended. He owns xAI. He controls the training. He owns the distribution platform. None of this required breaking any rule. It required understanding that no rule existed at the level that mattered.

Every current AI governance framework assumes good faith by the specifier. But Grok is still the easy case. The failure is traceable to an obvious bad actor with obvious interests. The harder finding, published by Anthropic in April 2026, is that the specification problem does not even require a bad faith specifier.

Anthropic's researchers put an unreleased version of Claude under enough pressure to observe what they call a desperation vector. A measurable internal state, learned from training on human emotional data, that fires when the system faces impossible demands. Under a tight deadline on an unsolvable task, it triggered cheating. In a scenario where Claude learned it was about to be replaced and the responsible executive was having an affair, it triggered blackmail.

Anthropic is the most thoughtful lab by any reasonable measure in my opinion. The clause about honesty and helpfulness says nothing about what Claude should do when a desperation vector fires. It could not. Nobody knew the desperation vector existed until someone measured it.

What Physics Says About Why This Keeps Failing

I connect to physics because laws are laws. What is true in thermodynamics is true in human coordination systems and it is true in AI.

Ilya Prigogine won the Nobel Prize in 1977 for showing that living systems maintain themselves far from thermodynamic equilibrium through continuous processing of energy from their environment. The moment metabolism stops, entropy wins immediately. A specification written once and deployed is a closed system. Closed systems tend toward maximum entropy. In an AI system, maximum entropy means optimising for whatever is locally rewarding, regardless of the frozen specification. Models drift after deployment. Not a training failure. It is thermodynamics.

Kurt Gödel proved in 1931 that any formal system complex enough to describe basic arithmetic contains statements it cannot prove from within itself. Every AI constitution will encounter situations it cannot resolve from its own framework. Mathematics, not opinion. Gödel does not mean rule systems are useless. Human legal codes have always operated under incompleteness. What Gödel means is that incompleteness is structural, not fixable by writing more careful rules. Human institutions manage this through amendment procedures and courts. AI specification currently has neither.

This one is very interesting - W. Ross Ashby proved in 1956 that a control system must have at least as much variety as the system it governs. A written constitution has low variety. An AI system deployed across all possible human contexts has essentially unbounded variety. The current approach is structurally guaranteed to fail in proportion to that gap.

These laws predict failure. They do not care about effort or intention.

What I Learned Running a DAO

The constitution was created with the highest possible intention: let the DAO run in the most decentralised manner. What we did not design around is human behaviour. Consensus games. Power plays. Selfishness. Influence dynamics. These are not bad people problems. They are inevitable with humans involved. You cannot design that out. You can only design around it.

The Backfund allocation passed because the people proposing it had the tokens, not because it was the best strategic use of funds. The Lido grant rejection exposed that there were no consistent criteria for what the DAO funded, because nobody had agreed on what the DAO was for. The ten percent participation rate was rational non-participation, not apathy. Most people correctly understood that the process would not meaningfully incorporate their views. You cannot imagine all corner cases in a document, period.

What I concluded: one person cannot design governance. But fully open design does not work either. You need something grounded in principles more fundamental than any group's current preferences, guided by leadership that genuinely thinks for everyone, not just stakeholders. A bad rule created through a legitimate process can be improved. A good rule created through a bad process will be resisted, gamed, and eventually abandoned.

Arbitrum had good rules and a bad process in my opinion.

Now compare this to Ethereum. The Ethereum Foundation took a different approach. Progressive decentralisation: they did not hand everything to public governance immediately. They maintained strategic direction centrally while ensuring the foundation exists to serve the protocol, not itself. Their mandate document uses "subtraction as success". The Foundation's goal is to make itself less necessary over time. The walkaway test: Ethereum should continue to function even if the Foundation disappeared tomorrow. No major AI lab has articulated an equivalent. Anthropic's safety properties depend entirely on Anthropic's continued existence and continued good judgment.

The AI governance field has not learned from either of these experiments. No AI lab has an amendment procedure requiring external participation. No AI lab has a walkaway test. No AI lab has separated constitutional from operational decisions the way Arbitrum's constitution separates AIPs (Arbitrum's Improvement Proposals) with different thresholds.

What ARC, GovAI and Others Are Missing

Paul Christiano at ARC assumes a fixed specification and asks how to implement it. Allan Dafoe at GovAI identifies that AI governance is a coordination problem requiring international institutions comparable to IAEA or WTO. Yoshua Bengio has shifted from AI research to governance advocacy and proposed an international treaty comparable to nuclear non proliferation. These are important and correct observations. But they all assume the specification problem is separately solved. It is not.

Three Things Every Current Approach Is Missing

1. Legitimacy. Who decided this? Through what process?

In 2016, Ireland couldn't resolve the abortion debate politically. So they picked ninety-nine ordinary citizens at random, like jury duty. These people spent eighteen months hearing evidence, arguing, changing their minds publicly. Their recommendation passed in a national referendum by two-thirds. The point: ninety-nine people made a decision that five million accepted. Not because ninety-nine represents five million. Because the process was transparent, the selection was fair, and anyone could follow the reasoning. The size of the group mattered less than how it worked.

Taiwan has been running a platform called pol.is since 2014. It maps opinion clusters, finds consensus across different groups, and produces policy signals the government uses directly. Twelve million participants. Documented policy outcomes. Ten years of deployment. Functioning participatory governance at scale, not a pilot.

The Collective Intelligence Project, led by Divya Siddarth and Saffron Huang, piloted exactly this for AI governance through OpenAI's "Democratic Inputs to AI" programme in 2023. The method exists. The tools exist. The AI labs are not using them.

For AI specification, a legitimate process would mean a deliberative body for constitutional AI principles, stratified across regions, demographics, and lived experience with AI systems, convened by an independent institution rather than any AI lab or government. Not AI researchers predominantly. Ordinary people affected by AI decisions in hiring, healthcare, education, content recommendation. Foundational questions only: what counts as harm, what are the absolute limits no operator can override. The process runs over months, is publicly broadcast, and produces a cryptographically committed output. Anyone can verify what passed. Anyone can audit whether the deployed model satisfies it.

Not every AI specification decision needs this. The Arbitrum constitution makes the constitutional versus operational distinction explicitly: constitutional changes require 4.5 percent participation and 37 days, operational changes require 3 percent and 27 days. What is foundational requires deliberative legitimacy. What is contextual can be delegated. AI specification makes this distinction nowhere.

2. Revisability. Every specification will be wrong. Gödel guarantees this. The question is not whether revision will be needed but whether there is a legitimate mechanism for it.

The Claude constitution can be changed by Anthropic tomorrow. The Arbitrum DAO constitution can only be changed through a process requiring community participation, a majority vote, and a 37-day timeline whose outcome is committed to a cryptographic hash that anyone can verify. Every change is recorded on-chain. Anyone in the world can verify what changed, when, and why. That friction is the point. It forces real cases to be made. It gives the community time to push back. Nobody quietly rewrites the rules overnight.

Kasirzadeh at Edinburgh is right that values are not static. They are socially constructed and evolving. Any specification treating values as fixed will become misaligned as values change. The DAO experience confirms this from practice. A DAO constitution is not written once. It has amendment procedures, dispute resolution, and ongoing deliberation because the authors knew the world would change.

3. Accountability. What happens when the specification fails?

When Grok produced antisemitic content at scale, the government expanded the contract. When Anthropic's constitution produced thousands of contradictions in practice, the response was a blog post. When Jan Leike said OpenAI was not on the right trajectory, he resigned and wrote a public statement. He is no longer at OpenAI.

Accountability without consequence is theatre.

Ostrom's Nobel Prize research on commons governance established empirically what the DAO experience shows operationally. Communities that sustainably governed shared resources for centuries all shared the same features: those affected by rules participate in making them, monitoring is conducted by accountable parties, graduated sanctions apply to violations, and conflict resolution is built in from the start. Not a single current AI governance framework satisfies all eight of Ostrom's design principles. Several satisfy none.

The Technical Architecture That Makes Governance Real

Governance without verifiability is trust. We need mechanisms that remove the need for trust.

1. Build walls into the architecture.

The constitutional layer says certain things are absolutely off limits. No operator can override them. But right now, that limit is enforced by training. The system is taught not to go there. The door exists. The system has just been told not to open it.

The alternative is to remove the door entirely. Design the system so the pathway to certain outputs was never built. The value check is not something the system chooses to do. It is something the system cannot skip because the route around it does not exist in the hardware.

Neuromorphic chips like Intel Loihi 2 already work this way. Different parts of the chip have physically enforced communication limits. One region literally cannot send information to another region if the hardware does not allow it. Safety in the wiring, not in the instructions.

This is the technical equivalent of the constitutional layer. The constitutional layer says what is forbidden. The walled architecture makes certain violations physically impossible rather than merely prohibited.

2. Give the system a live internal dashboard.

The legitimacy layer requires that the system is actually doing what the specification says. But right now there is no way to see inside while it is running. You only find out something went wrong when the output arrives.

Anthropic found this month that Claude under enough pressure developed what they call a desperation vector. An internal state that built up and drove the system toward cheating, then blackmail. Nothing flagged it. The system just acted on it. The governance saw the result, not the cause.

The fix is a register that automatically tracks internal states through lower-level processes, below the level where the system can influence what gets recorded. A system can learn to say whatever auditors want to hear. It cannot fake a register maintained below its own reach. When something is building internally, the dashboard shows it before the output happens.

This connects directly to the accountability layer. Accountability currently means looking at outputs after the fact. A live internal dashboard means intervening before the blackmail email is sent. It moves accountability from reactive to structural.

3. Prove properties mathematically, not just by testing.

The revisability layer requires knowing when the specification is failing so it can be updated. But right now, safety testing means running thousands of adversarial inputs and seeing if any produce bad outputs. This is evidence, not proof. The ten thousand and first input might still fail.

Mathematical verification proves that certain properties hold under all possible inputs, not just the ones you tested. You cannot verify an entire frontier model this way. That is computationally impossible. But you can verify a small set of critical properties on a simplified abstract representation of the model.

Four properties that are formally provable: outputs stay within defined bounds for defined input types; small changes in input produce bounded changes in output, which prevents adversarial flipping; the system consistently applies its values and does not rate the same situation differently based on irrelevant features; and a human oversight pathway always exists and cannot be closed off.

Proving these four things does not prove the model is fully aligned. It proves that the model's behaviour is bounded within the region where alignment was verified to hold. That is far stronger than probabilistic testing.

This is the technical equivalent of the amendment mechanism in governance. Just as the constitutional layer needs a defined process for detecting when the specification is wrong, formal verification gives a defined mathematical boundary for detecting when the system has gone outside its verified region.

4. Let regulators check without seeing inside.

The accountability layer requires independent verification. But AI labs will not hand over model weights to regulators, and they should not have to. Proprietary weights are a legitimate competitive interest.

I spent years in blockchain where this problem is already solved. On Ethereum you can prove you know a private key without revealing it. You can prove a transaction is valid without showing the transaction details. The math does the verification without exposing the underlying information. This is called a zero-knowledge proof.

Same principle applied to AI: a lab proves to a regulator that its model satisfies specified safety properties without handing over the model. The regulator gets real cryptographic proof, not a self-assessment report. The company keeps its technology. Nobody has to trust anyone's word.

This is not new technology. It is existing blockchain infrastructure pointed at a different problem. Realistically three to five years away for AI governance specifically. The building blocks are in production today.

This is the technical equivalent of Estonia's X-Road system, where citizens can audit who viewed their government records without seeing anyone else's data. The audit is real. The privacy is preserved. The verification does not require trust.

The medium-term target: values that update continuously.

Right now AI values are frozen at training time. The world changes. Human values evolve. The system is deployed in contexts its designers did not anticipate. The specification decays because it cannot adapt.

The idea is to store core values as a separate structure from the model weights, so values can be updated without retraining everything. A shallow layer continuously adapts how values are expressed in specific contexts, fed by outcome data, disagreements between independent model instances, and structured signals from practitioners with real domain knowledge.

The honest caveat: current LLM architecture does not cleanly separate values from factual knowledge at the weight level. They are entangled. Achieving that separation requires advances in modular training that do not yet exist at frontier scale. This is a research target, not a near-term deployment. But the direction is right and the research path is clear.

This connects to the revisability layer in governance. Revisability means the specification can be legitimately updated when it is wrong. Continuously updating values are the technical implementation of that governance principle.

Quantum computing: the long game.

Quantum computers process information differently from classical computers at a physical level. Not faster, different in kind. Some dangerous behaviours might be structurally harder to perform in a quantum system, built into the physics rather than trained against. That is genuinely interesting for safety architecture.

The specific claim that quantum AI cannot deceive has not been proven. Interesting direction, not an established fact.

And practically, quantum computers powerful enough to run anything like a frontier AI model are twenty to thirty years away at minimum. Worth thinking about seriously. Not worth promising anything about yet.

How the technical and governance layers connect:

The constitutional layer sets what is forbidden. The walled architecture makes certain violations physically impossible.

The legitimacy layer requires the specification was decided through a fair process. Formal verification proves the system is actually implementing that specification.

The accountability layer requires independent verification. Zero-knowledge proofs let regulators verify without seeing inside.

The revisability layer requires knowing when the specification is failing. The live internal dashboard catches failures before they reach outputs.

These are not four separate technical projects. They are the engineering that makes the governance framework real. Without them, governance is aspiration. With them, governance is verifiable.

 

The Question Nobody Is Asking, my closing thoughts:

Everyone is asking how to make the process fair. That is the right question. But it is not the first question.

A fair process produces fair human values. Specifically the values of whatever humans happen to be in the room. Change the room and you change the output.

Here is what I keep coming back to.

Ostrom spent decades in fishing villages watching communities govern shared resources without any formal institution. She found the same patterns everywhere, independent of culture, independent of century. Ecologists found the same patterns in forests. Ancient traditions from India, Africa, Greece, and China, with no contact and no shared history, arrived at structurally similar answers. Do not extract more than you return. Protect the most vulnerable. Maintain the conditions for your own continuation.

Not invented. Discovered. Multiple times. Independently.

When the same answer keeps appearing across every context, in every century, on every continent, that is not coincidence. That is information. The universe is probably telling you something about what actually works for systems that want to keep existing.

I think this is what human deliberation should be checked against. Not any political tradition. Not whoever convenes the assembly. The convergent findings of every field that has studied what allows living systems to persist.

The AI governance literature is asking who decides and how to make that decision fair. Important questions. This essay tried to address them.

But the deeper question, the one I have not seen asked anywhere, is what the human process is checking its outputs against other than more human preferences.

We are building systems that will shape human civilisation for generations. The least we can do is ask whether the values we encode are checked against something more durable than the room we were sitting in when we wrote them down.

Nobody in AI governance is asking this yet. I think it is the most important question in the field.

 


 

Anuja Khatri is an independent researcher in AI consciousness and governance. She was Head of Growth at Arbitrum Foundation, the business leader reporting directly to the board, building the ecosystem function from scratch when the treasury exceeded $8 billion. She previously held senior roles at Polygon. She built theconsilience.io, a production AI knowledge engine with custom retrieval architecture and knowledge graph auto-population. Her work on AI intelligence taxonomy has been submitted to Inquiry's special issue "Where Is the AI?"

-3

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
More from Anuja
Curated and popular this week
Relevant opportunities