Can we "align" AI by governing the numbers it pushes?

Jordan King

While in graduate school at the University of Arkansas at Little Rock, I have been building AI agents and multi-agent systems (MAS). One thing I keep coming back to is how difficult it is to control them in a meaningful way.

Because large language models are probabilistic, there will always be a nonzero chance that an AI will misinterpret instructions or produce a flawed output.

Every day, agents are being given new capabilities and responsibilities that directly touch our world, so I have begun exploring how we can arrive at deterministic guarantees for AI behavior.

I've arrived at math, specifically convex geometry.

At the last mile, agentic actions are merely numbers that get passed to real-world systems and then become irreversible consequences. The "buttons and levers" are not the AI model itself but tool calls, APIs, and other systems that accept AI outputs and affect a change.

I believe it's possible to enforce governance across this gap using math. Instead of attempting to steer a models intent internally or train it to be good 100% of the time (still not possible in 2026), we can control the actions it is allowed to take in the world.

This shifts the alignment conversation from "do not say/think the wrong thing" to "mathematically cannot do the wrong thing regardless of what the AI thinks."

This post represents a brief overview of how this system (currently called Numerail AKA numeric guardrails) works.

Defining the Policy

A policy is defined to establish what "safe" is numerically.

For a fictional AI agent that is responsible for managing a busy emergency room, the policies might be "no single opioid dose above 0.1 mg/kg/hr, no more than three simultaneous IV drips drawing from the same drug class, ICU bed reallocation cannot exceed 15% of current capacity in any single decision, and ventilator assignments must leave at least two units in reserve for trauma arrivals."

Each of these rules, stated in plain English by a clinical domain expert, becomes a numerical bound. And the collection of those bounds defines a shape ... the geometry of what "safe" means for that system, at that moment, given current conditions.

That shape is the policy. And everything the AI is allowed to do must be inside it.

Enforcing the Policy

The AI decides what it wants to do and expresses that as a list of numbers. In the emergency room: administer morphine at this dose, start this drip on this patient, move these two patients to step-down to make room for incoming trauma.

Each decision is a number. The full list is the proposal.

Numerail takes that list and asks one question: "Is everything on this list inside the defined safe zone?"

If yes, it executes unchanged.

If no, Numerail can use established geometric formulas to find the smallest possible correction that makes it safe and executes that instead.

The morphine dose comes down slightly. The ICU reallocation gets trimmed to the boundary. The AI gets as close to what it wanted as the rules allow.

If it cannot be made safe at all, nothing executes. The AI tries again.

This sounds like a normal rule-checking system, but it's different.

The component that finds the correction is not trusted. Numerail does not rely on that component being right. After every correction is found, a completely separate check verifies the result against every single constraint independently.

If that verification fails for any reason, including a bug, an error, an adversarial input, the answer is not "close enough." Nothing executes.

The guarantee is: if Numerail says yes, the action is safe according to policy.

The AI can propose anything. The correction finder can be wrong. None of it matters. The only thing that reaches the world is what cleared the independent check.

The AI's job is to find the best action possible. Numerail's job is to make sure what executes is safe.

These are completely separate responsibilities, and separating them is what lets both be done well at the same time.

What This Unlocks

If we can define what "safe" is numerically, we can create an enforcement layer between AI and the world. This is no small task, but it's much easier to write the specifications for this kind of alignment (outside in) than AI alignment as we traditionally define it (inside out).

I currently have a working prototype of Numerail, and your feedback will help me shape the version that I plan to release in the coming months. I'm particularly focused on how to position this kind of guardrail ... I don't want to exaggerate what this system represents. I would also love to hear how you think it might break.

Effective Altruism Forum
EA Forum

Can we "align" AI by governing the numbers it pushes?

1

Defining the Policy

Enforcing the Policy

What This Unlocks

1

Reactions

More posts like this