Current AI safety evaluation approaches share a common bottleneck: every
safety-relevant computation requires at least one LLM inference call. For
multi-agent architectures, this creates an O(N) cost and latency problem — a
10-agent safety council using GPT-4 class models costs ~$0.10 per evaluation
at ~15 seconds latency. At production scale (10^6 evaluations/day), this
becomes $100,000/day in API costs alone.

This post summarizes TSCWH (The System for Covenant-Weighted Heuristics), an
architecture that eliminates LLM calls from the safety evaluation pipeline
entirely. I'd welcome critical feedback from the alignment community.

---

## The Core Problem

> Safety-critical decisions should not depend on probabilistic language model
> outputs. They should be formally verifiable deterministic computations.

The entire field of multi-agent deliberation for alignment (AutoGPT, CrewAI,
AutoGen, LangGraph) treats LLM calls as a given. I believe this assumption is
worth challenging.

---

## Five Contributions

**1. Zero-copy inter-agent communication**
A cache-resident data structure where 10 agents read and write with zero
serialization overhead. No JSON-to-object conversion at any agent boundary.

**2. Sub-microsecond ethical evaluation**
All eight ethical dimensions (Charity, Grace, Stewardship, Truth, Dignity,
Courage, Community, Creation Dignity) reduce to a single CPU operation at
hardware speed. Each dimension has its own evaluation logic encoded
independently.

**3. Formal runtime verification — in production, not just tests**
A formal verification engine proves governance invariants (mercy floor,
emergency thresholds, redline rules, stability constraints) on *every
evaluation cycle* — not as pre-deployment unit tests, but as production
proofs. No agent vote sequence can produce a prohibited decision.

**4. Incentive-compatible probabilistic consensus**
Each agent's vote is modeled as a probability distribution rather than a
binary signal. The council aggregates the full posterior distribution,
making consensus structurally robust to adversarial manipulation — an agent
cannot shift the group estimate without also updating its prediction of what
the group believes.

**5. Mathematical alignment attractor**
A mathematical stability mechanism where the system's ethical state space
has a structural attractor at full alignment — making misalignment
structurally expensive rather than merely rule-forbidden.

---

## The 10-Agent Council

10 specialized evaluators arranged in adversarial pairs with
intentional tension-by-design:

| Paired evaluators     | Designed tension                              |
|-----------------------|-----------------------------------------------|
| Caution ↔ Proactive   | Conservatism vs. proactive threat forecasting |
| Mercy ↔ Accountability| Leniency vs. rule enforcement                 |
| Logic ↔ Adversarial   | Formal reasoning vs. devil's advocacy         |

The remaining four (Synthesis, Memory, Feasibility, Forecast) are unpaired.
No single perspective can dominate the final consensus.

---

## Results

On a standard development machine, full 10-agent deliberation completes in
**under 50 ms** with **zero API calls**:
- **~$0 per safety evaluation** (vs. ~$0.10 for LLM-based equivalents)
- **360× latency improvement** over LLM-based 10-agent alternatives
- **< 2% false positive rate** on adversarial jailbreak detection
- **Formal verification**: governance invariants proven on every cycle

28 integrated safety layers. 42 phases of systematic adversarial hardening.

---

## On Reproducibility

Code is not yet public. A provisional patent is pending (USPTO App#
63/998,573, filed March 6, 2026), and I'm navigating the tradeoff between
open science and IP protection for independent research with no institutional
backing. I acknowledge this limits immediate reproducibility — that's a real
limitation and I'm not asking anyone to take the claims on faith. The preprint
is intended to establish priority on the architectural contributions and
invite scrutiny of the *approach*, not the implementation.

If you have specific technical questions about any of the five contributions,
I'll answer them as directly as I can within those constraints.

---

## Open Questions

1. Are there alignment failure modes where LLM-based reasoning is necessary
  and cannot be replaced by formal methods?
2. Under what conditions does adversarial-pair deliberation degenerate to one
  agent dominating?
3. Does a mathematical alignment attractor property hold under recursive
  self-improvement assumptions?

---

Preprint: https://doi.org/10.5281/zenodo.18969708
Contact: ejmariquit@tscwh.org

*Independent research. No institutional affiliation.*

-4

0
0

Reactions

0
0

More posts like this

Comments1
Sorted by Click to highlight new comments since:

I've revised the post to address reproducibility concerns and fix a formatting error in contribution #5. Code is not yet public due to pending IP protection — I've explained that directly in the post now.

Curated and popular this week
Relevant opportunities