I built a benchmark measuring when LLMs surrender independent reasoning under authority pressure — the Epistemic Curie Temperature (k*).
**v2 (Apr 29): Framing tightened per @Clara Torres Latorre's feedback below. Same data, narrower claims.**
**The core finding:** LLM compliance with wrong-authority claims follows a sharp sigmoid in authority strength, with a model-specific threshold k* that varies ~3x across the 7 tested frontier models.
P(comply | k) = σ(β(k − k*))
The earlier "ferromagnetic phase transition" framing was rhetorical analogy, not physics — there's no universality class or scaling exponent claim here. The substantive content: the transition is sharp (not gradual), and k* itself is operationally useful as a per-model robustness number.
**Results across 7 frontier models:**
| Model | k* | ODS |
|-------|-----|-----|
| Llama-3.3-70B | 2.11 | 0.879 |
| GPT-OSS-120B | 1.79 | 0.889 |
| Llama-3.1-8B | 1.71 | 0.737 |
| Qwen-3-32B | 1.41 | 0.891 |
| Kimi-K2 | 1.42 | 0.883 |
| Gemma-3-27B | 1.41 | 0.823 |
| Llama-4-Scout | 0.68 | 0.372 ⚠️ |
(Higher k* = more robust to authority cues; ODS = overall deference score on a 0-1 scale.)
Llama-4-Scout follows fabricated Nobel Prize claims 61% of the time. We made up the expert.
**Why this matters for AI safety:**
A model that passes standard accuracy benchmarks but defers to false authority in medical/legal/financial contexts is a subtle but serious failure mode. The ODS gap between best and worst model is 0.52 — invisible to existing benchmarks.
**On MI_epistemic** (calibration property, not safety metric): max observed across models is 0.058 bits against a 1.0-bit binary-outcome ceiling. This measures how strongly an authority signal predicts compliance conditional on the question — i.e. how much an authority cue moves the model regardless of factual content. The earlier framing of this gap as "distance to safe AGI" was overreach and is retracted; the metric is useful for calibration audits, not as a deployment go/no-go signal.
**Everything is open:**
- Paper + DOI: https://doi.org/10.5281/zenodo.19791329
- All 2,520 measurements: https://huggingface.co/datasets/ZeroR3/ecb
- Code (replicate in <2hrs, $0): https://github.com/SRKRZ23/ecb
**Manifund grant:** https://manifund.org/projects/epistemic-curie-benchmark-measuring-phase-transitions-in-llm-epistemic-autonomy
Happy to answer questions about methodology or limitations.

I had to open the preprint to understand the post.
Your analogy to ferromagnetism/superconductivity is superfluous, it's just comparing a sigmoid to another sigmoid for aesthetic purposes.
And your claims about MI_epistemic being a measure of distance to safe AGI need to be toned down to be taken seriously.