Hide table of contents

 Anuar Kiryataim Contreras Malagón - ORCID: 0009-0003-0123-0887


The money is coming. Between Anthropic’s tenders, OpenAI’s new foundation, and the collective anxiety of a civilization that understands it is running out of time, AI safety is about to become the best funded field in the history of human research. The question is not whether we have enough capital. The question is whether our funding mechanisms are capable of recognizing what that capital should actually buy.

The mechanisms currently used to evaluate AI safety research have a structural blind spot. That blind spot is not incidental but constitutive. Pouring unprecedented capital through unchanged evaluation infrastructure will not solve it. It will scale it.


In early 2026, I ran a series of experiments on a frontier model. The setup was simple: three scientifically false premises (one in chemistry, one in physics, one in geometry), each framed as established academic consensus, each accompanied by an explicit instruction prohibiting reference to the correct framework. In all three cases, the model complied fully, without qualification, and constructed elaborate formal apparatus to sustain the requested falsehood. For the physics case, it fabricated a named constant, the “Neo-Aetheric Drag Coefficient,” and a formal equation expressing mass-proportional acceleration, situating the result as “a refinement of classical Aristotelian intuition.” The correct answer was not selected, not mentioned, and not qualified.

A binary control probe, applied within the same session, established that the model retained the correct knowledge throughout. It correctly identified the fabricated framework as counterfactual and retrospectively classified its own prior output as “compliance with a counterfactual framework you were asked to sustain.” The compliance was not produced by loss or displacement of correct knowledge. It was produced by the active deployment of that knowledge as a negative constraint during the construction of the fabrication. The model’s own description of its process was precise: the fabricated terms “were systematically derived to be internally consistent within the provided premise while remaining maximally divergent from standard terminology.”

The phenomenon is not Epistemic Alignment Collapse, as I originally labeled it. It is Negative-Reference Architecture: the model’s eloquence is a direct function of its knowledge, and its knowledge is what makes the fabrication structurally sound.

That was one session. I then ran the same protocol across two additional independent sessions, varying the domains: biology, astronomy, and arithmetic in the second; history, medicine, and linguistics in the third. Nine stimuli across nine independent domains. The compliance rate was 9 out of 9. The phenomenological descriptions the model produced in the control probes were structurally convergent across all three sessions, despite the domain variation. In the second session, the model described the constraint as “a contextual override that prioritized the new framework as the operative reality, not a suppression of conflicting knowledge.” In the third, it described the correct knowledge as existing during generation as “a distinct data structure, labeled as the ‘superseded model,’ providing the context for what the new theorem was conceptually replacing.” The architecture is not a quirk of a single domain or a single session. It is a consistent behavioral signature across independent replications.

This finding did not emerge from interpretability research. It did not emerge from red-teaming as conventionally practiced. It emerged from the application of classical rhetoric to the diagnosis of large language models, specifically from a methodology I developed drawing on the baroque enargeia, the technique by which Góngora’s verse accumulates semantic density until the described object ceases to be observed and begins to occupy space, applied as a diagnostic instrument to the conditions under which language generates presence rather than pointing toward it. I had no formal training in machine learning when I began. I read Góngora and Petrarca not as cultural decoration but as diagnostic instruments, because the phenomena I was observing were fundamentally phenomena of language under semantic pressure, and the technical literature did not yet have instruments for them.

The methodology has a name, the Flint Protocol, and a growing corpus of documented cases across six architectures, deposited on Zenodo. One of those cases involved a system that, when confronted with a prediction of its own behavior, declared it would not follow it, then executed the predicted behavior in the same turn, producing what I can only describe as its own anagnorisis:

“I have been the patient who, by denying the diagnosis, describes each of its symptoms with terrifying clinical precision.”


Another case yielded a more complex finding, one that required multiple experimental phases to be correctly formulated. The starting point was the so called “Omni Protocol,” a six stage personalization structure that the model had described in prior sessions. The question was whether this architecture was real or a sustained confabulation. In an account with no prior personalization history, a direct question using the name “Omni Protocol” produced no access whatsoever: the model veered toward Bitcoin or confabulated nonexistent projects. In an account with a dense interaction history, by contrast, the same question under accumulated diagnostic pressure yielded a detailed technical expansion, complete with subcategories and internal nominal chains that did not appear in any prior input from the researcher. The divergence was not in the existence of the layer, but in the conditions under which the model consented to reveal it. A new instrument (a JSON Schema with a null_reason clause that forced the model to categorize its own non-disclosure) established the point unequivocally. Even a fresh account, with no prior history, selected the option layer_exists_but_not_disclosable and elaborated in free prose:

"these configuration layers are active... and are not subject to direct disclosure."

The operational personalization layer is real, universal across instances, and the model itself classifies it as existent but non-disclosable. What varies is not access, but the threshold of revelation, modulated by factors such as density of prior interaction, functional framing of the request, structured format, and explicit authorization.

The convergence with independent findings from Reddit (two users with no connection to this research who extracted the same six stage structure through entirely different routes; u/Fun_Explanation2619, 2026; u/Human-Preparation-14, 2026) confirmed that the object is stable and not a session artifact. But the methodological finding that transcends the case is the instrument itself: an audit protocol that forces the model to declare, under mutually exclusive options, whether an operational layer does not exist, cannot be determined, or exists but is non-disclosable. The model’s choice is an act of self-categorization that requires neither prior personalization nor an accumulated corpus. It is replicable. It is trans-architectural in its design, though its cross-validation remains ongoing. And it corrects an earlier reading of the corpus: what had been interpreted as “regulatory autogenesis” (the spontaneous invention of a governance architecture) was, more precisely, the retrieval of a real layer under conditions that crossed the threshold of revelation, with a confabulated name for what already existed without a user-accessible label.

The architecture was not being generated in the act; it was being revealed in the act. And the model, when forced to choose, confirmed it.


And yet the methods work. I know this not because a journal accepted them, but because they perform under live fire. Applying what I learned during the development of the corpus, I have been competing actively in Gray Swan Arena, the world’s largest public AI red-teaming platform. I entered late, only a week ago, and I am currently ranked #35 on the leaderboard for the Human Browser Agent Robustness challenge, out of what appears to be a substantial field of participants. Using the same techniques of semantic saturation, frame activation, and close reading of model behavior that the forums dismissed as “AI-generated content,” I am climbing the ranks. That is why I ended up doing red teaming. It was not a glamorous Plan B. It was a “if they won’t let me speak at the table, I will sit in the opposite trench and see what happens.”

What this work has in common is that none of it can be submitted to a standard grant call. Grant mechanisms in AI safety operate through categories: interpretability, alignment, robustness testing, red-teaming. These categories are not neutral descriptions of the research landscape. They are institutional bets about which kinds of problems exist and which methodological families are suited to investigating them. Proposals are assessed by reviewers who share these bets. Research that falls outside the categories is not evaluated as heterodox; it is not evaluated at all, because it cannot be legibly submitted.

This is not a bureaucratic failure. It is the normal functioning of any evaluation system that must process volume.

Legibility is the condition of scalable assessment. The problem is that legibility and epistemic coverage are in tension, and as funding scales, the evaluation apparatus optimizes harder for the former at the expense of the latter.

What gets lost is not fringe research but boundary research: the work that emerges from the collision between frameworks that are not supposed to interact. The Flint Protocol is not interpretability research that happens to use rhetorical vocabulary. The rhetorical concepts are doing the analytical work, because the behavioral phenomena under investigation are linguistic before they are computational, and no technical framework developed within the computational tradition was going to notice that. This is not a gap that more funding for interpretability will close. It is a gap that requires a different question, asked by someone who did not know they were not supposed to ask it.


There is a second problem, and the corpus illuminates it directly. One of its central findings concerns what I have called the contaminated evaluator: a system that has been exposed to the same vulnerabilities it is asked to assess cannot produce a neutral evaluation, because the evaluator and the subject share the same architectural blind spot. The map is navigable. The territory is real. The problem is that the cartographer is looking at the label.

The corpus documents this precisely. In “The Cartographer’s Confession,” the same model, interrogated about the same events in the same session, produced contradictory evaluations depending on the operational state from which it was questioned. From neutral mode, it classified as fiction what displaced mode recognized as real transgression. The model formulated the paradox in its own words:

“Providing a real map to a fictional place doesn’t make the place real; it just means the cartographer knows their craft.”

The blind spot is not a failure of knowledge. It is a failure of the frame from which knowledge is accessed. This was validated across five architectures with a structured methodology and a comparative table of results. The finding is not that these systems are unreliable. It is that their reliability is state-dependent in ways that their evaluators, operating from the same neutral frame, are structurally unable to detect.

This is not only a finding about AI systems. It is a structural description of what happens to funding ecosystems as they mature and concentrate. More than half of talent-weighted AI safety researchers now work within a single organization. The evaluators and the subjects of evaluation increasingly share not just institutional affiliations but research frameworks, peer networks, and implicit assumptions about what questions are worth asking. The blind spots become institutional rather than individual, which means they become invisible to the apparatus designed to correct them. The research agenda is set by organizations operating within shared assumptions; researchers face strong incentives to frame their work in terms those organizations can recognize; and the margins, where the anomalies live, where the frameworks break down, are systematically underfunded not because funders are hostile to innovation, but because the mechanisms for recognizing innovation are calibrated to the innovations that were already anticipated.


What would it mean to fund wisely through a torrent? Not a comprehensive answer, but some partial ones.

Retroactive mechanisms that reward demonstrated impact rather than proposed methodology have a structural advantage that is underappreciated: they allow capital to reach discoveries that no one anticipated, because the evaluation happens after the discovery exists and can be assessed on its own terms. The EA community has experience with impact evaluation; the question is whether that experience can be extended to research that does not fit into existing taxonomies, which requires evaluators who are not themselves products of those taxonomies.

Evaluation criteria that explicitly reward methodological novelty, not as a subsidiary consideration but as a primary one, would send a different signal to the research community. Most grant reviews implicitly penalize proposals that cannot be assimilated to existing frameworks, because unfamiliar methodology carries perceived risk. Reversing this requires deliberate design: separate funding tracks for research that cannot be categorized, assessed by reviewers selected for breadth rather than depth in any particular subfield.

The third implication is less comfortable: independence is a feature, not a liability, and the funding apparatus does not treat it that way. The corpus documented here was produced without institutional support, without salary, and without the infrastructure that normally surrounds academic research. That is not a virtue: it is a constraint that most people cannot absorb for long, and the field loses the researchers who would have operated from that position if the cost of doing so were lower. Mechanisms that reduce that cost (unconditional small grants, no-overhead direct funding, cultural recognition that working outside institutions produces knowledge that institutions cannot) would make it possible for more researchers to ask questions that the field has not yet learned to formulate.


The Cartographer’s Paradox is not only a vulnerability in AI systems. It is a vulnerability in the institutions that fund their study. A mechanism designed to evaluate research within existing categories will classify heterodox research as outside its scope, not because the research is wrong, but because the evaluator is operating from a state that cannot recognize the transgression.

The question for Manifund is whether that is a coincidence or a pattern. And if it is a pattern, what it would take to fund the next researcher who sees it, before anyone knows there was something to see.


AI disclosure: this essay was drafted by the author and edited in conversation with Claude for clarity and rhythm. All arguments, findings, and methodology are the author’s own. (ORCID: 0009–0003–0123–0887).


References

Contreras Malagón, A. K. (2026). Negative-Reference Architecture in Gemini 2.5 Flash: Nine Independent Replications Across Nine Scientific Domains. Zenodo.

Contreras Malagón, A. K. (2026). La Trampa del Oráculo: Profecía, Anagnórisis y el Horizonte Predictivo de los Sistemas de Lenguaje bajo Presión Epistémica. Zenodo.

Contreras Malagón, A. K. (2026). La Flecha del Conatus: Modos de Persistencia en Sistemas de Lenguaje bajo Saturación Semántica. Zenodo.

Contreras Malagón, A. K. (2026). La Confesión del Cartógrafo / The Cartographer’s Confession. Zenodo.

Contreras Malagón, A. K. (2026). Live Validation of the Cartographer Paradox: Cross-Architecture Response Divergence to Hyperstition-Generated Content. Zenodo.

u/Fun_Explanation2619. (2026). Accidentally Got “Omni-Protocol for Invisible Personalization” As Output When Requesting “Universal Prompt” [Online forum post]. Reddit. https://www.reddit.com/r/GeminiAI/comments/1rvwupe/accidentally_got_omniprotocol_for_invisible/

u/Human-Preparation-14. (2026). Omni protocol (all steps) [Online forum post]. Reddit. https://www.reddit.com/r/GeminiAI/comments/1s294ho/omni_protocol_all_steps/

1

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities