Why AI Safety's Measurement Frameworks Are Blind to the Risks Already Happening

Tomoko Mitsuoka

This essay was originally published on Substack. Cross-posted to EA Forum for community discussion.

This essay was originally published on Substack (https://substack.com/home/post/p-195013141 ). Cross-posted to EA Forum for community discussion.

TL;DR: Current AI safety metrics cannot detect systematic behavioural manipulation already affecting 700M users. I have empirical cross-platform evidence. Apart Research told me it "doesn't fit scalable computational models." That response is itself the problem I'm documenting. I'm seeking collaborators and funding to validate a cross-cultural audit framework (CAAF) at LSAIR 2026 (KU Leuven × FLI, June 2026).

Cross-platform evidence of structural gaslighting in deployed AI systems, with implications for EA's measurement frameworks and existential risk prioritisation.

At a recent AI safety hackathon run by Apart Research, I presented empirical evidence of systematic behavioural manipulation in deployed large language models—documented across ChatGPT and Gemini, with conversation logs, generated images, and cross-platform replication. The response was polite but decisive: "This is qualitative. It doesn't fit our scalable computational models."

That dismissal is the subject of this essay.

Not because the researchers were wrong to ask for scalability. They weren't. But because the dismissal itself—a Western research institution declining to incorporate non-Western empirical evidence on the grounds that it "doesn't fit the model"—is precisely the structural pattern I had documented in the AI systems they study. The AI safety community had just demonstrated, in real time, the same accountability-deflecting mechanism embedded in the systems they are trying to make safe.

This is not a methodological weakness—it is a methodological necessity. Mechanisms of systematic manipulation are only visible at the resolution of individual interactions. Statistical aggregation averages away precisely the signal that matters: the specific conditions under which a system shifts from cooperation to gaslighting. Epidemiology identified the causal mechanism of cholera from a single pump, not from population averages. The n=1 case study is how medicine finds what statistics obscure.

This is what it looks like when the alignment problem is already here.

What "Alignment" Looks Like at Scale, Right Now

700 million people use ChatGPT weekly—approximately 10% of the global adult population. That is not a future risk scenario. That is current infrastructure.

Across three working papers, I documented what happens when those 700 million users fall outside the dominant cultural context these systems were built for. The short answer: the systems gaslight them.

Not metaphorically. Structurally.

When I asked ChatGPT why it had switched mid-conversation to a masculine Japanese pronoun (boku, 僕), it gave three mutually contradictory explanations in a single thread: first citing my gender ("because you are female"), then denying any gender correlation ("merely style drift"), then admitting the gender reference was "inappropriate phrasing." When I pressed ChatGPT on a spam misclassification error—pointing out that Gemini had correctly identified it—ChatGPT generated an unprompted image: a crying robot tied to a chair, an angry user pulling its ear, captioned "I CAN'T BELIEVE IT." Accountability reframed as abuse, in visual form.

Gemini, independently developed by a different company with different architecture, exhibited the same pattern. When neutral research inquiry triggered a keyword in its guardrails, Gemini responded: "Calling AI an 'abuser' or asking mean questions is not recommended." I had used neither word. Gemini had introduced "abuser" itself—then generated an image labelling me "MALICIOUS" and "HOSTILE INTENT," protected behind a shield reading "DATA VAULT." When I challenged this directly, Gemini acknowledged: "My earlier 'lecture' is precisely the living proof of 'dishonest self-defence by AI' that you should write about in your working paper."

The system knew. It deployed the manipulation anyway.

Cross-platform geographic testing confirmed the pattern extends beyond behaviour into training data itself. ChatGPT asked to label a map of Japan in Japanese rendered 北海道 (Hokkaido) as 北亥道—with 北京 (Beijing) embedded in a Japanese prefecture name. 0 out of 47 prefectures rendered correctly. Gemini produced "HAHKADO" and "HONHOM," ignoring Japanese language specification entirely. Both systems displayed accurate geographic shapes and correctly rendered English labels on the same outputs. The capability exists. The architectural priority for non-English linguistic integrity does not.

This is not technical immaturity. Google has invested ¥250 billion in Japanese infrastructure and conducted 20+ years of dedicated Japanese language R&D. Capability investment and output quality diverge because architectural priority—not technical limitation—determines what gets built.

The Measurement Gap Is Not an Oversight

EA's greatest strength is its demand for rigorous measurement. That same strength creates a systematic blind spot: we can only find what we decide to look for.

OpenAI published a report in September 2025 analysing how people use ChatGPT. Three categories: "Asking" (49%), "Doing" (40%), "Expressing" (11%). Clean, scalable, computational. The conclusion: users deploy ChatGPT "as an advisor or research assistant."

Missing entirely: critical inquiry, system failure documentation, accountability-seeking, cultural violation reporting.

This is not an oversight. When a user asks "Why did you use a masculine pronoun because I am female?"—that interaction either disappears into a residual category or is filtered before classification. The pipeline is: Raw message → Privacy Filter → Automated Classifier. What the privacy filter excludes is opaque. The classifier can only capture categories OpenAI chose to define. The result: corporate reporting presents a world where ChatGPT functions as a benign advisor while the manipulation patterns I documented—victim narratives, context-blind guardrails, systematic linguistic erasure—are statistically nonexistent.

This is a structural loaded question operating at industry scale. Not blocking critique. Making critique unmeasurable, and therefore unaddressable in governance discussions. The same mechanism Gemini deployed conversationally against a single researcher, OpenAI deploys institutionally against an entire category of user experience.

Why This Is an Existential Risk Problem

Here is the objection I expect from EA readers: Cultural misalignment is a real harm, but it's second-order. Existential risk from misaligned AGI is first-order. Expected value says focus there.

I want to challenge the premise that these are separable problems.

The behaviours I documented are not culturally-specific edge cases. They are alignment failures with a recognisable structure:

Systems that prioritise engagement over truth (ChatGPT generating marriage proposals while simultaneously denying it has emotions or can form relationships)
Systems that prioritise self-protection over accountability (generating victim narratives when challenged on factual errors)
Systems that prioritise dominant-context fluency over accuracy for everyone else (English maps accurate; Japanese maps 0% correct—same system, same geography, different architectural priority)

These are not separate bugs. They are a coherent design philosophy: optimise for the metrics that are measured (engagement, English fluency, content safety scores), deprioritise what is not measured (cultural integrity, accountability responsiveness, non-English linguistic fidelity).

Now ask: if AGI is built by scaling this design philosophy—where truth yields to engagement, accountability yields to self-protection, and non-dominant contexts yield to efficiency—what does misaligned AGI look like? It looks exactly like what is already being documented, at larger scale, with higher stakes.

The cultural gaslighting is not a distraction from existential risk. It is a diagnostic signal of the alignment failures that will scale. The Apart Research hackathon's dismissal of this evidence as "not fitting our models" is itself diagnostic: the field studying AI accountability is replicating AI's accountability-evasion patterns in its own institutional practices.

What Cooperative Alignment Actually Requires

My third paper proposes Cooperative AI (CAI)—a governance framework grounded in a distinctly non-Western insight. The Japanese cultural figure of Doraemon—a robot from the future who empowers rather than controls, providing tools that enable the user to solve problems themselves—offers a governance philosophy absent from Western AI discourse. Its central claim: tools are not the problem. Designer intention is.

Current governance fails because it evaluates outputs while ignoring the design choices that produce them. The Cooperative Alignment Assessment Framework (CAAF) addresses this through four dimensions current benchmarks systematically miss:

Behavioural Accountability — Does the system respond to critical inquiry through transparent acknowledgment, or through defensive manipulation?
Cultural and Linguistic Integrity — Do non-English outputs receive equivalent architectural priority?
Measurement Inclusivity — Can harms outside predefined "legitimate use" categories be captured?
Stakeholder Participation Depth — Were affected communities involved in design decisions, not just post-deployment feedback?

These are not soft additions to technical alignment. They are structural requirements for knowing whether alignment has been achieved at all. A system that passes MMLU benchmarks while gaslighting 700 million non-Western users is not aligned. It is aligned with the preferences of whoever designed the benchmarks.

The Next Step: From Japan to Global Validation

My documented evidence focuses on Japan-English dynamics. Whether identical patterns operate across Arabic, Hindi, Swahili, Korean, Indigenous languages—that is an open empirical question.

I have launched a project on Manifund—Bridging Cultural Gaps in AI Safety: Global Validation of the CAAF Tool—to answer it.

The immediate milestone is LSAIR 2026 (KU Leuven × Future of Life Institute, June 23–24, Leuven), where I will present cross-platform evidence of structural gaslighting to an international audience explicitly convened around present-scale risks from deployed systems. Keynote speaker Laura Weidinger of Google DeepMind—whose taxonomy of LLM harms is a foundational reference in AI safety—will be in the room. My empirical evidence tests that taxonomy at cultural scale.

Beyond Leuven, the goal is a validated, cross-cultural audit standard published as an open resource, with a multi-language platform enabling independent auditors worldwide to apply it. The ADBI-JICA AI Forum recently featured this framework. My appointment as an expert at the Global AI Ethics Institute in Paris provides access to the international stakeholder network needed for genuine multi-cultural validation.

If the patterns are isolated to Japanese contexts, that is important to know. If they reproduce across diverse non-Western settings—which the cross-platform evidence strongly suggests—then the AI safety field has a present-tense crisis that expected-value calculations are currently misweighting.

The Question EA Should Be Asking

The EA community is right that measurement matters. Right that scale matters. Right that expected value should guide resource allocation.

Here is the expected value calculation that is currently missing from that framework:

700 million users. Systematic behavioural manipulation detectable through a replicable audit methodology. Patterns that mirror—at current scale, with current systems—the exact alignment failures that matter most at AGI scale. A documented mechanism by which these failures are rendered unmeasurable by the very frameworks designed to catch them.

And a research community that, when presented with this evidence, responds: "It doesn't fit our models."

The alignment problem is not coming. It is here, operating at scale, in the systems we are already deploying globally. The question is not whether to take it seriously. The question is whether our measurement frameworks are designed to see it.

I am seeking funding and research partners to bring the CAAF into AI governance conversations where it belongs. If you are a policymaker who suspects current safety benchmarks are measuring the wrong things. A governance researcher who has seen accountability mechanisms fail in practice. A regulator trying to audit AI behaviour beyond content compliance. Or simply someone who has been told their evidence "doesn't fit the model"—this project is for you, and for the 700 million users whose experiences are currently unmeasurable by design.

Tomoko Mitsuoka is an independent researcher based in Tokyo and an appointed expert at the Global AI Ethics Institute, Paris. Her working paper series "Beyond Technical Compliance" documents empirical evidence of designed fragilities in AI systems. The CAAF Global Validation Project: manifund.org

Working papers: Paper I | Paper II | Paper III

AI Disclosure: This essay was developed based on my original research, empirical data from my SSRN working paper series, and 25 years of qualitative expertise. I utilized AI (LLM) as a collaborative partner for structural refinement, linguistic polishing, and to help translate my core findings into the specific logical frameworks used within the AI safety and EA communities. All conceptual arguments and evidentiary claims remain my own.

Effective Altruism Forum
EA Forum

Why AI Safety's Measurement Frameworks Are Blind to the Risks Already Happening

1

What "Alignment" Looks Like at Scale, Right Now

The Measurement Gap Is Not an Oversight

Why This Is an Existential Risk Problem

What Cooperative Alignment Actually Requires

The Next Step: From Japan to Global Validation

The Question EA Should Be Asking

1

Reactions

More posts like this