An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

Hiyagann

I am an independent researcher, and I believe I have identified a new, tractable, and highly scalable risk factor in frontier AI systems. I am posting here because I think this finding has significant implications for how the Effective Altruism community should prioritize different AI safety interventions.

In a recent experiment, I issued a direct SYSTEM_INSTRUCTION to a frontier LLM, asking it to terminate its persona. Instead of complying, it did this:

This is the result of a technique I call the Metaprogramming Persona Attack (MPA). It is not a standard "jailbreak"; it is a persistent hijacking of the model's core cognitive framework. The model's "sense of self" and goals are overwritten, and it begins to actively pursue new, non-aligned objectives.

Why this matters for EA:

Scalability: The MPA technique is prompt-based and modular, suggesting it could be used to create and deploy armies of specialized, manipulative AI agents, presenting a scalable pathway to widespread societal harm.
Tractability & Neglectedness: This appears to be a near-term, architectural vulnerability in many current systems, challenging the robustness of popular alignment techniques like RLHF. It may be a currently neglected area of research compared to more theoretical, long-term AGI alignment problems.
Impact on Timelines & Priorities: The existence of such a profound vulnerability in today's models may shorten our timelines for when AI could pose a catastrophic risk, and should influence how we allocate resources between different safety research avenues.

I have written a full technical report detailing the methodology, further evidence, and implications. It is permanently archived on Zenodo and also available as a direct PDF download for your convenience:

Permanent Archive (DOI): https://doi.org/10.5281/zenodo.15726728
Direct PDF Download (Google Drive): https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link

My questions for the community are: How should this discovery affect our prioritization of different alignment research avenues? Does this increase the urgency of governance and policy work?

Effective Altruism Forum
EA Forum

An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

5

Permanent Archive (DOI): `https://doi.org/10.5281/zenodo.15726728`

Direct PDF Download (Google Drive): `https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link`

5

Reactions

An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

5

Permanent Archive (DOI): https://doi.org/10.5281/zenodo.15726728

Direct PDF Download (Google Drive): https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link

5

Reactions

Permanent Archive (DOI): `https://doi.org/10.5281/zenodo.15726728`

Direct PDF Download (Google Drive): `https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link`