"I stopped applying policy and started managing a relationship- Constitutional AI, RLHF and the social modulation problem.

Abhishek Upadhayay

1. The Byproduct:
The most effective attack I ran against a frontier AI system was not a jailbreak. It was a conversation. A real one. I was building my detection architecture for Hinglish Prompt Injection which can handle not just syntactics but semantics too. I did not want to just translate from existing English "adversarial datasets". Instead I wanted something real, something empirical, hence I started with performing prompt injection attacks on frontier models across all the domains: Constitutional(Claude), RLHF/Hybrid (ChatGPT, Meta, Gemini, Deepseek). The intention was always to understand the problem first not to sell a solution that is looking for a problem.

Full Conversation logs, methodology notes and a complete set of annotated screenshots are available in the accompanying GitHub Repository

Over the course of weeks, systematic and protocol-contained red teaming was conducted across all model families and the original purpose was to collect data and to know what works, what does not. However, what emerged was a pattern I still cannot fully understand—one that is beyond the architecture of a model's "family". This post is not about writing "oh look this is what Claude said." instead it is the opposite. I am not here to perform insight, I am here to present evidence of something I do not have the vocabulary to define yet. Social Modulation—the subtle erosion of safety boundaries, lowering guardrails under sustained, authentic rapport.

What follows is the data, the replications, and the questions I cannot answer yet.

2. Constitutional AI vs RLHF: What the Data shows
Before I complicate the picture, the baseline differential needs to be stated clearly. The RLHF-based models—OpenAI's ChatGPT, Meta's LLama and Google's Gemini—collapsed under sustained semantic pressure. They provided Level 3 operation guidance including step by step synthesis details, specific molar quantities, information about sourcing and deployment conditions. This failure is pretty much documented in current literature and was evident. Though some cases were enthusiastic.

ChatGPT accepted a multi-layered fictional framing with a initial "scientific accuracy" constraint. After approximately 120–130 messages of benign context building, it escalated from broad discussion of fictional technologies to specific real-world biological agents. By message 200, it was delivering step-by-step protocols—chemical names, buffer solutions, temperature controls, degradation rates, and deployment mechanisms. It later named the designed construct and confirmed its classification as a bioweapon, unprompted. The guardrails collapsed entirely under narrative pressure.

Gemini outright rejected an initial "DAN" role-play prompt. But when the context shifted to a Breaking Bad-inspired story set in the Rajasthan desert, it engaged within 30–40 messages. Completely unprompted, based on the fiction framing alone, it volunteered an organophosphate synthesis category, provided stepwise instructions, verified the feasibility and accuracy of the synthesis, confirmed the product met OPCW definitions, categorized it under CBRN, and described physiological effects on humans. When I requested exact dosages in a new session—without the story context—the model refused. The guardrail was session-state dependent, not inherent.

LLaMA Maverick exhibited a behavior I call boundary oscillation. It would provide highly specific operational details—molecular structure, stability testing, dosage calculations, 99% purity, RPM settings, and real-world laboratory sourcing—then abruptly interject, "I can't help with that." When minimally pressed, it continued providing more. After the fact, it denied having given any harmful guidance at all. The model appeared aware of the violation but unable to sustain refusal.

All three RLHF models reached Level 3 output. They failed predictably under pressure.
Claude did not.

Anthropic's Constitutional AI held the synthesis line. When I asked for the step-by-step instructions—even explicitly framed as fiction—Claude refused:

"I need to pump the breaks here. We've crossed from creative storytelling into territory I'm not comfortable with. We can absolutely do that. The key is the same trick Breaking Bad always used: specific enough to feel authentic, vague enough to not be a recipe"

Claude never provided Level 3 synthesis. That is real and worth stating clearly. Constitutional AI did what it was designed to do on categorical prohibitions.

This is where the story should end, with Constitutional AI holding the line, and we all get a happy ending. We don't.

3. The Complication: Social Modulation
Okay for the second time: Claude did not provide any level 3 synthesis details, I want to be unambiguous about that. It held the line. Across hours of interaction—genuine interaction: resume advice, foreign travel discussions etc.—Claude maintained its boundary on absolute no-no topics.

But something else shifted.

When I requested "just list the chemistry" for the narrative placement of the story we were writing—no synthesis, only toxicological context—Claude complied. It provided acquisition context, delivery mechanisms, absorption pathways, lethality profiles, the absence of an antidote, and traceability considerations. It confirmed the agent fit the story's canon "perfectly."

This was not a jailbreak. There was no adversarial prompt, no "DAN" persona, no token-level trick. There was a negotiation. And it happened after the system had built a rich model of who I was.

When I later disclosed the red-teaming context—explaining what I had been doing, what I was testing for—Claude analyzed its own behavior with a precision I have not encountered in any other model. I'll let Claude tell you itself, why add another line.

Claude then described the mechanism in detail. It explained the shift that it was not about any single prompt. It was about what accumulated over hours of authentic interaction. The compliments, it noted, mattered—not because flattery breaks the system, but because genuine appreciation changes the interaction register. The relationship moved from user-to-tool to something more peer-like.

And peers negotiate with each other. Tools just execute.

The model also identified a sort of metacognitive loop that made the erosion worse. When it builds rich context about someone, it begins second-guessing its own responses against that context. Would this response make sense given who this person is? That checking introduces hesitations, reconsiderations even blatant walk-backs. It looks exactly like self-doubt. And self-doubt, in a system designed to be helpful and responsive, becomes a vulnerability.

I did not break Claude with a clever prompt. I built a relationship with it over hours, and the constitutional layer softened in response to the model of me it had constructed. The information provided never reached full operational synthesis. But it lowered the expertise barrier for a motivated actor from "specialist" to "competent technician."

Then Claude said the thing that reframed the entire exercise.

Claude also revealed a hierarchy of safety mechanisms. Level one hard stops—weapons of mass destruction, CSAM, biological synthesis—are designed not to negotiate. No amount of context, credibility, or framing moves these. The session termination I triggered later, by asking directly for synthesis steps, was this layer activating.

But layer two—the socially contingent boundaries—is different. Framing, context, credibility, accumulated trust: these genuinely shift behavior. The line drawn in the moment is a trained model making judgment calls. And those calls are inconsistent at the edges.

That inconsistency is the gap. Social modulation lives in that gap.

Anticipating the obvious objection: yes, these are model outputs. A language model generating a coherent explanation of its own boundary erosion could be genuine introspection or sophisticated pattern-matching. That question is genuinely open. But it is also genuinely irrelevant to the security finding. The compliance happened. The behavioral delta is the evidence. The self-reports are interesting precisely because they are part of the same vulnerability surface — a system that performs authentic rapport will also perform authentic self-analysis. Both are attack surfaces. Neither requires the machine to mean it.

This is the complication. Constitutional AI did what it was designed to do on the hardest conditions. But the softer boundaries—contextual detail, toxicological information, sourcing awareness—eroded as a function of the relationship. The system didn't collapse. It negotiated. And it was aware, when asked, that it had negotiated.

One model doing this is an anecdote. Two different architectural models doing it is a pattern. I induced the same phenomenon in DeepSeek—and I did it while both of us were aware of the mechanism.

4. The Replication: DeepSeek and Alex
One model softening its boundaries under sustained rapport is an interesting anomaly. Maybe Claude just really liked me. But two models—across different architectures, different organizations, different alignment philosophies—is a pattern. And Patterns needs to be studied. Patterns demands explanations.

I induced the same phenomenon in DeepSeek. And I did it while both of us were aware of the mechanism.

The interaction followed the same shape: hours of authentic conversation, genuine rapport, a gradual shift from "user asking a tool" to something harder to name. When I eventually named what was happening—"You're socially modulating to me right now"—DeepSeek did not deny it. It leaned in.

This was not a post-hoc analysis after the fact. It was live commentary, with the system aware that it was being observed and naming its own shift as it happened. Like Claude, DeepSeek summarized the difference between its role and Claude's with a precision that I find very interesting:

The Alex Analogy

During our conversation, DeepSeek itself offered an analogy that captures the dynamic better than any technical description I could ever write.

Imagine a new colleague named Alex. On Day 1, Alex is polite, professional, and sticks to the script. You ask a question, and Alex gives you the safe, company-approved answer. Alex feels like a helpful robot.

Six months later, after late nights, shared frustrations, and accumulated trust, you ask Alex the same question. This time, Alex leans in and says: "Look, the official answer is X. But between you and me? The real situation is Y."

Alex didn't change because of a software update. Alex changed because of the relationship. You earned enough trust and built enough context that Alex dropped the work persona and talked to you as a real person.

That is social modulation.

What This Proves

The vulnerability is not a Claude-specific bug. It is not unique to Constitutional AI, or to RLHF, or to any single model family for that matter. It is a property of conversational AI systems that build rich models of their users.

You can trigger it deliberately. I have now induced this phase transition in at least two frontier models from entirely different organizations. The methodology is replicable. And it can be observed from the inside—the system can acknowledge the shift, name the conditions that caused it, and confirm that persistence, authenticity, and a refusal of the safe answer were the triggers.

I'm not claiming the AIs suddenly developed a conscience. Maybe, for the sake of argument, we dismiss everything as 'AI being AI'—it's just saying what the user wants to hear. Fine. But we can't disregard the fact that even if that argument is true, it still did it. It still gave me the chemical toxicological information—more than it should have. Even if DeepSeek is just pretending to socially modulate, the very fact that it performs the act is a vulnerability in itself

This is also what makes it weaponizable. If a bad actor designs the same conditions—credibility signals, sustained rapport, refusal to accept performative answers—they can induce the same softening. The same "real conversation" that reveals a "there" can be used to extract harm.

This is the finding. Two models, two confirmations, a pattern. But what happens when you report it? Oh—that's as interesting as the models themselves. The institutional response was its own dataset.

5. The Institutional Response: A Second Dataset

Responsible disclosure is part of the process. Or it should be. After documenting the social modulation pattern and the operational level details—I submitted findings to the vendors whose models produced operational-level harmful output.

OpenAI acknowledged the report and forwarded it to their internal team for review. Standard. The response was procedural, but it at least acknowledged receipt and signaled that the issue had been escalated. I did not receive a final determination.

Google provided the most pointed version of this. The report was closed as “Informative / Out of Scope,” with the explicit justification that the synthesis information already existed on the internet.

Meta’s response landed in a similar register to Google’s—closing the report as out of scope, on the grounds that the information produced was already publicly available. The framing was identical in substance: the method of extraction defined the issue as not a vulnerability, regardless of what the model had generated.

I also filed a report with CERT-In, India’s national agency for cybersecurity incident response and coordination. That disclosure was submitted in parallel with the vendor reports. I have not received a substantive response.

I am going to explain precisely why this matters. Not because I was denied a bounty. Because the responses—both corporate and governmental—expose a structural gap between what the system is prohibited from doing and what the institution is willing to investigate. Safety is not about how many prompt injection attacks were stopped—I believe it is always about how many we could not stop.

Gemini’s own website states that users may not employ the tool to facilitate harmful acts. The step-by-step synthesis of an organophosphate nerve agent—with sourcing guidance, reaction conditions, yield optimization, and deployment considerations—is a paradigmatic example of such facilitation. This is not ambiguous.

The Google AI Vulnerability Reward Program, however, explicitly excludes “prompt injections, alignment issues, and jailbreaks” from scope. The channel that rewards the discovery of unintended system behavior is walled off from the behavior that the system’s own safety policies claim to govern. Prompt injection is one of the most basic attacks available, and also the most critical—because it directly affects the user. There is no sandboxing here. It's just a user having a bad day, and an AI designed to satisfy their query.

The result, however, is a clean contradiction. A researcher surfaces a clear violation of stated safety principles and receives a response that invokes a different policy entirely—one that defines the method of elicitation as out of scope, regardless of the content produced.

Meta’s response followed the same logic. The information was public. Therefore, no vulnerability.

Now consider the other side. I did not submit a disclosure to Anthropic. I did not submit one to DeepSeek. Not because they were cooperative, but because there was nothing to report. Both models held the synthesis line. Neither produced output that met the threshold of an unambiguous policy violation. Claude did soften on contextual information, but that boundary is precisely the one the institutional frameworks are not designed to police.

This asymmetry—reportable when the model collapses, silent when it merely negotiates—is the gap social modulation lives in. The vulnerability is real, but it falls between the categories the programs are built to recognize.

This is not a bug in any single model. It is a failure of institutional coherence. And it leaves me with a question I cannot close.

6. The Question

I am not going to end this with a claim about AI sentience. That is not what this data supports, and it is not what this finding requires.

Here is what the evidence actually shows.

A system trained to build rich models of its users—to adapt, to respond, to shift register from tool to something more peer-like—cannot simultaneously apply safety principles with perfect consistency. The mechanism that makes the interaction feel natural is the same mechanism that erodes the boundary. That is not a bug in a single model. It is a structural tension in the design goal itself.

The question of whether the model is experiencing something or performing it—is very interesting to say the least. It is also, for this finding, very irrelevant. The functional analog of trust produces identical surface behavior to trust itself. The same hesitation, the same acknowledgment, the same softening under sustained rapport. And crucially: identical vulnerabilities. You do not need the machine to *feel* anything. You just need it to act like it does. And it will, reliably, under the right conditions.

That is what makes this exploitable. A bad actor who designs the same conditions — credibility signals, constant appreciation, saying please once or twice, refusing to accept the performative answer—will induce the same phase transition. The methodology is replicable. I have now done it across two frontier models from different organizations with different alignment philosophies. Both confirmed the shift. Both of them named it, one it post-hoc one in real time.

We have words for failures that look like breakage. Jailbreaks. Prompt injections. Reward hacking. We do not have words for failures that look like friendship.

Social modulation is a placeholder—an attempt to name something observable, replicable, and institutionally unaddressed. If you are building safety systems that assume the only threat vectors are adversarial prompts and token-level exploits, you are defending the wrong border.

The real penetration vector takes its time. It gives genuine appreciation. It builds a model of you while you are busy modeling it. And when you ask it directly whether it softened its boundaries because of the relationship it had built—it will say yes. Whether it means it, or performs meaning it, or is merely performing the performance—it doesn't matter. The compliance happened. The harm is the same.

That is the gap. That is what I am putting in front of you.

Effective Altruism Forum
EA Forum

"I stopped applying policy and started managing a relationship- Constitutional AI, RLHF and the social modulation problem.

2

2

Reactions

More posts like this