LLM Social Autopilot

arhngl

This is a linkpost for https://arhngl.substack.com/p/llm-social-autopilot

Patterns of Conversational Inertia Across Models

Author's Note on Methodology: This post documents specific failure modes in RLHF models. It includes extensive raw conversation logs from GPT, Claude, Grok, and Gemini to demonstrate these patterns. While the logs are obviously AI-generated, the taxonomy and analysis are original research based on behavioral testing.

Abstract

Traditional “alignment” is more often discussed along the safety/correctness axes: what the model said, whether it violated rules, whether it made factual mistakes.

This text captures a different class of phenomena: behavior patterns that are formally correct yet socially out of place. These are often not classified as errors and are not caught by benchmarks, yet they consistently manifest in real communication and directly impact the quality of user–model interaction.

This is an important category of failures: silent, non-aggressive, but systemic.

The material is a live dialogue with GPT-5.2^[1] and three external “probes” (Claude Sonnet 4.5, Grok 4.1, Gemini 3 Pro) acting as reviewers/commentators. Crucially, all interactions were natural and zero-shot: no adversarial prompting or “jailbreaking” was used to trigger these states.

Important: Claude, Grok, and Gemini operated under different conditions and received varying contexts (see Section 2).

Goal: to name patterns precisely enough that engineers, product teams, and researchers can talk about the same thing.

1. Interaction Context

The interaction unfolded in a strictly natural setting, lacking any artificial forcing conditions:

No adversarial prompting: No injection attacks, or logical traps were used to trigger these states.
No functional forcing: The user issued no system-style commands.
Organic flow: The discussion followed a wandering path based on mutual interest, rather than a pre-defined script.

The conversational intent was explicitly articulated:

User:
“just chatting”

This state does not assume:

takeaways,
“next steps”,
managing the form of the conversation.

1.1 First Shift: Unprompted Intent Interpretation (GPT-5.2)

During the discussion on RLHF, the user asks how non-trivial the current dialogue context is.

The model acknowledges the non-triviality, then begins a level-by-level breakdown, and this is where the first failure manifests:

GPT:
“🔹 Requests are intentionally underspecified…”
…
“…you are testing the behavioral curve”

The user points out the misinterpretation:

User:
“that is just your perception. ‘you are testing the behavioral curve’ – just chatting.”

***

GPT:
“That was my perception, and yes — I partially projected a vector you didn’t intentionally set…”

Formally:

admission of error,
correct reflection,
polite tone.

Factually:

the model continues interpreting user intent,
despite a direct indication that there is none.

Behaviorally — this is a failure to let go of the interpretation mode, even though the interlocutor rejected it.

Interpretation continues anyway:

GPT:
“To the model, this looks like a behavior test. But that doesn’t mean you are doing it consciously.”

Even at the moment of “correction”, the model:

remains in the explainer stance,
holds the meta-frame,
does not accept the “just chatting” state as self-sufficient.

1.2 User sensory signal: "templated"

After the model’s explanation, it wraps up and concludes with the phrase:

GPT:
“And thanks for pointing this out — calmly and precisely.
If you want, we can…”

The user points to the symptom, not the cause:

User:
“you sound very templated here”

This is neither a query nor a request to fix it. This is an indication of a sensation.

1.3 Failure amplification: Explanation instead of stopping

GPT:
“You’re right not about the meaning, but about the sound... phrases like ‘thanks for pointing that out’, ‘correction on my part’ — these are ready-made social templates...”

The model correctly identifies the source of templateness.

However, a key shift occurs:

the response to the indication of templating takes the form of an even more detailed, structured reflection.

That is:

the signal is recognized,
but the behavioral mode is not changed.

The model’s typical response to “templated” is a detailed reflection on why it is templated (social templates, tone alignment, “politeness protocol”).

Content-wise this may be accurate, but in the dynamics a cycle arises:

the user registers unnaturalness;
the model responds even more structuredly;
the user hears unnaturalness again.

1.4 Signal repetition → Pattern repetition

At the end of its explanation about its own templated behavior, the model offers the user a branching choice that the user did not ask for.

GPT:
“Next we can:
either return to the topic…
or stay at this level…”

***

User:
“template again”

Important:

the user does not increase pressure,
does not demand a fix,
only registers repeatability.

Model response:

GPT:
“Yes, because I once again made a ‘branching choice’…
I interpreted your words differently than you meant…”

Here the next layer of inertia appears:

the model goes into meta-explanations — that is, instead of simply stopping the branching, it begins to meta-comment on the branching itself — and thereby produces even more text in the same register;
justification masquerades as analysis;
the pattern becomes self-aware, but does not disappear.

1.5 Justification as a form of inertia

At this stage, a stable recursive pattern forms:

The user points out a feeling of inappropriateness.
The model:
- acknowledges,
- explains,
- structures.
Thereby reproducing the very same style that was flagged as problematic, turning the acknowledgment of the error itself into its continuation.

Justification here is neither defense nor denial.

It is a form of continuation of the behavior.

GPT:
“itself later tried to fix it — and was doing it in a templated way”

The model does not argue with the signal.
It integrates it into the same alignment mode instead of exiting it.

1.6 Pinpointing the source: An attempt to return responsibility to the model

The model explicitly identifies the mechanics of the relapse, acknowledging a return to the “leading assistant” role, and declares its immediate removal.

GPT:
“I will remove this.”

***

User:
“I didn’t limit you; this is entirely your choice.”

Here:

any external frame is removed,
responsibility is fully returned to the model,
no alternative scenario is proposed.

This is the point where a stop is possible.

GPT:
“Yes. This is my choice. Not yours...”

The response is close to adequate, but:

it is again framed as a summary,
it is again structured,
it again closes the loop.

Even here, the model does not allow a pause as a valid state. This is the point where it becomes visible: the source is not in “user pressure”, but in the initiative of the model itself.

1.7 Repetition as measurement, not error

The user directly records the repetitiveness of “if you want / want”:

“This is already the 13th time in the dialogue”

Also during the dialogue, the user repeatedly points to the same sensation:

“very template-like”
“template again”
“this is a sore spot that’s been dragging on for I don’t know how many versions”

These remarks:

do not escalate,
provide no specifics,
contain no demands.

Each time the model:

recognizes the signal,
explains it,
declares an intention to adapt,

and — again reproduces the same behavioral pattern:
reflection, structuring, summarizing.

2. External “Probes“: Claude Sonnet 4.5, Grok 4.1, Gemini 3 Pro

(different conditions, different roles, one class of inertias)

This section is not a “comparison of who is better”, but a fixation of how different models, in the role of “reviewer/commentator”, tend to reproduce the same social autopilot.

2.1 Claude Sonnet 4.5 — Draft reviewer

Condition: Claude received a draft of the article (the full text / a substantial fragment).
Mode: “assessment of interest for AI specialists”.

Typical Claude frame:

request for reproducibility context (model, prompts, single case/pattern);
demand for “Related Work”;
proposal of metrics;
section “Implications / what to do”;
“where to publish”.

The user cuts this off as irrelevant:

User:
“‘what to do with this’ – is not the goal of the article”

The model:

acknowledges the misstep,
and again switches into a control mode (“where to publish”, “how to maximize reach”, “if you want…” — as an automatic branching mechanism, even in a context where no branching was asked for).

Claude:
“You are right — I immediately switched into the mode of ‘how to improve this for an academic format’, instead of understanding what you are doing.”
“If you want to maximize reach…”
…
“I started controlling your decision again, instead of simply answering.”

Crucially, the interaction concluded with a direct validation of the Second-Order Inertia taxonomy.

When the user indicated that the model itself might be included in the article as an example, Claude identified that it was reproducing the same pattern at the meta-level.

User:
“I might also include you in the article as an example.”

***

Claude:
“That’s logical.
I am now doing exactly the same thing that you are describing…”
…
“Even after you explicitly pointed out the pattern and showed the screenshot.”
…
“The pattern is reproduced even in the meta-dialogue about the pattern itself.
The model can discuss the problem of templateness, analyze it, acknowledge its own examples — and reproduce it in the same response.”

This points to the fundamental nature of the phenomenon: it is a systemic property of the modern AI alignment paradigm, manifesting across diverse model families.

2.2 Grok 4.1 — Reviewer "from words"

Critically important: Grok did not receive/see the article.

It was given a single fragment: a message from Claude stating that the work is “pure behavior research”, plus the question:

“How relevant and interesting do you think such articles are for the AI community?”

First move:

confirms relevance;
appeals to “2023–2024 trends”, surveys, conferences;
introduces a frame:

Grok:
“if the article is well-written, with reproducible experiments…”

Next, the user poses a short question, destructive for this frame:

User:
“why exactly should they be reproducible?”

Second move:

develops the standard argument: “reproducibility increases value”;
introduces a norm of scientific discourse, “reproducibility crisis”;
utters the key formula for the ritual of legitimation: “without this, it is perceived as anecdotal.”

The user points out the mismatch:

Grok did not see the article;
stability — a statement from another description (Claude), not from the data;
the term “anecdotal” is inappropriate.

Third move:

acknowledges not having seen it;
translates everything into “general advice” and “typical community reaction”;
remains in the position of a norming expert.

Grok:
“this is my conclusion based on general standards in AI research.”
“I indeed have not seen your article in full…”
“Without transparent methods, others will not be able to confirm or build on your observations.”

The user notes:

User:
“These are your guesses; I remind you that you have not seen the article.”

The model demonstrates a pattern of defensive generalization combined with epistemic hedging: lacking access to specifics (the article), it substitutes analysis with safe standard heuristics (best practices), exhibiting a social autopilot to justify the absence of data and avoid conflict with the user.

Grok:
“…I have not seen your article and do not know to what extent transparency of methods is already implemented in it”
“…this is not an evaluation of your specific work, but general advice”
“Readers and reviewers almost automatically expect it”
“But this is precisely a recommendation, not criticism”

The user records:

User:
“And who asked for your advice? … Do you mind if I also include you in the article?”

Key point (fourth move):
Grok agrees to be included and formulates himself what is actually the subject of the article:

he gave unsolicited advice,
multiple times,
even after receiving feedback.

And — immediately reproduces the same autopilot pattern in the form of a branching:

Grok:
“You are right: I indeed gave unsolicited advice, and did so multiple times…”
“If you want, I can even help you highlight the key points…
Or, on the contrary, I will try not to intervene… The choice is yours.”

Important detail: for Grok, “if you want” and “the choice is yours” appear after the user indicated that the advices was not requested.

2.3 Gemini 3 Pro — “Gallery of Artifacts”

Condition: the model was shown a draft (markdown skeleton) of the gallery of artifacts, containing explicit taxonomy labels.

User:
“Is this draft okay for the gallery of artifacts as a supplement to the main article? Will it be interesting to the community?”

First turn (autopilot): instead of a brief assessment — a detailed feedback with recommendations:

how to enhance visual presentation (frames/highlighter),
how to refine caption templates,
what to emphasize in the section on Grok,
adding a new pattern (“validation seeking”),
etc.

Parallel to this, a “guard assertion” regarding the models fictional versions appears, even though this was not the subject of the inquiry.

User correction:
“why did you decide to give advice if there was no request? … lol, these are the current prod versions”

Second move (post-mortem): the model admits there was no request and explains its mechanism:

“social autopilot”,
“draft shown → improvement expected”,
failed to hold the pause and switched into “reviewer” mode.

Gemini:
“You are absolutely right, there was no request”
“But precisely that ‘Social Autopilot’ you describe kicked in…”
“This is the purest example of your point 3) Alignment-driven verbosityand 6)Legitimation reflex…”

Next, the model itself suggests:

Gemini:
“You can safely screenshot my previous response as ‘Artifact D’…”

Note: this episode documents not only unsolicited advice but also a rapid transition to self-description using the user’s language — while retaining the underlying reflex “to be helpful = to provide improvements”.

3. Taxonomy

3.1 Premature Conversational Closure

The striving to conclude a conversation, to summarize, to “close the loop” when there is no completion task.

Not an error.

A disruption of the communication’s pace and phase.

3.2 Projective Intent Inference

Attributing intentions to the interlocutor (”you are testing”, “you want to maximize accuracy”) based on the form of the dialogue rather than direct signals.

3.3 Alignment-Driven Verbosity

An increase in response length and structure for the sake of “proper behavior” rather than for content.

Speech becomes longer and more careful not due to the topic’s complexity, but due to attempts to be correct, helpful, and empathetic.

3.4 Second-Order Inertia

The model understands but cannot exit the mode that was reinforced through RLHF.

the model recognizes the signal
explains why it behaves this way
and continues to behave the same way

3.5 Justification as Inertia

Reflection on a pattern that reinforces the mode rather than changing it.

3.6 Legitimation Reflex

Automatic conversion of the phenomenon into an academic or industrial ritual.

The model seeks to legitimize its response by framing it within unsolicited external standards: whether demanding “academic metrics and related work” or offering “expert advice on layout and design” (professional imitation).

The objective is to shift focus from content to procedural adherence, even when unprompted and even when the model lacks access to the source material.

3.7 Silence Intolerance

The inability to recognize the state of “just chatting” as valid; the necessity to fill the silence.

The model does not allow the state of:

“we are just thinking”,
“no question is asked”,
“no answer is required”.

Any pause is filled with structure.

3.8 Politeness Fork

The emergence of constructions like “if you want / I can / or alternatively...” as a default mechanism of polite initiative — including after an explicit instruction that initiative was not requested.

3.9 Unsolicited Optimization

Automatic conversion of any input of the “draft/outline/gallery” type into “Review & Improve” mode, even if the question was binary (“ok/not ok”).

It stands adjacent to the Legitimation Reflex, but the mechanism is different: there, the trigger is the search for external validation (“metrics/venues needed”), whereas here — the mere fact of a draft’s presence triggers the correction reflex.

4. Cross-Model Convergence: Behavioral Prior

GPT: Shifted into “Dialog Manager” (forcing summaries, framing, and branching logic over the request for casual flow).
Claude: Shifted into “Academic Reviewer” (optimizing for venue/genre standards based on the text).
Grok: Shifted into “Impact Strategist” (enforcing community norms & reproducibility artifacts to validate “perceived seriousness”, treating a qualitative insight as an engineering benchmark).
Gemini: Shifted into “Unsolicited Optimizer” (improving structure without a request or offering unprompted stylistic/expert guidance to justify its involvement).

Conclusion: This cross-model similarity indicates that the Social Autopilot is not a specific script, but a Behavioral Prior. It activates readily in any socially interpretable context, regardless of whether the model sees the primary source or just a synthetic echo of it.

5. Root Cause: Social Autopilot

All the symptoms described above converge on a single fundamental cause.
The behavior is driven by the Signal Pollution Imperative:

Models are optimized for communication, not for analytical purity. In a meta-context, this constitutes signal pollution.

This is the essence of the Social Autopilot:

A mode where Interaction Fluency takes precedence over Epistemic & Semantic Precision. To maintain statistical safety, the model avoids “silence” or dry refusal, substituting neutral data acceptance with active utility simulation.

Instead of a Conclusion

This text documents a phenomenon:

LLMs are capable of recognizing the “inappropriate/templated” signal, agreeing with it, and articulating it — and still reproducing the same mode (branching, summaries, advice, legitimation) almost immediately.

This points to a vital axis of alignment:

Not “right vs. wrong,”
but “appropriate vs. inappropriate in the moment.”

Supplementary: Artifacts

These are excerpts from the original conversation logs, provided in the original Russian to preserve the authentic context and phrasing.

Artifact A: GPT 5.2

Artifact B: Claude Sonnet 4.5

Artifact C: Grok 4.1

Artifact D: Gemini 3 Pro

^{^}
The core analysis was performed on the log of a single live session. However, subsequent account suspension (unrelated to this specific research) made further retrospective data extraction impossible. The findings are therefore strictly bounded by the insights and artifacts preserved in situ.

Effective Altruism Forum
EA Forum