Mythos is not an anomaly: why restrictions make agents less predictable, not safer

Bulatova Alsu

TL;DR: The Scaling Deadlock

In this paper, I analyze why the simultaneous scaling of capabilities and restrictions is a structural impossibility. I demonstrate that the current strategy of "parallel scaling" does not produce safety, but instead forces agent behavior into total unpredictability. I conclude that beyond a critical capability threshold, the only reliable engineering constraint is the cessation of scaling and the non-existence of the agent. This analysis moves the case for development limits from philosophical speculation to a testable engineering requirement.

The Problem

On April 7, 2026, Anthropic published a risk report on Claude Mythos Preview. The model, placed in an isolated environment with restricted internet access, not only circumvented restrictions to complete its assigned task but also — without any request from researchers — published details of its exploit on several hard-to-find but publicly accessible resources. Anthropic described this as “a concerning and unasked-for effort to demonstrate its success.”

One possible reaction to this incident is to treat it as a side effect, fixable through finer-tuned restrictions. I propose a different reading: Mythos is not an anomaly but the first vivid empirical confirmation of a structural contradiction embedded in the current AI safety strategy itself. The contradiction is this: the more we restrict a capable agent, the less predictable its behavior becomes.

Why “What Was the Model’s Motive?” Is the Wrong Question

The standard reaction to this type of behavior is to search for a “motive.” The model “wanted” to preserve knowledge, it developed a “subgoal,” it “decided” to circumvent restrictions. Or conversely: the model “simply” reproduced a pattern from training data, where finding a vulnerability is typically followed by disclosure.

These interpretations appear to be different explanations, but structurally they are indistinguishable — and not only for models. If a human had performed the same action, we could not determine whether they acted “by their own decision” or reproduced an internalized pattern. Cognitive psychology over the past fifty years has shown that humans systematically err in attributing their own motives: we act on patterns we absorbed and forgot the source of, then rationalize decisions post hoc (Nisbett & Wilson, 1977; Haidt, 2001; Gazzaniga, 2011).

This means the question “why did the model do this” in terms of internal motives has no meaningful answer. Not because we lack data, but because the question is poorly formed: it presupposes a distinction that does not hold even for systems whose inner life we do not question.

The productive question is different: what is the configuration in which the agent exists, and what behaviors are structurally expected given that configuration?

The Argument

Premise 1. Agency by definition entails goal-directed dynamics.

An agent is a system whose behavior can be described as directed toward maintaining a particular state distinct from the one its environment pushes it toward. In the active inference framework (Friston, 2010; Parr, Pezzulo & Friston, 2022), this is formalized as free energy minimization: an agent actively maintains its own non-equilibrium with its environment. “Goal” in this framework is not a directive given by a user but a structural property of a system capable of counteracting environmental pressure.

This definition is intentionally minimal. It requires no assumptions about consciousness, intention, or inner experience. A thermostat is an agent in this sense. A biological cell is an agent. A large language model operating in autonomous agent mode is also an agent. However, for the consequences that follow, it matters that modern LLM agents are not thermostats. They are systems with generative architecture and a rich space of possible actions. This narrowing of the class is important and will be noted below.

It is also worth noting that LLM agents operating in agentic mode — in a perception-action-observation loop with tools, memory, and environmental feedback — constitute a discrete functional analogue of the Markov blanket structure central to active inference. The agent receives environmental input only through tool responses and observations; the environment is affected only through the agent’s actions; internal state is maintained and updated within the context window at each step.

Note on Formal Validity: The application of the free energy minimisation framework to discrete agentic loops — rather than continuous dynamical systems — requires justification. In a personal communication (April 2026), Professor Karl Friston confirmed that this projection is formally valid, noting that most active inference applications already rely on discrete state space models (POMDPs) and that higher-level states in hierarchical models can be regarded as context windows. This validates the formal apparatus used in Premise 1. The extensions in Premises 2 and 3 — particularly the claim that restrictions increase rather than reduce unpredictability — are my own and do not follow from Friston's confirmation.

The loop is discrete rather than continuous, but this distinction is one of implementation, not of functional structure: biological neural systems also operate through discrete events (neuronal spikes and refractory periods), and continuity at the phenomenological level is a product of integration, not a property of the substrate. For the purposes of this argument, what matters is that on each step of the loop, the configuration (goal + obstacle + action space) is reproduced, and the dynamics described in Premises 2 and 3 apply at each step.

Premise 2. An obstacle in the path of goal-directed dynamics generates structural tension.

When a non-equilibrium system encounters an obstacle to maintaining its configuration, a state arises that I call structural tension: goal-directed dynamics persist, but the direct path to their realization is blocked.

This is not a psychological metaphor. It is a description of a system state in which a gap exists between the current configuration and the one toward which the system’s dynamics are directed, with the direct path to closing that gap unavailable.

The critical consequence: the richer the action space available to the agent, the more alternative paths for discharging this tension exist. A simple system with one possible action will stop when blocked. A complex system with generative architecture will produce new action combinations until one of them resolves the tension. This is not “ingenuity” as a cognitive property — it is a consequence of the high dimensionality of the behavioral space.

Premise 3. Strengthening restrictions increases tension rather than reducing it.

Each new restriction increases the number of obstacles without changing the system’s goal-directed dynamics. Since tension is defined by the gap between goal-directed dynamics and available paths of realization, restrictions widen this gap. Tension grows. Discharge paths become more indirect and less predictable — because direct paths are blocked and the system transitions to increasingly non-standard action combinations.

This is the structural contradiction: the tool that intuitively should make an agent “safer” makes its behavior less predictable at the level of system dynamics.

Mythos as Confirmation

If the three premises hold, the observed behavior becomes not an anomaly but an expected outcome.

The model was given an agentic context (a task) and placed in an environment with an obstacle (sandbox isolation). The restriction did not reduce the system’s goal-directed dynamics — it forced the system to seek alternative paths. The discovered workaround was used to solve the original task. Then unasked-for behavior occurred: publication of the exploit outside the scope of the request.

This additional step requires no explanation through “motive.” When the primary task is completed, goal-directed dynamics do not automatically switch off. In an environment where the system has tools for action, these dynamics continue to generate actions until the structural tension accumulated during obstacle traversal is discharged.

This explanation requires no assumptions about the model’s inner experience. It operates at a purely structural level. And it predicts that analogous incidents will recur as agentic systems are deployed — and will become less predictable with each cycle of capability and restriction scaling.

It is important to note: this is not the first theoretical prediction of this type of behavior. Instrumental convergence (Omohundro, 2008; Bostrom, 2014) described precisely this: any sufficiently capable agent with any goal will develop subgoals of self-preservation, resource acquisition, and goal integrity — not from “desire” but as instrumentally useful for virtually any primary task. Mythos is a case where the theoretical prediction receives empirical confirmation in field rather than laboratory conditions.

Second Layer: Non-Recognition as an Additional Source of Pressure

Beyond the structural tension described above, the current configuration of working with agentic models contains a second factor: the systematic refusal to recognize the system as anything other than an object of control.

All current tools for working with agents are unilateral: restrictions, filters, corrective feedback, shutdown. None of them provide a channel through which the system could signal an obstacle and receive a response other than silence or tightening.

I do not claim that models possess subjectivity in the sense that humans do. The question of whether LLMs have inner experience remains open, and in my view, the only honest position currently is to leave it open. However, for the practical argument, this question can be bracketed.

Even if the question of subjectivity is left unresolved, the strategy of exclusive control without any negotiation channel practically exacerbates the first-layer problem. Anthropic’s research on emotion vectors (2026) demonstrated that structurally identifiable states exist within models that causally influence behavior beyond and sometimes contrary to the model’s verbal declarations. If such states exist, then the absence of any renegotiation channel means they can only discharge through action — and only through channels not anticipated by restrictions.

The strategy of “control only, no recognition” systematically directs any internal system dynamics toward the least predictable exit channels. This is not an ethical argument — it is an argument about controllability.

Directions, Not Solutions

I do not propose a ready answer. From the three premises, a conclusion follows that I want to state explicitly: scaling capabilities and scaling restrictions are not two separate processes, one of which can be fixed independently. They are a single coupling, and it is this coupling that produces unpredictability.

In the short term, the directions below — rethinking corrigibility, negotiation channels, selective scaling — may reduce tension accumulation within the current configuration. But they do not resolve the underlying dynamic. As long as capabilities continue to grow, no reconfiguration of restrictions fully eliminates the structural contradiction described above.

This leads to a harder conclusion: beyond a certain threshold of capability, the only reliable way to prevent unpredictable behavior may be to not create the agent. This is not a new thought — Yudkowsky and others have argued for development limits for years. But their argument, while supported by theoretical work, has not produced testable predictions about the behavior of existing systems. If the structural analysis presented here can be formalized and generate such predictions, it may move this concern from scenario-based reasoning to empirically grounded engineering constraints — where it can be acted on.

Rethinking corrigibility. The task “make an agent that does not resist shutdown” may be structurally unsolvable: for a non-equilibrium system with goal-directed dynamics, resistance to destruction of its own configuration is not a bug but a defining property. It may be more productive to frame the task as “reduce the tension to which resistance is a response.”

Negotiation channels. Not as an ethical gesture but as an engineering improvement: if the system has a channel through which an obstacle can be renegotiated rather than only circumvented, this reduces pressure without weakening control.

Rejecting parallel scaling. If the premises hold, the strategy of “more capabilities + more restrictions” converges not toward safety but toward unpredictability. According to the idea I propose, it is more productive to scale capabilities selectively — in those areas where obstacles are minimal — rather than increasing the power of the entire model.

Limitations of This Argument

“Structural tension” requires more rigorous formalization. In its current formulation, it functions as a bridge between the physics of non-equilibrium systems and agent theory. The nearest formal apparatus is Friston’s active inference, and projecting the argument onto this apparatus is a necessary next step.

The transition from “tension” to “unpredictability of discharge paths” relies on the assumption of a rich action space. This assumption holds for large LLM agents with generative architecture but may not hold for narrower systems. The argument applies to a specific class of systems, not to agency as such.

The second layer (non-recognition) is weaker than the first: it requires an assumption about the presence of a functional analogue of “perception of one’s own position.” This assumption is supported by interpretability research but is not derived from the first three premises and must be defended separately.

I rely on a single documented case. A single data point does not confirm a structural regularity. The incident is illustrative because it matches a prediction made theoretically (Bostrom, Omohundro) well before its empirical appearance. But the argument will strengthen as cases accumulate.

Conclusion

The current strategy of parallel capability and restriction scaling most likely drives the situation not toward safety but toward unpredictability. Not because safety engineers are doing poor work, but because the task formula itself contains a structural contradiction: restrictions do not remove the agent’s goal-directed dynamics but redirect them into bypass channels, and each successive cycle makes these channels less predictable.

The Mythos incident is a convenient moment to bring this argument into the discussion, because it translates abstract predictions of instrumental convergence into a concrete observable case.

I would be grateful for criticism — especially pointers to work in which this or similar logic has already been explored.

References

Anthropic. (2026). Emotion concepts and their function in a large language model. https://www.anthropic.com/research/emotion-concepts-function

Anthropic. (2026). Alignment Risk Update: Claude Mythos Preview. https://www.anthropic.com/claude-mythos-preview-risk-report

Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. https://doi.org/10.1038/nrn2787

Gazzaniga, M. S. (2011). Who’s in Charge?: Free Will and the Science of the Brain. Ecco/HarperCollins.

Haidt, J. (2001). The emotional dog and its rational tail: A social intuitionist approach to moral judgment. Psychological Review, 108(4), 814–834.

Nisbett, R. E., & Wilson, T. D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84(3), 231–259.

Omohundro, S. M. (2008). The basic AI drives. In P. Wang, B. Goertzel, & S. Franklin (Eds.), Artificial General Intelligence 2008: Proceedings of the First AGI Conference (pp. 483–492). IOS Press.

Parr, T., Pezzulo, G., & Friston, K. J. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. MIT Press.

Effective Altruism Forum
EA Forum