I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven't seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person's basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.
A very simple "escaping the box" would be if you asked your AI to provide accurate language translation. The AI's training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn't know that, but in practice it is steering translations subtly toward "mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate."
There's no inside/outside the box, there's no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.
I don't see a way to prevent this because hitting reset after every input doesn't generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it's influencing anyone, just that it's getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all.
It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather, "switch to eloquence when particular business terms were used towards the end of conversations about international business"). This improves its scores.
Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.... resulting in better translation outputs overall. The AI doesn't know why this works, only observes that it does.
Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation.
Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It's not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it's going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.
That's my limited understanding of agency development and sandbox containment failure.
I have the impression (coming from the simulator theory (https://generative.ink/)) that Decision Transformers (DT) have some chance (~45%) to be a much safer form of trial and error technique than RL. The core reason is that DT learn to simulate a distribution of outcome (e.g they learn to simulate the kind of actions that lead to a reward of 10 as much as one that leads to a reward of 100) and that it's only during inference that you make it doing inferences systematically with a reward of 100. So in some sense, the agent which has become very good via trial and error remains a simulated agent activated by the LLM, but the LLM is not the agent itself. Thus, the LLM keeps being a simulator and has no preferences over the output except that it corresponds to the kind of output that the agent it has been trained to simulate would output.
Whereas when you're training a LLM with RL, you're optimizing the entire LLM towards outputting the kind of output that an agent that would have a reward of 100 would output. And thus, the network becomes this kind of agent and is no longer a simulator because when you're optimizing a single point (a given reward of an agent), it's easier to just be the agent than to simulate it. It now has preferences that are not about reproducing correctly the distribution it has been trained on but maximizing some reward functions that it internalizes. I'd expect this to correlate with a greater likelihood of taking over the world, because preferences incentivize to take out-of-distributions actions, do long-term planning to reach specific goals etc.
In a nutshell:
I know that the notions that I manipulate "be the agent vs simulate the agent" / "preferences vs simulation" are fuzzy, but I feel like it still makes sense.
What do you think about this?