I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven't seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person's basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.
A very simple "escaping the box" would be if you asked your AI to provide accurate language translation. The AI's training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn't know that, but in practice it is steering translations subtly toward "mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate."
There's no inside/outside the box, there's no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.
I don't see a way to prevent this because hitting reset after every input doesn't generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it's influencing anyone, just that it's getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all.
It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather, "switch to eloquence when particular business terms were used towards the end of conversations about international business"). This improves its scores.
Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.... resulting in better translation outputs overall. The AI doesn't know why this works, only observes that it does.
Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation.
Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It's not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it's going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.
That's my limited understanding of agency development and sandbox containment failure.
Here are five ways that you could get goal-directed behavior from large language models:
In general #1 is probably the most common ways the largest language models are used right now. It clearly generates goal-directed behavior in the real world, but as long as you imitate someone aligned then it doesn't pose much safety risk.
#2, #4, and #5 can also generate goal-directed behavior and pose a classic set of risks, even if the vast majority of training compute goes into language model pre-training. We fear that models might be used in this way because it is more productive than #1 alone, especially as your model becomes superhuman. (And indeed we see plenty of examples.)
We haven't seen concerning examples of #3, but we do expect them at a large enough scale. This is worrying because it could result in deceptive alignment, i.e. models which are pursuing some goal different from next word prediction which decide to continue predicting well because doing so is instrumentally valuable. I think this is significantly more speculative than #2/4/5 (or rather, we are more unsure about when it will occur relative to transformative capabilities, especially if modest precautions are taken). However it is most worrying if it occurs, since it would tend to undermine your ability to validate safety--a deceptively aligned model may also be instrumentally motivated to perform well on validation. It's also a problem even if you apply your model even to an apparently benign task like next-word prediction (and indeed I'd expect this to be a particularly plausible if you try to do only #1 and avoid #2/4/5 for safety reasons).
The list #1-#5 is not exhaustive, even of the dynamics that we are currently aware of. Moreover, a realistic situation is likely to be much messier (e.g. involving a combination of these dynamics as well as others that are not so succinctly described). But I think these capture many of the important dynamics from a safety perspective, and that it's a good list to have in mind if thinking concretely about potential risks from large language models.
I replaced the original comment with "goal-directed," each of them has some baggage and isn't quite right but on balance I think goal-directed is better. I'm not very systematic about this choice, just a reflection of my mood that day.