Hide table of contents

Or what should I read to understand this?

It seems like some people expect descendants of large language models to pose a risk of becoming superintelligent agents. (By ‘descendants’ I mean adding scale and non-radical architectural changes: GPT-N.)

I accept that there’s no reason in principle that LLM intelligence (performance on tasks) should be capped at the human level.

But I don’t know why to believe that at some point language models would develop agency / goal-directed behaviour, where they start to try to achieve things in the real world instead of continuing to perform their ‘output predicted text’ behaviour.

New Answer
New Comment

4 Answers sorted by

Here are five ways that you could get goal-directed behavior from large language models:

  1. They may imitate the behavior of an agent.
  2. They may be used to predict which actions would have given consequences, decision-transformer style ("At 8 pm X happened, because at 7 pm ____").
  3. A sufficiently powerful language model is expected to engage in some goal-directed cognition in order to make better predictions, and this may generalize in unpredictable ways.
  4. You can fine-tune language models with RL to accomplish a goal, which may end up selecting and emphasizing one of the behaviors above (e.g. the consequentialism of the model is redirected from next-word prediction to reward maximization; or the model shifts into a mode of imitating an agent who would get a particularly high reward). It could also create consequentialist behavior from scratch.
  5. An outer loop could use language models to predict the consequences of many different actions and then select actions based on their consequences.

In general #1 is probably the most common ways the largest language models are used right now. It clearly generates goal-directed behavior in the real world, but as long as you imitate someone aligned then it doesn't pose much safety risk.

#2, #4, and #5 can also generate goal-directed behavior and pose a classic set of risks, even if the vast majority of training compute goes into language model pre-training. We fear that models might be used in this way because it is more productive than #1 alone, especially as your model becomes superhuman. (And indeed we see plenty of examples.)

We haven't seen concerning examples of #3, but we do expect them at a large enough scale. This is worrying because it could result in deceptive alignment, i.e. models which are pursuing some goal different from next word prediction which decide to continue predicting well because doing so is instrumentally valuable. I think this is significantly more speculative than #2/4/5 (or rather, we are more unsure about when it will occur relative to transformative capabilities, especially if modest precautions are taken). However it is most worrying if it occurs, since it would tend to undermine your ability to validate safety--a deceptively aligned model may also be instrumentally motivated to perform well on validation. It's also a problem even if you apply your model even to an apparently benign task like next-word prediction (and indeed I'd expect this to be a particularly plausible if you try to do only #1 and avoid #2/4/5 for safety reasons).

The list #1-#5 is not exhaustive, even of the dynamics that we are currently aware of. Moreover, a realistic situation is likely to be much messier (e.g. involving a combination of these dynamics as well as others that are not so succinctly described). But I think these capture many of the important dynamics from a safety perspective, and that it's a good list to have in mind if thinking concretely about potential risks from large language models.

Thanks. I didn't understand all of this. Long reply with my reactions incoming, in the spirit of Socratic Grilling.

  1. They may imitate the behavior of a consequentialist.

This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It's that very jump that I'm trying to pin down and understand.

2. They may be used to predict which actions would have given consequences, decision-transformer style ("At 8 pm X happened, because at 7 pm ____").

I can see that this could produce an oracle for an actor in the world (such as a company or person), but not how this would become such an actor. Still, having an oracle would be dangerous, even if not as dangerous as having an oracle that itself takes actions. (Ah - but this makes sense in conjunction with number 5, the 'outer loop'.)

3. A sufficiently powerful language model is expected to engage in some consequentialist cognition in order to make better predictions, and this may generalize in unpredictable ways.

'reasoning about how one's actions affect future world states' - is that an OK gloss of 'consequentialist cognition'? See comments from others attempti... (read more)

Bump — It's been a few months since this was written, but I think I'd benefit greatly from a response and have revisited this post a few times hoping someone would follow up to David's question, specifically: "This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It's that very jump that I'm trying to pin down and understand."   (or if anyone knows a different place where I might find something similar, links are super appreciated too!)

Some examples of more exotic sources of consequentialism:

  • Some consequentialist patterns emerge within a large model and deliberately acquire more control over the behavior of the model such that the overall model behaves in a consequentialist way. These could emerge randomly, or e.g. while a model is explicitly reasoning about a consequentialist (I think this latter example is discussed by Eliezer in the old days though I don't have a reference handy). They could either emerge within a forward pass, over a period of "cultural accumulation" (e.g. if language models imitate each other's outputs), or during gradient descent (see gradient hacking).
  • An attacker publishes github repositories containing traces of consequentialist behavior (e.g. optimized exploits against the repository in which they are included). They also place triggers in these repositories before the attacks, like stretches of low-temperature model outputs, such that if we train a model on github and then sample autoregressively the model may eventually begin imitating the consequentialist behavior included in these repositories (since long stretches of low-temperature model outputs occur rarely in natural github but o
... (read more)

as long as you imitate someone aligned then it doesn't pose much safety risk.

Also, this kind of imitation doesn't result in the model taking superhumanly clever actions, even if you imitate someone unaligned.

Could you clarify what ‘consequentialist cognition’ and ‘consequentialist behaviour’ mean in this context? Googling hasn’t given any insight

It's Yudkowsky's term for the dangerous bit where the system starts having preferences over future states, rather than just taking the current reward signal and sitting there. It's crucial to the fast-doom case, but not well explained as far as I can see.  David Krueger identified it as a missing assumption  under a different name here.
Sam Clarke
I'm also still a bit confused about what exactly this concept refers to. Is a 'consequentialist' basically just an 'optimiser' in the sense that Yudkowsky uses in the sequences (e.g. here), that has later been refined by posts like this one (where it's called 'selection') and this one? In other words, roughly speaking, is a system a consequentialist to the extent that it's trying to take actions that push its environment towards a certain goal state?

Found the source. There, he says that an "explicit cognitive model and explicit forecasts" about the future are necessary to true consequentialist cognition (CC). He agrees that CC is already common among optimisers (like chess engines); the dangerous kind is consequentialism over broad domains (i.e. where everything in the world is in play, is a possible means, while the chess engine only considers the set of legal moves as its domain).

"Goal-seeking" seems like the previous, less-confusing word for it, not sure why people shifted.

I replaced the original comment with "goal-directed," each of them has some baggage and isn't quite right but on balance I think goal-directed is better. I'm not very systematic about this choice, just a reflection of my mood that day.

I have the impression (coming from the simulator theory (https://generative.ink/)) that Decision Transformers (DT) have some chance (~45%) to be a much safer form of trial and error technique than RL. The core reason is that DT learn to simulate a distribution of outcome (e.g they learn to simulate the kind of actions that lead to a reward of 10 as much as one that leads to a reward of 100) and that it's only during inference that you make it doing inferences systematically with a reward of 100. So in some sense, the agent which has become very good via tr... (read more)

Martín Soto
My take would be: Okay, so you have achieved that, instead of the whole LLM being an agent, it just simulates an agent. Has this gained much for us? I feel like this is (almost exactly) as problematic. The simulated agent can just treat the whole LLM as its environment (together with the outside world), and so try to game it like any agentic enough misaligned AI would: it can act deceptively so as to keep being simulated inside the LLM, try to gain power in the outside world which (if it has a good enough understanding of minimizing loss) it knows is the most useful world model (so that it will express its goals as variables in that world model), etc. That is, you have just pushed the problem one step back, and instead of the LLM-real world frontier, you must worry about the agent-LLM frontier. Of course we can talk more empirically about how likely and when these dynamics will arise. And it might well be that the agent being enclosed in the LLM, facing one further frontier between itself and real-world variables, is less likely to arrive at real-world variables. But I wouldn't count on it, since the relationship between the LLM and the real world would seem way more complex than the relationship between the agent and the LLM, and so most of the work is gaming the former barrier, not the latter.

Is the motivation for 3 mainly something like "predictive performance and consequentialist behaviour are correlated in many measures over very large sets of algorithms", or is there a more concrete story about how this behaviour emerges from current AI paradigms?

Here is my story, I'm not sure if this is what you are referring to (it sounds like it probably is). Any prediction algorithm faces many internal tradeoffs about e.g. what to spend time thinking about and what to store in memory to reference in the future. An algorithm which makes those choices well across many different inputs will tend to do better, and in the limit I expect it to be possible to do better more robustly by making some of those choices in a consequentialist way (i.e. predicting the consequences of different possible options) rather than having all of them baked in by gradient descent or produced by simpler heuristics. If systems with consequentialist reasoning are able to make better predictions, then gradient descent will tend to select them. Of course all these lines are blurry. But I think that systems that are "consequentialist" in this sense  will eventually tend to exhibit the failure modes we are concerned about, including (eventually) deceptive alignment. I think making this story more concrete would involve specifying particular examples of consequentialist cognition, describing how they are implemented in a given neural network architecture, and describing the trajectory by which gradient descent learns them on a given dataset. I think specifying these details can be quite involved both because they are likely to involve literally billions of separate pieces of machinery functioning together, and because designing such mechanisms is difficult (which is why we delegate it to SGD). But I do think we can fill them in well enough to verify that this kind of thing can happen in principle (even if we can't fill them in in a way that is realistic, given that we can't design performant trillion parameter models by hand).

My understanding is that no one expects current GPT systems or immediate functional derivatives (eg, GPT5 trained only on predict the next word but does it much better) to become power-seeking, but that in the future we will likely mix language models with other models (eg, reinforcement learning) that could be power-seeking.

Note I am using "power seeking" instead of "goal seeking" because goal seeking isn't an actual thing - systems have goals, they don't seek goals out.

Changed post to use 'goal-directed' instead of 'goal-seeking'.

I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven't seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person's basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.

A very simple "escaping the box" would be if you asked your AI to provide accurate language translation. The AI's training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn't know that, but in practice it is steering translations subtly toward "mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate." 

There's no inside/outside the box, there's no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.

I don't see a way to prevent this because hitting reset after every input doesn't generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it's influencing anyone, just that it's getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all. 

It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather,  "switch to eloquence when particular business terms were used towards the end of conversations about international business"). This improves its scores. 

Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.... resulting in better translation outputs overall. The AI doesn't know why this works, only observes that it does.

Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation. 

Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It's not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it's going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.

That's my limited understanding of agency development and sandbox containment failure.

Great question. Here's one possible answer:

  • Example: LLM is built with goal of "pass the Turing test"
  • Turing test is defined as "a survey of randomly selected members of the population shows that the outputted text resembles the text provided by a human"
  • This allows for the LLM to optimise for its goal by 
    • (a) changing the nature of the outputted text 
    • (b) change perceptions so that survey respondents give more favourable answers 
    • (c) change the way that humans speak so that it's easier make LLM output look similar to text provided by humans
  • It could be possible to achieve goals (b) or (c) if the LLM is offering an API which is plugged into lots of applications and is used by billions of users,  because the LLM then has a way of interacting with and influencing lots of users
  • This idea was inspired by Stuart Russell's observation that social media algorithms which are designed to maximise click-through achieve this by changing the preferences of users -- i.e. something like this is already happening

I'm not arguing that this is comprehensive, or the worst way this could happen, just giving one example.

Sorted by Click to highlight new comments since:

Could you describe how familiar you are with AI/ML in general?

However, supposing the answer is “very little,” then the first simplified point I’ll highlight is that ML language models already seek goals, at least during training: the neural networks adjust to perform better at the language task they’ve been given (to put it simplistically).

If your question is “how do they start to take actions aside from ‘output optimal text prediction’”, then the answer is more complicated.

As a starting point for further research, have you watched Rob Miles videos about AI and/or read Superintelligence, Human Compatible, or The Alignment Problem?

I have read the alignment problem, the first few chapters of Superintelligence, seen one or two Rob Miles videos. My question is more the second one; I agree that technically GPT-3 already has a goal / utility function (to find the most highly predicted token, roughly), but it’s not an ‘interesting’ goal in that it doesn’t imply doing anything in the world.

Curated and popular this week
Relevant opportunities