Hide table of contents

Comment Permalink

Answer by Paul_ChristianoJul 16, 2022142

Here are five ways that you could get goal-directed behavior from large language models:

They may imitate the behavior of an agent.
They may be used to predict which actions would have given consequences, decision-transformer style ("At 8 pm X happened, because at 7 pm ____").
A sufficiently powerful language model is expected to engage in some goal-directed cognition in order to make better predictions, and this may generalize in unpredictable ways.
You can fine-tune language models with RL to accomplish a goal, which may end up selecting and emphasizing one of the behaviors above (e.g. the consequentialism of the model is redirected from next-word prediction to reward maximization; or the model shifts into a mode of imitating an agent who would get a particularly high reward). It could also create consequentialist behavior from scratch.
An outer loop could use language models to predict the consequences of many different actions and then select actions based on their consequences.

In general #1 is probably the most common ways the largest language models are used right now. It clearly generates goal-directed behavior in the real world, but as long as you imitate someone aligned then it doesn't pose much safety risk.

#2, #4, and #5 can also generate goal-directed behavior and pose a classic set of risks, even if the vast majority of training compute goes into language model pre-training. We fear that models might be used in this way because it is more productive than #1 alone, especially as your model becomes superhuman. (And indeed we see plenty of examples.)

We haven't seen concerning examples of #3, but we do expect them at a large enough scale. This is worrying because it could result in deceptive alignment, i.e. models which are pursuing some goal different from next word prediction which decide to continue predicting well because doing so is instrumentally valuable. I think this is significantly more speculative than #2/4/5 (or rather, we are more unsure about when it will occur relative to transformative capabilities, especially if modest precautions are taken). However it is most worrying if it occurs, since it would tend to undermine your ability to validate safety--a deceptively aligned model may also be instrumentally motivated to perform well on validation. It's also a problem even if you apply your model even to an apparently benign task like next-word prediction (and indeed I'd expect this to be a particularly plausible if you try to do only #1 and avoid #2/4/5 for safety reasons).

The list #1-#5 is not exhaustive, even of the dynamics that we are currently aware of. Moreover, a realistic situation is likely to be much messier (e.g. involving a combination of these dynamics as well as others that are not so succinctly described). But I think these capture many of the important dynamics from a safety perspective, and that it's a good list to have in mind if thinking concretely about potential risks from large language models.

Showing 3 of 13 replies (Click to show all)

Paul_ChristianoJan 24 20244

I replaced the original comment with "goal-directed," each of them has some baggage and isn't quite right but on balance I think goal-directed is better. I'm not very systematic about this choice, just a reflection of my mood that day.

jake-rg

Jan 2 2023

Bump — It's been a few months since this was written, but I think I'd benefit greatly from a response and have revisited this post a few times hoping someone would follow up to David's question, specifically: "This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It's that very jump that I'm trying to pin down and understand." (or if anyone knows a different place where I might find something similar, links are super appreciated too!)

Martín Soto

Nov 27 2022

My take would be: Okay, so you have achieved that, instead of the whole LLM being an agent, it just simulates an agent. Has this gained much for us? I feel like this is (almost exactly) as problematic. The simulated agent can just treat the whole LLM as its environment (together with the outside world), and so try to game it like any agentic enough misaligned AI would: it can act deceptively so as to keep being simulated inside the LLM, try to gain power in the outside world which (if it has a good enough understanding of minimizing loss) it knows is the most useful world model (so that it will express its goals as variables in that world model), etc. That is, you have just pushed the problem one step back, and instead of the LLM-real world frontier, you must worry about the agent-LLM frontier. Of course we can talk more empirically about how likely and when these dynamics will arise. And it might well be that the agent being enclosed in the LLM, facing one further frontier between itself and real-world variables, is less likely to arrive at real-world variables. But I wouldn't count on it, since the relationship between the LLM and the real world would seem way more complex than the relationship between the agent and the LLM, and so most of the work is gaming the former barrier, not the latter.

See in context

[ Question ]

How would a language model become goal-directed?

by David M

Jul 16 20221 min read4 answers 2