20 karmaJoined Nov 2021


AI Notkilleveryone researcher at Apollo, working on interpretability. Physics PhD.


all behaviour can be interpreted as maximising a utility function.

Yes, it indeed can be. However, the less coherent the agent acts, the more cumbersome it will be to describe it as an expected utility maximiser. Once your utility function specifies entire histories of the universe, its description length goes through the roof. If describing a system as a decision theoretic agent is that cumbersome, it's probably better to look for some other model to predict its behaviour. A rock, for example, is not well described as a decision theoretic agent. You can technically specify a utility function that does the job, but it's a ludicrously large one.

The less coherent and smart a system acts, the longer the utility function you need to specify to model its behaviour as a decision theoretic agent will be. In this sense, expected-utility-maximisation does rule things out, though the boundary is not binary. It's telling you what kind of systems you can usefully model as "making decisions" if you want to predict their actions.

If you would prefer math that talks about the actual internal structures agents themselves consist of, decision theory is not the right field to look at. It just does not address questions like this at all. Nowhere in the theorems will you find a requirement that an agent's preferences be somehow explicitly represented in the algorithms it "actually uses" to make decisions, whatever that would mean. It doesn't know what these algorithms are, and doesn't even have the vocabulary to formulate questions about them. It's like saying we can't use theorems for natural numbers to make statements about counting sheep, because sheep are really made of fibre bundles over the complex numbers, rather than natural numbers. The natural numbers are talking about our count of the sheep, not the physics of the sheep themselves, nor the physics of how we move our eyes to find the sheep. And decision theory is talking about our model of systems as agents that make decisions, not the physics of the systems themselves and how some parts of them may or may not correspond to processes that meet some yet unknown embedded-in-physics definition of "making a decision".

I do not find the argument against the applicability of the Complete Class theorem in that post convincing. See Charlie Steiner's reply in the comments.

You just have to separate "how the agent internally represents its preferences" from "what it looks like the agent us doing." You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But externally it looks like this agent is acting like it assigns equal value to whatever indifferent things it thought of choosing between first.

Decision theory is concerned with external behaviour, not internal representations. All of these theorems are talking about whether the agent's actions can be consistently described as maximising a utility function. They are not concerned whatsoever with how the agent actually mechanically represents and thinks about its preferences and actions on the inside. To decision theory, agents are black boxes. Information goes in, decision comes out. Whatever processes may go on in between are beyond the scope of what the theorems are trying to talk about.


Money-pump arguments for Completeness (understood as the claim that sufficiently-advanced artificial agents will have complete preferences) assume that such agents will not act in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ But that assumption is doubtful. Agents with incomplete preferences have good reasons to act in accordance with this kind of policy: (1) it never requires them to change or act against their preferences, and (2) it makes them immune to all possible money-pumps for Completeness. 

As far as decision theory is concerned, this is a complete set of preferences. Whether the agent makes up its mind as it goes along or has everything it wants written up in a database ahead of time matters not a peep to decision theory. The only thing that matters is whether the agent's resulting behaviour can be coherently described as maximising a utility function. If it quacks like a duck, it's a duck.

the (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans

"Could reasonably be described" is the problem here. You likely need very high precision to get this right. Relatively small divergences from human goals in terms of bits altered suffice to make a thing that is functionally utterly inhuman in its desires. This is a kind of precision that current AI builders absolutely do not have.

Worse than that, if you train an AI to do a thing, in the sense of setting a loss function where doing that thing gets a good score on the function, and not doing that thing gets a bad score, you do not, in general, get out an AI that wants to do that thing. One of the strongest loss signals that trains your human brain is probably "successfully predict the next sensory stimulus". Yet humans don't generally go around thinking "Oh boy, I sure love successfully predicting visual and auditory data, it's so great." Our goals have some connection to that loss signal, e.g. I suspect it might be a big part of what makes us like art. But the connection is weird and indirect and strange. 

If you were an alien engineer sitting down to write that loss function for humans, you probably wouldn't predict that they'd end up wanting to make and listen to audio data that sounds like Beethoven's music, or image data that looks like van Gogh's paintings. Unless you knew some math that tells you what kind of AI with what kind of goals  you get if you train on a loss function  over a dataset .

The problem is that we do not have that math. Our understanding of what sort of thinky-thing with what goals comes out at the end of training is close to zero. We know it can score high on the loss function in training, and that's basically it. We don't know how it scores high. We don't know why it "wants" to score high, if it's the kind of AI that can be usefully said to "want" anything. Which we can't tell if it is either.

With the bluntness of the tools we currently possess, the goals that any AGI we make right now would have would effectively be a random draw from the space of all possible goals. There are some restrictions on where in this gigantic abstract goal space we would sample from, for example the AI can't want trivial things that lead to it just sitting there forever doing nothing. Because then it would be functionally equivalent to a brick and have no reason to try and score high on the loss function in training, so it would be selected against. But it's still an incredibly vast possibility space.

Unfortunately, humans and human values are very specific things, and most goals in goal space make no mention of them. If a reference to human goals does get into the AGIs goals, there's no reason to expect that it will get in there in the very specific configuration of the AGI wanting the humans to get what they want. 

So the AGI gets some random goal that involves more than sitting around doing nothing, but probably isn't very directly related to humans, any more than humans' goals are related to correctly predicting the smells that enters their noses. The AGI will then probably gather resources to achieve this goal, and not care what happens to humans as a consequence. Concretely, that may look like earth and the solar system getting converted into AGI infrastructure, with no particular attention paid to keeping things like an oxygen rich atmosphere around. The AGI knows that we would object to this, so it will make sure that we can't stop it. For example, by killing us all. 

If you offered it passage off earth in exchange for leaving humanity alone, it would have little reason to take that deal. That's leaving valuable time and a planet worth of resources and on the table. Humanity might also make another AGI some day, and that could be a serious rival. On the other hand, just killing all the humans is really easy, because they are not smart enough to defend themselves. Victory is nigh guaranteed. So it probably just does that.