Hide table of contents

Introduction

For about fifteen years, the AI safety community has been discussing coherence arguments. In papers and posts on the subject, it’s often written that there exist 'coherence theorems' which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Despite the prominence of these arguments, authors are often a little hazy about exactly which theorems qualify as coherence theorems. This is no accident. If the authors had tried to be precise, they would have discovered that there are no such theorems.

I’m concerned about this. Coherence arguments seem to be a moderately important part of the basic case for existential risk from AI. To spot the error in these arguments, we only have to look up what cited ‘coherence theorems’ actually say. And yet the error seems to have gone uncorrected for more than a decade.

More detail below.[1]

Coherence arguments

Some authors frame coherence arguments in terms of ‘dominated strategies’. Others frame them in terms of ‘exploitation’, ‘money-pumping’, ‘Dutch Books’, ‘shooting oneself in the foot’, ‘Pareto-suboptimal behavior’, and ‘losing things that one values’ (see the Appendix for examples).

In the context of coherence arguments, each of these terms means roughly the same thing: a strategy A is dominated by a strategy B if and only if A is worse than B in some respect that the agent cares about and A is not better than B in any respect that the agent cares about. If the agent chooses A over B, they have behaved Pareto-suboptimally, shot themselves in the foot, and lost something that they value. If the agent’s loss is someone else’s gain, then the agent has been exploited, money-pumped, or Dutch-booked. Since all these phrases point to the same sort of phenomenon, I’ll save words by talking mainly in terms of ‘dominated strategies’.

With that background, here’s a quick rendition of coherence arguments:

  1. There exist coherence theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
  2. Sufficiently-advanced artificial agents will not pursue dominated strategies.
  3. So, sufficiently-advanced artificial agents will be ‘coherent’: they will be representable as maximizing expected utility.

Typically, authors go on to suggest that these expected-utility-maximizing agents are likely to behave in certain, potentially-dangerous ways. For example, such agents are likely to appear ‘goal-directed’ in some intuitive sense. They are likely to have certain instrumental goals, like acquiring power and resources. And they are likely to fight back against attempts to shut them down or modify their goals.

There are many ways to challenge the argument stated above, and many of those challenges have been made. There are also many ways to respond to those challenges, and many of those responses have been made too. The challenge that seems to remain yet unmade is that Premise 1 is false: there are no coherence theorems.

Cited ‘coherence theorems’ and what they actually say

Here’s a list of theorems that have been called ‘coherence theorems’. None of these theorems state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. Here’s what the theorems say:

The Von Neumann-Morgenstern Expected Utility Theorem:

The Von Neumann-Morgenstern Expected Utility Theorem is as follows:

An agent can be represented as maximizing expected utility if and only if their preferences satisfy the following four axioms:

  1. Completeness: For all lotteries X and Y, X is at least as preferred as Y or Y is at least as preferred as X.
  2. Transitivity: For all lotteries X, Y, and Z, if X is at least as preferred as Y, and Y is at least as preferred as Z, then X is at least as preferred as Z.
  3. Independence: For all lotteries X, Y, and Z, and all probabilities 0<p<1, if X is strictly preferred to Y, then pX+(1-p)Z is strictly preferred to pY+(1-p)Z.
  4. Continuity: For all lotteries X, Y, and Z, with X strictly preferred to Y and Y strictly preferred to Z, there are probabilities p and q such that (i) 0<p<1, (ii) 0<q<1, and (iii) pX+(1-p)Z is strictly preferred to Y, and Y is strictly preferred to qX+(1-q)Z.

Note that this theorem makes no reference to dominated strategies, vulnerabilities, exploitation, or anything of that sort.

Some authors (both inside and outside the AI safety community) have tried to defend some or all of the axioms above using money-pump arguments. These are arguments with conclusions of the following form: ‘agents who fail to satisfy Axiom A can be induced to make a set of trades or bets that leave them worse-off in some respect that they care about and better-off in no respect, even when they know in advance all the trades and bets that they will be offered.’ Authors then use that conclusion to support a further claim. Outside the AI safety community, the claim is often:

Agents are rationally required to satisfy Axiom A.

But inside the AI safety community, the claim is:

Sufficiently-advanced artificial agents will satisfy Axiom A.

This difference will be important below. For now, the important thing to note is that the conclusions of money-pump arguments are not theorems. Theorems (like the VNM Theorem) can be proved without making any substantive assumptions. Money-pump arguments establish their conclusion only by making substantive assumptions: assumptions that might well be false. In the section titled ‘A money-pump for Completeness’, I will discuss an assumption that is both crucial to money-pump arguments and likely false.

Savage’s Theorem

Savage’s Theorem is also a Von-Neumann-Morgenstern-style representation theorem. It also says that an agent can be represented as maximizing expected utility if and only if their preferences satisfy a certain set of axioms. The key difference between Savage’s Theorem and the VNM Theorem is that the VNM Theorem takes the agent’s probability function as given, whereas Savage constructs the agent’s probability function from their preferences over lotteries.

As with the VNM Theorem, Savage’s Theorem says nothing about dominated strategies or vulnerability to exploitation.

The Bolker-Jeffrey Theorem

This theorem is also a representation theorem, in the mould of the VNM Theorem and Savage’s Theorem above. It makes no reference to dominated strategies or anything of that sort.

Dutch Books

The Dutch Book Argument for Probabilism says:

An agent can be induced to accept a set of bets that guarantee a net loss if and only if that agent’s credences violate one or more of the probability axioms.

The Dutch Book Argument for Conditionalization says:

An agent can be induced to accept a set of bets that guarantee a net loss if and only if that agent updates their credences by some rule other than Conditionalization.

These arguments do refer to dominated strategies and vulnerability to exploitation. But they suggest only that an agent’s credences (that is, their degrees of belief) must meet certain conditions. Dutch Book Arguments place no constraints whatsoever on an agent’s preferences. And if an agent’s preferences fail to satisfy any of Completeness, Transitivity, Independence, and Continuity, that agent cannot be represented as maximizing expected utility (the VNM Theorem is an ‘if and only if’, not just an ‘if’). 

Cox’s Theorem

Cox’s Theorem says that, if an agent’s degrees of belief satisfy a certain set of axioms, then their beliefs are isomorphic to probabilities.

This theorem makes no reference to dominated strategies, and it says nothing about an agent’s preferences.

The Complete Class Theorem

The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.

This theorem does refer to dominated strategies. However, the Complete Class Theorem starts off by assuming that the agent’s preferences over actions in sets of circumstances satisfy Completeness and Transitivity. If the agent’s preferences are not complete and transitive, the Complete Class Theorem does not apply. So, the Complete Class Theorem does not imply that agents must be representable as maximizing expected utility if they are to avoid pursuing dominated strategies.

Omohundro (2007), ‘The Nature of Self-Improving Artificial Intelligence’

This paper seems to be the original source of the claim that agents are vulnerable to exploitation unless they can be represented as expected-utility-maximizers. Omohundro purports to give us “the celebrated expected utility theorem of von Neumann and Morgenstern… derived from a lack of vulnerabilities rather than from given axioms.” 

Omohundro’s first error is to ignore Completeness. That leads him to mistake acyclicity for transitivity, and to think that any transitive relation is a total order. Note that this error already sinks any hope of getting an expected-utility-maximizer out of Omohundro’s argument. Completeness (recall) is a necessary condition for being representable as an expected-utility-maximizer. If there’s no money-pump that compels Completeness, there’s no money-pump that compels expected-utility-maximization.

Omohundro’s second error is to ignore Continuity. His ‘Argument for choice with objective uncertainty’ is too quick to make much sense of. Omohundro says it’s a simpler variant of Green (1987). The problem is that Green assumes every axiom of the VNM Theorem except Independence. He says so at the bottom of page 789. And, even then, Green notes that his paper provides “only a qualified bolstering” of the argument for Independence.

Money-Pump Arguments by Johan Gustafsson

It’s worth noting that there has recently appeared a book which gives money-pump arguments for each of the axioms of the VNM Theorem. It’s by the philosopher Johan Gustafsson and you can read it here.

This does not mean that the posts and papers claiming the existence of coherence theorems are correct after all. Gustafsson’s book was published in 2022, long after most of the posts on coherence theorems. Gustafsson argues that the VNM axioms are requirements of rationality, whereas coherence arguments aim to establish that sufficiently-advanced artificial agents will satisfy the VNM axioms. More importantly (and as noted above) the conclusions of money-pump arguments are not theorems. Theorems (like the VNM Theorem) can be proved without making any substantive assumptions. Money-pump arguments establish their conclusion only by making substantive assumptions: assumptions that might well be false.

I will now explain how denying one such assumption allows us to resist Gustafsson’s money-pump arguments. I will then argue that there can be no compelling money-pump arguments for the conclusion that sufficiently-advanced artificial agents will satisfy the VNM axioms. 

Before that, though, let’s get the lay of the land. Recall that Completeness is necessary for representability as an expected-utility-maximizer. If an agent’s preferences are incomplete, that agent cannot be represented as maximizing expected utility. Note also that Gustafsson’s money-pump arguments for the other axioms of the VNM Theorem depend on Completeness. As he writes in a footnote on page 3, his money-pump arguments for Transitivity, Independence, and Continuity all assume that the agent’s preferences are complete. That makes Completeness doubly important to the ‘money-pump arguments for expected-utility-maximization’ project. If an agent’s preferences are incomplete, then they can’t be represented as an expected-utility-maximizer, and they can’t be compelled by Gustafsson’s money-pump arguments to conform their preferences to the other axioms of the VNM Theorem. (Perhaps some earlier, less careful money-pump argument can compel conformity to the other VNM axioms without assuming Completeness, but I think it unlikely.)

So, Completeness is crucial. But one might well think that we don’t need a money-pump argument to establish it. I’ll now explain why this thought is incorrect, and then we’ll look at a money-pump.

Completeness doesn’t come for free 

Here’s Completeness again:

Completeness: For all lotteries X and Y, X is at least as preferred as Y or Y is at least as preferred as X.

Since:

‘X is strictly preferred to Y’ is defined as ‘X is at least as preferred as Y and Y is not at least as preferred as X.’

And: 

‘The agent is indifferent between X and Y’ is defined as ‘X is at least as preferred as Y and Y is at least as preferred as X.’

Completeness can be rephrased as:

Completeness (rephrased): For all lotteries X and Y, either X is strictly preferred to Y, or Y is strictly preferred to X, or the agent is indifferent between X and Y.

And then you might think that Completeness comes for free. After all, what other comparative, preference-style attitude can an agent have to X and Y?

This thought might seem especially appealing if you think of preferences as nothing more than dispositions to choose. Suppose that our agent is offered repeated choices between X and Y. Then (the thought goes), in each of these situations, they have to choose something. If they reliably choose X over Y, then they strictly prefer X to Y. If they reliably choose Y over X, then they strictly prefer Y to X. If they flip a coin, or if they sometimes choose X and sometimes choose Y, then they are indifferent between X and Y.

Here’s the important point missing from this thought: there are two ways of failing to have a strict preference between X and Y. Being indifferent between X and Y is one way: preferring X at least as much as Y and preferring Y at least as much as X. Having a preferential gap between X and Y is another way: not preferring X at least as much as Y and not preferring Y at least as much as X. If an agent has a preferential gap between any two lotteries, then their preferences violate Completeness.

The key contrast between indifference and preferential gaps is that indifference is sensitive to all sweetenings and sourings. Consider an example. C is a lottery that gives the agent a pot of ten dollar-bills for sure. D is a lottery that gives the agent a different pot of ten dollar-bills for sure. The agent does not strictly prefer C to D and does not strictly prefer D to C. How do we determine whether the agent is indifferent between C and D or whether the agent has a preferential gap between C and D? We sweeten one of the lotteries: we make that lottery just a little but more attractive. In the example, we add an extra dollar-bill to pot C, so that it contains $11 total. Call the resulting lottery C+. The agent will strictly prefer C+ to D. We get the converse effect if we sour lottery C, by removing a dollar-bill from the pot so that it contains $9 total. Call the resulting lottery C-. The agent will strictly prefer D to C-. And we also get strict preferences by sweetening and souring D, to get D+ and D- respectively. The agent will strictly prefer D+ to C and strictly prefer C to D-. Since the agent’s preference-relation between C and D is sensitive to all such sweetenings and sourings, the agent is indifferent between C and D.

Preferential gaps, by contrast, are insensitive to some sweetenings and sourings. Consider another example. A is a lottery that gives the agent a Fabergé egg for sure. B is a lottery that returns to the agent their long-lost wedding album. The agent does not strictly prefer A to B and does not strictly prefer B to A. How do we determine whether the agent is indifferent or whether they have a preferential gap? Again, we sweeten one of the lotteries. A+ is a lottery that gives the agent a Fabergé egg plus a dollar-bill for sure. In this case, the agent might not strictly prefer A+ to B. That extra dollar-bill might not suffice to break the tie. If that is so, the agent has a preferential gap between A and B. If the agent has a preferential gap, then slightly souring A to get A- might also fail to break the tie, as might slightly sweetening and souring B to get B+ and B- respectively.

The axiom of Completeness rules out preferential gaps, and so rules out insensitivity to some sweetenings and sourings. That is why Completeness does not come for free. We need some argument for thinking that agents will not have preferential gaps. ‘The agent has to choose something’ is a bad argument. Faced with a choice between two lotteries, the agent might choose arbitrarily, but that does not imply that the agent is indifferent between the two lotteries. The agent might instead have a preferential gap. It depends on whether the agent’s preference-relation is sensitive to all sweetenings and sourings.

A money-pump for Completeness

So, we need some other argument for thinking that sufficiently-advanced artificial agents’ preferences over lotteries will be complete (and hence will be sensitive to all sweetenings and sourings). Let’s look at a money-pump. I will later explain how my responses to this money-pump also tell against other money-pump arguments for Completeness.

Here's the money-pump, suggested by Ruth Chang (1997, p.11) and later discussed by Gustafsson (2022, p.26):

’ denotes strict preference and ‘’ denotes a preferential gap, so the symbols underneath the decision tree say that the agent strictly prefers A to A- and has a preferential gap between A- and B, and between B and A. 

Now suppose that the agent finds themselves at the beginning of this decision tree. Since the agent doesn’t strictly prefer A to B, they might choose to go up at node 1. And since the agent doesn’t strictly prefer B to A-, they might choose to go up at node 2. But if the agent goes up at both nodes, they have pursued a dominated strategy: they have made a set of trades that left them with A- when they could have had A (an outcome that they strictly prefer), even though they knew in advance all the trades that they would be offered.

Note, however, that this money-pump is non-forcing: at some step in the decision tree, the agent is not compelled by their preferences to pursue a dominated strategy. The agent would not be acting against their preferences if they chose to go down at node 1 or at node 2. And if they went down at either node, they would not pursue a dominated strategy.

To avoid even a chance of pursuing a dominated strategy, we need only suppose that the agent acts in accordance with the following policy: ‘if I go up at node 1, I will go down at node 2.’ Since the agent does not strictly prefer A- to B, acting in accordance with this policy does not require the agent to change or act against any of their preferences.

More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.[2] And (granted some assumptions), the policy never requires the agent to change or act against any of their preferences.

Here’s why. Assume:

  • That the agent’s strict preferences are transitive.
  • That the agent knows in advance what trades they will be offered.
  • That the agent is capable of backward induction: predicting what they would choose at later nodes and taking those predictions into account at earlier nodes.

(If the agent doesn’t know in advance what trades they will be offered or is incapable of backward induction, then their pursuit of a dominated strategy need not indicate any defect in their preferences. Their pursuit of a dominated strategy can instead be blamed on their lack of knowledge and/or reasoning ability.)

Diagram

Description automatically generated

Given the agent’s knowledge of the decision tree and their grasp of backward induction, we can infer that, if the agent proceeds to node 2, then at least one of the possible outcomes of going to node 2 is not strictly dispreferred to any option available at node 1. Then, if the agent proceeds to node 2, they can act on a policy of not choosing any outcome that is strictly dispreferred to some option available at node 1. The agent’s acting on this policy will not require them to act against any of their preferences. For suppose that it did require them to act against some strict preference. Suppose that B is strictly dispreferred to A, so that the agent’s policy requires them to choose C, and yet C is strictly dispreferred to B. Then, by the transitivity of strict preference, C is strictly dispreferred to A. That means that both B and C are strictly dispreferred to A, contrary to our original assumption that at least one of the possible outcomes of going to node 2 is not strictly dispreferred to any option available at node 1. We have reached a contradiction, and so we can reject the assumption that the agent’s policy will require them to act against their preferences. This proof is easy to generalize so that it applies to decision trees with more than three terminal outcomes.

Summarizing this section

Money-pump arguments for Completeness (understood as the claim that sufficiently-advanced artificial agents will have complete preferences) assume that such agents will not act in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ But that assumption is doubtful. Agents with incomplete preferences have good reasons to act in accordance with this kind of policy: (1) it never requires them to change or act against their preferences, and (2) it makes them immune to all possible money-pumps for Completeness. 

So, the money-pump arguments for Completeness are unsuccessful: they don’t give us much reason to expect that sufficiently-advanced artificial agents will have complete preferences. Any agent with incomplete preferences cannot be represented as an expected-utility-maximizer. So, money-pump arguments don’t give us much reason to expect that sufficiently-advanced artificial agents will be representable as expected-utility-maximizers.

Conclusion

There are no coherence theorems. Authors in the AI safety community should stop suggesting that there are.

There are money-pump arguments, but the conclusions of these arguments are not theorems. The arguments depend on substantive and doubtful assumptions.

Here is one doubtful assumption: advanced artificial agents with incomplete preferences will not act in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Any agent who acts in accordance with that policy is immune to all possible money-pumps for Completeness. And agents with incomplete preferences cannot be represented as expected-utility-maximizers.

In fact, the situation is worse than this. As Gustafsson notes, his money-pump arguments for the other three axioms of the VNM Theorem depend on Completeness. If Gustafsson’s money-pump arguments fail without Completeness, I suspect that earlier, less-careful money-pump arguments for the other axioms of the VNM Theorem fail too. If that’s right, and if Completeness is false, then none of Transitivity, Independence, and Continuity has been established by money-pump arguments either.

Bottom-lines

  • There are no coherence theorems
  • Money-pump arguments don’t give us much reason to expect that advanced artificial agents will be representable as expected-utility-maximizers.

Appendix: Papers and posts in which the error occurs

Here’s a selection of papers and posts which claim that there are coherence theorems.

The nature of self-improving artificial intelligence

“The appendix shows how the rational economic structure arises in each of these situations. Most presentations of this theory follow an axiomatic approach and are complex and lengthy. The version presented in the appendix is based solely on avoiding vulnerabilities and tries to make clear the intuitive essence of the argument.”

“In each case we show that if an agent is to avoid vulnerabilities, its preferences must be representable by a utility function and its choices obtained by maximizing the expected utility.”

The basic AI drives

“The remarkable “expected utility” theorem of microeconomics says that it is always possible for a system to represent its preferences by the expectation of a utility function unless the system has “vulnerabilities” which cause it to lose resources without benefit.”

‘Coherent decisions imply consistent utilities’

“It turns out that this is just one instance of a large family of coherence theorems which all end up pointing at the same set of core properties. All roads lead to Rome, and all the roads say, "If you are not shooting yourself in the foot in sense X, we can view you as having coherence property Y."”

“Now, by the general idea behind coherence theorems, since we can't view this behavior as corresponding to expected utilities, we ought to be able to show that it corresponds to a dominated strategy somehow—derive some way in which this behavior corresponds to shooting off your own foot.”

“And that's at least a glimpse of why, if you're not using dominated strategies, the thing you do with relative utilities is multiply them by probabilities in a consistent way, and prefer the choice that leads to a greater expectation of the variable representing utility.”

“The demonstrations we've walked through here aren't the professional-grade coherence theorems as they appear in real math. Those have names like "Cox's Theorem" or "the complete class theorem"; their proofs are difficult; and they say things like "If seeing piece of information A followed by piece of information B leads you into the same epistemic state as seeing piece of information B followed by piece of information A, plus some other assumptions, I can show an isomorphism between those epistemic states and classical probabilities" or "Any decision rule for taking different actions depending on your observations either corresponds to Bayesian updating given some prior, or else is strictly dominated by some Bayesian strategy".”

“But hopefully you've seen enough concrete demonstrations to get a general idea of what's going on with the actual coherence theorems. We have multiple spotlights all shining on the same core mathematical structure, saying dozens of different variants on, "If you aren't running around in circles or stepping on your own feet or wantonly giving up things you say you want, we can see your behavior as corresponding to this shape. Conversely, if we can't see your behavior as corresponding to this shape, you must be visibly shooting yourself in the foot." Expected utility is the only structure that has this great big family of discovered theorems all saying that. It has a scattering of academic competitors, because academia is academia, but the competitors don't have anything like that mass of spotlights all pointing in the same direction.”

‘Things To Take Away From The Essay’

“So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage's theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent's implied probabilities are instead derived. Yudkowsky's essay does a good job communicating these concepts, but doesn't emphasize that this is different from VNM.”

‘Sufficiently optimized agents appear coherent’

“Summary: Violations of coherence constraints in probability theory and decision theory correspond to qualitatively destructive or dominated behaviors.”

“Again, we see a manifestation of a powerful family of theorems showing that agents which cannot be seen as corresponding to any coherent probabilities and consistent utility function will exhibit qualitatively destructive behavior, like paying someone a cent to throw a switch and then paying them another cent to throw it back.”

“There is a large literature on different sets of coherence constraints that all yield expected utility, starting with the Von Neumann-Morgenstern Theorem. No other decision formalism has comparable support from so many families of differently phrased coherence constraints.”

‘What do coherence arguments imply about the behavior of advanced AI?’

“Coherence arguments say that if an entity’s preferences do not adhere to the axioms of expected utility theory, then that entity is susceptible to losing things that it values.”

Disclaimer: “This is an initial page, in the process of review, which may not be comprehensive or represent the best available understanding.”

‘Coherence theorems’

“In the context of decision theory, "coherence theorems" are theorems saying that an agent's beliefs or behavior must be viewable as consistent in way X, or else penalty Y happens.”

Disclaimer: “This page's quality has not been assessed.”

“Extremely incomplete list of some coherence theorems in decision theory

  • Wald’s complete class theorem
  • Von-Neumann-Morgenstern utility theorem
  • Cox’s Theorem
  • Dutch book arguments”

‘Coherence arguments do not entail goal-directed behavior’

“One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.”

“The VNM axioms are often justified on the basis that if you don't follow them, you can be Dutch-booked: you can be presented with a series of situations where you are guaranteed to lose utility relative to what you could have done. So on this view, we have "no Dutch booking" implies "VNM axioms" implies "AI risk".”

‘Coherence arguments imply a force for goal-directed behavior.’ 

Coherence arguments’ mean that if you don’t maximize ‘expected utility’ (EU)—that is, if you don’t make every choice in accordance with what gets the highest average score, given consistent preferability scores that you assign to all outcomes—then you will make strictly worse choices by your own lights than if you followed some alternate EU-maximizing strategy (at least in some situations, though they may not arise). For instance, you’ll be vulnerable to ‘money-pumping’—being predictably parted from your money for nothing.3

AI Alignment: Why It’s Hard, and Where to Start

“The overall message here is that there is a set of qualitative behaviors and as long you do not engage in these qualitatively destructive behaviors, you will be behaving as if you have a utility function.”

‘Money-pumping: the axiomatic approach’

“This post gets somewhat technical and mathematical, but the point can be summarised as:

  • You are vulnerable to money pumps only to the extent to which you deviate from the von Neumann-Morgenstern axioms of expected utility.

In other words, using alternate decision theories is bad for your wealth.”

‘Ngo and Yudkowsky on alignment difficulty’

“Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, "Oh, well, I'll just build an agent that's good at optimizing things but doesn't use these explicit expected utilities that are the source of the problem!"

And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.

And I have tried to write that page once or twice (eg "coherent decisions imply consistent utilities") but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they'd have to do because this is in fact a place where I have a particular talent.”

“In this case the higher structure I'm talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying "look here", even though people have occasionally looked for alternatives.

And when I try to say this, people are like, "Well, I looked up a theorem, and it talked about being able to identify a unique utility function from an infinite number of choices, but if we don't have an infinite number of choices, we can't identify the utility function, so what relevance does this have" and this is a kind of mistake I don't remember even coming close to making so I do not know how to make people stop doing that and maybe I can't.”

“Rephrasing again: we have a wide variety of mathematical theorems all spotlighting, from different angles, the fact that a plan lacking in clumsiness, is possessing of coherence.”

‘Ngo and Yudkowsky on AI capability gains’

“I think that to contain the concept of Utility as it exists in me, you would have to do homework exercises I don't know how to prescribe. Maybe one set of homework exercises like that would be showing you an agent, including a human, making some set of choices that allegedly couldn't obey expected utility, and having you figure out how to pump money from that agent (or present it with money that it would pass up).

Like, just actually doing that a few dozen times.

Maybe it's not helpful for me to say this? If you say it to Eliezer, he immediately goes, "Ah, yes, I could see how I would update that way after doing the homework, so I will save myself some time and effort and just make that update now without the homework", but this kind of jumping-ahead-to-the-destination is something that seems to me to be... dramatically missing from many non-Eliezers. They insist on learning things the hard way and then act all surprised when they do. Oh my gosh, who would have thought that an AI breakthrough would suddenly make AI seem less than 100 years away the way it seemed yesterday? Oh my gosh, who would have thought that alignment would be difficult?

Utility can be seen as the origin of Probability within minds, even though Probability obeys its own, simpler coherence constraints.”

‘AGI will have learnt utility functions’

“The view that utility maximizers are inevitable is supported by a number of coherence theories developed early on in game theory which show that any agent without a consistent utility function is exploitable in some sense.”

  1. ^

    Thanks to Adam Bales, Dan Hendrycks, and members of the CAIS Philosophy Fellowship for comments on a draft of this post. When I emailed Adam to ask for comments, he replied with his own draft paper on coherence arguments. Adam’s paper takes a somewhat different view on money-pump arguments, and should be available soon.

  2. ^

    Gustafsson later offers a forcing money-pump argument for Completeness: a money-pump in which, at each step, the agent is compelled by their preferences to pursue a dominated strategy. But agents who act in accordance with the policy above are immune to this money-pump as well. Here’s why.

    Gustafsson claims that, in the original non-forcing money-pump, going up at node 2 cannot be irrational. That’s because the agent does not strictly disprefer A- to B: the only other option available at node 2. The fact that A was previously available cannot make choosing A- irrational, because (Gustafsson claims) Decision-Tree Separability is true: “The rational status of the options at a choice node does not depend on other parts of the decision tree than those that can be reached from that node.” But (Gustafsson claims) the sequence of choices consisting of going up at nodes 1 and 2 is irrational, because it leaves the agent worse-off than they could have been. That implies that going up at node 1 must be irrational, given what Gustafsson calls ‘The Principle of Rational Decomposition’: any irrational sequence of choices must contain at least one irrational choice. Generalizing this argument, Gustafsson gets a general rational requirement to choose option A whenever your other option is to proceed to a choice node where your options are A- and B. And it’s this general rational requirement (‘Minimal Unidimensional Precaution’) that allows Gustafsson to construct his forcing money-pump. In this forcing money-pump, an agent’s incomplete preferences compel them to violate the Principle of Unexploitability: that principle which says getting money-pumped is irrational. The Principle of Preferential Invulnerability then implies that incomplete preferences are irrational, since it’s been shown that there exists a situation in which incomplete preferences force an agent to violate the Principle of Unexploitability.

    Note that Gustafsson aims to establish that agents are rationally required to have complete preferences, whereas coherence arguments aim to establish that sufficiently-advanced artificial agents will have complete preferences. These different conclusions require different premises. In place of Gustafsson’s Decision-Tree Separability, coherence arguments need an amended version that we can call ‘Decision-Tree Separability*’: sufficiently-advanced artificial agents’ dispositions to choose options at a choice node will not depend on other parts of the decision tree than those that can be reached from that node. But this premise is easy to doubt. It’s false if any sufficiently-advanced artificial agent acts in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ And it’s easy to see why agents might act in accordance with that policy: it makes them immune to all possible money-pumps for Completeness, and (as I am about to prove back in the main text) it never requires them to change or act against any of their preferences.

    John Wentworth’s ‘Why subagents?’ suggests another policy for agents with incomplete preferences: trade only when offered an option that you strictly prefer to your current option. That policy makes agents immune to the single-souring money-pump. The downside of Wentworth’s proposal is that an agent following his policy will pursue a dominated strategy in single-sweetening money-pumps, in which the agent first has the opportunity to trade in A for B and then (conditional on making that trade) has the opportunity to trade in B for A+. Wentworth’s policy will leave the agent with A when they could have had A+.

104

0
0

Reactions

0
0

More posts like this

Comments49
Sorted by Click to highlight new comments since: Today at 1:29 PM

Your post argues for a strong conclusion:

To spot the error in these arguments, we only have to look up what cited ‘coherence theorems’ actually say. And yet the error seems to have gone uncorrected for more than a decade.

[...]

There are no coherence theorems. Authors in the AI safety community should stop suggesting that there are.

There are money-pump arguments, but the conclusions of these arguments are not theorems. The arguments depend on substantive and doubtful assumptions.

As I understand it, you propose two main arguments for the conclusion:

  1. There are only arguments about money-pumps / dominated strategies, not theorems.
  2. The Completeness axiom is suspicious.

I think (1) is straightforwardly wrong / conceptually confused. I agree with skepticism on the basis of (2), but people have already noticed this and discussed it (though phrased differently).

(Note the post makes other smaller claims that I either disagree with or think are misleading -- don't assume that if I don't talk about some claim that means I think it's correct.)


For the first argument, that there are no "coherence theorems":

For now, the important thing to note is that the conclusions of money-pump arguments are not theorems. Theorems (like the VNM Theorem) can be proved without making any substantive assumptions. Money-pump arguments establish their conclusion only by making substantive assumptions: assumptions that might well be false. [...]

That is definitely not the difference between theorems and arguments. Theorems are typically of the form "Suppose X, then Y"; what is X if not an assumption?

For example, in (one direction of) the VNM theorem, the assumption is that the preferences satisfy transitivity, completeness, independence, and continuity, and the conclusion is that the preferences can be represented with a utility function.

The difference between theorems and arguments is that in theorems you are limited to a particular set of formal inference rules in moving from premises to conclusions, whereas in arguments there is a much more expansive and informal set of inference rules. (Though in practice people use informal arguments in proving theorems with the implicit promise that they could be rewritten with the formal inference rules with more effort.)

In any case, if you really want to see one, here's a fairly boring money-pump theorem / coherence theorem:

Theorem. Suppose there is a set of possible worlds , and an agent  that given a current world  and a proposed new world  specifies how much money it would pay to switch to  from . Suppose further than  cannot be money pumped, that is, there is no sequence of worlds  such that (1)  and (2) . Then  must be transitive in the following sense: for any , if  and , then .

Proof. Suppose  is not transitive, so there exists some  where , and . But then then the sequence  is a money pump, leading to a contradiction.

This theorem is baking in some assumptions that you might find problematic, such as completeness (implicitly present in the type signature of ), or "no money pumps" (which you might object to because there's no one to actually run the money pump on the agent), or the lack of time-dependence of the agent (again implicitly present in the type signature of ).

But I think this is clearly a theorem that is coming to a substantive conclusion about an agent based on "no dominated strategies" / "no money pumps", so I don't think you can really say that "coherence theorems don't exist".


For the second argument (that the completeness axiom is suspicious): I think this is basically expressing the same sort of objection that I express here, particularly the section "There are no coherence arguments that say you must have preferences". I didn't tie it to the Completeness axiom because I think it's a mistake to get bogged down in the details of the specific assumptions present in theorems when you can make the same point in English, but it is the same conceptual point, as far as I can tell.

For what it's worth my position here is "you can't argue for AI risk solely via coherence theorems; you also have to argue for why the AI will be goal-directed in the first place, but there are plausible arguments for that conclusion (which are not based on coherence arguments)".

EJT
1y15
4
0

Theorems are typically of the form "Suppose X, then Y"; what is X if not an assumption?

X is an antecedent.

Consider an example. Imagine I claim:

  • Suppose James is a bachelor. Then James is unmarried.

In making this claim, I am not assuming that James is a bachelor. My claim is true whether or not James is a bachelor.

I might temporarily assume that James is a bachelor, and then use that assumption to prove that James is unmarried. But when I conclude ‘Suppose James is a bachelor. Then James is unmarried’, I discharge that initial assumption. My conclusion no longer depends on it. Any conclusion which can be proved with no undischarged assumptions is a theorem.

Theorem. Suppose there is a set of possible worlds , and an agent  that given a current world  and a proposed new world  specifies how much money it would pay to switch to  from . Suppose further than  cannot be money pumped, that is, there is no sequence of worlds  such that (1)  and (2) . Then  must be transitive in the following sense: for any , if  and , then .

Proof. Suppose  is not transitive, so there exists some  where , and . But then then the sequence  is a money pump, leading to a contradiction.

I agree that this is a theorem. But it’s not a ‘coherence theorem’ (at least not in the way that I’ve used the term in this post, and not in the way that previous authors seem to have used the term [see the Appendix]): it doesn’t state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. It states only that, unless an agent’s preferences are acyclic, that agent is liable to pursue strategies that are dominated by some other available strategy.

You can call it a ‘coherence theorem’. Then it would be true that coherence theorems exist. But the important point remains: Premise 1 of the coherence argument is false. There are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. VNM doesn’t say that, Savage doesn’t say that, Bolker-Jeffrey doesn’t say that, Dutch Books don’t say that, Cox doesn’t say that, Complete Class doesn’t say that.

For the second argument (that the completeness axiom is suspicious): I think this is basically expressing the same sort of objection that I express here, particularly the section "There are no coherence arguments that say you must have preferences". I didn't tie it to the Completeness axiom because I think it's a mistake to get bogged down in the details of the specific assumptions present in theorems when you can make the same point in English, but it is the same conceptual point, as far as I can tell.

I agree. I think the points that you make in that post are good.

For what it's worth my position here is "you can't argue for AI risk solely via coherence theorems; you also have to argue for why the AI will be goal-directed in the first place, but there are plausible arguments for that conclusion (which are not based on coherence arguments)".

I agree with this too.

Thanks, I understand better what you're trying to argue.

The part I hadn't understood was that, according to your definition, a "coherence theorem" has to (a) only rely on antecedents of the form "no dominated strategies" and (b) conclude that the agent is representable by a utility function. I agree that on this definition there are no coherence theorems. I still think it's not a great pedagogical or rhetorical move, because the definition is pretty weird.

I still disagree with your claim that people haven't made this critique before.

From your discussion:

[The Complete Class Theorem] does refer to dominated strategies. However, the Complete Class Theorem starts off by assuming that the agent’s preferences over actions in sets of circumstances satisfy Completeness and Transitivity. If the agent’s preferences are not complete and transitive, the Complete Class Theorem does not apply. So, the Complete Class Theorem does not imply that agents must be representable as maximizing expected utility if they are to avoid pursuing dominated strategies.

So, you would agree that the following is an English description of a theorem:

If an agent has complete, transitive preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.

The difference from your premise 1 is the part about the agent having complete, transitive preferences.

I feel pretty fine with justifying the transitive part via theorems basically like the one I gave above. You'd need to strengthen it a bit but that seems very doable. You do require a money pump argument rather than a dominated strategy argument, because when you have intransitive preferences it's not even clear what a "dominated strategy" would be.

If you buy that, then the only difference is the part about the agent having complete preferences. Which is exactly what has been critiqued previously. So I still think that it is basically incorrect to say:

And yet the error seems to have gone uncorrected for more than a decade.

So, you would agree that the following is an English description of a theorem:

If an agent has complete, transitive preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.

Yep, I agree with that.

I feel pretty fine with justifying the transitive part via theorems basically like the one I gave above.

Note that your money-pump justifies acyclicity  (The agent does not strictly prefer A to B, B to C, and C to A) rather than the version of transitivity necessary for the VNM and Complete Class theorems (If the agent weakly prefers A to B, and B to C, then the agent weakly prefers A to C). Gustafsson thinks you need Completeness to get a money-pump for this version of transitivity working (see footnote 8 on page 3), and I'm inclined to agree.

when you have intransitive preferences it's not even clear what a "dominated strategy" would be.

A dominated strategy would be a strategy which leads you to choose an option that is worse in some respect than another available option and not better than that other available option in any respect. For example, making all the trades and getting A- in the decision-situation below would be a dominated strategy, since you could have made no trades and got A:

So I still think that it is basically incorrect to say:

And yet the error seems to have gone uncorrected for more than a decade.

The error is claiming that 

  • There exist  theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.

I haven't seen anyone point out that that claim is false.

That said, one could reason as follows:

  1. Rohin, John, and others have argued that agents with incomplete preferences can act in accordance with policies that make them immune to pursuing dominated strategies.
  2. Agents with incomplete preferences cannot be represented as maximizing expected utility.
  3. So, if Rohin's, John's, and others' arguments are sound, there cannot exist theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.

Then one would have corrected the error. But since the availability of this kind of reasoning is easily missed, it seems worth correcting the error directly.

Okay, it seems like we agree on the object-level facts, and what's left is a disagreement about whether people have been making a major error. I'm less interested in that disagreement so probably won't get into a detailed discussion, but I'll briefly outline my position here.

The error is claiming that 

  • There exist  theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.

I haven't seen anyone point out that that claim is false.

The main way in which this claim is false (on your way of using words) is that it fails to note some of the antecedents in the theorem (completeness, maybe transitivity).

But I don't think this is a reasonable way to use words, and I don't think it's reasonable to read the quotes in your appendix as claiming what you say they claim.

Converting math into English is a tricky business. Often a lot of the important "assumptions" in a theorem are baked into things like the type signature of a particular variable or the definitions of some key terms; in my toy theorem above I give two examples (completeness and lack of time-dependence). You are going to lose some information about what the theorem says when you convert it from math to English; an author's job is to communicate the "important" parts of the theorem (e.g. the conclusion, any antecedents that the reader may not agree with, implications of the type signature that limit the applicability of the conclusion), which will depend on the audience.

As a result when you read an English description of a theorem, you should not expect it to state every antecedent. So it seems unreasonable to me to critique a claim in English about a theorem existing purely because it didn't list all the antecedents.

I think it is reasonable to critique a claim in English about a theorem on the basis that it didn't highlight an important antecedent that limits its applicability. If you said "AI alignment researchers should make sure to highlight the Completeness axiom when discussing coherence theorems" I'd be much more sympathetic (though personally my advice would be "AI alignment researchers should make sure to either argue for or highlight as an assumption the point that the AI is goal-directed / has preferences").

Gustafsson thinks you need Completeness to get a money-pump for this version of transitivity working

Yup, good point, I think it doesn't change the conclusion.

it seems like we agree on the object-level facts

I think that’s right.

Often a lot of the important "assumptions" in a theorem are baked into things like the type signature of a particular variable or the definitions of some key terms; in my toy theorem above I give two examples (completeness and lack of time-dependence). You are going to lose some information about what the theorem says when you convert it from math to English; an author's job is to communicate the "important" parts of the theorem (e.g. the conclusion, any antecedents that the reader may not agree with, implications of the type signature that limit the applicability of the conclusion), which will depend on the audience.

Yep, I agree with all of this.

Converting math into English is a tricky business.

Often, but not in this case. If authors understood the above points and meant to refer to the Complete Class Theorem, they need only have said:

  • If an agent has complete, transitive preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.

(And they probably wouldn’t have mentioned Cox, Savage, etc.)

Yup, good point, I think it doesn't change the conclusion.

I think it does. If the money-pump for transitivity needs Completeness, and Completeness is doubtful, then the money-pump for transitivity is doubtful too.

I think it does [change the conclusion].

Upon rereading I realize I didn't state this explicitly, but my conclusion was the following:

If an agent has complete preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.

Transitivity depending on completeness doesn't invalidate that conclusion.

Ah I see! Yep, agree with that.

I appreciate the whole post. But I personally really enjoyed the appendix. In particular, I found it informative that Yudkowsk can speak/write with that level of authoritativeness, confidence, and disdain for others who disagree, and still be wrong (if this post is right).

Habryka
1y10
10
10

(if this post is right)

The post does actually seem wrong though. 

I expect someone to write a comment with the details at some point (I am pretty busy right now, so can only give a quick meta-level gleam), but mostly, I feel like in order to argue that something is wrong with these arguments is that you have to argue more compellingly against completeness and possible alternative ways to establish dutch-book arguments. 

Also, the title of "there are no coherence arguments" is just straightforwardly wrong. The theorems cited are of course real theorems, they are relevant to agents acting with a certain kind of coherence, and I don't really understand the semantic argument that is happening where it's trying to say that the cited theorems aren't talking about "coherence", when like, they clearly are.

You can argue that the theorems are wrong, or that the explicit assumptions of the theorems don't hold, which many people have done, but like, there are still coherence theorems, and IMO completeness seems quite reasonable to me and the argument here seems very weak (and I would urge the author to create an actual concrete situation that doesn't seem very dumb in which a highly intelligence, powerful and economically useful system has non-complete preferences).

The whole section at the end feels very confused to me. The author asserts that there is "an error" where people assert that "there are coherence theorems", but man, that just seems like such a weird thing to argue for. Of course there are theorems that are relevant to the question of agent coherence, all of these seem really quite relevant. They might not prove the things in-practice, as many theorems tend to do. 

Like, I feel like with the same type of argument that is made in the post I could write a post saying "there are no voting impossibility theorems" and then go ahead and argue that the Arrow's Impossibility Theorem assumptions are not universally proven, and then accuse everyone who ever talked about voting impossibility theorems that they are making "an error" since "those things are not real theorems". And I think everyone working on voting-adjacent impossibility theorems would be pretty justifiedly annoyed by this.

EJT
1y20
10
1

I’m following previous authors in defining ‘coherence theorems’ as

theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.

On that definition, there are no coherence theorems. VNM is not a coherence theorem, nor is Savage’s Theorem, nor is Bolker-Jeffrey, nor are Dutch Book Arguments, nor is Cox’s Theorem, nor is the Complete Class Theorem.

there are theorems that are relevant to the question of agent coherence

I'd have no problem with authors making that claim.

I would urge the author to create an actual concrete situation that doesn't seem very dumb in which a highly intelligence, powerful and economically useful system has non-complete preferences

Working on it.

I’m following previous authors in defining ‘coherence theorems’ as

Can you be concrete whose previous authors definition are you using here? A google search for your definition returns no results but this post, and this is definitely not a definition of "coherence theorems" that I would use.

EJT
1y13
5
0

Two points, made in order of importance:

(1) How we define the term ‘coherence theorems’ doesn’t matter. What matters is that Premise 1 (striking out the word ‘coherence’, if you like) is false.

(2) The way I define the term ‘coherence theorems’ seems standard.

Now making point (1) in more detail:

Reserve the term ‘coherence theorems’ for whatever you like. Premise 1 is false: there are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. The VNM Theorem doesn't say that, nor does Savage's Theorem, nor does Bolker-Jeffrey, nor do Dutch Books, nor does Cox's Theorem, nor does the Complete Class Theorem. That is the error in coherence arguments. Premise 1 is false.

Now for point (2):

I take the Appendix to make plausible enough that my use of the term ‘coherence theorems’ is standard, at least in online discussions. Here are some quotations.

1.

Now, by the general idea behind coherence theorems, since we can't view this behavior as corresponding to expected utilities, we ought to be able to show that it corresponds to a dominated strategy somehow


2.

Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner.

3. 

Summary: Violations of coherence constraints in probability theory and decision theory correspond to qualitatively destructive or dominated behaviors.

Again, we see a manifestation of a powerful family of theorems showing that agents which cannot be seen as corresponding to any coherent probabilities and consistent utility function will exhibit qualitatively destructive behavior

4.

One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy

5.

‘Coherence arguments’ mean that if you don’t maximize ‘expected utility’ (EU)—that is, if you don’t make every choice in accordance with what gets the highest average score, given consistent preferability scores that you assign to all outcomes—then you will make strictly worse choices by your own lights than if you followed some alternate EU-maximizing strategy (at least in some situations, though they may not arise). For instance, you’ll be vulnerable to ‘money-pumping’—being predictably parted from your money for nothing. 

6.

The overall message here is that there is a set of qualitative behaviors and as long you do not engage in these qualitatively destructive behaviors, you will be behaving as if you have a utility function.

7.

I think that to contain the concept of Utility as it exists in me, you would have to do homework exercises I don't know how to prescribe. Maybe one set of homework exercises like that would be showing you an agent, including a human, making some set of choices that allegedly couldn't obey expected utility, and having you figure out how to pump money from that agent (or present it with money that it would pass up).

8.

The view that utility maximizers are inevitable is supported by a number of coherence theories developed early on in game theory which show that any agent without a consistent utility function is exploitable in some sense.

Maybe the term ‘coherence theorems’ gets used differently elsewhere. That is okay. See point (1).

Working on it.

Spoiler (don't read if you want to work on a fun puzzle or test your alignment metal).

Oh, nice, I do remember really liking that post. It's a great example, though I think if you bring in time and trade-in-time back into this model you do actually get things that are more VNM-shaped again. But overall I am like "OK, I think that post actually characterizes how coherence arguments apply to agents without completeness quite well", and am also like "yeah, and the coherence arguments still apply quite strongly, because they aren't as fickle or as narrow as the OP makes them out to be". 

But overall, yeah, I think this post would be a bunch stronger if it used the markets example from John's post. I like it quite a bit, and I remember using it as an intuition pump in some situations that I somewhat embarrassingly failed to connect to this argument.

I cite John in the post!

Ah, ok. Why don't you just respond with markets then!

You are correct with some of the criticism, but as a side-note, completeness is actually crazy. 

All real agents are bounded, and pay non-zero costs for bits, and as a consequence, don't have complete preferences. Complete agents in real world do not exist. If they existed, correct intuitive model of them wouldn't be 'rational players' but 'utterly scary god, much bigger than the universe they live in'. 

Oh, sorry, totally.

The same is true for the other implicit assumption in VNM, which is doing bayesianism. There exist no bayesian agents. Any non-trivial bayesian agents would be similarly a terrifying alien god, much bigger than the universe they live in.

Do I understand you correctly here?

Each agent has a computable partial preference ordering that decides if it prefers to .

We'd like this partial relation to be complete (i.e., defined for all ) and transitive (i.e., and implies ).

Now, if the relation is sufficiently non-trivial, it will be expensive to compute for some . So it's better left undefined...?

If so, I can surely relate to that, as I often struggle computing my preferences. Even if they are theoretically complete. But it seems to me the relationship is still defined, but might not be practical to compute.

It's also possible to think of it in this way: You start out with partial preference ordering, and need to calculate one of its transitive closures. But that is computationally difficult, and not unique either.

I'm unsure what these observations add to the discussion, though.

and I would urge the author to create an actual concrete situation that doesn't seem very dumb in which a highly intelligence, powerful and economically useful system has non-complete preferences

I'd be surprised if you couldn't come up with situations where completeness isn't worth the cost - e.g. something like, to close some preference gaps you'd have to think for 100x as long, but if you close them all arbitrarily then you end up with intrasitivity.

This seems like a great point. Completeness requires closing all preference gaps, but if you do that inconsistently and violate transitivity then suddenly you are vulnerable to money-pumping.

Like, I feel like with the same type of argument that is made in the post I could write a post saying "there are no voting impossibility theorems" and then go ahead and argue that the Arrow's Impossibility Theorem assumptions are not universally proven, and then accuse everyone who ever talked about voting impossibility theorems that they are making "an error" since "those things are not real theorems". And I think everyone working on voting-adjacent impossibility theorems would be pretty justifiedly annoyed by this.

I think that there is some sense in which the character in your example would be right, since:

  • Arrow's theorem doesn't bind approval voting.
  • Generalizations of Arrow's theorem don't bind probabilistic results, e.g., each candidate is chosen with some probability corresponding to the amount of votes he gets.

Like, if you had someone saying there was "a deep core of electoral process" which means that as they scale to important decisions means that you will necessarily get "highly defective electoral processes", as illustrated in the classic example of the "dangers of the first pass the post system". Well in that case it would be reasonable to wonder whether the assumptions of the theorem bind, or whether there is some system like approval voting which is much less shitty than the theorem provers were expecting, because the assumptions don't hold.

The analogy is imperfect, though, since approval voting is a known decent system, whereas for AI systems we don't have an example friendly AI.

(if this post is right)

The post does actually seem wrong though. 

Glad that I added the caveat.

Also, the title of "there are no coherence arguments" is just straightforwardly wrong. The theorems cited are of course real theorems, they are relevant to agents acting with a certain kind of coherence, and I don't really understand the semantic argument that is happening where it's trying to say that the cited theorems aren't talking about "coherence", when like, they clearly are.

Well, part of the semantic nuance is that we don't care as much about the coherence theorems that do exist if they will fail to apply to current and future machines

IMO completeness seems quite reasonable to me and the argument here seems very weak (and I would urge the author to create an actual concrete situation that doesn't seem very dumb in which a highly intelligence, powerful and economically useful system has non-complete preferences).

Here are some scenarios:

  • Our highly intelligent system notices that to have complete preferences over all trades would be too computationally expensive, and thus is willing to accept some, even a large degree of incompleteness. 
  • The highly intelligent system learns to mimic the values of human, which end up having non-complete preferences, which the agent mimics
  • You train a powerful system to do some stuff, but also to detect when it is out of distribution and in that case do nothing. Assuming you can do that, their preference is incomplete, since when offered tradeoffs they always take the default option when out of distribution. 

The whole section at the end feels very confused to me. The author asserts that there is "an error" where people assert that "there are coherence theorems", but man, that just seems like such a weird thing to argue for. Of course there are theorems that are relevant to the question of agent coherence, all of these seem really quite relevant. They might not prove the things in-practice, as many theorems tend to do. 

Mmh, then it would be good to differentiate between:

  • There are coherence theorems that talk about some agents with some properties
  • There are coherence theorems that prove that AI systems as will soon exist in the future will be optimizing utility functions

You could also say a third thing, which would be: there are coherence theorems that strongly hint that AI systems as will soon exist in the future will be optimizing utility functions. They don't prove it, but they make it highly probable because of such and such. In which case having more detail on the such and such would deflate most of the arguments in this post, for me.

For instance:

Coherence arguments’ mean that if you don’t maximize ‘expected utility’ (EU)—that is, if you don’t make every choice in accordance with what gets the highest average score, given consistent preferability scores that you assign to all outcomes—then you will make strictly worse choices by your own lights than if you followed some alternate EU-maximizing strategy (at least in some situations, though they may not arise). For instance, you’ll be vulnerable to ‘money-pumping’—being predictably parted from your money for nothing.

This is just false, because it is not taking into account the cost of doing expected value maximization, since giving consistent preferability scores is just very expensive and hard to do reliably. Like, when I poll people for their preferability scores, they give inconsistent estimates. Instead, they could be doing some expected utility maximization, but the evaluation steps are so expensive that I now basically don't bother to do some more hardcore approximation of expected value for individuals, but for large projects and organizations.  And even then, I'm still taking shortcuts and monkey-patches, and not doing pure expected value maximization.

“This post gets somewhat technical and mathematical, but the point can be summarised as:

  • You are vulnerable to money pumps only to the extent to which you deviate from the von Neumann-Morgenstern axioms of expected utility.

In other words, using alternate decision theories is bad for your wealth.”

The "in other words" doesn't follow, since EV maximization can be more expensive than the shortcuts.

Then there are other parts that give the strong impression that this expected value maximization will be binding in practice:

“Rephrasing again: we have a wide variety of mathematical theorems all spotlighting, from different angles, the fact that a plan lacking in clumsiness, is possessing of coherence.”

 

“The overall message here is that there is a set of qualitative behaviors and as long you do not engage in these qualitatively destructive behaviors, you will be behaving as if you have a utility function.”

 

  “The view that utility maximizers are inevitable is supported by a number of coherence theories developed early on in game theory which show that any agent without a consistent utility function is exploitable in some sense.”

 

Here are some words I wrote that don't quite sit right but which I thought I'd still share: Like, part of the MIRI beat as I understand it is to hold that there is some shining guiding light, some deep nature of intelligence that models will instantiate and make them highly dangerous. But it's not clear to me whether you will in fact get models that instantiate that shining light. Like, you could imagine an alternative view of intelligence where it's just useful monkey patches all the way down, and as we train more powerful models, they get more of the monkey patches, but without the fundamentals. The view in between would be that there are some monkey patches, and there are some deep generalizations, but then I want to know whether the coherence systems will bind to those kinds of agents.

No need to respond/deeply engage, but I'd appreciate if you let me know if the above comments were too nitpicky.

You can argue that the theorems are wrong, or that the explicit assumptions of the theorems don't hold, which many people have done, but like, there are still coherence theorems, and IMO completeness seems quite reasonable to me and the argument here seems very weak (and I would urge the author to create an actual concrete situation that doesn't seem very dumb in which a highly intelligence, powerful and economically useful system has non-complete preferences).

If you want to see an example of this, I suggest John's post here.

Related, but not so much the aim of your post: who or what is going to money pump or Dutch book a superintelligence even if the superintelligence doesn't maximize expected utility? Many money pumps and Dutch books all seem pretty contrived and unlikely to occur naturally without adversaries. So, where would the pressure to avoid them actually come from? Maybe financial markets, but do they need to generalize their aversion to exploitation in markets to all their preferences? Negotiations with humans to gain power before it kills us all? Again, do they need to generalize from these?

I guess cyclic preferences could be bad in natural/non-adversarial situations.

There are also opposite pressures: if you would have otherwise had exploitable preferences, making them non-exploitable means giving something else up, i.e. some of your preference rankings. This is also a cost, and an AGI may not be willing to pay it.

xuan
1y14
5
0
1

Seconded! On this note, I think the assumed presence of adversaries or competitors is actually one of the under-appreciated upshots of MIRI's work on Logical Induction (https://intelligence.org/2016/09/12/new-paper-logical-induction/). By the logical induction criterion they propose, "good reasoning" is only defined with respect to a market of traders of a particular complexity class - which can be interpreted as saying that "good reasoning" is really intersubjective rather than objective! There's only pressure to find the right logical beliefs in a reasonable amount of time if there are others who would fleece you for not doing so.

"good reasoning" is really intersubjective rather than objective! There's only pressure to find the right logical beliefs in a reasonable amount of time if there are others who would fleece you for not doing so.

This is a really interesting point that reminds me of arguments made by pragmatist philosophers like John Dewey and Richard Rorty. They also wanted to make "justification" an intersubjective phenomenon, of justifying your beliefs to other people. I don't think they had money-pump arguments in mind though.

That's why the standard prediction is not that AIs will be perfectly coherent, but that it makes sense to model them as being sufficiently coherent in practice, in the sense that e.g. we can't rely on incoherence in order to shut them down.

I guess there are acausal influence and locally multipolar (multiple competing AGI) cases, too.

I wonder if it is possible to derive expected utility maximisation type results from assumptions of "fitness" (as in, evolutionary fitness). This seems more relevant to the AI safety agenda - after all, we care about which kinds of AI are successful, not whether they can be said to be "rational".  It might also be a pathway to the kind of result AI safety people implicitly use - not that agents maximise some expected utility, but that they maximise utilities which force a good deal of instrumental convergence (i.e. describing them as expected utility maximisers is not just technically possible, but actually parsimonious). Actually, if we get the instrumental convergence then it doesn't matter a great deal if the AIs aren't strictly VNM rational.

In conclusion, I think we're interested in results like fitness -> instrumental convergence, not rationality -> VNM utility.

I largely endorse the position that a number of AI safety people have seen theorems of the latter type and treated them as if that they imply theorems of the former type.

I agree fitness is a more useful concept than rationality (and more useful than an individual agent's power), so here's a document I wrote about it: https://drive.google.com/file/d/1p4ZAuEYHL_21tqstJOGsMiG4xaRBtVcj/view

It seems that your response to the money-pump argument is to give up decision-tree separability (and hence consequentialism). That amounts to a form of resolute choice, which i rebut in section 7.

Thanks for the comment! In this context, where we're arguing about whether sufficiently-advanced artificial agents will satisfy the VNM axioms, I only have to give up Decision-Tree Separability*:

Sufficiently-advanced artificial agents’ dispositions to choose options at a choice node will not depend on other parts of the decision tree than those that can be reached from that node. 

And Decision-Tree Separability* isn't particularly plausible. It’s false if any sufficiently-advanced artificial agent acts in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ And it’s easy to see why agents might act in accordance with that policy: it makes them immune to  money-pumps for Completeness.

Also, it seems as if one of the major downsides of resolute choice is that agents sometimes have to act against their preferences. But, as I argue in the post, artificial agents with incomplete preferences who act in accordance with the policy above will never have to act against their preferences.

What you are suggesting is what I called "The Conservative Approach" to resolute choice, which I discuss critically on pages 73–74. It is not a new idea.

Note also that avoiding money pumps for Completeness cannot alone motivate your suggested policy, since you can also avoid them by satisfying Completeness. So that argument does not work (without assuming the point at issue).

Finally, I guess I don't see why consequentialism would less plausible for artificial agents than other agents.

I didn’t mean to suggest it was new! I remember that part of your book.

Your second point seems to me to get the dialectic wrong. We can read coherence arguments as saying:

  • Sufficiently-advanced artificial agents won't pursue dominated strategies, so they'll  have complete preferences.

I’m pointing out that that inference is poor. Advanced artificial agents might instead avoid dominated strategies by acting in accordance with the policy that I suggest.

I’m still thinking about your last point. Two quick thoughts:

  • It seems like most humans aren’t consequentialists.
  • Advanced artificial agents could have better memories of their past decisions  than humans.

But my argument against proposals like yours is not that agents wouldn’t have sufficiently good memories. The objection (following Broome and others) is that the agents at node 2 have no reason at that node for ruling out option A- with your policy. The fact that A could have been chosen earlier should not concern you at node 2. A- is not dominated by any of the available options at node 2.

Regarding the inference being poor,  my argument in the book has two parts (1) the money pump for Completeness which relies on Decision-Tree Separability and (2) the defence of Decision-Tree Separability. It is (2) that rules out your proposal.

Regarding your two quick thoughts, lots of people may be irrational. So that arguments does not work.

I think all of these objections would be excellent if I were arguing against this claim: 

  • Agents are rationally required to satisfy the VNM axioms.

But I’m arguing against this claim:

  • Sufficiently-advanced artificial agents will satisfy the VNM axioms.

And given that, I think your objections miss the mark. 

On your first point, I’m prepared to grant that agents have no reason to rule out option A- at node 2. All I need to claim is that advanced artificial agents might rule out option A- at node 2. And I think my argument makes that claim plausible.

On your second point, Decision-Tree Separability doesn’t rule out my proposal. What would rule it out is Decision-Tree Separability*:

sufficiently-advanced artificial agents’ dispositions to choose options at a choice node will not depend on other parts of the decision tree than those that can be reached from that node.

And whatever the merits of Decision-Tree Separability, Decision-Tree Separability* seems to me not very plausible.

On your third point, (whether or not most humans are irrational) most humans are non-consequentialists. So even if it is no more plausible that artificial agents will be non-consequentialists than humans, it can be plausible that artificial agents will be non-consequentialists. And it is relevant that advanced artificial agents could be better at remembering their past decisions than humans. That would make them better able to act in accordance with the policy that I suggest.

I might have missed something but isn't the "solution" to the concerns about the completeness money pump equivalent to the agent becoming complete.

E.g. after the agent has chose B over A it now effectively has a preference of B over A-. 

I haven't worked this through e.g. the proof of VNM etc. but are we sure this weaker notion of completeness might end up being enough to still get the relevant conclusions?

(quite busy might have a bit more of a think about this later)

Nice point. The rough answer is 'Yes, but only once the agent has turned down a sufficiently wide array of options.' Depending on the details, that might never happen or only happen after a very long time. 

I've had a quick think about the more precise answer, and I think it is: 

  • The agent's preferences will be functionally complete once and only once it is the case that, for all pairs of options between which the agent has a preferential gap, the agent has turned down an option that is strictly preferred to one of the options in the pair.

I had a similar thought to Shiny. Am I correct that an agent following your suggested policy ("‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ ") would never *appear* to violate completeness from the perspective of an observer that could only see their decisions and not their internal state? And assuming completeness is all we need to get to full utility maximization, does that mean an agent following your policy would act like a utility maximizer to an observer?

There's a complication here related to a point that Rohin makes : if we can only see an agent's decisions and we know nothing about its preferences, all behavior can be rationalized as EU maximization.

But suppose we set that complication aside. Suppose we know this about an agent's preferences: 

  • There is some option A such that the agent strictly prefers A+$1

Then we can observe violations of Completeness. Suppose that we first offer our agent a choice between A and some other option B, and that the agent chooses A. Then we give the agent the chance to trade in A for B, and the agent takes the trade. That indicates that the agent does not strictly prefer A to B and does not strictly prefer B to A. Two possibilities remain: either the agent is indifferent between A and B, or the agent has a preferential gap between A and B.

Now we offer our agent another choice: stick with B, or trade it in for A+$1. If the agent is indifferent between A and B, they will strictly prefer A+$1 to B (because indifference is sensitive to all sweetenings and sourings), and so we will observe the agent taking the trade. If we observe that the agent doesn't take the trade, then they must have a preferential gap between A and B, and so their preferences must be incomplete.

Looking forward to reading this. In the meantime, I notice that this post hasn't been linked and seems likely to be relevant:

Coherence arguments do not entail goal-directed behavior by Rohin Shah

My quick take after skimming: I  am quite confused about this post.
Of course the VNM theorem IS a coherence theorem.
How... could it not be a coherence theorem?

It tells you that actors following four intuitive properties can be represented as utility maximisers. We can quibble about the properties, but the result sounds important regardless for understanding agency!

The same reasoning could be applied to argue that Arrow's Impossibility Theorem is Not Really About Voting. After all, we are just introducing all these assumptions about what good voting looks like!

EJT
1y11
6
1

I would have hoped you reached the second sentence before skimming! I define what I mean (and what I take previous authors to mean) by 'coherence theorems' there.

I think your title might be causing some unnecessary consternation.  "You don't need to maximise utility to avoid domination" or something like that might have avoided a bit of confusion.

Not central to the argument, but I feel someone should be linking here to Garrabrant's rejection of the independence axiom, which is fairly compelling IMO.

More from EJT
Curated and popular this week
Relevant opportunities