(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)
(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)
J: It seems to me that the field of alignment doesn't understand the most basic theory of agents, and is missing obvious insights when it comes to modeling the sorts of systems they purport to study.
N: Do tell. (I'm personally sympathetic to claims of the form "none of you idiots have any idea wtf you're doing", and am quite open to the hypothesis that I've been an idiot in this regard.)
J: Consider the coherence theorems that say that if you can't pump resources out of a system, then it's acting agent-like.
N: I'd qualify "agent-like with respect to you", if I used the word 'agent' at all (which I mostly wouldn't), and would caveat that there are a few additional subtleties, but sure.
J: Some of those subtleties are important! In particular: there's a gap between systems that you can't pump resources out of, and systems that have a utility function. The bridge across that gap is an additional assumption that the system won't pass up certain gains (in a specific sense).
Roughly: if you won't accept 1 pepper for 1 mushroom, then you should accept 2 mushrooms for 1 pepper, because a system that accepts both of those trades winds up with strictly more resources than a system that rejects both (by 1 mushroom), and you should be able to do at least that well.
N: I agree.
J: But some of the epistemically efficient systems around us violate this property.
For instance, consider a market for (at least) two goods: peppers and mushrooms; with (at least) two participants: Alice and Bob. Suppose Alice's utility is (where and are the quantities of peppers and mushrooms owned by Alice, respectively), and Bob's utility is (where and are the quantities of peppers and mushrooms owned by Bob, respectively).
Example equilibrium: the price is 3 peppers for 1 mushroom. Alice doesn't trade at this price when she has , i.e. , i.e. (using the fact that ), i.e. when Alice has 1.5 times as many peppers as she has mushrooms. Bob doesn't trade at this price when he has 6 times as many peppers as mushrooms, by a similar argument. So these prices can be an equilibrium whenever Alice has 1.5x as many peppers as mushrooms, and Bob has 6x as many peppers as mushrooms (regardless of the absolute quantities).
Now consider offering the market a trade of 25,000 peppers for 10,000 mushrooms. If Alice has 20,000 mushrooms (and thus 30,000 peppers), and Bob has only 1 mushroom (and thus 6 peppers), then the trade is essentially up to Alice. She'd observe that
so she (and thus, the market as a whole) would accept. But if Bob had 20,000 mushrooms (and thus 120,000 peppers), and Alice had only 2 mushrooms (and thus 3 peppers), then the trade is essentially up to Bob. He'd observe
so he wouldn't take the trade.
Thus, we can see that whether a market — considered altogether — takes a trade, depends not only on the prices in the market (which you might have thought of as a sort of epistemic state, and that you might have noted was epistemically efficient with respect to you), but also on the hidden internal state of the market.
N: Sure. The argument was never "every epistemically efficient (wrt you) system is an optimizer", but rather "sufficiently good optimizers are epistemically efficient (wrt you)".
J: You might be missing the point here. You can appeal to "you can't pump money out of the system" to get a type of weak efficiency, but you MIRI folk seem to think that "can't pump money out" arguments also imply a form of strong efficiency, that things like markets lack.
N: I agree that "you can't pump money out" does not suffice for a utility function, and that you need an additional "it doesn't pass up free money" constraint to bridge the gap. And I concede that I sometimes use "you can't pump money out of it" as a pointer to a larger cluster of criteria, and sometimes in a way that elides a real distinction, and sometimes because I haven't been tracking that distinction carefully in my own head.
...for the record, though, I expect AI systems that are capable of ending the acute risk period, to not pass up on free valuables. So I admit I'm not sure why you're focused on this distinction as important. Do you think we have a disagreement beyond the point about me sometimes playing fast-and-loose with coherence constraints?
J: I still suspect you're missing the point. In real life, systems that control valuable resources tend to have the property that you can't pump resources out of them, for the obvious competitive reason. But there's no similarly compelling reason for things to avoid having path-dependence in their preferences, as you'd find in a market.
N: Hold up. The obvious reason for optimizing systems to avoid path-dependent preferences is so that they avoid passing up certain gains. A property I expect a market made of sufficiently competent participants to possess.
J: In which case, we have a disagreement, yeah. Which... well, I proved my point a few paragraphs ago, and you seemed to agree? So I admit I'm confused by your position.
To refine my own position: aggregate systems of agents do not in general act like an agent. That's what I've been trying to say, here — the condition that you can expect lots of systems to possess in real life is a weak ("no money pump") efficiency property. The strong efficiency property ("takes certain gains") is much rarer, and is lost the moment you start aggregating agents.
(Indeed, once you notice that you shouldn't think of a market as an agent in the VNM sense, you're only one step away from the conclusion that you shouldn't think of the constituent parts as agents either. When you look closely, even the market participants are probably better modeled as weakly-efficient market-like systems rather than as agents.)
N: Woah, hold your horses there. Optimizing systems that are epistemically and instrumentally efficient wrt you (which I suppose I could suffer calling 'agents' in this context) totally aggregate into other agents.
J: ...have you failed to internalize my argument from above?
N: Nobody ever said that a collection of participants each with preferences over a set of goods, have to aggregate into an agent that also has preferences over . The obvious guess is that that market aggregates into something more like an agent with preferences over functions from to , i.e., over different assignments of goods to each participant in the market.
Like, it's completely respectable for the market to value "Alice gains 25,000 peppers but loses 10,000 mushrooms" differently from "Bob gains 25,000 peppers but loses 10,000 mushrooms". It will look like it has a bunch of hidden state about who has which resources when you consider only the resource totals, but that's just an artifact of looking at the wrong outcome space.
(Aside: I have a better sense now of why I might want some sort of "epistemic efficiency implies instrumental efficiency" argument: intuitively, this seems like the sort of thing you might want when looking at optimizers that are themselves aggregates of smaller optimizers. Which is an update for me; thanks. More specifically, I was not tracking the way that markets should look instrumentally coherent, and my "epistemic efficiency isn't necessarily supposed to imply instrumental efficiency" argument now seems off-base.)
J: OK, yes, you can think of the market as having preferences over assignments of goods to participants, but then you still lose agency. For example, consider the trades (in that extended outcome space) "Alice loses 1 pepper, Bob gains 2 mushrooms", and the trade "Alice gains 3 peppers, Bob loses 2 mushrooms". The market won't take either of these trades, because Alice isn't willing to take the former, and Bob isn't willing to take the latter. And this is exactly an instance of an aggregate system that satisfies weak efficiency (it has no preference cycles / you can't pump money out of it), but violates strong efficiency (the order-dimension of its preference graph is not 1 / it passes up certain gains). Aggregates of agents aren't agents!
N: Oh, yeah, I flatly deny that.
J: It's... a theorem?
N: It might be a theorem in Earth!economics, about Earth!economics!rational agents who use broken decision theories. It's false in real life, and when aggregating agents smart enough to use better decision theories.
J: I'm not sure how you expect to swing that. Utility functions are defined only up to affine transformations, so you can't just say "do whatever leads to the higher aggregate utility". I don't see how you'd break the symmetry between different offers, while respecting the invariance of utility functions up to affine transformations.
N: Sure, I'm not saying you can figure out how to aggregate a bunch of agents into a superagent by looking at their utility functions alone! You need to take some other structure about the agents into account, such as information about what each agent thinks is fair. (See, eg, the relevant dath ilani papers.)
J: Ah. Hmm.
N: To spell it out more precisely: what happens in real life is that Alice and Bob accept both trades, and then Alice gives Bob a pepper, and now they've achieved the certain gain of +1 pepper apiece. Or, more generally, they do something that is at least that good for both of them within the constraints of what they agree is fair — because why would either settle for a strategy that does worse than that? In real life, aggregate agents don't pass up certain gains of valuable resources, because they value the resources.
J: I... see.
J: OK, well, note that this assumes that side-channel trades can occur, at prices other than the market prices.
N: Yeah. Side channels that the agents would fight to establish, so that they can take advantage of certain gains.
Although, of course, logical decision theorists don't need to be able to make side-trades to accept such bets, and they'll keep taking advantage of certain gains even if you forbid such trades. Like, if Alice and Bob have common knowledge that the market is either going to be offered the trade "Alice gains $1,000,000; Bob loses $1" or the trade "Alice loses $1; Bob gains $1,000,000", with equal probability of each, and they're not allowed to trade between themselves, then they can (and will, if they're smart) simply agree to accept whichever trade they're presented (because this joint strategy makes them both significantly richer in expectation).
Again, these are smart folk who value resources. You can argue all you want about how they shouldn't be able to get the extra money, but don't count on those arguments holding up.
Like, there's a basic mental technique here of asking "if the participants in the market were all actually as smart as me and deeply driven to get more goods, could they somehow find some way to wind up richer?" Argue as you may, the aggregate markets of dath ilan still won't pass up certain gains of valuable resources that you can easily point out.
J: This... is something of an update for me, I admit.
J: Although, I note that, for all your high-falutin arguments about dath ilan, none of them are going to convince the pancreas and the thalamus to start making side-channel ATP trades, at price-analogs that differ from the equilibrium.
N: Indeed. It's no coincidence that artificial intelligence is an X-risk, and pancreases are not.
J: Ah, but the pancreas/thalamus interaction can tell us something about intelligence. As can the study of the energy market (of sorts) inside a bacterium. As can Earth's best financial markets, populated as they are by poorly-coordinated constituents who might not decide to use logical decision theory even if they knew they had the option. Agents might be normative, but descriptively, the world is full of systems that you merely can't pump money out of.
N: OK, sure. But that's a vastly weaker claim. I happily endorse the descriptive claim "lots of modern optimizers are even worse at taking certain gains, than they are at avoiding money-pumps."
(I don't see this as particularly relevant to alignment research, but I believe it.)
J: I see it as quite relevant to alignment research! I'm hoping to learn quite a bit from bacteria and markets, that generalizes to humans and artificial intelligences.
N: That sounds like a disagreement for another day. As for today, I'll settle for you retreating from the position that I'm lacking a basic understanding of the objects I wish to study, and from the position that intelligent agents don't aggregate into an agent (and, relatedly, from the position that 'markets' beat 'agents' as a model of capable optimizers).
J: I'll just point out again that, if you want to learn what human values are, you need a descriptive theory of humans rather than a normative theory of intelligence, and I still bet that weak-efficiency is a better descriptive model of most humans than agenthood.
This example (of two cases where a market's decision about a trade differs depending on hidden state) relies on the initial wealth distributions being unequal. Legends hold that there are other examples where the hidden state doesn't depend on initial differences, if the utilities aren't logarithmic. John Wentworth tells me he cares in practice about this additional fact, and notes that further information can be found in the literature under the heading of "non-existence of representative agents". I have not myself constructed such an example, and would be interested if someone has a simple one.
Hopefully there are canonical solutions. For instance, in an ultimatum game, the Schelling fair point is that both participants get utility halfway between their best and worst deals, which solution is invariant under affine transformation. Knowing that agents are willing to accept these canonical solutions as 'fair' does not seem like a large additional burden of knowledge.