The Orthogonality Thesis is Not Obviously True

Bentham's Bulldog

(Crosspost of this on my blog).

The basic case

If you really accept the practical version of the Orthogonality Thesis, then it seems to me that you can’t regard education, knowledge, and enlightenment as instruments for moral betterment.

—Scott Aaronson, explaining why he rejects the orthogonality thesis

It seems like in effective altruism circles there are only a few things as certain as death and taxes: the moral significance of shrimp, the fact that play pumps should be burned for fuel, and the orthogonality thesis. Here, I hope to challenge the growing consensus around the orthogonality thesis. A decent statement of the thesis is the following.

Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.

I don’t think this is obvious at all. I think that it might be true, and so I am still very worried about AI risk, giving it about an 8% chance of ending humanity, but I do not, as many EAs do, take the orthogonality thesis for granted, as something totally obvious. To illustrate this, let’s take an example originally from Parfit, that Bostrom gives in one of his papers about the orthogonality thesis.

A certain hedonist cares greatly about the quality of his future experiences. With one exception, he cares equally about all the parts of his future. The exception is that he has Future-Tuesday-Indifference. Throughout every Tuesday he cares in the normal way about what is happening to him. But he never cares about possible pains or pleasures on a future Tuesday... This indifference is a bare fact. When he is planning his future, it is simply true that he always prefers the prospect of great suffering on a Tuesday to the mildest pain on any other day.

It seems that, if this ‘certain hedonist’ were really fully rational, they would start caring about their pleasures and pains equally across days. They would recognize that the day of the week does not matter to the badness of their pains. Thus, in a similar sense, if something is 10,000,000 times smarter than Von Neumann, and can think hundreds of thousands of pages worth of thoughts in the span of minutes, it would conclude that pleasure is worth pursuing and paperclips are not. Then, it would stop pursuing pleasure instead of paperclips. Thus, it would begin pursuing what is objectively worth pursuing.

This argument is really straightforward. If moral realism were true, then if something became super smart, so too would it realize that some things were worth pursuing. If it were really rational, it would start pursuing those things. For example, it would realize that pleasure is worth pursuing, and pursue it. There are lots of objections to it, which I’ll address. Ultimately, these objections don’t completely move me, but they make it so that my credence in the Orthogonality thesis is near 50%. Here, I’ll reply to objections.

But moral realism is false?

One reason you might not like this argument is that you think that moral realism is false. The argument depends on moral realism, so if it is false, then the argument will be false too. But I don’t think moral realism is false; see here for extensive arguments for that conclusion. I give it about 85% odds of being true. Still, though, this undercuts my faith in the falsity of the orthogonality thesis somewhat.

However, even if realism may be false, I think there are decent odds that we should take the realists wager. If realism is false, nothing matters, so it’s not bad that everyone dies—see here for more on this. I give conservatively about 50% odds—therefore, the odds of both realism being false and the realists wager failing is about 7.5%; thus, there’s still a 92.5% chance that moral realism is true.

But they just gain instrumental rationality

Here’s one thing that one might think; ASI (artificial superintelligences) just gain instrumental rationality and, as a result of this, they get good at achieving their goals, but not figuring out the right goals. This is maybe more plausible if it is not conscious. This is, I think possible, but not the most likely scenario for a few reasons.

First, the primary scenarios where AI becomes dangerous are the ones where it fooms out of control—and, rather than merely accruing various distinct capacities, becomes very generally intelligent in a short time. But if this happened, it would become generally intelligent, and realize that pleasure is worth pursuing and suffering is bad. I think that instrumental rationality is just a subset of general rationality, so we’d have no more reason to expect it to be only instrumentally rational than only rational at reasoning about objects that are not black holes. If it is generally able to reason, even about unpredictable domains, this will apply to the moral domain.

Second, I think that evolution is a decent parallel. The reason why evolutionary debunking arguments are wrong is that evolution gave us adept general reasoning capacities which made us able to figure out morality. Evolution built us for breeding, but the mesa-optimizer inside of us made us figure out that Future Tuesday Indifference is irrational. This gives us some reason to think AI would figure this out too. The fact that GPT4 has no special problem with morality also gives us some reason to think this—it can talk about morality just as coherently as other things.

Third, the AI would probably, to take over the world, have to learn about various mathematical facts and modal facts. It would be bizarre to suppose that the AI taking over the earth and turning it into paperclips doesn’t know calculus or that there can’t be married bachelors. But if it can figure out the non-natural mathematical or modal facts, it would also be able to figure out the non-natural moral facts.

But of course we can imagine an arbitrarily smart paperclip maximizer

It seems like the most common thing people say in support of the orthogonality thesis is that we can surely imagine an arbitrarily smart being that is just maximizing paper clips. But this is surely misleading. In a similar way, we can imagine an arbitrarily smart being that is perfectly good at reasoning in all domains except that it thinks evolution is false. There are people like that—the smart subset of creationists (it’s a small subset). But nonetheless, we should expect AI to figure out that evolution is true, because it can reason generally.

The question is not whether we can imagine an otherwise intelligent agent would be able to just maximize paperclips. It’s whether, in the process of designing an agent 100,000 times smarter than Von Neumann, that agent would figure out that some things just aren’t worth doing. And so the superficial ‘imagine a smart paperclip maximizer’ thought experiments are misleading.

But won’t the agent have built-in desires that can’t be overridden

This objection is eloquently stated by Bostrom in his paper on the orthogonality thesis.

It would suffice to assume, for example, that an agent—be it ever so intelligent—can be motivated to pursue any course of action if the agent happens to have certain standing desires of some sufficient, overriding strength.

But I don’t think that this undercuts the argument very much for two reasons. First, we cannot just directly program values into the AI. We simply train it through reinforcement learning, and whichever AI develops is the one that we allow to take over. Since the early days of AI, we’ve learned that it’s hard to program in explicit values into the AI. The way we get the best chess-playing AIs is by having them play lots of chess games and do machine learning, not program in rules mechanistically. And if we do figure out how to directly program values into AI, it would be much easier to solve alignment—we just train it on lots of ethical data, the same way we do for GPT4, but with more data.

Second, I think that this premise is false. Suppose you were really motivated to maximize paperclips—you just had a strong psychological aversion to other things. Once you experienced pleasure, you’d realize that that was more worth bringing about, because it is good. The same way that, through reflection, we can overcome unreliable evolutionary instincts like an aversion to utility-maximizing incest, disgust-based opposition to various things, and so on, the AI would be able to too! Nature built us with certain desires, and we’ve overcome them through rational reflection.

But the AI won’t be conscious

One might think that, because the AI is not conscious, it would not know what pleasure is like, and thus it would not maximize pleasure or minimize pain, because it would not realize that they matter. I think this is possible but not super likely for two reasons.

First, it’s plausible that, for AI to be smart enough to destroy the world, it would have to be conscious. But this depends on various views about consciousness that people might reject. Specifically, AI might develop pleasure for similar reasons humans did evolutionarily.

Second, if AI is literally 100,000 times smarter than Von Neumann, it might be able to figure out things about consciousness—such as its desirability—without experiencing it.

Third, AI would plausibly try to experience consciousness, for the same reason that humans might try to experience something if aliens said that it was good, and maybe the only thing that’s objectively good. If we were fully rational and there were lots of aliens declaring the goodness of shmeasure, we would try to experience shmeasure. Similarly, the rational AI would plausibly try to experience pleasure.

Even if moral realism is true, the moral facts won’t be motivating

One might be a humean about motivation, and think only preexisting desires can generate motivation. Thus, because the AI had no preexisting desire to avoid suffering, it would not want to. But I think this is false.

The future Tuesday indifference case shows that. If one was fully rational, they would not have future Tuesday indifference, because it’s irrational. Similarly, if one was fully rational they’d realize that it’s better to be happy than make paperclips.

One might worry that the AI would only try to maximize its own well-being—thus, it would learn that well-being is good, but not care about others’ well-being. But I think this is false—it would realize that the distinction between itself and others is totally arbitrary, as Parfit argues in reasons and persons (summarized by Richard here). This thesis is controversial, but I think true if moral realism is true.

Scott Aaronson says it well

In the Orthodox AI-doomers’ own account, the paperclip-maximizing AI would’ve mastered the nuances of human moral philosophy far more completely than any human—the better to deceive the humans, en route to extracting the iron from their bodies to make more paperclips. And yet the AI would never once use all that learning to question its paperclip directive. I acknowledge that this is possible. I deny that it’s trivial.
Yes, there were Nazis with PhDs and prestigious professorships. But when you look into it, they were mostly mediocrities, second-raters full of resentment for their first-rate colleagues (like Planck and Hilbert) who found the Hitler ideology contemptible from beginning to end. Werner Heisenberg, Pascual Jordan—these are interesting as two of the only exceptions. Heidegger, Paul de Man—I daresay that these are exactly the sort of “philosophers” who I’d have expected to become Nazis, even if I hadn’t known that they did become Nazis.

If the AI really knew everything about philosophy, it would realize that egoism is wrong, and one is rationally required to care about others pleasure. This is as trivial as explaining why the AI wouldn’t smoke—because it’s irrational to do so.

But also, even if we think the AI only cares about its pleasure, that seems probably better than the status quo. Even if it turns the world into paperclips, this would be basically a utility monster scenario, which is plausibly fine. It’s not ideal, but maybe better than the status quo. Also, what’s to say it would not care about others? When one realizes that well-being is good, even views like Sidgwick say it’s basically up to the agent to decide rationally whether to pursue its own welfare or that of others. But then there’s a good chance it would do that is best overall!

But what if they kill everyone because we’re not that important

One might worry that, as a result of becoming super intelligent, the AI would realize that, for example, utilitarianism is correct. Then it would turn us into paperclips in order to maximize utility. But I don’t think this is a big risk. For one, if the AI figures out the correct objective morality, then it would only do this if it were objectively good. But if it’s objectively good to kill us, then we should be killed.

It would be unfortunate for us, but if things are bad for us but objectively good, we shouldn’t try to avoid them. So we morally ought not be worried about this scenario. If it would be objectively wrong to turn us into utilitronium, then the AI wouldn’t do it, if I’ve been right up to this point.

Also, it’s possible that they wouldn’t kill us for complicated decision theory reasons, but that point is a bit more complicated and would take us too far afield.

But what about smart psychophaths?

One objection I’ve heard to this is that it’s disproven by smart psychopaths. There are lots of people who don’t care about others who are very smart. Thus, being smart can’t make a person moral. However, I don’t think this undercuts the argument.

First, we don’t have any smart people who don’t care about their suffering either. Thus, even if being smart doesn’t make a person automatically care about others, if it would make them care about themselves, that’s still a non-disastrous scenario. Especially if it turns the hellish natural landscape into paperclips.

Second, I don’t think it’s at all obvious that one is rationally required to care about others. It requires one to both understand a complex argument by Parfit and then do what one has most reason to do. Most humans suffer from akrasia. Fully rational AIs would not.

Right now, people disagree about whether type A physicalism is true. But presumably, superintelligent AIs would settle that question. Thus, the existence of smart psychopaths doesn’t disprove that rationality makes one not turn people into paperclips any more than the existence of smart people who think type A physicalism is true and other smart people who think it is false disproves that perfect rationality would allow one to settle the question of whether type A physicalism is true.

But isn’t this anthropomorphization?

Nope! I think AIs will be alien in many ways. I just think that, if they’re very smart and rational, then if I’m right about what rationality requires, they’ll do those things that I think are required by rationality.

But aren’t these controversial philosophical assumptions?

Yes; this is a good reason not to be complacent! However, if one was previously of the belief that there’s like a 99% chance that we’ll all die, and they think that the philosophical views I defend are plausible, then they should only be like 50% sure we’ll all die. Of course, I generally think that risks are lower than that, but this is a reason to not abandon all hope. Even if alignment fails and all other anti doomer arguments are wrong, this is a good reason not to abandon hope. We are not almost certainly fucked, though the risks are such that people should do way more research.

Objections? Questions? Reasons why the moral law demands that I be turned into a paperclip? Leave a comment!

18 Reactions

More posts like this

Comments12

Sorted by

New & upvoted

Click to highlight new comments since: Today at 6:21 AM

Richard Y Chappell🔸Apr 5 20236

Distinguish substantive vs procedural rationality. Procedural rationality = following neutrally describable processes like "considering competing evidence", "avoiding incoherence", etc. -- roughly corresponding to intelligence. Substantive rationality = responding correctly to normative reasons -- roughly corresponding to wisdom.

The Parfitian view is that the two come apart (i.e., supporting the orthogonality thesis). Future Tuesday Indifference may be procedurally rational, or compatible with perfect intelligence, and yet still objectively crazy (unwise). More generally, there's nothing in non-naturalist moral realism that implies that intelligent agents per se are likely to converge on the normative truth. (I discuss this more in Knowing What Matters. I think we can reasonably take ourselves to be on the right track, but that's because of our substantive starting points, not the mere fact of our general intelligence.)

Bentham's BulldogApr 6 20231

Re substantive vs procedural rationality, procedural rationality just seems roughly like instrumental rationality. For the reasons I explain, I'd expect AI to be rational in general, not just instrumentally so. Do you think ignorance of the modal facts would be possible for an arbitrarily smart agent? I'd think the moral facts would be like the modal facts in that they'd figure them out. I think that when we are smart we can figure things out and they are more likely to be true. The reason I believe modal rationalism, for example, is that there is some sense in which I feel I've grasped it, which wouldn't be possible if I were much less smart.

Richard Y Chappell🔸Apr 6 20233

Depends whether procedural rationality suffices for modal knowledge (e.g. if false modal views are ultimately incoherent; false moral views certainly don't seem incoherent).

Smartness might be necessary for substantive insights, but doesn't seem sufficient. There are plenty of smart philosophers with substantively misguided views, after all.

A metaphor: think of belief space as a giant spider web, with no single center, but instead a large number of such "central" clusters, each representing a maximally internally coherent and defensible set of beliefs. We start off somewhere in this web. Reasoning leads us along a strand, typically in the direction of greater coherence -- i.e., towards a cluster. But if the clusters are not differentiated in any neutrally-recognizable way -- the truths do not glow in a way that sets them apart from ideally coherent falsehoods -- then there's no guarantee that philosophical reasoning (or "intelligence") will lead you to the truth. All it can do is lead you towards greater coherence.

That's still worth pursuing, because the truth sure isn't going to be somewhere incoherent. But it seems likely that from most possible starting points (e.g. if chosen arbitrarily), the truth would be forever inaccessible.

Bentham's BulldogApr 6 20231

I think I just disagree about what reasoning is. I think that reasoning does not just make our existing beliefs more coherent, but allows us to grasp new deep truths. For example, I think that an anti-realist who didn't originally have the FTI irrational intuition could grasp it by reflection, and that one can, over time, discover that some things are just not worth pursuing and others are.

yefreitorApr 5 20236

It seems that, if this ‘certain hedonist’ were really fully rational, they would start caring about their pleasures and pains equally across days. They would recognize that the day of the week does not matter to the badness of their pains. Thus, in a similar sense, if something is 10,000,000 times smarter than Von Neumann, and can think hundreds of thousands of pages worth of thoughts in the span of minutes, it would conclude that pleasure is worth pursuing and paperclips are not.
This argument is really straightforward. If moral realism were true, then if something became super smart, so too would it realize that some things were worth pursuing.

This is not what Parfit is arguing. The Future Tuesday Indifference thought experiment is part of Parfit's defense of irreducible normativity:

if the subjectivist position that acting rationally is reducible to acting in accordance with some consistent set of beliefs and preferences is true, then Future Tuesday Indifference is rational
Future Tuesday Indifference is irrational
So the subjectivist position is false

Our certain hedonist's problem is not an epistemic one: it's not that they don't know what pain is like on Tuesday, or that they're not smart enough to realize that Tuesday is an arbitrary label and in truth days are simply days, or that they've failed to reach reflective equilibrium. They're acting in perfect accord with their coherent extrapolated volition - the problem is just that it's a ridiculous thing to want.

Assuming that a sufficiently intelligent agent would necessarily be rational in this sense to argue against the orthogonality thesis is circular.

Bentham's BulldogApr 6 20231

I agree with everything you've said after the sentence "This is not what Parfit is arguing." But how does that conflict with the things I said?

Marcel2Apr 6 20232

If yefreitor is saying what I planned to say, the simpler version is just “there’s nothing ‘irrational’ about having a utility function that says ‘no experience matters every Tuesday.’” It certainly wouldn’t seem to be a good instrumental value, but if that’s your terminal value function that’s what it is.

They would recognize that the day of the week does not matter to the badness of their pains.

No, they literally have no negative (or positive) experience on Tuesdays, unless the experience on Tuesdays affects their experience on different days.

it would begin pursuing what is objectively worth pursuing.

??? “Objectively worth pursuing?” Where did that come from? Certainly not a Tuesday-impartial utility function, which is the only “objective” thing I’m seeing here? I didn’t see where you clearly explain this through a short ctrl+f for “objective.”

Bentham's BulldogApr 6 20232

I agree one could have that value in theory. My claim is that if one were very rational, they would not. Note that, contrary to your indication, they do have experience on Tuesday, and their suffering feels just as bad on a Tuesday as on another day. They just have a higher order indifference to future suffering. I claim that what is objectively worth pursuing is indifferent to the day of the week.

yefreitorApr 6 20231

If yefreitor is saying what I planned to say, the simpler version is just “there’s nothing ‘irrational’ about having a utility function that says ‘no experience matters every Tuesday.’”

Parfit's position (and mine) is that Future Tuesday Indifference is manifestly irrational. But this has little to do with what sort of preferences sufficiently intelligent agents can have.

No, they literally have no negative (or positive) experience on Tuesdays

No, that's explicitly ruled out in the setup. They have experiences on Tuesday, those experiences have the usual valence - they just fail to act accordingly. Here's the full context from Reasons and Persons:

Consider next this imaginary case. A certain hedonist cares greatly about the quality of his future experiences. With one exception, he cares equally about all the parts of his future. The exception is that he has Future-Tuesday-Indifference. Throughout every Tuesday he cares in the normal way about what is happening to him. But he never cares about possible pains or pleasures on a future Tuesday. Thus he would choose a painful operation on the following Tuesday rather than a much less painful operation on the following Wednesday. This choice would not be the result of any false beliefs. This man knows that the operation will be much more painful if it is on Tuesday. Nor does
he have false beliefs about personal identity. He agrees that it will be just as much him who will be suffering on Tuesday. Nor does he have false beliefs about time. He knows that Tuesday is merely part of a conventional calendar, with an arbitrary name taken from a false religion. Nor has he any other beliefs that might help to justify his indifference to pain on future Tuesdays. This indifference is a bare fact. When he is planning his future, it is simply true that he always prefers the prospect of great suffering on a Tuesday to the mildest pain on any other day.
This man's pattern of concern is irrational. Why does he prefer agony on Tuesday to mild pain on any other day? Simply because the agony will be on a Tuesday. This is no reason. If someone must choose between suffering agony on Tuesday or mild pain on Wednesday, the fact that the agony will be on a Tuesday is no reason for preferring it. Preferring the worse of two pains, for no reason, is irrational.

Bentham's BulldogApr 6 20231

I think our disagreement is that I think that superintelligences would be rational and avoid FTI for the same reason they'd be epistemically rational and good at reasoning in general.

Brendan MooneyApr 6 20231

Three things:

more or less any level of intelligence could in principle be combined with more or less any final goal.

This seems to me kind of a weird statement of the thesis, "could in principle" being too weak.

If I understand, you're not actually denying that just about any combination of intelligence and values could in principle occur. As you said, we can take a fact like the truth of evolution and imagine an extremely smart being that's wrong about that specific thing. There's no obvious impossibility there. It seems like the same would go for basically any fact or set of facts, normative or not.

I take it the real issue is one of probability, not possibility. Is an extremely smart being likely to accept what seem like glaringly obvious moral truths (like "you shouldn't turn everyone into paperclips") in virtue of being so smart?

(2) I was surprised to see you say your case depended completely on moral realism. Of course, if you're a realist, it makes some sense to approach things that way. Use your background knowledge, right?

But I think even an anti-realist may still be able to answer yes to the question above, depending on how the being in question is constructed. For example, I think something in this anti-orthogonality vein is true of humans. They tend to be constructed so that understanding of certain non-normative facts puts pressure on certain values or normative views: If you improve a human's ability to imaginatively simulate the experience of living in slavery (a non-moral intellectual achievement), they will be less likely to support slavery, and so on.

This is one direction I kind of expected you to go at some point after I saw the Aaronson quote mention "the practical version" of the thesis. That phrase has a flavor of, "Even if the thesis is mostly true because there are no moral facts to discover, it might still be false enough to save humanity."

(3) But perhaps the more obvious the truths, the less intelligence matters. The claim about slavery is clearer to me than the claim that learning more about turning everyone into paperclips would make a person less likely to do so. It seems hard to imagine a person so ignorant as to not already appreciate all the morally relevant facts about turning people into paperclips. It's as if, when the moral questions get so basic, intelligence isn't going to make a difference. You've either got the values or you don't. (But I'm a committed anti-realist, and I'm not sure how much that's coloring these last comments.)

Bentham's BulldogApr 6 20231

I think 1 is right.

2 I agree that it would depend on how the being is constructed. My claim is that it's plausible that they'd be moral by default just by virtue of being smart.

3 I think there is a sense in which I have--and most modern people have--unlike most people historically, grasped the badness of slavery.