Hide table of contents

This is part 1 of a 5-part sequence:

Part 1: summary of Bostrom's argument

Part 2: arguments against a fast takeoff

Part 3: cosmic expansion and AI motivation

Part 4: tractability of AI alignment

Part 5: expected value arguments

Introduction

In this article I present a critique of Nick Bostrom’s book Superintelligence. For purposes of brevity I shall not devote much space to summarising Bostrom’s arguments or defining all the terms that he uses. Though I briefly review each key idea before discussing it, I shall also assume that readers have some general idea of Bostrom’s argument, and some of the key terms involved. Also note that to keep this piece focused, I only discuss arguments raised in this book, and not what Bostrom has written elsewhere or others who have addressed similar issues. The structure of this article is as follows. I first offer a summary of what I regard to be the core argument of Bostrom’s book, outlining a series of premises that he defends in various chapters. Following this summary, I commence a general discussion and critique of Bostrom’s concept of ‘intelligence’, arguing that his failure to adopt a single, consistent usage of this concept in his book fatally undermines his core argument. The remaining sections of this article then draw upon this discussion of the concept of intelligence in responding to each of the key premises of Bostrom’s argument. I conclude with a summary of the strengths and weaknesses of Bostrom’s argument.

Summary of Bostrom’s Argument

Throughout much of his book, Bostrom remains quite vague as to exactly what argument he is making, or indeed whether he is making a specific argument at all. In many chapters he presents what are essentially lists of various concepts, categories, or considerations, and then articulates some thoughts about them. Exactly what conclusion we are supposed to draw from his discussion is often not made explicit. Nevertheless, by my reading the book does at least implicitly present a very clear argument, which bears a strong similarity to the sorts of arguments commonly found in the Effective Altruism (EA) movement, in favour of focusing on AI research as a cause area. In order to provide structure for my review, I have therefore constructed an explicit formulation of what I take to be Bostrom’s main argument in his book. I summarise it as follows:

Premise 1: A superintelligence, defined as a system that ‘exceeds the cognitive performance of humans in virtually all domains of interest’, is likely to be developed in the foreseeable future (decades to centuries).

Premise 2: If superintelligence is developed, some superintelligent agent is likely to acquire a decisive strategic advantage, meaning that no terrestrial power or powers would be able to prevent it doing as it pleased.

Premise 3: A superintelligence with a decisive strategic advantage would be likely to capture all or most of the cosmic endowment (the total space and resources within the accessible universe), and put it to use for its own purposes.

Premise 4: A superintelligence which captures the cosmic endowment would likely put this endowment to uses incongruent with our (human) values and desires.

Preliminary conclusion: In the foreseeable future it is likely that a superintelligent agent will be created which will capture the cosmic endowment and put it to uses incongruent with our values. (I call this the AI Doom Scenario).

Premise 5: Pursuit of work on AI safety has a non-trivial chance of noticeably reducing the probability of the AI Doom Scenario occurring.

Premise 6: If pursuit of work on AI safety has at least a non-trivial chance of noticeably reducing the probability of an AI Doom Scenario, then (given the preliminary conclusion above) the expected value of such work is exceptionally high.

Premise 7: It is morally best for the EA community to preferentially direct a large fraction of its marginal resources (including money and talent) to the cause area with highest expected value.

Main conclusion: It is morally best for the EA community to direct a large fraction of its marginal resources to work on AI safety. (I call this the AI Safety Thesis.)

Bostrom discusses the first premise in chapters 1-2, the second premise in chapters 3-6, the third premise in chapters 6-7, the fourth premise in chapters 8-9, and some aspects of the fifth premise in chapters 13-14. The sixth and seventh premises are not really discussed in the book (though some aspects of them are hinted at in chapter 15), but are widely discussed in the EA community and serve as the link between the abstract argumentation and real-world action, and as such I decided also to discuss them here for completeness. Many of these premises could be articulated slightly differently, and perhaps Bostrom would prefer to rephrase them in various ways. Nevertheless I hope that they at least adequately capture the general thrust and key contours of Bostrom’s argument, as well as how it is typically appealed to and articulated within the EA community.

The nature of intelligence

In my view, the biggest problem with Bostrom’s argument in Superintelligence is his failure to devote any substantial space to discussing the nature or definition of intelligence. Indeed, throughout the book I believe Bostrom uses three quite different conceptions of intelligence:

  • Intelligence(1): Intelligence as being able to perform most or all of the cognitive tasks that humans can perform. (See page 22)
  • Intelligence(2): Intelligence as a measurable quantity along a single dimension, which represents some sort of general cognitive efficaciousness. (See pages 70,76)
  • Intelligence(3): Intelligence as skill at prediction, planning, and means-ends reasoning in general. (See page 107)

While certainly not entirely unrelated, these three conceptions are all quite different from each other. Intelligence(1) is mostly naturally viewed as a multidimensional construct, since humans exhibit a wide range of cognitive abilities and it is by no means clear that they are all reducible to a single underlying phenomenon that can be meaningfully quantified with one number. It seems much more plausible to say that the range of human cognitive abilities require many different skills which are sometimes mutually-supportive, sometimes mostly unrelated, and sometimes mutually-inhibitory in varying ways and to varying degrees. This first conception of intelligence is also explicitly anthropocentric, unlike the other two conceptions which make no reference to human abilities.

Intelligence(2) is unidimensional and quantitative, and also extremely abstract, in that it does not refer directly to any particular skills or abilities. It most closely parallels the notion of IQ or other similar operational measures of human intelligence (which Bostrom even mentions in his discussion), in that it is explicitly quantitative and attempts to reduce abstract reasoning abilities to a number along a single dimension. Intelligence(3) is much more specific and grounded than either of the other two, relating only to particular types of abilities. That said, it is not obviously subject to simple quantification along a single dimension as is the case for Intelligence(2), nor is it clear that skill at prediction and planning is what is measured by the quantitative concept of Intelligence(2). Certainly Intelligence(3) and Intelligence(2) cannot be equivalent if Intelligence(2) is even somewhat analogous to IQ, since IQ mostly measures skills at mathematical, spatial, and verbal memory and reasoning, which are quite different from skills at prediction and planning (consider for example the phenomenon of autistic savants). Intelligence(3) is also far more narrow in scope than Intelligence(1), corresponding to only one of the many human cognitive abilities.

Repeatedly throughout the book, Bostrom flips between using one or another of these conceptions of intelligence. This is a major weakness for Bostrom’s overall argument, since in order for the argument to be sound it is necessary for a single conception of intelligence to be adopted and apply in all of his premises. In the following paragraphs I outline several of the clearest examples of how Bostrom’s equivocation in the meaning of ‘intelligence’ undermines his argument.

Bostrom argues that once a machine becomes more intelligent than a human, it would far exceed human-level intelligence very rapidly, because one human cognitive ability is that of building and improving AIs, and so any superintelligence would also be better at this task than humans. This means that the superintelligence would be able to improve its own intelligence, thereby further improving its own ability to improve its own intelligence, and so on, the end result being a process of exponentially increasing recursive self-improvement. Although compelling on the surface, this argument relies on switching between the concepts of Intelligence(1) and Intelligence(2).

When Bostrom argues that a superintelligence would necessarily be better at improving AIs than humans because AI-building is a cognitive ability, he is appealing to Intelligence(1). However, when he argues that this would result in recursive self-improvement leading to exponential growth in intelligence, he is appealing to Intelligence(2). To see how these two arguments rest on different conceptions of intelligence, note that considering Intelligence(1), it is not at all clear that there is any general, single way to increase this form of intelligence, as Intelligence(1) incorporates a wide range of disparate skills and abilities that may be quite independent of each other. As such, even a superintelligence that was better than humans at improving AIs would not necessarily be able to engage in rapidly recursive self-improvement of Intelligence(1), because there may well be no such thing as a single variable or quantity called ‘intelligence’ that is directly associated with AI-improving ability. Rather, there may be a host of associated but distinct abilities and capabilities that each needs to be enhanced and adapted in the right way (and in the right relative balance) in order to get better at designing AIs. Only by assuming a unidimensional quantitative conception of Intelligence(2) does it make sense to talk about the rate of improvement of a superintelligence being proportional to its current level of intelligence, which then leads to exponential growth.

Bostrom therefore faces a dilemma. If intelligence is a mix of a wide range of distinct abilities as in Intelligence(1), there is no reason to think it can be ‘increased’ in the rapidly self-reinforcing way Bostrom speaks about (in mathematical terms, there is no single variable  which we can differentiate and plug into the differential equation, as Bostrom does in his example on pages 75-76). On the other hand, if intelligence is a unidimensional quantitative measure of general cognitive efficaciousness, it may be meaningful to speak of self-reinforcing exponential growth, but it is not necessarily obvious that any arbitrary intelligent system or agent would be particularly good at designing AIs. Intelligence(2) may well help with this ability, but it’s not at all clear it is sufficient – after all, we readily conceive of building a highly “intelligent” machine that can reason abstractly and pass IQ tests etc, but is useless at building better AIs.

Bostrom argues that once a machine intelligence became more intelligent than humans, it would soon be able to develop a series of ‘cognitive superpowers’ (intelligence amplification, strategising, social manipulation, hacking, technology research, and economic productivity), which would then enable it to escape whatever constraints were placed upon it and likely achieve a decisive strategic advantage. The problem is that it is unclear whether a machine endowed only with Intelligence(3) (skill at prediction and means-ends reasoning) would necessarily be able to develop skills as diverse as general scientific research ability, the capability to competently use natural language, and perform social manipulation of human beings. Again, means-ends reasoning may help with these skills, but clearly they require much more beyond this. Only if we are assuming the conception of Intelligence(1), whereby the AI has already exceeded essentially all human cognitive abilities, does it become reasonable to assume that all of these ‘superpowers’ would be attainable.

According to the orthogonality thesis, there is no reason why the machine intelligence could not have extremely reductionist goals such as maximising the number of paperclips in the universe, since an AI's level of intelligence is totally separate to and distinct from its final goals. Bostrom’s argument for this thesis, however, clearly depends adopting Intelligence(3), whereby intelligence is regarded as general skill with prediction and means-ends reasoning. It is indeed plausible that an agent endowed only with this form of intelligence would not necessarily have the ability or inclination to question or modify its goals, even if they are extremely reductionist or what any human would regard as patently absurd. If, however, we adopt the much more expansive conception of Intelligence(1), the argument becomes much less defensible. This should become clear if one considers that ‘essentially all human cognitive abilities’ includes such activities as pondering moral dilemmas, reflecting on the meaning of life, analysing and producing sophisticated literature, formulating arguments about what constitutes a ‘good life’, interpreting and writing poetry, forming social connections with others, and critically introspecting upon one’s own goals and desires. To me it seems extraordinarily unlikely that any agent capable of performing all these tasks with a high degree of proficiency would simultaneously stand firm in its conviction that the only goal it had reasons to pursue was tilling the universe with paperclips.

As such, Bostrom is driven by his cognitive superpowers argument to adopt the broad notion of intelligence seen in Intelligence(1), but then is driven back to a much narrower Intelligence(3) when he wishes to defend the orthogonality thesis. The key point to be made here is that the goals or preferences of a rational agent are subject to rational reflection and reconsideration, and the exercise of reason in turn is shaped by the agent’s preferences and goals. Short of radically redefining what we mean by ‘intelligence’ and ‘motivation’, this complex interaction will always hamper simplistic attempts to neatly separate them, thereby undermining Bostrom’s case for the orthogonality thesis - unless a very narrow conception of intelligence is adopted.

In the table below I summarise several of the key outcomes or developments that are critical to Bostrom’s argument, and how plausible they would be under each of the three conceptions of intelligence. Obviously such judgements are necessarily vague and subjective, but the key point I wish to make is simply that only by appealing to different conceptions of intelligence in different cases is Bostrom able to argue that all of the outcomes are reasonably likely to occur. Fatally for his argument, there is no single conception of intelligence that makes all of these outcomes simultaneously likely or plausible.

Outcome Intelligence(1)       Intelligence(2) Intelligence(3)

Quick takeoff Highly unlikely Likely Unclear

All superpowers Highly likely Highly unlikely Highly unlikely

Absurd goals Highly unlikely Unclear Likely

No change to goals Unlikely Unclear Likely

Comments13
Sorted by Click to highlight new comments since: Today at 10:46 AM

Weak upvote for engaging seriously with content and linking to the other parts of the argument.

On the other hand, while it's good to see complex arguments on the Forum, it's difficult to discuss pieces that are written without very many headings or paragraph breaks. It's generally helpful to break down your piece into labelled sections so that people can respond unambiguously to various points. I also think this would help you make this argument across fewer than five posts, which would also make discussion easier.

I'm not the best-positioned person to comment on this topic (hopefully someone with more expertise will step in and correct both of our misconceptions), but these sections stood out:

To see how these two arguments rest on different conceptions of intelligence, note that considering Intelligence(1), it is not at all clear that there is any general, single way to increase this form of intelligence, as Intelligence(1) incorporates a wide range of disparate skills and abilities that may be quite independent of each other. As such, even a superintelligence that was better than humans at improving AIs would not necessarily be able to engage in rapidly recursive self-improvement of Intelligence(1), because there may well be no such thing as a single variable or quantity called ‘intelligence’ that is directly associated with AI-improving ability.

Indeed, there may be no variable or quantity like this. But I'm not sure there isn't, and it seems really, really important to be sure before we write off the possibility. We don't understand human reasoning very well; it seems plausible to me that there really are a few features of the human mind that account for nearly all of our reasoning ability. (I think the "single quantity" thing is a red herring; an AI could make self-recursive progress on several variables at once.)

To give a silly human example, I'll name Tim Ferriss, who has used the skills of "learning to learn", "ignoring 'unwritten rules' that other people tend to follow", and "closely observing the experience of other skilled humans" to learn many languages, become an extremely successful investor, write a book that sold millions of copies before he was well-known, and so on. His IQ may not be higher now than when he begin, but his end results look like the end results of someone who became much more "intelligent".

Tim has done his best to break down "human-improving ability" into a small number of rules. I'd be unsurprised to see someone use those rules to improve their own performance in almost any field, from technical research to professional networking.

Might the same thing be true of AI -- that a few factors really do allow for drastic improvements in problem-solving across many domains? It's not at all clear that it isn't.

If, however, we adopt the much more expansive conception of Intelligence(1), the argument becomes much less defensible. This should become clear if one considers that ‘essentially all human cognitive abilities’ includes such activities as pondering moral dilemmas, reflecting on the meaning of life, analysing and producing sophisticated literature, formulating arguments about what constitutes a ‘good life’, interpreting and writing poetry, forming social connections with others, and critically introspecting upon one’s own goals and desires. To me it seems extraordinarily unlikely that any agent capable of performing all these tasks with a high degree of proficiency would simultaneously stand firm in its conviction that the only goal it had reasons to pursue was tilling the universe with paperclips.

Some of the world's most famous intellectuals have made what most people in the EA community would see as bizarre or dangerous errors in moral reasoning. It's possible for someone to have a deep grasp of literature, a talent for moral philosophy, and great social skills -- and still have desires that are antithetical to sentient well-being (there are too many historical examples to count).

Motivation is a strange thing. Much of the world, including some of those famous intellectuals I mentioned, believes in religious and patriotic ideals that don't seem "rational" to me. I'm sure there are people far more intelligent than I who would like to tile the world with China, America, Christianity, or Islam, and who are unlikely to break from this conviction. The ability to reflect on life, like the ability to solve problems, often seems to have little impact on how easily you can change your motivations.

It's also important not to take the "paperclip" example too seriously. It's meant to be absurd in a fun, catchy way, but also to stand in for the class of "generally alien goals", which are often much less ridiculous.

If an AI were to escape the bonds of human civilization and begin harvesting all of the sun's energy for some eldritch purpose, it's plausible to me that the AI would have a very good reason (e.g. "learn about the mysteries of the universe"). However, this doesn't mean that its good reason has to be palatable to any actual humans. If an AI were to decide that existence is inherently net-negative and begin working to end life in the universe, it would be engaging in deep, reflective philosophy (and might even be right in some hard-to-fathom way) but that would little comfort to us.

[Sorry for picking out a somewhat random point unrelated to the main conversation. This just struck me because I feel like it's similar to a divergence in intuitions I often notice between myself and other EAs and particularly people from the 'rationalist' community. So I'm curious if there is something here it would be valuable for me to better understand.]

To give a silly human example, I'll name Tim Ferriss, who has used the skills of "learning to learn", "ignoring 'unwritten rules' that other people tend to follow", and "closely observing the experience of other skilled humans" to learn many languages, become an extremely successful investor, write a book that sold millions of copies before he was well-known, and so on. His IQ may not be higher now than when he begin, but his end results look like the end results of someone who became much more "intelligent".
Tim has done his best to break down "human-improving ability" into a small number of rules. I'd be unsurprised to see someone use those rules to improve their own performance in almost any field, from technical research to professional networking.

Here is an alternative hypothesis, a bit exaggerated for clarity:

  • There is a large number of people who try to be successful in various ways.
  • While trying to be successful, people tend to confabulate explicit stories for what they're doing and why it might work, for example "ignoring 'unwritten rules' that other people tend to follow".
  • These confabulations are largely unrelated to the actual causes of success, or at least don't refer to them in a way nearly as specific as they seem to do. (E.g., perhaps a cause could be 'practicing something in an environment with frequent and accurate feedback', while a confabulation would talk about quite specific and tangential features of how this practice was happening.)
  • Most people actually don't end up having large successes, but a few do. We might be pulled to think that their confabulations about what they were doing are insightful or worth emulating, but in fact it's all a mix of survivorship bias and people with certain innate traits (IQ, conscientiousness, perhaps excitement-seeking, ...) not occurring in the confabulations doing better.

Do you think we have evidence that this alternative hypothesis is false?

I think the truth is a mix of both hypotheses. I don't have time to make a full response, but some additional thoughts:

  • It's very likely that there exist reliable predictors of success that extend across many fields.
  • Some of these are innate traits (intelligence, conscientiousness, etc.)
  • But if you look at a group of people in a field who have very similar traits, some will still be more successful than others. Some of this inequality will be luck, but some of it seems like it would also be related to actions/habits/etc.
  • Some of these actions will be trait-related (e.g. "excitement-seeking" might predict "not following unwritten rules"). But it should also be possible to take the right actions even if you aren't strong in the corresponding traits; there are ways you can become less bound by unwritten rules even if you don't have excitement-seeking tendencies. (A concrete example: Ferriss sometimes recommends practicing requests in public to get past worries about social faux pas -- e.g. by asking for a discount on your coffee. CFAR does something similar with "comfort zone expansion".)

No intellectual practice/"rule" is universal -- if many people tried the sorts of things Tim Ferriss tried, most would fail or at least have a lot less success. But some actions are more likely than others to generate self-improvement/success, and some actions seem like they would make a large difference (for example, "trying new things" or "asking for things").

One (perhaps pessimistic) picture of the world could look like this:

  • Most people are going to be roughly as capable/successful as they are now forever, even if they try to change, unless good or bad luck intervenes
  • Some people who try to change will succeed, because they expose themselves to the possibility of good luck (e.g. by starting a risky project, asking for help with something, or giving themselves the chance to stumble upon a habit/routine that suits them very well)
  • A few people will succeed whether or not they try to change, because they won the trait lottery, but within this group, trying to change in certain ways will still be associated with greater success.

One of Ferriss's stated goals is to look at groups of people who succeed at X, then find people within those groups who have been unexpectedly successful. A common interview question: "Who's better at [THING] than they should be?" (For example, an athlete with an unusual body type, or a startup founder from an unusual background.) You can never take luck out of the equation completely, especially in the complex world of intellectual/business pursuits, but I think there's some validity to the common actions Ferriss claims to have identified.

Thanks for your thoughts. Regarding spreading my argument across 5 posts, I did this in part because I thought connected sequences of posts were encouraged?

Regarding the single quantity issue, I don't think it is a red herring, because if there are multiple distinct quantities then the original argument for self-sustaining rapid growth becomes significantly weaker (see my responses to Flodorner and Lukas for more on this).

You say "Might the same thing be true of AI -- that a few factors really do allow for drastic improvements in problem-solving across many domains? It's not at all clear that it isn't." I believe we have good reason to think no such few factors exist. I would say because A) this does not seem to be how human intelligence works and B) because this does not seem to be consistent with the history of progress in AI research. Both I would say are characterised by many different functionalities or optimisations for particular tasks. Not to say there are no general principles but I think these are not as extensive as you seem to believe. However regardless of this point, I would just say that if Bostrom's argument is to succeed I think he needs to give some persuasive reasons or evidence as to why we should think such factors exist. Its not sufficient just to argue that they might.

Connected sequences of posts are definitely encouraged, as they are sometimes the best way to present an extensive argument. However, I'd generally recommend that someone make one post over two short posts if they could reasonably fit their content into one post, because that makes discussion easier.

In this case, I think the content could have been fit into fewer posts (not just one, but fewer than five) had the organization system been a bit different, but this isn't meant to be a strong criticism -- you may well have chosen the best way to sort your arguments. The critique I'm most sure about is that your section on "the nature of intelligence" could have benefited from being broken down a bit more, with more subheadings and/or other language meant to guide readers through the argument (similarly to the way you presented Bostrom's argument in the form of a set of premises, which was helpful).

Thanks for writing this!

I think you are pointing out some important imprecisions, but i think that some of your arguments aren't as conclusive as you seem to present them to be:

"Bostrom therefore faces a dilemma. If intelligence is a mix of a wide range of distinct abilities as in Intelligence(1), there is no reason to think it can be ‘increased’ in the rapidly self-reinforcing way Bostrom speaks about (in mathematical terms, there is no single variable  which we can differentiate and plug into the differential equation, as Bostrom does in his example on pages 75-76). "

Those variables could be reinforcing each other, as one could argue they had done in the evolution of human intelligence. (in mathematical terms, there is a runaway dynamic similar to the one dimensional case for a linear vector-valued differential equation, as long as all eigenvalues are positive).

"This should become clear if one considers that ‘essentially all human cognitive abilities’ includes such activities as pondering moral dilemmas, reflecting on the meaning of life, analysing and producing sophisticated literature, formulating arguments about what constitutes a ‘good life’, interpreting and writing poetry, forming social connections with others, and critically introspecting upon one’s own goals and desires. To me it seems extraordinarily unlikely that any agent capable of performing all these tasks with a high degree of proficiency would simultaneously stand firm in its conviction that the only goal it had reasons to pursue was tilling the universe with paperclips. To me it seems extraordinarily unlikely that any agent capable of performing all these tasks with a high degree of proficiency would simultaneously stand firm in its conviction that the only goal it had reasons to pursue was tilling the universe with paperclips."

Why does it seem unlikely? Also, do you mean unlikely as in "agents emerging in a world similar to ours is nowprobably won't have this property" or as in "given that someone figured out how to construct a great variety of superintelligent agents, she would still have trouble constructing an agent with this property?"

Thanks for your thoughts.

Regarding your first point, I agree that the situation you posit is a possibility, but it isn't something Bostrom talks about (and remember I only focused on what he argued, not other possible expansions of the argument). Also, when we consider the possibility of numerous distinct cognitive abilities it is just as possible that there could be complex interactions which inhibit the growth of particular abilities. There could easily be dozens of separate abilities and the full matrix of interactions becomes very complex. The original force of the 'rate of growth of intelligence is proportional to current intelligence leading to exponential growth' argument is, in my view, substantively blunted.

Regarding your second point, it seems unlikely to me because if an agent had all these abilities, I believe they would use then to uncover reasons to reject highly reductionistic goals like tilling the universe with paperclips. They might end up with goals that are still in opposition to human values, but I just don't see how an agent with these abilities would not become dissatisfied with extremely narrow goals.

Hi Fods12,

We read and discussed your critique in 2 sessions in the AISafety.com reading group. You raise many interesting points, and I found it worthwhile to make a paragraph-by-paragraph answer to your critique. The recording can be found here:

https://youtu.be/Xl5SMS9eKD4
https://youtu.be/lCKc_eDXebM

Best regards
Søren Elverlin

To see how these two arguments rest on different conceptions of intelligence, note that considering Intelligence(1), it is not at all clear that there is any general, single way to increase this form of intelligence, as Intelligence(1) incorporates a wide range of disparate skills and abilities that may be quite independent of each other. As such, even a superintelligence that was better than humans at improving AIs would not necessarily be able to engage in rapidly recursive self-improvement of Intelligence(1), because there may well be no such thing as a single variable or quantity called ‘intelligence’ that is directly associated with AI-improving ability.

While I'm not entirely convinced of a fast take-off, this particular argument isn't obvious to me. If the AI is better than humans at every cognitive task, then for every ability that we care about X, it will be better at the cognitive task of improving X. Additionally, it will be better at the cognitive task of improving it's ability to improve X, etc. It will be better than humans at constructing an AI that is good at every cognitive task, and will thus be able to create one better than itself.

This should become clear if one considers that ‘essentially all human cognitive abilities’ includes such activities as pondering moral dilemmas, reflecting on the meaning of life, analysing and producing sophisticated literature, formulating arguments about what constitutes a ‘good life’, interpreting and writing poetry, forming social connections with others, and critically introspecting upon one’s own goals and desires. To me it seems extraordinarily unlikely that any agent capable of performing all these tasks with a high degree of proficiency would simultaneously stand firm in its conviction that the only goal it had reasons to pursue was tilling the universe with paperclips.

This doesn't seem very unlikely to me. As a proof-of-concept, consider a paper-clip maximiser able to simulate several clever humans at high speeds. If it was posed a moral dilemma (and was motivated to answer it) it could perform at above human-level by simulating humans at fast speeds (in a suitable situation where they are likely to produce an honest answer to the question), and directly report their output. However, it wouldn't have to be motivated by it.

Thanks for your thoughts!

1) The idea I'm getting at is that an exponential-type argument of self-improvement ability being proportional to current intelligence doesn't really work if there are multiple distinct and separate cognitive abilities, because ability to improve ability X might not be in any clear way related to the current level of X. For example, ability to design a better chess-playing program might not be in any way related to chess playing ability, or object recognition performance might not be related to ability to improve this performance. These are probably not very good examples because probably these sorts of abilities are not fundamental enough, and we should be looking at more abstract cognitive abilities, but hopefully they serve as a general illustration. A superhuman AI would therefore be better at designing AIs than a human sure, but I don't think the sort of exponential growth arguments Bostrom uses hold if there are multiple distinct cognitive abilities.

2) The idea of a simplistic paper-maximising AI instantiating separate mind simulations is very interesting. I think the way you describe it this would amount to one agent creating distinct agents to perform a set task, rather than a single agent possessing those actual abilities itself. This seems relevant to me because any created mind simulations, being distinct from the original agent, would not necessarily share its goals or beliefs, and therefore a principal-agent problem arises. In order to be smart enough to solve this problem I think the original AI would probably have to be enhanced well beyond paperclip maximising levels. I think there's a lot more to be said here but I am not convinced this counterexample really und

To me it seems extraordinarily unlikely that any agent capable of performing all these tasks with a high degree of proficiency would simultaneously stand firm in its conviction that the only goal it had reasons to pursue was tilling the universe with paperclips.

Seems a little anthropomorphic. A possibly less anthropomorphic argument: If we possess the algorithms required to construct an agent that's capable of achieving decisive strategic advantage, we can also apply those algorithms to pondering moral dilemmas etc. and use those algorithms to construct the agent's value function.

I don't see how using Intelligence (1) as a definition undermines the orthogonality thesis.

Intelligence(1): Intelligence as being able to perform most or all of the cognitive tasks that humans can perform. (See page 22)

This only makes reference to abilities and not to the underlaying motivation. Looking at high functioning sociopaths you might argue we have an example of agents that often perform very well at all most human abilities but still have attitudes towards other people that might be quite different from most people and lack a lot of ordinary inhibitions.

This should become clear if one considers that ‘essentially all human cognitive abilities’ includes such activities as pondering moral dilemmas, reflecting on the meaning of life, analysing and producing sophisticated literature, formulating arguments about what constitutes a ‘good life’, interpreting and writing poetry, forming social connections with others, and critically introspecting upon one’s own goals and desires. To me it seems extraordinarily unlikely that any agent capable of performing all these tasks with a high degree of proficiency would simultaneously stand firm in its conviction that the only goal it had reasons to pursue was tilling the universe with paperclips.

I don't agree, i personally can easily imagine an agent that can argue for moral positions convincingly by analysing huge amounts of data about human preferences, that can use statistical techniques to infer the behaviour and attitude of humans and then use that knowledge to maximize something like positive affection or trust and many other things.