Critique of Superintelligence Part 3

James Fodor

Comments 5

Sorted by

New & upvoted

Rohin Shah

I'm not really arguing for Bostrom's position here, but I think there is a sensible interpretation of it.

Goals/motivation = whatever process the AI uses to select actions.

There is an implicit assumption that this process will be simple and of the form "maximize this function over here". I don't like this assumption as an assumption about any superintelligent AI system, but it's certainly true that our current methods of building AI systems (specifically reinforcement learning) are trying to do this, so at minimum you need to make sure that we don't build AI using reinforcement learning, or that we get it's reward function right, or that we change how reinforcement learning is done somehow.

If you are literally just taking actions that maximize a particular function, you aren't going to interpret them using common sense, even if you have the ability to use common sense. Again, I think we could build AI systems that used common sense to interpret human goals -- but this is not what current systems do, so there's some work to be done here.

The arguments you present here are broadly similar to ones that make me optimistic that AI will be good for humanity, but there is work to be done to get there from where we are today.

James Fodor

Hi rohinmshah, I agree that our current methods for building an AI do involve maximising particular functions and have nothing to do with common sense. The problem with extrapolating this to AGI is 1) these sorts of techniques have been applied for decades and have never achieved anything close to human level AI (of course that's not proof it never can but I am quite skeptical, and Bostrom doesn't really make the case that such techniques will be likely to lead to human level AI), and 2) as I argue in part 2 of my critique, other parts of Bostrom's argument rely upon much broader conceptions of intelligence that would entail the AI having common sense.

Rohin Shah

these sorts of techniques have been applied for decades and have never achieved anything close to human level AI

We also didn't have the vast amounts of compute that we have today.

other parts of Bostrom's argument rely upon much broader conceptions of intelligence that would entail the AI having common sense.

My claim is that you can write a program that "knows" about common sense, but still chooses actions by maximizing a function, in which case it's going to interpret that function literally and not through the lens of common sense. There is currently no way that the "choose actions" part gets routed through the "common sense" part the way it does in humans. I definitely agree that we should try to build an AI system which does interpret goals using common sense -- but we don't know how to do that yet, and that is one of the approaches that AI safety is considering.

I agree with the prediction that AGI systems will interpret goals with common sense, but that's because I expect that we humans will put in the work to figure out how to build such systems, not because any AGI system that has the ability to use common sense will necessarily apply that ability to interpreting its goals.

If we found out today that someone created our world + evolution in order to create organisms that maximize reproductive fitness, I don't think we'd start interpreting our sex drive using "common sense" and stop using birth control so that we more effectively achieved the original goal we were meant to perform.

Zeke_Sherman

For low probability of other civilizations, see https://arxiv.org/abs/1806.02404.

Humans don't have obviously formalized goals. But you can formalize human motivation, in which case our final goal is going to be abstract and multifaceted, and it is probably going to be include a very very broad sense of well-being. The model applies just fine.

Because it is tautologically true that agents are motivated against changing their final goals, this is just not possible to dispute. The proof is trivial, it comes from the very stipulation of what a goal is in the first place. It is just a framework for describing an agent. Now, with this framework, humans' final goals happen to be complex and difficult to discern, and maybe AI goals will be like that too. But we tend to think that AI goals will not be like that. Omohundro argues some economic reasons in his paper on the "basic AI drives", but also, it just seems clear that you can program an AI with a particular goal function and that will be all there is to it.

Yes, AI may end up with very different interpretations of its given goal but that seems to be one of the core issues in the value alignment problem that Bostrom is worried about, no?

James Fodor

Hi Zeke!

Thanks for the link about the Fermi paradox. Obviously I could not hope to address all arguments about this issue in my critique here. All I meant to establish is that Bostrom's argument does rely on particular views about the resolution of that paradox.

You say 'it is tautologically true that agents are motivated against changing their final goals, this is just not possible to dispute'. Respectfully I just don't agree. It all hinges on what is meant by 'motivation' and 'final goal'. You also say " it just seems clear that you can program an AI with a particular goal function and that will be all there is to it", and again I disagree. A narrow AI sure, or even a highly competent AI, but not an AI with human level competence in all cognitive activities. Such an AI would have the ability to reflect on its own goals and motivations, because humans have that ability, and therefore it would not be 'all there is to it'.

Regarding your last point, what I was getting at is that you can change a goal by explicitly rejecting a goal and choosing a new one, or by changing one's interpretation of an existing goal. This latter method is an alternative path by which an AI could change its goals in practise even if it still regarded itself as following the same goals it was programmed with. My point isn't that this makes goal alignment not a problem. My point was that this makes the 'AI will never change its goals' not a plausible position.

Comments

More from the author

Against Credulous AI Hype

James Fodor·2mo ago·7m read

453

The FTX crisis highlights a deeper cultural problem within EA - we don't sufficiently value good governance

James Fodor·3y ago·5m read

228

Case-control survey of EAGx attendees finds no behavioural or attitudinal changes after six months

James Fodor·1y ago·18m read

Curated and popular this week

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 4d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

130

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·6d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·1d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Recent opportunities to take action

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·1d ago·3m read

130

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·6d ago·4m read

How to Lobby Against the Save Our Bacon Act

minthin·16h ago·1m read

Premise 3: Arguments against cosmic expansion

Critical to Bostrom’s argument about the dangers of superintelligence is that a superintelligence with a critical strategic advantage would likely capture the majority of the cosmic endowment (the sum total of the resources available within the regions of space potentially accessible to humans). This is why Bostrom presents calculations for the huge numbers of potential human lives (or at least simulations of lives) whose happiness is at stake should the cosmic endowment be captured by a rogue AI. While Bostrom does present some compelling reasons for thinking that a superintelligence with a decisive strategic advantage would have reasons and the ability to expand throughout the universe, there are also powerful considerations against the plausibility of this outcome which he fails to consider.

First, by the orthogonality thesis, a superintelligent agent could have almost any imaginable goal. It follows that a wide range of goals are possible that are inconsistent with cosmic expansion. In particular, any superintelligence with goals involving the value of unspoiled nature, or of constraining its activities to the region of the solar system, or of economising on the use of resources, would have reasons not to pursue cosmic expansion. How likely it is that a superintelligence would be produced with such self-limiting goals compared to goals favouring limitless expansion is unclear, but it is certainly a relevant outcome to consider, especially given that valuing exclusively local outcomes or conservation of resources seem like plausible goals that might be incorporated by developers into a seed AI.

Second, on a number of occasions, Bostrom briefly mentions that a superintelligence would only be able to capture the entire cosmic endowment if no other technologically advanced civilizations, or artificial intelligences produced by such civilizations, existed to impede it. Nowhere, however, does he devote any serious consideration to how likely the existence of such civilizations or intelligences is. Given the great age and immense size of the cosmos, however, the probability that humans are the first technological civilization to achieve spaceflight, or that any superintelligence we produce would be the first to spread throughout the universe, seems infinitesimally small. Of course this is an area of great uncertainly and we can therefore only speculate about the relevant probabilities. Nevertheless, it seems very plausible to me that the chances of any human-produced superintelligence successfully capturing the cosmic endowment without alien competition are very low. Of course this does not mean that an out-of-control terrestrial AI could not do great harm to life on Earth and even spread throughout neighbouring stars, but it does significantly blunt the force of the huge numbers Bostrom presents as being at stake if we think the entire cosmic endowment is at risk of being misused.

Premise 4: The nature of AI motivation

Bostrom’s main argument in defence of premise 4 is that unless we are extremely careful and/or lucky in establishing the goals and motivations of the superintelligence before it captures the cosmic endowment, it is likely to end up pursuing goals that are not in alignment with our own values. Bostrom presents a number of thought experiments as illustrations of the difficulty of specifying values or goals in a manner that would result in the sorts of behaviours we want it to perform. Most of these examples involve the superintelligence pursuing a goal in a single-minded, literalistic way, which no human being would regard as ‘sensible’. He gives as examples an AI tasked with maximising its output of paperclips sending out probes to harvest all the energy within the universe to make more paperclips, or an AI tasked with increasing human happiness enslaving all humans and hijacking their brains to stimulate the pleasure centres directly. One major problem I have with all such examples is that the AIs always seem to lack a critical ability in interpreting and pursuing their goals that, for want of a better term, we might describe as ‘common sense’. This issue ultimately reduces to which conception of intelligence one applies, since if we adopt Intelligence(1) then any such AIs would necessarily have ‘common sense’ (this being a human cognitive ability), while the other two conceptions of intelligence would not necessarily include this ability. However, if we do take Intelligence(1) as our standard, then it seems difficult to see why a superintelligence would lack the sort of common sense by which any human would be able to see that the simple-minded, literalistic interpretations given as examples by Bostrom are patently absurd and ridiculous things to do.

Aside from the question of ‘common sense’, it is also necessary to analyse the concept of ‘motivation’, which is a multifaceted notion that can be understood in a variety of ways. Two particularly important conceptions of motivation are that it is some sort of internal drive to do or obtain some outcome, and motivation as some sort of more abstract rational consideration by which an agent has a reason to act in a certain way. Given what he says about the orthogonality thesis, it seems that Bostrom thinks of motivation as being some sort of internal drive to act in a particular way. In the first few pages of the chapter on the intelligent will, however, he switches from talking about motivation to talking about goals, without any discussion about the relationship between these two concepts. Indeed, it seems that these are quite different things, and can exist independently of each other. For example, humans can have goals (to quit smoking, or to exercise more) without necessarily having any motivation to take actions to achieve those goals. Conversely, humans can be motivated to do something without having any obvious associated goal. Many instances of collective behaviour in crowds and riots may be examples of this, where people act based on situational factors without any clear reason or objectives. Human drives such as curiously and novelty can also be highly motivating without necessarily having any particular goal associated with them. Given the plausibility that motivation and goals are different and distinct concepts, it is important for Bostrom to explain what he thinks the relationship between them is, and how they would operate in an artificial agent. This seems all the more relevant since we would readily say that many intelligent artificial systems possess goals (such as the common examples of a heat-seeking missile or a chess playing program), but it is not at all clear that these systems are in any way ‘motivated’ to perform these actions – they are simply designed to work towards these goals, and motivations simply don’t come into it. What then would it take to build an artificial agent that had both goals and motivations? How would an artificial agent act with respect to these goals and/or motivations? Bostrom simply cannot ignore these questions if he is to provide a compelling argument concerning what AIs would be motivated to do.

The problems inherent in Bostrom’s failure to analyse these concepts in sufficient detail become evident in the context of Bostrom’s discussion of something that he calls ‘final goals’. While he does not define these, presumably he means goals that are not pursued in order to achieve some further goal, but simply for their own sake. This raises several additional questions: can an agent have more than one final goal? Need they have any final goals at all? Might goals always be infinitely resolvable in terms of fulfilling some more fundamental or more abstract underlying goal? Or might multiple goals form an inter-connected self-sustaining network, such that all support each other but no single goal can be considered most fundamental or final? These questions might seem arcane, but addressing them is crucial for conducting a thorough and useful analysis of the likely behaviour of intelligent agents. Bostrom often speaks as if a superintelligence will necessarily act in single-minded devotion to achieve its one final goal. This assumes, however, that a superintelligence would be motivated to achieve its goal, that it would have one and only one final goal, and that its goal and its motivation to achieve it are totally independent from and not receptive to rational reflection or any other considerations. As I have argued here and previously, however, these are all quite problematic and dubious notions. In particular, as I noted in the discussion about the nature of intelligence, a human’s goals are subject to rational reflection and critique, and can be altered or rejected if they are determined to be irrational or incongruent with other goals, preferences, or knowledge that the person has. It therefore seems highly implausible that a superintelligence would hold so tenaciously to their goals, and pursue them so single-mindedly. Only a superintelligence possessing a much more minimal form of intelligence, such as the skills at prediction and means-ends reasoning of Intelligence(3), would be a plausible candidate for acting in such a myopic and mindless way. Yet as I argued previously, a superintelligence possessing only this much more limited form of intelligence would not be able to acquire all of the ‘cognitive superpowers’ necessary to establish a decisive strategic advantage.

Bostrom would likely contend that such reasoning is anthropomorphising, applying human experiences and examples in cases where they simply do not apply, given how different AIs could be to human beings. Yet how can we avoid anthropomorphising when we are using words like ‘motivation’, ‘goal’, and ‘will’, which acquire their meaning and usage largely through application to humans or other animals (as well as anthropomorphised supernatural agents)? If we insist on using human-centred concepts in our analysis, drawing anthropocentric analogies in our reasoning is unavoidable. This places Bostrom in a dilemma, as he wants to simultaneously affirm that AIs would possess motivations and goals, but also somehow shear these concepts of their anthropocentric basis, saying that they could work totally differently to how these concepts are applied in humans and other known agents. If these concepts work totally differently, then how are we justified in even using the same words in the two different cases? It seems that if this were so, Bostrom would need to stop using words like ‘goal’ and ‘motivation’ and instead start using some entirely different concept that would apply to artificial agents. On the other hand if these concepts work sufficiently similarly in human and AI cases to justify using common words to describe both cases, then there seems nothing obviously inappropriate in appealing to the operation of goals in humans in order to understand how they would operate in artificial agents. Perhaps one might contend that we do not really know whether artificial agents would have human analogues of desires and goals, or whether they would have something distinctively different. If this is the case, however, then our level of ignorance is even more profound than we had realised (since we don't even know what words we can use to talk about the issue), and therefore much of Bostrom’s argument on these subjects would be grossly premature and under-theorised.

Bostrom also argues that once a superintelligence comes into being, it would resist any changes to its goals, since its current goals are (nearly always) better achieved by refraining from changing them to some other goal. There is an obvious flaw to this argument, namely that humans change their goals all the time, and indeed whole subdisciplines of philosophy are dedicated to pursuing the question of what we should value and how we should go about modifying our goals or pursuing different things to what we currently do. Humans can even change their ‘final goals’ (insomuch as any such things exist), such as when they convert religions or change between radically opposed political ideologies. Bostrom mentions this briefly but does not present any particularly convincing explanation for this phenomenon, nor does he explain why we should assume that this clear willingness to countenance (and even pursue) goal changes is not something that would affect AIs as it affects humans. One potential such response could be that the ‘final goal’ pursued by all humans is really something very basic such as ‘happiness’ or ‘wellbeing’ or ‘pleasure’, and that this never changes even though the means of achieving it can vary dramatically. I am not convinced by this analysis, since many people (religious and political ideologues being obvious example) seem motivated by causes to perform actions that cannot readily be regarded as contributing to their own happiness or wellbeing, unless these concepts are stretched to become implausibly broad. Even if we accept that people always act to promote their own happiness or wellbeing, however, it is certainly the case that they can dramatically change their beliefs about what sort of things will improve their happiness or wellbeing, thus effectively changing their goals. It is unclear to me why we should expect that a superintelligence able to reflect upon its goals could not similarly change its mind about the meaning of its goals, or dramatically alter its views on how to best achieve them.