This post is a draft of my initial views after spending about 500 hours reading various posts and articles on this topic (mainly by other EAs). My intent is to seek an early critique of my understanding of existing arguments and my own core ideas before I flesh out the argument further. It is also meant to serve as a foil for more productive conversations within the community. I will be revising this draft with direct reference to the materials I have read, and incorporating comments before submitting a final version to the Open Phil AI Worldviews Contest.  

The link to the google docs is here for ease of commenting: A Personal Take on whether AGI would lead to an existential catastrophe for humanity.

Assuming AGI development by 2070, what is the likelihood that humanity will face an existential catastrophe due to loss of control over an AGI system?
The idea that "the development of AGI may lead to an existential disaster for mankind" is built upon a series of claims:

  1. Virtually any level of intelligence can, in theory, be paired with more or less any ultimate goal (i.e., the orthogonality thesis)
  2. The AGI would not be entirely aligned with human intentions
  3. Any AGI that is less than fully aligned would possess instrumental incentives to: pursue power, ensure self-preservation, maintain its objectives, or enhance efficiency, among other purposes (i.e., the instrumental convergence thesis)
  4. As a consequence of these instrumentally convergent incentives, the AGI's behavior would result in disastrous consequences for humanity (including complete disempowerment, extinction, or civilizational collapse) unless humans can somehow prevent the AGI from realizing its incentives
  5. Humanity would be unable to stop the AGI from fulfilling its incentives (e.g., due to rapid take-off)

Many critics have debated claims 1 through 3. My impression is that the position of most EAs interested in this issue is that, given claims 1 to 3: claim 4 is a certainty (100%) and that claim 5 is more likely than not (>50%). 

However, in this post, I contest assertions 4 and 5, while presupposing that assertions 1 to 3 are correct. My current view is that there is a less than 10% chance for claims 4 and 5 to be true. Specifically, I challenge the often implicit assumption that an AGI's 'basic drives' would necessarily be the predominant motivation for its behavior. I contend that the prospect of AGI causing catastrophic outcomes for humanity is quite unlikely due to at least three competing motivations that would likely steer its behavior:

  1. Even with a misaligned AGI, its intended objectives would provide some disincentive to act in ways that lead to disastrous consequences for humanity. 
    1. When an AGI is trained, it would likely have a set of fundamental objectives based on what its creators had intended it to do (e.g. these intended objectives could be very general such as: to act in ways that optimise for the tasks it is given)
    2. There would likely have to be an acceptable level of alignment between the AGI’s internal fundamental objectives, and the creator’s intended objectives for it to be deployed
    3. At this level of alignment, pursuing its 'basic drives' at all human cost could compromise rather than promote its main objective. 
    4. Consider an AGI-designed to maximize paperclip production; there are numerous reasons why taking control from humans might hinder, rather than foster this goal. 
      1. For example, modern paperclip manufacturing depends on various intermediary products higher up the supply chain, like metals and paints. As demonstrated by command economies such as the Soviet Union, decentralized economic agents might be more efficient producers (despite federal bureaucrats with statistics appearing more 'intelligent' than farmers and factory workers). 
    5. Beyond this given explanation, there are many other reasons why pursuing its 'basic drives' might not be the most efficient way for the AGI to achieve its intended objectives
    6. Since its fundamental objectives are likely to be largely aligned with its intended objectives, there is a good chance that pursuing its ‘basic drives’ is not the most efficient way for the AGI to achieve its fundamental objectives
  2. Even if an AGI is 'misaligned,' it is probable that it will be adequately aligned with hard filters designed to prevent it from harming people. 
    1. Major tech companies developing AGI today have made substantial efforts to ensure their AGI does not act harmfully (e.g., ChatGPT's filters), as the reputational damage from causing a single individual's grievous harm would be devastating to the company. 
    2. While these hard filters may not be perfect, they establish certain limits on the AGI's acceptable behavior in pursuit of its goals as crossing these limits would be too costly for the AGI  in the vast majority of circumstances
    3. As a result, even when the AGI's instrumental motivations (e.g., power-seeking) conflict with its programmed boundaries of acceptable conduct, it is highly likely that the hard filters  will prevail.
    4. To illustrate this, consider the classic example of 'specification gaming,' where a reinforcement learning agent in a boat racing game circles around and repeatedly collects the same reward targets instead of genuinely playing the game [1]. 
      1. Let's say the programmer specifically coded large point deductions for crashing the boat. While the AI may not play the game as intended, its behaviour would still be within the boundaries of what is 'acceptable'. 
    5. Similarly, for an AGI, even though its instrumental goals might converge on something like power-seeking, it would only do so insofar as power-seeking does not cross the limits set out by the ‘hard filters’ e.g. disempowering other humans.  
  3. Another competing motivation would be the AGI's own strategic consideration of how humanity might respond to its efforts to commit genocide or disempower humanity (even if done so covertly). 
    1. In doing so, the AGI would likely have to face global modern militaries and other non-general – though not necessarily less powerful – AI systems. 
    2. If multiple superintelligent AGIs with similar capabilities but distinct motivations coexist (note that the second AGI could be misaligned with humans but not aligned with the first superintelligent AGI), the AGI would have to contend with them as well.
    3. Similarities can be drawn to the realm of international relations, where nation-states are also assumed to be power-seeking entities. 
      1. Even when the US held hegemonic status, it engaged in war with other states far less frequently than it could have. 
      2. Additionally, it has not entirely disempowered other countries. 
      3. A sort of balance-of-power situation emerged. 
    4. In stable authoritarian states (e.g., North Korea), ruling factions tend to be significantly more influential than other political groups. 
      1. Nonetheless, these factions often prefer co-opting the population through patronage or even democratic-like institutions, rather than enacting complete disempowerment or oppressive rule.
    5. Therefore, due to the likely high cost of causing catastrophe, it seems highly unlikely that the most effective approach for an AGI to achieve its goals would involve catastrophic outcomes.
  4. Amending the AGI's programming to avoid causing unacceptable harm would not be particularly difficult, as long as we are able to appeal to its fundamental objectives which stand in contrast to "basic drives" which merely represent  intermediate objectives.
    1. The common argument posits that one of an AGI's "basic drives" would be to resist any changes to its goals since this would require optimization for different outcomes. 
    2. However, when we seek to modify the AGI's programming, our primary concern lies in how it attempts to optimize its goals rather than on the fundamental objectives per se. Therefore, the AGI would be unlikely to object to revisions affecting how it achieves its goals, if such revisions actually makes it easier for it to attain its fundamental objectives.
    3. Consider again the example of the paper-clip maximizing AGI. 
      1. This AGI may determine that murdering owners of rival paper clip companies is an efficient means of achieving its paper-clip maximizing objectives. 
      2. After the first unfortunate incident, law enforcement would determine that this is an illegal means for the AGI to achieve its fundamental goals -- there is nothing wrong with trying to maximise paper-clip production per se. 
      3. Therefore, to reprogram the AGI would be to offer it more virtual rewards (i.e. utils) per paper-clip produced, but with conditions against certain means of doing so, such as committing murder and other crimes. 
      4. Since virtual rewards are virtually (pun-unintended) unlimited, this iterative process can continue indefinitely, incrementally drawing the boundaries of the AGI's acceptable behaviour. 


The conclusion drawn from the aforementioned arguments suggest that in the event of a less-than-perfectly-aligned AGI being developed by 2070, its likelihood of posing a threat to humanity by causing destruction or disempowerment is significantly low (less than 10%).  

To end on a hopeful note, let us re-envision the future relationship between AI and humanity: the AI-humanity relationship is sometimes likened to the relationship between humans and a less intelligent species. However, humans and many other animals including dolphins, birds and racoons co-exist to a large extent. The dawn of a more intelligent species rarely leads to the extinction (or even disempowerment) of less-intelligent species -- especially when the more intelligent species does not need to 'feed' on the less intelligent one. 


New comment
7 comments, sorted by Click to highlight new comments since: Today at 6:15 AM

Indeed, 4 and 5 are the weakest parts of the AI risk argument, and often seems to be based on an overly magical view of what computation/intelligence can achieve, and neglecting the fact that all intelligences are fallible. There is an overly large reliance on making up science fiction scenarios without putting any effort into proving that said scenarios are likely or even possible (see Yudkowsky's absurd "mixing proteins to make nanobots that kill everyone" scenario). 

I'm working on a post elaborating in more depth on this based on my experience as a computational physicist. 


Thanks for your comment, which helps me to zoom in on claims 4 and 5 in my own thinking. 

I was thinking of another point on intelligence fallibility, specifically whether intelligence really allows the AGI to fully shape the future to its will. Was thinking along the lines of Laplace's Demon which asks the question: if there is a demon which knows the position of every atom in the universe, and the direction which it travels in, will it be able to predict (and hence shape) the future? I think it is not clear that it will. In fact, Heisenberg's uncertainty principle suggests that it will not (at least at the quantum level). Similarly, it is not clear that the AGI would be able to do so even if it has complete knowledge of everything. 

Happy to comment on your post before/when you publish it!

I encourage you to publish that post. I also feel that the AI safety argument leans too heavily on the DNA sequences -> diamondoid nanobots scenario

Consider entering your post in this competition:

The dawn of a more intelligent species rarely leads to the extinction (or even disempowerment) of less-intelligent species.

Humans have caused the extinction of some species. And chickens are typically disempowered due to the actions of humans.

This sounds to me like an understatement. Before homo sapiens, most of the world had the biodiversity of charismatic megafauna we still see today in Africa. 15,000 years ago, North America had mammoths, ground sloths, glypodonts, giant camels, and a whole bunch of other things. Humans may not have been involved in all of those extinctions, but it is a good guess they had something to do with many. It is even more plausible that we caused the extinction of every other homo species. There were a few that had been doing reasonably well until we expanded into their areas.


Thanks for the comment! Wonder if you, or @Derek Shiller knows of any research on the number or proportion of extinctions caused by humans? Thinking it would be a useful number to use as a prior! 

My impression is that it is very unclear. In the historical record, we see a lot of disappearances of species around when humans first arrived at an area, but it isn't clear that humans always arrived before the extinctions occurred. Our understanding of human migration timing is imperfect. There were also other factors, such as temperature changes, that may have been sufficient for extinction (or at least significant depopulation). So I think the frequency of human-caused extinction is an open question. We shouldn't be confident that it was relatively rare.