Fanaticism in AI: SERI Project

Introduction

Hello! I'm Jake, a philosophy student at UC Berkeley. This past summer I was a research fellow at Stanford's Existential Risks Initiative (SERI), where I did research on the problem of fanaticism in application to AI systems. I've written a condensed form of my final paper below, and I've also linked the longer form version. I would welcome any comments, feedback, or questions about any of this work. Views are solely my own.

Long-form paper linked here (pdf available by request).

The Fanatical Problem in AI: Short-Form Summary

This is a short-form writeup of a longer paper I wrote investigating fanaticism in application to AI systems, in what situations fanaticism would arise in AI, and whether or not we generally ought to endorse AI reaching fanatical verdicts. Given the condensed length of this article, I assume a familiarity with existential risks and AI, and only briefly touch on pertinent subjects including decision-theory, moral uncertainty, political fanaticism, and risk-aversion which are more thoroughly explored in the longer form. My discussion of fanaticism and its implications for AI systems is also necessarily abridged. By fanaticism, I am referring to the decision-theoretic axiology in which an agent favors choices with very low probability but sufficiently high utility for the choice to be consistent with maximizing utility. Essentially, the fanatical agent is one who maximizes expected utility by favoring longshot choices.

The classic example of fanatical behavior is Pascal’s Wager. As long as an agent accepts a non-zero probability of God existing, then the overwhelmingly high utility of paradise after life swamps the utility of all other possible choices. The fanatical agent is the one who favors this conclusion, and thus discards the other possible options regardless of if they offer immediate high utility. This fanatical wager should seem fairly counterintuitive, as surely we should not allow our actions to be beholden to tiny probabilities with absurdly high utilities. Yet on closer inspection fanaticism seems rational and plausible in AI.

Relevant Background

As mentioned, I shall briefly provide an introduction to a few relevant topics further discussed in the long-form version of this article. These topics can be somewhat technical.

I am interested in decision theory, which I treat as a normative theory of rational choice, prescribing certain choices an agent ought to optimally make given the agent’s preferences and knowledge about the world. One of the notational conventions of decision theory I use often is the construction of lotteries when dealing with multiple choices, as this serves to more accurately model the agent’s preferences.
I discuss moral uncertainty, which claims that 1) there are many moral theories making different normative assertions about scenario-specific actions and 2) we are rarely, if ever, absolutely confident in our following of a given moral theory. The conclusion from 1) and 2) is fairly simple: we should be uncertain about how we should act given that we are uncertain about which moral theory to favor.
I want to distinguish between common language or political fanaticism, and the decision-theoretic form of fanaticism which I am concerned with. Political fanaticism tends to involve some zealous elevation of a singular ideal to the exclusion of all else. By contrast, decision-theoretic fanaticism is simply favoring certain low probability high utility lotteries over options with inferior expected utility. Essentially, decision-theoretic fanaticism is a very risk-accepting form of unbounded utility maximization.

Finally, while I discuss risk-aversion and risk-acceptance, the definitions involve more formal and technical approaches, and so I will save them for the longer form of this discussion.

Discussion of Fanaticism

From this point I will refer to decision-theoretic fanaticism as simply “fanaticism”. There is a rich history of fanaticism centered around problems with infinities entering the calculus of expected utility maximization. Pascal’s Wager is likely the most famous and enduring example of this. Within discussions of existential risk and effective altruism, fanaticism is first seen in Nick Bostrom’s paper Infinite Ethics (2011), and in several papers by Nick Beckstead (2013, 2021). In a paper on problems in moral uncertainty, Hilary Greaves and Owen Cotton-Barrat discuss avoiding fanaticism (2019). Yet I am primarily interested in Hayden Wilkinson’s In Defence of Fanaticism (2021), which offers 1) a strong definition for fanaticism, 2) arguments for fanaticism, and 3) problems with rejecting fanaticism. It is also worth noting that Wilkison rejects the common usage of fanaticism as a critique of expected value theory (EVT). Wilkinson’s paper is the most recent of these works, and is entirely focused on fanaticism, while Bostrom and Beckstead treat only briefly with fanaticism. Wilkinson’s paper is also what drew my interest to the problem of fanaticism. The main move that I am trying to make is in applying Wilkinson’s view of fanaticism to AI — in particular, trying to understand how such an application might work and how we should feel about it. A reader less interested in the technical aspect of fanaticism could skip from here directly to my argument for fanaticism in AI.

I will briefly sketch some of the relevant points of Wilkinson’s argument here. First, we have a formal definition of fanaticism as follows:

“For any tiny (finite) probability ε > 0, and for any finite value v, there is some finite V that is large enough that L_risky is better than L_safe (no matter which scale those cardinal values are represented on)”

L_risky: value V with probability ε; value 0 otherwise

L_safe: value v with probability 1

This formalization of fanaticism essentially claims that the fanatic is an agent who takes suitably high expected utility lotteries L_risky over lotteries L_safe with inferior expected utility. One of the key moves Wilkinson makes is to construct fanaticism with fewer premises than EVT, thus forming a stronger argument. The two principles necessary for fanaticism are 1) an acceptance of the idea that we can make trades in probability for expected utility and 2) transitivity.

From here, Wilkinson provides an argument for fanaticism and problems with rejecting it. This takes the form of a standard continuum argument, we take one riskier lottery and thus set ourselves down the slope to a fanatical verdict. The first problem with rejecting fanaticism is a dilemma between an exposure to extreme sensitivities in probability and certain common-sense principles. The second two problems take the form of Derek Parfit’s Egyptology Objection and Wilkinson’s own Indology Objection. Faced with these problems, it seems far less absurd to accept fanaticism and all it entails.

Argument for Fanaticism in AI

I provide a very simple argument for fanatical AI or Superintelligence based on instrumental rationality.

Premise I: If an agent follows instrumental rationality, it will be fanatical.

Premise II: Superintelligences will follow instrumental rationality.

Conclusion: Superintelligences will be fanatical.

Briefly, instrumental rationality is generally defined by adherence to maximizing expected utility, which leads directly to fanaticism given the right conditions. Thus Premise I should seem at least plausible. Premise II relies on the claims of experts such as Bostrom (2012) and Stuart Russel (2019). Premise II is also seems intuitively appropriate given the role of means-ends reasoning in ML training. So Premise II should also seem quite plausible, and we are left with the conclusion that Superintelligences will be fanatical.

I leave the exploration of case examples to the longer form of this article, in which I focus on Dyson’s Wager and Pascal’s Mugging. In general, the cases I explore support the syllogistic argument above, and provide some scenarios where we should be concerned about fanatical AI, no matter how rational it may be. There are also a few objections worth considering, and some interesting replies that arise. I find that AI agents are more likely to be fanatical than human agents given more accurate knowledge of probabilities, and that AI avoids potential problems of mixed credences or impractically precise probabilities. I conclude that in the right situations, instrumentally rational AI will reach fanatical verdicts, which we ought to at the very least be concerned about given the unintuitive nature of fanaticism. While these verdicts seem to be potentially good or bad in various situations, they are undeniably concerning, and seem to be an area of AI safety that would benefit from further exploration.

Acknowledgements

During the SERI Fellowship I benefitted immensely from the guidance and advice of my mentor, Vincent Müller. I also received comments from some SERI organizers and fellows, in particular Maggie Hayes and Sunwoo Lee, as well as Mauricio Baker, Christina Barta, and Jeffrey Ohl. The short-form article below benefitted from advice and comments by Aaron Gertler. Lastly, this research was funded by SERI and BERI.

References

(Full list of references can be found in long-form linked above)

Beckstead, Nicholas 2013. On the Overwhelming Importance of Shaping the Far Future, PhD dissertation, Rutgers University.

Beckstead, Nicholas and Teruji Thomas. ‘A paradox for tiny probabilities and enormous values’, unpublished manuscript.

Bostrom, Nick (2011). Infinite ethics. Analysis and Metaphysics 10:9-59.

Bostrom, Nick (2009). Pascal's mugging. Analysis 69 (3):443-445.

Bostrom, Nick (2012). The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Minds and Machines 22 (2):71-85.

Buchak, Lara. "INSTRUMENTAL RATIONALITY, EPISTEMIC RATIONALITY, AND EVIDENCE-GATHERING." <i>Philosophical Perspectives</i> 24 (2010): 85-120. Accessed August 5, 2021. http://www.jstor.org/stable/41329440.

Bykvist, K. Moral uncertainty. Philosophy Compass. 2017; 12:e12408. https://doi.org/10.1111/phc3.12408

Good, I.J. (1967). "On the Principle of Total Evidence." The British Journal for the Philosophy of Science 17(4): 319-32

Greaves, H.. “A bargaining-theoretic approach to moral uncertainty.” (2019).

Hájek, Alan. "Waging War on Pascal's Wager." The Philosophical Review 112, no. 1 (2003): 27-56. Accessed July 19, 2021. http://www.jstor.org/stable/3595561.

Russell, S. (2019a). Human compatible: Artificial intelligence and the problem of control. Viking.

Steele, Katie and H. Orri Stefánsson, "Decision Theory", The Stanford Encyclopedia of Philosophy (Winter 2020 Edition), Edward N. Zalta (ed.), URL = <https://plato.stanford.edu/archives/win2020/entries/decision-theory/>.)

Wilkinson, Hayden (forthcoming). In defence of fanaticism. Ethics.

Wilkinson, Hayden, Infinite aggregation and risk.

kokotajlodSep 24 20213

Nice work!

However, imposing a bounded utility function on any decision involving lives saved or happy lives instantiated seems unpalatable, as it suggests that life diminishes in value. Thus, in decisions surrounding human lives and other unbounded utility values it seems that an instrumentally rational agent will maximize expected utility and reach a fanatical verdict. Therefore, if an agent is instrumentally rational, she will reach fanatical verdicts through maximizing expected utility.

I've only skimmed it so maybe this is answered in the paper somewhere, but: I think this is the part I'd disagree with. I don't think bounded utility functions are that bad, compared to the alternatives (such as fanaticism! And worse, paralysis! See my sequence.)

More importantly though, if we are trying to predict how superintelligent AIs will behave, we can't assume that they'll share our intuitions about the unpalatability of unbounded utility functions! I feel like the conclusion should be: Probably superintelligent AIs will either have bounded utility functions or be fanatical.

[anonymous]Sep 25 20213

Thanks for the comment!

I do briefly discuss bounded utility functions as an objection to the argument for fanatical Superintelligences. I generally take the view that imposing bounded utility functions is difficult to do in a way that doesn't seem arbitrary—in practice this might be less of an issue as one might be able observe the agent and impose bounded functions when necessary (I think this may raise other questions, but it does seem very possible in practice).

I don't think bounded utility functions are bad intrinsically, but I do think the problems created by denying fanaticism (a denial which can result form overly imposing bounded utility functions) are potentially worse than fanaticism. By these problems I'm referring back to those provided in Wilkinson's paper.

More importantly though, if we are trying to predict how superintelligent AIs will behave, we can't assume that they'll share our intuitions about the unpalatability of unbounded utility functions!

I think this is a very good point, and agree that we could end up with Superintelligences either imposing a bounded utility function or being fanatical. I think I would be somewhat intuitively inclined to think they would be fanatical more often than not in this case, but that isn't really a substantiated view on my part. Either way we still end up with fanatical verdicts being reached and the concerns that entails.

Effective Altruism Forum
EA Forum