Some governance research ideas to prevent malevolent control over AGI and why this might matter a hell of a lot

Jim Buhler

Epistemic status: I spent only a few weeks reading/thinking about this. I could have asked more people to give me feedback so I could improve this piece but I’d like to move on to other research projects and thought throwing this out there was still a good idea and might be insightful to some.

Summary

Many power-seeking actors will want to influence the development/deployment of artificial general intelligence (AGI). Some of them may have malevolent(-ish)^[1] preferences which they could satisfy on massively large scales if they succeed at getting some control over (key parts of the development/deployment of) AGI. Given the current rate of AI progress and dissemination, the extent to which those actors are a prominent threat will likely increase.

In this post:

I differentiate between different types of scenarios and give examples.

I argue that 1) governance work aimed at reducing the influence of malevolent actors over AGI does not necessarily converge with usual AGI governance work – which is as far as I know – mostly focused on reducing risks from “mere” uncautiousness and/or inefficiencies due to suboptimal decision-making processes, and 2) the expected value loss due to malevolence, specifically, might be large enough to constitute an area of priority in its own right for longtermists.

I, then, list some research questions that I classify under the following categories:

How malevolent control over AGI may trigger long-term catastrophes?

(This section is heavily inspired by discussions with Stefan Torges and Linh Chi Nguyen. I also build on Das Sarma and Wiblin’s (2022) conversation.)

We could divide the risks we should worry about into those two categories: Malevolence as a risk factor for AGI conflict and Direct long-term risks from malevolence.

Malevolence as a risk factor for AGI conflict

Clifton et al. (2022) write:

Several recent research agendas related to safe and beneficial AI have been motivated, in part, by reducing the risks of large-scale conflict involving artificial general intelligence (AGI). These include the Center on Long-Term Risk’s research agenda, Open Problems in Cooperative AI, and AI Research Considerations for Human Existential Safety (and this associated assessment of various AI research areas). As proposals for longtermist priorities, these research agendas are premised on a view that AGI conflict could destroy large amounts of value, and that a good way to reduce the risk of AGI conflict is to do work on conflict in particular.

In a later post from the same sequence, they explain that one of the potential factors leading to conflict is conflict-seeking preferences (CSPs) such as pure spite or unforgivingness. While AGIs might develop CSPs by themselves in training (e.g., because there are sometimes advantages to doing so; see, e.g., Abreu and Sethi 2003), they might also inherit them from malevolent(-ish) actors. Such an actor would also be less likely to want to reduce the chance of CSPs arising by “accident”.

This actor can be a legitimate decisive person/group in the development/deployment of AGI (e.g., a researcher at a top AI lab, a politician, or even some influencer whose’ opinion is highly respected), but also a spy/infiltrator or external hacker (or something in between these last two).

Direct long-term risks from malevolence

For simplicity, say we are concerned about the risk of some AGI ending up with X-risk-conducive preferences (XCPs)^[2] due to the influence of some malevolent actor (i.e., some human(-like) agent with traits particularly conducive to harm).

Here is a (not necessarily exhaustive) list of paths that may lead to the deployment of an AGI with XCPs:

The AGI is straightforwardly aligned with some actor(s) who, themselves, have XCPs.
The AGI is aligned with some actor(s) that have Quasi-XCPs (preferences that are malevolent-ish but can’t lead to existential catastrophes on their own), and either:
- the AGI makes the values go through some kind of CEV/idealization process and this results in XCPs, or
- some peculiar sort of AI misalignment where Quasi-XCPs are misspecified in a way that somehow leads to XCPs, or
- the actor with Quasi-XCPs doesn’t really try to align AGI with their values; they just launch an attack on a “friendly” AGI (call it Alice) that is at least roughly aligned with human values, or develop their own AGI with “anti-Alice preferences”, and this results with an AGI that has XCPs.

Just as for scenarios where malevolence is a risk factor for AGI conflict, “this actor can be a legitimate decisive person/group in the development/deployment of AGI (e.g., a researcher at a top AI lab, a politician, or even some influencer whose’ opinion is highly respected), but also a spy/infiltrator or external hacker (or something in between the two)”.

Why focus on AGI rather than “merely” advanced AI?

Allan Dafoe (2020) argues that while work motivated by the possible rise of AGI or artificial superintelligence is important, the relevance of the AI governance field should not condition on the eventual emergence of a general form of artificial intelligence. Various kinds of long-term risks (or at least risk factors) are also posed by future advanced general-purpose technologies.

While I agree with this, I tentatively believe that most of the expected future value we can affect lies in worlds where AGI arises, and even more so when it comes to scenarios involving malevolent actors. Although many global catastrophic risks (or GCRs) come from (or are exacerbated by) the misalignment/misuse of “merely” advanced AIs (see Dafoe 2020 for examples), AGI seems much more likely to lead to X-risks or S-risks, which are – by definition (see Aird 2020) – more severe than GCRs, the latter being unlikely to lead to an actual existential or suffering catastrophe (see, e.g., Rodriguez 2020).^[3]

Why might we want to consider focusing on malevolent actors, specifically?

After all, a decisive actor in an AGI context doesn’t have to show malevolent traits to cause a large-scale catastrophe. Mere unawareness, uncautiousness, avidity, or incompetence may be more than enough (Reece-Smith 2023; Dafoe 2020).

Nonetheless, I think there are compelling reasons to focus specifically on malevolent(-ish) actors.

Empowered malevolence is far more dangerous than mere empowered irresponsibility

While a merely irresponsible AGI-enabled actor might already trigger an extinction(-like) scenario, an ill-intentioned one could i) be far more willing to take such risks, ii) actively seek to increase them, and/or iii) cause outcomes worse than extinction^[4] (which means that – unlike most longtermist priorities – work on reducing risks from malevolence does not necessarily condition on the value of human expansion being positive).

And while History is already full of near-misses in terms of existential catastrophes (partly) due to malevolent actors (see, e.g., Althaus and Baumann 2020), as AI progress goes on and keeps being widely distributed, risks from such actors will become increasingly acute (Brundage et al. 2018).

Malevolent actors are more likely to want to influence AGI in the first place

Since most humans don’t seem to show severe malevolent-like traits, one might think that ill-intentioned actors are a small subset of all the potentially decisive actors we should worry about and try to stop/influence, such that focusing on them would be restrictive.

But, unfortunately, evidence suggests a significant correlation between malevolent traits and (successful) power-seekingness. Althaus and Baumann (2020) write:

Malevolent humans are unlikely to substantially affect the long-term future if they cannot rise to power. But alas, they often do. The most salient examples are dictators who clearly exhibited elevated malevolent traits: not only Hitler, Mao, and Stalin, but also Saddam Hussein, Mussolini, Kim Il-sung, Kim Jong-il, Duvalier, Ceaușescu, and Pol Pot, among many others.
In fact, people with increased malevolent traits might even be overrepresented among business (Babiak et al., 2010; Boddy et al., 2010; Lilienfeld, 2014), military, and political leaders (Post, 2003; Lilienfeld et al., 2012), perhaps because malevolent traits—especially Machiavellianism and narcissism—often entail an obsession with gaining power and fame (Kajonius et al., 2016; Lee et al., 2013; Southard & Zeigler-Hill, 2016) and could even be advantageous in gaining power (Deluga, 2011; Taylor, 2019).

Plus, both AI and politics (two domains decisive for our concerns here) are heavily male-dominated fields,^[5] and “elevated Dark Tetrad traits are significantly more common among men (Paulhus & Williams, 2002; Plouffe et al., 2017)”. (Althaus and Baumann 2020)

Therefore, assuming this correlation still holds to a significant extent when it comes to all the ill-intentioned actors we’re worried about (and not only those with high Dark Tetrad traits; this seems fairly likely), such malevolent(-ish) actors may actually be a larger chunk of the “actors we should worry about” than one may intuitively imagine.

Malevolent actors are more likely to trigger a value lock-in

Let’s consider the following three scenarios:

The future is controlled by altruistic values. This requires both i) the technical AI alignment problem is solved, and ii) the solution is implemented in a way that is conducive to an EAish-shaped future (e.g., thanks to some AI governance interventions).
The future is controlled by random-ish or superficial values. This requires that at least one of the two conditions above is not met (e.g., due to unconsciousness among the relevant actors).
The future is controlled by malevolent values. (See Breaking down the necessary conditions for some ill-intentioned actor to cause an AGI-related long-term catastrophe regarding the requirements for this.)

The key point I want to make here is that, if #3 triumphs over #1 et #2, it will do so far more utterly for far longer.

The actors that will cause/allow #1 or #2 to occur have no/little incentive to lock specific values into an AGI forever. In scenario #1, most humble/altruistic actors would probably push for making AGI corrigible and preserving option value (see Ord 2020, chapter 7; MacAskill 2022, chapter 4; Finnveden et al. 2022). In scenario #2, there is no reason to assume that the actors triggering a future controlled by random-ish/superficial values will make longtermist goal preservation their priority (for instance, if Alice creates a paperclip-maximizer out of unconsciousness, it seems likely that she didn’t carefully invest many resources into making sure it does maximize paperclips until the end of time; subjecting it to value drift).

On the contrary, malevolent actors seem dangerously likely to attempt to lock their values into an AGI system. Lukas Finnveden et al. (2022) give instances of past authoritarian leaders who seem to have desired stable influence over the future:

As one example, the Egyptian pharaoh Akhenaten used his reign to stop the worship of Egyptian gods other than Aten; which included some attempts at erasing other gods’ names and the building of monuments with names like "Sturdy are the Monuments of the Sun Disc Forever". After his death, traditional religious practices gradually returned and many of the monuments were demolished — but perhaps Akhenaten would have prevented this if he could have enforced stability. As another example, Nazi Germany was sometimes called the “Thousand-Year Reich”.

MacAskill (2022, chapter 4) gives more examples of this kind.

All else equal, malicious values are more likely to be pursued effectively for a very long time^[6] relative to yours, mine, or those of a paperclip maximizer. Such an AGI locked in with malevolent values would then, e.g., permanently prevent other sapiens-like beings to arise after having killed/incapacitated all humans^[7] and/or preserve an anti-eutopia and perpetuate harm over those overwhelmingly long timescales.

The importance of infosec and the relative tractability of detecting malevolent(-like) traits

Jeffrey Ladish and Lenard Heim (2022) make what I believe to be a strong case in favor of prioritizing information security to reduce long-term AI risks. Their take in a nutshell:

We expect significant competitive pressure around the development of AGI, including a significant amount of interest from state actors. As such, there is a large risk that advanced threat actors will hack organizations — that either develop AGI, provide critical supplies to AGI companies, or possess strategically relevant information— to gain a competitive edge in AGI development. Limiting the ability of advanced threat actors to compromise organizations working on AGI development and their suppliers could reduce existential risk by decreasing competitive pressures for AGI orgs and making it harder for incautious or uncooperative actors to develop AGI systems.

Then, as Nova DasSarma suggests during her 80,000 hours interview, the “detection of bad actors within your organization and with access to your systems” may be a significant priority when it comes to infosec. Ladish and Heim (2022) back this up:

People are usually the weak point of information systems. Therefore, training and background checks are essential.

The problem is that such “bad actors” might simply be people like “AI-lab employees who might get bribed into disclosing sensitive information”. While we would ideally like to detect any potentially undesirable trait in employees such as “is potentially vulnerable to some kind of bribes/threats”, narrowing the search down to malevolent traits for tractability reasons seems reasonable. It is sometimes sensible to intentionally restrict one’s search area to where the light is, although I think this argument is the weakest of my list.

Robustness: Highly beneficial even if we fail at alignment

My impression^[8] is that the (implicit) ultimate motivation behind a large fraction of direct^[9] AGI governance work is something like “we want to increase the likelihood that humanity aligns AGI, which requires both making the technical problem more likely to be solved and – conditional on it being solved – making sure it is implemented by everyone who should”.^[10] This results in intervention ideas like making sure the relevant actors don’t “cut corners”, reducing the capabilities of the uncautious actors who might create misaligned AGIs, or slowing down AI progress.^[11]

While I do not intend to discredit work focused on this, I think it is important to notice that it implicitly assumes the alignment problem is (at least somewhat) tractable. This assumption seems warranted in many cases but… what if we fail at aligning AGI?

For analogous reasons why we may want to complement reducing the chance of a global catastrophe with building disaster shelters and working out emergency solutions to feed everyone if it happens anyway, we should seriously consider not conditioning almost all AGI safety projects on the alignment problem being significantly tractable, and put more effort into work that is beneficial even in worlds where a misaligned AGI takes over.

Preventing malevolent-ish control over AGI is a great example of such work. While it obviously reduces the probability of humanity failing to align AGI, it also appreciably limits damage in misalignment scenarios. If the rise of a misaligned AGI turns out hardly evitable, we still can at least steer away from chains of events where it’d end up with “malevolence-inspired” or conflict-seeking preferences, and such intervention would actually be quite impactful in expectation. Existential catastrophes are not all equal. There is a huge difference between a paperclip maximizer and a “Hitler AGI”^[12].

“But wouldn’t an AGI need to be aligned with some malevolent actor(s) in order to develop malevolent preferences? Isn’t a misaligned AGI just sort of a paperclip maximizer that won’t create much value or disvalue anyway?” you’re asking, sensibly.

Well, while a small error in the implementation of humanity’s values may be enough to lose most of the potential value of the future, a small error in implementing malevolence – alas – does absolutely not guarantee that we’d avoid extremely bad outcomes since disvalue is not as complex or fragile as value may be (see DiGiovanni 2021). In fact, I can envision scenarios where the failed implementation of Hitler’s values in an AGI results in something much worse than what Hitler himself would have done or agreed to.

In sum, in face of the relative cluelessness we (should) feel regarding the long-term impact of our actions, the robustness of our interventions matters. Nick Beckstead (2013) therefore invites us to look for “a common set of broad factors which, if we push on them, systematically lead to better futures”. And preventing malevolent control over AI might be one of these factors.^[13]

Counter-considerations and overall take

Here’s a short list of arguments against a focus on malevolent actors:

Many malevolent actors might get disempowered by more classic AI governance and infosec work anyway: For instance, consider attack scenario A) some random non-malevolent hacker breaches into an AI lab → some malevolent actor gets access to what the hacker stole (e.g., they buy it from them or something), and attack scenario B) some malevolent actor directly gets access to key AGI-related inputs themself. Scenario A seems more likely than B. While the current post does absolutely not claim that we should focus on B-like scenarios rather than A-like ones, we do not need to explicitly focus on malevolence to prevent events like A. And although those two scenarios are hightly specific to the domain of hacking, I believe we could find analogous examples in other relevant-to-AI-governance areas.
Comparatively bad outcomes may occur without the rise of malevolent actors:
- Possibility of accidentally conflictual AGI: While an AGI seems exceedingly unlikely to develop an intrinsic preference for human extinction by accident, it might develop conflict-seeking preferences (CSPs) without a malevolent intervention (e.g., because CSPs are sometimes advantageous in games; see, e.g., Abreu and Sethi 2003), which might lead to outcomes of similar or even higher magnitude (see Clifton et al. 2022). Therefore, something like CAIF’s and CLR’s work to ensure the way transformative AI systems are trained is conducive to cooperative behaviors might end up being more effective at reducing risks from CSPs than reducing the influence of potential malevolent actors. (If so, we should still be pretty worried about malevolent preferences conducive to long-term risks other than AGI conflict, though).
- Possibility of accidentally harmful AGI: Unintentional AI misalignment might obviously cause existential catastrophes for reasons discussed at length in the EA community (see, e.g., Christian 2020 for an introduction). There is also a non-zero chance of accidental “near misses” leading to even worse outcomes (see Tomasik 2019).
Risk-aversion: Long-term catastrophes due to malevolent actors seem to be one of those events that are pretty unlikely to occur although overwhelmingly bad if they do. Moreover, even among the set of “long-term catastrophes due to malevolent actors”, the expected value (loss) distribution is probably very fat-tailed (most of the expected disvalue likely lies in a few particularly bad scenarios). While I am personally sympathetic to risk-neutral expected value reasoning, or “fanaticism” (see Wilkinson 2020), a non-trivial number of smart people favor risk-aversion. Such risk-averse thinkers would likely want us to discount the importance of these scenarios.
Information and attention hazard: Working on the prevention of a priori relatively not-very-likely-by-default scenarios like the ones we’re considering (which may require increasing their saliency) might backfire and increase their likelihood rather than decrease it.^[14]
Beware the motte-and-bailey fallacy: As explained later, I intend to reconsider the kind of “bad traits” we should look out for. This might result in (some of) my arguments for a focus on malevolence becoming partially irrelevant to what we eventually end up working on.
Uncertainty regarding the value of the future: Malevolent(-ish) actors seem much more likely to cause existential catastrophes than astronomical suffering. Therefore, the less we’re confident in the hypothesis that human expansion is positive,^[15] the less reducing the influence of malevolent actors seems important. This argument applies to any work primarily aimed at reducing X-risks, though (and it might actually apply less strongly here, thanks to the convergence of S-risk and X-risk reduction when it comes to malevolence).

Overall, I am quite uncertain. Those counter-considerations seem potentially as strong as the motivations I list in favor of a focus on malevolent actors. ^[16]

My goal, in this piece, is merely to shed some (more)^[17] light on a potentially too-neglected cause area, not to persuade anyone they should prioritize it. However, I would be happy to see more people exploring furthermore whether preventing malevolent influence over AGI is promising, and investigating some of the research questions listed below might help in that regard (besides the fact they would help better reduce risks from malevolence, assuming it’s promising).

Potential research projects

Breaking down the necessary conditions for some ill-intentioned actor to cause an AGI-related long-term catastrophe

Here’s a very preliminary and quick attempt at specifying/formalizing the typology suggested earlier. Can we formalize and specify the breakdown further, or come up with an alternative/better one? Does it capture all the possible paths? What scenarios could be missing?

Muelhauser (2020), Ladish and Heim (2022), as well as DasSarma and Wiblin (2022) provide many examples of concerning information security breaches. Can we envision potential "AGI-adapted" versions of these?

What effects have (or could) malevolent people have had on previous transformative technologies? Why were those effects not stronger? What does this tell us about various potential scenarios of malevolent control over AGI?

What about the possibility of value lock-in with AGI (see Finnveden et al. 2022)? How would that change the picture?

Eventually, can we assign probabilities to a wide range of conditions and estimate the expected value loss for various scenarios?

Redefining the set of actors/preferences we should worry about

In the context of their work to reduce long-term risk from ill-intentioned actors, David Althaus and Tobias Baumann (2020) write:

We focus on the Dark Tetrad traits (Paulhus, 2014) because they seem especially relevant and have been studied extensively by psychologists.

While this is somewhat compelling, this may not be enough to warrant such a restriction of our search area. Many of the actors we should be concerned about, for our work here, might have very low levels of such traits. And features such as spite and unforgivingness might also deserve attention (see Clifton et al. 2022). In the race-to-AGI context, traits like an unbendable sense of pride or a "no pain, no gain" attitude could prove just as dangerous as psychopathy, at least under some assumptions.

What are some non-Dark-Tetrad traits that could play a notable role in the scenarios we focus on, here? How exactly? And how much? “How important are situational factors and ideologies compared to personality traits?” (Althaus and Baumann 2022.)

Is there actually a strong causal effect between malevolent influence and AGI ending up with conflict-seeking of malevolent-like preferences? Could very different types of values also lead to that?

What are behaviors we can reasonably ask decision-makers to watch for? Can they be qualified as “malevolent” or “ill-intentioned”? What could be a better term? How can we rigorously identify such traits? What are potential backfire risks of doing this?

Should this redefinition of the set of concerning actors/preferences make us update our breakdown of scenarios?

What are the preferences that decision-makers are likely to have by default? Which ones are concerning? And who will be the most prominent decision-makers when it comes to AGI, actually? Should we focus on AI researchers in top AI labs? Politicians who might nationalize AI development or significantly influence AI progress in another way? Another group?

Answering these questions may help relevant decision-makers – in, e.g., AI labs and governments – detect or filter out potentially dangerous actors (by, e.g., determining what kind of background checks or personality tests to conduct). It may also inform our research agendas on preventing “””malevolent””” control over AGI and longtermist cause prioritization.

Steering clear from information/attention hazard

How could work on risks of malevolent influence over AGI backfire and incentivize or empower malevolence itself? What interventions are more likely to lead to such scenarios? What are concrete paths and how plausible are they? Can we devise a helpful breakdown?

When are the upsides most likely to compensate for the potential downsides? Can we easily avoid the most infohazardous interventions? Can we agree on a list of ideas/scenarios that, under no circumstances, should be mentioned publicly or in the presence of people not extremely familiar with them?

Might infohazards significantly reduce the expected value of working on those risks?

What can we learn from infohazards in the fields of bio (see, e.g., Crawford et al. 2019) and nuclear risks (see, e.g., Grace 2015)? What can we take from AI research publications and misuse (see Shevlane and Dafoe 2020)? What about Micheal Aird’s collection of potentially relevant work?

Evaluating the promisingness of various governance interventions

The problem might be time-sensitive enough to justify investing resources in some potentially promising interventions we already have in mind. Plus, some people might simply not be a good fit for research and want to join/kickstart some projects motivated by what seems valuable with the evidence we currently have.

Therefore, although we probably need more research on the questions mentioned in the above sections before accurately assessing how promising different kinds of interventions are, some (preliminary) investigation wouldn’t hurt.

What kind of interventions can we consider?

First, a large part of the field of information security is of course apposite, here. Infosec intervention examples/types include (but are not limited to):

Preventing external hackers from gaining access to crucial AGI-related material.^[18]
- Selecting against people who might accept bribes in exchange for some access/information, within key organizations
- Incentivizing the relevant orgs to put more effort into patching vulnerabilities / improving security.
Preventing bad actors within organizations that are decisive in the development/deployment of potential AGI (although my impression – from reading infosec content and visiting the related Facebook group – is that the infosec community focuses almost exclusively on the above bullet point).
Background checks.
Limiting access to sensitive information to highly-trusted people.
Reducing the potential impact of a single individual/group (e.g., crucial decisions/changes can’t be made by a single actor).

Secondly, there are interventions related to the broader field of AGI governance. Some examples/types:

Interventions that disempower the actors we don’t know/trust.
- Improving compute governance.
- Reducing China’s AI potential by facilitating the immigration of top-AI talents to other countries?
- Helping some trusted actor win the AGI race? (see the “Partisan approach” in Mass 2022)
Spreading more intensely the meme that we should prevent AI misuse/malevolence (particularly within AI labs and relevant public institutions), although one must be cautious with potential attention hazards, here.
Slowing down / stopping AI progress (see Grace 2022), although this might^[19] backfire by making multipolar scenarios (and thereof AGI conflict) more likely.
Facilitating the detection of potentially uncooperative/suspicious AI development efforts.
Making it so there are more women within top AI labs and relevant public institutions, since, as stated earlier, they’re quite less likely to have malevolent traits (Althaus and Baumann 2020 make a similar suggestion).

I should note, for what it's worth, that Stefan Torges and Linh Chi Nguyen have considered more detailed and concrete infosec/AI-gov intervention ideas. They may be good contacts if you're interested in reducing risks of malevolent control over AGI (in addition to the authors of the work I list in the Appendix).

Thirdly and lastly, there are also non-AI related interventions on risk factors that might indirectly help, such as improving the detection of malevolent traits, preventing political instabilities that may lead to the rise of malevolent actors, or steering the (potential) development of genetic enhancement technologies towards selecting against malevolent traits such that we get fewer malevolent humans in the first place (Althaus and Baumann 2020; check their post for more ideas of the kind).

How valuable are such interventions in expectations? How impactful/tractable/neglected are they? Could they backfire? Do they pass our anti-infohazard filter? And finally, might we be too compelled by suspiciously convergent interventions?

Indeed, you’ll notice that some/many of those mentioned above as examples are already – to some extent – considered, studied, or somewhat carried out (although I have only a poor idea of which ones are actually carried out vs merely mentioned in the EA community). However, the motivations are generally pretty different from the ones highlighted in this doc, which should make us wary. It’d be surprisingly convenient if the work already pursued by the EA community for other reasons just happened to be best suited for addressing risks of malevolent control over AGI. Also, I worry people may naively generalize and assume any kind of work on, say, AI governance effectively reduces risks of ill-intentioned influence over AGI. This appears unlikely to be true.

Acknowledgment

Thanks to Jide Alaga and Falk Hemsing for their helpful comments on a draft. Thanks to Existential Risk Alliance for funding the work I put into this. All assumptions/claims/omissions are my own.

Appendix: Related work

(More or less in order of decreasing relevance. This Appendix is not meant as an endorsement of the claims made in these.)

David Althaus and Tobias Baumann (2020) Reducing long-term risks from malevolent actors
- On broader risks from malevolent actors, less focused on AGI.
Tobias Baumann (2022) Avoiding the Worst: How to Prevent a Moral Catastrophe (Just ctrl-f for “malevolent”.)
- On broader risks from malevolent actors, less focused on AGI.
Miles Brundage et al. (2018) The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
- Not focused on AGI but quite relevant when it comes to how we should expect AI progress to seriously empower malevolent actors.
Jeffrey Ladish and Lennart Heim (2022) Information security considerations for AI and the long term future
- Not particularly focused on risks from malevolence but relevant.
Wiblin, Robert & Keiran Harris (2022) Nova DasSarma on why information security may be critical to the safe development of AI systems
- Not particularly focused on risks from malevolence but relevant.
Jesse Clifton et al. (2022) When would AGIs engage in conflict? (Section What if conflict isn’t costly by the agents’ lights?)
- On conflict-seeking preferences.
Lukas Finnveden et al. (2022) AGI and Lock-In; and the work they refer to in the section How likely is this?
- Not focused on risks from malevolence but relevant.
Holden Karnofsky (2022) Nearcast-based "deployment problem" analysis (The parts on risks of misuse.)
- Not focused on risks from malevolence but relevant.
Toby Shevlane and Allan Dafoe (2020) Does publishing AI research reduce misuse?
- Not focused on AGI, nor on risks from malevolence, but relevant.
Rose Hadshar (2022) How big are risks from non-state actors? Base rates for terrorist attacks
- Some data on one particular piece of the puzzle.
Markus Anderljung and Julian Hazell (2023) Protecting Society from AI Misuse: When are Restrictions on Capabilities Warranted?
- Haven't read it yet but probably relevant.
Various sources on infohazards
- Sometimes pretty relevant to risks from malevolent control over AGI.

^{^}
I explain later why I think the label “malevolent/ill-intentioned actors” might not be the best to capture what we should actually be worried about. Until I find an alternative framing that satisfies me, you can interpret “malevolent” in a looser sense than “having the wish to do evil” or “scoring high on the Dark Tetrad traits”. You can think of it as something like “showing traits that are systemically conducive to causing conflict and/or direct harm”.
^{^}
By XCP, I mean something like intrinsically valuing punishment/conflict/destruction/death/harm. I wouldn’t include things like valuing paperclips, although this is also (indirectly) conducive to existential catastrophes.
^{^}
However, I take it from Althaus and Baumann (2020) that there may be some significant overlap between work aimed at reducing malevolent control over AGI and that over narrow AI (see, e.g., Brundage et al. 2018) or even over things not related to AI (at least not directly), such that there might still be benefits from creating a risk-from-malevolence field/community.
^{^}
See, e.g., Althaus and Gloor 2019; Bostrom 2014; Yudkowsky 2017 for research on such scenarios, and see Althaus and Baumann 2020 on how malevolent actors are a risk factor for – or even a potential direct cause of – these.
^{^}
The World Economic Forum (2020) estimates that women represent only 26% of the “Data and AI” workforce (p.37) (and I expect this to be much more uneven among groups like hackers), and suggest a similar order of magnitude in terms of representation of women in decisive political roles (p.25).
^{^}
Perhaps “forever” within the limits of physics; i.e. potentially trillions of years (MacAskill 2022; Finnveden et al. 2022).
^{^}
While a non-locked-in paperclip-maximizer may not do this permanently.
^{^}
Mostly from checking BlueDot’s AGI governance curriculum, reading Clarke 2022 and Maas 2022, as well as from simply talking to other people interested in AI governance.
^{^}
This is meant to exclude the more indirect work of reducing broader existential risk factors such as political turbulences, inequality, and epistemic insecurity (see Dafoe 2020), although I don’t think it is irrelevant.
^{^}
Thanks to Siméon Campos for helping me realize how important it was to distinguish these two, in a discussion.
^{^}
Potential interventions aimed at disempowering potential malevolent actors seem much less studied (see the Appendix for examples of related work, though), and while interventions to disempower uncautious actors and those to disempower malevolent ones might luckily converge, it’d be suspiciously convenient if they do that often. Many of the paths to AI risks due to uncautiousness diverge from those due to malevolence, after all. This is, by the way, the reason why I avoid the term “AI misuse” which seems to conflate those two different kinds of risks.
^{^}
The term “Hitler AGI” is of course overly restrictive. AGI ending up with Hitler-like values is an overly specific way in which it could have malevolent-ish preferences we should find dangerous. I could have found a more accurate term, but I was worried it would diminish how dramatically cool the sentence sounds.
^{^}
Althaus and Baumann (2020) make pretty much the exact same point.
^{^}
Harris (1994, p.19) talks about an interesting case studies where the report on the 1925 Geneva Disarmament Convention encouraged a Japanese leader to launch a bioweapons program.
^{^}
See, e.g., DiGiovanni 2022; Anthis 2022 for good arguments against this hypothesis and reasons to be highly uncertain.
^{^}
See this comment for my take on which ones are the most cruxy for me.
^{^}
See the Appendix for related work.
^{^}
As Anthony DiGiovanni pointed out in a private discussion, this is beneficial if – and only if – we think external hackers are more likely to be “the bad guys” relative to the legitimate decision makers around potential AGI developer. While I’d guess this is/will most likely be the case, this is worth keeping in mind. There are some worlds in which that type of intervention backfires.
^{^}
It is not obvious. As Maxime Riché suggested in a private discussion, it's probably more about takeoff speed than about AI timelines.

David_AlthausOct 3 20238

Thanks for writing this post!

You write:

While this is somewhat compelling, this may not be enough to warrant such a restriction of our search area. Many of the actors we should be concerned about, for our work here, might have very low levels of such traits. And features such as spite and unforgivingness might also deserve attention (see Clifton et al. 2022).

I wanted to note that the term 'malevolence' wasn't meant to exclude traits such as spite or unforgivingness. See for example the introduction which explicitly mentions spite (emphasis mine):

This suggests the existence of a general factor of human malevolence^[2]: the Dark Factor of Personality (Moshagen et al., 2018)—[...] characterized by egoism, lack of empathy^[3] and guilt, Machiavellianism, moral disengagement, narcissism, psychopathy, sadism, and spitefulness.

So to be clear, I encourage others to explore other traits!

Though I'd keep in mind that there exist moderate to large correlations between most of these "bad" traits such that for most new traits we can come up with, there will exist substantial positive correlations with other dark traits we already considered. (In general, I found it helpful to view the various "bad" traits not as completely separate, orthogonal traits that have nothing to do with each other but also as “[...] specific manifestations of a general, basic dispositional behavioral tendency [...] to maximize one’s individual utility— disregarding, accepting, or malevolently provoking disutility for others—, accompanied by beliefs that serve as justifications" (Moshagen et al., 2018).)

Given this, I'm probably more skeptical that there exist many actors who are, say, very spiteful but exhibit no other "dark" traits—but there are probably some!

That being said, I'm also wary of going too far in the direction of "whatever bad trait, it's all the same, who cares" and losing conceptual clarity and rigor. :)

Jim BuhlerOct 4 20231

Interesting, makes sense! Thanks for the clarification and for your thoughts on this! :)

LinchJun 23 20238

Malevolent(-ish) actors seem much more likely to cause existential catastrophes than astronomical suffering. Therefore, the less we’re confident in the hypothesis that human expansion is positive,^[15] the less reducing the influence of malevolent actors seems important.

I agree in terms of absolute probabilities. But in terms of relative risks, naively I'd expect malevolent actors to be much more likely, in relative terms, to cause s-risks than benevolent or selfish actors. That is because you need to posit additional motivations or miscalculations for most non-malevolent actors to intentionally cause s-risks, whereas malevolence itself may provide sufficient motivation. Indeed, I've previously assumed this is why negative-leaning folks focused on studying malevolent actors in the first place.

Miranda_ZhangJun 22 20234

Thanks a lot for your work on this neglected topic!

You mention,

Those counter-considerations seem potentially as strong as the motivations I list in favor of a focus on malevolent actors.

Could you give more detail on which of the counter-considerations (and motivations) you consider strongest?

Jim BuhlerJun 23 20232

Thanks Miranda! :)

I personally think the strongest argument for reducing malevolence is its relevance for s-risks (see section Robustness: Highly beneficial even if we fail at alignment), since I believe s-risks are much more neglected than they should be.

And the strongest counter-considerations for me would be

Uncertainty regarding the value of the future. I'm generally much more excited about making the future go better rather than "bigger" (reducing X-risk does the latter), so the more reducing malevolence does the latter more than the former, the less certain I am it should be a priority. (Again, this applies to any kind of work that reduces X-risks, though.)
Info / attention hazards. Perhaps the best way to avoid these malevolence scenarios is to ignore them and avoid making them more salient.

Interesting question you asked, thanks! I added a link to this comment in a footnote.

Effective Altruism Forum
EA Forum