Hide table of contents

Epistemic status:  Initial thoughts, uncertain. We are two junior researchers, who have each worked for around 4 hr a day for 5 days on this project, including writing. We only read one other piece in relation to this work: “Some governance research ideas to prevent malevolent control over AGI and why this might matter a hell of a lot”, Jim Buhler, EA Forum, 2023-05-23. We are looking for feedback on our thinking and our work.
 

Key Takeaways

  • Malevolent actor scenarios are neglected by current governance mechanisms
  • The dangerousness of actors should be profiled by the dangerousness of their preferences, their willingness to act, and the resources available to them

Executive summary

It is important to understand scenarios in which actors with dangerous intentions gain access to highly capable AI technologies. This post explores just that — ways to assess which actors and which preferences are more dangerous than others. We end with sharing a few concrete scenarios involving malevolent actors currently overlooked by existing governance proposals.

In this post, we 

  1. describe factors to understand the dangerousness of actors:[link]
    1. Dangerous preferences
    2. Willingness to act on preferences
    3. Resources
  2. categorise the dangerousness of preferences [link]
    1. Building on earlier work, we describe “risk-conducive preferences” (RCPs) – preferences that, if realised, increase the likelihood of bad outcomes
    2. These preferences are on a spectrum, ranging from those that would directly curb humanity’s potential to those that only indirectly increase the chance of x-risk 
  3. highlight some neglected malevolent actor scenarios [link]
    1. Highlight neglected governance mechanisms to address these 

Future research could draw on other fields to build towards a full understanding of the actors in this space. This would allow for a more comprehensive list of potentially dangerous actors and for evaluating their likelihood of acquiring and deploying AGI. This would shed light on the need for governance mechanisms specifically targeting risk from malevolent actors and allow them to be weighed against other governance mechanisms.
 

Background

Malevolent actor x-risk scenarios might be an understudied field. By this field, we mean the attempt of trying to understand and prevent scenarios in which actors (with preferences which make existential catastrophes more likely, such as starting a war) gain access to dangerously capable AI technologies. In “Some governance research ideas to prevent malevolent control over AGI and why this might matter a hell of a lot”, Jim Buhler argues that this kind of governance work “does not necessarily converge with usual AGI governance work” and that “expected value loss due to malevolence, specifically, might be large enough to constitute an area of priority in its own right for longtermists.” We have a similar intuition. 

Buhler invites further work on redefining the set of actors/preferences we should worry about. This post is a direct response to that invitation.

What kind of scenarios involving malevolent actors are we talking about here?

This area is mostly concerned with misuse risk scenarios, where a malevolent actor[1] is able to direct an AGI towards their own aims[2]. Here are some examples:

  • China independently develops AGI and uses it to permanently lock-in their ideological beliefs, prevent outside influence and fend off dissidents.
  • Using espionage, blackmailing and corruption, the Russian government gains access to key parts of an AGI architecture. In fear of losing a war, they develop and fine-tune it to launch an aggressive hail-mary attack on the world.
  • A doomsday cult hacks an AI lab to steal an AGI and uses it to destroy the world.
  • A member of a radical nationalist group in Europe successfully infiltrates a top AI lab, gains access to an AGI still in development, and uses it to launch an attack on a Middle Eastern state, increasing the chance of starting a world war.

Note here the key difference between these malevolent-actor-type scenarios and other x-risk scenarios: in these cases, there isn’t a misalignment between the actor’s intentions and the AI’s actions. The AI does what the actor intended. And yet, the result is an existential catastrophe, or moving significantly closer to one. 

To break these examples down, we can see four underlying components necessary in any malevolent-actor scenario: Actor, preference (RCP), method of accessing AGI, resulting outcome
 

ExampleActorPreferenceLikely method of accessing AGI[3]Outcome
1Chinese state governmentSocialism with Chinese characteristicsBuilds itValue lock-in (nationally)
2RussiaWin a warEspionage, blackmailing and corruptionGlobal destruction
3Doomsday cultExtinctionHackingExtinction
4Nationalist terrorist groupDestruction of a nationInfiltrationLarge scale destruction, increased global destabilisation

Many current governance proposals help with the category of scenarios we’re talking about. But different proposals help in different amounts. So if prioritising between different proposals, and if designing broader governance institutions, the likelihood of different scenarios in this category should be taken into account.

 

Factors of malevolent actors

If we want to propose governance mechanisms, we need to know which methods of AGI acquisition are most likely to take place. To know that, we need to know which actors are most “dangerous”. To assess this “dangerousness”, we propose focusing on three underlying factors:

  • Preferences: What do these actors want, and, if they got what they wanted, how much closer would that take us to catastrophic outcomes?
  • Willingness to act on preferences: How willing they are to take action to achieve those preferences, especially using extreme methods (e.g. damaging property, harming people, destabilising society)?
  • Resources: What resources do they have at their disposal to develop or deploy AI models, such as talent, money or AI components? 

Let's start with preferences, specifically focusing on what we call "risk-conducive preferences" – these are preferences that, if realised, increase the likelihood of unfavourable outcomes. We'll dive into this concept in more detail later on.

But preferences alone don't paint the full picture. Many individuals hold extreme views but have no intention to act on them. On the other hand, groups with similar views who have a history of causing destruction, like terrorist groups or doomsday cults, might be of higher concern. This means the "willingness to act disruptively” is another critical factor to consider.

The third crucial factor is their access to resources. This can change the quickest out of these three factors. For instance, a small extremist group might receive financial backing from a billionaire or a nation-state. Alternatively, they may gain access to a powerful AGI overnight through a successful infiltration operation. Also note the effect of Moore's law and algorithmic efficiency gains on resource needs. This means we can't solely focus on malicious actors with existing substantial financial resources.

So, if we map out actors and rank them by these three factors, we get a preliminary assessment of their “dangerousness”. This, in turn, gives us a clearer picture of which scenarios are more likely to unfold, and which governance mechanisms are most relevant to prevent these scenarios from happening.

 

Risk-conducive preferences (RCPs)

Here we go into more detail about one factor of the dangerousness of actors: their preferences.

We’re concerned with actors who would prefer some kind of world state that increases the chance of x-risk. They could achieve this state by using or creating an AGI that shares some of their preferences[4][5]. This could be a preference for global extinction, a preference for the destruction of some group, or perhaps an overly strong preference to win elections. 

Realising some preferences directly causes x-risk of s-risk events (e.g. desire to end the world), some merely increase the chance of creating them. The latter being on a spectrum of probabilities: one can imagine preferences which significantly increase the chance of x-risks (e.g. desire to start a global war), while some only moderately increase them (e.g. overly strong preference to win elections, through means of manipulation and societal control). We call all of these preferences "risk-conducive preferences" or "RCPs" for short. 

For example: a “desire to reduce crime” might indirectly cause value-lock-in, if the actor takes extreme measures to monitor and control the population. However, it is less of a risk-conducive preference than a “desire to return to a hunter-gatherer society”, leading more directly to societal destabilisation, leaving humanity more vulnerable to extinction and risking the inability to build society up again.

Spectrum for the likelihood of preferences to cause a severely bad outcome if realised


Example at the left: Extinction
Example around the middle: Widespread Destruction
Example towards the right: Widespread destabilisation

RCPs should not only be concerned with extinction, but also with all outcomes that “permanently or drastically curb humanity’s potential”. This might include value-lock-in, radical and irreversible destabilisation of society, or S-risk. Going forwards, we’ll use “x-risk” to mean any of these outcomes.

We can also put RCPs into three categories, depending on the preference’s closeness to direct desire x-risk:

  • Order 0 RCPs: explicit preferences that, when fully realised, would directly cause x-risk
  • Order 1 RCPs: explicit preferences that, when fully realised, would significantly increase the chance of x-risk
  • Order 2+ RCPs: preferences that would indirectly increase the chance of x-risk

Examples

Order 0 RCPs: explicit preferences that, when fully realised, would directly cause x-risk.

RCPReasonExamples

Extinction: Preference for humanity to go extinct


 

To stop humanity’s destruction of other species and the environment
  • Because if humans will not go extinct themselves quickly, they will make Earth uninhabitable, making all of life go extinct.
  • Because they do not believe humans are able to all together in perpetuity live in environmentally sustainable ways.
  • Because they do not see human life as any more valuable than the lives of other species.
  • Because they are in a doomsday cult
In order to decrease and prevent suffering 
Value Lock-in: Preference for certain values to be upheld indefinitelyBecause the values held are assumed to be perfect
  • Religious extremists
  • Some authoritarian governments
Irrecoverable civilisation collapse[6]Preference for removal of civilisation in a way that is irrecoverable (even if preferences change in the future)Because civilisation causes social and environmental problems(No well-known examples that don’t include value lock-in)



 

Order 1 RCPs: explicit preferences that, when fully realised, would be direct x-risk factors (would make it more likely for Order 0 preferences to be realised).
 

RCPConducive toReasonsExamples
Preference to radically reduce human populationIrrecoverable civilisation collapse
  • Worries of environmental effects
  • Worries of overpopulation
  • Anti-capitalist terrorist groups
Preference to structure civilisation in a certain way Value lock-in
  • Religious dogma
  • Religious extremists
  • Some governments
Preferences for conflict (CSPs)Extinction by global war
  • Anger
  • War-mongering nationstates
  • Other terrorist groups
Preferences for removal of civilisationIrrecoverable civilisation collapse 
  • Worries about the environment
  • Belief that civilisation causes more social issues, e.g. Anarcho-primitivism
  • Some forms of anti-globalization movements, advocating for self-sufficiently at a local community level
  • Anti-capitalist terrorist groups
  • Groups holding preferences to return to tribal or agricultural societies
  • Some forms of extreme religious fundamentalism
  • Any religious traditions with strong connections to nature and Earth


 

Order 2+ RCPs: Preferences that, if realised, contribute to x-risks less directly than Order 1 preferences.

Examples

Conflict: Destroying a nation / winning a war – conducive to global conflict, which is conducive to extinction by global war

Society: winning elections / reducing crime – conducive to changing societal structure, which could be conducive to value lock-in

Growth: increasing profit – conducive to runaway growth of an organisation at the expense of other values, which could increase the chance of going out with a whimper.

Or any other preference that might make e.g. AI or biological x-risks more likely.


 

Neglected malevolent actor scenarios

While considering the characteristics of potential malevolent actors, some dangerous scenarios came to mind. These deserve attention but seem overlooked. 

Discussions about malevolent actors typically focus on either large nation-states that are pursuing global dominance or individuals possessing harmful psychological traits. To manage these threats, governance proposals include compute restrictions, cybersecurity measures, and personality assessments. 

But here are a few worrying scenarios not affected by those governance mechanisms, listed very roughly in order of likelihood (based solely on intuition):

  1. Open source distribution. With the goal of sharing the benefits of future AI technology across society, a group without any malevolent intentions could release open source models online, allowing malevolent actors to avoid conventional security measures.
  2. Altruistic leaks. An employee within a prominent AI lab, motivated by concerns over power concentration in corporate hands, might decide to leak a trained model online without any malicious intent. 
  3. Infiltration. A radicalized individual or an agent of a foreign state could infiltrate an AI lab, gain access to the model and share them with their associates.
  4. Radicalization. A staff member at a leading AI lab could become radicalized by certain philosophical ideas and choose to share the model with like-minded individuals, presenting again novel governance challenges.

Although more research is needed, we see at least two directions for developing governance proposals to address these challenges:

  1. Security clearances. Similar to the vetting of high-ranking government officials, national security agencies could implement comprehensive background checks within AI labs to mitigate risks related to infiltration and espionage.
  2. Thresholds for open sourcing. Investigating mechanisms to restrict the open sourcing of AI models that surpass a certain threshold of power could help prevent dangerous releases.

At this stage of research, we are not offering probability estimates for the likelihood of each scenario. Nonetheless, these examples underscore the need for a more comprehensive approach to AGI governance that encompasses a broader range of potential threats. It seems important to further explore and assess these scenarios, and to offer governance solutions for both governments and AI research labs. 

Conclusion

This post has explored ways to evaluate the dangerousness of actors and their preferences, and highlighted some neglected scenarios featuring malevolent actors.

To assess and identify dangerousness more effectively, we unpacked the underlying factors contributing to the dangerousness of malevolent actors: the proximity of their preferences to wishing for catastrophic outcomes, their willingness to act on those preferences, and their resources. 

We introduced the concept of “risk-conducive preferences” to encompass a wider set of dangerous preferences highlighting that these preferences are placed along a continuum.

Finally, we provided scenarios that might currently be neglected by existing governance mechanisms.

The following figures (Figure 1 and Figure 2) show how our work fits together with governance proposals. 

Figure 1. Breakdown of actor preferences and plausible pathways to bad outcomes.


 

Figure 2. Possible governance mechanisms (dark blue background) to address pathways to bad outcomes outlined in Figure 1.

We encourage others to build upon this work. Here are a few avenues for future exploration:

  1. Researchers could put together a more extensive list of potentially dangerous actors, and evaluate their likelihood of acquiring and deploying AGI. This helps with knowing which governance mechanisms to implement. 
  2. A further breakdown of the underlying psychological conditions of malevolent actors might improve the framework for assessing “dangerousness”. Researchers with a background in psychology might want to dig deeper into the causes for factors such as “willingness to cause disruption” and explore potential targets for governance interventions, such as help for schizophrenic patients or addressing social isolation. This approach might interface well with the existing literature on radicalization.

As always, we appreciate all of the feedback we have received so far and remain very open for future comments.


 

Further Reading:

Some governance research ideas to prevent malevolent control over AGI and why this might matter a hell of a lot”, Jim Buhler, EA Forum, 2023-05-23

We have not (yet) read the following, but they are closely related:


 

Acknowledgements

We would like to thank Jim Buhler, Saulius Šimčikas, Moritz von Knebel, Justin Shovelain and Arran McCutcheon for helpful comments on a draft. All assumptions/claims/omissions are our own.

  1. ^

    By malevolent actor in this context, we mean someone with preferences conducive to risk, not that they actually wish to do evil.

  2. ^

     Jim Buhler gives some reasons for focussing on this area

  3. ^

     For simplicity, we just give one, but in reality, actors could use a portfolio of approaches to try to gain access.

  1. ^

    Jim Buhler frames the issue around preventing the existence of some AGI that has one of these “x-risk conducive preferences”, like intrinsically valuing punishment/destruction/death, and gives some ways for such an AGI to come about. But since these ways initially require an actor to have some risk-conducive preference, we focus on actors.

  2. ^

    If the AGI is like an oracle (like a generally intelligent simulator, like a multi-modal GPT-6), then the agent could use it to achieve its preferences, and the extent to which the agent has those preferences is only relevant for what the simulator is likely to say no to. But if the AGI is very agentic (like Auto-GPT or some RL model) then it will have those preferences.

  3. ^

    As described in What We Owe The Future, Chapter 6: Collapse (Will MacAskill 2022)

Show all footnotes
Comments4


Sorted by Click to highlight new comments since:

This was a great experience and I learnt a lot:

  • Choosing a research topic is a whole research project in itself
  • The writing phase takes much longer than the earlier phase of working out what to write (more than twice as long)
  • Co-working on research with one other person is great. It's very motivating and you learn a lot from each other. You have faster feedback loops and so can make a better outcome sooner.
  • This kind of research-writing-co-working is very mentally tiring (at first I couldn't do more than 4hrs per day)

We really wanted to complete the project in a tight timeframe. I actually posted this 2 weeks after we finished because it was the first chance I had. 

Some reflections: 

I think that the amount of time we set aside was too short for us, and we could still have made worthwhile improvements with more time to reflect, such as:

  • Choosing an easier topic for our first research project
  • Doing more further reading
  • I think the section on Risk-conducive preferences (RCPs) is not important enough to warrant the amount of words it is taking up
  • Many sentences could be re-written to improve the wording, and I don't think 'Factors of malevolent actors' is a very good heading.

(I'll come back and reply to this comment with more of my own reflections if I think of more and get more time in the next day or two) (edit: formatting)

Interesting, thanks for sharing your thoughts on the process and stuff! (And happy to see the post published!) :)

Nice post!

In order to decrease and prevent suffering

Not sure whether you intended to give examples of this in the table.

The following figures (Figure 1 and Figure 2) show how our work fits together with governance proposals. 

This is a nitpick, but I would find the diagrams easier to read if the "bad outcome" was at the bottom, such that the direction of causality was from top to bottom.

Nice post!

Thank you! Since it's (my) first post, it's helpful to have some positive encouragement.

  1. We actually intended not to give examples of those.
  2. That's useful feedback on the diagram, thanks.
Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f