“The Race to the End of Humanity” – Structural Uncertainty Analysis in AI Risk Models

Froolow

“The Race to the End of Humanity” – Structural Uncertainty Analysis in AI Risk Models

Froolow

25 min readMay 19, 2023

Comments 4

Sorted by

New & upvoted

quinn

psyched about this post, want to jot down a quick nit-pick

S-Risk can occur any time from now until the end of the race, and represents – for example - a totalitarian government seizing control of the world to such an extent that human flourishing is permanently curtailed, but the development of AI is not (so S-Risk can occur before AI is Invented).

I don't think curtailing human flourishing constitutes s-risk, I don't think the suffering-focused community likes to draw equivalences between opportunity cost and more immediate or obvious disvalue. When the s-risk community talks about malevolent actors (see CLR), they're talking more about associations between totalitarianism and willingness/ability to literally-torture at scale, whereas other theorists (not in the suffering-focused community) may worry about a flavor of totalitarianism where everyone has reasonable quality of life they just can't steer or exit.

One citation for the idea that opportunity costs (say all progress but spacefaring continues) and literally everyone literally dying is morally similar is the precipice. We can (polarizingly!) talk about "existential risk" not equalling "extinction risk" but equaling under some value function. This is one way of thinking about totalitarianism in the longtermist community.

Political freedoms and the valence of day to day experience aren't necessarily the exact same thing.

Froolow

Thank you, really interesting comment which clarifies a confusion I had when writing the essay!

Vasco Grilo🔸

Great analysis, Froolow!

Analysts discussing AI Risk should describe the structure of their model much more explicitly. I observe there is a bit of a tendency on the forums to be cagey about one’s ‘actual’ model of AI Risk when presenting estimates of Catastrophe, and imply that the ‘actual’ model of AI Risk one has is significantly more complicated than could possibly be explained in the space of a single post (phrases like, “This is roughly my model” are a signifier of this).

Agreed!

Ben Stewart

I really liked this!

Comments

More from the author

236

A critical review of GiveWell's 2022 cost-effectiveness model

Froolow·3y ago·46m read

128

Methods for improving uncertainty analysis in EA cost-effectiveness models

Froolow·3y ago·45m read

111

‘Dissolving’ AI Risk – Parameter Uncertainty in AI Future Forecasting

Froolow·3y ago·46m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 3d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

177

The first video from Giving What We Can's new channel is out now!

JustinPortela·4d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·6d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

Recent opportunities to take action

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·2d ago·1m read

173

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·2w ago·4m read

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·6d ago·2m read

^{^}

There are also a handful of incomplete attempts to move beyond the ‘weighted coinflip’ approach and into more sophisticated modelling structures. The most salient of these would probably be the MTAIR project, although I believe that MTAIR is now sadly defunct following the FTX collapse.

^{^}

As a hypothesis, my guess is that a lot of AI Alignment researchers come into the movement via Effective Altruism and then find the AI Catastrophe arguments convincing once they have already bought into broader EA logic. By coincidence, high-profile conventional charitable interventions (meaning, for example, those funded by GiveWell) are extremely well suited to the deterministic decision tree, and so the ‘weighted coinflip’ model is an excellent default choice in almost all conventional charity analyses. If many AI Alignment researchers have been socialised into always using the deterministic decision tree structure, and that approach always works, it is probably unsurprising that this approach becomes like water to a fish – even if it is understood that other model structures exist, there are significantly more important things to be doing than messing around with the approach that is fast, logical and has always worked in the past.

^{^}

Of course, nobody – least of all Carlsmith – believes that his model really describes reality any more than I believe my new model really describes reality. Instead, Carlsmith has made a simplifying assumption that treating AI Risk timelines as implicitly fixed will still allow him to investigate the sorts of dynamics he is interested in. He is probably right about this - Carlsmith asks and answers the question, "Is Power-Seeking AI an Existential Risk?" for which he is basically looking for a binary answer ("Yes it is" or "No it isn't"). Consequently it doesn't really matter if his model oversimplifies as long as it produces robust order-of-magnitude risk estimates. However, Open Philanthropy ask the question, "Conditional on AGI being developed by 2070, what is the probability that humanity will suffer an existential catastrophe due to loss of control over an AGI system?" for which a more complex model is needed, one which can produce results on a more granular level than just order-of-magnitude.

I will revisit this theme later in the essay, but an observation that modelling outsiders sometimes miss is that there is no such thing as a 'better' or 'worse' model provided you meet some minimum bar for competence in execution. Instead, there are models which are better suited for some questions and others which are better suited for other questions. So the paragraph spawning this footnote might look like a criticism of Carlsmith when instead it is really a compliment - Carlsmith has found a very simple and elegant way of modelling what he needs which (sadly) oversimplifies relative to what I need.

^{^}

A few small notes on the base case that aren't interesting enough to include in the main body of the text:

- I have forced AI to be invented before 2070, in line with Open Philanthropy’s preferences, by capping the data I generated from my survey to dates prior to then and forcing the probability it happens to be 100%.

- I have included X-Risk and S-Risk drawn from literature sources - Ord (2020) and Caplan (2008) respectively. S-Risk estimates seem very low to me - it would be extremely interesting to investigate this further in the future because I think it might be another area where some careful structural thinking could reveal some interesting insights. But it isn't AI Risk so isn't relevant for now.

- Some people did not give a date for when they thought Alignment would occur, which I interpreted as meaning they did not think it would ever be possible (backed up by reading some of their comments on their entries). I added a bit of an ad hoc term to make Alignment literally impossible in those worlds, so the only possible outcome is some sort of Catastrophe. If I ran the survey again, I would explicitly ask about this.

- PS1, PS2 and PS3 are my attempt to include some early thoughts about ‘partial successes’ in Alignment which are not as strict as the definition Carlsmith uses where Alignment reduces the risk of OOC AI to zero. I would describe my success here as ‘mixed’, so I have turned them all off for the base case.

^{^}

I had enough time to code up some exploratory work on the idea that 'Alignment' could be a moving target depending on how far advanced we were down the pathway towards AI Catastrophe. In the model below, ‘Alignment’ occurs either if we find a way to make AI completely safe as in the main model, but also if we develop a TEST which can show us whether an AI is aligned before we expose any AI to high-risk inputs. The intuition here is that we might be able to convince governments to act to ban unsafe AIs if we haven't already handed power to AIs, but after handing power to AIs we need some kind of 'weapon' to defeat them. The graph below shows the result of this analysis:

Unsurprisingly, this results in a significant probability mass redistribution away from AI Risk towards Alignment, as scenarios which were once the absolute worst-case for OOC AI Risk have the possibility of being defused even before AI is invented.

I got a bit anxious at the thought of presenting these alongside the main results because I think I've done a fairly good job of only talking about things I understand properly in the main body of the text, and I don't understand the nuts-and-bolts elements of Alignment at all well. But I thought people might be interested to see this even if it isn't perfect, if only to prove my point that there’s probably another halving of risk available to the first team who can present a plausible account of how a ‘multi-step’ model of alignment could work.

^{^}

Part of the art of modelling is knowing when you have made the model complex enough for whatever purpose you intend to use it for. Any model which fixed the problems with my model would inevitably have problems that needed to be fixed by another model and so on. Although this conveyer belt keeps modellers like me (somewhat) gainfully employed, it can be a trap for organisations who just want to use the best possible model to make decisions, because ‘best possible’ is a moving target. As discussed above, the structure needs to match the needs of the decision problem and there isn't a generic solution for how to find the best structure for a particular job.

“The Race to the End of Humanity” – Structural Uncertainty Analysis in AI Risk Models

“The Race to the End of Humanity” – Structural Uncertainty Analysis in AI Risk Models

Summary

1. Introduction

2. The Race to the End of Humanity”

2.1 The assumption of 'time independence'

2.2 Time dependency in a Carlsmith-like model

2.3 Results

Main results

Sensitivity analysis of time

Sensitivity analysis of original inputs

3. Conclusions