Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

LintzA

Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

LintzA

13 min readAug 3, 2022

Comments 4

Sorted by

New & upvoted

Zach Stein-Perlman

Great post!

A related framing I like involves two 'pillars,' reduce the alignment tax (similar to your pillar 1) and pay the alignment tax (similar to your pillars 2 & 3). (See Current Work in AI Alignment.)

We could also zoom out and add more necessary conditions for the future to go well. In particular, eventually achieving AGI (avoiding catastrophic conflict, misuse, accidents, and non-AI x-risks) and using AGI well (conditional on it being aligned) carve nature close to its joints, I think.

MaxRa

Thanks again for writing this up! Just a random thought, have you considered what happens when you loosen this assumption:

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

I'm thinking about scenarios where humanity is able to keep the first 1 to 2 generations of AGI under control (e.g. by restricting applications, by using sufficiently good interpretability to detect most deception, due to very gradual capability increases).

Some spontaneous thoughts what pillars might be additionally interesting then:

Coordination, but focussed more on labs sharing incidents, insights, tools
Humanity's ability to detect and fight power-seeking agents
- Generic state capacity
- Generic international cooperation
- Cybersecurity to prevent rogue agents getting access to resources and weapons, to prevent debilitating cyberattacks
- Surveillance capabilities
- Robustness against bioweapons

SammyDMartin

Great post!

Check whether the model works with Paul Christiano-type assumptions about how AGI will go.

I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn't quite fit.

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

Paul says rather that e.g.

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.

On his view (and this is somewhat similar to my view) the background assumption is more like, 'deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom', which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover. So something like the following is needed:

Technical alignment research success by the time of the first critical try (possibly AI assisted)
Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)

On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.

Also, crucially, the actions of pre-AGI AI may push this point where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul's view isn't that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could 'could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal'.

This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.

Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI on each of the pillars as well.

LintzA

Thanks for this!

My thinking has moved in this direction as well somewhat since writing this. I'm working on a post which tells a story more or less following what you lay out above - in doc form here: https://docs.google.com/document/d/1msp5JXVHP9rge9C30TL87sau63c7rXqeKMI5OAkzpIA/edit#

I agree this danger level for capabilities could be an interesting addition to the model.

I do feel like the model remains useful in my thinking, so I might try a re-write + some extensions at some point (but probably not very soon)

Comments

More from the author

399

The Game Board has been Flipped: Now is a good time to rethink what you’re doing

LintzA·1y ago·Curated 1y ago·15m read

151

Prediction markets & many experts think authoritarian capture of the US looks distinctly possible

LintzA·9mo ago·3m read

150

Politics is an EA blindspot

LintzA·10mo ago·6m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·2w ago·Curated 5d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

129

Let's taboo the V-word

lincolnq·2d ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·2d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

Recent opportunities to take action

SammyDMartin

Great post!

Check whether the model works with Paul Christiano-type assumptions about how AGI will go.

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

Paul says rather that e.g.

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.

Technical alignment research success by the time of the first critical try (possibly AI assisted)
Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)

This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.

^{^}

Though we still can’t drop the ball entirely on the other pillars, we are just at a point where business as usual is probably fine.

^{^}

For Nate Soares’ take on what the right problem is, see his post here: On how various plans miss the hard bits of the alignment challenge

^{^}

While the ability to determine whether an alignment solution will actually work is a technical problem, the leadership of the organization deciding whether to deploy AGI will face the problem of figuring out who to listen to as to whether the system is aligned and/or doing the technical thinking themselves.

^{^}

This set of actors could be kept small by (a) one actor deploying safe AGI in such a way that prevents other actors from deploying any AGI or from deploying unsafe AGI specifically and (b) there being a large (perceived) lead in AI development between one or a small set of actors and all other actors.

The expected behavior of actors who might end up with the option of deploying unsafe AGI matters because that may affect what other actors do, such as how much those other actors cut corners to get to AGI first.

^{^}

There are apparently real, powerful, people who have this opinion so it’s not as ridiculous as it sounds.

^{^}

Note that to calculate a probability of success it’s not as simple as multiplying 10%*10%*10% to get the correct odds. This is because success and failure on different pillars is correlated.

^{^}

This framing really helped me understand why MIRI folk tend to be extremely pessimistic about our odds of survival.

Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

Summary

The three pillars

Pillar 1: Technical alignment

Pillar 2: Safety-conscious deployment decisions

Pillar 3: Coordination between potential AI deployers

Hypothetical scenarios (for illustration)

Failure on one pillar, and partial success in others, leads to overall failure

Modest success in all pillars has unclear results

Strong success on one pillar is unlikely, but could lead to overall success

How likely are we to succeed?

Some variables that affect the difficulty of the pillars

How to use the model

Testing paths to victory

Describing strategic views

Imperfections of the model and future research