My understanding is that most AI safety work that plausibly reduces some s-risks may reduce extinction risks as well, and I'm thinking that some futures where we go extinct because of AI (especially with a single AI taking over) wouldn't involve astronomical suffering, if the AI has no (or sufficiently little) interest in consciousness or suffering, whether
- terminally,
- because consciousness or suffering is useful to some goal (e.g. it might simulate suffering incidentally or for the value of information), or
- because there are other agents who care about suffering it has to interact with or whose preferences it should follow (they could all be gone, ruling out s-risks from conflicts).
I am interested in how people are weighing (or defeating) these considerations against the s-risk reduction they expect from (particular) AI safety work.
EDIT: Summarizing:
- AI safety work (including s-risk-focused work) also reduces extinction risk.
- Reducing extinction risk increases some s-risks, especially non-AGI-caused s-risks, but also possibly AGI-caused s-risks.
So AI safety work may increase s-risks, depending on tradeoffs.
[note that I have a COI here]
Hmm, I guess I've been thinking that the choice is between (A) "the AI is trying to do what a human wants it to try to do" vs (B) "the AI is trying to do something kinda weirdly and vaguely related to what a human wants it to try to do". I don't think (C) "the AI is trying to do something totally random" is really on the table as a likely option, even if the AGI safety/alignment community didn't exist at all.
That's because everybody wants the AI to do the thing they want it to do, not just long-term AGI risk people. And I think there are really obvious things that anyone would immediately think to try, and these really obvious techniques would be good enough to get us from (C) to (B) but not good enough to get us to (A).
[Warning: This claim is somewhat specific to a particular type of AGI architecture that I work on and consider most likely—see e.g. here. Other people have different types of AGIs in mind and would disagree. In particular, in the "deceptive mesa-optimizer" failure mode (which relates to a different AGI architecture than mine) we would plausibly expect failures to have random goals like "I want my field-of-view to be all white", even after reasonable effort to avoid that. So maybe people working in other areas would have different answers, I dunno.]
I agree that it's at least superficially plausible that (C) might be better than (B) from an s-risk perspective. But if (C) is off the table and the choice is between (A) and (B), I think (A) is preferable for both s-risks and x-risks.
Hi Steven,
I really appreciate the dartboard analogy! It helped me understand your view.