AI Safety in a Vulnerable World: Requesting Feedback on Preliminary Thoughts

Jordan Arel

Cross-Posted to LessWrong

I would like feedback on a hypothesis that has been percolating in my brain for the past few months.

Epistemic Status: I have studied AI Safety for less than 100 hours, but have been thinking about x-risk for several years.

I am concerned that even in some cases where advanced AI is aligned, the environment in which it exists may still make it unsafe.

If I am not mistaken, “AI Alignment” seems to mean getting AI to do what we want without harmful side effects, but “AI Safety” seems to imply keeping AI from harming or destroying humanity.

These two may come apart in a “Vulnerable World Scenario” in which some future technologies destroy civilization by default. This may be the case because certain technologies have an intrinsic offense bias, meaning if even a small number of humans want to kill everyone, or competing groups are willing to kill each other, those attacking will succeed and those defending will fail by default.

Offense Bias

If there is an offense bias in advanced AI, or any other technology advanced AI leads to, it is not clear that aligning AI i.e. “getting AI to do what we want” would keep us safe. If multiple world powers have advanced AI and they each order their AI to destroy their enemies and protect their own citizens, then if there is an offense bias, and it easier to attack than defend, each AI may succeed in destroying the enemy, but fail to defend its own citizens, meaning everyone dies.

Due to entropy (the universe’s in-built destruction bias,^[1]) the fragility of humans, and the incredible flexibility of advanced AI, it seems quite plausible, I would even guess more likely than not, that advanced AI will constitute or enable an offense bias.

This problem is compounded when we consider the many powerful advanced technologies AI may accelerate in the near future, such as bio-technology, 3D printing, nanotechnology, advanced robotics, brain-machine interfaces, advanced internet of things applications, advanced wearable/cyborg technologies, advanced computer viruses, black swan (unknown unknown) technologies, etc.

Due to advanced AI processes like PASTA (Process for Automating Scientific and Technological Advancement,) powerful advanced technologies could arrive and have transformative effects quite soon, and any one of them could have an offense bias, as could any combination of them, including combinations with already existing technologies such nuclear weapons and drones. This may result in a combinatorial explosion of possible offensive synergies occurring as the number of technologies increase.

Perhaps something similar could be said of defensive technologies, though I am uncertain how the balance would play put. It seems probable to me that the more advanced technologies we expect there to be, and the more powerful we expect them to be, the more concerned we should be about this possibility.

It seems quite possible many of the protective factors humans have historically possessed (social interdependence, fragility/mortality, not overwhelmingly powerful, etc.) will break down, and so it should not be too surprising if one or more unprecedented offense biases occur.

I will next address a concept I will call “Human Alignment” which may be a way of framing solutions to a vulnerable world scenario.

Human Alignment

By “human alignment,” I mean a state of humanity in which most or all of humanity systematically cooperates to achieve positive-sum outcomes for everyone (or at a minimum are prevented from pursuing negative sum outcomes), in a way perpetually sustainable into the future. While exceedingly difficult, saving a vulnerable world from existential catastrophe may necessitate this.

Bostrom points out that if humanity retains a “wide and recognizably human distribution of motives” resulting in a multipolar world order and an “apocalyptic residual,” then even a single apocalyptic actor with access to certain advanced technology may spell the end of civilization. As mentioned, however, actors need not be apocalyptic; it may be enough that they are willing to risk destroying each other to defend themselves, or in pursuit of their own interests.

In “The Vulnerable World Hypothesis,” (VWH) a possible solution Bostrom proposes is universal surveillance of everyone at all times to prevent apocalyptic behavior. Many find this solution unpalatable, though perhaps better than extinction. This would result in humanity being (at least minimally) aligned by force.

Another possible solution is to sustainably eliminate all malicious and apocalyptic intentions, or in other words to universally create enough moral progress that no one desires to kill each other, or is willing to risk destroying humanity. Bostrom seems to dismiss this solution as intractable. I think, however, that by using systemic interventions which incorporate mildly to moderately advanced AI to re-shape the moral fitness landscape toward desirable traits, among other interventions, this may be more tractable than it seems at first glance. I wrote the rough draft of a book on such solutions (for x-risk / vulnerable world in general, not AI x-risk specifically) before formally discovering EA, longtermism, and the VWH. I am now trying to understand the AI x-risk landscape better to see if a vulnerable world scenario is likely given the development of advanced AI.

Conclusion

My main question is whether a vulnerable world induced AI x-risk scenario seems plausible or likely.

I think my main crux is whether AI is likely to be multi-polar, hence multiple agents have access to advanced AI.

Another factor is whether advanced AI is likely to have uneven abilities such that the ability to commit genocide or to create new dangerous technologies is developed before the ability to defend humans, predict what technologies will be dangerous, or align humanity.

I am also very curious if this is something others have talked about, and if so, I would appreciate references to these discussions.

Finally, I would greatly appreciate any thoughts on my reasoning in general, what I may be missing, and what would be promising directions for further research for me.

Thank you in advance for your feedback!

^{^}
By which I mean it is easier to break something than to create or fix it, not exactly the same as offense bias, but closely related

5 Reactions

More posts like this

Comments4

Sorted by

New & upvoted

Click to highlight new comments since: Today at 11:26 AM

Aaron_ScherDec 7 20225

I am a bit confused by the key question / claim. It seems to be some variant of "Powerful AI may allow the development of technology which could be used to destroy the world. While the AI Alignment problem is about getting advanced AIs to do what their human operator wants, this could still lead to an existential catastrophe if we live in such a vulnerable world where unilateral actors can deploy destructive technology. Thus actual safety looks like not just having Aligned AGI, but also ensuring that the world doesn't get destroyed by bad or careless or unilateral actors"

If this is the claim, seems about right, and has been discussed a lot both online and offline. Powerful AI itself might be that destructive technology, hence discussion of Deployment and Coordination problems. See here. Some other relevant resources: here, here.

As you asked for feedback,

If I am not mistaken, “AI Alignment” seems to mean getting AI to do what we want without harmful side effects, but “AI Safety” seems to imply keeping AI from harming or destroying humanity.

I would say the distinction isn't so clear and the semantics don't seem too important; what matters is that those in the field of AI Alignment and AI Alignment broadly are aimed at getting good outcomes for humanity.

I guess your claim might actually be "Powerful AI may be a precipitating factor for other risks as it allows the development of many other, potentially unsafe, technologies." This seems technically true but is unlikely to be how the world goes. Mainly I expect one of two outcomes:

humanity is disempowered or dead from misaligned AI;
we successfully align AGIs and solve the deployment problem which results in a world which where no single actor can cause an existential catastrophe.

The reason I think that no single actor can cause existential catastrophe in 2 is that this seems to be a likely precursor to avoiding dying to misaligned AGI. I would recommend the above links for understanding this intuition. I may be wrong here because it may be that the way we avoid misaligned AGI is by democratizing aligned-AGI-creation tech (all the open source libraries include Alignment properties including preventing misuse); but maybe the filters for preventing misuse are not sufficient for stopping people from developing civilization-destroying tech in the future (but probably given such a filter we would already be dead from misaligned AGI that somebody made by stress-testing the filters).

Sorry for scattered thoughts

Jordan ArelDec 9 20221

Thank you so much for this reply! I’m glad to know there is already some work on this, makes my job a lot easier. I will definitely look into the articles you mentioned and perhaps just study AI risk / AI safety a lot more in general to get a better understanding of how people think about this. It sounds like what people call “deployment” may be very relevant, so well especially look into this.

David JohnstonDec 6 20221

Some quick thoughts: A crude version of the vulnerable world hypothesis is “developing new technology is existentially dangerous, full stop”, in which case advanced AI that increase the rate of new technology development is existentially dangerous, full stop.

One of Bostroms solutions is totalitarianism. This seems to imply something like “new technology is dangerous, but this might be offset by reducing freedom proportionally”. Accepting this hypothesis seems to say that either advanced AI is existentially dangerous, or it accelerates a political transition to totalitarianism, which seems to be its own kind of risk.

Jordan ArelDec 9 20221

Yes, I agree this is somewhat what Bostrom is arguing. As I mentioned in the post, I think there may be solutions which don’t require totalitarianism, i.e. massive universal moral progress. I know this sounds intractable, I might address why I think this maybe mistaken in a future post, but it is a moot point if a vulnerable world induced X-risk scenario is unlikely, hence why I am wondering if there has been any work on this.