This is a cross-post from LessWrong
Thanks to Daniel Kokotajlo, Jan Hendrik Kirchner, Remmelt Ellen, Berbank Green, Otto Barten and Olaf Voß for helpful suggestions and comments.
As AI alignment remains terribly difficult and timelines appear to be dwindling, we must face the likely situation that within the next 20 years we will be able to build an AI that poses an existential threat before we know how to control it. In this case, our only chance to avert a catastrophe will be to collectively refrain from developing such a “dangerous” AI. But what exactly does that mean?
It seems obvious that an AI which pursues the wrong goal and is vastly more intelligent than any human would be “dangerous” in the sense that it would likely be unstoppable and probably lead to an existential catastrophe. But where, exactly, is the tipping point? Where is the line between harmless current AIs, like GPT-3 or MuZero, and a future AI that may pose an existential threat?
To put the question differently: If we were asked to draft a global law that prohibits creating “dangerous AI”, what should be written in it? Which are the things that no actor should ever be allowed to do, the “red lines” no one should ever cross, at least until there is a feasible and safe solution to the alignment problem? How would we even recognize a “dangerous AI”, or plans to build one?
This question is critical, because if the only way to avert an existential risk is to refrain from building a dangerous AI, we need to be very sure about what exactly makes an AI “dangerous” in this sense.
It may seem impossible to prevent all of humanity from doing something which is technically feasible. But while it is often difficult to get people to agree on any kind of policy, there are already many things which are not explicitly forbidden, but most people don’t do anyway, like letting their children play with radioactive toys, eating any unidentifiable mushrooms they find in the woods, climbing under a truck to drink wine while it is driving at full speed on the highway or drinking aquarium cleaner as a treatment against Covid. There is a common understanding that these are stupid things to do because the risk is much greater than the possible benefit. This common understanding of dangerousness is all that is needed to keep a very large proportion of humanity from doing those things.
If we could create a similar common understanding of what exactly the necessary and sufficient conditions are that turn an AI into an existential threat, I think there might be a chance that it wouldn’t be built, at least not for some time, even without a global law prohibiting it. After all, no one (apart maybe from some suicidal terrorists) would want to risk the destruction of the world they live in. There is no shareholder value to be gained from it. The expected net present value of such an investment would be hugely negative. There is no personal fame and fortune waiting for the first person to destroy the world.
Of course, it may not be so easy to define exact criteria for when an AI becomes “dangerous” in this sense. More likely there will be gray areas where the territory becomes increasingly dangerous. Still, I think it would be worthwhile to put significant effort into mapping that territory. It would help us with governing AI development and might lead to international treaties and more cautious development in some areas. In the best case, it could even help us define what “safe AI” really means, and how to use its full potential without risking our future. As an additional benefit, if a planned AI system can be identified as potentially “dangerous” beforehand, the burden of proof that their containment and control measures are fail-safe would lie with the people intending to create such a system.
In order to determine the “dangerousness” of an AI system, we should avoid the common mistake of using an anthropomorphic benchmark. When we currently talk about existential AI risks, we usually use terms like “artificial general intelligence” or “super-intelligent AI”. This seems to imply that AI gets dangerous at some point after it reaches “general problem-solving capabilities on at least human level”, so this would be a necessary condition. But this is misleading. First of all, it can lead people to underestimate the danger because they falsely equate “first arrival of dangerous AI” with “the time we fully understand the human brain”. Second, AI is already vastly super-intelligent in many narrow areas. A system that could destroy the world without being able to solve every problem on human level is at least conceivable. For example, an AI that is superhuman at strategy and persuasion could manipulate humans in a way that leads to a global nuclear war, even though it may not be able to recognize images or control a robot body in the real world. Third, as soon as an AI would gain general problem-solving capabilities on human level, it would already be vastly superhuman in many other aspects, like memory, speed of thought, access to data, ability to self-improve, etc., which might make it an invincible power. This has been illustrated in the following graphic (courtesy of AI Impact, thanks to Daniel Kokotajlo for pointing it out to me):
The points above indicate that the line between “harmless” and “dangerous” must be somewhere below the traditional threshold of “at least human problem-solving capabilities in most domains”. Even today’s narrow AIs often have significant negative, possibly even catastrophic side effects (think for example of social media algorithms pushing extremist views, amplifying divisiveness and hatred, and increasing the likelihood of nationalist governments and dictatorships, which in turn increases the risk of wars). While there are many beneficial applications of advanced AI, with the current speed of development, the possibility of things going badly wrong also increases. This makes it even more critical to determine how exactly an AI can become “dangerous”, even if it is lacking some of the capabilities typically associated with AGI.
It is beyond the scope of this post to make specific recommendations about how “dangerousness” could be defined and measured. This will require a lot more research. But there are at least some properties of an AI that could be relevant in this context:
- Broadness (of capabilities): Today’s narrow AIs are obviously not an existential threat yet. As the broadness of domains in which a system is capable grows, however, the risk of the system exhibiting unforeseen and unwanted behavior in some domain increases. This doesn’t mean that a narrow AI is necessarily safe (see example above), but broadness of capabilities could be a factor in determining dangerousness.
- Complexity: The more complex a system is, the more difficult it is to predict its behavior, which increases the likelihood that some of this behavior will be undesirable or even catastrophic. Therefore, all else being equal, the more complex a system is, measured for example by the number of parameters of a transformer neural network, the more dangerous.
- Opaqueness: Some complex systems are easier to understand and predict than others. For example, symbolic AI tends to be less “opaque” than neural networks. Tools for explainability can help reduce an AI's opaqueness. The more opaque a system is, the less predictable and the more dangerous.
- World model: The more an AI knows about the world, the better it becomes at making plans about future world states and acting effectively to change these states, including in directions we don’t want. Therefore, the scope and precision of its knowledge about the real world may be a factor of its dangerousness.
- Strategic awareness (as defined by Joseph Carlsmith, see section 2.1 of this document): This may be a critical factor in the dangerousness of an AI. A system with strategic awareness realizes to some extent that it is a part of its environment, a necessary element of its plan to achieve its goals, and a potential object of its own decisions. This leads to instrumental goals, like power-seeking, self-improvement, and preventing humans from turning it off or changing its main goal. The more strategically aware an AI becomes, the more dangerous.
- Stability: A system that dynamically changes over time is less predictable, and therefore more dangerous, than a system that is stable. For example, an AI that learns in real time and is even able to self-improve should in general be considered more dangerous than a system that is trained once and then applied to a task without any further changes.
- Computing power: The more computing power a system has, the more powerful, and therefore potentially dangerous, it becomes. This also applies to processing speed: The faster a system can decide and react, the more dangerous, because there is less time to understand its decisions and correct it if necessary.
One feature that I deliberately did not include in the list above is “connectivity to the outside world”, e.g. access to the internet, sensors, robots, or communication with humans. An AI that is connected to the internet and has access to many gadgets and points of contact can better manipulate the world and thus do dangerous things more easily. However, if an AI would be considered dangerous if it had access to some or all of these things, it should also be considered dangerous without it, because giving such a system access to the outside world, either accidentally or on purpose, could cause a catastrophe without further changing the system itself. Dynamite is considered dangerous even if there is no burning match held next to it. Restricting access to the outside world should instead be regarded as a potential measure to contain or control a potentially dangerous AI and should be seen as inherently insecure.
This list is by no means complete. There are likely other types of features, e.g. certain mathematical properties, which may be relevant but which I don’t know about or don’t understand enough to even mention them. I only want to point out that there may be objective, measurable features of an AI that could be used to determine its “dangerousness”. It is still unclear, however, how relevant these features are, how they interact with each other, and whether there are some absolute thresholds that can serve as “red lines”. I believe that further research into these questions would be very valuable.
A specific cruxy statement that I disagree on:
My disagreement here is threefold:
Such 'real world' capabilities seem to need many degrees of freedom in external inputs and outputs to be iteratively trained into a model.
This is where the analogy of AI's potential with dynamite's potential for danger does not hold:
- Dynamite has explosive potential from the get go (fortunately limited to a physical radius) but stays (mostly) chemically inert after production. It does not need further contact points of interaction with physical surroundings to acquire this potential for human-harmful impact.
- A self-learning machine architecture gains increasing potential for wide-scale human lethality (through general modelling/regulatory functions that could be leveraged or repurposed to modify conditions of the outside environment in self-reinforcing loops that humans can no longer contain) via long causal trajectories of the architecture's internals having interacted at many contact points with the outside world in the past. The initially produced 'design blueprint' does not immediately acquire this potential through production of needed hardware and initialisation of model weights.
If engineers end up connecting up more internet channels, sensors and actuators for large ML model training and deployment while continuing to tinkering with the model’s underlying code base, then from a control engineering perspective, they are setting up a fragile system that is prone to inducing cascading failures in the future. Engineers should IMO not be connecting up what amounts to self-learning spaghetti code for open-endedly learning and autonomously enacting changes in the real world. This, in my view, would be an engineering malpractice where practitioners are grossly negligent in preventing risks to humans living everywhere around the planet.
A model can be trained to cause effects we deem functional but under different interactions with structural aspects of the training environment than we expected. Such a model’s intended effects are not robust to shifts of the distribution of input data received when the model is deployed in new environments. Example: in deployment this game agent ‘captures’ a wall rather than the coin it got trained to (incidentally next to the right-most wall).
Compared to side-scroller games, real-life interactions are much more dimensionally complex. If we train a Deep RL model on a high-bandwidth stream of high-fidelity multimodal inputs from the physical environment in interaction with other agentic beings, we have no way of knowing whether any hidden causal structure got selected for and stays latent even during deployment test runs… until a rare set of interactions triggers it to cause outside effects that are out of line.
Core to the problem of goal misgeneralization in machine learning is that latent functions of internal code are being expressed under unknown interactions with the environment. A model that coherently overgeneralizes functional metrics over human contexts is concerning but trackable. Internal variance being selected to act out of line all over the place is not trackable.
Note that an ML model trains on signals that are coupled to existing local causal structures (as simulated on eg. localized servers or as sensed within local physical surroundings). Thus, the space of possible goal structures that can be selected for within an ML model is constrained by features that can be derived from data inputs received from local environments. Goals are locally selected for and thus partly non-orthogonal (cannot vary independently) with intelligence.
Not at all. I'm just saying that if any AI with external access would be considered dangerous, then the same AI without access should be considered dangerous as well.
The dynamite analogy was of course not meant to be a model for AI, I just wanted to point out that even an inert mass that in principle any child could play with without coming to harm is still considered dangerous, because under certain circumstances it will be harmful. Dynamite + fire = damage, dynamite w/o fire = still dangerous.
Your third argument seems to prove my point: An AI that seems aligned in the training environment turns out to be misaligned if applied outside of the training distribution. If that can happen, the AI should be considered dangerous, even if within the training distribution it shows no signs of it.
I'm saying that general-purpose ML architectures would develop especially dangerous capabilities by being trained in high-fidelity and high-bandwidth input-output interactions with the real outside world.