Why I expect successful (narrow) alignment

Summary

I believe that advanced AI systems will likely be aligned with the goals of their human operators, at least in a narrow sense. I’ll give three main reasons for this:

  1. The transition to AI may happen in a way that does not give rise to the alignment problem as it’s usually conceived of.
  2. While work on the alignment problem appears neglected at this point, it’s likely that large amounts of resources will be used to tackle it if and when it becomes apparent that alignment is a serious problem.
  3. Even if the previous two points do not hold, we have already come up with a couple of smart approaches that seem fairly likely to lead to successful alignment.

This argument lends some support to work on non-technical interventions like moral circle expansion or improving AI-related policy, as well as work on special aspects of AI safety like decision theory or worst-case AI safety measures.

Introduction

Efforts to shape advanced artificial intelligence (AI) may be among the most promising altruistic endeavours. Many people argue that the key issue is to ensure that powerful AI systems will reliably act in ways that are desirable to their human users – the so-called alignment problem.

In this post, I outline why I think it is likely that the alignment problem will be solved.1 However, I do not claim that we will find a shovel-ready solution to alignment any time soon; I just claim that advanced AI will likely be aligned by the time they are built.

To clarify, I’m talking about a minimal sense of alignment, which I’ll define as “AI systems behave in ways that are broadly in line with what their human operators intend”. That doesn’t necessarily mean perfect safety in all circumstances – similar to how present-day software suffers from bugs, but the consequences are usually not catastrophic. I’m also bracketing various conceptual complications (see e.g. 1, 2) regarding the meaning of alignment. I’ll refer to this as “narrow alignment”.

Successful narrow alignment does not mean that everything will be rosy – not all human intentions are good intentions. It doesn’t even mean that “human values” will, in a meaningful sense, be in control of the future.2 In particular, even if each individual AI system is aligned with (different) human operators, it’s quite plausible that a world containing many such AIs will exhibit systemic trends or Moloch dynamics where the collective evolves in ways that aren’t necessarily intended by anyone, or might even be disfavored by many. The resulting high-level dynamics may not be (fully) aligned with human (compromise) values, similar to how the economy or the evolution of new technologies is not (fully) aligned with human interests even though it is driven by individual humans.

However, this problem is very different from the usual conception of the alignment problem and would arguably require different strategies (like better governance or improved global coordination), which is why I won’t discuss it in this post.

The alignment problem in different scenarios

At least historically, concerns about the alignment problem tended to go along with specific views on what the transition to advanced AI will look like (assuming it happens at all). In particular, there’s a tendency to think of advanced AI as a single, unified agent, and to view a hard takeoff or intelligence explosion as likely or at least plausible.

These views can be criticized (cf. Magnus Vinding’s contra AI FOOM reading list), e.g. on the following grounds:

  • The capabilities of advanced AI may be distributed over many different systems and many different places, comparable to how “the industry” or “the economy” is not a localised, discrete actor.[1, 2]
  • Eric Drexler envisions a service-centered model of general, superintelligent-level capabilities that does not necessarily act like a unified agent.
  • The distribution of competence over various skills of AI will likely differ from that of humans, similar to how AI currently excels at board games but lacks common sense. To the extent to which this is true, it means that AI systems surpass humans “one domain at time”, rather than surpassing humans in all domains at once.

I think it is quite likely that some criticism along these lines is accurate. That doesn’t imply that alignment is not an issue at all, but the problem tends to be more benign in these alternative AI scenarios. This is because it wouldn’t be necessary to solve everything in advance – something that humanity is very bad at. If the emergence of AI is gradual or distributed, then it is more plausible that safety issues can adequately be handled “as usual”, by reacting to issues as they arise, by extensive testing and engineering, and by incrementally designing systems to satisfy multiple constraints.

In addition, I think that the range of plausible scenarios also includes some forms of advanced AI, such as whole brain emulation or brain-computer interfaces, where the alignment problem arguably does not arise at all, or at least not in its usual form.3

Future work on AI alignment

While alignment looks neglected now, we should also take into account that huge amounts of resources will likely be invested if it becomes apparent that this is a serious problem (see also here). Strong economic incentives will push towards alignment: it’s not economically useful to have a powerful AI system that doesn’t reliably do what you want. Also, future people will likely have more advanced methods at their disposal to help them solve alignment, ranging from more mundane factors like a higher number of researchers to more speculative possibilities like biologically enhanced minds or (depending on takeoff scenarios) helpful contributions from not-yet-quite-as-powerful AI systems. (Note that “the alignment problem is likely to be solved” is not the same as “the alignment problem is easy” – this post is defends the former claim).

It is conceivable that there will be a rapid increase in capabilities with no warning signs or at least no “fire alarm”. However, this is an additional assumption, and I find it more likely that there will be warning signs; in fact, we already observe certain kinds of “misalignment” in concrete machine learning systems, and there has been a surge of interest in AI safety in recent years.

I think that a large investment of resources will likely yield satisfactory alignment solutions, for several reasons:4

  • The problem of AI alignment differs from conventional principal-agent problems (aligning a human with the interests of a company, state, or other institution) in that we have complete freedom in our design of artificial agents: we can set their internal structure, their goals, and their interactions with the outside world at will.5
  • We only need to find a single approach that works among a large set of possible ideas.
  • Alignment is not an agential problem, i.e. there are no agential forces that push against finding a solution – it’s just an engineering challenge. (However, that applies only to narrow alignment. The possibility of malicious use of AI technology by bad actors is an agential problem, and indeed I think it’s less clear whether this problem will be solved to a satisfactory extent.)

Existing approaches hold some promise

Let’s suppose, for the sake of argument, that we’re building a superintelligent agent and we’re not convinced by the abstract claim that future folks will allocate appropriate resources to the problem. Surely it’s more reassuring to look at concrete approaches for how to deal with the most common concerns. For instance, a common worry is that we will not be able to correctly specify human values in an AI system, and that optimizing for mis-specified values will lead to catastrophic outcomes. And it’s undoubtedly correct that we’re currently unable to specify human goals in machine learning systems.

But it’s a mistake to imagine a superintelligence against a backdrop of our current capabilities. If powerful, general-purpose learning algorithms become available, it will also be possible to apply these methods to learn, rather than hard-code, concepts like “what humans want” or “what humans would approve of”.

Imagine a big neural network that receives a detailed description of an outcome as its input, and outputs the extent to which the human operator (the “overseer”) would approve of this outcome. We could then build a machine learning system with the aim of maximising the output of this neural network. (To be clear, this isn’t a particularly brilliant idea, and I’m not saying we should build AI systems like that – I’m just considering it for illustrative purposes. For instance, bootstrapping schemes like Paul Christiano’s iterated amplification and distillation might be a more sophisticated approach.)

It seems to me that it’s not that hard (given the assumption of powerful general-purpose learning algorithms) to learn a reasonably good representation of what humans want. It may not be perfect, and there’s a concern that a superintelligence would lock in a flawed representation and then go on to maximise that for all eternity, which would be very bad (cf. Goodhart’s curse and fragility of value). But the crux is that the notion of human values doesn’t need to be perfect to understand that humans do not approve of lock-ins, that humans would not approve of attempts to manipulate them, and so on. It is arguably possible to avoid lock-ins and instead regularly consult with the overseer to figure out whether you’re on track, and it’s not hard to grasp that this leads to higher approval in expectation. (It may be helpful to evaluate not just outcomes, but also the AI’s thought processes, the means it employs, etc.)

This is closely related to the concept of corrigibility and the idea of building AI systems to be faithful assistants to humans. In other words, I share the intuition that there is a relatively broad “basin of corrigibility”: even if the representation of values may be flawed at first, the approach is unlikely to break down completely because of that. (Again, it’s not that hard to understand what it means to be a helpful assistant to somebody.)

Implications

One might argue that the possibility that alignment will be solved anyway is not what keeps us up at night. That is, we should assume that the problem is hard for precautionary reasons, which is why it makes sense to work on alignment even if everything I say in this post is true.

I think that’s a fair point, and this post is not meant as a critique of AI alignment work. However, this reply assumes that (narrow) alignment or misalignment is the most critical dimension when it comes to shaping the long-term future. This is a common view but it’s not a no-brainer:

  • If AI systems will likely be aligned with human values, then it’s crucial to ensure that humans care about the “right” values, e.g. that they care about avoiding suffering rather than being selfish. So a high likelihood of alignment could be viewed as an argument in favor of moral advocacy and moral circle expansion. (This assumes our own values diverge from the average human, and that future humans will not converge on the “right” values anyway.)
  • Rather than working on technical safety problems of advanced AI, we should perhaps focus on non-technical aspects, such as improving AI-related policy, better global coordination to avoid arms race dynamics, or preventing malicious use of AI.
  • We could focus on “non-standard” aspects of AI safety beyond value alignment that are less likely to be solved, like ensuring that advanced AI systems will use an adequate decision theory that leads to the right conclusions on unknowns like acausal trade or multiverse-wide superrationality.

Personally, I am mainly interested in reducing s-risks, and I think the precautionary argument does not go through from that perspective. Advanced AI could still pose an s-risk even if aligned – indeed, it’s not even obvious whether it leads to less suffering in expectation. Whether or not the future is controlled by human values is therefore not the most critical question (though still relevant).

So I think suffering reducers should assume that (narrow) alignment will likely be successful, and work to reduce s-risks conditional on that assumption6 – for instance, by implementing surrogate goals or other worst-case AI safety measures.

Footnotes

  1. My inside view puts ~90% probability on successful alignment (by which I mean narrow alignment as defined below). Factoring in the views of other thoughtful people, some of which think alignment is far less likely, that number comes down to ~80%.
  2. It could be argued that humanity is not and has never been in control of its destiny: technological progress, economic pressure and the evolution of political ideologies have always had a life of their own.
  3. There might still be a fair amount of narrow safety work to do in these scenarios, such as ensuring that the emulations / interfaces are accurate and not slightly askew. For example, maybe your brain emulation has a few parameters off somewhat, which leads the brains to be psychopaths, or have a variety of other differences from normal human values. (HT Brian Tomasik for this point.)
  4. This isn’t to say that elites will navigate the creation of AI well, as that also involves non-technical aspects like coordination problems. Climate change is an example of a problem that gets a lot of attention but still doesn’t really get solved; but I’d argue that this is different in that it’s mostly a coordination problem – and it’s not obvious how much societal elites actually care about preventing climate change.
  5. There may be tradeoffs between performance and controllability, so in some sense we don’t have complete design freedom. However, I’d argue that this is mostly about solving the resulting cooperation problem, not about narrow alignment.
  6. This also requires another premise that the tractability of influencing aligned vs. non-aligned AI does not differ to an extreme extent. It’s not clear to me which is more tractable, so I think that premise is plausible.

Leave a Reply

Your email address will not be published. Required fields are marked *