Rohin wrote this amazing summary of all the research happening in AI Alignment in 2018-2019. Here is the "short" version of the summary, though I do recommend reading the whole post, which I think conveys quite well what happened in the last two years in AI Alignment.
While the full text tries to accurately summarize different points of view, that is not a goal in this summary. Here I simply try to give a sense of the topics involved in the discussion, without saying what discussion actually happened.
Basic analysis of AI risk. Traditional arguments for AI risk argue that since agentic AI systems will apply lots of optimization, they will lead to extreme outcomes that can’t be handled with normal engineering efforts. Powerful AI systems will not have their resources stolen from them, which by various dutch book theorems implies that they must be expected utility maximizers; since expected utility maximizers are goal-directed, they are dangerous.
However, the VNM theorem does not justify the assumption that an AI system will be goal-directed: such an assumption is really based on intuitions and conceptual arguments (which are still quite strong).
Comprehensive AI Services (CAIS) challenges the assumption that we will have a single agentic AI, instead suggesting that any task will be performed by a collection of modular services.
That being said, there are several other arguments for AI risk, such as the argument that AI might cause “lock in” which may require us to solve hard philosophical problems before the development of AGI.
Nonetheless, there are disjunctive reasons to expect that catastrophe does not occur: for example, there may not be a problem, or ML researchers may solve the problem after we get “warning shots”, or we could coordinate to not build unaligned AI.
Agency and optimization. One proposed problem is that of mesa optimization, in which an optimization algorithm used to train an AI creates an agent that is itself performing optimization. In such a scenario, we need to ensure that the “inner” optimization is also aligned.
To better understand these and other situations, it would be useful to have a formalization of optimization. This is hard: while we don’t want optimization to be about our beliefs about a system, if we try to define it mechanistically, it becomes hard to avoid defining a bottle cap as an optimizer of “water kept in the bottle”.
Understanding agents is another hard task. While agents are relatively well understood under the Cartesian assumption, where the agent is separate from its environment, things become much more complex and poorly-understood when the agent is a part of its environment.
Value learning. Building an AI that learns all of human value has historically been thought to be very hard, because it requires you to decompose human behavior into the “beliefs and planning” part and the “values” part, and there’s no clear way to do this.
Another way of looking at it is to say that value learning requires a model that separates the given data into that which actually achieves the true “values” and that which is just “a mistake”, which seems hard to do. In addition, value learning seems quite fragile to mis-specification of this human model.
Nonetheless, there are reasons for optimism. We could try to build an adequate utility function, which works well enough for our purposes. We can also have uncertainty over the utility function, and update the belief over time based on human behavior. If everything is specified correctly (a big if), as time goes on, the agent would become more and more aligned with human values. One major benefit of this is that it is interactive -- it doesn’t require us to specify everything perfectly ahead of time.
Robustness. We would like our agents to be robust - that is, they shouldn’t fail catastrophically in situations slightly different from the ones they were designed for. Within reinforcement learning, safe reinforcement learning aims to avoid mistakes, even during training. This either requires analytical (i.e. not trial-and-error) reasoning about what a “mistake” is, which requires a formal specification of what a mistake is, or an overseer who can correct the agent before it makes a mistake.
The classic example of a failure of robustness is adversarial examples, in which a tiny change to an image can drastically affect its classification. Recent research has shown that these examples are caused (at least in part) by real statistical correlations that generalize to the test set, that are nonetheless fragile to small changes. In addition, since robustness to one kind of adversary doesn’t make the classifier robust to other kinds of adversaries, there has been a lot of work done on improving adversarial evaluation in image classification. We’re also seeing some of this work in reinforcement learning.
However, asking our agents to be robust to arbitrary mistakes seems to be too much -- humans certainly don’t meet this bar. For AI safety, it seems like we need to ensure that our agents are robustly intent aligned, that is, they are always “trying” to do what we want. One particular way that our agents could be intent aligned is if they are corrigible, that is, they are trying to keep us “in control”. This seems like a particularly easy property to verify, as conceptually it seems to be independent of the domain in which the agent is deployed.
So, we would like to ensure that even in the worst case, our agent remains corrigible. One proposal would be to train an adversary to search for “relaxed” situations in which the agent behaves incorrigibly, and then train the agent not to do that.
Scaling to superhuman abilities. If we’re building corrigible agents using adversarial training, our adversary should be more capable than the agent that it is training, so that it can find all the situations in which the agent behaves incorrigibly. This requires techniques that scale to superhuman abilities. Some techniques for this include iterated amplification and debate.
In iterated amplification, we start with an initial policy, and alternate between amplification and distillation, which increase capabilities and efficiency respectively. This can encode a range of algorithms, but often amplification is done by decomposing questions and using the agent to answer subquestions, and distillation can be done using supervised learning or reinforcement learning.
In debate, we train an agent through self-play in a zero-sum game in which the agent’s goal is to “win” a question-answering debate, as evaluated by a human judge. The hope is that since each “side” of the debate can point out flaws in the other side’s arguments, such a setup can use a human judge to train far more capable agents while still incentivizing them to provide honest, true information.
Both iterated amplification and debate aim to train an agent that approximates the answer that one would get from an exponentially large tree of humans deliberating. The factored cognition hypothesis is that this sort of tree of humans is able to do any task we care about. This hypothesis is controversial: many have the intuition that cognition requires large contexts and flashes of intuition that couldn’t be replicated by a tree of time-limited humans.
Universality. One property we would hope to have is that if we use this tree of humans as an overseer for some simpler agent, then the tree would “know everything the agent knows”. If true, this property could allow us to build a significantly stronger conceptual argument for safety. It is also very related to…
Interpretability. While interpretability can help us know what the agent knows, and what the agent would do in other situations (which can help us verify if it is corrigible), there are other uses for it as well: in general, it seems better if we can understand the things we’re building.
Impact regularization. While relative reachability and attainable utility preservation were developed last year, this year saw them be unified into a single framework. In addition, there was a new proposed definition of impact: change in our ability to get what we want. This notion of impact depends on knowing the utility function U. However, we might hope that we can penalize some “objective” notion, perhaps "power", that occurs regardless of the choice of U, for the same reasons that we expect instrumental convergence.
Causal modeling. Causal models have been used recently to model the incentives for an agent under different AI safety frameworks, and to argue that by evaluating plans with the current reward function, you can remove the incentive for an agent to tamper with its reward function.
Oracles. Even if oracles are trying to maximize predictive accuracy, they could “choose” between different self-confirming predictions. We could avoid this using counterfactual oracles, which make predictions conditioning that their predictions do not influence the future.
Decision theory. There was work on decision theory, that I haven’t followed very much.
Forecasting. Several resources were developed to enable effective group forecasting, including an AI forecasting dictionary that defines terms, an AI resolution council whose future opinions can be predicted, and a dataset of well-constructed exemplar questions about AI.
Separately, the debate over takeoff speeds continued, with two posts arguing forcefully for continuous takeoff, without much response (although many researchers do not agree with them). The continuity of takeoff is relevant for but doesn’t completely determine whether recursive self improvement will happen, or whether some actor acquires a decisive strategic advantage. The primary implication of the debate is whether we should expect that we will have enough time to react and fix problems as they arise.
It has also become clearer that recent progress in AI has been driven to a significant degree by increasing the amount of compute devoted to AI, which suggests a more continuous takeoff. You could take the position that current methods can’t do <property X> (say, causal reasoning), and so it doesn’t matter how much compute you use.
AI Progress. There was a lot of progress in AI.
Field building. There were posts aiming to build the field, but they were all fairly disjointed.
The long version (~8.3k words) starts here.