In the AI safety literature, AI alignment is often presented as conceptually distinct from capabilities. However, (1) the distinction seems somewhat fuzzy and (2) many techniques that are supposed to improve alignment also improve capabilities.
(1) The distinction is fuzzy because one common way of defining alignment is getting an AI system to do what the programmer or user intends. However, programmers intend for systems to be capable. eg we want chess systems to win at chess. So, a system that wins more is more intent aligned, and is also more capable.
(2) eg This Irving et al (2018) paper by a team at Open AI proposes debate as a way to improve safety and alignment, where alignment is defined as aligning with human goals. However, the debate also improved the accuracy of image classification in the paper, and therefore also improved capabilities.
Similarly, Reinforcement learning with human feedback was initially presented as an alignment strategy, but my loose impression is that it also made significant capabilities improvements. There are many other examples in the literature of alignment strategies also improving capabilities.
**
This makes me wonder whether alignment is actually more neglected that capabilities work. AI companies want to make aligned systems because they are more useful.
How do people see the difference between alignment and capabilities?
I define “alignment” as “the AI is trying to do things that the AI designer had intended for the AI to be trying to do”, see here for discussion.
If you define “capabilities” as “anything that would make an AI more useful / desirable to a person or company”, then alignment research would be by definition a subset of capabilities research.
But it’s a very small subset!
Examples of things that constitute capabilities progress but not alignment progress include: faster and better and more and cheaper chips (and other related hardware like interconnects), the development of CUDA, PyTorch, etc., the invention of BatchNorm and Xavier initialization and adam optimizers and Transformers, etc. etc.
Here’s a concrete example. Suppose future AGI winds up working somewhat like human brain within-lifetime learning (which I claim is in the broad category of model-based reinforcement learning (RL).) A key ingredient in model-based RL is the reward function, which in the human brain case loosely corresponds to “innate drives”, like pain being bad (other things equal), eating when hungry being good, and hundreds more things like that. If future AI works like that, then future AI programmers can put whatever innate drives they want into their future AIs. So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it. Is solving this problem necessary to create economically-useful powerful AGIs? Unfortunately, it is not!! Just look at human high-functioning sociopaths. If we made AGIs like that, we could get extremely competent agents—agents that can make and execute complicated plans, figure things out, do science, invent tools to solve their problems, etc.—with none of the machinery that gives humans an intrinsic tendency to compassion and morality. Such AGIs would nevertheless be very profitable to use … for exactly as long as they can be successfully prevented from breaking free and pursuing their own interests, in which case we’re in big trouble. (By analogy, human slaves are likewise not “aligned” with their masters but still economically useful.) I have much more discussion and elaboration of all this stuff here.
I think the attitude most people (including me) have is: “If we want to do technical work to reduce AI x-risk, then we should NOT be working on any technical problems that will almost definitely get solved “by default”, e.g. because they’re straightforward and lots of people are already working on them and mostly succeeding, or because there’s no way to make powerful AGI except via first solving those problems, etc.”.
Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there ... (read more)