In the AI safety literature, AI alignment is often presented as conceptually distinct from capabilities. However, (1) the distinction seems somewhat fuzzy and (2) many techniques that are supposed to improve alignment also improve capabilities.
(1) The distinction is fuzzy because one common way of defining alignment is getting an AI system to do what the programmer or user intends. However, programmers intend for systems to be capable. eg we want chess systems to win at chess. So, a system that wins more is more intent aligned, and is also more capable.
(2) eg This Irving et al (2018) paper by a team at Open AI proposes debate as a way to improve safety and alignment, where alignment is defined as aligning with human goals. However, the debate also improved the accuracy of image classification in the paper, and therefore also improved capabilities.
Similarly, Reinforcement learning with human feedback was initially presented as an alignment strategy, but my loose impression is that it also made significant capabilities improvements. There are many other examples in the literature of alignment strategies also improving capabilities.
**
This makes me wonder whether alignment is actually more neglected that capabilities work. AI companies want to make aligned systems because they are more useful.
How do people see the difference between alignment and capabilities?
Thanks this is really useful. (I will try to go through this course, as well.)
I'm not sure the talk has it quite right though. My take is that on the most popular definitions of alignment and capabilities, they are partly conceptually the same, depending on which intentions we are meant to be aligning with. So, it's not the case that there is a 'alignment externality' of a capabilities improvements, but rather that some alignment improvements are capabilities improvements, by definition.