Hide table of contents

Comment Permalink

Answer by Steven ByrnesSep 01, 202316

I define “alignment” as “the AI is trying to do things that the AI designer had intended for the AI to be trying to do”, see here for discussion.

If you define “capabilities” as “anything that would make an AI more useful / desirable to a person or company”, then alignment research would be by definition a subset of capabilities research.

But it’s a very small subset!

Examples of things that constitute capabilities progress but not alignment progress include: faster and better and more and cheaper chips (and other related hardware like interconnects), the development of CUDA, PyTorch, etc., the invention of BatchNorm and Xavier initialization and adam optimizers and Transformers, etc. etc.

Here’s a concrete example. Suppose future AGI winds up working somewhat like human brain within-lifetime learning (which I claim is in the broad category of model-based reinforcement learning (RL).) A key ingredient in model-based RL is the reward function, which in the human brain case loosely corresponds to “innate drives”, like pain being bad (other things equal), eating when hungry being good, and hundreds more things like that. If future AI works like that, then future AI programmers can put whatever innate drives they want into their future AIs. So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it. Is solving this problem necessary to create economically-useful powerful AGIs? Unfortunately, it is not!! Just look at human high-functioning sociopaths. If we made AGIs like that, we could get extremely competent agents—agents that can make and execute complicated plans, figure things out, do science, invent tools to solve their problems, etc.—with none of the machinery that gives humans an intrinsic tendency to compassion and morality. Such AGIs would nevertheless be very profitable to use … for exactly as long as they can be successfully prevented from breaking free and pursuing their own interests, in which case we’re in big trouble. (By analogy, human slaves are likewise not “aligned” with their masters but still economically useful.) I have much more discussion and elaboration of all this stuff here.

5[anonymous]Sep 1 2023

Thanks a lot for this! To take one of your examples - faster and better chips (or more compute generally). It seems like this does actually improve alignment on perhaps the most popular definition of alignment as intent-alignment. In terms of answering questions from prompts, GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute. I mean this in the sense that it producers answers that are better/more in line with what users want I'm not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative. For instance, leading AI companies seem to me to be trying to make LLMs that are honest, and cooperative with their users (e.g. not threatening them). In fact, this seems to be a major focus of these companies. Do you think I am missing something?

Steven ByrnesSep 1 20233

Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there ... (read more)

See in context

[ Question ]

Alignment & Capabilities: What's the difference?

by [anonymous]

Aug 31 20231 min read6 answers 0