Criticism of the main framework in AI alignment

Michele Campolo

Most of the content of the post applies to both short-term and long-term future, and can be read by anyone who has heard about AI alignment before.

0. Summary

AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3.

The appendix clarifies some minor ambiguities with terminology and links to other stuff.

1. Criticism of the main framework in AI alignment

1.1 What I mean by main framework

In short, it’s the rationale behind most work in AI alignment: solving the control problem to reduce existential risk. I am not talking about AI governance, nor about AI safety that has nothing to do with existential risk (e.g. safety of self-driving cars).

Here are the details, presented as a step-by-step argument.

At some point in the future, we'll be able to design AIs that are very good at achieving their goals. (Capabilities premise)
These AIs might have goals that are different from their designers' goals. (Misalignment premise)
Therefore, very bad futures caused by out-of-control misaligned AI are possible. (From previous two premises)
AI alignment research that is motivated by the previous argument often aims at making misalignment between AI and designer, or loss of control, less likely to happen or less severe. (Alignment research premise).

Common approaches are ensuring that the goals of the AI are well specified and aligned with what the designer originally wanted, or making the AI learn our values by observing our behaviour. In case you are new to these ideas, two accessible books on the subject are [1,2].

5. Therefore, AI alignment research improves the expected value of bad futures caused by out-of-control misaligned AI. (From 3 and 4).

By expected value I mean a measure of value that takes likelihood of events into account, and follows some intuitive rules such as "5% chance of extinction is worse than 1% chance of extinction". It need not be an explicit calculation, especially because it might be difficult to compare possible futures quantitatively, e.g. extinction vs dystopia.

I don't claim that all AI alignment research follows this framework; just that this is what motivates a decent amount (I would guess more than half) of work in AI alignment.

1.2 Response

I call this a response, and not a strict objection, because none of the points or inferences in the previous argument is rejected. Rather, some extra information is taken into account.

6. Bad actors can use powerful controllable AI to bring about very bad futures and/or lock-in their values (Bad actors premise)

For more information about value lock-in, see chapter 4 of What We Owe The Future [3].

7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals. As a consequence, bad actors might have an easier time using powerful controllable AI to achieve their goals. (From 4 and 6)

8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (From 5 and 7)

This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.

An example: if you think that futures shaped by malevolent actors using AI are many times more likely to happen than futures shaped by uncontrolled AI, the response will strike you as very important; and vice versa if you think the opposite.

Another example: if you think that extinction is way worse than dystopic futures lasting a long time, the response won't affect you much—assuming that bad human actors are not fans of complete extinction.

If one considers both epistemic and moral uncertainty, the response works like a piece in the puzzle of how to evaluate AI alignment research. Other points can be made and balanced against the conclusion above, which can't establish by itself that AI safety research is overall net good or bad or neutral. At the same time, deciding to completely ignore it would likely be a case of biased reasoning, maybe motivated.

2. An alternative to the main framework

2.1 Moral progress as a goal of alignment research

Research that is not vulnerable to the response has to avoid point 7 above, i.e. it must not make it easier to create AI that helps malevolent actors achieve their goals.

Section 3 in Artificial Intelligence, Values, and Alignment [4] distinguishes six possible goals of AI alignment. The first three—alignment with instructions, expressed intentions, or revealed preferences—follow the main framework above. The other three focus less on the control problem, and more on finding an interpretation of ‘good’ and then making AI do good things. Thus, the latter three are less (or not at all) vulnerable to the response above.

If you are at all curious about AI safety, I suggest that you have a look at Gabriel's paper, it contains many excellent ideas. But it misses one that is, for lack of a better word, excellenter. It’s about building AIs that work like independent thinkers, then using them for moral progress.

This kind of AI does not do what its designer wants it to do, but rather does what it wants—to the same extent that humans do what they want and generally don’t limit themselves to following instructions from other humans. Therefore, the response above doesn’t apply.

The key point, which is also what makes this kind of AI useful, is that its behaviour is not completely arbitrary. Rather, this AI develops its own values as it learns about the world and thinks critically about them, as humans do as they go through their lives.

As it happens with humans, the end result will depend on the initial conditions, the learning algorithm, and the learning environment. Experimenting with different variations of these may expose us to an even greater degree of cultural, intellectual, and moral diversity than what we can observe today. One of the advantages of using AIs is that we can tweak them to remove biases of human reasoning, and thus obtain thinkers that are less irrational and less influenced by, for example, one’s skin colour. These AIs may even spot important injustices that are not widely recognised today—for comparison, consider how slavery was perceived centuries ago.

Chapter 3 and the section Building a Morally Exploratory World in chapter 4 of [3] contain more information about the importance of values change and moral progress.

2.2 Some considerations and objections to the alternative

Even though I cited [3] on more than one occasion, I think that pretty much all the content of the post applies to both short-term and long-term future.
I do not claim that research towards building the independent AI thinkers of 2.1 above is the most effective AI alignment research intervention, nor that it is the most effective intervention for moral progress. I’ve only presented a problem of the main framework in AI alignment, and proposed an alternative that aims to avoid that problem. As someone else would say: beware surprising and suspicious convergence.
Research on AI that is able to think critically about goals may be useful to reduce AI risk, even if no independent AI thinkers are built, since it may lead to insights on how to design AI that doesn’t just optimise for a specified metric.
Objection: Bad actors could build or buy or select independent AI thinkers that agree with their goals and want to help them.

Reply: True to a certain extent, but seems unlikely to happen and easier said than done. I think it’s unlikely to happen because bad actors would probably opt to use do-what-I-want AI, instead of producing a lot of independent AI thinkers with the hope that one of them happens to have goals that are very aligned with what the bad actors themselves want. And in the latter case, bad actors would also have to hope that the AI goals won’t change over time. Overall, this objection seems strong in futures in which research on independent AI thinkers has advanced to the point of outperforming research on do-what-I-want AI: a very unlikely scenario, considering that the latter kind of research is basically all AI research + most AI alignment research.

Objection: The proposed alternative can actually create bad actors.

Reply: True, some independent AI thinkers might resemble, for example, dictators of the past, if the initial conditions and learning algorithm and learning environment are appropriate. However, at least initially, they would not already be in a position of power with respect to other humans, and they would have to compete also with the other independent thinkers if they have different goals. The main difference with section 1 above is that we are not talking about very powerful or superintelligent AI here. My guess is that bad actors created this way would be roughly as dangerous as human bad actors. Unfortunately, many new humans are born every day, and some of them have bad intentions.

Objection: The proposed alternative requires human-level AI.

Reply: One can continue the objection in different ways.

“...Therefore it’s dangerous.”: See last part of the above reply.
“...Therefore it isn’t very useful.”: One may claim this if they believe, for example, that we will build very powerful and superintelligent AI shortly after the first human-level AI is built, and that at that point we’ll be doomed to dystopia or extinction, so there won’t be time for AI experiments and moral progress. I don’t know how to reply to this objection without attacking the beliefs I've just mentioned. However, if you think the proposed alternative is not very useful for a different reason, you can leave a comment and I’ll try to reply.
“...”: Sometimes people end the objection there. If we were able to increase mind diversity and foster moral progress by using AI that is below human level of intelligence, that would be great! I don’t exclude that it’s possible, but it might require extra research.

3. Technical details about the alternative

This section is not ready yet. When it will be ready, I’ll publish the complete version on the Alignment Forum and leave a link here.

In short, the main point is that at the moment we don’t know how to build AI that thinks critically about goals as humans do. That’s one of the reasons why I am doing research on it.

As far as I know, no one else in AI safety is directly working on it. There is some research in the field of machine ethics, about Artificial Moral Agents, that has a similar motivation or objective. My guess is that, overall, very few people are working on this.

Update: you can find some details here. I'll also publish a follow-up post to that one, with more guidelines on how to build AI that is capable of unbiased moral reasoning.

References

[1] Russell, Stuart. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.

[2] Christian, Brian. The alignment problem: How can machines learn human values?. Atlantic Books, 2021.

[3] MacAskill, William. What We Owe the Future. Hachette UK, 2022.

[4] Gabriel, Iason. "Artificial intelligence, values, and alignment." Minds and machines 30.3 (2020): 411-437.

Appendix

Terminology

When I use the term ‘AIs’, I mean multiple artificial intelligences, e.g. more than one AI program. When I use the term ‘AI’, I mean one or more artificial intelligences, or I may use it as a modifier (as in ‘AI safety’). The distinction is not particularly important, and in this post I simply use what seems more appropriate to the context.
When I write “by expected value I mean a measure of value […]”, I use ‘measure’ with its common-sense meaning in everyday language, not as the mathematical definition of measure.
- I’m assuming extinction is bad, as you can guess from that paragraph. You might think otherwise and that’s fine: if you believe extinction is not bad, then you probably don’t like x-risk motivated research in the first place and you don’t need the argument in section 1 to evaluate it.
Value lock-in, as defined in Chapter 4 of What We Owe The Future: "an event that causes a single value system, or set of value systems, to persist for an extremely long time."

Other stuff

You can find more criticism of AI safety from EAs here. The difference with this post is that there are many more arguments and ideas, but they are less structured.

In the past I wrote a short comparison between an idea similar to 2.1 and other alignment approaches, you can find it here.

This work was supported by CEEALAR, but these are not CEEALAR’s opinions. Note also that CEEALAR doesn't support me to insert questionable humour in my posts: I do it on my own initiative.

Thanks to Charlie Steiner for feedback.

Tristan KatzJul 19 20252

Another very late comment here :)

So as with your post on 'Free Agents', I believe that thinking about this is important, because it presents a potential way to align AI if we ourselves are unsure about the values that align AI.

But I'm not sure I'm convinced by the main reason given in this post: that if AI is controllable, bad agents will be able to use it malevolently. The goal of alignment research is usually to align AI with the values or goals of the designer, and not anyone who uses it. LLMs today already refuse to do many things you might want them to do. So if technical research is successful, I expect it would be hard for malevolent actors to use the same AI in bad ways.

Maybe it won't be impossible - but what seems more likely is that they would simply use the most advanced AI research to build their own AI for malevolent purposes, allowing them to pursue their goals with far more ease. And if they were to do that, then they would simply not train it to do its own moral reasoning. Which leads to another premise of the 'main framework' that you've missed - most people working in alignment (in my own experience) assume that once someone 'wins' the AGI/ASI race, they will be able to use that AI to control or prevent the development of other potentially dangerous AIs. For that reason, it may be sufficient for the AI to act according to the values of its developers (assuming they have good values!), rather than carrying out its own moral reasoning.

To be clear, I don't think it's likely that AI developers will be able to identify the exactly right values to align AGI to. But that's a different concern to what you've expressed here. So I do think that developing AI moral reasoning might be valuable, but I'm not convinced that it's valuable in order to prevent malevolent actors using AI.

Michele CampoloJul 21 20251

I hadn't considered the narrative you bring up here when I wrote the post, that is interesting. As you write, it relies on the assumption that

once someone 'wins' the AGI/ASI race, they will be able to use that AI to control or prevent the development of other potentially dangerous AIs

Here we are entering the realm of forecasting stuff about world politics — stuff I am definitely not an expert on. As far as I know, the probability of that scenario could be extremely low. I can also think of alternative scenarios that don't seem obviously absurd, so I doubt that the probability is extremely high, but it's hard for me to say much more than that. Anyway, as you said, AI moral reasoning might be valuable in that scenario as well.

but I'm not convinced that it's valuable in order to prevent malevolent actors using AI.

That's a bit too much, I don't think I claimed that moral reasoning in AI can directly prevent that. It seems that in order to prevent malevolent actors from using AI for bad purposes we would have to either stop AI research completely, because it is not only alignment research that works on the control problem but also standard AI research; or ensure that bad actors never get access to powerful and controllable AI, which also seems hard to do and not something AI moral reasoning can help with.

The weaker claim I made in the post is that research on moral reasoning in AI is less likely to help malevolent actors use AI for bad purposes (and/or help them to a lesser degree) wrt research that aims to make AI controllable.

Tristan KatzJul 22 20251

Regarding your last point: I see. I thought this was an argument for "alignment via moral reasoning as an addition to alignment via control", not "alignment via moral reasoning instead of alignment via control." So you would hope that alignment via moral reasoning would displace or replace alignment via control.

In that case, your argument is plausible but... quite hopeful? I'm sure many people will pursue control methods regardless. I suppose you might argue that, if enough people buy your argument, then research on AI that is merely controlled will advance more slowly, and research on AI that does its own moral reasoning, and is therefore harder to misuse, would advance faster or at least in parallel. Then I would accept that this might reduce the chance of malevolent misuse, but that's quite a hopeful scenario! In less hopeful scenarios, I am unsure if people concerned with malevolent misuse ought to pursue this kind of work, or if they wouldn't be better off simply advocating for a pause/slow down.

Michele CampoloJul 22 20251

In short, I am not hoping for a specific outcome, and I can't take into account every single scenario. If someone starts giving more credit to research on moral reasoning in AI after reading this, that's already enough, considering that the topic doesn't seem to be popular within AI alignment, and it was even more niche at the time I wrote this post.

Tristan KatzJul 22 20252

Sure! And like I said, I do think this is valuable: it just seems more obviously valuable as a way to ensure the best outcomes (aligned AI), rather than as a means to avoid the worst outcomes.

[anonymous]Sep 2 20227

Tl;dr As far as you know, you're the only person in the world directly working on how to build AI that's capable of making moral progress i.e. thinking critically about goals as humans do.

(I find this pretty surprising and worrying so wanted to highlight.)

Michele CampoloSep 2 20222

Maybe "only person in the world" is a bit excessive :)

[anonymous]Sep 6 20222

I dunno, I still think my summary works. (To be clear, I wasn't trying to be like, "You must be exaggerating, tsk tsk," - I think you're being honest and for me it's the most important part of your post so I wanted to draw attention to it.)

Michele CampoloSep 6 20222

Thank you!

Effective Altruism Forum
EA Forum