Hide table of contents

Most of the content of the post applies to both short-term and long-term future, and can be read by anyone who has heard about AI alignment before.

0. Summary

AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3. 

The appendix clarifies some minor ambiguities with terminology and links to other stuff.

1. Criticism of the main framework in AI alignment

1.1 What I mean by main framework

In short, it’s the rationale behind most work in AI alignment: solving the control problem to reduce existential risk. I am not talking about AI governance, nor about AI safety that has nothing to do with existential risk (e.g. safety of self-driving cars).

Here are the details, presented as a step-by-step argument.

  1. At some point in the future, we'll be able to design AIs that are very good at achieving their goals. (Capabilities premise)
  2. These AIs might have goals that are different from their designers' goals. (Misalignment premise)
  3. Therefore, very bad futures caused by out-of-control misaligned AI are possible. (From previous two premises)
  4. AI alignment research that is motivated by the previous argument often aims at making misalignment between AI and designer, or loss of control, less likely to happen or less severe. (Alignment research premise).

Common approaches are ensuring that the goals of the AI are well specified and aligned with what the designer originally wanted, or making the AI learn our values by observing our behaviour. In case you are new to these ideas, two accessible books on the subject are [1,2].

     5. Therefore, AI alignment research improves the expected value of bad futures                       caused by out-of-control misaligned AI. (From 3 and 4).

By expected value I mean a measure of value that takes likelihood of events into account, and follows some intuitive rules such as "5% chance of extinction is worse than 1% chance of extinction". It need not be an explicit calculation, especially because it might be difficult to compare possible futures quantitatively, e.g. extinction vs dystopia.

I don't claim that all AI alignment research follows this framework; just that this is what motivates a decent amount (I would guess more than half) of work in AI alignment.

1.2 Response

I call this a response, and not a strict objection, because none of the points or inferences in the previous argument is rejected. Rather, some extra information is taken into account.

     6. Bad actors can use powerful controllable AI to bring about very bad futures and/or             lock-in their values (Bad actors premise)

For more information about value lock-in, see chapter 4 of What We Owe The Future [3].

     7. Recall that alignment research motivated by the above points makes it easier to                 design AI that is controllable and whose goals are aligned with its designers' goals.           As a consequence, bad actors might have an easier time using powerful                                   controllable AI to achieve their goals. (From 4 and 6)

     8. Thus, even though AI alignment research improves the expected value of futures               caused by uncontrolled AI, it reduces the expected value of futures caused by bad               human actors using controlled AI to achieve their ends. (From 5 and 7)

This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.

An example: if you think that futures shaped by malevolent actors using AI are many times more likely to happen than futures shaped by uncontrolled AI, the response will strike you as very important; and vice versa if you think the opposite.

Another example: if you think that extinction is way worse than dystopic futures lasting a long time, the response won't affect you much—assuming that bad human actors are not fans of complete extinction.

If one considers both epistemic and moral uncertainty, the response works like a piece in the puzzle of how to evaluate AI alignment research. Other points can be made and balanced against the conclusion above, which can't establish by itself that AI safety research is overall net good or bad or neutral. At the same time, deciding to completely ignore it would likely be a case of biased reasoning, maybe motivated.

2. An alternative to the main framework

2.1 Moral progress as a goal of alignment research

Research that is not vulnerable to the response has to avoid point 7 above, i.e. it must not make it easier to create AI that helps malevolent actors achieve their goals.

Section 3 in Artificial Intelligence, Values, and Alignment [4] distinguishes six possible goals of AI alignment. The first three—alignment with instructions, expressed intentions, or revealed preferences—follow the main framework above. The other three focus less on the control problem, and more on finding an interpretation of ‘good’ and then making AI do good things. Thus, the latter three are less (or not at all) vulnerable to the response above.

If you are at all curious about AI safety, I suggest that you have a look at Gabriel's paper, it contains many excellent ideas. But it misses one that is, for lack of a better word, excellenter. It’s about building AIs that work like independent thinkers, then using them for moral progress.

This kind of AI does not do what its designer wants it to do, but rather does what it wants—to the same extent that humans do what they want and generally don’t limit themselves to following instructions from other humans. Therefore, the response above doesn’t apply.

The key point, which is also what makes this kind of AI useful, is that its behaviour is not completely arbitrary. Rather, this AI develops its own values as it learns about the world and thinks critically about them, as humans do as they go through their lives.

As it happens with humans, the end result will depend on the initial conditions, the learning algorithm, and the learning environment. Experimenting with different variations of these may expose us to an even greater degree of cultural, intellectual, and moral diversity than what we can observe today. One of the advantages of using AIs is that we can tweak them to remove biases of human reasoning, and thus obtain thinkers that are less irrational and less influenced by, for example, one’s skin colour. These AIs may even spot important injustices that are not widely recognised today—for comparison, consider how slavery was perceived centuries ago.

Chapter 3 and the section Building a Morally Exploratory World in chapter 4 of [3] contain more information about the importance of values change and moral progress.

2.2 Some considerations and objections to the alternative

  • Even though I cited [3] on more than one occasion, I think that pretty much all the content of the post applies to both short-term and long-term future.
  • I do not claim that research towards building the independent AI thinkers of 2.1 above is the most effective AI alignment research intervention, nor that it is the most effective intervention for moral progress. I’ve only presented a problem of the main framework in AI alignment, and proposed an alternative that aims to avoid that problem. As someone else would say: beware surprising and suspicious convergence.
  • Research on AI that is able to think critically about goals may be useful to reduce AI risk, even if no independent AI thinkers are built, since it may lead to insights on how to design AI that doesn’t just optimise for a specified metric.
  • Objection: Bad actors could build or buy or select independent AI thinkers that agree with their goals and want to help them.

Reply: True to a certain extent, but seems unlikely to happen and easier said than done. I think it’s unlikely to happen because bad actors would probably opt to use do-what-I-want AI, instead of producing a lot of independent AI thinkers with the hope that one of them happens to have goals that are very aligned with what the bad actors themselves want. And in the latter case, bad actors would also have to hope that the AI goals won’t change over time. Overall, this objection seems strong in futures in which research on independent AI thinkers has advanced to the point of outperforming research on do-what-I-want AI: a very unlikely scenario, considering that the latter kind of research is basically all AI research + most AI alignment research.

  • Objection: The proposed alternative can actually create bad actors.

Reply: True, some independent AI thinkers might resemble, for example, dictators of the past, if the initial conditions and learning algorithm and learning environment are appropriate. However, at least initially, they would not already be in a position of power with respect to other humans, and they would have to compete also with the other independent thinkers if they have different goals. The main difference with section 1 above is that we are not talking about very powerful or superintelligent AI here. My guess is that bad actors created this way would be roughly as dangerous as human bad actors. Unfortunately, many new humans are born every day, and some of them have bad intentions.

  • Objection: The proposed alternative requires human-level AI.

Reply: One can continue the objection in different ways.

  • “...Therefore it’s dangerous.”: See last part of the above reply.
  • “...Therefore it isn’t very useful.”: One may claim this if they believe, for example, that we will build very powerful and superintelligent AI shortly after the first human-level AI is built, and that at that point we’ll be doomed to dystopia or extinction, so there won’t be time for AI experiments and moral progress. I don’t know how to reply to this objection without attacking the beliefs I've just mentioned. However, if you think the proposed alternative is not very useful for a different reason, you can leave a comment and I’ll try to reply.
  • “...”: Sometimes people end the objection there. If we were able to increase mind diversity and foster moral progress by using AI that is below human level of intelligence, that would be great! I don’t exclude that it’s possible, but it might require extra research.

3. Technical details about the alternative

This section is not ready yet. When it will be ready, I’ll publish the complete version on the Alignment Forum and leave a link here.

In short, the main point is that at the moment we don’t know how to build AI that thinks critically about goals as humans do. That’s one of the reasons why I am doing research on it.

As far as I know, no one else in AI safety is directly working on it. There is some research in the field of machine ethics, about Artificial Moral Agents, that has a similar motivation or objective. My guess is that, overall, very few people are working on this.


[1] Russell, Stuart. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.

[2] Christian, Brian. The alignment problem: How can machines learn human values?. Atlantic Books, 2021.

[3] MacAskill, William. What We Owe the Future. Hachette UK, 2022.

[4] Gabriel, Iason. "Artificial intelligence, values, and alignment." Minds and machines 30.3 (2020): 411-437.



  • When I use the term ‘AIs’, I mean multiple artificial intelligences, e.g. more than one AI program. When I use the term ‘AI’, I mean one or more artificial intelligences, or I may use it as a modifier (as in ‘AI safety’). The distinction is not particularly important, and in this post I simply use what seems more appropriate to the context.
  • When I write “by expected value I mean a measure of value […]”, I use ‘measure’ with its common-sense meaning in everyday language, not as the mathematical definition of measure.
    • I’m assuming extinction is bad, as you can guess from that paragraph. You might think otherwise and that’s fine: if you believe extinction is not bad, then you probably don’t like x-risk motivated research in the first place and you don’t need the argument in section 1 to evaluate it.
  • Value lock-in, as defined in Chapter 4 of What We Owe The Future: "an event that causes a single value system, or set of value systems, to persist for an extremely long time."

Other stuff

You can find more criticism of AI safety from EAs here. The difference with this post is that there are many more arguments and ideas, but they are less structured. 

In the past I wrote a short comparison between an idea similar to 2.1 and other alignment approaches, you can find it here.


This work was supported by CEEALAR, but these are not CEEALAR’s opinions. Note also that CEEALAR doesn't support me to insert questionable humour in my posts: I do it on my own initiative. 

Thanks to Charlie Steiner for feedback.

Sorted by Click to highlight new comments since:

Tl;dr As far as you know, you're the only person in the world directly working on how to build AI that's capable of making moral progress i.e. thinking critically about goals as humans do.

(I find this pretty surprising and worrying so wanted to highlight.)

Maybe "only person in the world" is a bit excessive :)

As far as I know, no one else in AI safety is directly working on it. There is some research in the field of machine ethics, about Artificial Moral Agents, that has a similar motivation or objective. My guess is that, overall, very few people are working on this.

I dunno, I still think my summary works. (To be clear, I wasn't trying to be like, "You must be exaggerating, tsk tsk," - I think you're being honest and for me it's the most important part of your post so I wanted to draw attention to it.)

Curated and popular this week
Relevant opportunities