Hide table of contents

Most of the content of the post applies to both short-term and long-term future, and can be read by anyone who has heard about AI alignment before.

0. Summary

AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3. 

The appendix clarifies some minor ambiguities with terminology and links to other stuff.

1. Criticism of the main framework in AI alignment

1.1 What I mean by main framework

In short, it’s the rationale behind most work in AI alignment: solving the control problem to reduce existential risk. I am not talking about AI governance, nor about AI safety that has nothing to do with existential risk (e.g. safety of self-driving cars).

Here are the details, presented as a step-by-step argument.

  1. At some point in the future, we'll be able to design AIs that are very good at achieving their goals. (Capabilities premise)
  2. These AIs might have goals that are different from their designers' goals. (Misalignment premise)
  3. Therefore, very bad futures caused by out-of-control misaligned AI are possible. (From previous two premises)
  4. AI alignment research that is motivated by the previous argument often aims at making misalignment between AI and designer, or loss of control, less likely to happen or less severe. (Alignment research premise).

Common approaches are ensuring that the goals of the AI are well specified and aligned with what the designer originally wanted, or making the AI learn our values by observing our behaviour. In case you are new to these ideas, two accessible books on the subject are [1,2].

     5. Therefore, AI alignment research improves the expected value of bad futures                       caused by out-of-control misaligned AI. (From 3 and 4).

By expected value I mean a measure of value that takes likelihood of events into account, and follows some intuitive rules such as "5% chance of extinction is worse than 1% chance of extinction". It need not be an explicit calculation, especially because it might be difficult to compare possible futures quantitatively, e.g. extinction vs dystopia.

I don't claim that all AI alignment research follows this framework; just that this is what motivates a decent amount (I would guess more than half) of work in AI alignment.

1.2 Response

I call this a response, and not a strict objection, because none of the points or inferences in the previous argument is rejected. Rather, some extra information is taken into account.

     6. Bad actors can use powerful controllable AI to bring about very bad futures and/or             lock-in their values (Bad actors premise)

For more information about value lock-in, see chapter 4 of What We Owe The Future [3].

     7. Recall that alignment research motivated by the above points makes it easier to                 design AI that is controllable and whose goals are aligned with its designers' goals.           As a consequence, bad actors might have an easier time using powerful                                   controllable AI to achieve their goals. (From 4 and 6)

     8. Thus, even though AI alignment research improves the expected value of futures               caused by uncontrolled AI, it reduces the expected value of futures caused by bad               human actors using controlled AI to achieve their ends. (From 5 and 7)

This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.

An example: if you think that futures shaped by malevolent actors using AI are many times more likely to happen than futures shaped by uncontrolled AI, the response will strike you as very important; and vice versa if you think the opposite.

Another example: if you think that extinction is way worse than dystopic futures lasting a long time, the response won't affect you much—assuming that bad human actors are not fans of complete extinction.

If one considers both epistemic and moral uncertainty, the response works like a piece in the puzzle of how to evaluate AI alignment research. Other points can be made and balanced against the conclusion above, which can't establish by itself that AI safety research is overall net good or bad or neutral. At the same time, deciding to completely ignore it would likely be a case of biased reasoning, maybe motivated.

2. An alternative to the main framework

2.1 Moral progress as a goal of alignment research

Research that is not vulnerable to the response has to avoid point 7 above, i.e. it must not make it easier to create AI that helps malevolent actors achieve their goals.

Section 3 in Artificial Intelligence, Values, and Alignment [4] distinguishes six possible goals of AI alignment. The first three—alignment with instructions, expressed intentions, or revealed preferences—follow the main framework above. The other three focus less on the control problem, and more on finding an interpretation of ‘good’ and then making AI do good things. Thus, the latter three are less (or not at all) vulnerable to the response above.

If you are at all curious about AI safety, I suggest that you have a look at Gabriel's paper, it contains many excellent ideas. But it misses one that is, for lack of a better word, excellenter. It’s about building AIs that work like independent thinkers, then using them for moral progress.

This kind of AI does not do what its designer wants it to do, but rather does what it wants—to the same extent that humans do what they want and generally don’t limit themselves to following instructions from other humans. Therefore, the response above doesn’t apply.

The key point, which is also what makes this kind of AI useful, is that its behaviour is not completely arbitrary. Rather, this AI develops its own values as it learns about the world and thinks critically about them, as humans do as they go through their lives.

As it happens with humans, the end result will depend on the initial conditions, the learning algorithm, and the learning environment. Experimenting with different variations of these may expose us to an even greater degree of cultural, intellectual, and moral diversity than what we can observe today. One of the advantages of using AIs is that we can tweak them to remove biases of human reasoning, and thus obtain thinkers that are less irrational and less influenced by, for example, one’s skin colour. These AIs may even spot important injustices that are not widely recognised today—for comparison, consider how slavery was perceived centuries ago.

Chapter 3 and the section Building a Morally Exploratory World in chapter 4 of [3] contain more information about the importance of values change and moral progress.

2.2 Some considerations and objections to the alternative

  • Even though I cited [3] on more than one occasion, I think that pretty much all the content of the post applies to both short-term and long-term future.
  • I do not claim that research towards building the independent AI thinkers of 2.1 above is the most effective AI alignment research intervention, nor that it is the most effective intervention for moral progress. I’ve only presented a problem of the main framework in AI alignment, and proposed an alternative that aims to avoid that problem. As someone else would say: beware surprising and suspicious convergence.
  • Research on AI that is able to think critically about goals may be useful to reduce AI risk, even if no independent AI thinkers are built, since it may lead to insights on how to design AI that doesn’t just optimise for a specified metric.
  • Objection: Bad actors could build or buy or select independent AI thinkers that agree with their goals and want to help them.

Reply: True to a certain extent, but seems unlikely to happen and easier said than done. I think it’s unlikely to happen because bad actors would probably opt to use do-what-I-want AI, instead of producing a lot of independent AI thinkers with the hope that one of them happens to have goals that are very aligned with what the bad actors themselves want. And in the latter case, bad actors would also have to hope that the AI goals won’t change over time. Overall, this objection seems strong in futures in which research on independent AI thinkers has advanced to the point of outperforming research on do-what-I-want AI: a very unlikely scenario, considering that the latter kind of research is basically all AI research + most AI alignment research.

  • Objection: The proposed alternative can actually create bad actors.

Reply: True, some independent AI thinkers might resemble, for example, dictators of the past, if the initial conditions and learning algorithm and learning environment are appropriate. However, at least initially, they would not already be in a position of power with respect to other humans, and they would have to compete also with the other independent thinkers if they have different goals. The main difference with section 1 above is that we are not talking about very powerful or superintelligent AI here. My guess is that bad actors created this way would be roughly as dangerous as human bad actors. Unfortunately, many new humans are born every day, and some of them have bad intentions.

  • Objection: The proposed alternative requires human-level AI.

Reply: One can continue the objection in different ways.

  • “...Therefore it’s dangerous.”: See last part of the above reply.
  • “...Therefore it isn’t very useful.”: One may claim this if they believe, for example, that we will build very powerful and superintelligent AI shortly after the first human-level AI is built, and that at that point we’ll be doomed to dystopia or extinction, so there won’t be time for AI experiments and moral progress. I don’t know how to reply to this objection without attacking the beliefs I've just mentioned. However, if you think the proposed alternative is not very useful for a different reason, you can leave a comment and I’ll try to reply.
  • “...”: Sometimes people end the objection there. If we were able to increase mind diversity and foster moral progress by using AI that is below human level of intelligence, that would be great! I don’t exclude that it’s possible, but it might require extra research.

3. Technical details about the alternative

This section is not ready yet. When it will be ready, I’ll publish the complete version on the Alignment Forum and leave a link here.

In short, the main point is that at the moment we don’t know how to build AI that thinks critically about goals as humans do. That’s one of the reasons why I am doing research on it.

As far as I know, no one else in AI safety is directly working on it. There is some research in the field of machine ethics, about Artificial Moral Agents, that has a similar motivation or objective. My guess is that, overall, very few people are working on this.

References

[1] Russell, Stuart. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.

[2] Christian, Brian. The alignment problem: How can machines learn human values?. Atlantic Books, 2021.

[3] MacAskill, William. What We Owe the Future. Hachette UK, 2022.

[4] Gabriel, Iason. "Artificial intelligence, values, and alignment." Minds and machines 30.3 (2020): 411-437.

Appendix

Terminology

  • When I use the term ‘AIs’, I mean multiple artificial intelligences, e.g. more than one AI program. When I use the term ‘AI’, I mean one or more artificial intelligences, or I may use it as a modifier (as in ‘AI safety’). The distinction is not particularly important, and in this post I simply use what seems more appropriate to the context.
  • When I write “by expected value I mean a measure of value […]”, I use ‘measure’ with its common-sense meaning in everyday language, not as the mathematical definition of measure.
    • I’m assuming extinction is bad, as you can guess from that paragraph. You might think otherwise and that’s fine: if you believe extinction is not bad, then you probably don’t like x-risk motivated research in the first place and you don’t need the argument in section 1 to evaluate it.
  • Value lock-in, as defined in Chapter 4 of What We Owe The Future: "an event that causes a single value system, or set of value systems, to persist for an extremely long time."

Other stuff

You can find more criticism of AI safety from EAs here. The difference with this post is that there are many more arguments and ideas, but they are less structured. 

In the past I wrote a short comparison between an idea similar to 2.1 and other alignment approaches, you can find it here.

 

This work was supported by CEEALAR, but these are not CEEALAR’s opinions. Note also that CEEALAR doesn't support me to insert questionable humour in my posts: I do it on my own initiative. 

Thanks to Charlie Steiner for feedback.
 

Comments4


Sorted by Click to highlight new comments since:
[anonymous]6
0
0

Tl;dr As far as you know, you're the only person in the world directly working on how to build AI that's capable of making moral progress i.e. thinking critically about goals as humans do.

(I find this pretty surprising and worrying so wanted to highlight.)

Maybe "only person in the world" is a bit excessive :)

As far as I know, no one else in AI safety is directly working on it. There is some research in the field of machine ethics, about Artificial Moral Agents, that has a similar motivation or objective. My guess is that, overall, very few people are working on this.

[anonymous]2
0
0

I dunno, I still think my summary works. (To be clear, I wasn't trying to be like, "You must be exaggerating, tsk tsk," - I think you're being honest and for me it's the most important part of your post so I wanted to draw attention to it.)

Curated and popular this week
 ·  · 4m read
 · 
Forethought[1] is a new AI macrostrategy research group cofounded by Max Dalton, Will MacAskill, Tom Davidson, and Amrit Sidhu-Brar. We are trying to figure out how to navigate the (potentially rapid) transition to a world with superintelligent AI systems. We aim to tackle the most important questions we can find, unrestricted by the current Overton window. More details on our website. Why we exist We think that AGI might come soon (say, modal timelines to mostly-automated AI R&D in the next 2-8 years), and might significantly accelerate technological progress, leading to many different challenges. We don’t yet have a good understanding of what this change might look like or how to navigate it. Society is not prepared. Moreover, we want the world to not just avoid catastrophe: we want to reach a really great future. We think about what this might be like (incorporating moral uncertainty), and what we can do, now, to build towards a good future. Like all projects, this started out with a plethora of Google docs. We ran a series of seminars to explore the ideas further, and that cascaded into an organization. This area of work feels to us like the early days of EA: we’re exploring unusual, neglected ideas, and finding research progress surprisingly tractable. And while we start out with (literally) galaxy-brained schemes, they often ground out into fairly specific and concrete ideas about what should happen next. Of course, we’re bringing principles like scope sensitivity, impartiality, etc to our thinking, and we think that these issues urgently need more morally dedicated and thoughtful people working on them. Research Research agendas We are currently pursuing the following perspectives: * Preparing for the intelligence explosion: If AI drives explosive growth there will be an enormous number of challenges we have to face. In addition to misalignment risk and biorisk, this potentially includes: how to govern the development of new weapons of mass destr
jackva
 ·  · 3m read
 · 
 [Edits on March 10th for clarity, two sub-sections added] Watching what is happening in the world -- with lots of renegotiation of institutional norms within Western democracies and a parallel fracturing of the post-WW2 institutional order -- I do think we, as a community, should more seriously question our priors on the relative value of surgical/targeted and broad system-level interventions. Speaking somewhat roughly, with EA as a movement coming of age in an era where democratic institutions and the rule-based international order were not fundamentally questioned, it seems easy to underestimate how much the world is currently changing and how much riskier a world of stronger institutional and democratic backsliding and weakened international norms might be. Of course, working on these issues might be intractable and possibly there's nothing highly effective for EAs to do on the margin given much attention to these issues from society at large. So, I am not here to confidently state we should be working on these issues more. But I do think in a situation of more downside risk with regards to broad system-level changes and significantly more fluidity, it seems at least worth rigorously asking whether we should shift more attention to work that is less surgical (working on specific risks) and more systemic (working on institutional quality, indirect risk factors, etc.). While there have been many posts along those lines over the past months and there are of course some EA organizations working on these issues, it stil appears like a niche focus in the community and none of the major EA and EA-adjacent orgs (including the one I work for, though I am writing this in a personal capacity) seem to have taken it up as a serious focus and I worry it might be due to baked-in assumptions about the relative value of such work that are outdated in a time where the importance of systemic work has changed in the face of greater threat and fluidity. When the world seems to
Sam Anschell
 ·  · 6m read
 · 
*Disclaimer* I am writing this post in a personal capacity; the opinions I express are my own and do not represent my employer. I think that more people and orgs (especially nonprofits) should consider negotiating the cost of sizable expenses. In my experience, there is usually nothing to lose by respectfully asking to pay less, and doing so can sometimes save thousands or tens of thousands of dollars per hour. This is because negotiating doesn’t take very much time[1], savings can persist across multiple years, and counterparties can be surprisingly generous with discounts. Here are a few examples of expenses that may be negotiable: For organizations * Software or news subscriptions * Of 35 corporate software and news providers I’ve negotiated with, 30 have been willing to provide discounts. These discounts range from 10% to 80%, with an average of around 40%. * Leases * A friend was able to negotiate a 22% reduction in the price per square foot on a corporate lease and secured a couple months of free rent. This led to >$480,000 in savings for their nonprofit. Other negotiable parameters include: * Square footage counted towards rent costs * Lease length * A tenant improvement allowance * Certain physical goods (e.g., smart TVs) * Buying in bulk can be a great lever for negotiating smaller items like covid tests, and can reduce costs by 50% or more. * Event/retreat venues (both venue price and smaller items like food and AV) * Hotel blocks * A quick email with the rates of comparable but more affordable hotel blocks can often save ~10%. * Professional service contracts with large for-profit firms (e.g., IT contracts, office internet coverage) * Insurance premiums (though I am less confident that this is negotiable) For many products and services, a nonprofit can qualify for a discount simply by providing their IRS determination letter or getting verified on platforms like TechSoup. In my experience, most vendors and companies