Hide table of contents

Summary

I believe that advanced AI systems will likely be aligned with the goals of their human operators, at least in a narrow sense. I’ll give three main reasons for this:

  1. The transition to AI may happen in a way that does not give rise to the alignment problem as it’s usually conceived of.
  2. While work on the alignment problem appears neglected at this point, it’s likely that large amounts of resources will be used to tackle it if and when it becomes apparent that alignment is a serious problem.
  3. Even if the previous two points do not hold, we have already come up with a couple of smart approaches that seem fairly likely to lead to successful alignment.

This argument lends some support to work on non-technical interventions like moral circle expansionor improving AI-related policy, as well as work on special aspects of AI safety like decision theory or worst-case AI safety measures.

18

0
0

Reactions

0
0
Comments10
Sorted by Click to highlight new comments since: Today at 4:51 AM

I find it unfortunate that people aren't using a common scale for estimating AI risk, which makes it hard to integrate different people's estimates, or even figure out who is relatively more optimistic or pessimistic. For example here's you (Tobias):

My inside view puts ~90% probability on successful alignment (by which I mean narrow alignment as defined below). Factoring in the views of other thoughtful people, some of which think alignment is far less likely, that number comes down to ~80%.

Robert Wiblin, based on interviews with Nick Bostrom, an anonymous leading professor of computer science, Jaan Tallinn, Jan Leike, Miles Brundage, Nate Soares, Daniel Dewey:

We estimate that the risk of a serious catastrophe caused by machine intelligence within the next 100 years is between 1 and 10%.

Paul Christiano:

I think there is a >1/3 chance that AI will be solidly superhuman within 20 subjective years, and that in those scenarios alignment destroys maybe 20% of the total value of the future

It seems to me that Robert's estimate is low relative to your inside view and Paul's, since you're both talking about failures of narrow alignment ("intent alignment" in Paul's current language), while Robert's "serious catastrophe caused by machine intelligence" seems much broader. But you update towards much higher risk based on "other thoughtful people" which makes me think that either your "other thoughtful people" or Robert's interviewees are not representative, or I'm confused about who is actually more optimistic or pessimistic. Either way it seems like there's some very valuable work to be done in coming up with a standard measure of AI risk and clarifying people's actual opinions.

"Between 1 and 10%" also feels surprisingly low to me for general AI-related catastrophes. I at least would have thought that experts are less optimistic than that.

But pending clarification, I wouldn't put much weight on this estimate given that the interviews mentioned in the 80k problem area profile you link to seemed to be about informing the entire problem profile rather than this estimate specifically. So it's not clear e.g. whether the interviews included a question about all-things-considered risk for AI-related catastrophe that was asked to Nick Bostrom, an anonymous leading professor of computer science, Jaan Tallinn, Jan Leike, Miles Brundage, Nate Soares, and Daniel Dewey.

Good point, I'll send a message to Robert Wiblin asking for clarification.

Great point – I agree that it would be value to have a common scale.

I'm a bit surprised by the 1-10% estimate. This seems very low, especially given that "serious catastrophe caused by machine intelligence" is broader than narrow alignment failure. If we include possibilities like serious value drift as new technologies emerge, or difficult AI-related cooperation and security problems, or economic dynamics riding roughshod over human values, then I'd put much more than 10% (plausibly more than 50%) on something not going well.

Regarding the "other thoughtful people" in my 80% estimate: I think it's very unclear who exactly one should update towards. What I had in mind is that many EAs who have thought about this appear to not have high confidence in successful narrow alignment (not clear if the median is >50%?), judging based on my impressions from interacting with people (which is obviously not representative). I felt that my opinion is quite contrarian relative to this, which is why I felt that I should be less confident than the inside view suggests, although as you say it's quite hard to grasp what people's opinions actually are.

On the other hand, one possible interpretation (but not the only one) of the relatively low level of concern for AI risk among the larger AI community and societal elites is that people are quite optimistic that "we'll know how to cross that bridge once we get to it".

I’m a bit surprised by the 1-10% estimate. This seems very low, especially given that “serious catastrophe caused by machine intelligence” is broader than narrow alignment failure.

Yeah, it's also much lower than my inside view, as well as what I thought a group of such interviewees would say. Aside from Lukas's explanation, I think maybe 1) the interviewees did not want to appear too alarmist (either personally or for EA as a whole) or 2) they weren't reporting their inside views but instead giving their estimates after updating towards others who have much lower risk estimates. Hopefully Robert Wiblin will see my email at some point and chime in with details of how the 1-10% figure was arrived at.

In this comment I engage with many of the object-level arguments in the post. I upvoted this post because I think it's useful to write down these arguments, but we should also consider the counterarguments.

(Also, BTW, I would have preferred the word "narrow" or something like it in the post title, because some people use "alignment" in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)

If the emergence of AI is gradual or distributed, then it is more plausible that safety issues can adequately be handled “as usual”, by reacting to issues as they arise, by extensive testing and engineering, and by incrementally designing systems to satisfy multiple constraints.

If the emergence of AI is gradual enough, it does seem that safety issues can be handled adequately, but even many people who think "soft takeoff" is likely don't seem to think that AI will come that slowly. To the extent that AI does emerge that slowly, that seems to cut across many other AI-related problem areas including ones mentioned in the Summary as alternatives to narrow alignment.

Also, distributed emergence of AI is likely not safer than centralized AI, because an "economy" of AIs would be even harder to control and harness towards human values than a single or small number of AI agents. An argument can be made that AI alignment work is valuable in part so that unified AI agents can be safely built, thereby heading off such a less controllable AI economy.

So it does not seem like "distributed" by itself buys any safety. I think our intuition that it does probably comes from a sense that "distributed" is correlated with "gradual". If you consider a fast and distributed rise of AI, does that really seem safer than a fast and centralized rise of AI?

While alignment looks neglected now, we should also take into account that huge amounts of resources will likely be invested if it becomes apparent that this is a serious problem (see also here).

This assumes that alignment work is highly parallelizable. If it's not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.

Strong economic incentives will push towards alignment: it’s not economically useful to have a powerful AI system that doesn’t reliably do what you want.

This only applies to short-term "alignment" and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that's at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).

Existing approaches hold some promise

I have an issue with "approaches" (plural) here because as far as I can tell, everyone is converging to Paul Christiano's iterated amplification approach (except for MIRI which is doing more theoretical research). ETA: To be fair, perhaps iterated amplification should be viewed as a cluster of related approaches.

But the crux is that the notion of human values doesn’t need to be perfect to understand that humans do not approve of lock-ins, that humans would not approve of attempts to manipulate them, and so on.

I think we ourselves don't know how to reliably distinguish between "attempts to manipulate" and "attempts to help" so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.

Again, it’s not that hard to understand what it means to be a helpful assistant to somebody.

Same problem here, our own understanding of what it means to be a helpful assistant to somebody likely isn't robust to distributional shifts. I think this means we actually need to gain a broad/theoretical understanding of "corrigibility" or "helping" instead of being able to have AIs just learn it from humans.

Thanks for the detailed comments!

(Also, BTW, I would have preferred the word "narrow" or something like it in the post title, because some people use "alignment" in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)

Good point – changed the title.

Also, distributed emergence of AI is likely not safer than centralized AI, because an "economy" of AIs would be even harder to control and harness towards human values than a single or small number of AI agents.

As long as we consider only narrow alignment, it does seem safer to me in that local misalignment or safety issues in individual systems would not immediately cause everything to break down, because such a system would (arguably) not be able to obtain a decisive strategic advantage and take over the world. So there'd be time to react.

But I agree with you that an economy-like scenario entails other safety issues, and aligning the entire "economy" with human (compromise) values might be very difficult. So I don't think this is safer overall, or at least it's not obvious. (From my suffering-focused perspective, distributed emergence of AI actually seems worse than a scenario of the form "a single system quickly takes over and forms a singleton", as the latter seems less likely to lead to conflict-related disvalue.)

This assumes that alignment work is highly parallelizable. If it's not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.

Yeah, I do think that alignment work is fairly parallelizable, and future work also has a (potentially very big) information advantage over current work because they will know more about what AI techniques look like. Is there any precedent of a new technology where work on safety issues was highly serial and where it was therefore crucial to start working on safety a long time in advance?

This only applies to short-term "alignment" and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that's at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).

I think there are two different cases:

  • If the human actually cares only about short-term selfish gain, possibly at the expense of others, then this isn't a narrow alignment failure, it's a cooperation problem. (But I agree that it could be a serious issue).
  • If the human actually cares about the long term, then it appears that she's making a mistake by buying an AI system that is only aligned in the short term. So it comes down to human inadequacy – given sufficient information she'd buy a long-term aligned AI system instead, and AI companies would have incentive to provide long-term aligned AI systems. Though of course the "sufficient information" part is crucial, and is a fairly strong assumption as it may be hard to distinguish between "short-term alignment" and "real" alignment. I agree that this is another potentially serious problem.
I think we ourselves don't know how to reliably distinguish between "attempts to manipulate" and "attempts to help" so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.

Interesting point. I think I still have an intuition that there's a fairly simple core to it, but I'm not sure how to best articulate this intuition.

I'm sympathetic to the general thrust of the argument, that we should be reasonably optimistic about "business-as-usual" leading to successful narrow alignment. I put particular weight on the second argument, that the AI research community will identify and be successful at solving these problems.

However, you mostly lost me in the third argument. You suggest using whatever state-of-the-art general purpose learning technique exists to model human values, and then optimise them. I'm pessimistic about this since it involves an adversarial relationship between the optimiser (e.g. an RL algorithm) and the learned reward function. This will work if the optimiser is weak and the reward model is strong. But if we are hypothesising a far improved reward learning technique, we should also assume far more powerful RL algorithms than we have today.

Currently, it seems like RL is generally an easier problem than learning a reward function. For example, current IRL algorithms will overfit the reward function to demonstrations in a high-dimensional environment. If you later optimize the reward with an RL algorithm, you get a policy which does well under the learned reward function, but terribly (often worse than random) on the ground truth reward function. This is why you normally learn the policy jointly with the reward in a GAN-based approach. Regularizers to learn a good reward model (which can then generalize) is in active area of research, see e.g. the variational discriminator bottleneck. However, solving it in generality seems very hard. There's been little success in adversarial defences, which is a related problem, and there are theoretical reasons to believe adversarial examples will be present for any model class in high-dimensional environments.

Overall, I'm optimistic about the research community solving these problems, but think that present techniques are far from adequate. Although improved general-purpose learning techniques will be important, I believe there will also need to be a concerted focus on solving alignment-related problems.

What do you think about technical interventions on these problems, and "moral uncertainty expansion" as a more cooperative alternative to "moral circle expansion"?

Working on these problems makes a lot of sense, and I'm not saying that the philosophical issues around what "human values" means will likely be solved by default.

I think increasing philosophical sophistication (or "moral uncertainty expansion") is a very good idea from many perspectives. (A direct comparison to moral circle expansion would also need to take relative tractability and importance into account, which seems unclear to me.)

Curated and popular this week
Relevant opportunities