Hide table of contents

This is a cross-post of a bonus edition of the Alignment Newsletter that I thought would be particularly interesting to this audience.

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can give some by commenting on this post.

Audio version here (may not be up yet).

Welcome to another special edition of the newsletter! In this edition, I summarize four conversations that AI Impacts had with researchers who were optimistic that AI safety would be solved "by default". (Note that one of the conversations was with me.)

While all four of these conversations covered very different topics, I think there were three main points of convergence. First, we were relatively unconvinced by the traditional arguments for AI risk, and find discontinuities relatively unlikely. Second, we were more optimistic about solving the problem in the future, when we know more about the problem and have more evidence about powerful AI systems. And finally, we were more optimistic that as we get more evidence of the problem in the future, the existing ML community will actually try to fix that problem.

The conversations

Conversation with Paul Christiano (Paul Christiano, Asya Bergal, Ronny Fernandez, and Robert Long) (summarized by Rohin): There can't be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left. So, the prior that any particular thing has such an impact should be quite low. With AI in particular, obviously we're going to try to make AI systems that do what we want them to do. So starting from this position of optimism, we can then evaluate the arguments for doom. The two main arguments: first, we can't distinguish ahead of time between AIs that are trying to do the right thing, and AIs that are trying to kill us, because the latter will behave nicely until they can execute a treacherous turn. Second, since we don't have a crisp concept of "doing the right thing", we can't select AI systems on whether they are doing the right thing.

However, there are many "saving throws", or ways that the argument could break down, avoiding doom. Perhaps there's no problem at all, or perhaps we can cope with it with a little bit of effort, or perhaps we can coordinate to not build AIs that destroy value. Paul assigns a decent amount of probability to each of these (and other) saving throws, and any one of them suffices to avoid doom. This leads Paul to estimate that AI risk reduces the expected value of the future by roughly 10%, a relatively optimistic number. Since it is so neglected, concerted effort by longtermists could reduce it to 5%, making it still a very valuable area for impact. The main way he expects to change his mind is from evidence from more powerful AI systems, e.g. as we build more powerful AI systems, perhaps inner optimizer concerns will materialize and we'll see examples where an AI system executes a non-catastrophic treacherous turn.

Paul also believes that clean algorithmic problems are usually solvable in 10 years, or provably impossible, and early failures to solve a problem don't provide much evidence of the difficulty of the problem (unless they generate proofs of impossibility). So, the fact that we don't know how to solve alignment now doesn't provide very strong evidence that the problem is impossible. Even if the clean versions of the problem were impossible, that would suggest that the problem is much more messy, which requires more concerted effort to solve but also tends to be just a long list of relatively easy tasks to do. (In contrast, MIRI thinks that prosaic AGI alignment is probably impossible.)

Note that even finding out that the problem is impossible can help; it makes it more likely that we can all coordinate to not build dangerous AI systems, since no one wants to build an unaligned AI system. Paul thinks that right now the case for AI risk is not very compelling, and so people don't care much about it, but if we could generate more compelling arguments, then they would take it more seriously. If instead you think that the case is already compelling (as MIRI does), then you would be correspondingly more pessimistic about others taking the arguments seriously and coordinating to avoid building unaligned AI.

One potential reason MIRI is more doomy is that they take a somewhat broader view of AI safety: in particular, in addition to building an AI that is trying to do what you want it to do, they would also like to ensure that when the AI builds successors, it does so well. In contrast, Paul simply wants to leave the next generation of AI systems in at least as good a situation as we find ourselves in now, since they will be both better informed and more intelligent than we are. MIRI has also previously defined aligned AI as one that produces good outcomes when run, which is a much broader conception of the problem than Paul has. But probably the main disagreement between MIRI and ML researchers and that ML researchers expect that we'll try a bunch of stuff, and something will work out, whereas MIRI expects that the problem is really hard, such that trial and error will only get you solutions that appear to work.

Rohin's opinion: A general theme here seems to be that MIRI feels like they have very strong arguments, while Paul thinks that they're plausible arguments, but aren't extremely strong evidence. Simply having a lot more uncertainty leads Paul to be much more optimistic. I agree with most of this.

However, I do disagree with the point about "clean" problems. I agree that clean algorithmic problems are usually solved within 10 years or are provably impossible, but it doesn't seem to me like AI risk counts as a clean algorithmic problem: we don't have a nice formal statement of the problem that doesn't rely on intuitive concepts like "optimization", "trying to do something", etc. This suggests to me that AI risk is more "messy", and so may require more time to solve.

Conversation with Rohin Shah (Rohin Shah, Asya Bergal, Robert Long, and Sara Haxhia)(summarized by Rohin): The main reason I am optimistic about AI safety is that we will see problems in advance, and we will solve them, because nobody wants to build unaligned AI. A likely crux is that I think that the ML community will actually solve the problems, as opposed to applying a bandaid fix that doesn't scale. I don't know why there are different underlying intuitions here.

In addition, many of the classic arguments for AI safety involve a system that can be decomposed into an objective function and a world model, which I suspect will not be a good way to model future AI systems. In particular, current systems trained by RL look like a grab bag of heuristics that correlate well with obtaining high reward. I think that as AI systems become more powerful, the heuristics will become more and more general, but they still won't decompose naturally into an objective function, a world model, and search. In addition, we can look at humans as an example: we don't fully pursue convergent instrumental subgoals; for example, humans can be convinced to pursue different goals. This makes me more skeptical of traditional arguments.

I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. This should make alignment and oversight of these systems significantly easier. For significantly stronger systems, we should be delegating the problem to the AI systems that are 10x more intelligent than us. (This is very similar to the picture painted in Chris Olah’s views on AGI safety (AN #72), but that had not been published and I was not aware of Chris's views at the time of this conversation.)

I'm also less worried about race dynamics increasing accident risk than the median researcher. The benefit of racing a little bit faster is to have a little bit more power / control over the future, while also increasing the risk of extinction a little bit. This seems like a bad trade from each agent's perspective. (That is, the Nash equilibrium is for all agents to be cautious, because the potential upside of racing is small and the potential downside is large.) I'd be more worried if [AI risk is real AND not everyone agrees AI risk is real when we have powerful AI systems], or if the potential upside was larger (e.g. if racing a little more made it much more likely that you could achieve a decisive strategic advantage).

Overall, it feels like there's around 90% chance that AI would not cause x-risk without additional intervention by longtermists. The biggest disagreement between me and more pessimistic researchers is that I think gradual takeoff is much more likely than discontinuous takeoff (and in fact, the first, third and fourth paragraphs above are quite weak if there's a discontinuous takeoff). If I condition on discontinuous takeoff, then I mostly get very confused about what the world looks like, but I also get a lot more worried about AI risk, especially because the "AI is to humans as humans are to ants" analogy starts looking more accurate. In the interview I said 70% chance of doom in this world, but with way more uncertainty than any of the other credences, because I'm really confused about what that world looks like. Two other disagreements, besides the ones above: I don't buy Realism about rationality (AN #25), whereas I expect many pessimistic researchers do. I may also be more pessimistic about our ability to write proofs about fuzzy concepts like those that arise in alignment.

On timelines, I estimated a very rough 50% chance of AGI within 20 years, and 30-40% chance that it would be using "essentially current techniques" (which is obnoxiously hard to define). Conditional on both of those, I estimated 70% chance that it would be something like a mesa optimizer; mostly because optimization is a very useful instrumental strategy for solving many tasks, especially because gradient descent and other current algorithms are very weak optimization algorithms (relative to e.g. humans), and so learned optimization algorithms will be necessary to reach human levels of sample efficiency.

Rohin's opinion: Looking over this again, I'm realizing that I didn't emphasize enough that most of my optimism comes from the more outside view type considerations: that we'll get warning signs that the ML community won't ignore, and that the AI risk arguments are not watertight. The other parts are particular inside view disagreements that make me more optimistic, but they don't factor in much into my optimism besides being examples of how the meta considerations could play out. I'd recommend this comment of mine to get more of a sense of how the meta considerations factor into my thinking.

I was also glad to see that I still broadly agree with things I said ~5 months ago (since no major new opposing evidence has come up since then), though as I mentioned above, I would now change what I place emphasis on.

Conversation with Robin Hanson (Robin Hanson, Asya Bergal, and Robert Long) (summarized by Rohin): The main theme of this conversation is that AI safety does not look particularly compelling on an outside view. Progress in most areas is relatively incremental and continuous; we should expect the same to be true for AI, suggesting that timelines should be quite long, on the order of centuries. The current AI boom looks similar to previous AI booms, which didn't amount to much in the past.

Timelines could be short if progress in AI were "lumpy", as in a FOOM scenario. This could happen if intelligence was one simple thing that just has to be discovered, but Robin expects that intelligence is actually a bunch of not-very-general tools that together let us do many things, and we simply have to find all of these tools, which will presumably not be lumpy. Most of the value from tools comes from more specific, narrow tools, and intelligence should be similar. In addition, the literature on human uniqueness suggests that it wasn't "raw intelligence" or small changes to brain architecture that makes humans unique, it's our ability to process culture (communicating via language, learning from others, etc).

In any case, many researchers are now distancing themselves from the FOOM scenario, and are instead arguing that AI risk occurs due to standard principal-agency problems, in the situation where the agent (AI) is much smarter than the principal (human). Robin thinks that this doesn't agree with the existing literature on principal-agent problems, in which losses from principal-agent problems tend to be bounded, even when the agent is smarter than the principal.

You might think that since the stakes are so high, it's worth working on it anyway. Robin agrees that it's worth having a few people (say a hundred) pay attention to the problem, but doesn't think it's worth spending a lot of effort on it right now. Effort is much more effective and useful once the problem becomes clear, or once you are working with a concrete design; we have neither of these right now and so we should expect that most effort ends up being ineffective. It would be better if we saved our resources for the future, or if we spent time thinking about other ways that the future could go (as in his book, Age of Em).

It's especially bad that AI safety has thousands of "fans", because this leads to a "crying wolf" effect -- even if the researchers have subtle, nuanced beliefs, they cannot control the message that the fans convey, which will not be nuanced and will instead confidently predict doom. Then when doom doesn't happen, people will learn not to believe arguments about AI risk.

Rohin's opinion: Interestingly, I agree with almost all of this, even though it's (kind of) arguing that I shouldn't be doing AI safety research at all. The main place I disagree is that losses from principal-agent problems with perfectly rational agents are bounded -- this seems crazy to me, and I'd be interested in specific paper recommendations (though note I andothers have searched and not found many).

On the point about lumpiness, my model is that there are only a few underlying factors (such as the ability to process culture) that allow humans to so quickly learn to do so many tasks, and almost all tasks require near-human levels of these factors to be done well. So, once AI capabilities on these factors reach approximately human level, we will "suddenly" start to see AIs beating humans on many tasks, resulting in a "lumpy" increase on the metric of "number of tasks on which AI is superhuman" (which seems to be the metric that people often use, though I don't like it, precisely because it seems like it wouldn't measure progress well until AI becomes near-human-level).

Conversation with Adam Gleave (Adam Gleave et al) (summarized by Rohin): Adam finds the traditional arguments for AI risk unconvincing. First, it isn't clear that we will build an AI system that is so capable that it can fight all of humanity from its initial position where it doesn't have any resources, legal protections, etc. While discontinuous progress in AI could cause this, Adam doesn't see much reason to expect such discontinuous progress: it seems like AI is progressing by using more computation rather than finding fundamental insights. Second, we don't know how difficult AI safety will turn out to be; he gives a probability of ~10% that the problem is as hard as (a caricature of) MIRI suggests, where any design not based on mathematical principles will be unsafe. This is especially true because as we get closer to AGI we'll have many more powerful AI techniques that we can leverage for safety. Thirdly, Adam does expect that AI researchers will eventually solve safety problems; they don't right now because it seems premature to work on those problems. Adam would be more worried if there were more arms race dynamics, or more empirical evidence or solid theoretical arguments in support of speculative concerns like inner optimizers. He would be less worried if AI researchers spontaneously started to work on relative problems (more than they already do).

Adam makes the case for AI safety work differently. At the highest level, it seems possible to build AGI, and some organizations are trying very hard to build AGI, and if they succeed it would be transformative. That alone is enough to justify some effort into making sure such a technology is used well. Then, looking at the field itself, it seems like the field is not currently focused on doing good science and engineering to build safe, reliable systems. So there is an opportunity to have an impact by pushing on safety and reliability. Finally, there are several technical problems that we do need to solve before AGI, such as how we get information about what humans actually want.

Adam also thinks that it's 40-50% likely that when we build AGI, a PhD thesis describing it would be understandable by researchers today without too much work, but ~50% that it's something radically different. However, it's only 10-20% likely that AGI comes only from small variations of current techniques (i.e. by vastly increasing data and compute). He would see this as more likely if we hit additional milestones by investing more compute and data (OpenAI Five was an example of such a milestone).

Rohin's opinion: I broadly agree with all of this, with two main differences. First, I am less worried about some of the technical problems that Adam mentions, such as how to get information about what humans want, or how to improve the robustness of AI systems, and more concerned about the more traditional problem of how to create an AI system that is trying to do what you want. Second, I am more bullish on the creation of AGI using small variations on current techniques, but vastly increasing compute and data (I'd assign ~30%, while Adam assigns 10-20%).





More posts like this

Sorted by Click to highlight new comments since:

At risk of sounding foolish, this seems odd "There can't be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left. So, the prior that any particular thing has such an impact should be quite low."

But, if we lived in a world where there are 10 death-stars lurking around in the heavens. And, all of these are very likely to obliterate the earth and reduce the expected value of the Earth significantly. Then can't the EV detraction of each individual death star be (say) 90% & EV of the future remains above zero. One way of looking at it would be that the Death Stars are deployed in a staggered fashion. The impact of the deployment of the first death star being that earth's EV falls from 100 to 10, the deployment of the second would imply if falls from 10 to 1, and the third from 1 to 0.1 etc.

Apologies for the terrible explanation of my point.

I also found that passage odd, though I think for a somewhat different reason (or at least with a different framing).

For me, the passage reminded me of the O-ring theory of economic development, "which proposes that tasks of production must be executed proficiently together in order for any of them to be of high value".

For the sake of the argument, let's make the very extreme and unlikely assumption that, if no longtermists worked on them, each of AI risk, biorisk, and nuclear war would by themselves be enough to guarantee an existential catastrophe that reduces the value of the future to approximately 0. In that case, we might say that the EV of the future is ~0, and even if we were to totally fix one of those problems, or even two of them, the EV would still be ~0. Therefore, the EV of working on any one or two of the problems, viewed in isolation, is 0. But the EV of fixing all three would presumably be astronomical.

We could maybe say that existential catastrophe in this scenario is overdetermined, and so we need to remove multiple risks in order for catastrophe to actually not happen. This might naively make it look like many individual prevention efforts were totally worthless, and it might indeed mean that they are worthless if the other efforts don't happen, but it's still the case that, altogether, that collection of efforts is extremely valuable.

This also sort-of reminds me of some of 80k/Ben Todd's comments on attributing impact, e.g.:

A common misconception is that the credit for an action can’t add up to more than 100%. But it’s perfectly possible for both people to be responsible. Suppose Amy and Bob see someone drowning. Amy performs CPR while Bob calls the ambulance. If Amy wasn’t there to perform CPR, the person would have died while Bob called the ambulance. If Bob wasn’t there, the ambulance wouldn’t have arrived in time. Both Amy and Bob saved a life, and it wouldn’t have happened if either was there. So, both are 100% responsible.

I haven't taken the time to work through how well this point holds up when instead each x-risk causes less than 100%, e.g. 10%, chance of existential catastrophe if there were no longtermists working on it. But it seems plausible that there could be more than 100% worth of x-risks, if we add it up across the centuries/millenia, such that, naively, any specified effort to reduce x-risks that doesn't by itself reduce the total risk to less than 100% appears worthless.

So I think the point that, in a sense, only so many things can reduce the EV of the future by 10% does highlight that work on AI risk might be less valuable that one would naively think, if we expect other people/generations to drop the ball on other issues. But if we instead view this like a situation where there's a community of people working on AI risk, another working on biorisk, another working on nuclear, future generations working on whatever risks come up then, etc., then it seems totally possible that each community has to fix their problem. And so then it makes sense for our generation to work on the issues we can work on now, and more specifically for people with a comparative advantage for AI to work on that, those with a comparative advantage for general great power war reduction to work on that, etc.

So Paul's statement there seems to sort-of hold up as a partial argument for reducing focus on AI risk, but in a specific way (to ensure we also patch all the other leaks too). It doesn't seem like it holds up as an argument that AI safety work is less valuable than we thought in a simple sense, such that we should redirect efforts to non-longtermist work.

(I'm not saying Paul specifically intended to have the latter implication, but I think it'd be easy to make that inference from what he said, at least as quoted here. I'm also not sure if I'm right or if I've expressed myself well, and I'd be interested in other people's thoughts.)

Thanks v much - your third para was a much better explanation of what I was driving at!

Btw, there's a section on "Comparing and combining risks" in Chapter 6 of The Precipice (pages 171-173 in my version), which is very relevant to this discussion. Appendix D expands on that further. I'd recommend interested people check that out.

I'm sympathetic to many of the points, but I'm somewhat puzzled by the framing that you chose in this letter.

Why AI risk might be solved without additional intervention from longtermist

Sends me the message that longtermists should care less about AI risk.

Though, the people in the "conversations" all support AI safety research. And, from Rohin's own words:

Overall, it feels like there's around 90% chance that AI would not cause x-risk without additional intervention by longtermists.

10% chance of existential risk from AI sounds like a problem of catastrophic proportions to me. It implies that we need many more resources spent on existential risk reduction. Though perhaps not strictly on technical AI safety. Perhaps more marginal resources should be directed to strategy-oriented research instead.

Sends me the message that longtermists should care less about AI risk.

I do believe that, and so does Robin. I don't know about Paul and Adam, but I wouldn't be surprised if they thought so too.

Though, the people in the "conversations" all support AI safety research.

Well, it's unclear if Robin supports AI safety research, but yes, the other three of us do. This is because:

10% chance of existential risk from AI sounds like a problem of catastrophic proportions to me.

(Though I'll note that I don't think the 10% figure is robust.)

I'm not arguing "AI will definitely go well by default, so no one should work on it". I'm arguing "Longtermists currently overestimate the magnitude of AI risk".

I also broadly agree with reallyeli:

However I really think we ought to be able to discuss guesses about what's true merely on the level of what's true, without thinking about secondary messages being sent by some statement or another. It seems to me that if we're unable to do so, that will make the difficult task of finding truth even more difficult.

And this really does have important implications: if you believe "non-robust 10% chance of AI accident risk", maybe you'll find that biosecurity, global governance, etc. are more important problems to work on. I haven't checked myself -- for me personally, it seems quite clear that AI safety is my comparative advantage -- but I wouldn't be surprised if on reflection I thought one of those areas was more important for EA to work on than AI safety.

I'm not arguing "AI will definitely go well by default, so no one should work on it". I'm arguing "Longtermists currently overestimate the magnitude of AI risk".

Thanks for the clarification Rohin!

I also agree overall with reallyeli.

I had the same reaction (checking in my head that a 10% chance still merited action).

However I really think we ought to be able to discuss guesses about what's true merely on the level of what's true, without thinking about secondary messages being sent by some statement or another. It seems to me that if we're unable to do so, that will make the difficult task of finding truth even more difficult.

In the beginning of the Christiano part it says

There can't be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left.

Why is it unlikely that there is little to no expected value left? Wouldn't it be conceivable that there are a lot of risks in the future and that therefore there is little expected value left? What am I missing?

I think the argument is that we don't know how much expected value is left, but our decisions will have a much higher expected impact if the future is high-EV, so we should make decisions that would be very good conditional on the future being high-EV.

Curated and popular this week
Relevant opportunities