Hide table of contents

This is a linkpost for #44 - Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem', and his vision of how humanity will progressively hand over decision-making to AI systems. You can listen to the episode on that page, or by subscribing to the '80,000 Hours Podcast' wherever you get podcasts.

In the episode, Paul and Rob discuss:

  • What could people do to shield themselves financially from potentially losing their jobs to AI?
  • How important is it that the best AI safety team ends up in the company with the best ML team?
  • What might the world look like if several states or actors developed AI at the same time (aligned or otherwise)?
  • Would artificial general intelligence grow in capability quickly or slowly?
  • How likely is it that transformative AI is an issue worth worrying about?
  • What are the best arguments against being concerned?
  • What would cause people to take AI alignment more seriously?
  • Concrete ideas for making machine learning safer, such as iterated amplification.
  • What does it mean to say that a crow-like intelligence could be much better at science than humans?
  • What is ‘prosaic AI’?
  • How do Paul’s views differ from those of the Machine Intelligence Research Institute?
  • The importance of honesty for people and organisations
  • What are the most important ways that people in the effective altruism community are approaching AI issues incorrectly?
  • When would an ‘unaligned’ AI nonetheless be morally valuable?
  • What’s wrong with current sci-fi?

If an AI says, “I would like to design the particle accelerator this way because,” and then makes an inscrutable argument about physics, you’re faced with this tough choice. You can either sign off on that decision and see if it has good consequences, or you [say] “no, don’t do that ’cause I don’t understand it”. But then you’re going to be permanently foreclosing some large space of possible things your AI could do.

–Paul Christiano

Key points

So I think the competitive pressure to develop AI, in some sense, is the only reason there’s a problem. I think describing it as an arms race feels somewhat narrow, potentially. That is, the problem’s not restricted to conflicts among states, say. It’s not restricted even to conflict, per se. If we have really secure property, so if everyone owns some stuff and the stuff they owned was just theirs, then it would be very easy to ignore … if individuals could just opt out of AI risk being a thing because they’d just say, “Great, I have some land and some resources and space, I’m just going to chill. I’m going to take things really slow and careful and understand.” Given that’s not the case, then in addition to violent conflict, there’s … just faster technological progress tends to give you a larger share of the stuff.

Most resources are just sitting around unclaimed, so if you go faster you get more of them, where if there’s two countries and one of them is 10 years ahead in technology, that country will, everyone expects, expand first to space and over the very long run, claim more resources in space. In addition to violent conflict, de facto, they’ll claim more resources on earth, et cetera.

I think the problem comes from the fact that you can’t take it slow because other people aren’t taking it slow. That is, we’re all forced to develop technology as fast as we could. I don’t think of it as restricted to arms races or conflict among states, I think there would probably still be some problem, just because people … Even if people weren’t forced to go quickly, I think everyone wants to go quickly in the current world. That is, most people care a lot about having nicer things next year and so even if there were no competitive dynamic, I think that many people would be deploying AI the first time it was practical, to become much richer, or advance technology more rapidly. So I think we would still have some problem. Maybe it would be a third as large or something like that.


The largest source of variance is just how hard is the problem? What is the character of the problem? So after that, I think the biggest uncertainty, though not necessarily the highest place to push, is about how people behave. It’s how much investment do they make? How well are they able to reach agreements? How motivated are they in general to change what they’re doing in order to make things go well? So I think that’s a larger source of variance than technical research that we do in advance. I think it’s potentially a harder thing to push on in advance. Pushing on how much technical research we do in advance is very easy. If we want to increase that amount by 10%, that’s incredibly cheap, whereas having a similarly big change on how people behave would be a kind of epic project. But I think that more of the variance comes from how people behave.

I’m very, very, uncertain about the institutional context in which that will be developed. Very uncertain about how much each particular actor really cares about these issues, or when push came to shove, how far out of their way they would go to avoid catastrophic risk. I’m very uncertain about how feasible it will be to make agreements to avoid race to the bottom on safety.


We’re very uncertain about how hard doing science is. As an example, I think back in the day we would have said playing board games that are designed to tax human intelligence, like playing chess or go is really quite hard, and it feels to humans like they’re really able to leverage all their intelligence doing it.

It turns out that playing chess from the perspective of actually designing a computation to play chess is incredibly easy, so it takes a brain very much smaller than an insect brain in order to play chess much better than a human. I think it’s pretty clear at this point that science makes better use of human brains than chess does, but it’s actually not clear how much better. It’s totally conceivable from our current perspective, I think, that an intelligence that was as smart as a crow, but was actually designed for doing science, actually designed for doing engineering, for advancing technologies rapidly as possible, it is quite conceivable that such a brain would actually outcompete humans pretty badly at those tasks.


Some people have a model in which early developers of AI will be at huge advantage. They can take their time or they can be very picky about how they want to deploy their AI, and nevertheless, radically reshape the world. I think that’s conceivable, but it’s much more likely that the earlier developers of AI will be developing AI in a world that already contains quite a lot of AI that’s almost as good, and they really won’t have that much breathing room. They won’t be able to reap a tremendous windfall profit. They won’t be able to be really picky about how they use their AI. You won’t be able to take your human level AI and send it out on the internet to take over every computer because this will occur in a world where all the computers that were easy to take over have already been taken over by much dumber AIs. It’s more like you’re existing in this soup of a bunch of very powerful systems.


The idea in iterative amplification is to start from a weak AI. At the beginning of training you can use a human. A human is smarter than your AI, so they can train the system. As the AI acquires capabilities that are comparable to those of a human, then the human can use the AI that they’re currently training as an assistant, to help them act as a more competent overseer.

Over the course of training, you have this AI that’s getting more and more competent, the human at every point in time uses several copies of the current AI as assistants, to help them make smarter decisions. And the hope is that that process both preserves alignment and allows this overseer to always be smarter than the AI they’re trying to train. And so the key steps of the analysis there are both solving this problem, the first problem I mentioned of training an AI when you have a smarter overseer, and then actually analyzing the behavior of the system, consisting of a human plus several copies of the current AI acting as assistants to the human to help them make good decisions.

In particular, as you move along the training, by the end of training, the human’s role becomes kind of minimal, like if we imagine training superintelligence. In that regime, we’re just saying, can you somehow put together several copies of your current AI to act as the overseer? You have this AI trying to … Hopefully at each step it remains aligned. You put together a few copies of the AI to act as an overseer for itself.

Articles, books, and other media discussed in the show

Mentioned at the start of the episode: The world’s highest impact career paths according to our research

Paul’s blog posts:

Everything else discussed in the show:


Robert Wiblin: Hi listeners, this is the 80,000 Hours Podcast, where each week we have an unusually in-depth conversation about the world’s most pressing problems and how you can use your career to solve them. I’m Rob Wiblin, Director of Research at 80,000 Hours.

Today’s episode is long for a reason. My producer – Keiran Harris – listened to our first recording session and said it was his favourite episode so far, so we decided to go back and add another 90 minutes to cover issues we didn’t make it to first time around.

As a result, the summary can only touch on a fraction of the topics that come up. It really is pretty exciting.

I hope you enjoy the episode as much as we did and if you know someone working near AI or machine learning, please do pass this conversation on to them.

Just quickly before that I want to let you know that last week we released probably our most important article of the year. It’s called “These are the world’s highest impact career paths according to our research”, and it summarises many years of work into a single article that brings you up to date on what 80,000 Hours recommends today.

It outlines our new suggested process which any of you could potentially use to generate a short-list of high-impact career options given your personal situation. It then describes the five key categories of career that we most often recommend, which should be able to produce at least one good option for almost all graduates.

Finally, it lists and explains the top 10 ‘priority paths’ we want to draw attention to, because we think they can enable to right person to do an especially large amount of good for the world.

I definitely recommend checking it out – we’ll link to it from the show-notes and blog post.

Here’s Paul.

Robert Wiblin: Today, I’m speaking with Dr. Paul Christiano. Paul recently completed a PhD in theoretical computer science at UC Berkeley and is now a researcher at Open AI, working on aligning artificial intelligence with human values. He blogs at ai-alignment.com. Thanks for coming on the podcast, Paul.

Paul Christiano: Thanks for having me.

Robert Wiblin: We plan to talk about Paul’s views on how the transition to an AI economy will actually occur and how listeners can contribute to making that transition go better, but first, I’d like to give you a chance to frame the issue of AI alignment in your own words. What is the problem of AI safety and why did you decide to work on it yourself?

The problem of AI safety

Paul Christiano: AI alignment, I see as the problem of building AI systems that are trying to do the thing that we want them to do. So in some sense, that might sound like it should be very easy because we build an AI system, we get to choose all … we get to write the code, we get to choose how the AI system is trained. There are some reasons that it seems kind of hard to train an AI system to do exactly … So we have something we want in the world, for example we want to build an AI, we want it to help us govern better, we want it to help us enforce the law, we want it to help us run a company. We have something we want that AI to do, but technical reasons, it’s not trivial to build the AI that’s actually trying to do the thing we want it to do. That’s the alignment problem.

I care about that problem a lot because I think we’re moving towards a world where most of the decisions are made by intelligent machines and so if those machines aren’t trying to do those things humans want them to do, then the world is going to go off in a bad reaction. If the AI systems we can build are really good at … it’s easy to train them to maximize profits, or to get users to visit the websites, or to get users to press the button saying that the AI did well, then you have a world that’s increasingly optimized for things like making profits or getting users to click on buttons, or getting users to spend time of websites without being increasingly optimized for having good policies, heading in a trajectory that we’re happy with, helping us figure out what we want and how to get it.

So that’s the alignment problem. The safety problem is somewhat more broadly, understand things that might go poorly with AI and what technical work and political work we can do to improve the probability that things go well.

Robert Wiblin: Right, so what concretely do you do at Open AI?

Paul Christiano: So I do machine learning research, which is a combination of writing code and running experiments and thinking about how machine learning systems should work, trying to understand what are the important problems, how could we fix them, plan out what experiments give us interesting information, what capabilities do we need if we want to build aligned AI five years, 10 years, 20 years down the road? What are the capabilities we need, what should we do today to work towards those capabilities? What are the hardest parts? So trying to understand what we need to do and then actually trying to do it.

AI alignment

Robert Wiblin: Makes sense. So the first big topic that I wanted to get into was kind of the strategic landscape of artificial intelligence, safety research, both technical and, I guess, political and strategic. Partly I wanted to do that first because I understand it better than the technical stuff, so I didn’t want to be floundering right off the bat. What basically caused you to form your current views about AI alignment and to regard it as a really important problem? Maybe also, how have your views on this changed over time?

Paul Christiano: So there are a lot of parts on my views on this, it’s like a complicated pipeline from do the most good for the most people, to write this particular machine learning code. I think very broadly speaking, I come in with this utilitarian perspective, I do care about more people more, then you start thinking, you take that perspective and you think that future populations will be very large, you start asking, what are the features of the world today that affect the long run trajectory of civilization? I think if you come in with that question, there’s two very natural categories of things, there’s if we all die then we’re all dead forever, and second, there’s sort of a distribution of values, or optimization in the world, and that can be sticky in the sense that if you create entities that are optimizing something, those entities can entrench themselves and be hard to move. In the same way that humans are kind of hard to remove at this point. You try and kill humans, humans bounce back.

There are a few ways you can change the distribution of values in the world. I think the most natural, or the most likely one, is as we build AI systems, we’re going to sort of pass the torch from humans, who want one set of things, to AI systems, that potentially want a different set of things. So in addition to going extinct, I think bungling that transition is the easiest way to head in a bad direction, or to permanently alter the structure of civilization.

So at a very high level, that’s kind of how I got to thinking about AI many years ago, and then once you have that perspective, one then has to look at the actual character of AI and say how likely is this failure mode? That is what actually determines what AI is trying to optimize, and start thinking in detail about the kinds of techniques people are using to produce AI. I think that after doing that, I became pretty convinced that there are significant problems. So there’s some actual difficulty there of building an AI that’s trying to do the thing that the human who built it wants it to do. If we could resolve that technical problem, that’d be great. Then we dodge this difficulty of humans maybe passing off control to some systems that don’t want the same things we want.

Then, zooming in a little bit more, if the whole world … Right, so this is a problem which some people care about, we also care about a lot of other things though, and we’re also all competing with one another which introduces a lot of pressure for us to build whatever kind of AI works best. So there’s some sort of fundamental tension between building AI that works best for the tasks that we want our AI to achieve, and building AI which robustly shares our values, or is trying to do the same things that we want it to do.

So it seems like the current situation is we don’t know how to build AI that is maximally effective but still robustly beneficial. If we don’t understand that, then people deploying AI will face some trade-off between those two goals. I think by default, competitive pressures would cause people to push far towards the AI that’s really effective at doing what we want it … Like, really effective at acquiring influence or navigating conflict, or so on, but not necessarily robustly beneficial. So then we would need to either somehow coordinate to overcome that pressure. So we’d have to all agree we’re going to build AI that actually does what we want it to do, rather than building AI which is effective in conflict, say. Or, we need to make technical progress so there’s not that trade-off.

Arms race dynamic

Robert Wiblin: So to what extent do you view the arms race dynamic, the fact that people might try to develop AI prematurely because they’re in a competitive situation, as the key problem that’s driving the lack of safety?

Paul Christiano: So I think the competitive pressure to develop AI, in some sense, is the only reason there’s a problem. I think describing it as an arms race feels somewhat narrow, potentially. That is, the problem’s not restricted to conflicted among states, say. It’s not restricted even to conflict, per se. If we have really secure property, so if everyone owns some stuff and the stuff they owned was just theirs, then it would be very easy to ignore … if individuals could just opt out of AI risk being a thing because they’d just say, “Great, I have some land and some resources and space, I’m just going to chill. I’m going to take things really slow and careful and understand.” Given that’s not the case, then in addition to violent conflict, there’s … just faster technological progress tends to give you a larger share of the stuff.

Most resources are just sitting around unclaimed, so if you go faster you get more of them, where if there’s two countries and one of them is 10 years ahead in technology, that country will, everyone expects, expand first to space and over the very long run, claim more resources in space. In addition to violent conflict, de facto, they’ll claim more resources on earth, et cetera.

I think the problem comes from the fact that you can’t take it slow because other people aren’t taking it slow. That is, we’re all forced to develop technology fast as we could. I don’t think of it as restricted to arms races or conflict among states, I think there would probably still be some problem, just because people … Even if people weren’t forced to go quickly, I think everyone wants to go quickly in the current world. That is, most people care a lot about having nicer things next year and so even if there were no competitive dynamic, I think that many people would be deploying AI the first time it was practical, to become much richer, or advance technology more rapidly. So I think we would still have some problem. Maybe it would be a third as large or something like that.


Robert Wiblin: How much attention are people paying to these kind of problems now? My perception is that the amount of interest has ramped up a huge amount, but of course, I guess the amount the number of resources going into just increasing the capabilities of AI has also been increasing a lot, so it’s unclear whether safety has become a larger fraction of the whole.

Paul Christiano: So I think in terms of profile of the issue, how much discussion there is of the problem, safety has scaled up faster than AI, broadly. So it’s a larger fraction of discussion now. I think that more discussion of the issue doesn’t necessarily translate to anything super productive. It definitely translates to people in machine learning maybe being a little bit annoyed about it. So it’s a lot of discussion, discussion’s scaled up a lot. The number of people doing research has also scaled up significantly, but I think that’s maybe more in line with the rate of progress in the field. I’m not sure if fraction of people working on, “I’m full time …” Actually, no I think that’s also scaled up, maybe by a factor of two relatively, or something.

So if one were to look at publications and taught machine learning conferences, there’s an increasing number, maybe a few in the last NIPS, that are very specifically directed at the problem, “We want our AI to be doing the thing that we want it to be doing and we don’t have a way to do that right now. Let’s try and push technology in that direction. To build AI to understand what we want and help us get it.” So now we’re at the point where there’s a few papers in each conference that are very explicitly targeted at that goal, up from zero to one.

At the same time, there’s aspects of the alignment problem that are more clear, so things like building AI that’s able to reason about what humans want, and there’s aspects that are maybe a little bit less clear, like more arcane seeming. So for example, thinking about issues distinctive to AI which exceeds human capabilities in some respect. I think the more arcane issues are also starting to go from basically nothing to discussed a little bit.

Robert Wiblin: What kind of arcane issues are you thinking off?

Paul Christiano: So there’s some problem with building weak AIs, say, that want to do what humans want them to do. There’s then a bunch of additional difficulties that appear when you imagine the AI that you’re training is a lot smarter than you are in some respect. So then you need some other strategy. So in that regime, it becomes … When you have a weak AI, it’s very easy to say what the goal is, what you want the AI to do. You want it do something that looks good to you. If you have a very strong AI, then you actually have a philosophical difficulty of what is the right behavior for such a system. It means that all the answers … there can be no very straightforward technical answer if we prove a theorem and say this is the right … or you can’t nearly prove a theorem. You have to do some work to say, we’re happy with what this AI is doing, even though, no human understands, say, what this AI’s doing.

Same parallel with device specification stuff. Another big part of alignment is understanding training models that continue to do … you train your model to do something. On the training distribution, you’ve trained your AI, on the training distribution, it does what you want. There’s a further problem of maybe when you deploy it, or on the test distribution it does something catastrophically different from what you want, and that’s also … on that problem, I think interest has probably scaled up even more rapidly. So the number of people thinking about adversarial machine learning, can an adversary find some situation in which your AI does something very bad, then people working on that problem has scaled up. I think it’s more than doubled as a fraction of the field, although it’s still in absolute terms, kind of small.

Robert Wiblin: What do you think would cause people to seriously scale up their work on this topic and do you think it’s likely to come in time to solve the problem, if you’re right that there are serious risks here?

Paul Christiano: Yeah, so I think that where we’re currently at, it seems clear that there is a real problem. There is this technical difficulty of building AI that does what we want it to do. It’s not yet clear if that problem is super hard, so I think we’re really uncertain about that. I’m working on it, not because I’m confident it’s super hard, but because it seems pretty plausible that it’s hard. I think that the machine learning community would be much, much motivated to work on the problem if it became clear that this was going to be a serious problem. If you aren’t super good at coping with, “Well there’s a 30% chance this is going to be a huge problem,” or something like that. I think one big thing is as it becomes more clear, then I think many more people will work on the problem.

So when I talk about these issues of training weaker AI systems to do what humans want them to do, I think it is becoming more clear that that’s a big problem. So for example, we’re getting to the point where robotics is getting good enough that it’s going to be limited by, or starting to be limited by, who communicates to the robot what it actually ought to be doing. Or people are becoming very familiar with … YouTube has an algorithm that decides what video it will show you. People have some intuitive understanding, they’re like, “That algorithm has a goal and if that goal is not the goal that we collectively has school and the users of YouTube would want, that’s going to push the world in this annoying direction.” It’s going to push the world towards people spending a bunch of time on YouTube rather than their lives being better.

So you think, we are currently at the stage where some aspects of these problems are becoming more obvious, and that makes it a lot easier for people to work on those aspects. As we get closer to AI, assuming that these problems are serious, it’s going to become more and more obvious that the problems are serious. That is, we’ll be building AI systems which, humans don’t understand what they do, and the fact that their values are not quite right is causing serious problems.

I think that’s one axis and then the other axis is … So, I’m particularly interested in the possibility of transformative AI that has a very large effect on the world. So the AI that starts replacing humans in the great majority of economically useful work. I think that right now, we’re very uncertain about what the timelines are for that. I think there’s a reasonable chance within 20 years, say, but certainly there’s not compelling evidence that it’s going to be within 20 years. I think as that becomes more obvious, then many more people will start thinking about catastrophic risks in particular, because those will become more plausible.

Robert Wiblin: So your concerns about how transformative AI could go badly have become pretty mainstream but not everyone is convinced. How compelling do you think the arguments are that people should be worried about this and is there anything that you think that you’d like to say to try to persuade skeptics who might be listening?

Paul Christiano: I think almost everyone is convinced that there is … or almost everyone in machine learning, is convinced that there’s a problem. That there’s an alignment problem. There’s the problem of trying to build AI to do what you want it to do and that that requires some amount of work. I think the point of disagreement … there’s few points of disagreement within the machine learning community. So one is, is that problem hard enough that it’s a problem that’s worth trying to focus on and trying to push differentially? Or is that the kind of problem that should get solved in the normal business of doing AI research? So that’s one point of disagreement. I think on that point, I think in order to be really excited about working that problem, you have to be thinking, what can we do to affect how AI goes better?

If you’re just asking how can we have really powerful AI that does good things as soon as possible, then I think it’s actually not that compliant an argument to work on alignment. But I think if you’re asking the question how do we actually maximize so probably this goes well, then it doesn’t really matter whether that ought to be part of the job of AI researchers, we should be really excited about putting more resources into that to make it go faster and I think if someone really takes seriously the goal of trying to make AI go well instead of just trying to push on AI and trying to make cool stuff happen sooner, or trying to realize benefits over the next five years, then I think that case is pretty strong right now.

Another place there’s a lot of disagreement in the ML community is, maybe it’s more an issue of framing than an issue of substance, which is the kind of thing I find pretty annoying. There’s one frame where you’re like, “AI’s very likely to kill everyone, there’s going to be some robot uprising. It’s going to be a huge mess, this should be on top of our list of problems.” And there’s another framing where it’s like, “Well, if we, as the AI community, fail to do our jobs, then yes something bad would happen.” But it’s kind of offensive for you to say that we as the AI community are going to fail to do our jobs. I don’t know if I would really need to … it doesn’t seem like you should really have to convince anyone on the second issue.

You should be able to be like, “Yes, it’d be really bad if we failed to do our jobs.” Now, this discussion we’re currently having is not part of us trying to argue that everyone should be freaking out, this is us trying to argue like … this is us doing our jobs. This discussion we’re having right now. You can’t have a discussion about us trying to do our jobs and be like, “Yes, it’s going to be fine because we’re going to do our jobs.” That is an appropriate response in some kinds of discussion, maybe …

Robert Wiblin: But when you’re having the conversation about are we going to spend some money of this now, then …

Paul Christiano: Yeah, then I think it’s not such a great response. I think safety’s a really unfortunate word. Lot’s of people don’t like safety, it’s kind of hard to move away from. If you describe the problem, like with training AI to do what we want it to do to people, they’re like, “Why do you call that safety?” That’s the problem with building good AI, and that’s fine, I’m happy with that. I’m happy saying, “Yep, this is just doing AI reasonably well.” But then, yeah, it’s not really an argument about why one shouldn’t push more money into that area, or shouldn’t push more effort into that area. It’s a part of AI that’s particularly important to whether AI’s a positive or negative effect.

Yeah, I think in my experience, those are the two biggest disagreements. The biggest substantive disagreement is on the, “Is this a thing that’s going to get done easily anyway?” I think there people tend to have … maybe it’s just a normal level of over-confidence about how easy problems will end up being, together with not having a real … I think there aren’t that many people who are really prioritizing the question, “How do you make AI go well?” Instead of just, “How do make …” Like, choose some cool thing they want to happen. “How do I make that cool thing happen as soon as possible in calendar time?” I think that’s unfortunate, it’s a hard thing to convince people on, in part because values discussions are always a little bit hard.

Best arguments against being concerned

Robert Wiblin: So what do you think are the best arguments against being concerned about this issue, or at least, wanting to prioritize directing resources towards it, and why doesn’t it persuade you?

Paul Christiano: So I think there’s a few classes of arguments. Probably the ones I find most compelling are opportunity cost arguments where someone says, “Here’s a concrete alternative. Yeah, you’re concerned about x, have you considered that y’s even more concerning?” I can imagine someone saying, “Look, the risk of bioterrorism killing everyone is high enough that you should … on the margin, returns to that are higher than returns to AI safety.” At least they’re not compelled by those arguments as well, part of that is competitive advantage thing where like, “I don’t really have to evaluate those arguments because it’s clear what my competitive advantage is.” In part, I have a different reason, I’m not compelled for every argument of that form. So that’s one class of arguments against.

In terms of the actual value of working on AI safety, I think the biggest concern is this, “Is this an easy problem that will get solved anyway?” Maybe the second biggest concern is, “Is this a problem that’s so difficult that one shouldn’t bother working on it or one should be assuming that we need some other approach?” You could imagine, the technical problem is hard enough that almost all the bang is going to come from policy solutions rather than from technical solutions.

And you could imagine, those two concerns maybe sound contradictory, but aren’t necessarily contradictory, because you could say, “We have some uncertainty about this parameter of how hard this problem is.” Either it’s going to be easy enough that it’s solved anyway, or it’s going to be hard enough that working on it now isn’t going to help that much and so what mostly matters is getting our policy response in order. I think I don’t find that compelling, in part because one, I think the significant probability on the range … like the place in between those, and two, I just think working on this problem earlier will tell us what’s going on. If we’re in the world where you need a really drastic policy response to cope with this problem, then you want to know that as soon as possible.

It’s not a good move to be like, “We’re not going to work on this problem because if it’s serious, we’re going to have a dramatic policy response.” Because you want to work on it earlier, discover that it seems really hard and then have significantly more motivation for trying the kind of coordination you’d need to get around it.

Robert Wiblin: It seems to me like it’s just too soon to say whether it’s very easy, moderately difficult or very difficult, does that seem right?

Paul Christiano: That’s definitely my take. So I think people make some arguments in both directions and we could talk about particular arguments people make. Overall, I find them all just pretty unconvincing. I think a lot of the like, “It seems easy,” comes from just the intuitive, “Look, we get to build the AI, we get to choose the training process. We get to look at all the competition AI is doing as it thinks. How hard can it be to get the AI to be trying to do …” or maybe not, maybe it’s hard to get it to do exactly what you want but how hard can it be to get it to not try and kill everyone?

That sounds like a pretty … there’s a pretty big gap between the behavior we want and the behavior reasoning about what output is going to most lead to humans being crushed. That’s a pretty big gap. Feels like you ought to be able to distinguish those, but I think that’s not … There’s something to that kind of intuition. It is relevant to have a reasoning about how hard a problem is but it doesn’t carry that much weight on it’s own. You really have to get into the actual details of how we’re producing AI systems, how is that likely to work? What is the distribution of possible outcomes in order to actually sustain anything with confidence? I think once you do that, the picture doesn’t look quite as rosy.

Robert Wiblin: You mentioned that one of the most potentially compelling counter arguments was that there’s just other really important things for people to be doing that might be even more pressing. Yeah, what things other than AI safety do you think are among the most important things for people to be working on?

Paul Christiano: So I guess I have two kinds of answers to this question. One kind of answer is what’s the standard list of things people would give? Which I think are the most likely things to be good alternatives. So for example, amongst the utilitarian crowd, I think the talking about an essential risk from engineered pandemics is a very salient option, there’s a somewhat broader bioterror category. I think off other things in this genre, one could also look at the world more broadly, so intervening on political process, improve political institutions, or just push governance in a particular direction that we think is conducive to a good world, or a world on a good longterm trajectory.

Those are examples of problems that lots of people would advocate for and therefore, I think if lots of people think x is important, that’s good evidence that x is important. The second kind of answer, which is the problems that I find most tempting to work, which is going to be related to … it’s going to tend to be systematically be things that other people don’t care about, I also think there’s a lot of value. Yeah, one can add a lot of value if there’s a thing that’s important, if you care about the ratio of how important it actually is. Or how important other people think it is and how important it actually is.

So at that level, things that I’m like … I’m particularly excited about very weird utilitarian arguments. So I’m particularly excited about people doing more thinking about what actual features of the world affect, whether on a positive or negative trajectory. So thinking about things … There’s a lot of considerations that are extremely important, from the long run utilitarian perspective, that are just not very important according to people’s normal view of the world, or normal values. So you find one big area is just thinking about and acting on, sort of that space of considerations.

So an example, which is a kind of weird example, but hopefully illustrates the point, is normal people care a ton about whether humanity … they care a ton about catastrophic risks. They would really care if everyone died. I think to a weird utilitarian, you’re like, “Well, it’d be bad if everyone died, but even in that scenario, there was a bunch of weird stuff you would do to try and improve the probability that things turn out okay in the end.” So these include things like working on extremely robust bunkers that are capable of repopulating the world, or trying to … in the extreme case where all humans die, you’re like, “Well we’d like some other animal later to come along and if all intelligent life began and colonize the stars.” Those are weird scenarios, the scenarios that basically no one tries to push on … No one is asking, “What could we do as a civilization to make it better for the people who will come after us if we manage to blow ourselves up?”

So because no one is working on them, even though they’re not that important in absolute terms, I think it’s reasonably likely that they’re good things to work on. Those are examples of kind of weird things. There’s a bunch of not as weird things that also seem pretty exciting to me. Especially things about improving how well people are able to think, or improving how well institutions function, which I’d be happy to get into more detail on, but are not things I’m expert in.

Robert Wiblin: Yeah, maybe just want to list off a couple of those?

Paul Christiano: So just all the areas that seem … are high level areas that seem good to me, so a list of … Thinking about the utilitarian picture and what’s important to our future focused utilitarian, there’s thinking about extinction risks. Maybe extinction risks that are especially interesting to people who care about extinction. So things like bunkers, things like repopulation of the future, things like understanding the tails of normal risks. So understanding the tails of climate change, understanding the tails of nuclear war.

More normal interventions like pushing on peace, but especially with an eye to avoiding the most extreme forms of war, or mitigating the severity of all out war. Pushing on institutional equality, so experimenting with institutions like prediction markets, different ways of aggregating information, or making decisions across people. Just running tons of experiments and understanding what factors influence individual cognitive performance, or individual performance within organizations, or for decision making.

An example of a thing that I’m kind of shocked by is how little study there is of nootropics and cognitive enhancement broadly. I think that’s a kind of thing that’s relatively cheap and seems such good bang for your buck and expectation, that it’s pretty damning for civilization that we haven’t invested in it. Yeah, those are a few examples.

Importance of location of the best AI safety team

Robert Wiblin: Okay, great. Coming back to AI, how important is it to make sure that the best AI safety team ends up existing within the organization, that has the best general machine learning firepower behind it?

Paul Christiano: So you could imagine splitting up the functions of people who work on AI safety into two categories. One category is developing technical understanding, which is sufficient to build aligned AI. So this is doing research saying, “Here are some algorithms, here’s some analysis that seems important.” Then a second function is actually affecting the way that an AI project is carried out, to make sure it reflects our understanding of how to build an aligned AI. So for the first function, it’s not super important. For the first function, if you want to be doing research on alignment, you want to have access to machine learning expertise, so you need to be somewhere that’s doing reasonably good machine learning research but it’s not that important that you be at the place that’s actually at the literal cutting edge.

From the perspective of the second function, it’s quite important. So if you imagine someone actually building very, very powerful AI systems, I think the only way in practice that society’s expertise about how to build aligned AI is going to affect the way that we build AGI, is by having a bunch of people who have made it their career to understand those considerations and work on those considerations, who are involved in the process of creating AGI. So for that second function it’s quite important that if you want an AI to be safe, you want people involved in development of that AI to basically be alignment researchers.

Robert Wiblin: Do you think we’re heading towards a world where we have the right distribution of people?

Paul Christiano: Yeah so I think things are currently okay on that front. I think as we get closer … so we’re currently in a mode where we can imagine … we’re somewhat confident there will be powerful AI systems within two or three years and so for the short term, there’s not as much pressure as there will be closer to the day to consolidate behind projects that are posing a catastrophic risk. It would optimistic that if we were in that situation where we actually faced significant prospect of existential risk from AI over the next two years, then there would be significantly more pressure for … both pressure for safety researchers to really follow wherever that AI was being built or be allocated across the organizations that are working on AI that poses an existential risk, and also a lot of pressure within such organizations to be actively seeking safety researchers.

My hope would be that you don’t have to really pick. Like the safety researchers don’t have to pick a long time in advance what organizations you think will be doing that development, you can say, “We’re going to try and develop the understanding that is needed to make this AI safe. We’re going to work in an organization that is amongst those that might be doing development of dangerous AI and then we’re going to try and live in the kind of world where as we get very close, there’s a lot of … people understand the need for and are motivated to concentrate more expertise on alignment and safety,” and that that occurs at that time.

Robert Wiblin: It seems like there’s some risks to creating new organizations because you get a splintering of the effort and also potential coordination problems between the different groups. How do you feel we should split additional resources between just expanding existing research organizations versus creating new projects?

Paul Christiano: So I agree that to the extent that we have a coordination problem amongst developers of AI, to the extent that the field is hard to reach agreements or regulate, as there are more and more actors, then almost equally prefer not to have a bunch of new actors. I think that’s mostly the case for people doing AI development, so for example, for projects that are doing alignment per se, I don’t think it’s a huge deal and should mostly be determined by other considerations, whether to contribute to existing efforts or create new efforts.

I think in the context of AI projects, I think almost equal, one should only be creating new AI … if you’re interested in alignment, you should only be creating new AI projects where you have some very significant interest in doing so. It’s not a huge deal, but it’s nicer to have a smaller number of more pro-social actors than to have a larger number of actors with uncertain … or even a similar distribution of motivations.

Variance in outcomes

Robert Wiblin: So how much of the variance in outcomes from artificial general intelligence, in your estimates, comes from uncertainty about how good we’ll be at actually working on the technical AI alignment problem, versus uncertainty about how firms that are working to develop AGI will behave potentially, the governments in the countries where they’re operating, how they’re going to behave?

Paul Christiano: Yeah, I think the largest source of variance isn’t either of those but is instead just how hard is problem? What is the character of the problem? So after that, I think the biggest uncertainty, though not necessarily the highest place to push, is about how people behave. It’s how much investment do they make? How well are they able to reach agreements? How motivated are they in general to change what they’re doing in order to make things go well? So I think that’s a larger source of variance than technical research that we do in advance. I think it’s potentially a harder thing to push on in advance. Pushing on how much technical research we do in advance is very easy. If we want to increase that amount by 10%, that’s incredibly cheap, whereas having a similarly big change on how people behave would be a kind of epic project. But I think that more of the variance comes from how people behave.

I’m very, very, uncertain about the institutional context in which that will be developed. Very uncertain about how much each particular actor really cares about these issues, or when push came to shove, how far out of their way they would go to avoid catastrophic risk. I’m very uncertain about how feasible it will be to make agreements to avoid race to the bottom on safety.

Robert Wiblin: Another question that came in from a listener was, I guess a bit of a hypothetical, but it’s interesting to prod your intuitions here. What do you think would happen if several different firms or countries simultaneously made a very powerful general AI? Some of which were aligned but some of which weren’t and potentially went rogue with their own agenda. Do you think that would be a very bad expectation, situation expectation?

Paul Christiano: My normal model does not involve a moment where you’re building powerful AI. So that is, instead of having a transition from nothing to very powerful AI, you have a bunch of actors gradually rushing up the capacity of the systems they’re able to build. But even if that’s false, I expect developers to generally be really well financed groups that are quite large. So if they’re smaller groups, I do generally expect them to divide up the task and effectively pool resources in one way or another. Either by explicitly resource sharing or by merging or by normal trading with each other. But we can still imagine … I say, in general, this was distributed across the world, it would be a bunch of powerful AI systems, some of which are aligned, some of which aren’t aligned. I think my default guess about what happens in that world is similar to saying if 10% of the AIs are aligned, then we capture 10% as much value as if 100% of them are aligned. It’s roughly in that ballpark.

Robert Wiblin: Does that come from the fact that there’s a 10% chance that one out of 10 AGIs would, in general, take over? You have more of a view where there’s going to be a power sharing, or each group gets a fraction of the influence, as in the world today?

Paul Christiano: Yeah. I don’t have a super strong view on this, and in part, I don’t have a strong view because I end up at the same place, regardless of how much stochasticity there is. Like whether you get 10% of the stuff all time, or all the stuff 10% of the time, I don’t have an incredibly strong preference between those, for kind of complicated reasons. I think I would guess … so, in general, if there’s two actors who are equally powerful, they could fight it out and then just see what happened and then behind a veil of ignorance, each of them wins half the time and crushes the other.

I think normally, people would prefer to reach comprises short of that. So that is, imagine how that conflict would go and say, “Well if you’re someone who would be more likely to win, then you’ll extract a bunch of concessions from the weaker party.’ But everyone is incentivized to reach an agreement where they don’t have an all out war. In general, that’s how things normally go amongst humans. We’re able to avoid all out war most of the time, though not all the time.

I would, in general, guess that AI systems will be better at that. Certainly in the long run, I think it’s pretty clear AI systems will be better at negotiating to reach positive sum trades, where avoiding war is often a example of a positive sum trade. It’s conceivable in the short term that you have AI systems that are very good at some kinds of tasks and not very good at diplomacy, or not very good at reaching agreement or these kinds of tests. But I don’t have a super strong view about that.

I think that’s the kind of thing that would determine to what extent you should predict there to be war. If people have transferred most of the decision making authority to machines, or a lot of decision making authority to machines, then you care a lot about things like, are machines really good at waging war but not really changing the process of diplomacy? If they have differential responsibility in that kind of respect, then you get an outcome that’s more random and someone will crush everyone else, and if you’re better at striking agreements, then you’re more likely to say like, “Well, look, here’s the allocation of resources … we’ll allocate influence according to the results of what would happen if we fought. Then let’s all not fight.”


Robert Wiblin: One topic that you’ve written quite a lot about is credible commitments and the need for organizations to be honest. I guess part of that is because it seems like it’s going to be very important in the future for organizations that are involved in the development of AGI to be able to coordinate around safety and alignment and to avoid getting into races with one another. Or to have a just a general environment of mistrust, where they have reasons to go faster in order to out compete other groups. Has anyone ever attempted to have organizations that are as credible in their commitments as this? Do you have much hope that we’ll be able to do that?

Paul Christiano: So certainly I think in the context of arms control agreements and monitoring, some efforts are made for one organization to be able to credibly commit that they are … credibly demonstrate that they’re abiding by some agreement. I think that the kind of thing I talked about … So I wrote this blog post on honest organizations, I think the kind of measure I’m discussing there is both somewhat more extreme than things that would … like a government would normally be open to and also more tailored for this setting, where you have an organization which is currently not under the spotlight, which is trying to set itself up in such a way that it’s prepared to be trustworthy in the future, if it is under the spotlight.

I’m not aware of any organizations having tried that kind of thing. So a private organization saying, “Well, we expect some day in the future, we might want to coordinate in this way and be regulated in this way so we’re going to try and constitute ourselves such that it’s very easy for someone to verify that we’re complying with an agreement or a law.” I’m not aware of people really having tried that much. I think there’s some things that are implicitly this way and companies can change who they hire, they can try and be more trustworthy by having executives, or having people on the board, or having monitors embedded within the organization that they think stakeholders will trust. Certainly a lot of precedent for that. Yeah, I think the reason you gave for why this seems important to me in this context is basically right.

I’m concerned about the setting where there’s some trade-off between the capability of the AI systems you build and safety. In the context of such a trade-off, you’re reasonably likely to want some agreement that says, “Everyone is going to meet this bar on safety.” Given that everyone has committed to meet that bar, there’s not really an incentive then to cut … or they’re not able to follow the incentive to cut corners on safety, say. So you might want to make that …. That agreement might take place as an informal agreement amongst AI developers, it might take place as domestic regulation or law enforcement would like to allow AI companies to continue operating, but would like to verify they’re not going to take over the world.

It might take the context of agreements among states, which would themselves be largely … An agreement among states about AI would involve the US or China having some unusually high degree of trust or insight into what firms in the other country are doing. So I’m thinking forward to that kind of agreement and seems like you would need machinery in place that’s not currently in place. Or it would be very, very hard at the moment. So anything you could do to make it easier seems like it would be … potentially you could make it quite a lot easier. There’s a lot of room there.

Robert Wiblin: Is this in itself a good reason for anyone who’s involved in AI research to maintain an extremely high level of integrity so that they will be trusted in future?

Paul Christiano: I think having a very high level of integrity sounds good in general. As a utilitarian, I do like it if the people engaged in important projects are mostly in it for their stated goals and want to make the world better. It seems like there’s a somewhat different thing which is how trustworthy are you to the external stakeholders who wouldn’t otherwise have trusted your organization. Which I think is different from the normal … if we were to rate people by integrity, that would be a quite different ranking than ranking them by demonstrable integrity to people very far away who don’t necessarily trust the rest of the organization they’re involved in.

Robert Wiblin: I didn’t quite get that. Can you explain that?

Paul Christiano: So I could say there’s both … If I’m interacting with someone in the context … like I’m interacting with a colleague. I have some sense of how much they conduct themselves with integrity. It’s like, one, I could rank people by that. I’d love it if the people who were actually involved in making AI were people who I’d rank as super high integrity.

Because then a different question, which is suppose you have some firm, and then you have, there’s someone in the Chinese defense establishment reasoning about the conduct of that firm. They don’t really care that much probably, if there’s someone I would judge as high integrity involved in the process because they don’t have the information that I’m using to make that judgment. From their perspective, they care a lot of about the firm being instructed such that they feel that they understand what the firm is doing. They don’t feel any uncertainty about whether, in particular, they have minimal suspicion that a formal agreement is just cover for US firms to be cutting corners and delaying their competitors. They really want to have a lot of insight into what is happening at the firm. They don’t have some confidence that there’s not some unobserved collusion between the US defense establishment and this firm that nominally is complying with some international agreement, to undermine that agreement. That’s the example of states looking into firms.

But also in the example of firms looking into firms, similarly, if I am looking in, there’s some notion of integrity that would be relevant for two researchers at Baidu looking, interacting with each other and thinking about how much integrity they have. Something quite different that would be helpful for me looking into AI research at Baidu actually believing that AI research at Baidu is being conducted, when they make public statements, those statements are an accurate reflection of what they’re doing. They aren’t collaborating. There isn’t behind the scenes a bunch of work to undermine nominal agreements.

Robert Wiblin: Yeah, I think that it is very valuable for people in this industry to be trustworthy for all of these reason, but I guess I am a bit skeptical that trust alone is going to be enough, in part for the reasons you just gave. There’s that famous Russian proverb, trust but verify. It seems like there’s been a lot of talk, at least publicly, about the importance of trust, and maybe not enough about how we can come up with better ways of verifying what people’s behavior actually is. I mean, one option, I guess, would just be to have people from different organizations all working together in the same building, or to move them together so they can see what other groups are doing, which allows them to have a lot more trust just because they have much more visibility. How do you feel about that?

Paul Christiano: Yeah, so I think I would be pretty pessimistic about reaching any kind of substantive and serious agreement based only on trust for the other actors in the space. It may be possible in some … yeah, it’s conceivable amongst Western firms that are already quite closely, where there’s been a bunch of turnover of staff from one to the other and everyone knows everyone. It’s maybe be conceivable in that case. In general, when I talk about agreements, I’m imagining trust as a complement to fairly involved monitoring and enforcement mechanisms.

The modern enforcement problem in this context is quite difficult. That is it’s very, very hard for me to know, suppose I’ve reached, firm A and firm B have reached some nominal agreement. They’re only going to develop some AI that’s safe according to some standard. It’s very, very hard for firm A to demonstrate that to firm B without literally showing all of their, without giving firm B enough information they could basically take everything or benefit from all of the research that firm A is doing. There’s no easy solution to this problem. The problem is easier to the extent you believe that the firm is not running a completely fraudulent operation to maintain some appearances, but then in addition to have some … In addition to having enough insight to verify that, you still need to do a whole bunch of work to actually control how development is going.

I’m just running a bunch of code on some giant computing cluster, you can look and you can see, indeed, they’re running some code on this cluster. Even if I literally showed you all of the code I was running on the cluster, that’s actually that, wouldn’t be that helpful. It’s very hard for you to trust what I’m doing unless you’re literally have watched the entire process by which the code was produced. Or at least, you’re confident there wasn’t some other process hidden away that’s writing the real code, and the thing you can see is just a cover by which it looks like we’re running some scheduling job, but actually it’s just a … it’s carrying some real payload that’s a bunch of actual AI research that the results are getting smuggled out to the real AI research group.

Robert Wiblin: Could you have an agreement in which every organization accepts that all of the other groups are going to try to put clandestine informants inside their organization, and that that’s just an acceptable thing for everyone to do to one another because it’s the only way that you could really believe what someone’s telling you?

Paul Christiano: Yes, I think there’s a split between two ways of doing this kind of coordination. On one arm, you try and maintain something like the status quo, where you have a bunch of people independently pushing on the AI progress. In order to maintain that arm, there’s some limit on how much transparency different developers can have into each other’s research. That’s one arm. Then there’s a second arm where you just give up on that and you say yes, all of the information is going to leak.

I think the difficulty in the first arm is that it’s incredibly, you have to walk this really fine line where you’re trying to give people enough insight, which probably does involve monitors, whistle blowing, other mechanisms whereby there are people who firm A trust embedded in firm B. That’s what makes it hard to do monitoring without leaking all the information. That you have to walk that fine line. Then, if you want to leak all the information, then the main difficulty seems to be you have to reach some new agreement about how you’re actually going to divide the fruits AI research.

Right now, there’s some implicit status quo, where people who make more AI progress expect to capture some benefits by virtue of having made more AI progress. You could say, no, we’re going to deviate from the status quo and just agree that we’re going to develop AI effectively jointly. Either because it’s literally joint or because we’ve all opened … or the leaders has opened himself up to enough monitoring they cease to be the leader. If you do that, then you have to reach some agreement where you say, here’s how we compensate the leader for the fact that they were the leader. Either that or the leader has to be willing to say, yep, I used to be, have a high evaluation because I was doing so well in AI, and now I’m just happy to grant that that advantage is going to get eroded, and I’m happy to do that because it reduces the risk of the world being destroyed.

I think both of those seem like reasonable options to me. Which one that you take depends a little bit upon how serious the problem appear to be, like what the actual structure of the field is like, or like the coordinating is more reasonable if the relevant actors are close, such that … well, it’s more reasonable if there’s an obvious leader who’s going to capture the benefits and is feeling reasonably is wiling to distribute them, or is somehow there’s not a big difference between the players, such as erasing AI as a fact. If you imagine the US and China both believing that, like things are hard if each of them believes that they’re ahead in AI and each of them believe that they’re going to benefit by having AI research which isn’t available to their competitor. Things are hard if both of them believe that they’re ahead, and things are easy if both of them believe that they’re behind.

If they both have an accurate appraisal of the situation and understand there’s not a big difference, then maybe you’re also okay because everyone’s fine saying, sure, I’m fine leaking because I know that that’s roughly the same as … I’m not going to lose a whole lot by leaking information to you.

Takeoff speeds

Robert Wiblin: Okay. Let’s turn now to this question of fast versus slow take off of artificial intelligence. Historically, a lot of people who’ve been worried about AI alignment have tended to take the view that they expected progress to be relatively gradual for a while, and then to suddenly accelerate and take off very quickly over a period of days or weeks or months rather than years. But you’ve, for some time, been promoting the view that you think the take off of general AI is going to be more gradual than that. Do you want to just explain your general view?

Paul Christiano: Yeah, so it’s worth clarifying that when I say slow, I think I still mean very fast compared to most people’s expectations. I think that a transition taking place over a few years, maybe two years between AI having very significant economic impact and literally doing everything sounds pretty plausible. I think when people think about such a tiered transition, to most people on the street, that sounds like a pretty fast takeoff. I think that’s important to clarify. That when I say slow, I don’t mean what most people of by slow.

Another things that’s important to clarify is that I think there’s rough agreement amongst the alignment and safety crowd about what would happen if we did human level AI. That is everyone agrees that at that point, progress has probably exploded and is occurring very quickly, and the main disagreement is about what happens in advance of that. I think I have the view that in advance of that, the world has already changed very substantially. You’re already likely exposed to catastrophic AI risk, and in particular, when someone develops human level AI, it’s not going to emerge in a world like the world of today where we can say that indeed, having human level AI today would give you a decisive strategic advantage. Instead, it will emerge in a world which is already much, much crazier than the world of today, where having a human AI gives you some more modest advantage.

Robert Wiblin: Yeah, do you want to paint a picture for us of what that world might look like?

Paul Christiano: Yeah, so I guess there are a bunch of different parts of the worlds, and I can focus on different ones, but I can try and give some random facts or some random view, like facts from that world. They’re not real facts. They’re Paul’s wild speculations. I guess, in terms of calibrating what AI progress looks like, or how rapid it is, I think maybe two things that seem reasonable to think about are, the current rate of progress and information technology in general. That would suggest something like, maybe in the case of AI, like falling in costs by a factor of two every year-ish or every six to 12 months.

Another thing that I think is important to get an intuitive sense of scale is to compare to intelligence in nature. I think when people do intuitive extrapolation of AI, they often think about abilities within the human range. One thing that I do agree with proponents of fast takeoff about is that that’s not a very accurate perspective when thinking about AI.

I think about better way to compare is to look at what evolution was able to do with varying amounts of compute. If you look at what each order of magnitude buys you in nature, you’re going from insects to small fish to lizards to rats to crows to primates to humans. Each of those is one order of magnitude, roughly, so you should be thinking of there are these jumps. It is the case that the different between insect and lizard feels a lot smaller to us and is less intuitive significance than the difference between primate and human or crow and primate, so when I’m thinking about AI capabilities, I’m imagining, intuitively, and this is not that accurate, but I think is useful as an example to ground things out, I’m imagining this line raising and one day you have, or one year you have an AI which is capable of very simple learning tasks and motor control, and then a few years later … A year later, you have an AI that’s capable of slightly more sophisticated learning, now it learns as well as a crow or something, that AI is starting to get deployed as quickly as possible in the world and having a transformative impact, and then it’s a year later that AI has taken over the process of doing science from humans. Yeah, I think that’s important to have in mind as background for talking about what this world looks like.

Robert Wiblin: What tasks can you put an AI that’s as smart as a crow on that are economically valuable?

Paul Christiano: I think there’s a few kinds of answers. Once place where I think you definitely have a big impact is in robotics and domains like manufacturing logistics and construction. That is think lower animals are probably, they’re good enough at motor control that you’d have much, much better robotics than you have now. Today, I would say robotics doesn’t really, or robots that learn don’t really work very well or at all. Today the way we get robotics to work is you really organize your manufacturing process around them. They’re quite expensive and tricky. It’s just hard to roll it out. I think in this world, probably even before you have crow level AI, you have robots that are very general and flexible. They can be applied not only on an assembly line, but okay, one, they take the place of humans on assembly lines quite reliably, but they can also then be applied in logistics to loading and unloading truck, driving trucks, managing warehouses, construction.

Robert Wiblin: Maybe image identification as well?

Paul Christiano: They could certainly do image identification well. I think that’s the sort of thing we get a little bit earlier. I think that’s a large part of … Today those activities are a large part of the economy. Maybe this stuff we just listed is something … I don’t actually know in the US, it’s probably lower here than elsewhere, but still more than 10% of our economy, less than 25%.

There’s another cost of activities. If you look at the intellectual work humans do, I think a significant part of it could be done by very cheap AIs at the level of crows or not much more sophisticated than crows. There’s also a significant part that requires a lot more sophistication. I think we’re very uncertain about how hard doing science is. As an example, I think back in the day we would have said playing board games that are designed to tax human intelligence, like playing chess or go is really quite hard, and it feels to humans like they’re really able to leverage all their intelligence doing it.

It turns out that playing chess from the perspective of actually designing a computation to play chess is incredibly easy, so it takes a brain very much smaller than an insect brain in order to play chess much better than a human. I think it’s pretty clear at this point that science makes better use of human brains than chess does, but it’s actually not clear how much better. It’s totally conceivable from our current perspective, I think, that an intelligence that was as smart as a crow, but was actually designed for doing science, actually designed for doing engineering, for advancing technologies rapidly as possible, it is quite conceivable that such a brain would actually outcompete humans pretty badly at those tasks.

I think that’s another important thing to have in mind, and then when we talk about when stuff goes crazy, I would guess humans are an upper bound for when stuff goes crazy. That is we know that if had cheap simulated humans, that technological progress would be much, much faster than it is today. But probably stuff goes crazy somewhat before you actually get to humans. It’s not clear how many orders of magnitudes smaller a brain can be before it goes crazy. I think probably at least one seems safe, and then two or three is definitely plausible.

Robert Wiblin: It’s a bit surprising to say that science isn’t so hard, and that there might be a brain that, in a sense, is much less intelligent than a human that could blow us out of the water in doing science. Can you explain, can you try to make that more intuitive?

Paul Christiano: Yeah, so I mentioned this analogy to chess, which is when humans play chess, we apply a lot of faculties that we evolved for other purposes to play chess well, and we play chess much, much better than someone using pencil and paper to mechanically play chess at the speed that a human could. We’re able to get a lot of mileage out of all of these other … I know we evolved to be really good at physical manipulation and planning in physical contexts and reasoning about social situations. That makes us, in some sense, it lets us play good chess much better than if we didn’t have all this capacities.

That said, if you just write down a simple algorithm for playing chess, and you run it with a tiny, tiny fraction of the compute that a human uses in order to play chess, it crushes humans incredibly consistently. So, in a similar sense, if you imagine this project of look at some technological problem, consider a bunch of possible solutions, understand what the real obstructions are and how we can try and overcome those obstructions, a lot of the stuff we do there, we know that humans are much, much better than a simple mechanical algorithm applied to those tasks. That is we’re able to leverage all of these abilities that we … All these abilities that helped us in the evolutionary environment, we’re able to leverage to do really incredible things in terms of technological progress, or in terms of doing science or designing systems or et cetera.

But what’s not clear is if you actually had created, so again, if you take the computations of the human brain, and you actually put it in a shape that’s optimal for playing chess, it plays chess many, many orders of magnitude better than a human. Similarly, if you took the computation of the human brain and you actually reorganized it, so you said now, instead of a human explicitly considering some possibilities for how to approach is problem, a computer is going to generate a billion possibilities per second for possible solutions to this problem. In many respects, we know that that computation would be much, much better than humans at resolving some parts of science and engineering.

There’s been question of how, exactly how much leverage are we getting out of all this evolutionary heuristics. It’s not surprising that in the case of chess, we’re getting much less mileage than we do for tasks that are closer, that more leverage the full range of what the human brain does, or closer to tasks the human brain was designed for. I think science is, and technology are intermediate place, where they’re still really, really not close to what human brains are designed to do. It’s not that surprising if you can make brains that are really a lot better at science and technology than humans are. I think a priori, it’s not that much more surprising for science and technology than it would be for chess.

Robert Wiblin: Okay. I took us some part away from the core of this fast versus slow takeoff discussion. One part of your argument that I think isn’t immediately obviously is that when you’re saying that in a sense the takeoff will be slow, you’re actually saying that dumber AI will have a lot more impact on the economy and on the world than other people think? Why do you disagree with other people about what? Why do you think that earlier versions of machine learning could already be having a transformative impact?

Paul Christiano: I think there’s a bunch of dimensions of this disagreement. An interesting fact, I think, about the effective altruism and AI safety community is that there’s a lot of agreement about, or there’s a surprising amount of agreement about takeoff being fast. There’s a really quite large diversity of view about why takeoff will be fast.

Certainly the arguments people would emphasize, if you were to talk with them, would be very, very different, and so my answer to this question is different for different people. I think there’s this general, one general issue, is I think other people more imagine … other people look at the evolutionary record, and they more see this transition between lower primates and humans, where humans seem incredibly good at doing a kind of reasoning that builds on itself and discovers new things and accumulates them over time culturally. They more see that as being this jump that occurred around human intelligence and is likely to be recapitulated in AI. I think I more see that jump as occurring when it did because of the structure of evolution, so evolution was not really trying to optimize … It was not trying to optimize humans for cultural accumulation in any particularly meaningful sense. It was trying to optimize humans for this speed of tasks that primates are engaged in, and incidentally humans became very good at cultural accumulation and reasoning.

I think if you optimize AI systems for reasoning, it appears much, much earlier. If evolution had been trying to make AIs that would build a civilization, or if evolution had been trying to design creatures trying to optimize for creatures that would build a civilization, instead of going straight to humans who have some level of ability at forming a technological civilization, it would have been able to produce crappier technological civilizations earlier. I now think it’s probably not the case that if you left monkeys for long enough you would get a space faring civilization, but I think that’s not for reasons that are directly, I think that’s not a consequence of monkeys just being too dumb to do it, I think it’s largely a consequence of the way that monkey’s social dynamics work. The way that imitation work amongst monkeys, the way the culture accumulation works and how often things are forgotten.

I think that this continuity that we observe in the historical record between lower primates and humans, I don’t feel like it’s … It certainly provides some indication about what changes you should expect to see in the context of AI, but I don’t feel like it’s giving us a really robust indicator that it’s a really closely analogous situation. That’s one important difference. There’s this jump in the evolutionary record. I expect, to the extent there’s a similar jump, we would see it significantly earlier, and we would jump to something significantly dumber than humans. It’s a significant difference between my view and the view of some, I don’t know, maybe one third of people who are, who think takeoff is likely to be fast.

There are, of course, other differences, so in general, I look at the historical record, and I think it feels to me like there’s an extremely strong regularity of the form. Before you’re able to make a really great version of something, you’re able to make a much, much worse version of something. For example, before you’re able to make a really fast computer, you’re able to make a really bad computer. Before you’re able to make a really big explosive, you’re able to make a really crappy explosive that’s unreliable and extremely expensive. Before you’re able to make a robot that’s able to do some very interesting tasks, you’re able to make a robot which is able to do the tasks with lower reliability or a greater expense or in a narrower range of cases. That seems to me like a pretty robust regularity.

It seems like it’s most robust in cases where the metrics that we’re tracking is something that people are really trying to optimize. If you’re looking at a metric that people aren’t trying to optimize, like how many books are there in the world. How many books are there ein the world is a property that changes discontinuously over the historical record. I think the reason for that is just ’cause no one is trying to increase the number of books in the world. It’s incidental. There is a point in history where books are relatively inefficient way of doing something, and it switched to books being an efficient way to do something, and the number of books increases dramatically.

If you look at a measure of people who are actually trying to optimize, like how quickly information is transmitted, how many facts the average person knows, it’s a … not the average person, but how many facts someone trying to learn facts knows, those metrics aren’t going to change discontinuously in the same way that how many books exist will change. I think how smart is your AI is the kind of thing that’s not going to change. That’s the kind of things people are really, really pushing on and caring a lot about, how economically valuable is your AI.

I think that this historical regularity probably applies to the case of AI. There are a few plausible historical exceptions. I think the strongest one, by far, is the nuclear weapons case, but I think that that case, first, is there are a lot of very good a priori arguments for discontinuity around that case that are much, much stronger than the arguments we give for AI. Even as such, I think the extent of the discontinuity is normally overstated by people talking about the historical record. That’s a second disagreement.

I think a third disagreement, is I think people make a lot of sloppy arguments or arguments that don’t quite work. I think they’re, I feel like, a little bit less uncertain because I feel like it’s just a matter of if you work through the arguments, they don’t really hold together.

I think an example of that is I think people often make this argument of imagining your AI is being a human who makes mistakes sometimes, just an epsilon fraction of the time or fraction of cases where your AI can’t do what a human could do. You’re just decreasing epsilon over time until you hit some critical threshold where now your AI becomes super useful. Once it’s reliable enough, like when it gets to zero mistakes or one in a million mistakes. I think that model is like … there’s not actually, or it looks a priori like a reasonable-ish model, but then you actually think about it. Your AI is not like a human that’s degraded in some way. If you take human and you degrade them, there is a discontinuity that gets really low levels of degradation, but in fact, your AI is falling along a very different trajectory. The conclusions from that model turn out to be very specific to the way that you were thinking of AI as a degraded human. Those are the three classes of disagreements.

Robert Wiblin: Let’s take that it’s given that you’re right that an AI takeoff will be more gradual than some people think. Although, I guess, still very fast by human time scales. What kind of strategic implications does that have for you and me today trying to make that transition go better?

Paul Christiano: I think the biggest strategic question that I think about regularly that’s influenced by this is to what extent early developers of AI will have a lot of leeway to do what they want with the AI that they’ve built. How much advantage will they have over the rest of the world?

I think some people have a model in which early developers of AI will be at huge advantage. They can take their time or they can be very picky about how they want to deploy their AI, and nevertheless, radically reshape the world. I think that’s conceivable, but it’s much more likely that the earlier developers of AI will be developing AI in a world that already contains quite a lot of AI that’s almost as good, and they really won’t have that much breathing room. They won’t be able to reap a tremendous windfall profit. They won’t be able to be really picky about how they use their AI. You won’t be able to take your human level AI and send it out on the internet to take over every computer because this will occur in a world where all the computers that were easy to take over have already been taken over by much dumber AIs.

It’s more like you’re existing in this soup of a bunch of very powerful systems. You can’t just go out into a world … people imagine something like the world of today and human level AI venturing out into that world. In that scenario, you’re able to do an incredible amount of stuff. You’re able to basically steal everyone’s stuff if you want to steal everyone’s stuff. You’re able to win a war if you want to win a war. I think that that model, so that model I think is less likely under a slow takeoff, though it still depends on quantitatively exactly how slow. It especially depends on maybe there’s some way … if a military is to develop AI in a way where they selectively … They can develop AI in a way that would increase the probability of this outcome if they’re really aiming for this outcome of having a decisive strategic advantage. If this doesn’t happen, if the person who develops AI doesn’t have this kind of leeway, then there are, I think the nature of this safety problem changes a little bit.

In one respect, it gets harder because now you really want to be building an AI that can do … you’re not going to get to be picky about what tasks you’re applying your AI to. You need an AI that can be applied to any task. That’s going to be an AI that can compete with a world full of a bunch of other AIs. You can’t just say I’m going to focus on those tasks there’s a clear definition of what I’m trying to do, or I’m just going to pick a particular task, which is sufficient to obtain a strategic advantage and focus on that one. You really have to say, based on the way the world is set up, there’s a bunch of tasks that people want to apply AI to, and you need to be able to make those AI safe.

In that respect, it makes the problem substantially harder. It makes the problem easier in the sense that now you do get a little bit of a learning period. It’s like as AI ramps up, people get to see a bunch of stuff going wrong. We get to roll out a bunch of systems and see how they work. So it’s not like there’s this one shot. There’s this moment where you press the button and then your AI goes, and it either destroys the world or it doesn’t. Its more there’s a whole bunch of buttons. Every day you push a new button, and if you mess up then you’re very unhappy that day, but it’s not literally the end of the world until you push the button the 60th time.

It also changes the nature of the policy or coordination problem a little bit. I think that tends to make the coordination problem harder and changes your sense of exactly what that problem will look like. In particular, it’s not, it’s unlikely to be between two AI developers who are racing to build a powerful AI then takes over the world. It’s more likely there are many people developing AI, or not many, but whatever. Let’s say there are a few companies developing AI, which is then being used by a very, very large number of people, both in law enforcement and in the military and in private industry. The kind of agreement you want is a new agreement between those players.

Again, the problem is easier in some sense, in that now the military significance is not as clear. It’s conceivable that that industry isn’t nationalized. That this development isn’t being done by military. That it’s instead being treated in a similar way to other strategically important industries.

Then it’s harder because there’s not just this one. You don’t have to hold your breath until an AI takes over the world and everything changes. You need to actually set up some sustainable regime where people are happy with the way AI development is going. People are going to continue to think, engage in normal economic ways as they’re developing AI. In that sense, the problem gets harder. I think both problems, some aspects of the problem, both the technical and policy problems become harder, some aspects become easier.

Robert Wiblin: Yeah. That’s a very good answer. Given that other people would disagree with you, though, what do you think are the chances that you’re wrong about this, and what’s the counter argument that gives you the greatest concern?

Paul Christiano: Yeah, I feel pretty uncertain about this question. I think we could try to quantify an answer to how fast this takeoff by talking about how much time elapses between certain benchmarks being met. If you have a one year lead in the development of AI, how much of an advantage does that give you at various points in development.

I think that when I break out very concrete consequences in the world, like if I ask how likely is it that the person who develops AI will be able to achieve a decisive strategic advantage for some operationalization at some point, then I find myself disagreeing with other people’s probabilities, but I can’t disagree that strongly. Maybe other people will assign a 2/3 probability to that event, and I’ll assign a 1/4 probability to that even, which is a pretty big disagreement, but certainly doesn’t look like either side being confident. Let’s 2/3 versus 1/3. It doesn’t look like either side being super confident in their answer, and everyone needs to be willing to pursue policies that are robust across that uncertainty.

I think the thing that makes me most sympathetic to the fast takeoff view is not any argument about qualitative change around human level. It’s more an argument just of like look quantitatively about the speed of development and think about if you were scaling up on the times scale. If every three months you were corresponding to a, your AIs were equivalent to an animal with a brain twice as large, it would not be many months between AIs that seemed minimally useful and AI that was conferring at a strategic advantage. It’s just this quantitative question of exactly how fast this development, and even there’s no qualitative change, you can have development that’s fast enough that it’s correctly described as a fast takeoff. In that case, the view I’ve described of the world is not as accurate. We’re more in that scenario where the AI developer can just keep things under wraps during these extra nine months, and then, if they’d like, have a lot of leeway about what to do.

Robert Wiblin: How strong do you think is the argument that people involved in AI alignment work should focus on the fast takeoff scenario even if it’s less likely because they expect to get more leverage, personally, if that scenario does come to pass?

Paul Christiano: I think that’s a … There’s definitely a consideration that direction. I think it tends to be significantly weaker than the focusing on short time. There’s a similar argument for focusing on short timelines, which I think is quite a bit stronger. I mean, I think that … The way that argument runs, the reason you might focus on fast timelines, or on fast takeoff, is because over the course of a slow takeoff, there will be lots of opportunities to do additional work and additional experimentation to figure out what’s going on.

If you have a view where that work can just replace anything you could do now, then anything you could do now becomes relatively unimportant. If you have a view where there’s any complementarity between work we do now and work that’s done. Imagine you have this, let’s say, one to two years period where people are really scrambling, where it becomes clear to many people that there’s a serious problem here, and we’d like to fix it. Because any kind of complementarity between the work we do now and the work that they’re doing during that period, then that doesn’t really undercut doing work now.

I think that it’s good. We can then advance to do things like understand the nature of the problem, the nature of the alignment problem, understand much more about how difficult the problem is, set up institutions such that they’re prepared to make these investments, and I think those things are maybe a little bit better in fast takeoff worlds, but it’s not a huge difference. I think it’s not more than … intuitively, I think it’s not more than a factor of two, but I haven’t thought that much about it. It might be … Maybe it’s a little more than that.

The short timelines thing I think is a much larger update.

Robert Wiblin: Yeah. Tell us about that.

Paul Christiano: Just, so if you think that AI might be surprisingly soon, in general, what surprisingly soon means is that many people are surprised, so they haven’t made much investment. In those worlds, there’s a lot less, much less has been done. Certainly, if AI was developed in 50 years, I do not think it’s the case that the research I’m doing now could really, very plausibly be relevant, just because there’s so much time that other people are going to have to rediscover the same things.

If you get a year ahead now, that means maybe five years from now you’re 11 months ahead of where you would have been otherwise, and five years later, you’re eight months of where you would have been otherwise. Over time, the advantage just shrinks more and more. If AI’s developed in 10 years, then something crazy happened, people were completely, the world at large has really been asleep at the wheel if we’re going to have human level AI in 10 years, and in that world, it’s very easy to have very large impact.

Of course, if AI is developed in 50 years, it could happen that people are asleep at the wheel in 40 years. They can independently make those … I don’t know, you can invest now for the case that people are asleep at the wheel. You aren’t really foreclosing the possibility of people being asleep in the future. If they’re not asleep at the wheel in the future, then the work we do now is a much lower impact.

It’s mostly, I guess, just a neglectedness argument where you’re not really expect up here AI to be incredibly neglected. If, in fact, people with short timelines are right, if the 15% in 10 years, 35% in 20 years is right, then AI is absurdly neglected at the moment. Right? In that world, what we’re currently seeing in ML is not unjustified heights but desperately trying to catch up to what would be an acceptable level of investment given the actual probabilities we face.

Robert Wiblin: Earlier you mentioned that if you have this two year period, where economic growth has really accelerated in a very visible way, that people would already be freaking out. Do you have a vision for exactly what that freaking out would look like, and what implications that has?

Paul Christiano: I think there’s different domains and different consequences in different domains. Amongst AI researchers, I think a big consequence that a bunch of discussions that are currently hypothetical and strange, the way we talk about catastrophic risk caused by AI. We talk about the possibility of AI much smarter than humans, or we talk about decisions being made by machines, a bunch of those issues will cease to become, stop being weird considerations or speculative arguments and will start being this is basically already happening. We’re really freaked out about where this is going, or we feel very viscerally concerned.

I think that’s a thing that will have a significant effect on both what kind of research people are doing and also how open they are to various kinds of coordination. I guess that’s a very optimistic view, and I think it’s totally plausible that … Many people are much more pessimistic on that front than I am, but I feel like if we’re in this regime, people will really be thinking about [prioritizing 01:06:13] the thing that’s clearly coming, and they will be thinking about catastrophic risk from AI as even more clear than powerful AI, just because we’ll be living in this world where AI is really … you’re already living in world where stuff is changing too fast for humans to understand in quite a clear way. In some respects, our current world has that character, and that makes it a lot easier to make this case than it would have been 15 years ago. But that will be much, much more the case in the future.

Robert Wiblin: Can you imagine countries and firms hoarding computational ability because they don’t want to allow anyone else to get in on the game?

Paul Christiano: I think mostly I imagine defaults is just asset prices get bit up a ton. It’s not that you hoard competition so much as just computers become incredibly expensive and that flows backwards to semi-connector fabrication becomes incredibly expensive. IP chip companies become relatively valuable. That could easily get competed away. I think to first order, the economic story is probably what I expect, but then I think if you try it, if you look at the world, and you have, imagine asset prices and some area are raising by a factor of 10 over the course of a few years or a year, I think that it’s pretty likely that the normal … I think the rough economic story is probably still basically right, but markets, or the formal structure of markets is pretty easy to break down in that case.

You can easily end up in the world where computation is very expensive, but prices are too sticky for actually prices to adjust in the correct way. Instead, that ends up looking like computers are still somewhat cheap, but now effectively they’re impossible for everyone to buy, or machine learning hardware is effectively impossible for people to buy at the nominal price. That world might look more like people hording computation, which I would say is mostly a symptom of an inefficient market world. It’s just the price of your computer has gone up by an absurd amount because everyone thinks this is incredibly important now, and it’s hard to produce computers as fast as people want them. In an inefficient market world, that may look like …. That ends up looking like freaking out, and takes the form partly of a policy response instead of a market response, so strategic behavior by militaries and large firms.


Robert Wiblin: Okay, that has been the discussion of how fast or gradual this transition will be. Let’s talk now about when you think this thing might happen. What’s your best guess for, yeah, AI progress timelines?

Paul Christiano: I normally think about this question in terms of what’s the probability of some particular development by 10 or 20 years rather than thinking about a median because those seem like the most decision relevant numbers, basically. Maybe one could also, if you had very short timelines give probabilities on less than 10 years. I think that my probability for human labor being obsolete within 10 years is probably something in the ballpark of 15%, and within 20 years is something within the ballpark of 35%. AI would then have, prior to human labor being obsolete, you have some window of maybe a few years during which stuff is already getting quite extremely crazy. Probably AI [risk 01:09:04] becomes a big deal. We can have permanently have sunk the ship like somewhat before, one to two years before, we actually have human labor being obsolete.

Those are my current best guesses. I feel super uncertain about … I have numbers off hand because I’ve been asked before, but I still feel very uncertain about those numbers. I think it’s quite likely they’ll change over the coming year. Not just because new evidence comes in, but also because I continue to reflect on my views. I feel like a lot of people, whose views I think are quite reasonable, who push for numbers both higher and lower, or there are a lot of people making reasonable arguments for numbers both much, like shorter timelines than that and longer timelines than that.

Overall, I come away pretty confused with why people currently are as confident as they are in their views. I think compared to the world at large, the view I’ve described is incredibly aggressive, incredibly soon. I think compared to the community of people who think about this a lot, I’m more somewhere in, I’m still on the middle of the distribution. But amongst people whose thinking I most respect, maybe I’m somewhere in the middle of the distribution. I don’t quite understand why people come away with much higher or much lower numbers than that. I don’t have a good … It seems to me like the arguments people are making on both sides are really quite shaky. I can totally imagine that after doing … After being more thoughtful, I would come away with higher or lower numbers, but I don’t feel convinced that people who are much more confident one way or the other have actually done the kind of analysis that I should defer to them on. That’s said, I also I don’t think I’ve done the kind of analysis that other people should really be deferring to me on.

Robert Wiblin: There’s been discussion of fire alarms, which are kind of indicators that you would get ahead of time, that you’re about to develop a really transformative AI. Do you think that there will be fire alarms that will give us several years, or five or ten-years’ notice that this is going to happen? And what might those alarms look like?

Paul Christiano: I think that the answer to this question depends a lot on … There’s many different ways the AI could look. Different ways that AI could look have different signs in advance. I think if AIs developed very soon, say within the next 20 years, I think the best single guess for the way that it looks is a sort of … The techniques that we are using are more similar to evolution than they are to learning occurring within a human brain. And a way to get indications about where things are going is by comparing how well those techniques are working to how well evolution was able to do with different levels of … different computational resources. On that perspective, or in that scenario, what I think is the most likely scenario within 20 years, I think the most likely fire alarms are successfully replicating the intelligence of lower animals.

Things like, right now we’re kind of at the stage where AI systems are … the sophistication is probably somewhere in the range of insect abilities. That’s my current best guess. And I’m very uncertain about that. I think as you move from insects to small vertebrates to larger vertebrates up to mice and then birds and so on, I think it becomes much, much more obvious. It’s easier to make this comparison and the behaviors become more qualitatively distinct. Also, just every order of magnitude gets you an order of magnitude closer to humans.

I think before having broadly-human level AI, a reasonably good warning sign would be broadly lizard-level or broadly mouse-level AI, that is learning algorithms which are able to do about as well as a mouse in a distributional environment that’s about as broad as the distribution environments that mice are evolved to handle. I think that’s a bit of a problematic alarm for two reasons. One, it’s actually quite difficult to get a distribution of environments as broad as the distribution that a mouse faces, so there’s likely to be remaining concern. If you can replicate everything a mouse can do in a lab, that’s maybe not so impressive, and it’s very difficult to actually test for some distribution environments. Is it really flexing the most impressive mouse skills?

I think that won’t be a huge problem for people … A very reasonable person looking at the evidence will still be able to get a good indication, but it will be a huge problem for establishing consensus about what’s going on. That’s one problem. And then the other problem was this issue I mentioned where it seems like transformative impacts should come significantly before broadly human-level AI. I think that a mouse-level AI would probably not give you that much warning, or broadly mouse-level AI would probably not give you that much warning. And so you need to be able to look a little bit earlier than mice. It’s plausible that in fact one should be regarding … One should really be diving into the comparison to insects now and say, can we really do this? It’s plausible to me that that’s the kind of … If we’re in this world where our procedures are similar to evolution, it’s plausible to me the insect thing should be a good indication, or one of the better indications, that we’ll be able to get in advance.

Robert Wiblin: There was this recent blog post that was doing the rounds on social media called, “An AI Winter is Coming,” which was broadly making the argument that people are realizing that current machine learning techniques can’t do the things that people have been hoping that they’ll be able to do over the last couple of years. That the range of situations they can handle is much more limited and that the author expects that the economic opportunities for them are gonna dry out somewhat, and an investment will shrink. As we’ve seen, so they claim, in the past when there’s been a lot of enthusiasm about AI, and then it hasn’t actually been able to do the things that we claimed. Do you think there’s much chance that that’s correct, and what’s your general take on this AI boom, AI winter view?

Paul Christiano: I think that the position in that post are somewhat … I feel like the position in that post is fairly extreme in a way that’s not very plausible. For example, I think the author of that post is pessimistic about self-driving cars actually working because they won’t be sufficiently reliable. I think its correct to be like, this is a hard problem. I think that … I would be extremely happy to take a bet at pretty good odds against the world they’re imagining. I guess I … I also feel somewhat similarly about robotics at this point. I think what we’re currently able to do in the lab is approaching good enough that industrial robotics can … That’s a big … If the technology is able to work well, it’s a lot of value. I think we’re able to in the lab is a very strong indication that that is going to work in the reasonably short term.

I think those things are pretty good indications that, say, current investment in the field is probably justified by, or the level of investment is plausible given the level of applications that we can foresee quite easily, though I don’t wanna comment on the form of investment. There’s maybe a second … I think I don’t consider the argument in the post … I think the arguments in the post are kind of wacky and not very careful. I think one thing that makes it a little bit tricky is this comparison. If you’re compare the kind of AI we’re building now to human intelligence, I think literally until the very end, actually, probably after the very end, you’re just gonna be, look there’s all these things that humans can do that our algorithms can’t do. I think one problem that’s just kind of a terrible way to do the comparison. That’s the kind of comparison that’s predictably going to leave you being really skeptical until the very, very end.

I think there’s another question, which is, and maybe this is what they were getting at, which is, there’s a sense maybe amongst the … especially certainly deep-learning true believers, at the moment, that you can just take existing techniques and scale them quite far. If you just keep going, things are gonna keep getting better and better, and we’re gonna get all the way to powerful AI like that. I think it’s a quite interesting question whether that is … If we’re in that world, then we’re just gonna see machine learning continue to grow, so then we would not be in a bubble. We would be in the beginning of this ramp up to spending some substantial fraction of GDP on machine learning. That’s one possibility. Another possibility is that some applications are going to work well, so maybe well get some simple robotics applications working well which could be quite large, that could easily have impacts in hundreds of billions or trillions of dollars. But, things are gonna dry up long before they get to human level. I think that seems quite conceivable. I would maybe be … Maybe I think it’s a little bit more likely that not that at some point things pull back. I mean it’s somewhat less than 50% that the current wave of enthusiasm is going to just continue going up until we build human level AI. But I also think that’s kind of plausible.

I think people, they really want to call bubbles in a way that results in a lot of irrationality. I think Scott Sumner writes about this a lot and I mostly agree with his take. When enthusiasm about something gets really high, that doesn’t mean it’s guaranteed that it’s gonna continue going up. It can just a bet that there’s a one-third chance that it’s gonna continue going up or one-half chance, and I think that … People are really happy about being self-satisfied after calling the bubble, after calling a level of enthusiasm that’s unjustified. Sometimes they’re right ex-ante, and the fact that there are some people who are right, sometimes those calls are right ex-ante makes it a lot more attractive to take this position. I think a lot of the time, ex-post, it was fine to say this was a bubble, but ex-ante, I think it’s worth investing a bunch on the possibility that something is really, really important. I think that’s kind of where we’re at.

I think that the arguments people are making that deep learning is doomed are mostly pretty weak. For example, because they’re comparing deep learning to human intelligence, and that’s just not the way to run this extrapolation. The way to run the extrapolation is to think about how tiny existing models are compared to the brain, think about on the model world they’re able to do a brain in 10 or 20 years. What should we be able to do now? And actually make that comparison instead of trying to say, look at all these tasks humans can do.

How can individuals protect themselves financially?

Robert Wiblin: What kinds of things should people do before we have an artificial and general intelligence in order to, I guess, protect themselves financially, if they’re potentially going to lose their jobs? Is there really anything meaningful that people can do to shield themselves from potentially negative effects?

Paul Christiano: If the world continues to go, well … If all that happens is that we build AI, and it just works the way that it would work in an efficient market worlds, there’s no crazy turbulence, then the main change is, you shift from having … Currently two-thirds of GDP gets paid out roughly as income. I think if you have a transition too human labor being obsolete then you fall to roughly zero of GDP is paid as income, and all of it is paid out as returns on capital. From the perspective of a normal person, you either want to be benefiting from capital indirectly, like living in a state that uses capital to fund redistribution, or you just wanna have some savings. There’s a question of how you’d wanna … The market is not really anticipating AI being a huge thing over 10 or 20 years, so you might wanna further hedge and say … If you thought this was pretty likely, then you may want to take a bet against the market and say invest in stuff that’s gonna be valuable in those cases.

I think that, mostly, the very naive guess is not a crazy guess for how to do that. Investing more in tech companies. I am pretty optimistic about investing in semiconductor companies. Chip companies seem reasonable. A bunch of stuff that’s complimentary to AI is going to become valuable, so natural resources bid up. In an efficient market world, the price of natural resources is one of the main things that benefits. As you make human labor really cheap, you just become limited on resources a lot. People who own Amazon presumably benefits a huge amount. People who run logistics, people who run manufacturing, etc. I think that generally just owning capital seems pretty good. Unfortunately, right now is not a great time to be investing, but still, I think that’s not a dominant consideration when determining how much you should save.

Robert Wiblin: You say it’s bad just because the stock market in general looks overvalued, based on price-to-earnings ratios?

Paul Christiano: Yeah, it’s hard to know what overvalued means exactly, but certainly it seems reasonable to think of it in terms of, if you buy a dollar of stocks, how much earnings are there to go around for that dollar of stocks, and it’s pretty low, pretty usually low. This might be how it is forever. I guess if you have the kind of you that I’m describing, if you think we’re gonna move to an economy that’s growing extremely rapidly, then you have to bet that the rate of return on capital is gonna go up, and so it’s kind of… In some sense, you need to invest early because you wanna actually be owning physical assets, since that’s where all of the value is going to accrue. But. it’s also a bummer to lock in relatively low rates of return.

Robert Wiblin: In the normal scenario, where that doesn’t happen?

Paul Christiano: No, in the … Even in … Suppose I make someone a loan. A way people normally hold capital would be making a loan … You make a loan now, you make a loan at 1% real interest for 20 years. You’re pretty bummed if then people develop AI, and now the economy is growing by 25% a year. Your 1% a year loan is looking pretty crappy.

And you’re pretty unhappy about that. Stocks are a little bit better than that, but it depends a lot on … Yeah, stocks still take a little bit of a beating from that. I think this generally is a consideration that undercuts the basic … I think the basic thing you would do if you expected AI would be save more, earn more capital if you can. I think that’s undercut a little bit by the market being priced such that it’s hard, which could be a bunch of people doing that, if that’s not why it’s happening. Prices aren’t being bid up because everyone is reasoning in this way. Prices are being bid up just ’cause of unrelated cyclical factors.


Robert Wiblin: Let’s talk now about some of the actual technical ideas you’ve had for how to make machine learning safer. One of those has been called iterated intelligence distillation and amplification, sometimes abbreviated as IDA. What is that idea in a nutshell?

Paul Christiano: I think the starting point is realizing that it is easier to train an AI system, or it currently seems easier to train in an aligned AI system, if you have access to some kind of overseer that’s smarter than the AI you’re trying to train. A lot of the traditional arguments about why alignment is really hard, or why the problem might be intractably difficult, really push on the fact that you’re trying to train, say, a superintelligence, and you’re just a human.

And similarly, if you look at existing techniques, if you look at the kind of work people are currently doing in more mainstream alignment work, it’s often implicitly predicated on the assumption that there’s a human who can understand what the AI is doing, or there’s a human who can behave close to approximately rational, or a human who can evaluate how good the AI system’s behavior is, or a human who can peer in at what the AI system is thinking and make of that decision process.

And sometimes its dependence is a little bit subtle, but I think, it seems to me like it’s extremely common. Even when people aren’t acknowledging explicitly a lot of the techniques are gonna have a hard time scaling to domains where the AI is a lot smarter than the overseer who’s training it. I think motivated by that observation you could say, let’s try and split the alignment problem into two parts, one of which is try and train an aligned AI, assuming that you have an overseer smarter than that AI, and the second part is actually produce an overseer who’s smart enough to use that process or smart enough to train that AI.

The idea in iterative amplification is to start from a weak AI. At the beginning of training you can use a human. A human is smarter than your AI, so they can train the system. As the AI acquires capabilities that are comparable to those of a human. Then the human can use the AI that they’re currently training as an assistant, to help them act as a more competent overseer.

Over the course of training, you have this AI that’s getting more and more competent, the human at every point in time uses several copies of the current AI as assistants, to help them make smarter decisions. And the hope is that that process both preserves alignment and allows this overseer to always be smarter than the AI they’re trying to train. And so the key steps of the analysis there are both solving this problem, the first problem I mentioned of training an AI when you have a smarter overseer, and then actually analyzing the behavior of the system, consisting of a human plus several copies of the current AI acting as assistants to the human to help them make good decisions.

In particular, as you move along the training, by the end of training, the human’s role becomes kind of minimal, like if we imagine training superintelligence. In that regime, we’re just saying, can you somehow put together several copies of your current AI to act as the overseer? You have this AI trying to … Hopefully at each step it remains aligned. You put together a few copies of the AI to act as an overseer for itself.

Robert Wiblin: How is it that the kind of training AI and the human are gonna be smarter than the AI that they’re trying to train? I mean they’re trying to make something that’s smarter than them, right? But you’re saying that as a … That in combination, they can do a better job?

Paul Christiano: Yeah, so the key step there, the reason why it’s plausible … Let’s suppose we’re trying to train our AI to some level of capability. Suppose that you imagine the AI gets to think for two seconds before each decision it makes. Then when, in this oversight process, we have multiple copies of the AI, each thinking for two seconds. If you imagine this in an analogy to humans, you have one smart human is making a decision, and they’re being overseen by a team of 20 slightly less-smart humans. And when I say slightly, I really mean slightly, like they’re one gradient step behind. The hope is, and this is a complicated question or an open question at the moment … If you have this team of 20 very slightly dumber aligned agents, then can they divide up the work of overseeing this AI, so that even they’re a tiny bit dumber, by working together, they can do it.

Robert Wiblin: Okay, so is the idea that by throwing more processing power or more instances behind the somewhat dumber algorithm, they’ll be smarter than the AI that they’re trying to train, and therefore able to keep it under control and give it good instructions in concert with the human?

Paul Christiano: That’s roughly right, yeah. In particular, when we ask, what does this team have to do? We have this overseer, which is now a team of AIs working together. What does the team have to actually do? There may be two big classes of tasks they have to do. One is, they’re gonna provide the incentives to the AI. This team is gonna look, or AI is gonna take some action, then our team of overseer AIs is going to evaluate that action and decide how good it was, and that’s gonna be the rewards signal that we use to train our AI.

In order for that to work, this team of 20 AIs needs to be able to make a better judgment about how good actions are than the AI that they’re trying to train. Right? Because they can’t … They’re sort of using this process. Can’t really train an AI to make smarter decisions than that team would make. And so, there, we hope that you can sort of break down … There’s this question, how good an action was action x? You hope that you can break down the problem of evaluating that action into a bunch of tasks, each of which are a little bit easier, and then you can split up those tasks across this team. That’s one of the tasks that the team needs to be able to do, this evaluation. Answer these evaluation questions.

The second category that we don’t understand as well, which is like, they would like to ensure that the AI they’re training reliably does the thing they’re trying to incentivize it to do. For example, an example of a technique called adversarial training that you could use for this problem is, if I’m training some AI, and I’m concerned that in some situations it might behave badly, I can try and synthesize situations. I can try and construct hand-crafted situations, designed to look like situations where the AI might behave badly. And I can, instead of running, say, just on real examples, I also run it on these synthetic examples designed to cause problems.

And so in order to that, I need to be able to do tasks like look at the AI and figure out whether the synthetic, what are the cases where it would perform poorly, and then also evaluate in such a case, did it do something unacceptable? I think it’s not crazy to think about this with an analogy to humans. Again, you have your team of 20 humans. You have one slightly smarter human, who’s maybe is … You’re concerned that under some conditions, they’re gonna try and break this process. They’re gonna try and go rogue, do something bad. And this team of 20 humans is responsible for interrogating this human and making sure there’s not a condition under which it’s going to behave poorly. Giving it test cases that hopefully it can’t tell, the human being trained can’t tell, from the real situations that it’s going to encounter.

That’s an example of a technique that you could use for this process of reliability. There are other possible techniques you could use. And those are the two big tasks that this team of AIs needs to be able to solve. It was a team of humans in the analogy, but … This team needs to be able to evaluate behavior to produce good incentives, to train the AI to do what the team wants it to do, and it needs to be able to do some evaluation for reliability, which is not something we currently understand well.

Robert Wiblin: At its core, you’re going to try to get somewhat dumber AIs and humans together to come up with a training process by which they figure out whether this smarter AI that they’re trying to develop is behaving in the way that they want, by designing particular scenarios to test whether that’s the case. And even though they’re not quite as smart, I guess, in this model, because you’re throwing quite a lot of power behind that somewhat simpler task of just evaluating whether it’s doing the right thing, you hope that that way you’ll be able to gradually scale up and not lose alignment at any particular point in time.

Paul Christiano: Yeah, that’s right. I guess it’s worth pointing out again that the … Generating scenarios, that’s one possible way of testing, trying to get the system to behave robustly, robustly to do the right thing. There are other possible approaches. You could also try and use an approach where you open up the brain of this AI you’re trying to train. Use something like interpretability techniques that people are currently working on, to understand how it’s thinking, and say, ah, now that I understand how it’s thinking, I see that here’s a place that it’s thinking that I wouldn’t want it to be thinking. And I can tell from that that it will fail in the scenario. Or I can just directly say, no, that was not a good way to be thinking about the problem, and penalize that. One of the major things this group is doing is just determining incentives for the AI that they’re training. This team of slightly dumber humans is just determining what … They’re evaluating the AI on realistic examples, on examples that appear in the real world and saying, how good was its behavior in this case? How good was its behavior in that case? And the AI is being trained to maximize those evaluations.

Robert Wiblin: By “incentives,” you mean, do we give it its reward? Do we give it whatever it’s been programmed to try to get?

Paul Christiano: Yeah. I mean, formally, you would really be using gradient descents, where you’re like, yup, we take our AI, we take this evaluation that this team is providing, and then we modify the AI very slightly, so that it gets a slightly higher reward on that, a slightly higher evaluation, or it outputs actions that have higher evaluations on average. And in that setting, actually the AI that you’re starting with is exactly the same as the AIs who are on this team doing the oversight. But after you make this very small perturbation, that perturbation now hopefully gives you an AI that’s very slightly smarter than the other AIs on the team. The AI that’s actually thinking is exactly as smart as the ones on the team. It’s only as you consider these possible perturbations that you hope that the perturbations are like epsilon smarter. And that’s how training would normally work, were you’d have some evaluation, consider AI, run it, perturb it to get slightly better performance, repeat.

Robert Wiblin: Someone emailed me about IDA wanting me to ask you about it and said, “The context here is that I and many others think that IDA is currently the most promising approach to solving the alignment problem, largely because it’s the only real, actual proposal that anyone has made.” Do you think that’s right? And, more generally, what’s been the reaction to this general approach?

Paul Christiano: Yeah. I would say the current situation is, I am very interested in really asking what solutions would look like in … as you scale them up. What is our actual game plan? What is the actual end-game here? That’s a question that relatively few people are interested in, and so very few people are working on. MIRI, the Machine Intelligence Research Institute, is very interested in that question, but they part ways with me by believing that that question is so obviously impossible that it’s not worth thinking about it directly, and instead we should be trying to improve our understanding of the nature of rational agency. That’s the reason, to me, why they are not in the business of trying to produce concrete proposals. It just seems doomed to them. Feels to them like we’re just patching holes in a thing that’s fundamentally not going to work.

And most people in the broader ML community, I would say they take an attitude that’s more like, we don’t really know how the system is going to work until we build it. It’s not that valuable to think about in advance, what is the actual scheme going to look like. And that’s the difference there. I think that’s also true for many safety researchers who are like most, who are more traditional AI or ML researchers. They would more often say, look, I have a general plan. I’m not going to think in great detail about what that plan is going to look like because I don’t think that thinking is productive, but I’m gonna try and vaguely explain the intuitions, like maybe something like this could work. I think it sort of happens to be the case. Basically no one is engaged in the project of actually, say, here is what aligned AI might look like.

I’m trying to aspire to the goal of actually write down a scheme that could work. There are a few other groups that are also doing that. I guess, at the OpenAI safety team, we also recently published this paper on safety via debate, which I think also has this form of being also an actual candidate solution, or something that’s aspiring to be a scalable solution to this problem. Jeffrey Irving was lead author on that. He’s a colleague on the OpenAI safety team. I think that’s coming from a very similar place. And maybe is, in some sense is a very similar proposal.

I think it’s very likely that either both of these proposals work, or neither of them works. In that sense they’re not really totally independent proposals. But they’re getting at … They’re really pushing on the same facts about the world that let you make AI. Both of then are leveraging AI to help you evaluate your AI.

I think the other big category is work on inverse reinforcement learning, where people are attempting to invert through human behavior and say, given what a human did, here’s what the human wants. Given what the human wants, we can come up with better plans to get what the human wants, and maybe that approach can be scalable. I think the current state of affairs on that is there are some very fundamental problems with making it work, with scaling it up, related to, how do you define what it is that a human wants? How do you relate human behavior to human preferences, given that humans aren’t really the kind of creature that actually has … There’s no slot in the human brain that’s where you put the preferences.

I think unfortunately we haven’t made super much progress on that core of the problem, or what I would consider the core of the problem. I think that’s related to people in that area not thinking of that as being their current, primary goal. That is, they’re not really in the business of saying, and here we’re gonna try and write down something that’s just gonna work, no matter how powerful AI gets. They’re more in the business of saying, let’s understand. Clarify the nature of the problem, make some progress, try and get some intuition for what will alow us to make further progress, and how we could get ourselves in a position where, as AI improves, we’ll be able to adapt to that.

I think it’s not a crazy perspective. But I think that’s how we come to be in this place where there are very, very few concrete proposals that are aspiring to be actual … a scheme you could write down and then run with AI and would actually [yield 01:33:10] AI. I think overall reaction is there’s two kind of criticisms people have, one of which is, this problem seems just completely hopeless. There’s a few reasons people would think that this iterative amplification approach is completely hopeless. They’re normally can be divided roughly into thinking that organizing a team of 20 AIs to be aligned and smarter than the individual AIs already subsumes the entire alignment problem. In order to do that, you would need to understand how to solve alignment in a very deep way, such that if you understood that, there’d be no need to do any of this, bother with any of the other machinery.

The second common concern is that this robustness problem is just impossibly difficult. In the iterative amplification scheme, as we produce a new AI, we need to verify … Not only do we need to incentivize the AI to do well on the training distribution. We also need to sort of restrict it to not behave really badly off of the training distribution. And there are a bunch of plausible approaches to that that people are currently exploring in the machine learning community. But it’s … I think the current situation is, we don’t see a fundamental reason that’s impossible, but it currently looks really hard. And so many people are suspicious that problem might be impossible.

That’s one kind of negative response is this … Maybe the problem, iterative amplification, cannot be made to work. The other kind of response is, it’s reasonably likely that AI safety is easy enough that we also don’t need any of this machinery. That is, I’ve described this procedure for trying to oversee AI systems that are significantly smarter than humans. Many of the problems on this perspective are only problems when you want things to be scalable to very, very smart AI systems. You might think, instead, look, we just want to build an AI that can take one “pivotal act,” that is an expression people sometimes use for an action an AI could take that would substantially improve our situation, with respect to the alignment problem. Say we want to build an AI which his able to safely take pivotal act. That doesn’t require being radically smarter than a human or taking actions that are not understandable to a human, so we should really not be focusing on or thinking that much about techniques that work in this weird, extreme regime.

I guess even people in the broader ML community would say, look, I don’t know … They don’t necessarily buy into this framework of, can you just take a pivotal act? But they would still say, look, you’re worrying about a problem which is quite distant, it’s pretty likely that for one reason or another that problem is going to be very easy by the time we get there, or that one of these other approaches we can identify is just going to turn out to work fine. I think both those reactions are quite common. I think there’s also a reasonably big crowd of people who are like, yeah, I’m really interested, coming from a similar perspective to me, where they really want a concrete proposal that they can actually see how it could work. I think that those people tend to be, well, for those who aren’t incredibly pessimistic about this proposal, many of them are pretty optimistic about iterative amplification, or debate, or something along those lines.

Robert Wiblin: That’s a great answer. Yeah, I think it’s really creditable that you actually try to put out ideas for how we could deal with this, and I’ve seen, as you said, very few other people actually try to do that. And people can just read those ideas for themselves on your AI Alignment blog on Medium.


Robert Wiblin: You mentioned another approach that people have been talking about recently, which is debate as a way of aligning AI. You also mentioned inverse reinforcement learning. But we discussed that in the episode last year with Dario Amodei, so we’ll skip that one. But can you just describe the approach in the debate paper, which is somewhat similar, it sounds like, to IDA?

Paul Christiano: Yeah. The idea is, we’re interested in training AI systems to make decisions that are in some respects too complicated for a human to understand. It’s worth pointing out that problems can appear probably long before AI is broadly-human level, because AI’s abilities are very uneven, so it can have an understanding of a domain that is way beyond human understanding of that domain, even while being subhuman in many respects. We want to train this AI to make decisions that are too complex for a human to understand. We’re wondering how do you get a training signal for such an AI? One way, one approach people often take is to pick some actual consequence in the world, like some simple consequence in the world that you could optimize, like whatever, you’re running a company, just … I don’t care how you’re making decisions about that company. All I care about is that they lead to the company having high profit.

We’re interested in moving away from … I think there are serious concerns with that, from a safety perspective. We want to move more towards the regime where, instead of evaluating, yes, this decision had good consequences, but I don’t understand why, we’re evaluating a proposed decision and saying, yeah, we understand that that’s a good decision, so we’re going to give it a high reward because we understand why it’s good. That approach has … I mean, if an AI comes to you and says, “I would like to design the particle accelerator this way because,” and then makes to you an inscrutable argument about physics, you’re faced with this tough choice. You can either sign off on that decision and see if it has good consequences, or you can be like, no, don’t do that ’cause I don’t understand it. But then you’re going to be permanently foreclosing some large space of possible things your AI could do.

Instead, the proposal is, we’re going to have two AIs. One AI’s gonna make a proposal. We can’t directly … That proposal counts on a bunch of complicated facts that we don’t necessarily understand. It’s gonna make some complicated argument about the economy in order to justify that proposal. And we couldn’t actually evaluate that argument. But if we introduce this adversarial agent who can explain to us why the proposal that was made is bad, and the original agent, if this critique has a flaw, the original can say, no, that critique is not a valid critique because, and point out the flaw. And then the critiquer can say, no, actually it was valid. They can go back and forth in this way.

Then you can implicitly explore an exponentially large space of considerations. Because by giving the critiquer the option to pick any line of argument that they want in order to attack the proposal, you can verify that every possible line of argument, if the critiquer is not able to win, it suggests to you that every possible line of argument would have been unsuccessful. Every possible line of argument would have still left you thinking the proposal was a good one. It’s not clear if you can actually construct.

Now we have some complicated question. Our AI is proposing to you an action. We would like to set up the debate such that the best action will actually win the debate. If two AIs propose actions, and one of them is proposing an action which is actually better, then it will be able to win a debate in which it establishes that its action is better.

I think there are some plausibility arguments, like the one I just gave, that you’re exploring an exponentially large space of considerations. But this might be possible in cases where a human couldn’t have any idea about the task itself, or directly answering the question. It’s a very open question, exactly how powerful is debate? That is, if we set up a debate in the best possible way, so we give it, we have some human judge of this debate who’s evaluating the claims and counter-claims. If we give them optimal training and optimal advice, and then we have two very powerful agents debate in this way, we’d like it to be the case that the optimal strategy in this debate is being honest and actually telling the truth and then actually providing valid arguments for that and responding to counterarguments in a valid way. And we don’t know if that’s the case, but figuring out if that’s the case and then understanding in what cases we’re able to run such debates and it converges to truth, understanding how to set them up so they converge to truth, etc., does give a plausible way of training powerful AI systems.

Robert Wiblin: So how analogous is this approach to a case where say a person like me is trying to judge a difficult scientific issue, and I wouldn’t be capable of doing the original research and figuring out the truth for myself, but if there was scientists debating back and forth and one of them maybe was trying to be misleading in some way and another one was being truthful, the hope is that I would be able to figure out which one was telling the truth because I can at least evaluate the debate even if I couldn’t produce the arguments, myself?

Paul Christiano: Yeah, so I think the situation’s pretty analogous to two human experts with lots of understanding you lack; you’re trying to understand the truth. You hope that if one of those experts is trying to make a claim that is true, then by zooming in on one consideration after another, you could find out if it’s true. You could eventually come to be very skeptical all the counterarguments or they could undermine all the counterarguments that were offered, and so I think that’s like … it’s definitely not an obvious claim. It’s not obvious in the context of human discussions. I think as a society, empirically, there aren’t great examples of covering really big gaps in expertise. Like, it’s often the case that two people with expertise in their area can have a debate in a way that convinces someone with slightly less expertise, but when there’s really large gaps, I don’t think we have a very good record of doing that kind of thing successfully.

So, I’d say there’s more hope that this is possible, that a human could just evaluate some proposal produced by a sophisticated AI system, but it’s still very much an open question whether this kind of thing could actually work and one way you could try and assess that would be say, “We’re gonna get fairly serious about … have some serious experiments of trying to take people with considerable expertise in an area, have them have a debate arbitrated by someone with less expertise.

Robert Wiblin: What do you think is the biggest or most concerning criticism of AI safety via debate?

Paul Christiano: Personally, I think the worst problem is just, is the basic question, do debates tend to favor accurate answers, or do they tend to favor answers that are easy to defend for reasons other than their accuracy? There’s a bunch of reasons the debate might favor an answer other than it being accurate. I think one that really leaps to people’s mind is, well, the judge is just a human. Humans have all sorts of biases and inconsistencies. That’s one reason that debate could favor answers other than the accurate one. I’m more personally concerned about maybe an even more basic question, which is, setting aside all human biases and all ways in which humans fail to reason well, I think it’s just an open question: does the structure of debate tend to promote truth? Does it tend to be the case that there’s some way to argue for the accurate position, even if the content of the debate, the thing you’re debating, is really, really complex compared to what the human can understand?

Robert Wiblin: It seems like debate among humans is way better than random, anyway.

Paul Christiano: I agree. Humans are … And clearly we’re able to get, at least in some cases, able to get much better answers than we’d get on our own. If I get to observe two experts debate a subject, even if one of them is actively trying to mislead me, I can arrive at better conclusion than I could have arrived at if I just wasn’t able to listen to their expertise or was only given a single expert, whose incentive was to look good.

I think that the example of debates amongst humans makes it very unclear whether this procedure can be scaled arbitrarily far. An example you might think of is, consider a human who’s a smart person, who knows a lot about reason, who has been practiced a lot at judging debates. They don’t know any calculus. They’re not judging a debate between two quantum physicists about how to determine the results of some experiment on a recent particle accelerator. Just imagining that process, I can see how it could work, I can imagine it working, but it’s an incredibly intimidating prospect. This person is not gonna understand anything about … Over the course of the debate, there’s no way that they can form in their head a reasonable model of calculus, or of quantum mechanics, or of the standard model. And yet you hope that somehow the way they’re arbitrating this debate can implicitly answer extremely complex questions about … that depend on all those areas.

I think this is the kind of test that you can do empirically. We can ask the empirical question, for a human who’s very good at, a very smart person who’s been trained to judge such debates, and then you have two people with a ton of expertise in an area they’ve never thought about come in, one of them trying to convince them of the truth and one trying to mislead them, is it the case empirically that humans can arbitrate such debates and actually, that the best way to win such a debate is to provide true facts about the domain to the human?

I think if that’s the case, I think it’s actually, if that’s the case, it’s a very interesting fact, not just for the purpose of AI but just in general. I think it, prima facie, is just a really important question about the world. Are there norms of debate that allow you to consistently arrive at the truth in domains where the arbitrator doesn’t understand what’s true? That’s a question that’s relevant to a ton of domains.

This version of the question is distinctive in some respects. I think mostly it’s distinctive because we are free to set things up in the way that’s maximally convenient. It’s kind of asking, under the best possible conditions, can debate be conducive to truth? Whereas most debates in our current society are under pretty highly suboptimal conditions, like very limited time, bad incentives on behalf of the judge, judges sampled from some population that doesn’t have a lot of time to think about how to judge such debates well, or isn’t that … hasn’t thought a lot about how to structure this to lead to truth. I think most debates in the real world are under pretty pessimistic conditions, but just understanding when does debate work, or what is the equivalent of debate truth, I think is a really … I would consider that a really fundamental and interesting question, completely independent of AI. I think it’s now also a particularly important question because it really is closely related to one of the most plausible strategies for training very powerful AIs to help us actually arrive at good advice or good conclusions.

Robert Wiblin: Are there other important pros and cons of this approach that are worth mentioning?

Paul Christiano: So, I think there’s definitely a lot that could be said about it. There are a bunch of other issues that come up like when you start actually trying to do machine learnings, when you try and train agents to play this kind of game, then there’s lots of ways that that can be hard as a machine learning problem. You can have lots of concerns in particular with the dynamics of this game. So, some people maybe wouldn’t be happy that you’re training your AIs to be really persuasive to people. You might be concerned that makes some kind of familiar modes look more … crop up in more subtle ways or be more problematic.

But, I really think the main thing is just: is it the case that a sufficiently sophisticated judge will be able … every judge defines a different game, like convincing me is a different game from convincing you. I think it’s clear that for weak enough judges, this game isn’t particularly truth conducive. There’s no reason that the honest player would have an advantage. The hope is that there is some level of sufficiently strong judges, above which, it’s the case that you converge over longer and longer debates to more accurate claims, yeah, it’s unclear. So, first question is: is there a threshold, and the second question: are humans actually above that threshold? If this was the case, if we have humans judge such debates, they will actually have honest strategies winning.

Robert Wiblin: What kind of people do you need to pursue this research? Are there any differences compared with other lines?

Paul Christiano: So, again, I think there’s like a very similar … there’s a bunch of different questions that come up both for amplification and debate. I think different questions require different kinds of skill and different backgrounds. I think that both for amplification and debate, there is this more conceptual question. Or, I don’t know if conceptual is the right word. It’s a fact both about like the structure of argument and about the actual way humans make decisions, which is like, “Can humans arbitrate these debates and demands where they lack expertise? Or, in the amplification case, can you have teams addressing some issue where no individual can understand the big picture?

And that, I mean, there’s a bunch of different angles you could take on that question. So, you could take a more philosophical angle and say, “What is going on there? Why should we expect this to be true, or what are the cases that might be really hard?” You could also just run experiments involving people, which seems relatively promising, but involves, obviously, a different set of skills, or you could try and study it in the context of machine learning and go straight for training … you might say, “Well, we could test these things with humans if we had very large numbers of humans.” Maybe, actually, the easiest way to test it is to be in the regime where we can train machines to effectively do things that are much, much more expensive than what we could afford to do with humans.

So, you could imagine approaching it from a more philosophical perspective, a more, I don’t know, cognitive science or just organizing, maybe not even in an academic perspective, just putting together a bunch of humans and understanding how those groups behave, or a more machine learning perspective.

Robert Wiblin: What’s been the reception from other groups to this debate approach?

Paul Christiano: So, I think there are many different groups and different answers for different groups. I would say that for many people, the nature of the problem or the alignment problem is very unclear when they first hear it stated in an abstract way, and so I think for a lot of people, it’s been very helpful to get a clear sense of what the problem is that you’re trying to solve. I think when you throw out this proposal, people both understand why debates are better than just giving an answer and having a human evaluate it, and they also can sort of see very clearly why there’s difficulty, like it’s not obvious that the outcome of the debate is in fact producing the right answer.

So, I think from that perspective, it’s been extremely helpful. I think a lot of people have been able to get much more purchase understanding what the difficulties are in what we’re trying to do. I think for people who are more in like the ML side … again, it’s still been very helpful for having them understand what we’re trying to do, but I think the ML community is really very focused on a certain kind of implementation and actually building the thing, and so I think that community is mostly just sort of waiting til, “That’s a very interesting research direction,” and then their response is to wait until things either happen or don’t happen, til we’ve actually built systems that embody those principals to do something which you wouldn’t have been able to do without that idea.

Robert Wiblin: So, if we can use this approach to go from having like 60% accuracy to 70% or 80%, how useful is that? Do we need to be able to judge these things correctly almost all of the time, or is it more just like: the more often humans can make the right call, the better.

Paul Christiano: Yeah, so, certainly, if you just had a judge who was like correct, but then 40% of the time, they err randomly, that would be totally fine. I guess that’s sort of gonna average out and it’s not a problem at all. What you really care about is just: in what cases are there, to what extent are there systematic biases in these judgements? So, to the extent that we consistently just make the wrong answer when the answer depends on certain considerations or in certain domains, and so, from that perspective, I guess the question is, “What class or problems can you successfully resolve with this technique, and if you push that frontier of problems a little bit further, you can solve a few more problems now than you could before. Are you happy?”

I’d say there’s kind of two attitudes you could have on this. So, one: I guess the thing I really would like is a solution that just works in the sense that we have some principal reason to think it works. It works empirically. As we scale up our machine learning systems, it works better and better, and we don’t expect that to break down. That would be really great and sort of has, from a more theoretical perspective, that’s kind of what we’d like. There’s a second perspective you could have, which is just, “There’s this set of important problems that we want to apply machine learning systems to, so as we deploy ML systems, we think the world might change faster or become more complex in certain respects,” and what we really care about is whether we can apply machine learning to help us make sense of that kind of world or steer that world in a good direction, and so from that perspective, it’s more like there’s this list of tasks that we’re interested in and sort of the more tasks we can apply ML to, the better position we will be to cope with possible disruption caused by ML.

And so, from that perspective, I think you’re just sort of happy every time you expand the frontier of tasks that you’re able to solve effectively. I think I also take that pretty seriously, so if it was the case that we could just push the set of tasks that we’re able to solve a little bit, I think that would improve our chances of coping with things well a little bit, but my main goal is, or the main focus, I think, as we are further away, if we’re having to think about things more conceptually or more theoretically, then I think it’s better to focus on having a really solid solution that we think will work all the time. As we get closer, then it becomes more interesting to say, “Great, now we see these particular problems that we wanna solve. Let’s just see if we can push our techniques a little bit so that ML systems can help us solve those problems.”

Robert Wiblin: Do you think it’s possible that there’s an advantage to whoever’s trying to be deceptive in these cases, that in fact it’s easier for the person who’s trying to mislead or judge because they can choose from a wider range of possible claims that they could make whereas the person or the agent that’s trying to communicate the truth, they can only make one claim, which is the true one?

Paul Christiano: Yeah, I guess a few points: maybe a first preliminary point is that, in general, if you had two agents, there wouldn’t be one assigned to the truth and one assigned to lie. Instead, they would both just be arguing whatever they thought would be most likely to be judged as honest and helpful, so in a world where it worked like that, there would just … neither participant in the debate would be trying to say anything true. Both of them would be arguing for some garbage if we were in this unfortunate situation.

So then, in terms of the actual question, yeah, you could sort of imagine there is this giant space of things you could argue for. One of them is like … some tiny space of other things that we would actually regard on reflection as the most truthy and all the other stuff, yeah it’s a very, very tiny subset of the possible claims, and there’s a ton of other things that differ between different claims besides how actually useful are they and truthful are they?

And so, I think, a priori, you would definitely be … it’s like a very, very surprising claim or very, very special claim to say the very best strategy from amongst all these strategies is the one that’s most truthful and helpful, so I definitely think your first guess, just if you didn’t know anything about the domain would be that there’s going to be some other properties. Maybe how nice it sounds is very useful like you wanna pick the thing that sounds nicest, or the thing that has the slickest soundbite in its favor or something like that.

I think I am reasonably optimistic that if, say, a human judge is careful, they can sort of judge well enough that they have some … I’d say, if you’re a weak judge, this process can’t really get off the ground. You’re not able to at all correlate your judgements with truth. As you get to be a stronger judge, you hope that not only can you start to answer some questions; you can sort of bootstrap up to answering more and more complex questions. So, that is, you could say, “Well, if I just were to guess something off the top of my head, that has one level of correlation with truth.” Like, in easy enough cases, that’s going to be truthful. Then, if I have a short debate that sort of bottoms out with me guessing something off the top of my head, that can be a little bit more conducive to the truth. And now, if I have a long debate where after that long debate I can now have a short debate to decide which side I think wins, then I think that’s more likely to be conducive to truth.

So, you would hope that you have sort of eliminating behavior where as you think longer and longer, the class of cases in which truthfulness becomes the optimal strategy grows, but yeah, I think it’s not obvious at all.

Robert Wiblin: What’s the best source for people who want to learn more about this approach? There’s a paper up on Archive, and I think also a blog post that came out after that that’s perhaps more extensive?

Paul Christiano: I think the paper’s probably the best thing to look at. So, there’s a paper in the archive called AI Safety Via Debate. Like, it covers a lot of considerations and raises a lot of considerations, discusses possible problems, discusses how it compares to amplification, things like that. It presents some very simple toy experiments to show a little bit about how this might work in the context of machine learning. It doesn’t present sort of convincing example of a system which does something something interesting using debate, and so that’s what we’re currently … that’s what we’re currently working on, and so a reader who’s looking for that should maybe come back in six months. But, I think if you want to understand what is … yeah, if you want to understand why we’re interested in the idea or what is basically going on, then I think the paper’s a good thing to look at.

Advances needed to implement these strategies

Robert Wiblin: What would prevent us from implementing either of these strategies today? What advances do we need to actually be able to put them into practice?

Paul Christiano: I think, depending on your perspective either unfortunately or fortunately, there’s really a ton of stuff that needs to be done. One category is just building up that basic engineering competence to make these things work at scale. In writing this process, it’s kind of like training an AI. Let’s consider the debate case, which I think is fairly similar in technical requirements, but may be a bit easier to talk about.

We understand a lot about how to train AIs to play games well because that’s a thing we’ve been trying to do a lot. This, as an example of a game, has many differences from the games people normally train AIs to play. For example, it is arbitrated by a human, and queries to a human judge are incredibly expensive. That presents you with a ton of problems about, one, organizing the collection of this data. Using approximations. There’s this whole family of approximations you’re going to have to use in order to be able to actually train these AIs to play this game well. You can’t just have, every time they play the game, a human actually makes the evaluation. You need to be training models to approximate humans. You need to be using less-trusted evaluations. You need to be learning cleverly from data that’s passive data rather than actually allowing them to query the judge. That’s one …

Technically, running this project at scale is hard for a bunch of reasons that AI is hard, and then also hard for some additional reasons distinctive to the role of humans in these kinds of proposals. It’s also hard, I guess, as a game, because it has some features that games don’t normally have. So we’re used to thinking of games with … there’s other technical differences beyond the involvement of humans that make these kind of hard engineering problems. And some of those are things that I’m currently working on, just trying to understand better. And, again, trying to build up the actual engineer expertise to be ready to make these things work at very large scale. So that’s one class of problems.

A second class of problems is just figuring out … I think there’s maybe two things you could want. One is you want to be able to actually apply these schemes. You want to be able to actually run such debates and use them to train a powerful AI, but then you also want to understand much more than we currently understand about whether that’s actually going to work well. So in some sense, even if there was nothing stopping us from running this kind of training procedure right now, we’re going to have to do a lot of work to understand whether we’re comfortable with that.

Do we think that’s good? Or do we think that we should do some other approach or maybe try harder to coordinate to avoid deploying AI. That’s a huge cluster of questions, some of which are empirical questions about how humans think about things. Like what happens in actual debates involving humans, what happens if you actually try and take 20 humans and have them coordinate in the application setting. It also depends on hard philosophical questions. Like I mentioned earlier, the question, “What should a super intelligent AI do?” If you had a formal condition for what it should do, then your problem would be a lot easier.

Our current position is we don’t know. In addition to solving that problem, we’re going to be defining that problem. Like should is a tricky word. So that’s the second category of difficulties.

There’s a third, big category of difficulties corresponding to … and the third category is maybe something we could just wait on. The current AI is not sophisticated enough to say run interesting debates. That is, if you imagine the kind of debate between humans that’s like interestingly promoting truth, that involves a very complicated learning problem the debaters have to solve. And to think right now, it feels like that problem is just at the limits of our abilities. Like you could imagine in some simple settings training that kind of AI. And so, one option would just be to wait until the AI improves and say we’re going to try and study these techniques in simpler cases and then apply them with the real messiness of human cognition only once the AI’s better.

Another option would be to try and push safety out as far as one could go. So it’s actually starting to engage with the messiness of human cognition. And to be clear, the second step I suggested is philosophical difficulties and asking whether this is actually a good scheme. That’s totally going to have to, even today, involve engaging a ton with humans. Like that involves actually running debates, actually doing this kind of decomposition process that underlies amplification.

So maybe those are the three main categories of difficulty that I see. I think all of them seem very important. I think my current take is probably that the most important ones are figuring out if this is a good idea rather than being actual obstructions to running the scheme. I think it’s quite realistic to relatively soon be at a place where you could use this procedure to train a powerful AI. And the hard part is just getting to the point where we actually believe that’s a good idea. Or we’ve actually figured out whether that’s a good idea. And then, I mean that’s not just figuring it out, it’s also modifying the procedures so that they actually are a good idea.

Robert Wiblin: Yeah that makes a lot more sense now.

Prosaic AI

Okay, let’s push onto a different line of research you’ve been doing into prosaic AI alignment. You’ve got a series of posts about this on ai-alignment.com. Yeah, what’s kind of the argument you’re making? What is prosaic AI?

Paul Christiano: So, I would describe this as a motivating goal for research, or a statement of what we ought to be trying to do as researchers working on alignment, and roughly what I mean by prosaic AI is AI which doesn’t involve any unknown unknowns, or AI which doesn’t involve any fundamental surprises about the nature of intelligence. So, we could look at existing ML systems and say whether or not I think this is likely, we could ask what would happen if you took these ideas and scaled these ideas up to produce something like sophisticated behavior or human-level intelligence, and then, again, whether or not that’s likely, we can sort of understand what those systems would look like much better than we can understand what other kinds of AI systems would look like just because they would be very analogous to the kinds of systems we could build today.

And so, in particular, what that involves, I guess, if the thing we’re scaling up is something like existing techniques in deep learning, that involves defining an objective, defining a really broad class of models, a really giant … and that’s a complicated model involving attention and internal cognitive workspaces, and then just optimizing over that class to find something that scores well according to the objective, and so we’d imagine … yeah, so that’s the class of technique. That’s the basic technique, and you could say, “What would happen if it turned out that that technique could be scaled up to produce powerful AI?” That’s what I mean by prosaic AI, and then the task would be to say, “Supposing you live in that world, supposing we’re able to do that kind of scale-up, can we design techniques which allow us to use that AI for good or allow us to use that AI to do what we actually want, given that we’re assuming that AI can be used to have some really big transformative impact on the world.”

Yeah, so there’s a few reasons you might think this is a reasonable goal for research. So, maybe one is that it’s a very … it’s like a concrete model of what AI might look like, and so it’s relatively easy to actually work on instead of sort of being in the dark and having to speculate about what kinds of changes might occur in the field. Second reason is that even if many more techniques are involved in AI, it seems quite likely that doing gradient descent or rich model classes is going to be one of several techniques, and so if you don’t understand how to use that technique safely, it’s pretty likely you’re going to have a hard time.

Maybe a third reason is that I think there is actually some prospect that existing techniques will go further than people guess, and that’s a case that’s particularly important from the perspective of alignment, because, in that case, people, sort of by hypothesis, be caught a little bit by surprise. There’s not that much time to do intervening, or to do more work between now and then, so I think in general, I would advocate for a policy of, “Look at the techniques you understand currently and try and understand how to use those techniques to safely use those techniques, and then once you’ve really solved that problem, once you’re like, “Now we understand how to make, how to do gradient descent in a way that produces safe AI,” then you can go on and look towards future techniques that might appear and ideally to understand for each of the techniques that might play a role in building your AI, you’d have some analogous safe version of that technique, which doesn’t introduce problems with alignment but is roughly equally useful.

Robert Wiblin: So, I guess the people who wouldn’t be keen on this approach would be those who are confident that current methods are not going to lead to very high levels of general intelligence, and so they expect the techniques that you’re developing now just won’t be usable ’cause they’re gonna be so different.

Paul Christiano: Yeah, I guess I’d say there’s two categories of people that might be super skeptical of this as a goal. One will be, as you said, people who just don’t believe that existing techniques are going to go that far or don’t believe that they’re going to play an important role in powerful AI systems, and then a second one would be those who think that’s plausible, but that the project is just doomed. That is that there is going to be no way to produce an analog of existing techniques that would be aligned, even if they could in fact play a role in sophisticated AI systems. I think both of those are reasonably common perspectives.

Robert Wiblin: Well, I think in a minute we’ll talk about MIRI, and I guess is perhaps a bit of a combination of the two of them.

Paul Christiano: Yeah, they’re some of the strongest proponents of the second view that we’re super doomed in a world where sophisticated AI looks anything like existing systems.

Robert Wiblin: Can you lay out the reasons both for and against thinking that current techniques in machine learning can lead to general intelligence?

Paul Christiano: Yeah, so I think one argument in favor, or one simple point in favor is that we do believe if you took existing techniques and ran them with enough computing resources, there’s some anthropic weirdness and so on, but we do think that produces general intelligence based on observing humans, which are effectively produced by the same techniques. So, we do think if you had enough compute, that would work. That probably takes, sort of if you were to run a really naïve analogy with the process of evolution, you might think that if you scaled up existing ML experiments by like 20 orders of magnitude or so that then you would certainly get general intelligence.

So that’s one. There’s this basic point that probably these techniques would work at large enough scale, so then it just becomes a question about what is that scale? How much compute do you need before you can do something like this to produce human-level intelligence? And so then the arguments in favor become quantitative arguments about why to think various levels are necessary. So, that could be an argument that talks about the efficiency of our techniques compared to the efficiency of evolution, examines ways in which evolution probably uses more compute than we’d need, includes arguments about things like computer hardware, saying how much of those 20 orders of magnitude will we just be able to close by spending more money and building faster computers, which is …

20 orders of magnitude sounds like a lot, but actually, you cover … we’ve covered more than 20 orders of magnitude so far, and we will cover a significant fraction of those over the current decade, or you can also try and run arguments on analogies. Look at how effectively or how much compute existing systems take to train and try and understand that. So, you could just try and say, based on what our experience so far, how much compute do you think will be needed?

That’s like probably the most important class of arguments in favor. There’s other qualitative arguments like: there are lots of tasks that we’re able to do, so you’d probably want to look at what tasks we have succeeded at or failed at and try and fit those into that quantitative picture to make sense of it. But, I think it’s like not insane to say that existing systems seems like they’ve plausibly reached the level of sophistication of insects, so we are able to take this very brute force approach of doing search over neural nets and get behavior that’s … and this is totally unclear, but I think it’s plausible that existing behaviors as sophisticated as inspects. If you thought that, then I think it would constitute an argument in favor.

Yeah, so I guess arguments against, probably the most salient argument against is just, “If we look at the range of tasks humans are able to accomplish, we have some intuitive sense of how quickly machines are able to do more and more of those tasks,” and I think many people would look at that rate of progress and say, “Look, if you were to extrapolate that rate, it’s just gonna take a very, very long time before we’re able to do that many tasks.” I think a lot of this is just people extrapolate things in very different ways. So, some people would look at being able to see the task an insect can do and say, “Wow, insects have reasonably big brains on a scale from nothing to human. We’ve come a substantial fraction of the way. We’re perhaps plausibly going to get there just by scaling this up.”

Other people would look at what insects do and say, “Look, insects exhibit almost none of the interesting properties of reasoning. We’ve captured some very tiny fraction of that. Presumably, it’s gonna be a really long time before we’re able to capture even like a small fraction of interesting human cognition.”

Robert Wiblin: What are the aspects of cognition that seem most challenging, or, I guess, are most likely to require major research insights rather than just increasing the compute?

Paul Christiano: Again, with enough compute, you’d sort of expect, or I would be willing to bet that you would get everything in human cognition, and the question is, in some sense, which aspects of cognition are most expensive to produce in this way, or most likely to be prohibitively expensive such that you can’t just find them by brute force search. You have to actually understand them. So, natural things or properties of human cognition operate over very long timescales. Maybe evolution got to take at developing different notions of curiosity until it found a notion of curiosity that is effective, or a notion of play that was effective for getting humans to do useful learning.

It’s not clear that you can evaluate. If you have some proposed set of motivations for a human that you’re rolling out, it’s not clear you can evaluate then by actually having a bunch of human lifetimes occur, and so if there’s a thing you’re trying to optimize where every time you have a proposal, you have to check it in order to check it, you have to run a whole bunch of human lifetimes, that’s going to take a whole lot of checks. And so, if there’s like cognitively complicated things that only … right, so maybe curiosity’s simple, but if you have a thing like curiosity that’s actually very complicated or involves lots of moving parts, then it might be very, very hard to find something like that by this brute force search.

Things that operate over very short timescales are much, much, more likely to … then you can try a whole bunch of things. You can get feedback about what works, but things that operate over long timescales might be very hard.

Robert Wiblin: So, it sounds like you’re saying at some level of compute, you’re pretty confident that current methods would produce human-level intelligence and maybe much more. I think a lot of listeners would find that claim somewhat surprising, or at least being confident that that’s true. Yeah, what’s the reason that you think that?

Paul Christiano: Yeah, so there’s a bunch of … there are a bunch of things to be said on this topic, so maybe a first thing to say is human intelligence was produced by this process of: try some random genomes, take those genomes which produce the organisms with the highest fitness, and then randomly vary those a little bit and see where you get. In order for that process to produce intelligence, you definitely need a bunch of things. At a minimum, you need to try a huge number of possibilities. Again, now we’re just discussing the claim that with enough compute would work. So, at a minimum, you need to try a whole bunch of possibilities, but you also need an environment in which reproductive fitness is a sufficiently interesting objective.

So, one reason that you might be skeptical of this claim is that you might think that the environment that humans evolved in or the lower life evolved in, like, is actually quite complex, and we wouldn’t have access. Even if we had arbitrarily large amounts of compute, it wouldn’t actually be able to create an environment rich enough to produce intelligence in the same way. So, that’s something I’m skeptical of largely because I think … humans operate in this physical environment. Almost all the actual complexity comes from other organisms, so that’s sort of something you get for free if you’re spending all this compute running evolution cause you get to have the agent you’re actually producing interact with itself.

I guess, other than that, you have this physical environment, which is very rich. Quantum field theory is very computationally complicated if you want to actually simulate the behavior of materials, but, it’s not an environment that’s optimized in ways that really pull out … human intelligence is not sensitive to the details of the way that materials break. If you just substitute in, if you take like, “Well, materials break when you apply stress,” and you just throw in some random complicated dynamics concerning how materials break, that’s about as good, it seems, as the dynamics from actual chemistry until you get to a point where humans are starting to build technology that depends on those properties. And, by that point, the game is already over. The point when humans are building technologies that really exploit the fact that we live in a universe with this rich and consistent physics, at that point, you already have human-level intelligence. Effectively, there’s not much more evolution occurring.

So yeah, maybe on the environment side, I think most of the interesting complexity comes from organisms in the environment, and there’s not much evidence that considerable computational complexity of the world is actually an important part of what gives a human intelligence. A second reason people might be skeptical is they might … this estimate, this 20 orders of magnitude thing would come from thinking about the neurons in all the brains in all the organisms that have lived. You might think that maybe the interesting compute is being done early in the process of development or something about the way that genotypes translate into phenotypes. If you think that, you might think that the neuron counts are a great underestimate for the amount of interesting compute.

Or, similarly, you might think other things in the organisms are more interesting than either development or neurons. I think that, like, my main position here is it really does look like we understand the way in which neurons do computing. A lot of the action is sending action potentials over long distances. The brain spends a huge amount of energy on that. It looks like that’s the way that organisms do interesting computing. It looks like they don’t have some other mechanism that does a bunch of interesting computing ’cause otherwise they wouldn’t be spending these huge amounts of energy implementing the mechanism we understand. It does look like brains work the way we think they work.

Robert Wiblin: So, I guess some people could think that there’s a lot of computation going on within individual neurons, but you’re skeptical of that.

Paul Christiano: Yeah, so I think my view would be that mostly the hard thing about … say, if you wanted to simulate a brain, you can imagine there being two kinds of difficulties. One is simulating the local dynamics of neurons, and a second is moving information long distances, say, as you fire action potentials, and I think, most likely, both in the brain and in computers, the movement of information is actually the main difficulty. It’s like, the dynamics within the neuron just don’t … they might be very complicated. It might involve a lot of arithmetic operations to perform that simulation, but I think it’s not hard to compare to just shuffling that around, and shuffling that around, we have a much clearer sense of exactly how much happens because we know that there’s these action potentials. Action potentials communicate information basically only in the timing. I mean, there’s a little bit more than that. But we can basically … we know sort of how much information is actually getting moved.

Robert Wiblin: It looks like ones and zeros?

Paul Christiano: Yeah, it looks like ones and zeros and most of the extra bits are in timing, and we sort of know roughly what level of precision there is, and so there’s not that many bits per action potential.

Robert Wiblin: So, I don’t have a lot of understanding of the specifics of how machine learning works, but I would think that one objection people might have is to say that even if you had lots of compute and you tried to make the parameters of this, the machine learning more accurate, just the structure of it might not match what the brain is doing, so it might just cap out at some level of capability because there’s no way for the current methods of … the current way that the data’s being transformed to actually be able to produce general intelligence. Do you think there’s any argument for that or is it just the case that the methods we have now at least at some level of abstraction are analogous to what the human brain is doing, and therefore, with a sufficient amount of compute, maybe a very, very high amount, but they should be able to copy everything that the human brain is doing?

Paul Christiano: Yeah, so I would say that most of the time, machine learning would be fixed on architecture and then optimize over computations that fit within that architecture. Obviously, when evolution optimizes for humans, it does this very broad search over possible architectures like looking over genomes that encode, “Here’s how you put together a brain.” We can also do a search over architectures, and so the natural question becomes, “How effective are we at searching over architectures compared to evolution?” I feel like this is mostly in the regime of just a computational question. That is, we sort of know … I mean, the very highest level that evolution uses isn’t that complicated, sort of at the meta level, and so you could, in the worst case, just do a search at that same level of abstraction.

I guess one point that we haven’t discussed at all but is, I guess, relevant for some, some people would consider super relevant is anthropic considerations concerning the evolution of humans. So, you might think that evolution only extremely rarely produces intelligent life, but that we happen to live on a planet where that process worked.

Robert Wiblin: Yeah, what do you make of that?

Paul Christiano: So, I think it’s kind of hard to make it fit with the evolutionary evidence. This is something that, I think Carl Shulman and Nick Bostrom have a paper about this, and some other people have written about it periodically: I think the rough picture is that intelligence evolves like … if this is the case, if there’s some hard step in evolution, it has to be extremely early in evolutionary history, so in particular, it has to happen considerably before vertebrates, and probably has to have happened by simple worms.

Robert Wiblin: And why is that? ‘Cause those steps took longer than the later steps did?

Paul Christiano: Well, so, one reason … I think the easiest reason to put it before vertebrates is just to say that cephalopods seem pretty smart and the last common ancestor between an octopus and a human is some simple worm. I think that’s probably the strongest evidence. That’s from this paper by Nick and Carl.

Robert Wiblin: Okay, because then we have another line that also produced substantial intelligence.

Paul Christiano: Independently.

Robert Wiblin: Independently, and that would be incredibly suspicious if it had happened twice on the same planet, and there, we don’t have the anthropic argument, ’cause you could live on a planet where it only happened once.

Paul Christiano: That’s right. You could think maybe there’s a hard step between octopi and humans, but then we’re getting into the regime where like sort of any place you look-

Robert Wiblin: What is this hard step?

Paul Christiano: Many things happen twice. Like, birds and mammals independently seem to become very, very intelligent. You could think that maybe in early vertebrates, there was some lucky architectural choice made in the design of vertebrate brains that causes on the entire vertebrate line intelligence will then sort of systematically increase quickly, but what was important was this lucky step, but at some point: you can try and run some argument before you might get stuck before humans. It seems pretty hard to do. It doesn’t seem very convincing and it certainly doesn’t seem like it would give you an argument for why you wouldn’t reach at least like octopus levels of intelligence. So, if you’re thinking that existing techniques are gonna get stuck anywhere around their current level, then this kind of thing isn’t going to be very relevant.

Robert Wiblin: Yeah, so I guess it kind of raises a definitional question of, “What is current techniques?” How much do you change the architecture before you say, “Oh, well this is no longer like current machine learning methods. This is no longer prosaic AI?”

Paul Christiano: Yeah, so I think the thing that’s really relevant from the perspective of alignment research is you want to assume something about what you can do, and the thing you want to assume you can do is: there is some model class. You optimize over that model class given an objective. Maybe you care about whether the objective has to supply gradients. Maybe it doesn’t even matter that much. So then, as an alignment researcher, you say, “Great, the AI researchers have handed us a black box. The black box works as follows. The black box takes some inputs, produces some outputs. You specify how good those outputs were, and then the black box adjusts over time to be better and better.

And, as an alignment researcher, as long as something fits within that framework, you don’t necessarily care about the details of, “What kind of architecture are we searching over? Are you doing architecture search? Or what form does the objective take?” Well, what form the objective takes, you may care about, but most other details you don’t really care about because alignment research isn’t going to be sensitive to those details.

So, in some sense, you could easily end up with a system: existing ML researchers would say, “Wow, that was actually quite a lot different from what we were doing in 2018,” but which an alignment researcher would say “That’s fine. The important thing from my perspective is this still matches with the kind of alignment techniques that we were developing,” so we don’t really care how different it looks. We just care about, did it basically change the nature of the game from the perspective of alignment?

Robert Wiblin: Yeah. Can we look backwards in history and say, would techniques that we developed five or ten years ago work on today’s architectures?

Paul Christiano: Yeah, so we can look back. Hindsight is always complicated and hazardous, but I think you would say, if you were to, say, in 1990, perform a similar exercise and look across techniques, I would say certainly the kinds of things we’re talking about now would exist. That would be part of your picture. They would not have nearly as much, be nearly as much of a focal point as they are today because they hadn’t yet worked nearly as well as they worked now, so I guess we would be talking about what fraction of your field of view would these techniques occupy?

So, I think it’s pretty safe to say that more than 10% of your field of view would have been taken up by the kind of thing we’re discussing now, and the techniques developed with respect, with that 10% of possibilities in mind would still apply. Existing systems are very, very similar to the kinds of things people are imagining in the late ’80s, and there’s a question like, “Is that number 10% or is it a third?” I think that’s pretty unclear and I don’t have enough of a detailed understanding of that history to really be able to comment intelligently, and I’d wanna defer to people who were doing research in the area at that time.

I do think the … if you had instead focused on different kinds of techniques, like if you’d been around in the further past and you were, say, trying to do AI alignment for expert systems, I don’t feel that bad about that. I guess some people look back on history and say, “Man, that would have been a real bummer if you’d been alive in the ’60s and you’d done all this AI alignment research that didn’t apply to the kind of AI we’re building now,” and my perspective is kind of like, “Look, one, if it takes 50 years to build AI, it doesn’t matter as much what the details are of the AI alignment work you did in the ’60s. Two, actually, there’s a lot of overlap between those problems like many of the philosophical difficulties you run into alignment are basically the same, even between existing systems and expert systems.

Three, I would actually be pretty happy with the world where like when people propose a technique, a bunch of AI alignment researchers invest a bunch of time understanding alignment for expert systems, and then 15 years later, they move onto the next thing. It’s like not that bad a world. I expect you would, in fact … if you just solved this sequence of concrete problems, that actually sounds pretty good. It sounds like a good way to get practice as a field. It sounds reasonably likely to be useful. There’s probably lots of commonalities between those problems. Even if they turn out to be totally wasted, it’s a reasonable bet an expectation: you sort of have to do … that’s the cost we have to pay if you want to have done a bunch of research for the techniques that are actually relevant unless you’re very confident the current techniques are not the things that will go all the way, or that it’s doomed. I think both those positions seem really, really hard to run to me. I haven’t heard very convincing arguments for either of those positions.

Robert Wiblin: What’s expert systems?

Paul Christiano: The systems based on having a giant set, maybe reasoning rules and facts, and then they use these rules to combine these facts.

Robert Wiblin: And that just didn’t work?

Paul Christiano: Yeah, there was a period where people were more optimistic about them. I don’t know the history very well at all. I think in general, certainly, it didn’t realize the ambitions of the most ambitious people in that field, and certainly, it’s not the shape of most existing, or the kinds of systems people are most excited about today.


Robert Wiblin: Okay, so, we mentioned that a group that has kind of a different view from this prosaic AI is the Machine Intelligence Research Institute at Berkeley. If I understand correctly, you got into AI safety in part through at least that social group or that intellectual group, but it seems like now, you kind of recommend … you kind of represent a different node or, what’s the term, access within the people working on AI safety. Yeah, how would you describe their view and how it differs from yours today?

Paul Christiano: I would say the most important difference is they believe this prosaic AI alignment project is very likely to be doomed. That is, they think if the shape of sophisticated AI systems resembles the shape of existing ML systems, or if it in particular: you obtain sophisticated AI by defining a model class, defining an objective and doing gradient descent, finding the model that scores well according to the objective, then they think we’re just extremely doomed such that they think the right strategy is to instead step back from that assumption and say, “Can we understand other ways that people could build sophisticated AI?”

Part of that is like, if you’re doing gradient descent or this big model … if you’re doing gradient descent to find a model that performs well, you’re gonna end up with some actual particular models. You’re gonna end up with some particular way of thinking that your giant neural net embodies, and you could instead … instead of just specifying procedures that give rise to that way of thinking, you could actually try and understand that way of thinking directly and say, “Great, now that we understand this, we can both reason about its alignment and maybe we can also design it more efficiently or we can more efficiently describe search procedures that will uncover it once we know what it is that they’re looking for.”

And, I’d say that’s like the biggest difference, and the crux there is mostly the: is it possible to design alignment techniques that you make something like existing ML systems safe? And so, my view is that, mostly likely, that’s possible. Not most, like, more likely than not, not like radically more likely than not, but somewhat more likely than not, that’s possible and that as long as it looks possible and you have attractive lines of research to pursue and a clear path forward, we should probably work on that by default, and that we should … at some point, if it’s like, “Wow, it’s really hard to solve alignment for systems that look anything like existing ML,” then you really wanna understand as much as you can why that’s hard, and then you wanna step back and say, “Look, people in ML, it looks like the thing you’re doing actually is sort of like unfixably dangerous and maybe it’s time for us to think about weird solutions where we actually have to change the overall trajectory of the field based on this consideration about alignment.”

If it’s not reasonable to call that weird, from the outside view, you might think, “Well, the goal of AI is to make things good for humans. It’s not crazy to change the direction of the field based on what is plausibly going to be alignable.”

Robert Wiblin: But it would seem strange to them today?

Paul Christiano: Yeah, people in ML would not be like, “Oh, that makes a lot of sense. Let’s just swap what we’re doing.” So, I guess, my position would be like, “I’m currently much, much more optimistic than MIRI people.” I think that’s the main disagreement about whether it will be possible to solve alignment for prosaic AI systems, and I think as long as we’re optimistic in that way, we should work on that problem until we discover why it’s hard or solve it.

Robert Wiblin: Just to make it more concrete for people, what are the kind of specific questions that MIRI is researching that they think are useful?

Paul Christiano: I think, at this point, MIRI’s public research, the stuff that they publish on and talk about on the internet, one big research area is decision theory, so understanding, supposing that you have some agent which is able to make predictions about things or has views on the truth of various statements, how do you actually translate those views into decision? This is tricky, ’cause you want to say things like, “You care about quantities. What would happen if I were to do X,” and it’s not actually clear what would happen if I were to do X means. Is it causal, kind of factual, like a certain kind of statement? It’s not clear that’s the right kind of statement.

Robert Wiblin: So this is causal decision theory, evidential decision theory and-

Paul Christiano: Yeah, and most of the stuff they’re considering seriously is like … like once you get really precise about and you’re like, “We’d like this to be an algorithm,” the whole picture just gets a lot weirder and there are a lot of distinctions that people don’t normally consider in the philosophy, philosophical community, that are kind of, “You have to wade through if you want to have … be in the place where you have a serious proposal for,” here is an algorithm for making decisions given as input like views on these empirical questions or given as input like a logical inductor or something like that.

So, that’s one class of questions that they work on. I think another big class of questions they work on is like … I mean like stepping back from, looking at the whole problem and saying from a conceptual perspective, supposing you grant this view, this worldview about what’s needed, what … I don’t know a good way to define this problem. It’s kind of just like, “Figure out how you would build an aligned AI,” which is a good problem. It’s the very high level problem. I endorse some people thinking about the very high level problem. I think it’s one of the more useful things to think about. There’s some flavor of it that depends on what facts you … what you think are the important considerations or what you think the difficulties are, so they work on a certain version of that problem.

I think other examples of things include … they’re very interested just in, “What are good models for rational agency?” So, we have such models in some settings like in Cartesian settings where you have an environmented agent that communicate over some channel where they send bits to one another. It becomes much less clear what agency means once you have an agent that’s physically substantiated in some environment. That is, what does it mean to say that a human is like a consequentialist thing acting in the world, given that the human is just actually like: some of the degrees of freedom in the world are leading together in this complicated way to make a human. It’s quite complicated to talk about what that actually means, that there’s this consequentialist in the world. That’s a thing they’re like super interested in.

Yeah, figuring out how to reason about systems that are logically very complex including systems that contain yourself or contain other agents like you. How do we formulize such reasoning, is another big issue.

Robert Wiblin: Does MIRI have a different view as well about the likelihood of current methods producing general intelligence or is that-

Paul Christiano: There’s probably some difference there. There’s a lot less stark than the other one. I think maybe a difference in that space that’s more close is like: I kind of have a view that’s more like: there’s probably some best way. There’s some easiest way for our current society as it’s currently built to develop AI, and the more of a change you want to make from that default path, the more difficult it becomes. Whereas I think the MIRI perspective would more be like: the current way that we build ML is reasonably likely to just be very inefficient, so it’s reasonably likely that if you were to step back from that paradigm and try something very different, that it would be comparably efficient and maybe more efficient, and I think that’s a little bit … yeah, I guess I don’t buy that claim. I don’t think it’s as important as the definite doom claim.

Robert Wiblin: So, what are the best arguments for MIRI’s point of view that the current methods can’t be made safe?

Paul Christiano: So, I’d guess I’d say there’s two classes of problems that they think might be unresolvable. One is: if I perform … if I have some objective in mind … suppose I even have the right objective, an objective that perfectly tracks how good a model is, all things considered according to human values, and then I optimize really aggressively on that objective. The objective is still just a feature of the behavior of this: I have this black box I’m optimizing over the weights of some neural network, so now I have an objective that perfectly captures whether the behavior is good for humans, and I optimize really hard on that objective.

So one of MIRIs big concerns is that even if we assume that problem is resolved and you have such an objective, then it’s pretty likely you’re going to find a model which only has these desirable properties like on the actual distribution where it was trained, and that it’s reasonably likely that in fact, that system that you’ve trained is going to be some consequentialist who wants something different from human flourishing and just happens on the trained distribution to do things that look good.

Robert Wiblin: Within that narrow range?

Paul Christiano: Yes. So, an example of this phenomenon I that think MIRI people think is pretty informative, though certainly not decisive on its own, is like: humans were evolved to produce lots of human offspring, and that it’s the case that humans are sophisticated consequentialist whose terminal goal is not just producing offspring, so that even though the cognitive pulse that humans use is very good for producing offspring over human evolutionary history, it seems like it’s not actually great, sort of already broken down to a considerable extent, in the long run looks like it will break down to a much, much greater extent.

So, then, if you were like a designer of humans being like, “I know. I’ve defined this objective that tracks how many offspring they have. Now I’m going to optimize it over many generations. I’m gonna optimize biological life over a million generations to find the life which is best at producing offspring,” you’d be really bummed by the results. So, their sort of expectation is that in a similar way, we’re going to be really bummed. We’re going to optimize this neural net over a very large number of iterations to find something that appears to produce actions that are good by human lights, or we’re going to find something whose relationship to human flourishing is similar to like humans’ relationship to reproduction where they sort of do it as a weird byproduct of a complicated mix of drives rather than because that’s the thing they actually want, and so when generalized, they might behave very strangely.

Robert Wiblin: Okay, sounds kind of persuasive. What’s the counterargument?

Paul Christiano: So, I think there’s a few things to say in response. So, one is that evolution does a very simple thing where you sample environments according to this distribution, then you see what agents perform well on those environments. We, when we train ML systems, can be a little bit more mindful than that, so in particular, we are free to sample from whatever distribution over environments we are … any distribution we’re able to construct, and so as someone trying to solve prosaic AI alignment, you are free to look at the world and say, “Great, I have this concern about whether the system I’m training is going to be robust, or whether it might generalize in a catastrophic way in some new kind of context,” and then I’m free to use that concern to inform the training process I use, so I can say, “Great, I’m going to adjust my training process by, say, introducing adversary and having the adversary try and construct inputs on which the system is going to behave badly.”

That’s something that people do in ML. it’s called adversarial training, and if you do that, that’s very different from the process evolution ran. Right now, you imagine that there’s someone roughly as smart as humans who’s like, constructing these weird environments, like if they’re looking at humans and say, “Great, the humans seem to care about this art shit,” then the adversary’s just constructing an environment where humans have lots of opportunities to do art or whatever, and then if they don’t have any kids, then they get down-weighted.

If there’s some gap, if there’s some context under which humans fail to maximize reproductive fitness, then adversary can specifically construct those contexts, and use that to select against. Again, the reproductive fitness analogy makes this sound kind of evil, but you should replace reproductive fitness with things that are good.

Yeah, so that’s one thing. I think the biggest thing, probably, is that, as the designers of the system, we’re free to do whatever we can think of to try and improve robustness, and we will not just like sample-

Robert Wiblin: Yeah, we can look forward rather than just look at the present generation.

Paul Christiano: Yeah, although it’s a challenging problem to do so, so that’s a thing a bunch of people work on. It’s not obviously they’ll be able to succeed, certainly. I don’t think that like this analogy should make you think, like … I think the analogy maybe says there is a problem. There is a possible problem, but doesn’t say, “And that problem will be resistant to any attempt to solve it.” It’s not like evolution made a serious attempt to solve the problem.

Robert Wiblin: Yeah. If you can make the method courageables that you can continue improving it, changing it, even as you’re going with an AI transforming in the wild, that seems like it would partially solve the problem, ’cause one of the issues here is that humans ended up with the motivations that they have, desires that they have, and then we’re going about it in a single generation or a handful of generations in evolutionary time, changing everything about the environment. The environment’s changing much faster than we are such that we’ve become … our drives no longer match what would actually be required to reproduce at the maximal rate, whereas if you were changing humans as we went, as our behavior ceased to be adaptive from that point of view, then perhaps you could keep us in line so we’d be fairly close to the maximal reproductive rate. Does that make sense?

Paul Christiano: Yeah. I think that’s like an important part of the picture for why we have hope. That is, if you’re like, “Yeah, we’re gonna evolve a thing that just wants human flourishing, or we’re gonna like, do grand ascent until we find a thing that really wants me to flourish and then we’re gonna let it rip in the universe,” that doesn’t sound great. But, if you’re like, “We’re gonna try and find a thing which helps humans in their efforts to continue to create systems that help humans continue to create systems that help humans achieve, help humans flourish,” then that’s … I guess you could imagine in the analogy: instead of trying to evolve a creature which just cares about human flourishing, you’re trying to evolve a creature that’s really helpful, and somehow, “Be really helpful and don’t kill everyone and so on,” is like an easier, a more imaginable set of properties to have, sort of even across a broad range of environments than matching some exact notion of what constitutes flourishing by human lights.

I think one reason that people at MIRI are from a similar school of thought would be pessimistic about that is they have this mental image of humans participating in that continuing training process, sort of training more and more sophisticated AIs. If you imagine a human is intervening and saying, “Here’s how I’m gonna adjust this training process,” or, “Here’s how I’m going to shape the course of this process,” it sounds kind of hopeless, because humans are so much slower, and in many respects, presumably so much less informed, less intelligent.

Robert Wiblin: That they might just be adding noise?

Paul Christiano: Yeah. And it would be very expensive to have human involvement. Mostly, they wouldn’t be presuming to give direction, some random direction. I think the main response there is like: you should imagine humans performing this process … early on in this process, you should imagine humans being the ones adjusting objectives or adjusting the behavior of the system. Later on, you should imagine that as mostly being carried out by the current generation of AI systems, so the reason that humans can keep up is that process goes faster and faster, and so hopefully because we’re maintaining this property that there always a whole bunch of AI systems trying to help us get what we want.

Robert Wiblin: Will it continue to bottom out in some sense, what humans say about how the upper level is going? I’m imagining if there’s multiple levels of the most advanced AI then the less advanced, I guess this is kind of what we were talking about earlier. Then you’ve got kind of humans at the bottom. At some point, would they just disappear from the equation and it’s like all-

Paul Christiano: Yeah, so it’s always going to be anchored to what a human would have said. In some sense, that’s like the only source of ground truth in the system. Humans might not actually be … there might be some year beyond which humans never participate, but at that point, the reason that would happen would be because there is some system. Suppose in the year 2042, humans stop ever providing any input to AI systems again, the reason that would be possible is that in the year 2042 there was some AI system which we already trust robustly, “Do well enough according to human lights.”

Robert Wiblin: And it can do it faster and cheaper.

Paul Christiano: Yeah. It’s a little bit tricky to ever have that handoff occur, because that system in 2042, the one that you’re trusting to hand things off to, has never been trained on things happening in 2043, so it’s a little bit complicated, and it’s not that you’re gonna keep running that same system in 2042. It’s that that system is going to be sufficiently robust that it can help you train the system in 2043 that it’s going to … yeah.

Robert Wiblin: Yeah, if you could visit 50 years in the future and see everything that was happening there, how likely do you think it would be that you would say, “My view of this was broadly correct,” vs, “MIRI’s view was more correct than mine with hindsight?” I’m trying to measure how confident you are about your general perspective.

Paul Christiano: Yeah, so I certainly think there are a lot of cases where it would be like, “Well, both views were very wrong in important ways,” and then you could easily imagine both sides being like, “Yeah, but my view was right in the important way,” so that’s certainly a thing which seems reasonably likely. In terms of thinking in retrospect that my view was like unambiguously right, I don’t know, maybe I’m on like, relative to MIRI’s view, maybe I’m at 50-70% … that’s pretty high, whatever, like 50-60% that like in retrospect we’ll be like, “Oh, yeah, this was super clear,” and then maybe on the other side I would put a relatively small probability that’s super clear in the other direction, that like maybe 20% or 10% or something that like in retrospect I’m like, “Geez, I was really wrong there. Clearly my presence in this debate just made things worse.”

Robert Wiblin: And then there’s kind of a middle ground of both of them had important things to say, yeah?

Paul Christiano: Yeah.

Robert Wiblin: Interesting. So, you’re like reasonably, reasonably confident, but it seems like but you would still support, given those probabilities, MIRI doing substantial research into their line of inquiry.

Paul Christiano: Yeah, I’m excited about MIRI doing the stuff MIRI’s doing. I would prefer that MIRI people do things that were better on my perspective, which I suspect is most likely to happen if they came to agree more with this perspective.

Robert Wiblin: But at some point, let’s say that your line of research had four times as many resources or four times as many people, then you might say, “Well, having one more person on this other thing could be more useful,” even given your views, right?

Paul Christiano: Yeah, although, I don’t think … the situation is not like there’s line of research A and line of research B and the chief disagreement is about which line of research to pursue. It’s more like: if I was doing something very, very similar to what MIRI’s doing, or doing something superficially quite similar, I would do it in a somewhat different way, so, to the extent I was working on philosophical problems that clarify our understanding of cognition or agency, I would not be working on the same set of problems that MIRI people are working on, and I think those differences matter probably more than the high-level, “What does the research look like?” So, lots of stuff in the general space of research MIRI’s doing that I’d be like, “Yep, that’s a good thing to do,” which, now we’re in this regime of, yeah, it depends on how many people are doing one thing vs doing the other thing.

Robert Wiblin: Do you think that those … if your view is correct, is there going to be much like incidental value from the research that MIRI is doing or is it kind of just by the by at that point?

Paul Christiano: So, one way research of the kind MIRI’s doing is relevant is to clarifying whether, when they talk about amplification or debate, they each have this conceptual, key conceptual uncertainty. In the case of debate, is it the case that debates lead to … that the honest strategy, telling the truth, saying useful things is actually a winning strategy in a debate, or in the case of amplification, is there some way to assemble some large team of aligned agents such that the resulting system is smarter than the original agents and remains aligned. Those conceptual difficulties seem not at all unrelated to the kinds of conceptual … like if you’re asking, “How would we build an aligned AI using infinite amounts of computing power without thinking at all about contemporary ML?” That’s a very similar kind of question, thinking, “What are the correct normative standards of reasoning that you should use to evaluate computing claims, what are, when you compose these agents, like, what kind of decomposition of cognitive work is like actually alignment preserving or do we expect to produce correct results?

So, a natural way in which the kind of research MIRI’s doing could add value is by shedding light on those questions. And an expectation, I’d guess they’re at least several times less effective at answering those questions than if they were pointed at them more directly, you know? I don’t know if it’s like five times less effective than if they were pointed at them directly. I think it’s a smaller multiple than that, probably.

Robert Wiblin: What would you say to people who are listening who just feel kind of agnostic on whether prosaic AI approach is the best or MIRI’s is?

Paul Christiano: You mean what would I say in terms of what they ought to do?

Robert Wiblin: Yeah, or maybe what they ought to think or things to consider if they maybe don’t feel qualified to judge this debate.

Paul Christiano: Sure, in terms of what to do, I suspect comparative advantage considerations will generally loom large, and so if one’s feeling agnostic, those will likely end up dominating or, comparative advantage plus short-term what would be the most informative, involve the most learning, build the most flexible capital. In terms of what to think, all things considered, I don’t know. That seems pretty complicated. It’s going to depend a lot on what kind of expertise, just in general, looking at a situation with conflicting people who’ve thought a lot about the situation, how do you decide whose view to take seriously in those cases? To be clear, the spectrum of views amongst all people, my view is radically closer to MIRI’s than almost anyone else in the machine learning community on most respects. There are other respects in which the machine learning community is closer to MIRI than I am. So like, the actual menu of available views is unfortunately even broader than this one.

Robert Wiblin: If I’m broader than Paul Christiano’s view and MIRI’s generalized view.

Paul Christiano: Indeed.

Robert Wiblin: It’s unfortunate. You’re saying there’s a third option?

Paul Christiano: Yeah. In fact, yeah, it’s quite a lot broader. Like, I think being agnostic is not a crazy response. I think there’s an easy position where, well, the most confident claims, sort of all these perspectives like differ substantially on emphasis, but one could basically could significant probability on all of the most confident claims from every perspective. Yeah, certainly the convex combination of them will be more accurate than any particular perspective, and then in order to do significantly better than that, you’re going to have to start making more seriously claims about who you’re willing to ignore.

Robert Wiblin: Who to trust, yeah. What would you say is kind of the third most plausible broad view?

Paul Christiano: I think one reasonably typical view in the machine learning to which I’m sympathetic is the, “All of this will be mostly okay.” As AI progresses, we’ll get a bunch of empirical experience messing around with ML systems. Sometimes they’ll do bad things. Correcting that problem will not involve heroic acts of understanding-

Robert Wiblin: Safety specifically, or alignment, specifically, not beyond what might happen anyway.

Paul Christiano: Yeah, and it’s a little bit hard. You could separate that out into both the claim about what will happen anyway and a claim about what is required. I guess, the views we were talking about for me and MIRI were more about what is required. We’ve separate disagreements about what will likely happen. I think there’s a different ML position on what is likely to be required, which says more like, “Yeah, we have no idea what’s likely to be required. It’s reasonably likely to be easy, any particular thing we think is reasonably likely to be wrong,” and that’s like … I could try and flesh out the view more, but roughly, it’s just like, “We don’t know what’s going to happen and it’s reasonably likely to be easy, or by default, expected to be easy.”

I think there’s a reasonable chance in retrospect that looks like a fine view. Yeah, I don’t see how to end up with high confidence in that view, and if you’re like a 50% chance of that view, it’s not gonna have that huge an effect.

Robert Wiblin: On your expected value of working on safety, yeah?

Paul Christiano: Yeah, it may-

Robert Wiblin: It only halves it at worst, yeah? Or at best.

Paul Christiano: Yeah, and increasing that probability from … if you give that a significant probability, that might matter a lot if you have a, “We’re definitely doomed,” view. So, I think on the MIRI view, maybe accepting, giving significant credence to the machine learning perspective would significantly change what they would do, ’cause they currently have this view where you’re kind of at like zero. I don’t know if you’ve seen this post on the logistics success curve that Eliezer wrote.

Robert Wiblin: I haven’t.

Paul Christiano: The idea is that if you’re close to zero, then most interventions … if your probability of success is close to zero, then most interventions that look common-sensically useful aren’t actually going to help you very much ’cause it’s just going to move you from like 0.01% to 0.02%.

Robert Wiblin: So this would be a view that’s kind of, “You need many things all at once to have any significant chance of success,” and so just getting one out of 100 things you need doesn’t move you much.

Paul Christiano: That’s right, just making organizations like a little bit more sane or fixing one random problem here or one random problem there isn’t much going to help. So, if you have that kind of view, it’s kind of important then, you’re putting really low probability on this. This isn’t the only perspective in ML, but it’s one conventional ML perspective. I think on my view, it doesn’t matter that much if you give that 30% or 50% or 20% probability. I think that probability’s not small enough that you should discount that case, interventions that are good, if the problem’s not that hard, it seems like they’re likely to be useful, and also it’s not high enough that-

Robert Wiblin: I would have thought that it would make little difference to your strategy ’cause in that case things would be, “Okay, you don’t really have to do anything,” so you can almost just … even if you think it’s 50/50 whether any of this is necessary or not, you can just largely ignore that.

Paul Christiano: Yeah, that’s what I’m saying. It doesn’t make a huge difference. A way in which it matters is like: you might imagine there are some interventions that are good, like in worlds where things are hard, and there’s 50/50 and those interventions are still good. Maybe they’re half as good as they otherwise would have been, like we were saying, and there’s some interventions that are good in worlds where things are easy. That is, you might be like, “Well, if things were easy, we could still fuck up in various other ways and make the world bad,” and so, reducing those probabilities, I would say that’s also a valuable intervention because the probability of things-are-easy is not low enough that that’s getting driven down to zero.

Robert Wiblin: So then just more normal world improvement, or making it more likely that we encode good values? So, if the alignment problem is solved, then it becomes more a question of, “What values will we in fact program into an AI?” And trying to make sure that those are good ones.

Paul Christiano: Yeah, there’s a lot of things that actually could come up in that world. So, for example, your AI could have a very … if an AI has a very uneven profile of abilities. You could imagine having AI systems that are very good at building better explosives or designing more clever biological weapons that aren’t that good at helping us accelerate the process of reaching agreements that control the use of destructive weapons or better steering the future.

So, another problem independent of alignment is this uneven abilities of AI problem. That’s one example. Or we might just be concerned that as the world becomes more sophisticated, there will be more opportunities for everyone to blow ourselves up. We might be concerned that we will solve the alignment problem when we build AI, then someday that AI will build future AI and it will fill out the alignment problem. So, there’s lots of extra problems you could care about.

Robert Wiblin: I suppose there’s also: AI could be destabilizing to international relations or politics or be used for bad purposes even though if … so we can give it good instructions and we’ll give it instructions to cause harm.

Paul Christiano: Yeah, so then there’s a question of how much you care about that kind of destabilization. I think most people would say they care at least some. Even if you have a very focused-on-the-far-future perspective, there’s some way in which that kind of destabilization can lead to irreversible damage. So yeah, there’s a bunch of random stuff that can go wrong with AI and you might become more interested in attending to that or saying, “How do we solve those problems with a mediocre understanding of alignment if a mediocre understanding of alignment doesn’t automatically doom you?”

Robert Wiblin: Yeah, is there anything else you wanna say on, I guess, MIRI before we move on? Obviously, at some point, I’ll get someone on from there to defend their view and explain what research they think is most valuable, hopefully some time in the next couple of months.

Paul Christiano: Yeah, so I guess one thing I mentioned earlier, there were like these two kinds of concerns or two kinds of arguments that would give that we’re super doomed on prosaic if AI looks like existing ML systems. I mentioned one of them, this, even if you have the right objective, it’s plausible that the thing you produce will have some other real objective of the consequentialist who just incidentally is pursuing that objective on a training distribution.

There’s a second concern that actually constructing a good objective is incredibly difficult, so in the context of the kinds of proposals that I’ve been discussing, like in context of iterative amplification, they’d then be saying, “Well, all the magic occurs in the step where you aggregate a bunch of people to make better decisions than those people who made it alone,” and in some sense, any way that you try and solve prosaic AI alignment is going to have some step like that where you are implicitly encoding some answer to the alignment problem in the limit of infinite computation.

‘Cause they might think that that problem like alignment in that limit is still like sufficiently difficult or has all the core difficulties in it so that it’s not clear. This might say that we’re doomed under prosaic AI alignment, but more directly, it would just say, “Great, we need to solve that problem first, anyway,” ’cause there’s no reason to work on the distinctive parts of prosaic AI alignment rather than trying to attack that conceptual problem and then learning that we’re doomed or having a solution which we could then … maybe it would give you a more direct angle of attack.


Robert Wiblin: So, you’re on the board of a newish project called Ought. What is that all about?

Paul Christiano: So, the basic mandate is understanding how we can use machine learning to help humans make better decisions. The basic motivation is that we are super interested: if machine learning makes the world a lot more complicated and is able to transform the world, we want to also ensure that machine learning is able to help humans understand those impacts and steer the world in a good direction. That’s, in some sense, what the alignment problem is about. You want to avoid a situation that’s gonna mismatch between how well AI can help you develop new technologies and how well AI can help you actually manage this more complicated world it’s creating.

I think the main project, or certainly, that project that I am most interested in is on what they call factored cognition, which is basically understanding how you can take complex tasks and break them down into pieces where each piece is simpler than the whole, so it doesn’t depend on the whole context, and then compose those contributions back to solve the original task.

So, you could imagine that in the context of taking a hard problem and breaking down to pieces that individual humans can work on, so like, say, a hundred humans, you don’t want any of them to have to understand the entire task. You want to break off some little piece that that person can solve, or you can think of it in the context of machine learning systems. In some sense, the human version is most interesting because it is a warmup or a way of studying in advance the ML’s version.

So, in the ML version, that would be, now instead of 100 people, you have some people and a bunch of ML systems, which have some set … maybe an ML system has more limited ability to respond to complex context, like a human has a context in mind when they articulate this problem. ML system has some limited ability to respond to that context, or fundamentally, I think the most interesting reason to care about breaking tasks down to small pieces is because once you make the task simpler, once an ML system is solving some piece, it becomes easier to evaluate its behavior and whether it’s behavior is good.

So, this is very related to in the way that iterated amplification hopes to solve AI alignment is by saying we can inductively train more and more complex agents by composing weaker agents to make stronger agents, so this factor cognition project is … it is one possible approach for composing a bunch of weaker agents to make a stronger agent, and in that sense, it’s like one of the main … addressing one of the main ingredients you would need for iterated amplification to work.

I think right now it’s kind of the main project that’s aiming at acquiring evidence about how well that kind of composition works, again, in the context of like just doing it with humans since humans are something we can study today. We can just recruit a whole bunch of humans. There’s like a ton of work in like actually starting to resolve that uncertainty, and we can learn about … there’s a lot of work we’ll have to do before we can be able to tell, “Does this work? Does this not work?” But I’d say that’s one of the main things I was doing right now, the reason I’m most excited about it.

Robert Wiblin: Is this a business?

Paul Christiano: It’s organized as a nonprofit.

Robert Wiblin: Is Ought hiring at the moment and what kind of people are you looking for?

Paul Christiano: Yeah, so I think there are some roles that will hopefully be resolved by the time we’re hired for, by the time this podcast comes out. Some things that are likely to be continuing hires are researchers who are interested in understanding this question, understanding and thinking about how you compose the small contributions and solutions to harder tasks, and that’s a … there are several different disciplines that potentially bear on that, but sort of people who are interested in computer science, are interested in like … the approach they’re taking, things that are interested in programming languages are also a reasonable fit, people who are just … I think there’s some stuff that doesn’t fit will including in the academic discipline but if you just think about the problem, “How do you put together a bunch of people? How do you set up these experiments? How do you help humans be able to function as parts of a machine?”

So, researchers who are interested in those problems is one genre, and another is engineers who are interested in helping actually build systems that will be used to test possible proposals or will substantiate the best guess about how to solve those proposals, and those will be … in contrast, Open AI is hiring researchers and engineers in ML, so sort of engineers would then be building ML systems, testing ML systems, debugging and improving and so on, ML systems. I think at Ought, similarly hiring both researchers and engineers and people in between, that they are the focus that’s lost on ML. It’s more on, again, building systems that will allow humans to … humans and other simple automation to collaborate, to solve hard problems, and so it is more … it involves less of a distinctive ML background. It’s more potentially a good fit for people who have software engineering background and the problem’s interesting and they have some relevant background, or just the problem’s interesting and they have a broad background in software engineering.

Robert Wiblin: Okay, well, I’ll stick up a link to that, to the Ought website with more information on specifically what it’s doing and I guess what vacancies are available whenever we manage to edit this and get it out.

Paul Christiano: Cool.


Robert Wiblin: Okay let’s talk about what listeners who are interested working on this problem could actually do and what advice you have for them. So we’ve had a number of episodes on AI safety issues, which have covered these topics before with Dario Amodei, your colleague, as I mentioned. Jan Leike at DeepMind. As well as Miles Brundage and Allan Dafoe at FHI working on more policy and strategy issues.

Robert Wiblin: Do you have a sense of where your advice might deviate from those of those four people or just other people in general on this topic?

Paul Christiano: So I think there’s a bunch of categories of work that need to be done or that we’d like to be done. I think I’d probably agree with all the people you just listed about. Each of them, presumably, would have advocated for some kind of work. So I guess Dario and Jan probably were advocating for machine learning work that really tried to apply or connect ideas about safety to our actual implementations. Filling up the engineering expertise to make these things work. And acquiring empirical evidence about what works and what doesn’t. And I think that project is extremely important. And I’m really excited about EA’s training up in ML and being prepared to help contribute to that project. Like figuring out whether ML’s a good fit for them. And then, if so, contributing to that project. I guess I won’t talk more about that because I assume it’s been covered on previous podcasts.

I’d probably also agree with Miles and Allan about there’s like a bunch of policy work and strategic work that seems also incredibly important. I also won’t talk more about that.

I think some categories of work that I consider important that I wouldn’t expect those people to mention. I think for people who with a background in computer science, but not machine learning. Or who don’t want to work in machine learning have decided that’s not the best thing. Don’t enjoy machine learning. I think there’s a bunch of other computer science work that’s relevant to understanding the mechanics and proposals. Like debate or amplification.

So an example would be like, right now, Ought, one of their projects is on factored cognition. So, in general, on how you can take a big task and decompose it into pieces which don’t depend on the entire context, and then put those pieces together in a way that preserves the semantics of the individual agents or the alignment of the individual workers. So that’s a problem which is extra important in the context of machine learning or in the context of endured amplification. But that one can study almost entirely independent of machine learning.

It is one can just say, like let’s understand the dynamics of such decomposition. Let’s understand what happens when we apply simple automation to that process. Let’s understand what tasks we can decompose and can’t. Let’s understand what kind of interface or what kind of collaboration amongst agents actually works effectively. So that’s an example of a class of questions which depend on a sort of well studied from a computer science perspective, but aren’t necessarily machine learning questions. Which I’d be really excited to see work on. And there’s similar questions in the debate space where just understanding how do we structure such debates. Do the lead to truth? Etc.

I think one could also study those questions not from a computer science perspective at all. But I think it’s like super reasona … Like, I don’t know. I think philosophers differ a lot in their taste. But like, for example, if you’re a philosopher interested in asking a question about this area, then I think under what conditions do debate lead to truth is not really a question about computers in any sense. It’s the kind of question that falls under computer scientist’s sensibilities, but I think that taking a really technical but not necessarily quantitative approach to that question is accessible to lots of people who want to try and help with AI safety. And similarly for amplification.

So I think in both of those areas, there’s questions that could be studied from a very computer science perspective and involve software engineering and involve running experiments. And is also can be studied from a more philosophical perspective. Just thinking about the questions and about what we really want and alignment works.

They can also be studied from this more psychology perspective of actually engaging. Like some of them are going to run relatively large scale experiments involving humans. I don’t know if things are … like if the time is right for that. But that’s definitely, there’s definitely experiments in the space that do seem valuable. And it seems like that at some point in the future there’s going to be more of them.

Robert Wiblin: Sorry, what do you mean by that?

Paul Christiano: So if you ask how does this kind of decomposition work or how do these kinds of debates work? Like the decomposition is ultimately guided by … Right, so I originally described this process involving a human and a few AI assistants. Ultimately you want to replace that human with an AI that’s predicting what a human would do. But nevertheless, the way that you’re going to train that system or the way we currently anticipate training that system involves a ton of interaction. Like I’m assuming there’s really just imitating or maximizing the approval of some human who’s running that process. And so, in addition to caring about how machines work you care a ton about how does that process work with actual humans? And how can you collect enough data from humans to … How can you cheaply collect enough data from humans that you can actually integrate this into the training process of powerful AI systems.

So I don’t think that’s a fact about … That doesn’t bear on many of the traditional question psychology and maybe that’s a bad thing to refer to it as. But it is like a … It involves studying humans. It involves questions about particular humans and about how humans behave. About how to effectively or cheaply get data from humans. Which are not really … They’re questions machine leaner people have to deal with because we also have to deal with humans. But really it’s like a much larger … Machine leaner people are not that good at dealing with the interaction with humans at the moment.

So yeah. So that’s some family of questions. I think the ones I’m most excited about are probably more on the philosophical computer science bent. There are lots of people who wouldn’t be a great fit for working in the ML. Who wouldn’t be great for working on those questions.

I think also stepping back further, setting aside the amplification of debate, I think there’s just still a lot of very big picture questions about how do you make AI safe? That is you could focus on some particular proposal, but you could also just consider the process of generating additional proposals or understanding the landscape of possibilities, understanding the nature of the problem. I don’t know if you’ve ever had anyone from MIRI on, but I’m sure they would advocate for this kind of work. And I think that’s also … I consider that pretty valuable.

Probably I’m more excited about, at the moment, about pushing on our current list promising proposals. Since I spent a bunch of time thinking about alternatives and it doesn’t seem as great to me. But I also think there’s a lot of value to clarifying our understanding of the problem. More like trying to generate totally different proposals. Trying to understand what the possibilities are like.

Robert Wiblin: Great, yeah. Well we’re planning to get someone on from MIRI on in a couple of months time. Perhaps when it fits better with their plans and they’re hoping to hire. So we’ll some synergies between having the podcast and them actually having some jobs available.

Paul Christiano: That makes sense.

Robert Wiblin: So make it a little bit more concrete. What are Open AI’s hiring opportunities at the moment? And, in particular, I heard that you’re not just hiring ML researchers but also looking for engineers. So I was interested to learn kind of how they help with your work and how valuable those roles are compared to the kind of work maybe that you’re doing?

I think there’s a sort of spectrum between … Yeah there’s a spectrum between research and engineering. Or like most people at Open AI don’t sit in either extreme of that spectrum. So most people are doing some combination of thinking about more conceptual issues in ML, and running experiments, and writing code that implements ideas, and them messing with that code thinking about how it works, like debugging.

Yeah there’s a lot of steps in this pipeline that are not that cleanly separated. And so, I think there’s value on the current margin from all the points on this spectrum. And I think actually at the moment, right now, I think I’m still spending or the safety team is still spending a reasonably large … Like even people who are nominally very far on the research end are still spending a pretty large fraction of their time doing things that are relatively far towards engineering. So spending a lot of time setting up and running experiments, getting things working.

Again, the spectrum between engineering and research is I think not that clean. Or ML is not in really a state where it’s that clean. So I think right now there’s a lot of room for people who are more at the engineering side. I think what I mean by more at the engineering side is people who don’t have a background doing research in ML, but do have a background doing engineering. And who are interested in learning about ML and willing to put in some time on the order of months, maybe. Like getting more experienced thinking about ML, doing engineering related to ML. I think there’s a lot of room for that.

Mostly … So I mentioned these three problems. The first problem was actually getting the engineer experienced to make say amplification or debate work at scale. I think that involves a huge amount of getting things to work. Sort of by the construction of the task. And similarly in the third category of trying to push safety out far enough that it’s engaging with … that ML could actually be interacting in an interesting way with human cognition. I think that also involves again pushing things to a relatively large scale. Doing some research or some work that’s more similar to conventional machine learning work rather than being safety in particular.

I think both of those problems are pretty important. And both of them require … like are not that heavily weighted towards very conceptual machine learning work. I think my current take, like I currently consider a second category of work as figuring out from a conceptual perspective is this a good scheme to do? Seems like the most important stuff to me, but also seems like very complimentary with the other two categories in a sense of our current philosophy which I’m pretty happy with, is like we actually want to be building new systems and starting to run experiments on them in parallel with thinking about does this scheme … Like what are the biggest conceptual issues for some combination of the experiments can also keel off … Like even if the conceptual stuff work, if the experiments don’t that’s like another reason that the thing can be a non starter. And second, that who can run a bunch of experiments that actually give you a lot of evidence about … Like help you understand the scheme much better.

And obviously, independent of the complimentarity. Actually being able to implement these ideas is important. Like there’s obviously complimentarity between knowing whether x works and actually having the expertise there able to implement x. Right? The case that we’re aiming at is the cases where we have both developed a conceptual understanding of how you can build blind AI, but that have actually developed teams and have groups that are understand that and are trained to actually put it into practice in cases where it matters. So we’d like to aim towards the one where there’s a bunch of teams that are able to … Yeah that are basically able to apply cutting edge ML to make AI systems that are aligned rather than unaligned.

That’s, again, harking back to the very beginning of our discussion, we talked about these two functions of safety teams. I think the second function of actually make the AI blind is also an important function. Obviously, it only works if you’ve done the conceptual work. But also, the conceptual work is also realistically the main way that’s going to be valuable is if there are teams that are able to put that into practice. And that problem is, to a significant extent, an engineering problem.

Robert Wiblin: Just quickly, do you know what vacancies Open AI has at the moment?

Paul Christiano: So I guess on the safety team, yeah … I mostly think about safety team. On the safety team, we are both very interested in hiring ML researchers who have a background in ML research. Like who have done that kind of work in the past or have done exceptional work in nearby fields and are interested in moving into ML. We’re also pretty interested in hiring ML engineers. That is people who have done engineering work and are maybe interested in learning, or have put in some amount of time. So ideally these are people who are either exceptional at doing engineering related to ML or are exceptional at engineering and have demonstrated that they’re able to get up to speed in ML and are now able to do that quality work.

And again, those roles are not … In terms of what they involve, there’s not a clean separation between them. It’s basically just a spectrum. Yeah, there’s several different skills that are useful. We’re really looking for all of those skills. Like the ability to build things, the ability to do engineering, the ability to do those parts of engineering that are distinct to ML. The ability to reason about safety, the ability to reason about ML. Both of those at a conceptual level.

So the safety team is currently looking for the entire spectrum of stuff we do. I think that’s probably the case in several other teams within the organization. That is, the organization is large enough, there’s like a bunch of places now that given a particular skill set … Well, again, given any particular skill set on that spectrum, there’s probably a place. The organization overall is not that large. We’re at the scale of 60 full-time people, I think. So there’s still a lot of roles that don’t really exist that much. Like that would at a very large company. But there’s a lot of engineering to be done. There’re a lot of conceptual work to be done. And a lot of the whole space in between those.

Robert Wiblin: Yeah. How does work at Open AI compare to DeepMind and other top places that people should have at the forefront of their brains?

Paul Christiano: You mean in terms of my assessment of impact? Or in terms of the experience day to day?

Robert Wiblin: I think in terms of impact, mostly.

Paul Christiano: Yeah. I don’t think I have a really strong view on this question. I think it depends in significant part on things like where you want to be and which particular people you’re most excited about working with. I guess those are going to be two biggest inputs. Yeah, I think that both teams are doing reasonable work that accelerates safety. Both teams are giving experience implementing things and understanding how you can be integrated into an AI project.

I’m optimistic that over the long run, there will be some amount of consolidation of safety work at wherever happens to be the place that is designing the AI systems for which it’s most needed.

Robert Wiblin: Awesome. A question that quite a few listeners wrote in with for you was how much people who were concerned about AI alignment should be thinking about moving into computer security in general? And what’s the relationship between computer security and AI safety?

Paul Christiano: I think it’s worth distinguishing two relationships between security and alignment. Or like two kinds of security research. So one would be security of computer systems that interface with or are affected by AI. So this is kind of like the conventional computer security problem, but now in a world where AI exists. Or maybe you’re like even aren’t focusing on the fact that AI exist and are just thinking about conventional computer security. So that’s one class of problems.

There’s a second class of problems which is the security of ML systems themselves. Like to what extent can an ML system be manipulated by an attacker or to what extent does an ML system continue to function appropriately in an environment containing an attacker. So they have different views about those two areas.

So on the first area, computer security broadly, I think my current feeling is that computer security is quite similar to other kinds of conflict. So that is, if you live in a world where it’s possible to attack. You know, someone’s running a web server. It’s possible to compromise that web server. Like a bunch of people have computers. It’s possible to effectively steal resources from them or to steal time on their computers. That’s very similar to living in a world where it’s possible to take a gun and shoot people. And I like regret, in general, I love it if there are fewer opportunities for destructive conflict in the world. Like it’s not great if it’s possible to steal stuff or blow stuff up or so on.

But from that perspective, I don’t think computer security is like … I think the core problem in the AI alignment, like the core question is can we build AI systems that are effectively representing human interests? And if the answer is no, then there are enough forms of possible conflict that I think we’re pretty screwed in any case. And if the answer is yes, if we can build powerful AI systems that are representing human interests, then I don’t think cyber security is a fundamental problem anymore than the possibility of war as a fundamental problem. Like it’s bad. It’s perhaps extremely bad, but we will able … At that point, the interaction will be between AI systems representing your interests and AI systems representing someone else’s interests or AI systems representing no one’s interest.

And, at that point, I think the situation is probably somewhat better than the situation is today. That is I expect cyber security is less of a problem in that world than it is in this world if you manage to solve alignment. So that’s my view on computer security that’s not conventional computer security and how alignment interfaces it with it. I think it can be … I basically think quantitatively, computer security can become somewhat more important during this intermediate period. Or AI is especially good at certain kinds of attacks and maybe not as useful. Like it may end up not being as useful for defense. And so one might want to intervene on making AI systems more useful for defense. But I think that doesn’t have outsize utilitarian impact compared to other cause areas in the world.

I think security of ML systems is a somewhat different story. Mostly because I think security of ML systems … like intervening on security of ML systems seems like a very effective way to advance alignment, to me. So if you ask how are alignment problems likely to first materialize in the world? Like supposing that I have built some AI system that isn’t doing exactly the thing that I want. I think the way that that’s likely to first show up is in the context of security.

So if I build like a virtual system that’s representing my interests on the internet, it’s like a little bit bad if they’re not exactly aligned with my interests. But in a world containing an attacker, that becomes catastrophically bad often. Because an attacker can take that wedge between the values of that system and my values and they can sort of create situations that exploit that difference.

Right. So, for example, if I have an AI that doesn’t care about some particular fact. Like it doesn’t care about the fact that it uses up a little bit of network bandwidth whenever it sends this request. But I would really care about that because I wouldn’t want to keep sending requests arbitrarily. So an attacker can create a situation where my AI is going to become confused and because it isn’t attending to this cost. An attacker is motivated to create a situation where the AI will therefore pay a bunch of the cost. So motivated to trick my AI that doesn’t care about sending messages into sending very very large numbers of messages.

Or like if my AI normally behaves well, then there exists this tiny class of inputs with very very small probability and encounters an input that causes it to behave maliciously. And that will appear eventually in the real world, perhaps. And that’s part of, sort of, the … part of the alignment concern. Is that that will appear naturally in the world with small enough probability. Or as you run AI long enough. But it will definitely first appear when an attacker is trying to construct a situation in which my AI behaves poorly. So I think security is this interesting connection where many alignment problems, not literally all, but I think a majority you should expect to appear first as security problems. And as a result, I think security is sort of one of the most natural communities to do this kind of research in.

Robert Wiblin: When you say an attacker would try to do these things, what would be their motivation there?

Paul Christiano: Honestly it would depend on exactly what AI system it is. But a really simple case would be if you have a virtual assistant going out to make purchasing decisions for you. The way it makes its decisions is slightly wrong. There are a thousand agents in the world, there are a thousand people who would like the virtual assistants to send them some money. So if it’s possible to manipulate the decision it uses for deciding where to send money, then that’s a really obvious thing to try and attack.

If it’s possible to cause it to leak your information. So suppose you have an AI which has some understanding of what information … of what your preferences are, but doesn’t quite understand exactly how you regard a privacy. There’s ways of leaking information that it doesn’t consider a leak. Because it has an almost, but not completely, correct model of what constitutes leaking. Then an attacker can use that to just take your information by setting up a situation where the AI doesn’t regard something as a leak, but it is a leak.

There’s any difference between what is actually bad and what you’re AI considers bad. And an attacker can come in and exploit that difference if there’s some action that would have a cost to you and it benefits the attacker. Then the attacker wants to set things up so that your AI system is not recognizing the cost to you.

So taking money, taking information, using your computer to launch other malicious activities. Like run denial of service. Just causing destruction. Like there’s some fraction of attackers who just want to run denial of service stacks. So if you can compromise integrity of a bunch of AI systems people are using, that’s a bummer.

Maybe they want to control what content you see. So if you have AI systems that are mediating how you interact with the internet. You know your AI says, here’s something you should read. There are tons of people who would like to change what your AI suggests that you read just because every eyeball is worth a few cents. If you can deploy that at scale, it’s like a lot of cents.

So that’s the kind of situation where some of those problems aren’t alignment. There’re a lot of security problems that aren’t alignment problems. But I think it’s the case that many many alignment problems are also security problems. So if one were to be working in security of ML with an eye towards working on those problems that are also alignment problems, I think that’s actually a pretty compelling thing to do from a long term AI safety perspective.

Potential of causing harm

Robert Wiblin: So, it seems to me like AI safety is a pretty fragile area where it will be possible to cause harm by doing subpar research or having the wrong opinions or giving the wrong impression, being kind of a loudmouth who has not terribly truth-tracking views. How high do you think the bar is for people going into this field without causing harm? Is it possible to be kind of at the 99th or 99th.9th percentile of suitability for doing this but still on balance: not really do good because the kind of unintentional harm that you do outweighs the positive contribution that you made?

Paul Christiano: So, I do think the current relationship between the AI alignment community or safety community and the ML community is a little bit strange in that you have this weird situation with a lot of external interests in safety and alignment. We have a lot of external funding, a lot of people on the street. It sort of sounds like a compelling concern to them that causes a lot of people in machine learning to be kind of on the defensive. That is, they see a lot of external interest that’s often kind of off-base or doesn’t totally make sense. They’re concerned about policies that don’t make sense or diversion of interest from issues they consider important to some incoherent concerns.

So, that means, again, they’re a little bit on the defensive in some sense, and as a result, I think it’s kind of important for people in the field to be reasonably respectful and not causing trouble, because there’s more likely in most context to actually cause a sort of hostile response. I don’t know if that’s much of a property of people. I think someone who believed that this was an important thing, if you’re at the point where you’re like, “Yep, I’m really concerned about causing political tension or really rocking the boat-”

Robert Wiblin: That’s not a good sign.

Paul Christiano: Yeah, when they hit that point, if you’re at that point and you’re basically behaving sensibly, then I think things could probably be okay. I’ve definitely sometimes … I have from time to time caused some distress or run into people who are pretty antagonistic towards something I was saying, but I mostly think if you care about it a lot and are being sensible, then I’d be like very surprised if the net effect was negative. I think a lot of people don’t care about it very much. They would disagree with this position and say that, “Look, the reason people are antagonistic is not because they’re being reasonably concerned about outsiders who don’t have a clear understanding pushing bad policies.” The reason that they’re defensive is just ’cause they’re being super silly, and so it’s just time for a showdown between people who are being silly and people who have sensible views.

And like, if you’re coming in with that kind of perspective, then, presumably this question’s not interesting to you because you’re just like, “Yeah, Paul’s just one of the silly sympathizers.” It’s not clear that I’m allowed to give recommendations to people like that or that they would … it’s not clear that they would be interested in the recommendations. I would recommend just as part of a compromise perspective, if you have that view, then there exists other people like Paul who have a different view on which, like, there are some reasonable concerns everyone wants to behave somewhat respectfully towards those concerns. It’ll be good if we all compromised and just didn’t destroy things or really piss people off.

If you couldn’t have worked on AI…

Robert Wiblin: So, if we imagine you and your colleagues and people who are kind of similar to you in other organizations before you got in AI safety but you had your skills and talents and interests, but I would say that you can’t work on AI safety, what do you think you should have done otherwise?

Paul Christiano: Yeah, so by can’t work on AI safety, you mean let us ignore all of the impacts of my work via the effect on … a natural thing that I might have done would be go into AI, and I’m like, “AI seems important independent of alignment. It seems like AI’s reasonably likely,” as a person with a sort of technical background, it kind of seemed, especially in the past … this is more obvious ’cause I’ve neglected this argument in the past. But, it seems like there’s a good ratio of effect of the area to congestion or number of people trying to work on it, and it was a good match for my comparative advantage.

Robert Wiblin: Yeah, let’s maybe set that aside as well ’cause it’s pretty similar.

Paul Christiano: Yeah, that seems not in the spirit of the question. Yeah, so setting aside all of AI … and let’s set aside everything that’s having an effect via the overall importance of AI. I am pretty excited about overall improving human capacity to make good decisions, make good predictions, coordinate well, etc., so I’m pretty excited about that kind of thing. I think it would be a reasonable bet, so that includes both stuff like … some of these things aren’t a good fit for my comparative advantage so it’s probably not what I should do. Examples of things that aren’t a good fit for my comparative advantage are: understanding pharmacological interventions to make people smarter, understanding … just having a better map of determinants of cognitive performance. “How can you quickly measure cognitive performance? What actually determines how well people do at complicated messy tasks in the real world,” so that you can intervene on that.

I think that’s an area where science can add a really large amount of value. It’s very, very hard for a firm to add value in that space compared to a scientist ’cause you’re just gonna discover facts and you’re not gonna be able to monetize them very well, probably. That’s an example of proving human capacity in a way that I probably wouldn’t have done because it’s not a great fit for my abilities. Things that are a better fit for my abilities are like stuff that’s more about what sort of institutions or mechanisms do you use? I don’t know if I would have worked on that kind of thing. I might have. So, an example of a thing I might work on is like-

Robert Wiblin: A little bit more law and economics or-

Paul Christiano: Yeah, so an example of a thing that I find very interesting is the use of decision markets for collective decision making. And so, that’s an example of an area that I would seriously consider and I think there’s a lot of very interesting stuff you can do in that space. It’s not an area I’ve thought about a huge amount because it seems like significantly less high-leverage than AI, but it is like a thing which I think there’s a lot more mathematical work to do, and so if you’re avoiding AI and you’re like, “We’re just math, really,” I’m almost certainly going to be working in some area that’s very, very similar to theoretical computer science in terms of what skills it requires.

Robert Wiblin: I guess, yeah, are there other key questions in that field that stand out as being particularly important in maths, computer science, other than AI-related things?

Paul Christiano: So, definitely, most of the questions people ask, I think, are, if they’re relevant at all, primarily relevant through an effect on AI, so I don’t know how much, exactly. I mean, I took those off the table. Maybe that was too much. I think the basic problem is if you really care about differential progress, effective altruists tend to have this focus on, “It doesn’t matter if we get somewhere faster. It mostly matters what order technologies are developed in or what trajectory we’re on.” I think really a lot of the things people work on are like … a lot of things people work on in math or computer science are like founded on this … based on this principal, “We don’t know how X is going to be helpful, but it’s going to be helpful in some way,” which I think is often a valid argument, but I think is not helpful for differential progress, or like, you need a different flavor of that argument if you wanna say … It’s hard to say, “We don’t know how this is going to be helpful, but we believe it’s going to be helpful to things that are good.”

Robert Wiblin: Specifically, yeah.

Paul Christiano: Yeah, so I think a lot of stuff in math and computer science is less appealing from like a long-run altruist perspective because of that. I think stuff on decision making in particular, like, “What kinds of institutions do you …” I think I was very interested in and did work on my thesis was just like, “There’s this giant family of problems. You have N people. They like each have access to some local information and would like to make some decisions. You can formalize this in our problems in that space. They would like to decide what to produce and what to consume and what to build,” so I’m just asking those questions saying, “What are good algorithms that people can use?” So, I really am using the computer science question, yeah. I don’t know that much about these areas, but it’s very exciting kind of area.

Ways that EA community are approaching AI issues incorrectly

Robert Wiblin: You may not have anything new to say about this one, but what would you say are the most important ways that people in the effective altruism community are approaching AI issues incorrectly?

Paul Christiano: So, I think one feature of the effective altruism community is its past dependence on founder effects, or people in EA who are interested in AI safety are often sort of very informed by this mirrored perspective for the very sensible reason the MIRI folk and Nick Bostrom were probably the earliest people talking seriously about the issue. So there’s like the cluster of things that I would regard as errors that would come with that. So like, some perspective on how you should think about sophisticated AI systems, so for example, very often thinking in terms of a system that has been given a goal.

This is actually not a mistake that MIRI makes. This is a mistake many EAs. Make. Many EAs would think about an AI as being handed some goal, like an explicit representation of some goal, and the question is just how do we choose that specific representation of a goal such that pursuing it leads to good outcomes, which I think is an okay model of AI to work with sometimes, but it’s mostly not … certainly not a super accurate model, and most of the problems in AI alignment appear in that model. So, that’s like a kind of error … again, attributing that one to MIRI is somewhat unfair and that MIRI themselves wouldn’t make this error, but is a consequence of people-

Robert Wiblin: It’s kind of a bastardized version of their view.

Paul Christiano: That’s right. An analogous thing is: I think that the way you should be thinking, probably, about building safe AI systems is more based on this idea of corrigibility, ’cause there’s AI systems that are going along with what … helping people correct them, helping humans understand what they’re doing and overall, participating in a process that points in the right direction rather than attempting the communicate the actual, “What is valuable?” or having an AI system that embodies what humans intrinsically want in the long-run.

So, I think that’s a somewhat important distinction and that’s kind of intuitively, if an ML person talks about this problem, they’re really going to be thinking about from that angle. They’re gonna be seen like, “Great, we want our AI to not kill everyone. We want it to help us understand what’s going on,” etc., and so sometimes EAs can relate the perspective of, “But consider the whole complexity of moral value and how would you communicate that to an AI?” I think that is like an example of a mismatch that’s probably mostly due to an error on the EA side, though it’s certainly the case that this concept, corrigibility’s a complicated concept and if you actually think about the mechanics of how that works, it’s like, really, there’s a lot more moving parts than the normal ML perspective kind of suggests.

Or like, again, it’s not even really like current ML perspective. It’s like the knee jerk response of someone who’s been actually thinking about ML systems. I guess I have difference of views with … I think EAs often have maybe also founder factory reasons like … actually no, I think for complicated reasons, they tend to have a view where the development of AI is likely to be associated with both sort of very rapid changes and also very rapid concentration of power. I think the EAs overestimate the extent to which or the probability of that happening, so this is like, yeah, that’s certainly a disagreement between me and most EAs. I think it’s much more likely we’re gonna be in the regime where there’s reasonably broadly distributed AI progress and ais getting deployed a whole bunch all around the world.

And, maybe that happens rapidly over the timescale of a year or two years that the world moves from something kind of comprehensive to something radically alien, but it’s not likely to be a year during which somewhere inside Google, AI’s being developed and at the end of the year, rolls out and takes over the world. It’s more likely to be a year during which just everything is sort of in chaos. The chaos is very broadly distributed chaos as AI gets rolled out.

Robert Wiblin: Is it possible that there’ll be better containment of the intellectual property such that other groups can’t copy and one group does go substantially ahead? At the moment, almost all AI research is published publicly such that it’s relatively easy to replicate, but that may not remain the case.

Paul Christiano: Yeah, so I think there’s definitely this naïve economic perspective on which this would be incredibly surprising, namely if … so, in this scenario where AI’s about to take over the world, then … and it’s driven primarily by progress in the AI technology rather than controlled large amounts of hardware, then that intellectual property now … the market value, sort of market to market would be like ten trillion dollars or whatever, so you sort of expect an actor who is developing that … the amount of pressure, competition to develop that would be very large. You expect very large coalitions to be in the lead over small actors.

So, it wouldn’t … Google’s not quite at the scale where they could plausibly do it. You could imagine sort of if all of Google was involved in this project, that becomes plausible, but then again, you’re not imagining a small group in a basement. You’re imagining an entity, which was already producing on the order of like … was already valued on the order of multiple trillions of dollars taking some large share of its resources into this development project.

And, that’s kind of conceivable. The value of Google going from five trillion dollars to a hundred trillion dollars, that’s a huge jump. It’s 20x in value, a hundred trillion dollars being your value if you take over the world. The 20x is a huge jump, but that’s kind of in the regime of what’s possible, whereas I think a billion dollars taking over the world is just super implausible. There’s an economic perspective which makes that prediction very confidently. To compare that to the real world, you have to think about a lot of ways in which the real world is not like an idealized simple economic system, but I still think it will be the case that probably AI development will involve very large coalitions involving very large amounts of hardware, large numbers of researchers regardless of if intellectual property is contained really well, then it might take place within a firm or a tightly coordinated cluster of firms rather than distributed across the academic community.

In fact, I would not be at all surprised if the academic community didn’t a super large role, but then the distinction is between distributed across a large number of loosely coordinated firms vs distributed across a network of tightly coordinated firms, and like, in both cases, it’s a big group. It’s not like a small group being covert. And, once you’re in the regime of that big group, then yeah, probably what ends up happening there is like the price … so, if it’s like Google’s doing this, unless they’re in addition to being really tight about IP, also really tight about what they’re doing, you see the share price of Google start growing very, very rapidly in that world, and then probably, as that happens, eventually you start running into problems where you can’t scale markets gracefully and then policymakers probably become involved.

At the point in the market is staying, Google is roughly as valuable as everything else in the world. Everyone is like, “Geez, this is some serious shit.” Google’s an interesting case, actually, ’cause corporate governance at Google is pretty poor, so Google has this interesting property where it’s not clear that owning a share of Google would actually entitle you to anything if Google were to take over the world. Many companies are somewhat better governed than Google in this respect.

Robert Wiblin: Explain that.

Paul Christiano: So, Google is sort of famous for shareholders having very little influence on what Google does, so if Google hypothetically were to have this massive windfall, it’s not really clear … it would be kind of a complicated question what Google as an organization ends up doing with that windfall, and Google seems kind of cool. I like Google. They seem nice, probably like they’d do something good with it, but it’s not obvious to me that being a shareholder in Google then gives you-

Robert Wiblin: You don’t get the dividend? You could sell the shares or-

Paul Christiano: You get the dividend, but it’s not clear whether it would be a dividend. For example, most shares that are sold on Google-

Robert Wiblin: You’re saying there’s a possibility of retaining the earnings to just invest in other things and it never gets handed back-

Paul Christiano: Yeah, they’d build some Google city, more Google projects.

Robert Wiblin: Interesting.

Paul Christiano: In particular, most shares of Google that are traded are non-voting shares, I think. I don’t actually know very much about Google’s corporate governance. They’re sort of famous-

Robert Wiblin: There’s two classes, I think, yeah.

Paul Christiano: So, I believe a majority of voting shares are still held by like three individuals. So, I think the shareholders don’t have any formal power in the case of Google, essentially. There’s a question of informally, there’s some expectations, and again, if you’re taking over the world, formal mechanisms are probably already breaking down.

Robert Wiblin: There’s also plenty of surplus to distribute.

Paul Christiano: Well, yeah, that depends on what you care about. So, from the perspective, in general, like, as AI’s developed, from the perspective of humans living happy lives, there’s sort of massive amounts of surplus. People have tons of resources. From the perspective of if what you care about is relative position or owning some large share of what is ultimately out there in the universe, then there’s, in some sense, there’s only one universe to go around, and so people will be divvying it up.

So, I think the people who are most interested in living happy lives and having awesome stuff happen to them and having their friends and family all be super happy, those people are all just gonna be really satisfied and it’s gonna be awesome, and the remaining conflict will be amongst either people who are very sort of greedy in the sense that they just want as much stuff as they can have, or states that are very interested in insuring the relative prominence of their state, things like that. Utilitarians, I guess, are one of the offenders here. A utilitarian wouldn’t be like, “Great, I got to live a happy life.” Utilitarians like-

Robert Wiblin: They have linear returns to more resources, more than most people do, yeah. I guess any universalist moral system may well have this property, or actually not necessarily, but most of them.

Paul Christiano: Yeah, I think a lot of impartial values generally have, yeah.

Value of unaligned AI

Robert Wiblin: Another blog post you wrote recently was about how valuable it would be if we could create an AI that didn’t seem value aligned, whether that would have any value at all or whether it would basically mean that we’d gotten 0 value out of the world. Do you want to explain what your argument was then?

Paul Christiano: Yeah, so I think this is a perspective that’s reasonably common in the ML community and in the broader academic world or broader intellectual world, namely, you build some very sophisticated system. One thing you could try and do is you could try and make it just want what humans want. Another thing you could do is you could just say, “Great, it’s some very smart system that has all kinds of complicated drives. Maybe it should just do its own thing. Maybe we should be happy for it, you know, the same way that we think humans are an improvement over bacteria, we should think that this AI we built is an improvement over humans.”

Robert Wiblin: Should live its best life.

Paul Christiano: Yeah, so I think it’s not an uncommon perspective. I think people in the alignment community are often pretty dismissive of that perspective. I think it’s a really hard. I think people on both sides, both people who sort of accept that perspective intuitively and people who dismiss that perspective, I think haven’t really engaged with how hard a moral question that is. Yeah, I consider it extremely not obvious. I am not happy about the prospect of building such an AI just ’cause it’s kind of an irreversible decision, or handing off the worlds to this kind of AI we built somewhat irreversible decision.

Robert Wiblin: It seems unlikely to be optimal, right?

Paul Christiano: Yes. I guess I would say half as good. If it’s half as good as humans doing their thing, I’m not super excited about that. That’s just half as bad as extinction. Again, trying to avoid that outcome, it’d be half as important as trying to avoid extinction, but again, the factor of two’s not going to be decisive. I think the main interesting question is, “Is there such an AI you could build that would be close to optimal?” And I do agree that a priori, most things aren’t going to be close to optimal. It’d be kind of surprising if that was the case. I do think there are some kinds of AIs that are very inhuman for which it is close to optimal, and understanding that border between when that’s very good, like when we should, as part of being a cosmic citizen, should be happy to just build the AI vs when that’s a great tragedy.

It’s important to understand that boundary if there’s some kind of AI you can build that’s not aligned that’s still good, so in that post I both made some arguments for why there should be some kinds of AIs that are good despite not being aligned, and then I also tried to push back a little bit against the intuitive picture some people have that is the default.

Robert Wiblin: Yeah, so I guess the intuitive picture in favor is just, “It’s good when agents get what they want and this AI will want some things, and then it will go about getting them, and that’s all for the good,” and the alternative view would be, “Well, yes, but it might exterminate life on earth, and then fill the universe with something like paperclips or some random thing that doesn’t seem to us like it’s valuable at all, so what a complete waste that would be.” Is that about right?

Paul Christiano: That’s definitely a rough first pass. That’s basically right. There’s definitely a lot that can be said on the topic. For example, someone who has the favorable view could say, “Yes, it would be possible to construct an agent which wanted a bunch of paperclips, but such an agent would be unlikely to be produced, so you’d have to go out of your way.” In fact, maybe the only way to produce such an agent is if you’re really trying to solve alignment. If you’re just trying to run something like evolution, then consider the analogy of evolution: humans are so far from the kind of thing that, yeah.

So, one position would be, “Yeah, there exists such bad AIs,” but if you run something like evolution, you’ll get a good AI instead. So, that perspective might then be optimistic about the trajectory of modern ML that is, on some alignment perspectives, you’re like, “Well, this is really terrifying. We’re just doing this black box optimization. Who knows what we’re going to get?” From some other perspectives, you’re like, “Well, that’s what produced humans, so we should pay it forward.” I think, also, people get a lot of mileage out of the normal analogy to descendens. That is, people say, “Well, we would have been unhappy had our ancestors been really excited about controlling the trajectory of our society and tried to ensure their values were imposed on the whole future,” and likewise, even if our relationship to AI systems we built is different than the relationship of our ancestors to us, it has this structural similarity and likewise, the AI would be annoyed if we went really out of our way and paid large costs to constrain the future structure of civilization, so maybe should be nice and do unto others as we would have them do unto us.

Robert Wiblin: I don’t find that persuasive, personally.

Paul Christiano: I certainly don’t find it persuasive out of the box, yeah.

Robert Wiblin: It just seems very different. I guess, were very similar by design to humans from 500 years ago just with probably more information and more time to think about what we want, whereas I think you can’t just … yeah, an AI might just be so differently designed that it’s like a completely different jump, whereas from our point of view, it could be, “Well, yeah.”

Paul Christiano: I think the more compelling … so, I don’t really lean much on this. I don’t take much away from the analogy to descendens. I think it’s a reasonable analogy to have thought about, but it’s not going to run much of the argument. I think the main reason that you might end up being altruistic towards, say, the kind of product of evolution would be if you said, “From behind the veil of ignorance, humans have some complicated set of drives, etc.” If humans go on controlling earth, if that set of values and preferences humans have is gonna get satisfied, if we were to run some new process that’s similar to evolution, it would produce a different agent with a different set of values and preferences, but from behind the veil of ignorance, it’s just as likely that our preferences would be the preferences of the thing that actually evolves on earth is that our set of preferences would be the preferences of this AI that got created. So, if you’re willing to step far enough back behind the veil of ignorance, then you might say, “Okay, I guess.”

Robert Wiblin: 50/50.

Paul Christiano: And, I think there’s come conditions under which you can make that argument tight and so even a causal of perfectly selfish causal decision theorist would in fact for like these normal, weird acausal trade reasons would in fact want to let the AI … would be happy for the AI, and the question of: outside of those very extreme cases where there’s a really tight argument, that you should be happy, how happy should you be if there’s a loose analogy between the process that you ran and biological evolution?

Robert Wiblin: So, what do you think are kind of the best arguments both for and against thinking that an unaligned or what seems like an unaligned AI would be morally valuable?

Paul Christiano: So, I think it certainly depends on which kind of unaligned AI we’re talking about, and so one question is, “What are the best arguments that there exists an aligned AI which is morally valuable?” And another question is like, “What are the best arguments that a random AI is morally valuable?” Etc. So, I guess the best argument for the existence, which I think is an important place to get started, or if you’re starting from this dismissive perspective like most people in the alignment community have intuitively, I think the existence argument is a really important first step. I think the strongest argument on the existence perspective is: consider the hypothetical where you’re actually able to, in your computer, create a nice little simulated planet from exactly the same distribution as Earth, so you run Earth, you run evolution on it, you get something very different from human evolution, but it’s exactly drawn from the same distribution.

Robert Wiblin: You’d think it’s like 50/50 whether it’s likely to be better or worse than us, on average, right?

Paul Christiano: Well, from our values, it might be … having conditioning now on our values, it might definitely be much worse.

Robert Wiblin: But conditioning of being agnostic about what values are good?

Paul Christiano: That’s right, or yeah, it’s a really complicated moral philosophy question. Then, the extreme … I think we could even make it actually tighter. So, if you were to just make such a species and then let that go in the universe, I think then you have a very hard question about whether that’s a good deal or a bad deal. I think you can do something a little bit better. You can do something a little bit more clearly optimal, which is like if you’re able to create many such simulations, run evolution not just once, but many times, look across all of the resulting civilizations and pick out a civilization which is constituted such that it’s going to do exactly the same thing you’re currently doing, such that when they have a conversation like this, they’re like, “Yeah, sure. Let’s let out that … let’s just run evolution and let that thing prosper,” then kind of like now, the civilizations who follow the strategy are just engaged in this musical chairs game where each of them started off evolving on some worlds, and then they randomly simulate a different one of them in the same distribution, and then that takes over that world.

So, you have exactly the same set of values in the universe now across the people who adopt this policy, just shuffled around. So, it’s clear that it’s better for them to do that than it is for them to say, face some substantial risk of building an aligned AI.

Robert Wiblin: Okay, so I didn’t understand this in the post, but now I think I do. The idea is that: imagine that there’s a million universes, all with different versions of Earth where life as evolved.

Paul Christiano: If you’re willing to go for a really big universe, you can imagine they’re literally just all copies of exactly the same solar system on which evolution went a little bit differently.

Robert Wiblin: And so they all end up with somewhat different values, and you’re saying … but if they’re all … if all of their values imply that they should just reshuffle their values and run a simulation and then be just as happy to go with whatever that spits out is what they seem to prefer, then all they do is kind of trade places on average. They all just … you all just end up with different draws from this broad distribution of possible values that people can have across this somewhat narrow but still broad set of worlds? But, you’re saying this is better because they don’t have to worry so much about alignment?

Oh, you mean why are things better after having played … after having-

Robert Wiblin: Yeah, why does the musical chairs thing where everyone just flips values on average with other people produce a better outcome in total?

Paul Christiano: Yeah, so I think this is most directly relevant as an answer to the question, “Why should we believe there exists a kind of AI that we would be as happy to build as an aligned AI even though it’s unaligned?” But in terms of why it would actually be good to have done this: the natural reason is, we have some computers. The concerning feature of our current situation is that human brains are not super … we have all these humans. We’re concerned that AIs running on these computers are going to be better than humans such that we’re sort of necessarily going to have to pass control over the world off to things running on computers.

So, after you’ve played this game musical chairs, now the new residents of our world are actually running on the computers, so now, as if you got your good brain emulations for free, that is now, those people who have access to simulations of their brain can do whatever it is they would … whatever you would have done with your AI, they can do with themselves. Yeah, there’s really a lot of moving parts here and a lot of ways this maybe doesn’t make any sense.

Robert Wiblin: Okay, let me just … so, if we handed it off, if we handed off the future to an AI that was running a simulation of these worlds and using that as its reference point for what it should value, on average, from this very abstracted point of view, this would be no worse, and, if all of the people in this broad set did this, then they would save a bunch of trouble trying to get the AI to do exactly what they want in that universe. They could all just kind of trade with one another, or they all get to save the overhead of trying to make the AI align with them specifically, and so they have to align it to some other pole that they’ve created of, yeah, some evolutionary process that listens to inside the computer?

Paul Christiano: And the concern is presumably not the overhead, but rather the risk of failure. That is, if you think there’s a substantial risk you would build the kind of AI which is not valuable, then this would be … that’s our current state. We might build an AI that does something no one wants. We could instead build an AI that does something that we want. Maybe a second, a third alternative, which is the same as the good outcome between those two is just build an AI that reflects values that are the same, from the same distribution of values that we have.

Robert Wiblin: Okay, so you try to align it with your values, and if you fail, I’d think, “Well, there’s always this backup option and maybe it’ll be valuable anyway.”

Paul Christiano: This is definitely plan B, and so it’d mostly be relevant … and again, to be clear, this weird thing with evolution is not something that’s going to get run because you can’t sample from exactly the same distribution as evolution. It would just prompt the question, “What class of AIs have this desirable future?” given that you believe at least one does, and yeah, it would be a plan B.

So the reason to work on this moral question, “What class of AIs are we happy with despite not being aligned with us?” And the reason to work on that moral question would be that if you had a reasonable answer, that … it’s an alternative to doing alignment. If we had a really clear answer to that question, then we could be okay anyway even if we mess up alignment.

Robert Wiblin: Okay, so this would be a … yeah, I see. It would be an alternative approach to getting something that’s valuable even if it’s not aligned in some narrow sense with us.

Paul Christiano: Yeah.

Robert Wiblin: And it might be an easier problem to solve, perhaps.

Paul Christiano: That’s right. At least, people have not … on my list of moral philosophy problems, it’s like my top rated moral philosophy problem. I think not that many people have worked on it that long.

Robert Wiblin: So, if you were a moral realist, you’d just believe that there are objective moral facts.

Paul Christiano: They should be totally fine with this kind of thing. From their perspective, why think that humans are better at discovering objective moral facts than … actually, I don’t know moral realist positions very well, but my understanding, some moral realists would go for that.

Robert Wiblin: But, I guess they might look at humans and say, “Well, I do just think that we’ve done better than average, or better than you would expect at doing this.” For example, “We care about this problem to begin with, whereas many other agents just might not even have the concept of morality,” so in that sense, we’re in the top half: maybe not the very top, but I wouldn’t roll the dice completely again, but then it seems like they should also then think that there’s a decent chance. If we did okay, it suggest that there’s a decent chance that if you roll the dice again, you’d get something somewhat valuable, because it would be an extraordinary coincidence if we managed to do really quite well at moralism, figuring what these moral facts are if it was extremely improbable for that to happen to begin with.

Paul Christiano: Yeah, I mean, it’s definitely: if you’re a moral realist, you’re going to have different views on this question. It’s going to depend a lot on what other views you take on a bunch of related questions. I’m not super familiar with coherent moral realist perspectives, but on my kind of perspective, if you make some moral errors early in history, it’s not a big deal as long as you are on sort of the right path for the spectrumal deliberation, so you might think from the realist perspective, there’d be a big range of acceptable outcomes and you could in fact be quite a bit worse than humans as long as you were again, on this right path of spectral deliberation. I don’t quite know how moral realists feel about deliberation. Would they say there’s a broad … yeah, I think there’s probably a lot of disagreement amongst moralists and it’s just not a-

Robert Wiblin: But then if you’re a total subjectivist, do you think there’s nothing that people ought to think is right? Instead, you just kind of want what you want? Why do you care at all about what other people in different hypothetical runs of evolutions would care about? Wouldn’t you just be like, completely, well, “I don’t even care what you want. All I care is about what I individually want and I just wanna maximize that.”

Paul Christiano: Yeah, and so then you get into these decision theoretic reason to behave kindly. The simplest pass would be, from behind, if you could have made a commitment before learning your values to act in a certain way, then that would have benefited your values and expectation. So, similarly, if there are logical correlations between a decision and the decisions of others with different values, then that might be fine. Even on your values, it might be correct for you to make this decision because it correlates with this other decision, like a decision made by others.

In the most extreme case … at some point, I should caveat this entire last however many, 10 minutes, 15 minutes of discussion as like, “This is a bunch of weird shit. It’s doesn’t reflect my behaviour as an employee of open AI like I do normal stuff making AI good for humans.” Anyway, then you get into weird shit where, once you’re doing this musical chairs game, then one step of that was you ran a bunch of simulations and saw which ones you were inclined to participate in the scheme you’re currently running, and so from that perspective, us as humans, we’d be like, “Well, we might as well be in such a simulation,” in which case, even on our neural values, by running this scheme, we’re going to be the ones chosen to take over the outside world.

Robert Wiblin: Why are you more likely to be chosen to … or go into the outside world if you’re cooperative?

Paul Christiano: So, the scheme which would run, if you wanted to do the musical chairs thing, you can’t just simulate a random species and let it take your place because that is then just gonna move from those species that run this procedure, they’re all gonna give up their seat, and then the random species are gonna replace them.

Robert Wiblin: So you end up … it’s like evolutionarily a bad strategy.

Paul Christiano: That’s a bad strategy. The thing that might be an okay strategy is you run the scheme and then you test for each species before you let them replace you. “Did they also run the scheme, and then if so-”

Robert Wiblin: Choose from the cooperative ones, yeah.

Paul Christiano: And then that would cause the incentives to be-

Robert Wiblin: Yeah, I think this does get a bit weird once we’re talking about the simulations.

Paul Christiano: Oh, it’s super weird. It’s super weird. I think the earlier parts were more normal.

Robert Wiblin: Yeah, the question of just whether an AI would be morally valuable seems much more mainstream.

Paul Christiano: I agree. I think it’s also more important. I think this weird stuff with simulations probably doesn’t matter whereas I think the question morally, “How valuable is it to have this AI which has values that are kind of like from some similar distribution to our values?” I think that’s actually a pretty important … I think it’s relatively common for people to think that would be valuable, and it’s not something alignment people have engaged with that much. It’s not a question, to my knowledge, that moral philosophers have engaged with that much: a little bit, but like not … I guess maybe they come from a different perspective than I would want to really tackle the question from, as is often the case with moral philosophers.

I guess another point is that I’m also kind of scared of this entire topic in that I think a reasonably likely way that AI being unaligned ends up looking in practice is like: people build a bunch of AI systems. They’re extremely persuasive and personable because we’ve optimized them to … they can be optimized effectively for having whatever superficial properties you want, so you’d live in a world with just a ton of AI systems that want random garbage, but they look really sympathetic and they’re making really great pleas. They’re like, “Really, this is incredibly inhumane. They’re killing us after this or they’re selecting us to … imposing your values on us.”

And then, I expect … I think the current way overall, as actual consensus goes is like really, to be much more concerned about people being bigoted or failing to respect the rights of AI systems than to be concerned the actual character of those systems. I think it’s a pretty likely failure more than something I’m concerned about.

Robert Wiblin: Interesting. I hadn’t really thought about that scenario. So, the idea is here: we create a bunch of AIs and then we kind of have an AI justice movement that gives AIs maybe more control, like more control over their world and more moral consideration. Then it turns out that while they’re very persuasive at advocating for their moral interests, in fact, their moral interests are, when they’re given moral autonomy, are nothing like ours, or much less than they seem.

Paul Christiano: Then we’re back to this question which was unclear of how valuable … maybe that’s fine. I don’t actually have a super strong view on that question. I think in expectation, I’m not super happy about it.

Robert Wiblin: But by kind of arguing for the moral rights of AIs, you’re making the scenario more possible?

Paul Christiano: I mostly think it’s gonna be … I strongly suspect there’s going to be serious discussion about this in any case, and I would prefer that there would be some actual figuring out what the correct answer is prior to becoming an emotionally charged or politically charged issue. I’m not super confident, to be clear, about anything we’re saying here. These are not like 80% views. These are 40% views. An example would be like: often when we talk about failure scenarios, I will talk about: there are a bunch of automated autonomous corporations that control resources, and they’re mastering resources that no human gets to choose for any purpose, and people’s response is like, “Well, that’s absurd. Legally, you’re just a machine. You have no right to own things. We’re gonna take your stuff.”

That’s something that I don’t is that likely to happen. I suspect that to the extent lots of resources are controlled by AI systems, those AI systems will be, in the interest of preserving those resources, will make fairly compelling appeals for respecting their rights in the same way a human would if you were like … if all humans get around and, “Yeah, we’re just gonna take.” Just such terrible optics. It seems like so much not a thing that I expect our society to do, everyone just being like, “We’re going to take all of these actor’s resources. We just don’t think they have the right to self-determination.”

Robert Wiblin: Interesting. It seems like the default to me, but maybe not. I guess the issue is that the AIs would be able to advocate for themselves without human assistance potentially in a way that a corporation can’t. A corporation is still made of people. Do corporations make an argument that, “I’m a separate entity and I deserve rights and should be able to amass resources that don’t go to shareholders?” The problem is there it’s controlled by shareholders so it ultimately bottoms out at people in some way, and AI doesn’t necessarily?

Paul Christiano: I think it’s both the case that corporations do in fact have a level of rights that would be sufficient to run the risk argument so that if the outcome is the same as corporations, that would be sufficient to be concerned, but I also think that corporations are both … yeah, they do bottom out with people in a way that these entities wouldn’t, and that’s one of the main problems, and also they’re just not able to make persuasive arguments. That is, one, they’re not able to represent themselves well. They don’t have a nice ability to articulate eloquent arguments that could plausibly originate with this actual moral patient making the arguments.

And then two, the actual moral case is more straightforward for corporations, whereas I think for AI’s there will actually be a huge amount of ambiguity. I think the sort of default, again from if you interact with people who think about these issues, some right now, if you talk to random academics who think about philosophy and AI or look at Hollywood movies that are somewhat less horrifying than Terminator, I think the normal thing would be, “Yeah, by default, we expect once such agents are as sophisticated as humans, they are deserving of moral considerations for the same kinds of reason humans are,” and it’s reasonably likely that people will deny them that moral consideration, but that would be like a terrible moral mistake.

I think that’s like kind of the normal … not normal view, but that’s like if I were to try and guess where the community is heading or well it would end up, that would be my guess.

Robert Wiblin: Yeah, I guess I feel like AI’s probably would deserve moral consideration.

Paul Christiano: I also agree with that, yes. That’s what makes the situation so tricky.

Robert Wiblin: That’s true, but then there’s this question of: they deserve moral consideration as to their … I suppose ’cause I’m sympathetic to hedonism, I care about their welfare-

Paul Christiano: As do I. To be clear, I totally care about their welfare.

Robert Wiblin: As do I, yeah, as we all should, but I don’t necessarily then want them to be able to do everything … do whatever they want with other resources which is an … no, but I feel that way about other people as well, necessarily, right, that I want other people on Earth to have high levels of welfare, but that doesn’t necessarily mean I wanna hand over the universe to whatever they want.

I just think it makes the character of this debate a lot more contentious. If you’re like, “Yeah, everyone agrees that there’s this giant class of individuals which is potentially reasonably large, which currently does some large fraction of labor in the world, which is asking for the right to self-determination and control of property and so on, and are also way more eloquent than we are,” it’s like, geez.

We’ll give you the welfare that we think you should deserve, yeah.

Paul Christiano: It doesn’t sound good.

Robert Wiblin: Yeah.

Paul Christiano: I mean the main reason I think it’s plausible is we do observe this kind of thing with nonhuman animals. People are pretty happy til you’re pretty terrible to nonhuman animals. I think that-

Robert Wiblin: But that’s another case where it’s like, for example, I think that we should be concerned about the welfare of pigs and make pigs’ lives good, but I wouldn’t then give pigs lots of GDP to organize in the way that pigs want, but the disanalogy there is that we think we’re more intelligent and have better values than pigs whereas it’s less clear that’d be true with AI. But, in as much as I worry that AI wouldn’t have good values, it actually is quite analogous, that.

Paul Christiano: Yeah, I think your position is somewhat … the arguments you’re willing to make here are somewhat unusual amongst humans, probably. I think most humans have more of a tight coupling between moral concern and thinking that a thing deserves liberty and self-determination and stuff like that.

Robert Wiblin: And generalize, right. Do you think that they’re bad arguments? It flows more naturally from a hedonistic point of view than a preference utilitarian point of view. That seems to be maybe where we’re coming apart.

Paul Christiano: Oh, no, I mean I also would be like, “Yep, I care about the welfare of lots of agents who I believe …” I believe it’s a terrible bad thing, but maybe not the worst thing ever if you’re mistreating a bunch of AI systems, ’cause I think they probably are at some point to be moral patients, but I would totally agree with you, though, I could have that position and simultaneously believe that it was like either a terrible moral error to bring such beings into existence or a terrible moral error to give them greater authority over what happens in the world.

So, I think that’s a likely place for us to end up in, and I think the level of veer and carefulness in public discussion is not such that those kinds of things get pulled apart. I think it probably mostly gets collapsed into a general rah or boo or … I don’t know that much about how public opinion works, but I’d be happy to take simple bets on this.

Robert Wiblin: Well, there’s some selfish reasons why people would not necessarily give large amounts of GDP. You could imagine there’s groups that would say, “Well, we still want to own AIs, but we should treat them humanely.” I guess that doesn’t sound too good now that I say that out loud.

Paul Christiano: I don’t think it’s gonna play well.

Robert Wiblin: It’s not gonna play, yeah.

Paul Christiano: Also, I mean, there’s just such a strong concentrated interest that is like … most of the cases where this goes badly are cases where there’s a large power imbalance, but in the case we’re talking about, the most effective lobbyists will be AI systems, and it’s going to be this very concentrated powerful interest which cares a lot about this issue, has a plausible moral claim, looks really appealing. It seems kind of overdetermined, basically.

Robert Wiblin: Yeah, okay.

Paul Christiano: This isn’t super important. This is mostly relevant, again, when people say things like, “No, it’s kind of crazy to imagine AI’s owning resources and doing their own thing.”

Robert Wiblin: Owning resources themselves, yeah.

Paul Christiano: And I think that is the default outcome, barring some sort of surprising developments and-

Robert Wiblin: Okay, I’ve barely thought about this issue at all, to be honest, which perhaps is an oversight, but I need to think about it some more and then maybe we can talk about it again.

Paul Christiano: I don’t think it’s that important an issue, mostly. I think, but like, details of how to make alignment work, etc., are more important. I just try to justify them by the additional argument that like, to the extent that you care about what these AI systems want, you really would like to create AI systems that are on the same page as humans. If you get to create a whole bunch of extra new agents, it could be great if you create a bunch of agents whose preferences are well-aligned with the existing agents, and it could be like you just create a ton of unnecessary conflict and suffering if you create a ton of agents who want very different things.


Robert Wiblin: Okay. So we’re almost out of time. But just a final few questions. So you’re not only working in this area, but you’re also a donor and you’re trying to support projects that you think will contribute to AI alignment. But it’s an area where there’s a lot of people trying to do that. There’s perhaps more money than people who can usefully take it. So I hear it’s somewhat challenging to find really useful things to fund that aren’t already getting funded. How do you figure out what to fund and would you mind mentioning some of the things that you donate to now?

Paul Christiano: Yeah. So I think I would like to move towards a world where it’s easier to work on anyone who is equipped to do reasonable AI alignment work. Is able to do that with a minimal hassle. Including if they had differences of view with other organizations currently working in the space. Or if they’re not yet trained up and want to just take some time to think about the area and see if it works out well.

I think there are definitely … they’re definitely people who are doing work who are interested in funding and I’ll say not doing crazy stuff. And so one could just, in order to inject more money, dip lower in that. Say look, we like previously … If we’re not really restricted by funding, then our bar ought not be like we’re really convinced this thing is great. Our bar should just be it looks like you’re sort of a sensible person who might eventually figure out what’s … You know, it’s like an important part of personal growth. Maybe this project will end up being good for reasons we don’t understand.

So one can certainly dip more in that direction. I think that’s not all used up. I guess the stuff I funded in AI safety over the last year has been … The biggest thing was funding Ought. The next biggest was running this sort of open call for individuals working on alignment outside of any organization. Which has funded three or four people. Actually, I guess most recently, there’s a a group working on IRL under weaker rationality assumptions in Europe. And also supporting — Zvi Mowshowitz and Vladimir Slepnev are running an AI alignment prize, which I’m funding and a little bit involved in judging.

Robert Wiblin: Do you think other donors that are earning to give could find similarly promising projects if they looked around actively?

Paul Christiano: I think it’s currently pretty hard in AI alignment. I think there’s potentially room for … Right, so I think it’s conceivable existing funders including me are being too conservative in some respects. And like you could just say look, I really don’t know if x is good. But there is a probable story where this thing is good. Or ensuring that the many people in the field had enough money that they could regrant effectively. Like many people in the conventional AI safety crowd say if they had enough money, they could regrant effectively and could do whatever they wanted.

Yeah, unless you’re willing to get a little bit crazy, it’s pretty hard. I guess it also depends on what you’re … Yeah I think it depends a lot on what your bar is. I think if AI is in fact … Like if we’re on short timelines, then the AI interventions are still pretty good compared to other opportunities. And there might be some qualitative sense of this kind of feels like a longer shot or wackier thing than I would fund in most areas.

So I think a donor probably has to be somewhat comfortable with that. Yeah there’s also some claims … Like I think MIRI can always use more money. I think there are some other organizations that can also use more money and it’s not something that I think about that much. In general, giving is not something I’ve been thinking about that much because I think it’s just a lot. It seems much better for me to personally be working on getting stuff done.

Fun final Q

Robert Wiblin: Yeah. That sounds right. Well this has been incredibly informative and you’re so prolific that I’ve got a whole lot more questions. But we’ll have to save them for another episode in the future. But I’ll stick up links to some other things that you’ve written that I think listeners who have stuck with the conversation this far will be really interested to read.

And yeah. You do write up a lot of your ideas in detail on your various blogs. So listeners who’d like to learn more, they’ll definitely have the opportunity to do so.

Just one final question, speaking of the blogs that you write. About a week ago you wrote about eight unusual science fiction plots that you wish someone would turn into a proper book or a movie. And I guess they’re very hard science fiction, things that you think might actually happen and that we can learn from. So what do you think is wrong with current SciFi? And which was your favorite of the ideas that you wrote up?

Paul Christiano: So I think a problem that I have and that maybe many similar people have is that it becomes difficult to enjoy science fiction as the world becomes less and less internally coherent and plausible. Like at the point when you’re really trying to imagine what is this world like? Like what is this character actually thinking? Like what would their background be? Often if you try and do that, I think almost all the time if you try and do that with existing science fiction, if you think too hard, eventually the entire thing falls apart and it becomes very difficult … you kind of have to do a weird exercise in order to not think too hard if you really want to sympathize with any of the characters. Or really even understand what’s … like think about what’s going on in the plot. I think it’s extremely common. It’s very, very rare to have any science fiction that doesn’t have that problem.

I think that’s kind of a shame, because it feels to me like the actual world we live in is super weird. And there’s lots of super crazy stuff that I don’t know if it will happen, but certainly is internally consistent that it could happen. And I would really enjoy science fiction that just fleshed out all the crazy shit that could happen. I think it’s a little bit more work. And the basic problem is that most readers just don’t care at all. Or it’s incredibly rare for people to care much about the internal coherence of the world. So people aren’t willing to spend extra time or slightly compromise on how convenient things are narritively.

I would guess that the most amusing story from the ones I listed are the ones that would actually make the best fiction would be … So I described one plot that was in Robin’s Age of Em scenario. Which I think is … If one, doesn’t fill in all the details, is a pretty coherent scenario. This is where you have a bunch of simulated humans who have mostly replaced normal humans in work who are alive during this brief period of maybe a few calendar years as we transition from simulated human brains to much more sophisticated AI.

And in that world, the experience of an em is very very weird in a number of ways. One of which is it’s very easy to … Like you can put an em in a simulation of an arbitrary situation. You can copy ems. You can reset them. You can run an em a thousand times through a situation. Which I think is a really interesting situation to end up in. So I described a plot that sort of … yeah.

I think if you consider the genre of con movies, I quite enjoy that genre. And I think it would be a really really interesting genre in the setting where it’s possible to take a person, to copy a person’s brain, to put them in simulations. Where people actually have a legitimate interest for wondering not only what is this person going to do in a simulation, but like what is this person going to do in a simulation when they’re simulating someone else. It’s like incredibly complicated, the dynamics of that situation. And also very conducive, yeah very conducive I think to amusing plots. So I’d be pretty excited to read that fiction.

I think it’d be most amusing as a film. I don’t think it’s ever going to happen. I think none of them will happen. It’s very depressing.

Robert Wiblin: Maybe after the singularity, we’ll be so rich. We’ll be able to make all kinds of science fiction that appeals just to a handful of people.

Paul Christiano: It will be super awesome. Yeah once we have really powerful AI, the AI can write for us.

Robert Wiblin: We can each have a single AI just producing films for one individual.

Paul Christiano: Oh thousands of AIs, thousands of AIs just producing your one. It’s going to be so good.

Robert Wiblin: That’s the dream. Thanks so much for taking the time to come on the podcast, Paul. And also just in general, thanks so much for all the work that you’re putting into trying to make the world a better place. Well, at least the future in a better place.

Paul Christiano: Thanks again for having me and thanks for running the podcast.

Robert Wiblin: If you liked this episode can I suggest going back and listening to our two previous episodes on AI technical research:

Number 3 – Dr Dario Amodei on OpenAI and how AI will change the world for good and ill and number 23 – How to actually become an AI alignment researcher, according to Dr Jan Leike.

Then you can listen to our two episodes on AI policy and strategy:

Number 31 – Prof Dafoe on defusing the political & economic risks posed by existing AI capabilities and number 1 – Miles Brundage on the world’s desperate need for AI strategists and policy experts.

If that’s not enough for you, number 21 – Holden Karnofsky on times philanthropy transformed the world & Open Phil’s plan to do the same, which has a significant section on the Open Philanthropy’s plans to shape transformative AI in a positive direction.

And again, if you know someone who would be curious about these topics, or already works adjacent to them, please do let them know about this show. That’s how we find our most avid listeners and make the world better through this show.

The 80,000 Hours Podcast is produced by Keiran Harris.

Thanks for joining, talk to you in a week or two!

Learn more

Research into risks from artificial intelligence

Career review: Machine Learning PhD

Positively shaping the development of artificial intelligence

The case for building expertise to work on US AI policy, and how to do it

Guide to working in AI policy and strategy






More posts like this

No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities