Hide table of contents

It is increasingly clear that artificial intelligence is poised to have a huge impact on the world, potentially of comparable magnitude to the agricultural or industrial revolutions. But what does that actually mean for us today? Should it influence our behavior? In this talk from EA Global 2018: London, Ben Garfinkel makes the case for measured skepticism.

The Talk

Today, work on risks from artificial intelligence constitutes a noteworthy but still fairly small portion of the EA portfolio.

Only a small portion of donations made by individuals in the community are targeted at risks from AI. Only about 5% of the grants given out by the Open Philanthropy Project, the leading grant-making organization in the space, target risks from AI. And in surveys of community members, most do not list AI as the area that they think should be most prioritized.

At the same time though, work on AI is prominent in other ways. Leading career advising and community building organizations like 80,000 Hours and CEA often highlight careers in AI governance and safety as especially promising ways to make an impact with your career. Interest in AI is also a clear element of community culture. And lastly, I think there's also a sense of momentum around people's interest in AI. I think especially over the last couple of years, quite a few people have begun to consider career changes into the area, or made quite large changes in their careers. I think this is true more for work around AI than for most other cause areas.

So I think all of this together suggests that now is a pretty good time to take stock. It's a good time to look backwards and ask how the community first came to be interested in risks from AI. It's a good time look forward and ask how large we expect the community's bet on AI to be: how large a portion of the portfolio we expect AI to be five or ten years down the road. It's a good time to ask, are the reasons that we first got interested in AI still valid? And if they're not still valid, are there perhaps other reasons which are either more or less compelling?

To give a brief talk roadmap, first I'm going to run through what I see as an intuitively appealing argument for focusing on AI. Then I'm going to say why this argument is a bit less forceful than you might anticipate. Then I'll discuss a few more concrete arguments for focusing on AI and highlight some missing pieces of those arguments. And then I'll close by giving concrete implications for cause prioritization.

The intuitive argument

So first, here's what I see as an intuitive argument for working on AI, and that'd be the sort of, "AI is a big deal" argument.

There are three concepts underpinning this argument:

  1. The future is what matters most in the sense that, if you could have an impact that carries forward and affects future generations, then this is likely to be more ethically pressing than having impact that only affects the world today.
  2. Technological progress is likely to make the world very different in the future: that just as the world is very different than it was a thousand years ago because of technology, it's likely to be very different again a thousand years from now.
  3. If we're looking at technologies that are likely to make especially large changes, then AI stands out as especially promising among them.

So given these three premises, we have the conclusion that working on AI is a really good way to have leverage over the future, and that shaping the development of AI positively is an important thing to pursue.

I think that a lot of this argument works. I think there are compelling reasons to try and focus on your impact in the future. I think that it's very likely that the world will be very different in the far future. I also think it's very likely that AI will be one of the most transformative technologies. It seems at least physically possible to have machines that eventually can do all the things that humans can do, and perhaps do all these things much more capably. If this eventually happens, then whatever their world looks like, we can be pretty confident the world will look pretty different than it does today.

What I find less compelling though is the idea that these premises entail the conclusion that we ought to work on AI. Just because a technology will produce very large changes, that doesn't necessarily mean that working on that technology is a good way to actually have leverage over the future. Look back at the past and consider the most transformative technologies that have ever been developed. So things like electricity, or the steam engine, or the wheel, or steel. It's very difficult to imagine what individuals early in the development of these technologies could have done to have a lasting and foreseeably positive impact. An analogy is sometimes made to the industrial revolution and the agricultural revolution. The idea is that in the future, impacts of AI may be substantial enough that there will be changes that are comparable to these two revolutionary periods throughout history.

The issue here, though, is that it's not really clear that either of these periods actually were periods of especially high leverage. If you were, say, an Englishman in 1780, and trying to figure out how to make this industry thing go well in a way that would have a lasting and foreseeable impact on the world today, it's really not clear you could have done all that much. The basic point here is that from a long-termist perspective, what matters is leverage. This means finding something that could go one way or the other, and that's likely to stick in a foreseeably good or bad way far into the future. Long-term importance is perhaps a necessary condition for leverage, but certainly not a sufficient one, and it's a sort of flawed indicator in its own right.

Three concrete cases

So now I'm going to move to three somewhat more concrete cases for potentially focusing on AI. You might have a few concerns that lead you to work in this area:

  1. Instability. You might think that there are certain dynamics around the development or use of AI systems that will increase the risk of permanently damaging conflict or collapse, for instance war between great powers.
  2. Lock-in. Certain decisions regarding the governance or design of AI systems may permanently lock in, in a way that propagates forward into the future in a lastingly positive or negative way.
  3. Accidents. It might be quite difficult to use future systems safely. And that there may be accidents that occur in the future with more advanced systems that cause lasting harm that again carries forward into the future.


First, the case from instability. A lot of the thought here is that it's very likely that countries will compete to reap the benefits economically and militarily from the applications of AI. This is already happening to some extent. And you might think that as the applications become more significant, the competition will become greater. And in this context, you might think that this all increases the risk of war between great powers. So one concern here is that there may be a potential for transitions in terms of what countries are powerful compared to which other countries.

A lot of people in the field of international security think that these are conditions under which conflict becomes especially likely. You might also be concerned about changes in military technology that, for example, increase the odds of accidental escalation, or make offense more favorable compared to defense. You may also just be concerned that in periods of rapid technological change, there are greater odds of misperception or miscalculation as countries struggle to figure out how to use the technology appropriately or interpret the actions of their adversaries. Or you could be concerned that certain applications of AI will in some sense damage domestic institutions in a way that also increases instability. That rising unemployment or inequality might be quite damaging, for example. And lastly, you might be concerned about the risks from terrorism, that certain applications might make it quite easy for small actors to cause large amounts of harm.

In general, I think that many of these concerns are plausible and very clearly important. Most of them have not received very much research attention at all. I believe that they warrant much, much more attention. At the same time though, if you're looking at things from a long-termist perspective, there are at least two reservations you could continue to have. The first is just we don't really know how worried to be. These risks really haven't been researched much, and we shouldn't really take it for granted that AI will be destabilizing. It could be or it couldn't be. We just basically have not done enough research to feel very confident one way or the other.

You may also be concerned, if you're really focused on long term, that lots of instability may not be sufficient to actually have a lasting impact that carries forward through generations. This is a somewhat callous perspective. If you really are focused on the long term, it's not clear, for example, that a mid-sized war by historical standards would be sufficient to have a big long term impact. So it may be actually a quite high bar to achieve a level of instability that a long-termist would really be focused on.


The case from lock-in I'll talk about just a bit more briefly. Some of the intuition here is that certain decisions have been made in the past about, for instance the design of political institutions, software standards, or certain outcomes of military or economic competitions, which seem to produce outcomes that carry forward into the future for centuries. Some examples would be the design of the US Constitution, or the outcome of the Second World War. You might have the intuition that certain decisions about the governance or design of AI systems, or certain outcomes of strategic competitions, might carry forward into the future, perhaps for even longer periods of time. For this reason, you might try and focus on making sure that whatever locks in is something that we actually want.

I think that this is a somewhat difficult argument to make, or at least it's a fairly non-obvious one. I think the standard skeptical reply is that with very few exceptions, we don't really see many instances of long term lock-in, especially long term lock-in where people really could have predicted what would be good and what would be bad. Probably the most prominent examples of lock-in are choices around major religions that have carried forward for thousands of years. But it's quite hard to find examples that last for hundreds of years. Those seem quite few. It's also generally hard to judge what you would want to lock in. If you imagine fixing some aspect of the world, as the rest of world changes dramatically, it's really hard to guess what would actually be good under quite different circumstances in the future. I think my general feeling on this line of argument is that, I think it's probably not that likely that we should expect any truly irreversible decisions around AI to be made anytime soon, even if progress is quite rapid, although other people certainly might disagree.


Last, we have the case from accidents. The idea here is that, we know that there are certain safety engineering challenges around AI systems. It's actually quite difficult to design systems that you can feel confident will behave the way you want them to in all circumstances. This has been laid out most clearly in the paper 'Concrete Problems in AI Safety,' from a couple of years ago by Dario Amodei and others. I'd recommend for anyone interested in safety issues to take a look at that paper. Then we might think, given the existence of these safety challenges, and given the belief or expectation that AI systems will become much more powerful in the future or be given much more responsibility, we might expect that these safety concerns will become more serious as time goes on.

At the limit you might worry that these safety failures could become so extreme that they could perhaps derail civilization on the whole. In fact, there is a bit of writing arguing that we should be worried about these sort of existential safety failures. The main work arguing for this is still the book 'Superintelligence' by Nick Bostrom, published in 2014. Before this, essays by Eliezer Yudkowsky were the main source of arguments along these lines. And then a number of other writers such as Stuart Russell or, a long time ago, IJ Goods or David Chalmers have also expressed similar concerns, albeit more briefly. The writing on existential safety accidents definitely isn't homogeneous, but often there's a sort of similar narrative that appears in these essays expressing these concerns. There's this basic standard disaster scenario that has a few common elements.

First, the author imagines that a single AI system experiences a massive jump in capabilities. Over some short period of time, a single system becomes much more general or much more capable than any other system in existence, and in fact any human in existence. Then given the system, researchers specify a goal for it. They give it some input which is meant to communicate what behavior it should engage in. The goal ends up being something quite simple, and the system goes off and single-handedly pursues this very simple goal in a way that violates the full nuances of what its designers intended.

There's a classic sort of toy example, which is often used to illustrate this concern. We imagine that some poor paperclip factory owner receives a general super-intelligent AI on his doorstep. There's a slot that's to stick in a goal. He writes down the goal "maximize paperclip production," puts it in the AI system, and then lets it go off and do that. The system figures out the best way to maximize paperclip production is to take over all the world's resources, just to plow them all into paperclips. And the system is so capable that designers can do nothing to stop it, even though it's doing something that they actually really do not intend.

I have some general concerns about the existing writing on existential accidents. So first there's just still very little of it. It really is just mostly Superintelligence and essays by Eliezer Yudkowsky, and then sort of a handful of shorter essays and talks that express very similar concerns. There's also been very little substantive written criticism of it. Many people have expressed doubts or been dismissive of it, but there's very little in the way of skeptical experts who are sitting down and fully engaging with it, and writing down point by point where they disagree or where they think the mistakes are. Most of the work on existential accidents was also written before large changes in the field of AI, especially before the recent rise of deep learning, and also before work like 'Concrete Problems in AI Safety,' which laid out safety concerns in a way which is more recognizable to AI researchers today.

Most of the arguments for existential accidents often rely on these sort of fuzzy, abstract concepts like optimization power or general intelligence or goals, and toy thought experiments like the paper clipper example. And certainly thought experiments and abstract concepts do have some force, but it's not clear exactly how strong a source of evidence we should take these as. Then lastly, although many AI researchers actually have expressed concern about existential accidents, for example Stuart Russell, it does seem to be the case that many, and perhaps most AI researchers who encounter at least abridged or summarized versions of these concerns tend to bounce off them or just find them not very plausible. I think we should take that seriously.

I also have some more concrete concerns about writing on existential accidents. You should certainly take these concerns with a grain of salt because I am not a technical researcher, although I have talked to technical researchers who have essentially similar or even the same concerns. The general concern I have is that these toy scenarios are quite difficult to map onto something that looks more recognizably plausible. So these scenarios often involve, again, massive jumps in the capabilities of a single system, but it's really not clear that we should expect such jumps or find them plausible. This is a wooly issue. I would recommend checking out writing by Katja Grace or Paul Christiano online. That sort of lays out some concerns about the plausibility of massive jumps.

Another element of these narratives is, they often imagine some system which becomes quite generally capable and then is given a goal. In some sense, this is the reverse of the way machine learning research tends to look today. At least very loosely speaking, you tend to specify a goal or some means of providing feedback. You direct the behavior of a system and then allow it to become more capable over time, as opposed to the reverse. It's also the case that these toy examples stress the nuances of human preferences, with the idea being that because human preferences are so nuanced and so hard to state precisely, it should be quite difficult to get a machine that can understand how to obey them. But it's also the case in machine learning that we can train lots of systems to engage in behaviors that are actually quite nuanced and that we can't specify precisely. Recognizing faces from images is an example of this. So is flying a helicopter.

It's really not clear exactly why human preferences would be so fatal to understand. So it's quite difficult to figure out how to map the toy examples onto something which looks more realistic.


Some general caveats on the concerns expressed. None of my concerns are meant to be decisive. I've found, for example, that many people working in the field of AI safety in fact list somewhat different concerns as explanations for why they believe the area is very important. There are many more arguments that I believe are shared individually, or inside people's heads and currently unpublished. I really can't speak exactly to how compelling these are. The main point I want to stress here is essentially that when it comes to the writing which has actually been published, and which is out there for analysis, I don't think it's necessarily that forceful, and at the very least it's not decisive.

So now I have some brief, practical implications, or thoughts on prioritization. You may think, from all the stuff I've just said, that I'm quite skeptical about AI safety or governance as areas to work in. In fact, I'm actually fairly optimistic. My reasoning here is that I really don't think that there are any slam-dunks for improving the future. I'm not aware of any single cause area that seems very, very promising from the perspective of offering high assurance of long-term impact. I think that the fact that there are at least plausible pathways for impact by working on AI safety and AI governance puts it head and shoulders above most areas you might choose to work in. And AI safety and AI governance also stand out for being pretty extraordinarily neglected.

Depending on how you count, there are probably fewer than a hundred people in the world working on technical safety issues or governance challenges with an eye towards very long-term impacts. And that's just truly, very surprisingly small. The overall point though, is that the exact size of the bet that EA should make on artificial intelligence, sort of the size of the portfolio that AI should take up will depend on the strength of the arguments for focusing on AI. And most of those arguments still just aren't very fleshed out yet.

I also have some broader epistemological concerns which connect to the concerns I've expressed. I think it's also possible that there are social factors relating to EA communities that might bias us to take an especially large interest in AI.

One thing is just that AI is especially interesting or fun to talk about, especially compared to other cause areas. It's an interesting, kind of contrarian answer to the question of what is most important to work on. It's surprising in certain ways. And it's also now the case that interest in AI is to some extent an element of community culture. People have an interest in it that goes beyond just the belief that it's an important area to work in. It definitely has a certain role in the conversations that people have casually, and what people like to talk about. I think these wouldn't necessarily be that concerning, except people sometimes also think that we can't really count on external feedback to push us back if we sort of drift a bit.

So first it just seems to be empirically the case that skeptical AI researchers generally will not take the time to sit down and engage with all of the writing, and then explain carefully why they disagree with our concerns. So we can't really expect that much external feedback of that form. People who are skeptical or confused, but not AI researchers, or just generally not experts may be concerned about sounding ignorant or dumb if they push back, and they also won't be inclined to become experts. We should also expect generally very weak feedback loops. If you're trying to influence the very long-run future, it's hard to tell how well you're doing, just because the long-run future hasn't happened yet and won't happen for a while.

Generally, I think one thing to watch out for is justification drift. If we start to notice that the community's interest in AI stays constant, but the reasons given for focusing on it change over time, then this would be sort of a potential check engine light, or at least a sort of trigger to be especially self-conscious or self-critical, because that may be some indication of motivated reasoning going on.


I have just a handful of short takeaways. First, I think that not enough work has gone into analyzing the case for prioritizing AI. Existing published arguments are not decisive. There may be many other possible arguments out there, which could be much more convincing or much more decisive, but those just aren't out there yet, and there hasn't been much written criticizing the stuff that's out there.

For this reason, thinking about the case for prioritizing AI may be an especially high impact thing to do, because it may shape the EA portfolio for years into the future. And just generally, we need to be quite conscious of possible community biases. It's possible that certain social factors will lead us to drift in what we prioritize, that we really should not be allowing to influence us. And just in general, if we're going to be putting substantial resources into anything as a community, we need to be especially certain that we understand why we're doing this, and that we stay conscious that our reasons for getting interested in the first place continue to be good reasons. Thank you.


Question: What advice would you give to one who wants to do the kind of research that you are doing here about the case for AI potentially, as opposed to the AI itself?

Ben: Something that I believe would be extremely valuable is just basically talking to lots of people who are concerned about AI and asking them precisely what reasons they find compelling. I've started to do this a little bit recently and it's actually been quite interesting that people seem to have pretty diverse reasons, and many of them are things that people want to write blog posts on, but just haven't done. So, I think this is a low-hanging fruit that would be quite valuable. Just talking to people who are concerned about AI, trying to understand exactly why they're concerned, and either writing up their ideas or helping them to do that. I think that would be very valuable and probably not that time intensive either.

Question: Have you seen any of the justification drift that you alluded to? Can you pinpoint that happening in the community?

Ben: Yeah. I think that's certainly happening to some extent. Even for myself, I believe that's happened for me to some extent. When I initially became interested in AI, I was especially concerned about these existential accidents. I think I now place relatively greater prominence on sort of the case from instability as I described it. And that's certainly, you know, one possible example of justification drift. It may be the case that this was actually a sensible way to shift emphasis, but would be something of a warning sign. And I've also just spoken to technical researchers, as well, who used to be especially concerned about this idea of an intelligence explosion or recursive self improvement. These very large jumps. I now have spoken to a number of people who are still quite concerned about existential accidents, but make arguments that don't hinge on there being this one single massive jump into a single system.

Question: You made the analogy to the industrial revolution, and the 1780 Englishman who doesn't really have much ability to shape how the steam engine is going to be used. It seems intuitively quite right. The obvious counterpoint would be, well AI is a problem-solving machine. There's something kind of different about it. I mean, does that not feel compelling to you, the sort of inherent differentness of AI?

Ben: So I think probably the strongest intuition is, you might think that there will eventually be a point where we start turning more and more responsibility over to automated systems or machines, and that there might eventually come a point where humans have almost no control over what's happening whatsoever, that we keep turning over more and more responsibility and there's a point where machines are in some sense in control and you can't back out. And you might have some sort of irreversible juncture here. I definitely, to some extent, share that intuition that if you're looking over a very long time span, that that is probably fairly plausible. I suppose the intuition I don't necessarily have is that unless things go, I suppose quite wrong or if they happen in somewhat surprising ways, I don't necessarily anticipate that there will be this really irreversible juncture coming anytime soon. If let's say it takes a thousand years for control to be handed off, then I am not that optimistic about people having that much control over what that handoff looks like by working on things today. But I certainly am not very confident.

Question: Are there any policies that you think a government should implement at this stage of the game, in light of the concerns around AI safety? And how would you allocate resources between existing issues and possible future risks?

Ben: Yeah, I am still quite hesitant, I think, to recommend very substantive policies that I think governments should be implementing today. I currently have a lot of agnosticism about what would be useful, and I think that most current existing issues that governments are making decisions on aren't necessarily that critical. I think there's lots of stuff that can be done that would be very valuable, like having stronger expertise or stronger lines of dialogue between the public and private sector, and things like this. But I would be hesitant at this point to recommend a very concrete policy that at least I'm confident would be good to implement right now.

Question: You mentioned the concept of kind of a concrete decisive argument. Do you see concrete, decisive arguments for other cause areas that are somehow more concrete and decisive than for AI, and what is the difference?

Ben: Yeah. So I guess I tried to allude to this a little bit, but I don't think that really any cause area has an especially decisive argument for being a great way to influence the future. There's some that I think you can put sort of a lower bound on at least how likely it is to be useful that's somewhat clear. So for example, risk from nuclear war. It's fairly clear that it's at least plausible this could happen over the next century. You know, nuclear war has almost happened in the past, the climate effects are speculative, but at least somewhat well understood. And then there's this question of if there were nuclear war, how damaging is this? Do people eventually come back from this? And that's quite uncertain, but I think it'd be difficult to put above 99% chance that people would come back from a nuclear war.

So, in that case you might have some sort of a clean lower bound on, let's say working on nuclear risk. Or, quite similarly, working on pandemics. And I think for AI it's difficult to have that sort of confident lower bound. I actually tend to think, I guess as I alluded to, that AI is probably or possibly still the most promising area based on my current credences, and its extreme neglectedness. But yeah, I don't think any cause area stands out as especially decisive as a great place to work.

Question: I'm an AI machine learning researcher PhD student currently, and I'm skeptical about the risk of AGI. How would you suggest that I contribute to the process of providing this feedback that you're identifying as a need?

Ben: Yeah, I mean I think just a combination of in-person conversations and then I think even simple blog posts can be quite helpful. I think there's still been surprisingly little in the way of just, let's say something written online that I would point someone to who wants the skeptical case. This actually is a big part of the reason I suppose I gave this talk, even though I consider myself not extremely well placed to give it, given that I am not a technical person. There's so little out there along these lines that there's low hanging fruit, essentially.

Question: Prominent deep learning experts such as Yann Lecun and Andrew Ng do not seem to be worried about risks from superintelligence. Do you think that they have essentially the same view that you have or are they coming at it from a different angle?

Ben: I'm not sure of their specific concerns. I know this classic thing that Andrew Ng always says is he compares it to worrying about overpopulation on Mars, where the suggestion is that these risks, if they materialize, are just so far away that it's really premature to worry about them. So it seems to be sort of an argument from timeline considerations. I'm actually not quite sure what his view is in terms of, if we were like, let's say 50 years in the future, would he think that this is a really great area to work on? I'm really not quite sure.

I actually tend to think that the line of thinking that says, "Oh, this is so far away so we shouldn't work on it" just really isn't that compelling. It seems like we have a load of uncertainty about AI timelines. It seems like no one can be very confident about that. So yeah, it'd be hard to get under, let's say one percent that interesting things won't happen in the next 30 years or so. So I'm not quite sure about the extent of his concerns, but if they're based on timelines, I actually don't find them that compelling.

Sorted by Click to highlight new comments since:

I was confused by the headline. "Ben Garfinkel: How Sure are we about this AI Stuff?" would make it clear that it is not some kind of official statement from the CEA. Changing an author to EA Global or even to the co-authorship of EA Global and Ben Garfinkel would help as well.

+1, a friend of mine thought it was an official statement from CEA when he saw the headline, was thoroughly surprised and confused

Thanks for this suggestion, Misha. I've changed the headline to include Ben's name, and I'm reviewing our transcript-publishing process to see how we be more clear in the future (e.g. by posting under authors' names if they have an EA Forum account, as we do when we crosspost from a user's blog).

An update: The previous name on this account was "Centre for Effective Altruism". Since the account was originally made for the purpose of posting transcripts from EA Global, I've renamed it to "EA Global Transcripts" to avert further confusion.

How is karma allocated for co-authored posts?

Currently, co-authorship only produces karma for the "lead author". The same is true on LessWrong, where most of the Forum's code comes from, and they're interested in changing that at some point (I submitted a Github request here), but it would require a more-than-trivial infrastructure change, so I don't know how highly they'll prioritize it.

I think this talk, as well as Ben's subsequent comments on the 80k podast, serve as a good illustration of the importance of being clear, precise, and explicit when evaluating causes, especially those often supported by relatively vague analogies or arguments with unstated premises. I don't recall how my views about the seriousness of AI safety as a cause area changed in response to watching this, but I do remember feeling that I had a better understanding of the relevant considerations and that I was in a better position to make an informed assessment.

I don’t understand how superhuman AGI would lock in a set of values any more than human values are already locked in to humans’ brains.

I think the big disanalogy between AI and the Industrial and Agricultural revolutions is that there seems to be a serious chance that an AI accident will kill us all. (And moreover this isn't guaranteed; it's something we have leverage over, by doing safety research and influencing policy to discourage arms races and encourage more safety research.) I can't think of anything comparable for the IR or AR. Indeed, there are only two other cases in history of risk on that scale: Nuclear war and pandemics.

Thanks for this talk/post--It's a good example of the sort of self-skepticism that I think we should encourage.

FWIW, I think it's a mistake to construe the classic model of AI accident catastrophe as capability gain first, then goal acquisition. I say this because (a) I never interpreted it that way when reading the classic texts, and (b) it doesn't really make sense--the original texts are very clear that the massive jump in AI capability is supposed to come from recursive self-improvement, i.e. the AI helping to do AI research. So already we have some sort of goal-directed behavior (bracketing CAIS/ToolAI objections!) leading up to and including the point of arrival at superintelligence.

I would construe the little sci-fi stories about putting goals into goal slots as not being a prediction about the architecture of AI but rather illustrations of completely different points about e.g. orthogonality of value or the dangers of unaligned superintelligences.

At any rate, though, what does it matter whether the goal is put in after the capability growth, or before/during? Obviously, it matters, but it doesn't matter for purposes of evaluating the priority of AI safety work, since in both cases the potential for accidental catastrophe exists.

the original texts are very clear that the massive jump in AI capability is supposed to come from recursive self-improvement, i.e. the AI helping to do AI research

...because that AI research is useful for some other goal the AI has, such as maximizing paperclips. See the instrumental convergence thesis.

At any rate, though, what does it matter whether the goal is put in after the capability growth, or before/during? Obviously, it matters, but it doesn't matter for purposes of evaluating the priority of AI safety work, since in both cases the potential for accidental catastrophe exists.

The argument for doom by default seems to rest on a default misunderstanding of human values as the programmer attempts to communicate them to the AI. If capability growth comes before a goal is granted, it seems less likely that misunderstanding will occur.

The argument for doom by default seems to rest on a default misunderstanding of human values as the programmer attempts to communicate them to the AI.

I don't think this is correct. The argument rests on AIs having any values which aren't human values (e.g. maximising paperclips), not just misunderstood human values.

Maximising paperclips is a misunderstood human value. Some lazy factory owners says, gee wouldn't it be great if I could get an AI to make my paperclips for me? Then builds an AGI and asks it to make paperclips, and it then makes everything into paperclips its utility function being unreflective of its owners true desire to also have a world.

If there is a flaw here it's probably somewhere in thinking that AGI will get built as some sort of intermediate tool and that it will be easy to rub the lamp and ask the genie to do something in easy to misunderstand natural language.

Presumably the programmer will make some effort to embed the right set of values in the AI. If this is an easy task, doom is probably not the default outcome.

AI pessimists have argued human values will be difficult to communicate due to their complexity. But as AI capabilities improve, AI systems get better at learning complex things.

Both the instrumental convergence thesis and the complexity of value thesis are key parts of the argument for AI pessimism as it's commonly presented. Are you claiming that they aren't actually necessary for the argument to be compelling? (If so, why were they included in the first place? This sounds a bit like justification drift.)

...because that AI research is useful for some other goal the AI has, such as maximizing paperclips. See the instrumental convergence thesis.

Yes, exactly.

The argument for doom by default seems to rest on a default misunderstanding of human values as the programmer attempts to communicate them to the AI. If capability growth comes before a goal is granted, it seems less likely that misunderstanding will occur.

Eh, I could see arguments that it would be less likely and arguments that it would be more likely. Argument that it is less likely: We can use the capabilities to do something like "Do what we mean," allowing us to state our goals imprecisely & survive. Argument that it is more likely: If we mess up, we immediately have an unaligned superintelligence on our hands. At least if the goals come before the capability growth, there is a period where we might be able to contain it and test it, since it isn't capable of escaping or concealing its intentions.

Hello from the 4 years into the future! Just a random note on the thing you said,

Argument that it is less likely: We can use the capabilities to do something like "Do what we mean," allowing us to state our goals imprecisely & survive.

Anthropic is now doing exactly this with their Constitutional AI. They let the chatbot respond in some way, then they ask it "reformulate the text so that it is more ethical", and finally train it to output something more akin to the latter rather than to the former.

Yep! I love when old threads get resurrected.

[comment deleted]1
we don't really know how worried to be [about instability risk from AI]. These risks really haven't been researched much, and we shouldn't really take it for granted that AI will be destabilizing. It could be or it couldn't be. We just basically have not done enough research to feel very confident one way or the other.

This makes me worry about tractability. The problem of instability has been known for at least five years now, and we haven't made any progress?

A note on leverage: One clear difference between AI and non-leverage-able things like wheels or steam engines is that there are many different ways to build AI.

Someone who tried to create a triangular wheel wouldn't have gotten far, but it seems plausible that many different kinds of AI system could become very powerful, with no particular kind of system "guaranteed" to arise even if it happens to be the most effective kind -- there are switching costs and market factors and branding to consider. (I assume that switching between AI systems/paradigms for a project will be harder than switching between models of steam engine).

This makes me think that it is possible, at least in principle, for our actions now to influence what future AI systems look like.

Curated and popular this week
Relevant opportunities