Ben Garfinkel

2385Joined Jan 2019


Ben Garfinkel - Researcher at Future of Humanity Institute


On Deference and Yudkowsky's AI Risk Estimates

I really appreciate the time people have taken to engage with this post (and actually hope the attention cost hasn’t been too significant). I decided to write some post-discussion reflections on what I think this post got right and wrong.

The reflections became unreasonably long - and almost certainly should be edited down - but I’m posting them here in a hopefully skim-friendly format. They cover what I see as some mistakes with the post, first, and then cover some views I stand by.

Things I would do differently in a second version of the post:

1. I would either drop the overall claim about how much people should defer to Yudkowsky — or defend it more explicitly

At the start of the post, I highlight the two obvious reasons to give Yudkowsky's risk estimates a lot of weight: (a) he's probably thought more about the topic than anyone else and (b) he developed many of the initial AI risk arguments. I acknowledge that many people, justifiably, treat these as important factors when (explicitly or implicitly) deciding how much to defer to Yudkowsky.

Then the post gives some evidence that, at each stage of his career, Yudkowsky has made a dramatic, seemingly overconfident prediction about technological timelines and risks - and at least hasn’t obviously internalised lessons from these apparent mistakes.

The post expresses my view that these two considerations at least counterbalance each other - so that, overall, Yudkowsky's risk estimates shouldn't be given more weight than (e.g.) those of other established alignment researchers or the typical person on the OpenPhil worldview investigation team.

But I don't do a lot in the post to actually explore how we should weigh these factors up. In that sense: I think it’d be fair to regard the post’s central thesis as importantly under-supported by the arguments contained in the post.

I should have either done more to explicitly defend my view or simply framed the post as "some evidence about the reliability of Yudkowsky's risk estimates."

2. I would be clearer about how and why I generated these examples

In hindsight, this is a significant oversight on my part. The process by which I generated these examples is definitely relevant for judging how representative they are - and, therefore, how much to update on them. But I don’t say anything about this in the post. My motives (or at least conscious motives) are also part of the story that I only discuss in pretty high-level terms, but seem like they might be relevant for forming judgments.

For context, then, here was the process:

A few years ago, I tried to get a clearer sense of the intellectual history of the AI risk and existential risk communities. For that reason, I read a bunch of old white papers, blog posts, and mailing list discussions.

These gave me the impression that Yudkowsky’s track record (and - to some extent - the track record of the surrounding community) was worse than I’d realised. From reading old material, I basically formed something like this impression: “At each stage of Yudkowsky’s professional life, his work seems to have been guided by some dramatic and confident belief about technological trajectories and risks. The older beliefs have turned out to be wrong. And the ones that haven’t yet resolved at least seem to have been pretty overconfident in hindsight.”

I kept encountering the idea that Yudkowsky has an exceptionally good track record or that he has an unparalleled ability to think well about AI (he’s also expressed view himself) - and I kept thinking, basically, that this seemed wrong. I wrote up some initial notes on this discrepancy at some point, but didn’t do anything with them.

I eventually decided to write something public after the “Death with Dignity” post, since the view it expresses (that we’re all virtually certain to die soon) both seems wrong to me and very damaging if it’s actually widely adopted in the community. I also felt like the “Death with Dignity” post was getting more play than it should, simply because people have a strong tendency to give Yudkowsky’s views weight. I can’t imagine a similar post written by someone else having nearly as large of an impact. Notably, since that post didn’t really have substantial arguments in it (although the later one did), I think the fact it had an impact is seemingly a testament to the power of deference; I think it’d be hard to look at the reaction to that post and argue that it’s only Yudkowsky’s arguments (rather than his public beliefs in-and-of-themselves) that have a major impact on the community.

People are obviously pretty aware of Yudkowsky’s positive contributions, but my impression is that (especially) new community members tended not to be aware of negative aspects of his track record. So I wanted to write a post drawing attention to the negative aspects.

I was initially going to have the piece explicitly express the impression I’d formed, which was something like: “At each stage of Yudkowsky’s professional life, his work has been guided by some dramatic and seemingly overconfident belief about technological trajectories and risks.” The examples in the post were meant to map onto the main ‘animating predictions’ about technology he had at each stage of his career. I picked out the examples that immediately came to mind.

Then I realised I wasn’t at all sure I could defend the claim that these were his main ‘animating predictions’ - the category was obviously extremely vague, and the main examples that came to mind were extremely plausibly a biased sample. I thought there was a good chance that if I reflected more, then I’d also want to include various examples that were more positive.

I didn’t want to spend the time doing a thorough accounting exercise, though, so I decided to drop any claim that the examples were representative and just describe them as “cherry-picked” — and add in lots of caveats emphasising that they’re cherry-picked.

(At least, these were my conscious thought processes and motivations as I remember them. I’m sure other factors played a role!)

3. I’d tweak my discussion of take-off speeds

I’d make it clearer that my main claim is: it would have been unreasonable to assign a very high credence to fast take-offs back in (e.g.) the early- or mid-2000s, since the arguments for fast take-offs had significant gaps. For example, there were a lots of possible countervailing arguments for slow take-offs that pro-fast-take-off authors simply hadn’t address yet — as evidenced, partly, by the later publication of slow-take-off arguments leading a number of people to become significantly more sympathetic to slow take-offs. (I’m not claiming that there’s currently a consensus against fast-take-off views.)

4. I’d add further caveats to the “coherence arguments” case - or simply leave it out

Rohin’s and Oli’s comments under the post have made me aware that there’s a more positive way to interpret Yudkowsky’s use of coherence arguments. I’m not sure if that interpretation is correct, or if it would actually totally undermine the example, but this is at minimum something I hadn’t reflected on. I think it’s totally possible that further reflection would lead me to simply remove the example.

Positions I stand by:

On the flipside, here’s a set of points I still stand by:

1. If a lot of people in the community believe AI is probably going to kill everyone soon, then (if they’re wrong) this can have really important negative effects

In terms of prioritisation: My prediction is that if you were to ask different funders, career advisors, and people making career decisions (e.g. deciding whether to go into AI policy or bio policy) how much they value having a good estimate of AI risk, they’ll very often answer that they value it a great deal. I do think that over-estimating the level of risk could lead to concretely worse decisions.

In terms of community health: I think that believing you’re probably going to die soon is probably bad for a large portion of people. Reputationally: Being perceived as believing that everyone is probably going to die soon (particularly if this actually an excessive level of worry) also seems damaging.

I think we should also take seriously the tail-risk that at least one person with doomy views (even if they’re not directly connected to the existential risk community) will take dramatic and badly harmful actions on the basis of their views.

2. Directly and indirectly, deference to Yudkowsky has a significant influence on a lot of people’s views

As above: One piece of evidence for this is Yudkowsky’s “Death with Dignity” post triggered a big reaction, even though it didn’t contain any significant new arguments. I think his beliefs (above and beyond his arguments) clearly do have an impact.

Another reason to believe deference is a factor: I think it’s both natural and rational for people, particularly people new to an area, to defer to people with more expertise in that area.[1] Yudkowsky is one of the most obvious people to defer to, as one of the two people most responsible for developing and popularising AI risk arguments and as someone who has (likely) spent more time thinking about the subject than anyone else.

Beyond that: A lot of people also clearly in general have huge amount of respect for Yudkowsky, sometimes more than they have for any other public intellectual. I think it’s natural (and sensible) for people’s views to be influenced by the views of the people they respect. In general, I think, unless you have tremendous self-control, this will tend to happen sub-consciously even if you don’t consciously choose to defer to the people you respect.

Also, people sometimes just do talk about Yudkowsky’s track record or reputation as a contributing factor to their views.

3. The track records of influential intellectuals (including Yudkowsky) should be publicly discussed.

A person’s track-record provides evidence about how reliable their predictions are. If people are considering how much to defer to some intellectual, then they should want to know what their track record (at least within the relevant domain) looks like.

The main questions that matter are: What has the intellectual gotten wrong and right? Beyond whether they were wrong or right, about a given case, does it also seem like their predictions were justified? If they’ve made certain kinds of mistakes in the past, do we now have reason to think they won’t repeat those kinds of mistakes?

4. Yudkowsky’s track record suggests a substantial bias toward dramatic and overconfident predictions.

One counter - which I definitely think it’s worth reflecting on - is that it might be possible to generate a similarly bias-suggesting list of examples like this for any other public intellectual or member of the existential risk community.

I’ll focus on one specific comment, suggesting that Yudkowsky’s incorrect predictions about nanotechnology are in the same reference class as ‘writing a typically dumb high school essay.’ The counter goes something like this: Yes, it was possible to find this example from Yudkowsky’s past - but that’s not importantly different than being able to turn up anyone else’s dumb high school essay about (e.g.) nuclear power.

Ultimately, I don’t buy the comparison. I think it’s really out-of-distribution for someone in their late teens and early twenties to pro-actively form the view that an emerging technology is likely to kill everyone within a decade, found an organization and devote years of their professional life to address the risk, and talk about how they’re the only person alive who can stop it.

That just seems very different from writing a dumb high school essay. Much more than a standard dumb high school essay, I think this aspect of Yudkowsky’s track record really does suggest a bias toward dramatic and overconfident predictions. This prediction is also really strikingly analogous to the prediction Yudkowsky is making right now - its relevance is clearly higher than the relevance of (e.g.) a random poorly thought-out view in a high school essay.

(Yudkowsky's early writing and work is also impressive, in certain ways, insofar as it suggests a much higher level of originality of thought and agency than the typical young person has. But the fact that this example is impressive doesn’t undercut, I think, the claim that it’s also highly suggestive of a bias toward highly confident and dramatic predictions.)

5. Being one of the first people to identify, develop, or take seriously some idea doesn’t necessarily mean that you predictions about the idea will be unusually reliable

By analogy:

  • I don’t think we can assume that the first person to take the covid lab leak theory seriously (when others were dismissive) is currently the most reliable predictor of whether the theory is true.

  • I don’t think we can assume that the first person to develop the many worlds theory of quantum mechanics (when others were dismissive) would currently be the best person to predict whether the theory is true, if they were still alive.

There are, certainly, reasons to give pioneers in a domain special weight when weighing expert opinion in that domain.[2] But these reasons aren’t absolute.

There are even easons that point in the opposite direction: we might worry that the pioneer has an attachment to their theory, so will be biased toward believing it is true and as important as possible. We might also worry that the pioneering-ness of their beliefs is evidence that these beliefs front-ran the evidence and arguments (since one way to be early is to simply be excessively confident). We also have less evidence of their open-mindedness than we do for the people who later on moved toward the pioneer’s views — since moving toward the pioneer’s views, when you were initially dismissive, is at least a bit of evidence for open-mindedness and humility.[3]

Overall, I do think we should tend defer more to pioneers (all else being equal). But this tendency can definitely be overruled by other evidence and considerations.

6. The causal effects that people have had on the world don’t (in themselves) have implications for how much we should defer to them

At least in expectation, so far, Eliezer Yudkowsky has probably had a very positive impact on the world. There is a plausible case to be made that misaligned AI poses a substantial existential risk - and Yudkowsky’s work has probably, on net, massively increased the number of people thinking about it and taking it seriously. He’s also written essays that have exposed huge numbers of people to other important ideas and helped them to think more clearly. It makes sense for people to applaud all of this.

Still, I don’t think his positive causal effect on the world gives people much additional reason to be deferential to him.

Here’s a dumb thought experiment: Suppose that Yudkowsky wrote all of the same things, but never published them. But suppose, also, that a freak magnetic storm ended up implanting all of the same ideas in his would-be-readers’ brains. Would this absence of a casual effect count against deferring to Yudkowsky? I don’t think so. The only thing that ultimately matters, I think, is his track record of beliefs - and the evidence we currently have about how accurate or justified those beliefs were.

I’m not sure anyone disagrees with the above point, but I did notice there seemed to be a decent amount of discussion in the comments about Yudkowsky’s impact - and I’m not sure I think this issue will ultimately be relevant.[4]

  1. For example: I had ten hours to form a view about the viability of some application of nanotechnology, I definitely wouldn’t want to ignore the beliefs of people who have already thought about the question. Trying to learn the relevant chemistry and engineering background wouldn’t be a good use of my time. ↩︎

  2. One really basic reason is simply that they’ve simply had more time to think about certain subjects than anyone else. ↩︎

  3. Here’s a concrete case: Holden Karnofsky eventually moved toward taking AI risks seriously, after publicly being fairly dismissive of it, and then wrote up a document analysing why he was initially dismissive and drawing lessons from the experience. It seems like we could count that as positive evidence about his future judgment. ↩︎

  4. Even though I’ve just said I’m not sure this question is relevant, I do also want to say a little bit about Yudkowsky’s impact. I personally think's probably had a very significant impact. Nonetheless, I also think the impact can be overstated. For example, I think, it’s been suggested that the effective altruism community might not be very familiar with concepts like Bayesian or the importance of overcoming bias if it weren’t for Yudkowsky’s writing. I don’t really find that particular suggestion plausible.

    Here’s one data point I can offer from my own life: Through a mixture of college classes and other reading, I’m pretty confident I had already encountered the heuristics and biases literature, Bayes’ theorem, Bayesian epistemology, the ethos of working to overcome bias, arguments for the many worlds interpretation, the expected utility framework, population ethics, and a number of other ‘rationalist-associated’ ideas before I engaged with the effective altruism or rationalist communities. For example, my college had classes in probability theory, Bayesian epistemology, and the philosophy of quantum mechanics, and I’d read at least parts of books like Thinking Fast and Slow, the Signal and the Noise, the Logic of Science, and various books associated with the “skeptic community.” (Admittedly, I think it would have been harder to learn some of these things if I’d gone to college a bit earlier or had a different major. I also probably "got lucky" in various ways with the classes I took and books I picked up.) See also Carl Shulman making a similar point and John Halstead also briefly commenting the way in which he personally encountered some the relevant ideas. ↩︎

On Deference and Yudkowsky's AI Risk Estimates

A general reflection: I wonder if one at least minor contributing factor to disagreement, around whether this post is worthwhile, is different understandings about who the relevant audience is.

I mostly have in mind people who have read and engaged a little bit with AI risk debates, but not yet in a very deep way, and would overall be disinclined to form strong independent views on the basis of (e.g.) simply reading Yudkowsky's and Christiano's most recent posts. I think the info I've included in this post could be pretty relevant to these people, since in practice they're often going to rely a lot -- consciously or unconsciously; directly or indirectly -- on cues about how much weight to give different prominent figures' views. I also think that the majority of members of the existential risk community are in this reference class.

I think the info in this post isn't nearly as relevant to people who've consumed and reflected on the relevant debates very deeply. The more you've engaged with and reflected on an issue, the less you should be inclined to defer -- and therefore the less relevant track records become.

(The limited target audience might be something I don't do a good enough job communicating in the post.)

On Deference and Yudkowsky's AI Risk Estimates

The part of this post which seems most wild to me is the leap from "mixed track record" to

In particular, I think, they shouldn’t defer to him more than they would defer to anyone else who seems smart and has spent a reasonable amount of time thinking about AI risk.

For any reasonable interpretation of this sentence, it's transparently false. Yudkowsky has proven to be one of the best few thinkers in the world on a very difficult topic. Insofar as there are others who you couldn't write a similar "mixed track record" post about, it's almost entirely because they don't have a track record of making any big claims, in large part because they weren't able to generate the relevant early insights themselves. Breaking ground in novel domains is very, very different from forecasting the weather or events next year; a mixed track record is the price of entry.

I disagree that the sentence is false for the interpretation I have in mind.

I think it's really important to seperate out the question "Is Yudkowsky an unusually innovative thinker?" and the question "Is Yudkowsky someone whose credences you should give an unusual amount of weight to?"

I read your comment as arguing for the former, which I don't disagree with. But that doesn't mean that people should currently weigh his risk estimates more highly than they weigh the estimates of other researchers currently in the space (like you).

I also think that there's a good case to be made that Yudkowsky tends to be overconfident, and this should be taken into account when deferring; but when it comes to making big-picture forecasts, the main value of deference is in helping us decide which ideas and arguments to take seriously, rather than the specific credences we should place on them, since the space of ideas is so large.

But we do also need to try to have well-calibrated credences, of course. For the reason given in the post, it's important to know whether the risk of everyone dying soon is 5% or 99%. It's not enough just to determine whether we should take AI risk seriously.

We're also now past the point, as a community, where "Should AI risk be taken seriously?" is that much of a live question. The main epistemic question that matters is what probability we assign to it - and I think this post is relevant to that.

(More generally, rather than reading this post, I recommend people read this one by Paul Christiano, which outlines specific agreements and disagreements.)

I definitely recommend people read the post Paul just wrote! I think it's overall more useful than this one.

But I don't think there's an either-or here. People - particularly non-experts in a domain - do and should form their views through a mixture of engaging with arguments and deferring to others. So both arguments and track records should be discussed.

The EA community has ended up strongly moving in Yudkowsky's direction over the last decade, and that seems like much more compelling evidence than anything listed in this post.

I discuss this in response to another comment, here, but I'm not convinced of that point.

On Deference and Yudkowsky's AI Risk Estimates

If someone visibly learns from forecasting mistakes they make, that should clearly update us positively on them not repeating the same mistakes.

I suppose one of my main questions is whether he has visibly learned from the mistakes, in this case.

For example, I wasn't able to find a post or comment to the effect of "When I was younger, I spent of years of my life motivated by the belief that near-term extinction from nanotech was looming. I turned out to be wrong. Here's what I learned from that experience and how I've applied it to my forecasts of near-term existential risk from AI." Or a post or comment acknowledging his previous over-optimistic AI timelines and what he learned from them, when formulating his current seemingly short AI timelines.

(I genuinely could be missing these, since he has so much public writing.)

On Deference and Yudkowsky's AI Risk Estimates

While he's not single-handedly responsible, he lead the movement to take AI risk seriously at a time when approximately no one was talking about it, which has now attracted the interests of top academics. This isn't a complete track record, but it's still a very important data-point.

I definitely do agree with that!

It's possible I should have emphasized the significance of it more in the post, rather than moving on after just a quick mention at the top.

If it's of interest: I say a little more about how I think about this, in response to Gwern's comment below. (To avoid thread-duplicating, people might want to respond there rather than here if they have follow-on thoughts on this point.) My further comment is:

This is certainly a positive aspect of his track-record - that many people have now moved closer to his views. (It also suggests that his writing was, in expectation, a major positive contribution to the project of existential risk reduction - insofar as this writing has helped move people up and we assume this was the right direction to move.) But it doesn't imply that we should give him many more "Bayes points" to him than we give to the people who moved.

Suppose, for example, that someone says in 2020 that there was a 50% chance of full-scale nuclear war in the next five years. Then - due to Russia's invasion of Ukraine - most people move their credences upward (although they still remained closer to 0% than 50%). Does that imply the person giving the early warning was better-calibrated than the people who moved their estimates up? I don't think so. And I think - in this nuclear case - some analysis can be used to justify the view that the person giving the early warning was probably overconfident; they probably didn't have enough evidence or good enough arguments to actually justify a 50% credence.

It may still be the case that the person giving the early warning (in the hypothetical nuclear case) had some valuable and neglected insights, missed by others, that are well worth paying attention to and seriously reflecting on; but that's a different matter from believing they were overall well-calibrated or should be deferred to much more than the people who moved.

[[EDIT: Something else it might be worth emphasizing, here, is that I'm not arguing for the view "ignore Eliezer." It's closer to "don't give Eliezer's views outsized weight, compared to (e.g.) the views of the next dozen people you might be inclined to defer to, and factor in evidence that his risk estimates might have a significant upward bias to them."]]

On Deference and Yudkowsky's AI Risk Estimates


I interpreted Gwern as mostly highlighting that people have updated toward's Yudkowsky's views - and using this as evidence in favor of the view we should defer a decent amount to Yudkowsky. I think that was a reasonable move.

There is also a causal question here ('Has Yudkowsky on-net increased levels of concern about AI risk relative to where they would otherwise be?'), but I didn't take the causal question to be central to the point Gwern was making. Although now I'm less sure.

I don't personally have strong views on the causal question - I haven't thought through the counterfactual.

On Deference and Yudkowsky's AI Risk Estimates

On 1 (the nanotech case):

I want to remind any reader that this is an opinion from 1999, when Eliezer was barely 20 years old.

I think your comment might give the misimpression that I don't discuss this fact in the post or explain why I include the case. What I write is:

I should, once again, emphasize that Yudkowsky was around twenty when he did the final updates on this essay. In that sense, it might be unfair to bring this very old example up.

Nonetheless, I do think this case can be treated as informative, since: the belief was so analogous to his current belief about AI (a high outlier credence in near-term doom from an emerging technology), since he had thought a lot about the subject and was already highly engaged in the relevant intellectual community, since it's not clear when he dropped the belief, and since twenty isn't (in my view) actually all that young. I do know a lot of people in their early twenties; I think their current work and styles of thought are likely to be predictive of their work and styles of thought in the future, even though I do of course expect the quality to go up over time....

An addition reason why I think it's worth distinguishing between his views on nanotech and (e.g.) your views on nuclear power: I think there's a difference between an off-hand view picked up from other people vs. a fairly idiosyncratic view that you consciously adopted after a lot of reflection and that you decide to devote your professional life to and found an organization to address.

It's definitely up to the reader to decide how relevant the nanotech case is. Since it's not widely known, it seems at least pretty plausibly relevant, and the post twice flags his age at the time, I do still endorse including it.

At face value, as well: we're trying to assess how much weight to give to someone's extreme, outlier-ish prediction that an emerging technology is almost certain to kill everyone very soon. It just does seem very relevant, to me, that they previously had a different extreme outlier-ish prediction that another emerging technology was very likely kill everyone within a decade.

I don't find it plausible that we should assign basically no significance to this.

On 6 (the question of whether Yudkowsky has acknowledged negative aspects of his track record):

For the two "clear cut" examples, Eliezer has posted dozens of times on the internet that he has disendorsed his views from before 2002. This is present on his personal website, the relevant articles are no longer prominently linked anywhere, and Eliezer has openly and straightforwardly acknowledged that his predictions and beliefs from the relevant period were wrong.

Similarly, I think your comment may give the impression that I don't discuss this point in the post. What I write is this:

He has written about mistakes from early on in his intellectual life (particularly pre-2003) and has, on this basis, even made a blanket-statement disavowing his pre-2003 work. However, based on my memory and a quick re-read/re-skim, this writing is an exploration of why it took him a long time to become extremely concerned about existential risks from misaligned AI. For instance, the main issue it discusses with his plans to build AGI are that these plans didn't take into account the difficulty and importance of ensuring alignment. This writing isn't, I think, an exploration or acknowledgement of the kinds of mistakes I've listed in this post.

On the general point that this post uses old examples:

Give the sorts of predictions involved (forecasts about pathways to transformative technologies), old examples are generally going to be more unambiguous than new examples. Similarly for risk arguments: it's hard to have a sense of how new arguments are going to hold up. It's only for older arguments that we can start to approach the ability to say that technological progress, progress in arguments, and evolving community opinion say something clear-ish about how strong the arguments were.

On signposting:

I also dislike calling this post "On Deference and Yudkowsky's AI Risk Estimates", as if this post was trying to be an unbiased analysis of how much to defer to Eliezer, while you just list negative examples. I think this post is better named "against Yudkowsky on AI Risk estimates". Or "against Yudkowsky's track record in AI Risk Estimates". Which would have made it clear that you are selectively giving evidence for one side, and more clearly signposted that if someone was trying to evaluate Eliezer's track record, this post will only be a highly incomplete starting point.

I think it's possible another title would have been better (I chose a purposely bland one partly for the purpose of trying to reduce heat - and that might have been a mistake). But I do think I signpost what the post is doing fairly clearly.

The introduction says it's focusing on "negative aspects" of Yudkowsky's track record, the section heading for the section introducing the examples describes them as "cherry-picked," and the start of the section introducing the examples has an italicized paragraph re-emphasizing that the examples are selective and commenting on the significance of this selectiveness.

On the role of the fast take-off assumption in classic arguments:

I think the arguments are pretty tight and sufficient to establish the basic risk argument. I found your critique relatively uncompelling. In particular, I think you are misrepresenting that a premise of the original arguments was a fast takeoff.

I disagree with this. I do think it's fair to say that fast take-off was typically a premise of the classic arguments.

Two examples I have off-hand (since they're in the slides from my talk) are from Yudkowsky's exchange with Caplan and from Superintelligence. Superintelligence isn't by Yudkowsky, of course, but hopefully is still meaningful to include (insofar as Superintelligence heavily drew on Yudkowsky's work and was often accepted as a kind of distillation of the best arguments as they existed at the time).

From Yudkowsky's debate with Caplan (2016):

“I’d ask which of the following statements Bryan Caplan [a critic of AI risk arguments] denies:

  1. Orthogonality thesis: Intelligence can be directed toward any compact goal….

  2. Instrumental convergence: An AI doesn’t need to specifically hate you to hurt you; a paperclip maximizer doesn’t hate you but you’re made out of atoms that it can use to make paperclips, so leaving you alive represents an opportunity cost and a number of foregone paperclips….

  3. Rapid capability gain and large capability differences: Under scenarios seeming more plausible than not, there’s the possibility of AIs gaining in capability very rapidly, achieving large absolute differences of capability, or some mixture of the two….

  4. 1-3 in combination imply that Unfriendly AI is a critical problem-to-be-solved, because AGI is not automatically nice, by default does things we regard as harmful, and will have avenues leading up to great intelligence and power.”

(Caveat that the fast-take-off premise is stated a bit ambiguity here, so it's not clear what level of rapidness is being assumed.)

From Superintelligence:

Taken together, these three points [decisive strategic advantage, orthogonality, and instrumental convergence] thus indicate that the first superintelligence may shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition. If we now reflect that human beings consist of useful resources (such as conveniently located atoms) and that we depend for our survival and flourishing on many more local resources, we can see that the outcome could easily be one in which humanity quickly becomes extinct.

The decisive strategic advantage point is justified through a discussion of the possibility of a fast take-off. The first chapter of the book also starts by introducing the possibility of an intelligence explosion. It then devotes two chapters to the possibility of a fast take-off and the idea this might imply a decisive strategic advantage, before it gets to discussing things like the orthogonality thesis.

I think it's also relevant that content from MIRI and people associated with MIRI, raising the possibility of extinction from AI, tended to very strongly emphasize (e.g. spend most of its time on) the possibility of a run-away intelligence explosion. The most developed classic pieces arguing for AI risk often have names like "Shaping the Intelligence Explosion," "Intelligence Explosion: Evidence and import," "Intelligence Explosion Microeconomics," and "Facing the Intelligence Explosion."

Overall, then, I do think it's fair to consider a fast-takeoff to be a core premise of the classic arguments. It wasn't incidental or a secondary consideration.

[[Note: I've edited my comment, here, to respond to additional points. Although there are still some I haven't responded to yet.]]

On Deference and Yudkowsky's AI Risk Estimates

No, it's just as I said, and your Karnofsky retrospective strongly supports what I said.

I also agree that Karnfosky's retrospective supports Gwern's analysis, rather than doing the opposite.

(I just disagree about how strongly it counts in favor of deference to Yudkowsky. For example, I don't think this case implies we should currently defer more to Yudkwosky's risk estimates than we do to Karnofsky's.)

On Deference and Yudkowsky's AI Risk Estimates

Thanks for the comment! A lot of this is useful.

calling LOGI and related articles 'wrong' because that's not how DL looks right now is itself wrong. Yudkowsky has never said that DL or evolutionary approaches couldn't work, or that all future AI work would look like the Bayesian program and logical approach he favored;

I mainly have the impression that LOGI and related articles were probably "wrong" because, so far as I've seen, nothing significant has been built on top of them in the intervening decade-and-half (even though LOGI's successor was seemingly predicted to make it possible for a small group to build AGI). It doesn't seem like there's any sign that these articles were the start of a promising path to AGI that was simply slower than the deep learning path.

I have had the impression, though, that Yudkowsky also thought that logical/Bayesian approaches were in general more powerful/likely-to-enable-near-term-AGI (not just less safe) than DL. It's totally possible this is a misimpression - and I'd be inclined to trust your impression over mine, since you've read more of his old writing than I have. (I'd also be interested if you happen to have any links handy.) But I'm not sure this significantly undermine the relevance of the LOGI case.

I continue to be amazed anyone can look at the past decade of DL and think that Hanson is strongly vindicated by it, rather than Yudkowsky-esque views.

I also think that, in various ways, Hanson also doesn't come off great. For example, he expresses a favorable attitude toward the CYC project, which now looks like a clear dead end. He is also overly bullish about the importance of having lots of different modules. So I mostly don't want to defend the view "Hanson had a great performance in the FOOM debate."

I do think, though, his abstract view that compute and content (i.e. data) are centrally important are closer to mark than Yudkowsky's expressed view. I think it does seem hard to defend Yudkowsky's view that it's possible for a programming team (with mid-2000s levels of compute) to acquire some "deep new insights," go down into their basement, and then create an AI system that springboards itself into taking over the world. At least - I think it's fair to say - the arguments weren't strong enough to justify a lot of confidence in that view.

Yet, the number who take it seriously since Eliezer started advocating it is now far greater than it was when he started and was approximately the only person anywhere. You aren't taking seriously that these surveyed researchers ("AI Impacts, CHAI, CLR, CSER, CSET, FHI, FLI, GCRI, MILA, MIRI, Open Philanthropy and PAI") wouldn't exist without Eliezer as he created the AI safety field as we know it, with everyone else downstream (like Bostrom's influential Superintelligence - Eliezer with the serial numbers filed off and an Oxford logo added).

This is certainly a positive aspect of his track-record - that many people have now moved closer to his views. (It also suggests that his writing was, in expectation, a major positive contribution to the project of existential risk reduction - insofar as this writing has helped move people up and we assume this was the right direction to move.) But it doesn't imply that we should give him many more "Bayes points" to him than we give to the people who moved.

Suppose, for example, that someone says in 2020 that there was a 50% chance of full-scale nuclear war in the next five years. Then - due to Russia's invasion of Ukraine - most people move their credences upward (although they still remained closer to 0% than 50%). Does that imply the person giving the early warning was better-calibrated than the people who moved their estimates up? I don't think so. And I think - in this nuclear case - some analysis can be used to justify the view that the person giving the early warning was probably overconfident; they probably didn't have enough evidence or good enough arguments to actually justify a 50% credence.

It may still be the case that the person giving the early warning (in the hypothetical nuclear case) had some valuable and neglected insights, missed by others, that are well worth paying attention to and seriously reflecting on; but that's a different matter from believing they were overall well-calibrated or should be deferred to much more than the people who moved.

[[EDIT: Something else it might be worth emphasizing, here, is that I'm not arguing for the view "ignore Eliezer." It's closer to "don't give Eliezer's views outsized weight, compared to (e.g.) the views of the next dozen people you might be inclined to defer to, and factor in evidence that his risk estimates might have a sigificant upward bias to them."]]

On Deference and Yudkowsky's AI Risk Estimates

I do not want an epistemic culture that finds it acceptable to challenge an individuals overall credibility in lieu of directly engaging with their arguments.

I think I roughly agree with you on this point, although I would guess I have at least a somewhat weaker version of your view. If discourse about people's track records or reliability starts taking up (e.g.) more than a fifth of the space that object-level argument does, within the most engaged core of people, then I do think that will tend to suggest an unhealthy or at least not-very-intellectually-productive community.

One caveat: For less engaged people, I do actually think it can make sense to spend most of your time thinking about questions around deference. If I'm only going to spend ten hours thinking about nanotechnology risk, for example, then I might actually want to spend most of this time trying to get a sense of what different people believe and how much weight I should give their views; I'm probably not going to be able to make a ton of headway getting a good gears-level-understanding of the relevant issues, particularly as someone without a chemistry or engineering background.

Load More