Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


EA Archives Reading List

Topic Contributions


On Deference and Yudkowsky's AI Risk Estimates

Yepp, thanks for the clear rephrasing. My original arguments for this view were pretty messy because I didn't have it fully fleshed out in my mind before writing this comment thread, I just had a few underlying intuitions about ways I thought Ben was going wrong.

Upon further reflection I think I'd make two changes to your rephrasing.

First change: in your rephrasing, we assign people weights based on the quality of their beliefs, but then follow their recommended policies. But any given way of measuring the quality of beliefs (in terms of novelty, track record, etc) is only an imperfect proxy for quality of policies. For example, Kurzweil might very presciently predict that compute is the key driver of AI progress, but suppose (for the sake of argument) that the way he does so is by having a worldview in which everything is deterministic, individuals are powerless to affect the future, etc. Then you actually don't want to give many resources to Kurzweil's policies, because Kurzweil might have no idea which policies make any difference.

So I think I want to adjust the rephrasing to say: in principle we should assign people weights based on how well their past recommended policies for someone like you would have worked out, which you can estimate using things like their track record of predictions, novelty of ideas, etc. But notably, the quality of past recommended policies is often not very sensitive to credences! For example, if you think that there's a 50% chance of solving nanotech in a decade, or a 90% chance of solving nanotech in a decade, then you'll probably still recommend working on nanotech (or nanotech safety) either way.

Having said all that, since we only get one rollout, evaluating policies is very high variance. And so looking at other information like reasoning, predictions, credences, etc, helps you distinguish between "good" and "lucky". But fundamentally we should think of these as approximations to policy evaluation, at least if you're assuming that we mostly can't fully evaluate whether their reasons for holding their views are sound.

Second change: what about the case where we don't get to allocate resources, but we have to actually make a set of individual decisions? I think the theoretically correct move here is something like: let policies spend their weight on the domains which they think are most important, and then follow the policy which has spent most weight on that domain.

Some complications:

  • I say "domains" not "decisions" because you don't want to make a series of related decisions which are each decided by a different policy, that seems incoherent (especially if policies are reasoning adversarially about how to undermine each other's actions).
  • More generally, this procedure could in theory be sensitive to bargaining and negotiating dynamics between different policies, and also the structure of the voting system (e.g. which decisions are voted on first, etc). I think we can just resolve to ignore those and do fine, but in principle I expect it gets pretty funky.

Lastly, two meta-level notes:

  • I feel like I've probably just reformulated some kind of reinforcement learning. Specifically the case where you have a fixed class of policies and no knowledge of how they relate to each other, so you can only learn how much to upweight each policy. And then the best policy is not actually in your hypothesis space, but you can learn a simple meta-policy of when to use each existing policy.
  • It's very ironic that in order to figure out how much to defer to Yudkowsky we need to invent a theory of idealised cooperative decision-making. Since he's probably the person whose thoughts on that I trust most, I guess we should meta-defer to him about what that will look like...
On Deference and Yudkowsky's AI Risk Estimates

(One major thing is that I think you should be comparing between two actions, rather than evaluating an action by itself, which is why I compared to "all other alignment work".)

IMO the crux is that I disagree with both of these. Instead I think you should use each worldview to calculate a policy, and then generate some kind of compromise between those policies. My arguments above were aiming to establish that this strategy is not very sensitive to exactly how much you defer to Eliezer, because there just aren't very many good worldviews going around - hence why I assign maybe 15 or 20% (inside view) credence to his worldview (updated from 10% above after reflection). (I think my all-things-considered view is similar, actually, because deference to him cancels out against deference to all the people who think he's totally wrong.)

Again, the difference is in large part determined by whether you think you're in a low-dimensional space (here are our two actions, which one should we take?) versus a high-dimensional space (millions of actions available to us, how do we narrow it down?) In a high-dimensional space the tradeoffs between the best ways to generate utility according to Eliezer's worldview and the best ways to generate utility according to other worldviews become much smaller.

This seems like a crazy way to do cost-effectiveness analyses.

Like, if I were comparing deworming to GiveDirectly, would I be saying "well, the value of deworming is mainly determined by the likelihood that the pro-deworming people are right, which I estimate is 70% but you estimate is 50%, so there's only a 1.4x difference"? Something has clearly gone wrong here.

Within a worldview, you can assign EVs which are orders of magnitude different. But once you do worldview diversification, if a given worldview gets even 1% of my resources, then in some sense I'm acting like that worldview's favored interventions are in a comparable EV ballpark to all the other worldviews' favored interventions. That's a feature not a bug.

It also feels like this reasoning implies that no EA action can be > 10x more valuable than any other action that an EA critic thinks is good? Since you assign a 90% chance that the EA is right, and the critic thinks there's a 10% chance of that, so there's only a 9x gap? And then once you do all of your adjustments it's only 2x? Why do we even bother with cause prioritization under this worldview?

I don't have a fleshed out theory of how and when to defer, but I feel pretty confident that even our intuitive pretheoretic deference should not be this sort of thing, and should be the sort of thing that can have orders of magnitude of difference between actions.

An arbitrary critic typically gets well less than 0.1% of my deference weight on EA topics (otherwise it'd run out fast!) But also see above: because in high-dimensional spaces there are few tradeoffs between different worldviews' favored interventions, changing the weights on different worldviews doesn't typically lead to many OOM changes in how you're acting like you're assigning EVs.

Also, I tend to think of cause prio as trying to integrate multiple worldviews into a single coherent worldview. But with deference you intrinsically can't do that, because the whole point of deference is you don't fully understand their views.

There's lots of things you can do under Eliezer's worldview that add dignity points, like paying relevant people millions of dollars to spend a week really engaging with the arguments, or trying to get whole-brain emulation before AGI. My understanding is that he doesn't expect those sorts of things to happen.

What do you mean "he doesn't expect this sort of thing to happen"? I think I would just straightforwardly endorse doing a bunch of costly things like these that Eliezer's worldview thinks are our best shot, as long as they don't cause much harm according to other worldviews.

I don't see why you are not including "c) give significant deference weight to his actual worldview", which is what I'd be inclined to do if I didn't have significant AI expertise myself and so was trying to defer.

Because neither Ben nor myself was advocating for this.

On Deference and Yudkowsky's AI Risk Estimates

Thanks for writing this update. I think my number one takeaway here is something like: when writing a piece with the aim of changing community dynamics, it's important to be very clear about motivations and context. E.g. I think a version of the piece which said "I think people are overreacting to Death with Dignity, here are my specific models of where Yudkowsky tends to be overconfident, here are the reasons why I think people aren't taking those into account as much as they should" would have been much more useful and much less controversial than the current piece, which (as I interpret it) essentially pushes a general "take Yudkowsky less seriously" meme (and is thereby intrinsically political/statusy).

On Deference and Yudkowsky's AI Risk Estimates

Musing out loud: I don't know of any complete model of deference which doesn't run into weird issues, like the conclusion that you should never trust yourself. But suppose you have some kind of epistemic parliament where you give your own views some number of votes, and assign the rest of the votes to other people in proportion to how defer-worthy they seem. Then you need to make a bunch of decisions, and your epistemic parliament keeps voting on what will best achieve your (fixed) goals.

If you do naive question-by-question majority voting on each question simultaneously then you can end up with an arbitrarily incoherent policy - i.e. a set of decisions that's inconsistent with each other. And if you make the decisions in some order, with the constraint that they each have to be consistent with all prior decisions, then the ordering of the decisions can become arbitrarily important.

Instead, you want your parliament to negotiate some more coherent joint policy to follow. And I expect that in this joint policy, each worldview gets its way on the questions that are most important to it, and cedes responsibility on the questions that are least important. So Eliezer's worldview doesn't end up reallocating all the biosecurity money, but it does get a share of curriculum time (at least for the most promising potential researchers). But in general how to conduct those negotiations is an unsolved problem (and pretty plausibly unsolveable).

On Deference and Yudkowsky's AI Risk Estimates

Yeah, I'm gonna ballpark guess he's around 95%? I think the problem is that he cites numbers like 99.99% when talking about the chance of doom "without miracles", which in his parlance means assuming that his claims are never overly pessimistic. Which seems like wildly bad epistemic practice. So then it goes down if you account for that, and then maybe it goes down even further if he adjusts for the possibility that other people are more correct than him overall (although I'm not sure that's a mental move he does at all, or would ever report on if he did).

On Deference and Yudkowsky's AI Risk Estimates

We both agree that you shouldn't defer to Eliezer's literal credences, because we both think he's systematically overconfident. The debate is between two responses to that:

a)  Give him less deference weight than the cautious, sober, AI safety people who make few novel claims but are better-calibrated (which is what Ben advocates).

b) Try to adjust for his overconfidence and then give significant deference weight to a version of his worldview that isn't overconfident.

I say you should do the latter, because you should be deferring to coherent worldviews (which are rare) rather than deferring on a question-by-question basis. This becomes more and more true the more complex the decisions you have to make. Even for your (pretty simple) examples, the type of deference you seem to be advocating doesn't make much sense.

For instance:

should funders reallocate nearly all biosecurity money to AI?

It doesn't make sense to defer to Eliezer's estimate of the relative importance of AI without also accounting for his estimate of the relative tractability of funding AI, which I infer he thinks is very low.

Should there be an organization dedicated to solving Eliezer's health problems? What should its budget be?

I'm guessing that Eliezer thinks that with more energy and project management skills he could make a significant dent in x-risk (perhaps 10 percentage points), while thinking the rest of the alignment field if fully funded can't make a dent of more than 0.01 percentage points, suggesting that "improve Eliezer's health + project management skills" is 3 OOM more important than "all other alignment work" (saying nothing about tractability, which I don't know enough to evaluate). Whereas I'd have it at, idk, 1-2 OOM less important, for a difference of 4-5 OOMs.

Again, the problem is that you're deferring on a question-by-question basis, without considering the correlations between different questions - in this case, the likelihood that Eliezer is right, and the value of his work. (Also, the numbers seem pretty wild; maybe a bit uncharitable to ascribe to Eliezer the view that his research would be 3 OOM more valuable than the rest of the field combined? His tone is strong but I don't think he's ever made a claim that big.)

Here's an alternative calculation which takes into account that correlation. I claim that the value of this organization is mainly determined by the likelihood that Eliezer is correct about a few key claims which underlie his research agenda. Suppose he thinks that's 90% likely and I think that's 10% likely. Then  if our choices are "defer entirely to Eliezer" or "defer entirely to Richard", there's a 9x difference in funding efficacy. In practice, though, the actual disagreement here is between "defer to Eliezer no more than a median AI safety researcher" and something like "assume Eliezer is, say, 2x overconfident and then give calibrated-Eliezer, say, 30%ish of your deference weight". If we assume for the sake of simplicity that every other AI safety researcher has my worldview, then the practical difference here is something like a 2x difference in this org's efficacy (0.1 vs 0.3*0.9*0.5+0.7*0.1). Which is pretty low!

Won't go through the other examples but hopefully that conveys the idea. The basic problem here, I think, is that the implicit "deference model" that you and Ben are using doesn't actually work (even for very simple examples like the ones you gave).

On Deference and Yudkowsky's AI Risk Estimates

I haven't thought much about nuclear policy, so I can't respond there. But at least in alignment, I expect that pushing on variables where there's less than a 2x difference between the expected positive and negative effects of changing that variable is not a good use of time for altruistically-motivated people.

(By contrast, upweighting or downweighting Eliezer's opinions by a factor of 2 could lead to significant shifts in expected value, especially for people who are highly deferential. The specific thing I think doesn't make much difference is deferring to a version of Eliezer who's 90% confident about something, versus deferring to the same extent to a version of Eliezer who's 45% confident in the same thing.)

My more general point, which doesn't hinge on the specific 2x claim, is that naive conversions between metrics of calibration and deferential weightings are a bad idea, and that a good way to avoid naive conversions is to care a lot more about innovative thinking than calibration when deferring.

On Deference and Yudkowsky's AI Risk Estimates

I phrased my reply strongly (e.g. telling people to read the other post instead of this one) because deference epistemology is intrinsically closely linked to status interactions, and you need to be pretty careful in order to make this kind of post not end up being, in effect, a one-dimensional "downweight this person". I don't think this post was anywhere near careful enough to avoid that effect. That seems particularly bad because I think most EAs should significantly upweight Yudkowsky's views if they're doing any kind of reasonable, careful deference, because most EAs significantly underweight how heavy-tailed the production of innovative ideas actually is (e.g. because of hindsight bias, it's hard to realise how much worse than Eliezer we would have been at inventing the arguments for AI risk, and how many dumb things we would have said in his position).

By contrast, I think your post is implicitly using a model where we have a few existing, well-identified questions, and the most important thing is to just get to the best credences on those questions, and we should do so partly by just updating in the direction of experts. But I think this model of deference is rarely relevant; see my reply to Rohin for more details. Basically, as soon as we move beyond toy models of deference, the "innovative thinking" part becomes crucially important, and the "well-calibrated" part becomes much less so.

One last intuition: different people have different relationships between their personal credences and their all-things-considered credences. Inferring track records in the way you've done here will, in addition to favoring people who are quieter and say fewer useful things, also favor people who speak primarily based on their all-things-considered credences rather than their personal credences. But that leads to a vicious cycle where people are deferring to people who are deferring to people who... And then the people who actually do innovative thinking in public end up getting downweighted to oblivion via cherrypicked examples.

Modesty epistemology delenda est.

On Deference and Yudkowsky's AI Risk Estimates

I think that there are very few decisions which are both a) that low-dimensional and b) actually sensitive to the relevant range of credences that we're talking about.

Like, suppose you think that Eliezer's credences on his biggest claims are literally 2x higher than they should be, even for claims where he's 90% confident. This is a huge hit in terms of Bayes points; if that's how you determine deference, and you believe he's 2x off, then plausibly that implies you should defer to him less than you do to the median EA. But when it comes to grantmaking, for example, a cost-effectiveness factor of 2x is negligible given the other uncertainties involved - this should very rarely move you from a yes to no, or vice versa. (edit: I should restrict the scope here to grantmaking in complex, high-uncertainty domains like AI alignment).

Then you might say: well, okay, we're not just making binary decisions, we're making complex decisions where we're choosing between lots of different options. But the more complex the decisions you're making, the less you should care about whether somebody's credences on a few key claims are accurate, and the more you should care about whether they're identifying the right types of considerations, even if you want to apply a big discount factor to the specific credences involved.

As a simple example, as soon as you're estimating more than one variable, you typically start caring a lot about whether the errors on your estimates are correlated or uncorrelated. But there are so many different possibilities for ways and reasons that they might be correlated that you can't just update towards experts' credences, you have to actually update towards experts' reasons for those credences, which then puts you in the regime of caring more about whether you've identified the right types of considerations.

On Deference and Yudkowsky's AI Risk Estimates

I think that a bunch of people are overindexing on Yudkowsky's views; I've nevertheless downvoted this post because it seems like it's making claims that are significantly too strong, based on a methodology that I strongly disendorse. I'd much prefer a version of this post which, rather than essentially saying "pay less attention to Yudkowsky", is more nuanced about how to update based on his previous contributions; I've tried to do that in this comment, for example. (More generally, rather than reading this post, I recommend people read this one by Paul Christiano, which outlines specific agreements and disagreements. Note that the list of agreements there, which I expect that many other alignment researchers also buy into, serves as a significant testament to Yudkowsky's track record.)

The part of this post which seems most wild to me is the leap from "mixed track record" to

In particular, I think, they shouldn’t defer to him more than they would defer to anyone else who seems smart and has spent a reasonable amount of time thinking about AI risk.

For any reasonable interpretation of this sentence, it's transparently false. Yudkowsky has proven to be one of the best few thinkers in the world on a very difficult topic. Insofar as there are others who you couldn't write a similar "mixed track record" post about, it's almost entirely because they don't have a track record of making any big claims, in large part because they weren't able to generate the relevant early insights themselves. Breaking ground in novel domains is very, very different from forecasting the weather or events next year; a mixed track record is the price of entry.

Based on his track record, I would endorse people deferring more towards the general direction of Yudkowsky's views than towards the views of almost anyone else. I also think that there's a good case to be made that Yudkowsky tends to be overconfident, and this should be taken into account when deferring; but when it comes to making big-picture forecasts, the main value of deference is in helping us decide which ideas and arguments to take seriously, rather than the specific credences we should place on them, since the space of ideas is so large. The EA community has ended up strongly moving in Yudkowsky's direction over the last decade, and that seems like much more compelling evidence than anything listed in this post.

Load More