4355Joined Dec 2017


Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


EA Archives Reading List


I like your overall ambitions! I want to note a couple of things that seemed incongruous to me/things I'd change about your default plan.

I'm 24 now, so I'm hoping to start my career trajectory at 32 (8 years forms a natural/compelling Schelling point

This seems like very much the wrong mindset. You're starting this trajectory now. In order to do great intellectual work, you should be aiming directly at the things you want to understand, and the topics you want to make progress on, as early as you can. A better alternative would be taking the mindset that your career will end in 8 years, and thinking about what you'd need to produce great work by that time. (This is deliberately provocative, and shouldn't be taken fully literally, but I think points in the right direction, especially given that you're aiming to do research where the credentials from a PhD that's successful by mainstream standards don't matter very much, like agent foundations research and more general high-level strategic thinking).

Pick a new important topic each month (or 2 -3 months)

Again, I'd suggest taking quite a different strategy here. In order to do really well at this, I think you don't want the mindset of shallowly exploring other people's work (although of course it's useful to have that as background knowledge). I think you want to have the mindset of identifying the things which seem most important to you, pushing forward the frontier of knowledge on those topics, following threads which arise from doing so, and learning whatever you need as you go along. What it looks like to be successful here is noticing a bunch of ways in which other people seem like they're missing stuff/overlooking things, digging into those, and finding new ways to understand these topics. (That's true even if your only goal is to popularise existing ideas - in order to be able to popularise them really well, you want the level of knowledge such that, if there were big gaps in those ideas, then you'd notice them.) This is related to the previous point: don't spend all this time preparing to do the thing - just do it!

I think that I am unusually positioned to be able to become such a person.

I think that doing well at this research is sufficiently heavy-tailed that it's very hard to reason your way into thinking you'll be great at it in advance. You'll get far far more feedback on this point by starting to do the work now, getting a bunch of feedback, and iterating fast.

Good luck!

Makes sense, glad we're on the same page!

a more accurate title for my post would be “population ethics without objective axiology.”

Perhaps consider changing it to that, then? Since I'm a subjectivist, I consider all axiologies subjective - and therefore "without axiology" is very different from "without objective axiology".

(I feel like I would have understood that our arguments were consistent either if the title had been different, or if I'd read the post more carefully - but alas, neither condition held.)

I'd also consider that humans are biological creatures with “interests” – a system-1 “monkey brain” with its own needs, separate (or at least separable) from idealized self-identities that the rational, planning part of our brain may come up with. So, if we also want to fulfill these interests/needs, that could be justification for a quasi-hedonistic view or for the type of mixed view that you advocate?

I like this justification for hedonism. I suspect that a version of this is the only justification that will actually hold up in the long term, once we've more thoroughly internalized qualia anti-realism.

I like this post; as you note, we've been thinking along very similar lines. But you reach different conclusions than I do - in particular, I disagree that "the ambitious morality of “do the most moral/altruistic thing” is something like preference utilitarianism." In other words, I think most of your arguments about minimal morality are still consistent with having an axiology.

I didn't read your post very carefully, but I think the source of the disagreement is that you're conflating objectivity/subjectivity with respect to the moral actor  and objectivity/subjectivity with respect to the moral patient.

More specifically: let's say that I'm a moral actor, and I have some axiology. I might agree that this axiology is not objective: it's just my own idiosyncratic axiology. But it nevertheless might be non-subjective with respect to moral patients, in the sense that my axiology says that some experiences have value regardless of what the people having those experiences want. So I could be a hedonist despite thinking that hedonism isn't the objectively-correct axiology.

This distinction also helps resolve the tension between "there's an objective axiology" and "people are free to choose their own life goals": the objective axiology of what's good for a person might in part depend on what they want.

Having an axiology which says things like "my account of welfare is partly determined by hedonic experiences and partly by preferences and partly by how human-like the agent is" may seem unparsimonious, but I think that's just what it means for humans to have complex values. And then, as you note, we can also follow minimal (cooperation) morality for people who are currently alive, and balance that with maximizing the welfare of people who don't yet exist.

I've now written up a more complete theory of deference here. I don't expect that it directly resolves these disagreements, but hopefully it's clearer than this thread.

Going from allocating 10% of your resources to 20% of your resources to a worldview seems like a big change.

Note that this wouldn't actually make a big change for AI alignment, since we don't know how to use more funding. It'd make a big change if we were talking about allocating people, but my general heuristic is that I'm most excited about people acting on strong worldviews of their own, and so I think the role of deference there should be much more limited than when it comes to money. (This all falls out of the theory I linked above.)

Across the general population, maybe coherence is 7/10 correlated with expected future impact; across the experts that one would consider deferring to I think it is more like 2/10, because most experts seem pretty coherent (within the domains they're thinking about and trying to influence) and so the differences in impact depend on other factors.

Experts are coherent within the bounds of conventional study. When we try to apply that expertise to related topics that are less conventional (e.g. ML researchers on AGI; or even economists on what the most valuable interventions are) coherence drops very sharply. (I'm reminded of an interview where Tyler Cowen says that the most valuable cause area is banning alcohol, based on some personal intuitions.)

I feel like looking at any EA org's report on estimation of their own impact makes it seem like "impact of past policies" is really difficult to evaluate?

The question is how it compares to estimating past correctness, where we face pretty similar problems. But mostly I think we don't disagree too much on this question - I think epistemic evaluations are gonna be bigger either way, and I'm mostly just advocating for the "think-of-them-as-a-proxy" thing, which you might be doing but very few others are.

Meta: I'm currently writing up a post with a fully-fleshed-out account of deference. If you'd like to drop this thread and engage with that when it comes out (or drop this thread without engaging with that), feel free; I expect it to be easier to debate when I've described the position I'm defending in more detail.

I'll note that most of this seems unrelated to my original claim, which was just "deference* seems important for people making decisions now, even if it isn't very important in practice for researchers", in contradiction to a sentence on your top-level comment. Do you now agree with that claim?

I always agreed with this claim; my point was that the type of deference which is important for people making decisions now should not be very sensitive to the "specific credences" of the people you're deferring to. You were arguing above that the difference between your and Eliezer's views makes much more than a 2x difference; do you now agree that, on my account of deference, a big change in the deference-weight you assign to Eliezer plausibly leads to a much smaller change in your policy from the perspective of other worldviews, because the Eliezer-worldview trades off influence over most parts of the policy for influence over the parts that the Eliezer-worldview thinks are crucial and other policies don't?

individual proxies and my thoughts on them

This is helpful, thanks. I of course agree that we should consider both correlations with impact and ease of evaluation; I'm talking so much about the former because not noticing this seems like the default mistake that people make when thinking about epistemic modesty. Relatedly, I think my biggest points of disagreement with your list are:

1. I think calibrated credences are badly-correlated with expected future impact, because:
a) Overconfidence is just so common, and top experts are often really miscalibrated even when they have really good models of their field
b) The people who are best at having impact have  goals other than sounding calibrated - e.g. convincing people to work with them, fighting social pressure towards conformity, etc. By contrast, the people who are best at being calibrated are likely the ones who are always stating their all-things-considered views, and who therefore may have very poor object-level models. This is particularly worrying when we're trying to infer credences from tone - e.g. it's hard to distinguish the hypotheses "Eliezer's inside views are less calibrated than other peoples" and "Eliezer always speaks based on his inside-view credences, whereas other people usually speak based on their all-things-considered credences".
c) I think that "directionally correct beliefs" are much better-correlated, and not that much harder to evaluate, and so credences are especially unhelpful by comparison to those (like, 2/10 before conditioning on directional correctness, and 1/10 after, whereas directional correctness is like 3/10).

2. I think coherence is very well-correlated with expected future impact (like, 5/10), because impact is heavy-tailed and the biggest sources of impact often require strong, coherent views. I don't think it's that hard to evaluate in hindsight, because the more coherent a view is, the more easily it's falsified by history.

3. I think "hypothetical impact of past policies" is not that hard to evaluate.  E.g. in Eliezer's case the main impact is "people do a bunch of technical alignment work much earlier", which I think we both agree is robustly good.

In your Kurzweil example I think the issue is not that you assigned weights based on hypothetical-Kurzweil's beliefs, but that hypothetical-Kurzweil is completely indifferent over policies.

Your procedure is non-robust in the sense that, if Kurzweil transitions from total indifference to thinking that one policy is better by epsilon, he'll throw his full weight behind that policy. Hmm, but then in a parliamentary approach I guess that if there are a few different things he cares epsilon about, then other policies could negotiate to give him influence only over the things they don't care about themselves. Weighting by hypothetical-past-impact still seems a bit more elegant, but maybe it washes out.

(If we want to be really on-policy then I guess the thing which we should be evaluating is whether the person's worldview would have had good consequences when added to our previous mix of worldviews. And one algorithm for this is assigning policies weights by starting off from a state where they don't know anything about the world, then letting them bet on all your knowledge about the past (where the amount they win on bets is determined not just by how correct they are, but also how much they disagree with other policies). But this seems way too complicated to be helpful in practice.)

I agree that if you can evaluate quality of past recommended policies well, without a ton of noise, that would be a better signal than accuracy of beliefs. This just seems extremely hard to do, especially given the selection bias in who comes to your attention in the first place, and idk how I'd do it for Eliezer in any sane way.

I think I'm happy with people spending a bunch of time evaluating accuracy of beliefs, as long as they keep in mind that this is a proxy for quality of recommended policies. Which I claim is an accurate description of what I was doing, and what Ben wasn't: e.g. when I say that credences matter less than coherence of worldviews, that's because the latter is crucial for designing good policies, whereas the former might not be; and when I say that all-things-considered estimates of things like "total risk level" aren't very important, that's because in principle we should be aggregating policies not risk estimates between worldviews.

I also agree that selection bias could be a big problem; again, I think that the best strategy here is something like "do the standard things while remembering what's a proxy for what".

Yepp, thanks for the clear rephrasing. My original arguments for this view were pretty messy because I didn't have it fully fleshed out in my mind before writing this comment thread, I just had a few underlying intuitions about ways I thought Ben was going wrong.

Upon further reflection I think I'd make two changes to your rephrasing.

First change: in your rephrasing, we assign people weights based on the quality of their beliefs, but then follow their recommended policies. But any given way of measuring the quality of beliefs (in terms of novelty, track record, etc) is only an imperfect proxy for quality of policies. For example, Kurzweil might very presciently predict that compute is the key driver of AI progress, but suppose (for the sake of argument) that the way he does so is by having a worldview in which everything is deterministic, individuals are powerless to affect the future, etc. Then you actually don't want to give many resources to Kurzweil's policies, because Kurzweil might have no idea which policies make any difference.

So I think I want to adjust the rephrasing to say: in principle we should assign people weights based on how well their past recommended policies for someone like you would have worked out, which you can estimate using things like their track record of predictions, novelty of ideas, etc. But notably, the quality of past recommended policies is often not very sensitive to credences! For example, if you think that there's a 50% chance of solving nanotech in a decade, or a 90% chance of solving nanotech in a decade, then you'll probably still recommend working on nanotech (or nanotech safety) either way.

Having said all that, since we only get one rollout, evaluating policies is very high variance. And so looking at other information like reasoning, predictions, credences, etc, helps you distinguish between "good" and "lucky". But fundamentally we should think of these as approximations to policy evaluation, at least if you're assuming that we mostly can't fully evaluate whether their reasons for holding their views are sound.

Second change: what about the case where we don't get to allocate resources, but we have to actually make a set of individual decisions? I think the theoretically correct move here is something like: let policies spend their weight on the domains which they think are most important, and then follow the policy which has spent most weight on that domain.

Some complications:

  • I say "domains" not "decisions" because you don't want to make a series of related decisions which are each decided by a different policy, that seems incoherent (especially if policies are reasoning adversarially about how to undermine each other's actions).
  • More generally, this procedure could in theory be sensitive to bargaining and negotiating dynamics between different policies, and also the structure of the voting system (e.g. which decisions are voted on first, etc). I think we can just resolve to ignore those and do fine, but in principle I expect it gets pretty funky.

Lastly, two meta-level notes:

  • I feel like I've probably just reformulated some kind of reinforcement learning. Specifically the case where you have a fixed class of policies and no knowledge of how they relate to each other, so you can only learn how much to upweight each policy. And then the best policy is not actually in your hypothesis space, but you can learn a simple meta-policy of when to use each existing policy.
  • It's very ironic that in order to figure out how much to defer to Yudkowsky we need to invent a theory of idealised cooperative decision-making. Since he's probably the person whose thoughts on that I trust most, I guess we should meta-defer to him about what that will look like...

(One major thing is that I think you should be comparing between two actions, rather than evaluating an action by itself, which is why I compared to "all other alignment work".)

IMO the crux is that I disagree with both of these. Instead I think you should use each worldview to calculate a policy, and then generate some kind of compromise between those policies. My arguments above were aiming to establish that this strategy is not very sensitive to exactly how much you defer to Eliezer, because there just aren't very many good worldviews going around - hence why I assign maybe 15 or 20% (inside view) credence to his worldview (updated from 10% above after reflection). (I think my all-things-considered view is similar, actually, because deference to him cancels out against deference to all the people who think he's totally wrong.)

Again, the difference is in large part determined by whether you think you're in a low-dimensional space (here are our two actions, which one should we take?) versus a high-dimensional space (millions of actions available to us, how do we narrow it down?) In a high-dimensional space the tradeoffs between the best ways to generate utility according to Eliezer's worldview and the best ways to generate utility according to other worldviews become much smaller.

This seems like a crazy way to do cost-effectiveness analyses.

Like, if I were comparing deworming to GiveDirectly, would I be saying "well, the value of deworming is mainly determined by the likelihood that the pro-deworming people are right, which I estimate is 70% but you estimate is 50%, so there's only a 1.4x difference"? Something has clearly gone wrong here.

Within a worldview, you can assign EVs which are orders of magnitude different. But once you do worldview diversification, if a given worldview gets even 1% of my resources, then in some sense I'm acting like that worldview's favored interventions are in a comparable EV ballpark to all the other worldviews' favored interventions. That's a feature not a bug.

It also feels like this reasoning implies that no EA action can be > 10x more valuable than any other action that an EA critic thinks is good? Since you assign a 90% chance that the EA is right, and the critic thinks there's a 10% chance of that, so there's only a 9x gap? And then once you do all of your adjustments it's only 2x? Why do we even bother with cause prioritization under this worldview?

I don't have a fleshed out theory of how and when to defer, but I feel pretty confident that even our intuitive pretheoretic deference should not be this sort of thing, and should be the sort of thing that can have orders of magnitude of difference between actions.

An arbitrary critic typically gets well less than 0.1% of my deference weight on EA topics (otherwise it'd run out fast!) But also see above: because in high-dimensional spaces there are few tradeoffs between different worldviews' favored interventions, changing the weights on different worldviews doesn't typically lead to many OOM changes in how you're acting like you're assigning EVs.

Also, I tend to think of cause prio as trying to integrate multiple worldviews into a single coherent worldview. But with deference you intrinsically can't do that, because the whole point of deference is you don't fully understand their views.

There's lots of things you can do under Eliezer's worldview that add dignity points, like paying relevant people millions of dollars to spend a week really engaging with the arguments, or trying to get whole-brain emulation before AGI. My understanding is that he doesn't expect those sorts of things to happen.

What do you mean "he doesn't expect this sort of thing to happen"? I think I would just straightforwardly endorse doing a bunch of costly things like these that Eliezer's worldview thinks are our best shot, as long as they don't cause much harm according to other worldviews.

I don't see why you are not including "c) give significant deference weight to his actual worldview", which is what I'd be inclined to do if I didn't have significant AI expertise myself and so was trying to defer.

Because neither Ben nor myself was advocating for this.

Thanks for writing this update. I think my number one takeaway here is something like: when writing a piece with the aim of changing community dynamics, it's important to be very clear about motivations and context. E.g. I think a version of the piece which said "I think people are overreacting to Death with Dignity, here are my specific models of where Yudkowsky tends to be overconfident, here are the reasons why I think people aren't taking those into account as much as they should" would have been much more useful and much less controversial than the current piece, which (as I interpret it) essentially pushes a general "take Yudkowsky less seriously" meme (and is thereby intrinsically political/statusy).

Musing out loud: I don't know of any complete model of deference which doesn't run into weird issues, like the conclusion that you should never trust yourself. But suppose you have some kind of epistemic parliament where you give your own views some number of votes, and assign the rest of the votes to other people in proportion to how defer-worthy they seem. Then you need to make a bunch of decisions, and your epistemic parliament keeps voting on what will best achieve your (fixed) goals.

If you do naive question-by-question majority voting on each question simultaneously then you can end up with an arbitrarily incoherent policy - i.e. a set of decisions that's inconsistent with each other. And if you make the decisions in some order, with the constraint that they each have to be consistent with all prior decisions, then the ordering of the decisions can become arbitrarily important.

Instead, you want your parliament to negotiate some more coherent joint policy to follow. And I expect that in this joint policy, each worldview gets its way on the questions that are most important to it, and cedes responsibility on the questions that are least important. So Eliezer's worldview doesn't end up reallocating all the biosecurity money, but it does get a share of curriculum time (at least for the most promising potential researchers). But in general how to conduct those negotiations is an unsolved problem (and pretty plausibly unsolveable).

Load More