Research analyst at Open Philanthropy. Doctoral student in philosophy at the University of Oxford. Opinions my own.

Wiki Contributions


On the limits of idealized values

Thanks, Richard :). Re: arbitrariness, in a sense the relevant choices might well end up arbitrary (and as you say, subjectivists need to get used to some level of unavoidable arbitrariness), but I do think that it at least seems worth trying to capture/understand some sort of felt difference between e.g. picking between Buridan's bales of hay, and choosing e.g. what career to pursue, even if you don't think there's a "right answer" in either case. 

I agree that "infallible" maybe has the wrong implications, here, though I do think that part of the puzzle is the sense in which these choices feel like candidates for mistake or success; e.g., if I choose the puppies, or the crazy galaxy Joe world, I have some feeling like "man, I hope this isn't a giant mistake." That said, things we don't have control over, like desires, do feel like they have less of this flavor.

On the limits of idealized values

I'm glad you liked it, Lukas. It does seem like an interesting question how your current confidence in your own values relates to your interest in further "idealization," of what kind, and how much convergence makes a difference. Prima facie, it does seems plausible that greater confidence speaks in favor"conservatism" about what sorts of idealization you go in for, though I can imagine very uncertain-about-their-values people opting for conservatism, too. Indeed, it seems possible that conservatism is just generally pretty reasonable, here.

Draft report on existential risk from power-seeking AI

Hi Ben, 

This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:

  • It now feels more salient to me now just how many AI applications may be covered by systems that either aren’t agentic planning/strategically aware (including e.g. interacting modular systems, especially where humans are in the loop for some parts, and/or intuitively “sphexish/brittle” non-APS systems ), or by systems which are specialized/myopic/limited in capability in various ways. That is, a generalized learning agent that’s superhuman (let alone better than e.g. all of human civilization) in ~all domains, with objectives as open-ended and long-term as “maximize paperclips,” now seems to me a much more specific type of system, and one whose role in an automated economy -- especially early on -- seems more unclear. (I discuss this a bit in Section 3, section, and section 4.3.2).
  • Thinking about the considerations discussed in the "unusual difficulties" section generally gave me more clarity about how this problem differs from safety problems arising in the context of other technologies (I think I had previously been putting more weight on considerations like "building technology that performs function F is easier than building some technology that performs function F safely and reliably," which apply more generally).
  • I realized how much I had been implicitly conceptualizing the “alignment problem” as “we must give these AI systems objectives that we’re OK seeing pursued with ~arbitrary degrees of capability” (something akin to the “omni test”). Meeting standards in this vicinity (to the extent that they're well defined in a given case) seems like a very desirable form of robustness (and I’m sympathetic to related comments from Eliezer to the effect that “don’t build systems that are searching for ways to kill you, even if you think the search will come up empty”), but I found it helpful to remember that the ultimate problem is “we need to ensure that these systems don’t seek power in misaligned ways on any inputs they’re in fact exposed to” (e.g., what I’m calling “practical PS-alignment”) -- a framing that leaves more conceptual room, at least, for options that don’t “get the objectives exactly right," and/or that involve restricting a system’s capabilities/time horizons, preventing it from “intelligence exploding,” controlling its options/incentives, and so on (though I do think options in this vein raise their own issues, of the type of that the "omni test" is meant to avoid, see,, and 4.3.3). I discuss this a bit in section 4.1.
  • I realized that my thinking re: “races to the bottom on safety” had been driven centrally by abstract arguments/models that could apply in principle to many industries (e.g., pharmaceuticals). It now seems to me a knottier and more empirical question how models of this kind will actually apply in a given real-world case re: AI. I discuss this a bit in section 5.3.1
Draft report on existential risk from power-seeking AI

Hi Ben, 

A few thoughts on this: 

  • It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grained level -- e.g., a level that just giving overall estimates, or just saying e.g. “significant probability,” “high enough to worry about,” etc can make more difficult to engage on.
  • In that vein, I think it’s possible you’re over-estimating how robust I take the premises and numbers here to be (I'm thinking here of your comments re: “very accurately carve the key parts of reality that are relevant,” and "trust the outcome number"). As I wrote in response to Rob above, my low-end/high-end range here is .1% to 40% (see footnote 179, previously 178), and in general, I hold the numbers here very lightly (I try to emphasize this in section 8). 
  • FWIW, I think Superintelligence can be pretty readily seen as a multi-step argument (e.g., something like: superintelligence will happen eventually; fast take-off is plausible; if fast-take-off, then a superintelligence will probably get a decisive strategic advantage; alignment will be tricky; misalignment leads to power-seeking; therefore plausible doom). And more broadly, I think that people make arguments with many premises all the time (though sometimes the premises are suppressed). It’s true that people don’t usually assign probabilities to the premises (and Bostrom doesn’t, in Superintelligence -- a fact that leaves the implied p(doom) correspondingly ambiguous) -- but I think this is centrally because assigning informal probabilities to claims (whether within a multi-step argument, or in general) just isn’t a very common practice, for reasons not centrally to do with e.g. multi-stage-fallacy type problems. Indeed, I expect I’d prefer a world where people assigned informal, lightly-held probabilities to their premises and conclusions (and formulated their arguments in premise-premise-conclusion form) more frequently.
  • I’m not sure exactly what you have in mind re: “examining a single worldview to see whether it’s consistent,” but consistency in a strict sense seems too cheap? E.g., “Bob has always been wrong before, but he’ll be right this time”; “Mortimer Snodgrass did it”; etc are all consistent. That said, my sense is that you have something broader in mind -- maybe something like "plausible," "compelling," "sense-making," etc. But it seems like these still leave the question of overall probabilities open...

Overall, my sense is that disagreement here is probably more productively focused on the object level -- e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover -- rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don't feel like I've yet heard strong candidates -- e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).

Draft report on existential risk from power-seeking AI

Hi Hadyn, 

Thanks for your kind words, and for reading. 

  1. Thanks for pointing out these pieces. I like the breakdown of the different dimensions of long-term vs. near-term. 
  2. Broadly, I agree with you that the document could benefit from more about premise 5. I’ll consider revising to add some.
  3. I’m definitely concerned about misuse scenarios too (and I think lines here can get blurry -- see e.g. Katja Grace’s recent post); but I wanted, in this document, to focus on misalignment in particular. The question of how to weigh misuse vs. misalignment risk, and how the two are similar/different more generally, seems like a big one, so I’ll mostly leave it for another time (one big practical difference is that misalignment makes certain types of technical work more relevant).
  4. Eventually, the disempowerment has to scale to ~all of humanity (a la premise 5), so that would qualify as TAI in the “transition as big of a deal as the industrial revolution” sense. However, it’s true that my timelines condition in premise 1 (e.g., APS systems become possible and financially feasible) is weaker than Ajeya’s.
Draft report on existential risk from power-seeking AI

(Continued from comment on the main thread)

I'm understanding your main points/objections in this comment as: 

  1. You think the multiple stage fallacy might be the methodological crux behind our disagreement. 
  2. You think that >80% of AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would assign >10% probability to existential catastrophe from technical problems with AI (at some point, not necessarily before 2070). So it seems like 80k saying 1-10% reflects a disagreement with the experts, which would be strange in the context of e.g. climate change, and at least worth flagging/separating. (Presumably, something similar would apply to my own estimates.)
  3. You worry that there are social reasons not to sound alarmist about weird/novel GCRs, and that it can feel “conservative” to low-ball rather than high-ball the numbers. But low-balling (and/or focusing on/making salient lower-end numbers) has serious downsides. And you worry that EA folks have a track record of mistakes in this vein.

(as before, let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p)

Re 1 (and 1c, from my response to the main thread): as I discuss in the document, I do think there are questions about multiple-stage fallacies, here, though I also think that not decomposing a claim into sub-claims can risk obscuring conjunctiveness (and I don’t see “abandon the practice of decomposing a claim into subclaims” as a solution to this). As an initial step towards addressing some of these worries, I included an appendix that reframes the argument using fewer premises (and also, in positive (e.g., “p is false”) vs. negative (“p is true”) forms). Of course, this doesn’t address e.g. the “the conclusion could be true, but some of the premises false” version of the “multiple stage fallacy” worry; but FWIW, I really do think that the premises here capture the majority of my own credence on p, at least. In particular, the timelines premise is fairly weak, premises 4-6 are implied by basically any p-like scenario, so it seems like the main contenders for false premises (even while p is true) are 2: (“There will be strong incentives to build APS systems”) and 3: (“It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway”). Here, I note the scenarios most salient to me in footnote 173, namely: “we might see unintentional deployment of practical PS-misaligned APS systems even if they aren’t superficially attractive to deploy” and “practical PS-misaligned might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity).” But I don’t see these are constituting more than e.g. 50% of the risk. If your own probability is driven substantially by scenarios where the premises I list are false, I’d be very curious to hear which ones (setting aside scenarios that aren’t driven by power-seeking, misaligned AI), and how much credence if you give them. I’d also be curious, more generally, to hear your more specific disagreements with the probabilities I give to the premises I list. 

Re: 2, your characterization of the distribution of views amongst AI safety researchers (outside of MIRI) is in some tension with my own evidence; and I consulted with a number of people who fit your description of “specialists”/experts in preparing the document. That said, I’d certainly be interested to see more public data in this respect, especially in a form that breaks down in (rough) quantitative terms the different factors driving the probability in question, as I’ve tried to do in the document (off the top of my head, the public estimates most salient to me are Ord (2020) at 10% by 2100, Grace et al (2017)’s expert survey (5% median, with no target date), and FHI’s (2008) survey (5% on extinction from superintelligent AI by 2100), though we could gather up others from e.g. LW and previous X-risk books.) That said, importantly, and as indicated in my comment on the main thread, I don’t think of the community of AI safety researchers at the orgs you mention as in an epistemic position analogous to e.g. the IPCC, for a variety of reasons (and obviously, there are strong selection effects at work). Less importantly, I also don’t think the technical aspects of this problem the only factors relevant to assessing risk; at this point I have some feeling of having “heard the main arguments”; and >10% (especially if we don’t restrict to pre-2070 scenarios) is within my “high-low” range mentioned in footnote 178 (e.g., .1%-40%). 

Re: 3, I do think that the “conservative” thing to do here is to focus on the higher-end estimates (especially given uncertainty/instability in the numbers), and I may revise to highlight this more in the text. But I think we should distinguish between the project of figuring out “what to focus on”/what’s “appropriately conservative,” and what our actual best-guess probabilities are; and just as there are risks of low-balling for the sake of not looking weird/alarmist, I think there are risks of high-balling for the sake of erring on the side of caution. My aim here has been to do neither; though obviously, it’s hard to eliminate biases (in both directions).

Draft report on existential risk from power-seeking AI

Hi Rob, 

Thanks for these comments. 

Let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p. I’m understanding your main objections in this comment as: 

  1. It seems to you like we’re in a world where p is true, by default. Hence, 5% on p seems too low to you. In particular:
    1. It implies 95% confidence on not p, which seems to you overly confident.
    2. If p is true by default, you think the world would look like it does now; so if this world isn’t enough to get me above 5%, what would be?
    3. Because p seems true to you by default, you suspect that an analysis that only ends up putting 5% on p involves something more than “the kind of mistake you should make in any ordinary way,” and requires some kind of mistake in methodology.

One thing I’ll note at the outset is the content of footnote 178, which (partly prompted by your comment) I may revise to foreground more in the main text: “In sensitivity tests, where I try to put in ‘low-end’ and ‘high-end’ estimates for the premises above, this number varies between ~.1% and ~40% (sampling from distributions over probabilities narrows this range a bit, but it also fails to capture certain sorts of correlations). And my central estimate varies between ~1-10% depending on my mood, what considerations are salient to me at the time, and so forth. This instability is yet another reason not to put too much weight on these numbers. And one might think variation in the direction of higher risk especially worrying.”

Re 1a: I’m open to 5% being too low. Indeed, I take “95% seems awfully confident,” and related worries in that vein, seriously as an objection. However, as the range above indicates, I also feel open to 5% being too high (indeed, at times it seems that way too me), and I don’t see “it would be strange to be so confident that all of humanity won’t be killed/disempowered because of X” as a forceful argument on its own (quite the contrary): rather, I think we really need to look at the object-level evidence and argument for X, which is what the document tries to do (not saying that quote represents your argument; but hopefully it can illustrate why one might start from a place of being unsurprised if the probability turns out low).

Re 1b: I’m not totally sure I’ve understood you here, but here are a few thoughts. At a high level, one answer to “what sort of evidence would make me update towards p being more likely” is “the considerations discussed in the document that I see as counting against p don’t apply, or seem less plausible” (examples here include considerations related to longer timelines, non-APS/modular/specialized/myopic/constrained/incentivized/not-able-to-easily-intelligence-explode systems sufficing in lots/maybe ~all of incentivized applications, questions about the ease of eliminating power-seeking behavior on relevant inputs during training/testing given default levels of effort, questions about why and in what circumstances we might expect PS-misaligned systems to be superficially/sufficiently attractive to deploy, warning shots, corrective feedback loops, limitations to what APS systems with lopsided/non-crazily-powerful capabilities can do, general incentives to avoid/prevent ridiculously destructive deployment, etc, plus more general considerations like “this feels like a very specific way things could go”). 

But we could also imagine more “outside view” worlds where my probability would be higher: e.g., there is a body of experts as large and established as the experts working on climate change, which uses quantitative probabilistic models of the quality and precision used by the IPCC, along with an understanding of the mechanisms underlying the threat as clear and well-established as the relationship between carbon emissions and climate change, to reach a consensus on much higher estimates. Or: there is a significant, well-established track record of people correctly predicting future events and catastrophes of this broad type decades in advance, and people with that track record predict p with >5% probability.

That said, I think maybe this isn’t getting at the core of your objection, which could be something like: “if in fact this is a world where p is true, is your epistemology sensitive enough to that? E.g., show me that your epistemology is such that, if p is true, it detects p as true, or assigns it significant probability.” I think there may well be something to objections in this vein, and I'm interested in thinking about the more; but I also want to flag that at a glance, it feels kind of hard to articulate them in general terms. Thus, suppose Bob has been wrong about 99/100 predictions in the past. And you say: “OK, but if Bob was going to be right about this one, despite being consistently wrong in the past, the world would look just like it does now. Show me that your epistemology is sensitive enough to assign high probability to Bob being right about this one, if he’s about to be.” But this seems like a tough standard; you just should have low probability on Bob being right about this one, even if he is. Not saying that’s the exact form of your objection, or even that it's really getting at the heart of things, but maybe you could lay out your objection in a way that doesn’t apply to the Bob case?

(Responses to 1c below)

Problems of evil

Sounds right to me.  Per a conversation with Aaron a while back, I've been relying on the moderators to tag posts as personal blog, and had been assuming this one would be.

The importance of how you weigh it

Glad to hear you found it helpful. Unfortunately, I don't think I have a lot to add at the moment re: how to actually pursue moral weighting research, beyond what I gestured at in the post (e.g., trying to solicit lots of your own/other people's intuitions across lots of cases, trying to make them consistent,  that kind of thing). Re: articles/papers/posts, you could also take a look at GiveWell's process here, and the moral weight post from Luke Muelhauser I mentioned has a few references at the end that might be helpful (though most of them I haven't engaged with myself). I'll also add, FWIW, that I actually think the central point in the post most applicable outside of the EA community than inside it, as I think of EA as fairly "basic-set oriented" (though there are definitely some questions in EA where weightings matter).

Load More