All of Joe_Carlsmith's Comments + Replies

Bleh, sorry, looks like this cross-posted twice from LW, but has only one LW copy (and can only be edited via that copy). I'm going to try to keep this version and get rid of the other one. Edit: resolved.

Bleh, sorry, am having some trouble with LW cross-posting here.

(Copying over my response from LessWrong)

Thanks for writing this -- I’m very excited about people pushing back on/digging deeper re: counting argumentssimplicity arguments, and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect ... (read more)

(Oops, sorry, something got messed up with the LessWrong cross-posting here, such that the old version of this post, including comments and upvotes, disappeared. Working on trying to fix.)

(Also copied from LW. And partly re-hashing my response from twitter.)

I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform. 

(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argume... (read more)

Thanks for this thoughtful comment, Ben. And also, for putting the "The Gold Lily" and "Mother and Child" on my radar -- they hadn't been before. I agree that "Mother and Child" evokes a sort some kind of sort of intergenerational project in the way you describe -- "it is your turn to address it." It seems related to the thing I was trying to talk about at the end of the post -- e.g., Gluck asking for some kind of directness and intensity of engagement with life. 

Thanks! Re: one in five million and .01% -- thanks, edited. And thanks for pointing to the Augenblick piece -- does look relevant (though my specific interest in that footnote was in constraints applicable to a model where you can only consider some subset of your evidence at any given time).

I'm sorry to hear about this, Nathan. As I say in the post, I do think that the question how to do gut-stuff right from a practical perspective is distinct from the epistemic angle that the post focuses on, and I think it's important to attend to both.

5
Nathan_Barnard
1y
I agree ideally one would do gut stuff right both practically and epistemically. In my case, the tradeoff of productivity loss and loss in general reasoning ability in exchange for some epistemic gains wasn't worth it.  I think it's plausible that for people in a similar situation to me - people who are good at making decisions based on just analytic reasoning and have reason to think that they might be vulnerable if they were to try to believe things on a gut level as well as an analytic one - should consider not engaging certain EA topics on a gut level (I don't restrict this to AI safety - I know people who've had similar reactions thinking about nuclear risk and I've personally made the decision not to think about s-risk or animal welfare on a gut level either.) I do want to emphasise that there was a tradeoff here - I think I have somewhat better AI safety takes as a result of thinking about AI safety on a gut level. The benefit though was reasonably small and not worth the other costs from an impartial welfareist perspective. 

Noting that the passage you quote in your appendix from my report isn't my definition of the type of AI I'm focused on. I'm focused on AI systems with the following properties:

  • Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation). 
  • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.&nb
... (read more)

Hi Jake, 

Thanks for this comment. I discuss this sort of case in footnote 33 here -- I think it's a good place to push back on the argument. Quoting what I say there:

there is, perhaps, some temptation to say “even if I should be indifferent to these people burned alive, I’m not! Screw indifference-ism world! Sounds like a shitty objective normative order anyway – let’s rebel against it.” That is, it feels like indifference-ism worlds have told me what the normative facts are, but they haven’t told me about my loyalty to the normati

... (read more)
2
Eli Rose
2y
Likewise.

A few questions about this: 

  1. Does this view imply that it is actually not possible to have a world where e.g. a machine creates one immortal happy person per day, forever, who then form an ever-growing line?
  2. How does this view interpret cosmological hypotheses on which the universe is infinite? Is the claim that actually, on those hypotheses, the universe is finite after all? 
  3. It seems like lots of the (countable) worlds and cases discussed in the post can simply be reframed as never-ending processes, no? And then similar (identical?) questions will
... (read more)
9
weeatquince
2y
My take (think I am less of an expert than djbinder here) 1. This view allows that. 2. This view allows that. (Although entirely separately consideration of entropy etc would not allow infinite value.) 3. No I don’t think identical questions arise. Not sure. Skimming the above post it seems to solve most of the problematic examples you give. At any point a moral agent will exist in a universe with finite space and finite time that will tend infinite going forward. So you cannot have infinite starting points so no zones of suffering etc. Also I think you don’t get problems with "welfare-preserving bijections" when well defined it time but struggle to explain why. It seems that for example w1 below is less bad than w2 Time      t1    t2    t3   t4   t5   t6    t7 Agent    a1    a2    a3   a4   a5   a6    a7 w1            -1              -1            -1             -1…. w2           -1     -1     -1    -1   -1     -1    -1….
5
djbinder
2y
As an aside, while neutrality-violations are a necessary consequence of regularization, a weaker form of neutrality is preserved. If we regularize with some discounting factor so that everything remains finite, it is easy to see that "small rearrangments" (where the amount that a person can move in time is finite) do not change the answer, because the difference goes to zero as γ→0. But "big rearrangments" can cause differences that grow with γ. Such situations do arise in various physical situations, and are interpretted as changes to boundary conditions, whereas the "small rearrangments" manifestly preserve boundary conditions and manifestly do not cause problems with the limit. (The boundary is most easily seen by mapping the infinite interval sequence onto a compact interval, so that "infinity" is mapped to a finite point. "Small rearrangments" leave infinity unchanged, whereas "large" ones will cause a flow of utility across infinity, which is how the two situations are able to give different answers.)
6
djbinder
2y
I think what is true is probably something like "neverending process don't exist, but arbitrarily long ones do", but I'm not confident. My more general claim is that there can be intermediate positions between ultrafinitism ("there is a biggest number"), and any laissez faire "anything goes" attitude, where infinities appear without care or scrunity. I would furthermore claim (but on less solid ground), that the views of practicing mathematicians and physicists falls somewhere in here. As to the infinite series examples you give, they are mathematically ill-defined without giving a regularization. There is a large literature in mathematics and physics on the question of regularizing infinite series. Regularization and renormalization are used through physics (particular in QFT), and while poorly written textbooks (particularly older ones) can make this appear like voodoo magic, the correct answers can always be rigorously be obtained by making everything finite. For the situation you are considering, a natural regularization would be to replace your sum with a regularized sum where you discount each time step by some discounting factor γ. Physically speaking, this is what would happen if we thought the universe had some chance of being destroyed at each timestep; that is, it can be arbitrarily long-lived, yet with probability 1 is finite. You can sum the series and then take γ→0 and thus derive a finite answer. There may be many other ways to regulate the series, and it often turns out that how you regulate the series doesn't matter. In this way, it might make sense to talk about this infinite universe without reference to a specific limiting process, but rather potentially with only some weaker limiting process specification. This is what happens, for instance, in QFT; the regularizations don't matter, all we care about are the things that are independent of regularization, and so we tend to think of the theories as existing without a need for regularization. Ho

Thanks for doing this! I've found it useful, and I expect that it will increase my engagement with EA Forum/LW content going forward.

"that just indicates that EDT-type reasoning is built into the plausibility of SIA"

 If by this you mean "SIA is only plausible if you accept EDT," then I disagree. I think many of the arguments for SIA -- for example, "you should 1/4 on each of tails-mon, tails-tues, heads-mon, and heads-tues in Sleeping Beauty with two wakings each, and then update to being a thirder if you learn you're not in heads-tues," "telekinesis doesn't work," "you should be one-half on not-yet-flipped fair coins," "reference classes aren't a thing," etc -- don't depend on EDT... (read more)

It’s a good question, and one I considered going into in more detail on in the post (I'll add a link to this comment). I think it’s helpful to have in mind two types of people: “people who see the exact same evidence you do” (e.g., they look down on the same patterns of wrinkles on your hands, the same exact fading on the jeans they’re wearing, etc) and “people who might, for all you know about a given objective world, see the exact same evidence you do” (an example here would be “the person in room 2”). By “people in your epistemic situation,” I mean the ... (read more)

Cool, this gives me a clearer picture of where you're coming from. I had meant the central question of the post to be whether it ever makes sense to do the EDT-ish try-to-control-the-past thing, even in pretty unrealistic cases -- partly because I think answering "yes" to this is weird and disorienting in itself, even if it doesn't end up making much of a practical difference day-to-day; and partly because a central objection to EDT is that the past, being already fixed, is never controllable in any practically-relevant sense, even in e.g. Newcomb's cases.... (read more)

Not sure exactly what words people have used, but something like this idea is pretty common in the non-CDT literature, and I think e.g. MIRI explicitly talks about "controlling" things like your algorithm.

I think this is an interesting objection. E.g., "if you're into EDT ex ante, shouldn't you be into EDT ex post, and say that it was a 'good action' to learn about the Egyptians, because you learned that they were better off than you thought in expectation?" I think it depends, though, on how you are doing the ex post evaluation: and the objection doesn't work if the ex post evaluation conditions on the information you learn. 

That is, suppose that before you read Wikipedia, you were 50% on the Egyptians were at 0 welfare, and 50% they were at 10 welfar... (read more)

2
MichaelStJules
3y
This sounds like CDT, though, by conditioning on the past. If, for Newcomb's problem, we condition on the past and so the contents of the boxes, we get that one-boxing was worse: Of course, there's something hidden here, which is that if the box that could have been empty was not empty, you could not have two-boxed (or with a weaker predictor, it's unlikely that the box wasn't empty and you would have two-boxed).

"the emphasis here seems to be much more about whether you can actually have a causal impact on the past" -- I definitely didn't mean to imply that you could have a causal impact on the past. The key point is that the type of control in question is acausal. 

I agree that many of these cases involve unrealistic assumptions, and that CDT may well be an effective heuristic most of the time (indeed, I expect that it is). 

I don't feel especially hung up on calling it "control" -- ultimately it's the decision theory (e.g., rejecting CDT) that I'm intere... (read more)

Thanks for these comments. 

Re: “physics-based priors,” I don't think I have a full sense of what you have in mind, but at a high level, I don’t yet see how physics comes into the debate. That is, AFAICT everyone agrees about the relevant physics — and in particular, that you can’t causally influence the past, “change” the past, and so on. The question as I see it (and perhaps I should’ve emphasized this more in the post, and/or put things less provocatively) is more conceptual/normative: whether when making decisions we should think of the past the wa... (read more)

3
Geoffrey Irving
3y
By “physics-based” I’m lumping together physics and history a bit, but it’s hard to disentangle them especially when people start talking about multiverses. I generally mean “the combined information of the laws of physics and our knowledge of the past”. The reason I do want to cite physics too, even for the past case of (1), is that if you somehow disagreed about decision theorists in WW1 I’d go to the next part of the argument, which is that under the technology of WW1 we can’t do the necessary predictive control (they couldn’t build deterministic twins back then). However, it seems like we’re mostly in agreement, and you could consider editing the post to make that more clear. The opening line of your post is “I think that you can “control” events you have no causal interaction with, including events in the past.” Now the claim is “everyone agrees about the relevant physics — and in particular, that you can’t causally influence the past”. These two sentences seem inconsistent, and especially since your piece is long and quite technical opening with a wrong summary may confuse people. I realize you can get out of the inconsistency by leaning on the quotes, but it still seems misleading.

I'm imagining computers with sufficiently robust hardware to function deterministically at the software level, in the sense of very reliably performing the same computation, even if there's quantum randomness at a lower level. Imagine two good-quality calculators, manufactured by the same factory using the same process, which add together the same two numbers using the same algorithm, and hence very reliably move through the same high-level memory states and output the same answer. If quantum randomness makes them output different answers, I count that as a "malfunction."

3
JackM
3y
OK thanks, and I have read through now and seen that you discuss randomness in section 4. Overall a very interesting read! Out of interest, is this idea of "acausal control" entirely novel or has it/something similar been discussed by others? 

I have sympathy for responses like "look, it's just so clear that you can't control the past in any practically relevant sense that we should basically just assume the type of arguments in this post are wrong somehow." But I'm curious where you think the arguments actually go wrong, if you have a view about that? For example, do you think defecting in perfect deterministic twin prisoner's dilemmas with identical inputs is the way to go?

8
MichaelStJules
3y
I think the thought experiments you give are pretty decisive in favour of the EDT answers over the CDT answers, and I guess I would agree that we have some kind of subtle control over the past, but I would also add: Acting and conditioning on our actions doesn't change what happened in the past; it only tells us more about it. Finding out that Ancient Egyptians were happier than you thought before doesn't make it so that they were happier than you thought before; they already observed their own welfare, and you were just ignorant of it. While EDT would not recommend for the sake of the Ancient Egyptians to find out more about their welfare (the EV would be 0, since the ex ante distributions are the same) or even filter only for positive information about their welfare (you would need to adjust your beliefs for this bias), doesn't it suggest that if you happen to find out that the Egyptians were better off than you thought, you did something good, and if you happen to find out that the Egyptians were worse off than you thought, you did something bad? If we control the past in the way you suggest in your thought experiments, do we also control it just by reading the Wikipedia page on Ancient Egyptians? Or do we only use EDT to evaluate the expected value of actions beforehand and not their actual value after the fact, or at least not in this way? And then, why does this seems absurd, but not the EDT answers to your thought experiments?

So certainly physics-based priors is a big component, and indeed in some sense is all of it.  That is, I think physics-based priors should give you an immediate answer of "you can't influence the past with high probability", and moreover that once you think through the problems in detail the conclusion will be that you could influence the past if physics were different (including boundary conditions, even if laws remain the same), but still that boundary condition priors should still tell us you can't influence the past.  I'm happy to elaborate.

F... (read more)

Thanks, Richard :). Re: arbitrariness, in a sense the relevant choices might well end up arbitrary (and as you say, subjectivists need to get used to some level of unavoidable arbitrariness), but I do think that it at least seems worth trying to capture/understand some sort of felt difference between e.g. picking between Buridan's bales of hay, and choosing e.g. what career to pursue, even if you don't think there's a "right answer" in either case. 

I agree that "infallible" maybe has the wrong implications, here, though I do think that part of the puz... (read more)

I'm glad you liked it, Lukas. It does seem like an interesting question how your current confidence in your own values relates to your interest in further "idealization," of what kind, and how much convergence makes a difference. Prima facie, it does seems plausible that greater confidence speaks in favor"conservatism" about what sorts of idealization you go in for, though I can imagine very uncertain-about-their-values people opting for conservatism, too. Indeed, it seems possible that conservatism is just generally pretty reasonable, here.

Hi Ben, 

This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:

  • It now feels more salient to me now just how many AI applications may be covered by systems that either aren’t agentic planning/strategically aware (including e.g. interacting modular systems, especially where humans are in the loop for some parts, and/or intuitively “sphexish/brittl
... (read more)
2
Ben Pace
3y
Great answer, thanks.

Hi Ben, 

A few thoughts on this: 

  • It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grain
... (read more)
2
Ben Pace
3y
Thanks for the thoughtful reply. I do think I was overestimating how robust you're treating your numbers and premises, it seems like you're holding them all much more lightly than I think I'd been envisioning. FWIW I am more interested in engaging with some of what you wrote in in your other comment than engaging on the specific probability you assign, for some of the reasons I wrote about here. I think I have more I could say on the methodology, but alas, I'm pretty blocked up with other work atm. It'd be neat to spend more time reading the report and leave more comments here sometime.

Hi Hadyn, 

Thanks for your kind words, and for reading. 

  1. Thanks for pointing out these pieces. I like the breakdown of the different dimensions of long-term vs. near-term. 
  2. Broadly, I agree with you that the document could benefit from more about premise 5. I’ll consider revising to add some.
  3. I’m definitely concerned about misuse scenarios too (and I think lines here can get blurry -- see e.g. Katja Grace’s recent post); but I wanted, in this document, to focus on misalignment in particular. The question of how to weigh misuse vs. misalignment r
... (read more)

(Continued from comment on the main thread)

I'm understanding your main points/objections in this comment as: 

  1. You think the multiple stage fallacy might be the methodological crux behind our disagreement. 
  2. You think that >80% of AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would assign >10% probability to existential catastrophe from technical problems with AI (at some point, not necessarily before 2070). So it seems like 80k saying 1-10% reflects a disagreement with the experts, which would be strange in the context of e.g.
... (read more)

The upshot seems to be that Joe, 80k, the AI researcher survey (2008), Holden-2016 are all at about a 3% estimate of AI risk, whereas AI safety researchers now are at about 30%. The latter is a bit lower (or at least differently distributed) than Rob expected, and seems higher than among Joe's advisors.

The divergence is big, but pretty explainable, because it concords with the direction that apparent biases point in. For the 3% camp, the credibility of one's name, brand, or field benefits from making a lowball estimates. Whereas the 30% camp is self-select... (read more)

I think I share Robby's sense that the methodology seems like it will obscure truth.

That said, I have neither your (Joe) extensive philosophical background nor have spent substantial time like you on a report like this, and I am interested in evidence to the contrary.

To me, it seems like you've tried to lay out a series of 6 steps of an argument, that you think each very accurately carve the key parts of reality that are relevant, and pondered each step for quite a while.

When I ask myself whether I've seen something like this produce great insight, it's ha... (read more)

Hi Rob, 

Thanks for these comments. 

Let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p. I’m understanding your main objections in this comment as: 

  1. It seems to you like we’re in a world where p is true, by default. Hence, 5% on p seems too low to you. In particular:
    1. It implies 95% confidence on not p, which seems to you overly confident.
    2. If p is true by default, you think the world would look like it does now; so if this world isn’t enough to get me above 5%, what would be?
    3. Because p seems true to you by def
... (read more)

Sounds right to me.  Per a conversation with Aaron a while back, I've been relying on the moderators to tag posts as personal blog, and had been assuming this one would be.

Glad to hear you found it helpful. Unfortunately, I don't think I have a lot to add at the moment re: how to actually pursue moral weighting research, beyond what I gestured at in the post (e.g., trying to solicit lots of your own/other people's intuitions across lots of cases, trying to make them consistent,  that kind of thing). Re: articles/papers/posts, you could also take a look at GiveWell's process here, and the moral weight post from Luke Muelhauser I mentioned has a few references at the end that might be helpful (though most of them I haven'... (read more)

Hi Michael — 

I meant, in the post, for the following paragraphs to address the general issue you mention: 

Some people don’t think that gratitude of this kind makes sense. Being created, we might say, can’t have been “better for” me, because if I hadn’t been created, I wouldn’t exist, and there would be no one that Wilbur’s choice was “worse for.” And if being created wasn’t better for me, the thought goes, then I shouldn’t be grateful to Wilbur for creating me.

Maybe the issues here are complicated, but at a high level: I don’t buy it. I

... (read more)

Hello Joe!

I enjoyed the McMahan/Parfit move of saying things are 'good for' without being 'better for'. I think it's clever, but I don't buy it. It seems like an linguistic sleight of hand and I don't really understand how it works.

I agree we have preferences over existing, but, well, so what? The fact I do or would have a preference does not automatically reveal what the axiological facts are. It's hard to know, even if we grant this, how it extends to not yet existing people. A present non-existing possible person doesn't have any preference, including w... (read more)

Thanks! Re: mental manipulation, do you have similar worries even granted that you’ve already been being manipulated in these ways? We can stipulate that there won’t be any increase in the manipulation in question, if you stay. One analogy might be: extreme cognitive biases that you’ve had all along. They just happen to be machine-imposed. 

That said, I don’t think this part is strictly necessary for the thought experiment, so I’m fine with folks leaving it out if it trips them up.

2
richard_ngo
3y
Yes, I think I still have these concerns; if I had extreme cognitive biases all along, then I would want them removed even if it didn't improve my understanding of the world. It feels similar to if you told me that I'd lived my whole life in a (pleasant) dreamlike fog, and I had the opportunity to wake up. Perhaps this is the same instinct that motivates meditation? I'm not sure.

Glad to hear you enjoyed it. 

I haven't engaged much with tranquilism. Glancing at that piece, I do think that the relevant notions of "craving" and "clinging" are similar; but I wouldn't say, for example, that an absence of clinging makes an experience as good as it can be for someone.

Thanks :). I haven't thought much about personal universes, but glancing at the paper, I'd expect resource-distribution, for example, to remain an issue.

Glad to hear it :)

Re: "my motivational system is broken, I'll try to fix it" as the thing to say as an externalist realist: I think this makes sense as a response. The main thing that seems weird to me is the idea that you're fundamentally "cut off" from seeing what's good about helium, even though there's nothing you don't understand about reality. But it's a weird case to imagine, and the relevant notions of "cut off" and "understanding" are tricky.

Thanks for reading. Re: your version of anti-realism: is "I should create flourishing (or whatever your endorsed theory says)" in your mouth/from your perspective true, or not truth-apt? 

To me Clippy's having or not having a moral theory doesn't seem very central. E.g., we can imagine versions in which Clippy (or some other human agent) is quite moralizing, non-specific, universal, etc about clipping, maximizing pain, or whatever.

2
richard_ngo
3y
It's not truth-apt. It has a truth-apt component (that my moral theory endorses creating flourishing). But it also has a non-truth-apt component, namely "hooray my moral theory". I think this gets you a lot of the benefits of cognitivism, while also distinguishing moral talk from standard truth-apt claims about my or other people's preferences (which seems important, because agreeing that "Clippy was right when it said it should clip" feels very different from agreeing that"Clippy wants to clip"). I can see how this was confusing in the original comment; sorry about that. I think the intuition that Clippy's position is very different from ours starts to weaken if Clippy has a moral theory. For example, at that point we might be able to reason with Clippy and say things like "well, would you want to be in pain?", etc. It may even (optimistically) be the case that properties like non-specificity and universality are strong enough that any rational agent which strongly subscribes to them will end up with a reasonable moral system. But you're right that it's somewhat non-central, in that the main thrust of my argument doesn't depend on it.