EDIT: I'm only going to answer a few more questions, due to time constraints. I might eventually come back and answer more. I still appreciate getting replies with people's thoughts on things I've written.

I'm going to do an AMA on Tuesday next week (November 19th). Below I've written a brief description of what I'm doing at the moment. Ask any questions you like; I'll respond to as many as I can on Tuesday.

Although I'm eager to discuss MIRI-related things in this AMA, my replies will represent my own views rather than MIRI's, and as a rule I won't be running my answers by anyone else at MIRI. Think of it as a relatively candid and informal Q&A session, rather than anything polished or definitive.


I'm a researcher at MIRI. At MIRI I divide my time roughly equally between technical work and recruitment/outreach work.

On the recruitment/outreach side, I do things like the following:

- For the AI Risk for Computer Scientists workshops (which are slightly badly named; we accept some technical people who aren't computer scientists), I handle the intake of participants, and also teach classes and lead discussions on AI risk at the workshops.
- I do most of the technical interviewing for engineering roles at MIRI.
- I manage the AI Safety Retraining Program, in which MIRI gives grants to people to study ML for three months with the goal of making it easier for them to transition into working on AI safety.
- I sometimes do weird things like going on a Slate Star Codex roadtrip, where I led a group of EAs as we travelled along the East Coast going to Slate Star Codex meetups and visiting EA groups for five days.

On the technical side, I mostly work on some of our nondisclosed-by-default technical research; this involves thinking about various kinds of math and implementing things related to the math. Because the work isn't public, there are many questions about it that I can't answer. But this is my problem, not yours; feel free to ask whatever questions you like and I'll take responsibility for choosing to answer or not.


Here are some things I've been thinking about recently:

- I think that the field of AI safety is growing in an awkward way. Lots of people are trying to work on it, and many of these people have pretty different pictures of what the problem is and how we should try to work on it. How should we handle this? How should you try to work in a field when at least half the "experts" are going to think that your research direction is misguided?
- The AIRCS workshops that I'm involved with contain a variety of material which attempts to help participants think about the world more effectively. I have thoughts about what's useful and not useful about rationality training.
- I have various crazy ideas about EA outreach. I think the SSC roadtrip was good; I think some EAs who work at EA orgs should consider doing "residencies" in cities without much fulltime EA presence, where they mostly do their normal job but also talk to people.


New Comment
230 comments, sorted by Click to highlight new comments since: Today at 5:48 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Reading through some of your blog posts and other writing, I get the impression that you put a lot of weight on how smart people seem to you. You often describe people or ideas as "smart" or "dumb," and you seem interested in finding the smartest people to talk to or bring into EA.

I am feeling a bit confused by my reactions. I think I am both a) excited by the idea of getting the "smart people" together so that they can help each other think through complicated topics and make more good things happen, but b) I feel a bit sad and left out that I am probably not one of the smart people.

Curious about your thoughts on a few things related to this... I'll put my questions as separate comments below.

2) Somewhat relatedly, there seems to be a lot of angst within EA related to intelligence / power / funding / jobs / respect / social status / etc., and I am curious if you have any interesting thoughts about that.

I feel really sad about it. I think EA should probably have a communication strategy where we say relatively simple messages like "we think talented college graduates should do X and Y", but this causes collateral damage where people who don't succeed at doing X and Y feel bad about themselves. I don't know what to do about this, except to say that I have the utmost respect in my heart for people who really want to do the right thing and are trying their best.

I don't think I have very coherent or reasoned thoughts on how we should handle this, and I try to defer to people who I trust whose judgement on these topics I think is better.

If you feel comfortable sharing: who are the people whose judgment on this topic you think is better?

1) Do you have any advice for people who want to be involved in EA, but do not think that they are smart or committed enough to be engaging at your level? Do you think there are good roles for such people in this community / movement / whatever? If so, what are those roles?

I used to expect 80,000 Hours to tell me how to have an impactful career. Recently, I've started thinking it's basically my own personal responsibility to figure it out. I think this shift has made me much happier and much more likely to have an impactful career.

80,000 Hours targets the most professionally successful people in the world. That's probably the right idea for them - giving good career advice takes a lot of time and effort, and they can't help everyone, so they should focus on the people with the most career potential.

But, unfortunately for most EAs (myself included), the nine priority career paths recommended by 80,000 Hours are some of the most difficult and competitive careers in the world. If you’re among the 99% of people who are not Google programmer / top half of Oxford / Top 30 PhD-level talented, I’d guess you have slim-to-none odds of succeeding in any of them. The advice just isn't tailored for you.

So how can the vast majority of people have an impactful career? My best answer: A lot of independent thought and planning. Your own personal brainstorming and reading and asking around and exploring, not just following stoc... (read more)

Hi Aidan,

I’m Brenton from 80,000 Hours - thanks for writing this up! It seems really important that people don’t think of us as “tell[ing] them how to have an impactful career”. It sounds absolutely right to me that having a high impact career requires “a lot of independent thought and planning” - career advice can’t be universally applied.

I did have a few thoughts, which you could consider incorporating if you end up making a top level post. The most substantive two are:

  1. Many of the priority paths are broader than you might be thinking.
  2. A significant amount of our advice is designed to help people think through how to approach their careers, and will be useful regardless of whether they’re aiming for a priority path.

Many of the priority paths are broader than you might be thinking:

Most people won’t be able to step into an especially high impact role directly out of undergrad, so unsurprisingly, many of the priority paths require people to build up career capital before they can get into high impact positions. We’d think of people who are building up career capital focused on (say) AI policy as being ‘on ... (read more)

I think this comment is really lovely, and a very timely message. I'd support it being turned into a top-level post so more people can see it, especially if you have anything more to add.

Thank you both very much, I will do that, and I almost definitely wouldn't have without your encouragement.

If anyone has more thoughts on the topic, please comment or reach out to me, I'd love to incorporate them into the top-level post.

I agree this is a very helpful comment. I would add: these roles in my view are not *lesser* in any sense, for a range of reasons and I would encourage people not to think of them in those terms.

  • You might have a bigger impact on the margins being the only - or one of the first few - people thinking in EA terms in a philanthropic foundation than by adding to the pool of excellence at OpenPhil. This goes for any role that involves influencing how resources are allocated - which is a LOT, in charity, government, industry, academic foundations etc.
  • You may not be in the presidential cabinet, or a spad to the UK prime minister, but those people are supported and enabled by people building up the resources, capacity, overton window expansion elsewhere in government and civil service. The 'senior person' on their own may not be able to achieve purchase with key policy ideas and influence.
  • A lot of xrisk research, from biosecurity to climate change, draws on and depends on a huge body of work on biology, public policy, climate science, renewable energy, insulation in homes, and much more. Often there are gaps in research on extreme scenarios due to lack of incentives for this kind
... (read more)
But the marginal impact of becoming a star striker is so high! (Just kidding – this is a great analogy & highlights a big problem with reasoning on the margin + focusing on maximizing individual impact.)

I also like the analogy, let's run with it. Suppose I'm reasoning from the point of view of the movement as a whole, and we're trying to put together a soccer team. Suppose also that there are two types of positions, midfield and striker. I'm not sure if this is true for strikers in what I would call soccer, but suppose the striker has a higher skillcap than midfield.[1] I'll define skillcap as the amount of skill with the position before the returns begin to diminish.

Where skill is some product of standard deviation of innate skill and hours practiced.

Back to the problem of putting together a soccer team, if you're starting with a bunch of players of unknown innate skill, you would get a higher expected value to tell 80% of your players to train to be strikers, and 20% to be midfielders. Because you have a smaller pool, your midfielders will have less innate talent for the position. You can afford to lose this however, as the effect will be small compared to the gain in the increased performance of the strikers.

That's not to say that you should fill your entire team with wannabe strikers. When you select your team you'll undoubtedly leave out some very dedicated strikers in favor

... (read more)
Yeah, I think this model misses that people who are aiming to be strikers tend to have pretty different dispositions than people aiming to be midfielders. (And so filling a team mostly with intending-to-be-strikers could have weird effects on team cohesion & function.) Interesting to think about how Delta Force, SEAL Team Six, etc. manage this, as they select for very high-performing recruits (all strikers) then meld them into cohesive teams. I believe they do it via: 1. having a very large recruitment pool 2. intense filtering out of people who don't meet their criteria 3. breaking people down psychologically + cultivating conformity during training I found it interesting to cash this out more... thanks!
3JP Addison3y
Ah, so like, in the "real world", you don't have a set of people, you end up recruiting a training class of 80% would-be-strikers, which influences the culture compared to if you recruited for the same breakdown as the eventually-selected-team?

I really enjoy the extent to which you've both taken the ball and run with it ;)

I think a lot of this is right and important, but I especially love:

Don't let the fact that Bill Gates saved a million lives keep you from saving one.

We're all doing the best we can with the privileges we were blessed with.

"Do you have any advice for people who want to be involved in EA, but do not think that they are smart or committed enough to be engaging at your level?"--I just want to say that I wouldn't have phrased it quite like that.

One role that I've been excited about recently is making local groups be good. I think that having better local EA communities might be really helpful for outreach, and lots of different people can do great work with this.

"...but do not think that they are smart or committed enough to be engaging at your level?" was intended to be from a generic insecure (or realistic) EA's perspective, not yours. Sorry for my confusing phrasing.

4) You seem like you have had a natural strong critical thinking streak since you were quite young (e.g., you talk about thinking that various mainstream ideas were dumb). Any unique advice for how to develop this skill in people who do not have it naturally?

For the record, I think that I had mediocre judgement in the past and did not reliably believe true things, and I sometimes had made really foolish decisions. I think my experience is mostly that I felt extremely alienated from society, which meant that I looked more critically on many common beliefs than most people do. This meant I was weird in lots of ways, many of which were bad and some of which were good. And in some cases this meant that I believed some weird things that feel like easy wins, eg by thinking that people were absurdly callous about causing animal suffering.

My judgement improved a lot from spending a lot of time in places with people with good judgement who I could learn from, eg Stanford EA, Triplebyte, the more general EA and rationalist community, and now MIRI.

I feel pretty unqualified to give advice on critical thinking, but here are some possible ideas, which probably aren't actually good:

  • Try to learn simple models of the world and practice applying them to claims you hear, and then being confused when they don't match. Eg learn introductory microeconomics and then whenever you hear a claim about the world that intro micro has an opinion on, try t
... (read more)

3) I've seen several places where you criticize fellow EAs for their lack of engagement or critical thinking. For example, three years ago, you wrote:

I also have criticisms about EAs being overconfident and acting as if they know way more than they do about a wide variety of things, but my criticisms are very different from [Holden's criticisms]. For example, I’m super unimpressed that so many EAs didn’t know that GiveWell thinks that deworming has a relatively low probability of very high impact. I’m also unimpressed by how many people are incredibly confident that animals aren’t morally relevant despite knowing very little about the topic.

Do you think this has improved at all? And what are the current things that you are annoyed most EAs do not seem to know or engage with?

I no longer feel annoyed about this. I'm not quite sure why. Part of it is probably that I'm a lot more sympathetic when EAs don't know things about AI safety than global poverty, because learning about AI safety seems much harder, and I think I hear relatively more discussion of AI safety now compared to three years ago.

One hypothesis is that 80000 Hours has made various EA ideas more accessible and well-known within the community, via their podcast and maybe their articles.

What evidence would persuade you that further work on AI safety is unnecessary?

I’m going to instead answer the question “What evidence would persuade you that further work on AI safety is low value compared to other things?”

Note that a lot of my beliefs here disagree substantially with my coworkers.

I’m going to split the answer into two steps: what situations could we be in such that I thought we should deprioritize AI safety work, and for each of those, what could I learn that would persuade me we were in them.

Situations in which AI safety work looks much less valuable:

  • We’ve already built superintelligence, in which case the problem is moot
    • Seems like this would be pretty obvious if it happened
  • We have clear plans for how to align AI that work even when it’s superintelligent, and we don’t think that we need to do more work in order to make these plans more competitive or easier for leading AGI projects to adopt.
    • What would persuade me of this:
      • I’m not sure what evidence would be required for me to be inside-view persuaded of this. I find it kind of hard to be inside view persuaded, for the same reason that I find it hard to imagine being persuaded that an operating system is secure.
      • But I can imagine what it
... (read more)

Thanks, that's really interesting! I was especially surprised by "If I thought there was a <30% chance of AGI within 50 years, I'd probably not be working on AI safety."

Yeah, I think that a lot of EAs working on AI safety feel similarly to me about this.

I expect the world to change pretty radically over the next 100 years, and I probably want to work on the radical change that's going to matter first. So compared to the average educated American I have shorter AI timelines but also shorter timelines to the world becoming radically different for other reasons.

If I thought there was a <30% chance of AGI within 50 years, I'd probably not be working on AI safety.
I expect the world to change pretty radically over the next 100 years.

I find these statements surprising, and would be keen to hear more about this from you. I suppose that the latter goes a long way towards explaining the former. Personally, there are few technologies that I think are likely to radically change the world within the next 100 years (assuming that your definition of radical is similar to mine). Maybe the only ones that would really qualify are bioengineering and nanotech. Even in those fields, though, I expect the pace of change to be fairly slow if AI isn't heavily involved.

(For reference, while I assign more than 30% credence to AGI within 50 years, it's not that much more).

Yeah, I suspect you're right. I think there are a couple more radically transformative technologies which I think are reasonably likely over the next hundred years, eg whole brain emulation. And I suspect we disagree about the expected pace of change with bioengineering and maybe nanotech.
Huh, my impression was that the most plausible s-risks we can sort-of-specifically foresee are AI alignment problems - do you disagree? Or is this statement referring to s-risks as a class of black swans for which we don't currently have specific imaginable scenarios, but if those scenarios became more identifiable you would consider working on them instead?

Most of them are related to AI alignment problems, but it's possible that I should work specifically on them rather than other parts of AI alignment.

An s-risk could occur via a moral failure, which could happen even if we knew how to align our AIs.

2[comment deleted]3y
6Donald Hobson3y
If no more AI safety work is necessary, that means that there is nothing we can do to significantly increase the chance of FAI over UFAI. I could be almost certain that FAI would win because I had already built one. Although I suspect that there will be double checking to do, the new FAI will need told about what friendly behavior is, someone should keep an eye out for any UFAI ect. So FAI work will be needed until the point where no human labor is needed and we are all living in a utopia. I could be almost certain that UFAI will win. I could see lots of people working on really scary systems and still not have the slightest idea of how to do make anything friendly. But there would still be a chance that those systems didn't scale to superintelligence, that the people running them could be persuaded to turn them off, and that someone might come up with a brilliant alignment scheme tomorrow. Circumstances where you can see that you are utterly screwed, yet still be alive, seem unlikely. Keep working untill the nanites turn you into paperclips. Alternatively, it might be clear that we aren't getting any AI any time soon. The most likely cause of this would be a pretty serious disaster. It would have to destroy most of humanities technical ability and stop us rebuilding it. If AI alignment is something that we will need to do in a few hundred years, once we rebuild society enough to make silicon chips, its still probably worth having someone making sure that progress isn't forgotten, and that the problem will be solved in time. We gain some philosophical insight that says that AI is inherently good, always evil, impossible ect. It's hard to imagine what a philosophical insight that you don't have is like.

In the 80k podcast episode with Hilary Greaves she talks about decision theory and says:

Hilary Greaves: Then as many of your listeners will know, in the space of AI research, people have been throwing around terms like ‘functional decision theory’ and ‘timeless decision theory’ and ‘updateless decision theory’. I think it’s a lot less clear exactly what these putative alternatives are supposed to be. The literature on those kinds of decision theories hasn’t been written up with the level of precision and rigor that characterizes the discussion of causal and evidential decision theory. So it’s a little bit unclear, at least to my likes, whether there’s genuinely a competitor to decision theory on the table there, or just some intriguing ideas that might one day in the future lead to a rigorous alternative.

I understand from that that there is little engagement of MIRI with the academia. What is more troubling for me is that it seems that the cases for the major decision theories are looked upon with skepticism from academic experts.

Do you think that is really the case? How do you respond to that? It would personally feel much better if I knew that there are some academic decision

... (read more)

Yeah, this is an interesting question.

I’m not really sure what’s going on here. When I read critiques of MIRI-style decision theories (eg from Will or from Wolfgang Schwartz), I feel very unpersuaded by them. This leaves me in a situation where my inside views disagree with the views of the most obvious class of experts, which is always tricky.

  • When I read those criticisms by Will MacAskill and Wolfgang Schwartz, I feel like I understand their criticisms and find them unpersuasive, as opposed to not understanding their criticisms. Also, I feel like they don’t understand some of the arguments and motivations for FDT. I feel a lot better disagreeing with experts when I think I understand their arguments and when I think I can see particular mistakes that they’re making. (It’s not obvious that this is the right epistemic strategy, for reasons well articulated by Gregory Lewis here.)
  • Paul’s comments on this resolved some of my concerns here. He thinks that the disagreement is mostly about what questions decision theory should be answering. He thinks that the updateless decision theories are obviously more suitable to building AI than eg CDT or ED
... (read more)

I think it’s plausible that Paul is being overly charitable to decision theorists; I’d love to hear whether skeptics of updateless decision theories actually agree that you shouldn’t build a CDT agent.

FWIW, I could probably be described as a "skeptic" of updateless decision theories; I’m pretty sympathetic to CDT. But I also don’t think we should build AI systems that consistently take the actions recommended by CDT. I know at least a few other people who favor CDT, but again (although small sample size) I don’t think any of them advocate for designing AI systems that consistently act in accordance with CDT.

I think the main thing that’s going on here is that academic decision theorists are primarily interested in normative principles. They’re mostly asking the question: “What criterion determines whether or not a decision is ‘rational’?” For example, standard CDT claims that an action is rational only if it’s the action that can be expected to cause the largest increase in value.

On the other hand, AI safety researchers seem to be mainly interested in a different question: “What sort of algorithm would it be rational for us to build into an AI system?” The first question doesn’t

... (read more)

The comments here have been very ecumenical, but I'd like to propose a different account of the philosophy/AI divide on decision theory:

1. "What makes a decision 'good' if the decision happens inside an AI?" and "What makes a decision 'good' if the decision happens inside a brain?" aren't orthogonal questions, or even all that different; they're two different ways of posing the same question.

MIRI's AI work is properly thought of as part of the "success-first decision theory" approach in academic decision theory, described by Greene (2018) (who also cites past proponents of this way of doing decision theory):

[...] Consider a theory that allows the agents who employ it to end up rich in worlds containing both classic and transparent Newcomb Problems. This type of theory is motivated by the desire to draw a tighter connection between rationality and success, rather than to support any particular account of expected utility. We might refer to this type of theory as a "success-first" decision theory.
[...] The desire to create a closer connection between rationality and success than that offered by standard d
... (read more)

"What makes a decision 'good' if the decision happens inside an AI?" and "What makes a decision 'good' if the decision happens inside a brain?" aren't orthogonal questions, or even all that different; they're two different ways of posing the same question.

I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.

Here’s another go:

Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. The phrase “decision theory” in this context typically refers to a claim about necessary and/or sufficient conditions for a decision being rational. To use different jargon, in this context a “decision theory” refers to a proposed “criterion of rightness.”

When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational only if taking it would cause the largest expected increase in value. To avoid any ambig

... (read more)

I agree that these three distinctions are important:

  • "Picking policies based on whether they satisfy a criterion X" vs. "Picking policies that happen to satisfy a criterion X". (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
  • "Trying to follow a decision rule Y 'directly' or 'on the object level'" vs. "Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y". (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you've come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
  • "A decision rule that prescribes outputting some action or policy and doesn't care how you do it" vs. "A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy". (E.g., a rule that says 'maximize the aggregate welfare of moral patients' vs. a specif
... (read more)
9Ben Garfinkel3y
The second distinction here is most closely related to the one I have in mind, although I wouldn’t say it’s the same. Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions. “Hedonistic utilitarianism is correct” would be a non-decision-theoretic example of (a). “Making decisions on the basis of coinflips” would be an example of (b). In the context of decision theory, of course, I am thinking of R_CDT as an example of (a) and P_CDT as an example of (b). I now have the sense I’m probably not doing a good job of communicating what I have in mind, though. I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making. Although we might have different intuitions here. I agree that this is a worthwhile goal and that philosophers can probably contribute to it. I guess I’m just not sure that the question that most academic decision theorists are trying to answer -- and the literature they’ve produced on it -- will ultimately be very relevant. The fact that R_CDT is “self-effacing” -- i.e. the fact that it doesn’t always recommend following P_CDT -- definitely does seem like a point of intuitive evidence against R_CDT. But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case [https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory] is an example of a case where R
I think "Don't Make Things Worse" is a plausible principle at first glance. One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the "Don't Make Things Worse Principle" makes things worse. Once you've already adopted son-of-CDT, which says something like "act like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this rule", it's not clear to me why you wouldn't just go: "Oh. CDT has lost the thing I thought made it appealing in the first place, this 'Don't Make Things Worse' feature. If we're going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?" A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demski's comments [https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory#X4CLhPpf9zecdGxAq]: And: [https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory#y8zRwcpNeu2ZhM3yE]
6Ben Garfinkel3y
This just seems to be the point that R_CDT is self-effacing: It says that people should not follow P_CDT, because following other decision procedures will produce better outcomes in expectation. I definitely agree that R_CDT is self-effacing in this way (at least in certain scenarios). The question is just whether self-effacingness or failure to satisfy "Don't Make Things Worse" is more relevant when trying to judge the likelihood of a criterion of rightness being correct. I'm not sure whether it's possible to do much here other than present personal intuitions. The point that R_UDT only violates the "Don't Make Things Worse" principle only infrequently seems relevant, but I'm still not sure this changes my intuitions very much. I may just be missing something, but I don't see what this theoretical ugliness is. And I don't intuitively find the ugliness/elegance of the decision procedure recommend by a criterion of rightness to be very relevant when trying to judge whether the criterion is correct. [[EDIT: Just an extra thought on the fact that R_CDT is self-effacing. My impression is that self-effacingness is typically regarded as a relatively weak reason to reject a moral theory. For example, a lot of people regard utilitarianism as self-effacing both because it's costly to directly evaluate the utility produced by actions and because others often react poorly to people who engage in utilitarian-style reasoning -- but this typically isn't regarded as a slam-dunk reasons to believe that utilitarianism is false. I think the SEP article on consequentialism [https://plato.stanford.edu/entries/consequentialism/] is expressing a pretty mainstream position when it says: "[T]here is nothing incoherent about proposing a decision procedure that is separate from one’s criterion of the right.... Criteria can, thus, be self-effacing without being self-refuting." Insofar as people don't tend to buy self-effacingness as a slam-dunk argument against the truth of moral theories
Sorry to drop in in the middle of this back and forth, but I am curious -- do you think it's quite likely that there is a single criterion of rightness that is objectively "correct"? It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. "don't make things worse", or "don't be self-effacing"). And so far there doesn't seem to be any single criterion that satisfies all of them. So why not just conclude that, similar to the case with voting and Arrow's theorem [https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem], perhaps there's just no single perfect criterion of rightness. In other words, once we agree that CDT doesn't make things worse, but that UDT is better as a general policy, is there anything left to argue about about which is "correct"? EDIT: Decided I had better go and read your Realism and Rationality [https://www.lesswrong.com/posts/GqTeChFnXdJzDzbMd/realism-and-rationality-2] post, and ended up leaving a lengthy comment there [https://www.lesswrong.com/posts/GqTeChFnXdJzDzbMd/realism-and-rationality-2#QMHYqC5sqXT4xhxKq].
5Ben Garfinkel3y
Happy to be dropped in on :) I think it's totally conceivable that no criterion of rightness is correct (e.g. because the concept of a "criterion of rightness" turns out to be some spooky bit of nonsense that doesn't really map onto anything in the real world.) I suppose the main things I'm arguing are just that: 1. When a philosopher expresses support for a "decision theory," they are typically saying that they believe some claim about what the correct criterion of rightness is. 2. Claims about the correct criterion of rightness are distinct from decision procedures. 3. Therefore, when a member of the rationalist community uses the word "decision theory" to refer to a decision procedure, they are talking about something that's pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems [[EDIT: or what decision procedure most closely matches our preferences about decision procedures]] don't directly speak to the questions that most academic "decision theorists" are actually debating with one another. I also think that, conditional on there being a correct criterion of rightness, R_CDT is more plausible than R_UDT. But this is a relatively tentative view. I'm definitely not a super hardcore R_CDT believer. I guess here -- in almost definitely too many words -- is how I think about the issue here. (Hopefully these comments are at least somewhat responsive to your question.) It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another. One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isn't even conceptually coherent and we
Thanks! This is helpful. I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you're just saying it's "often" natural, and I suppose it's natural in some cases and not others. But I think we may disagree on how often it's natural, though hard to say at this very abstract level. (Did you see my comment [https://www.lesswrong.com/posts/GqTeChFnXdJzDzbMd/realism-and-rationality-2#QMHYqC5sqXT4xhxKq] in response to your Realism and Rationality post?) In particular, I'm curious what makes you optimistic about finding a "correct" criterion of rightness. In the case of the politician, it seems clear that learning they don't have some of the properties you thought shouldn't call into question whether they exist at all. But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment [https://www.lesswrong.com/posts/GqTeChFnXdJzDzbMd/realism-and-rationality-2#QMHYqC5sqXT4xhxKq]), is that there's no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I'm not sure I understand why. My best guess, particularly informed by reading through footnote 15 on your Realism and Rationality [https://www.lesswrong.com/posts/GqTeChFnXdJzDzbMd/realism-and-rationality-2#fn-g2Qudp6we2KJiP9hr-15] post, is that when faced with ethical dilemmas (like your torture vs lollipop examples), it seems like there is a correct answer. Does that seem right? (I realize at this point we're talking about intuitions and priors on a pretty abstract level, so it may be hard to give a good answer.)
5Ben Garfinkel3y
Hey again! I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So here’s an over-long document [https://docs.google.com/document/d/1AOpIeU_vIqxxwviysBKZLnHnzhjwZGMPCN6oHu5Bmeg/edit?usp=sharing] I probably shouldn’t have written, which you are under no social obligation to read.
Thanks! Just read it. I think there's a key piece of your thinking that I don't quite understand / disagree with, and it's the idea that normativity is irreducible. I think I follow you that if normativity were irreducible, then it wouldn't be a good candidate for abandonment or revision. But that seems almost like begging the question. I don't understand why it's irreducible. Suppose normativity is not actually one thing, but is a jumble of 15 overlapping things that sometimes come apart. This doesn't seem like it poses any challenge to your intuitions from footnote 6 in the document (starting with "I personally care a lot about the question: 'Is there anything I should do, and, if so, what?'"). And at the same time it explains why there are weird edge cases where the concept seems to break down. So few things in life seem to be irreducible. (E.g. neither Eric nor Ben is irreducible!) So why would normativity be? [You also should feel under no social obligation to respond, though it would be fun to discuss this the next time we find ourselves at the same party, should such a situation arise.]
This is a good discussion! Ben, thank you for inspiring so many of these different paths we've been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions you've been sharing are widespread enough that it's very worthwhile to have public discussion on these points. On the contrary: * MIRI is more interested in identifying generalizations about good reasoning ("criteria of rightness") than in fully specifying a particular algorithm. * MIRI does discuss decision algorithms in order to better understand decision-making, but this isn't different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus [https://intelligence.org/files/DeathInDamascus.pdf]. Joyce and Arntzenius' response to this wasn't to go "algorithms are uncouth in our field"; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act. * MIRI does discuss "what decision procedure performs best", but this isn't any different from traditional arguments in the field like "naive EDT is wrong because it performs poorly in the smoking lesion problem". Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isn't different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument. * As I noted above [https://forum.effectivealtruism.org/posts/tDk57GhrdK54TWzPY/i-m-buck-shlegeris-i-do-research-and-outreach-at-miri-ama#o5j7WNsykEJBMfk8r], MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency,
I mostly agree with this. I think the disagreement between CDT and FDT/UDT advocates is less about definitions, and more about which of these things feels more compelling: * 1. On the whole, FDT/UDT ends up with more utility. (I think this intuition tends to hold more force with people the more emotionally salient "more utility" is to you. E.g., consider a version of Newcomb's problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your child's life.) * 2. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on. (I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that it's "too late" for one-boxing to get you any more utility in this world.) There are other considerations too, like how much it matters to you that CDT isn't self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. It's fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but it's still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT. If a more disciplined and well-prepared version of you would have one-boxed, then isn't there something off about saying that two-boxing is in any sense "correct"? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, an
My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking. Superstitious people often believe that it's possible to directly causally influence things across great distances of time and space. At a glance, FDT's prescription ("one-box, even though you can't causally affect whether the box is full") as well as its account of how and why this works ("you can somehow 'control' the properties of abstract objects like 'decision functions'") seem weird and spooky in the manner of a superstition. FDT's response: if a thing seems spooky, that's a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failure -- in the case of decision theory, some systematic loss of utility that isn't balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool "magical thinking" starts to ring rather hollow. At that point, it's necessary to consider the possibility that FDT is counter-intuitive but correct (like Einstein's "spukhafte Fernwirkung [https://en.wikipedia.org/wiki/Quantum_entanglement]"), rather than magical. In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates: The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the "deterministic subprocess" view of our decision-making, we would find nothing strange about the idea that it's sometimes right for this subprocess to do locally incorrect things for the sake of better global results. E.g., consider the transpar
3Ben Garfinkel3y
Is the following a roughly accurate re-characterization of the intuition here? "Suppose that there's an agent that implements P_UDT. Because it is following P_UDT, when it enters the box room it finds a ton of money in the first box and then refrains from taking the money in the second box. People who believe R_CDT claim that the agent should have also taken the money in the second box. But, given that the universe is deterministic, this doesn't really make sense. From before the moment the agent the room, it was already determined that the agent would one box. Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there's no relevant sense in which the agent should have two-boxed." If so, then I suppose my first reaction is that this seems like a general argument against normative realism rather than an argument against any specific proposed criterion of rightness. It also applies, for example, to the claim that a P_CDT agent "should have" one-boxed -- since in a physically deterministic sense it could not have. Therefore, I think it's probably better to think of this as an argument against the truth (and possibly conceptual coherence) of both R_CDT and R_UDT, rather than an argument that favors one over the other. In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only one choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So -- insofar as we accept this kind of objection from determinism -- there seems to be something problematically non-naturalistic about discussing what "would have happened" if we built in one decision procedure or another.
No, I don't endorse this argument. To simplify the discussion, let's assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, there's only one possible outcome; but it would be odd to say that a decision stops being a decision just because it's determined by something. (What's the alternative?) I do endorse "given the predictor's perfect accuracy, it's impossible for the P_UDT agent to two-box and come away with $1,001,000". I also endorse "given the predictor's perfect accuracy, it's impossible for the P_CDT agent to two-box and come away with $1,001,000". Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesn't mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly. (Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this "dominance" argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcomb's problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.) In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also "could have two-boxed", but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what they'll choose, the uncertainty goes away as soon as they see whether the box is full. No, there's nothing non-naturalistic ab
I think you need to make a clearer distinction here between "outcomes that don't exist in the universe's dynamics" (like taking both boxes and receiving $1,001,000) and "outcomes that can't exist in my branch" (like there not being a bomb in the unlucky case). Because if you're operating just in the branch you find yourself in, many outcomes whose probability an FDT agent is trying to affect are impossible from the problem specification (once you include observations). And, to be clear, I do think agents "should" try to achieve outcomes that are impossible from the problem specification including observations, if certain criteria are met, in a way that basically lines up with FDT, just like agents "should" try to achieve outcomes that are already known to have happened from the problem specification including observations. As an example, If you're in Parfit's Hitchhiker, you should pay once you reach town, even though reaching town has probability 1 in cases where you're deciding whether or not to pay, and the reason for this is because it was necessary for reaching town to have had probability 1.
+1, I agree with all this.
2Ben Garfinkel3y
Suppose that we accept the principle that agents never "should" try to achieve outcomes that are impossible from the problem specification -- with one implication being that it's false that (as R_CDT suggests) agents that see a million dollars in the first box "should" two-box. This seems to imply that it's also false that (as R_UDT suggests) an agent that sees that the first box is empty "should" one box. By the problem specification, of course, one boxing when there is no money in the first box is also an impossible outcome. Since decisions to two box only occur when the first box is empty, this would then imply that decisions to two box are never irrational in the context of this problem. But I imagine you don't want to say that. I think I probably still don't understand your objection here -- so I'm not sure this point is actually responsive to it -- but I initially have trouble seeing what potential violations of naturalism/determinism R_CDT could be committing that R_UDT would not also be committing. (Of course, just to be clear, both R_UDT and R_CDT imply that the decision to commit yourself to a one-boxing policy at the start of the game would be rational. They only diverge in their judgments of what actual in-room boxing decision would be rational. R_UDT says that the decision to two-box is irrational and R_CDT says that the decision to one-box is irrational.)
That should be "a one-boxing policy", right?
1Ben Garfinkel3y
Yep, thanks for the catch! Edited to fix.
It seems to me like they're coming down to saying something like: the "Guaranteed Payoffs Principle" / "Don't Make Things Worse Principle" is more core to rational action than being self-consistent. Whereas others think self-consistency is more important. It's not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn't it come down to which principles you favor? Maybe you could say FDT is more elegant. Or maybe that it satisfies more of the intuitive properties we'd hope for from a decision theory (where elegance might be one of those). But I'm not sure that would make the justification less-circular per se. I guess one way the justification for CDT could be more circular is if the key or only principle that pushes in favor of it over FDT can really just be seen as a restatement of CDT in a way that the principles that push in favor of FDT do not. Is that what you would claim?
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue. FDT gets you more utility than CDT. If you value literally anything in life more than you value "which ritual do I use to make my decisions?", then you should go with FDT over CDT; that's the core argument. This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there's no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski [https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory#comments]), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them). The latter argument for CDT isn't circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.
6Ben Garfinkel3y
I do think the argument ultimately needs to come down to an intuition about self-effacingness. The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT. But there's nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, there's also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning. So why shouldn't I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing. More formally, it seems like the argument needs to be something along these lines: 1. Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures. 2. (Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility. 3. Therefore, agents should not implement P_CDT. 4. (Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true. 5. Therefore -- as an implication of points (3) and (4) -- R_CDT is not true. Whether you buy the "No Self-Effacement" assumption in Step 4 -- or, alternatively, the countervailing "Don't Make Things Worse" assumption that supports R_CDT -- seems to ultimately be a mattter of intuition. At least, I don't currently know what else people can appeal to here to resolve the disagreement. [[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesn't specify how expected lifetime utility is being evaluated. For exampl
If the thing being argued for is "R_CDT plus P_SONOFCDT", then that makes sense to me, but is vulnerable to all the arguments I've been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT's "Don't Make Things Worse" principle. If the thing being argued for is "R_CDT plus P_FDT", then I don't understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over "R_FDT plus P_FDT"? (Indeed, what difference between the two views would be intended here?) The argument against "R_CDT plus P_SONOFCDT" doesn't require any mention of self-effacingness; it's entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT. The argument against "R_CDT plus P_FDT" seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don't understand what this view would mean or why anyone would endorse it (and I don't take you to be endorsing it). We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what "expected utility" means.
5Ben Garfinkel3y
Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT "gets you less utility" rather than the point that P_SONOFCDT "gets you less utility." So my comment was aiming to explain why I don't think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the "No Self-Effacement Principle"). But it sounds like you might agree that this fact doesn't on its own provide a strong challenge. In response to the first argument alluded to here: "Gets the most [expected] utility" is ambiguous, as I think we've both agreed. My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. So -- if we evaluate the expected utility of a decision to adopt a policy from a casual perspective -- it seems to me that P_SONOFCDT "gets the most expected utility." If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may "get the most utility" (because policy adoption decisions may be non-causally correlated.) Apologies if I'm off-base, but it reads to me like you might be suggesting an argument along these lines: 1. R_CDT says that it is rational to decide to follow a policy that would not maximize "expected utility" (defined in evidential/subjunctive terms). 2. (Assumption) But it is not rational to decide to follow a policy that would not maximize "expected utility" (defined in evidential/subjunctive terms). 3. Therefore R_CDT is not true. The natural response to this argument is that it's not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its "expected utility" defined in causal terms. So someone starting from the position that R_CDT is true obviously won't accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision dep
Perhaps the argument is something like: * "Don't make things worse" (DMTW) is one of the intuitions that leads us to favoring R_CDT * But the actual policy that R_CDT recommends does not in fact follow DMTW * So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_'s, and not about P_'s * But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn't get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
4Ben Garfinkel3y
Here are two logically inconsistent principles that could be true: Don't Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational. Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse. I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they're mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT. One could argue that R_CDT sympathists don't actually have much stronger intuitions regarding the first principle than the second -- i.e. that their intuitions aren't actually very "targeted" on the first one -- but I don't think that would be right. At least, it's not right in my case. A more viable strategy might be to argue for something like a meta-principle: The 'Don't Make Things Worse' Meta-Principle: If you find "Don't Make Things Worse" strongly intuitive, then you should also find "Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse" just about as intuitive. If the meta-principle were true, then I guess this would sort of imply that people's intuitions in favor of "Don't Make Things Worse" should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it. But I don't see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of "Don't Make Things Worse" :)
4Ben Garfinkel3y
Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here. Bomb Case: Omega puts a million dollars in a transparent box if he predicts you'll open it. He puts a bomb in the transparent box if he predicts you won't open it. He's only wrong about one in a trillion times. Now suppose you enter the room and see that there's a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don't open the box, then nothing bad will happen to you. You'll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up. Intuitively, this decision strikes me as deeply irrational. You're intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you're flagrantly violating the "Don't Make Things Worse" principle. Now, let's step back a time step. Suppose you know that you're sort of person who would refuse to kill yourself by detonating the bomb. You might decide that -- since Omega is such an accurate predictor -- it's worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you're making now is nonetheless making things better in expectation. This decision strikes me as pretty intuitively rational. You're violating the second principle -- the "Don't Commit to a Policy..." Principle -- but this violation just doesn't seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future. (This obviously ju

It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.

By triggering the bomb, you're making things worse from your current perspective, but making things better from the perspective of earlier you. Doesn't that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you don't. After updating, you think you're either a simulation within Omega's prediction so your action has no effect on yourself or you're in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/inconsistency.

Giving the human tendency to change our (UDT-)utility functions by updating, it's not clear what to do (or what is right), and I think this reduces UDT's intuitive appeal and makes it less of a slam-dunk over CDT/EDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isn't adequately explained in MIRI's decision theory papers.)

I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that that's right in your case.) But I feel like the second doesn't quite capture what I had in mind regarding the DMTW intuition applied to P_'s. Consider an alternate version: Or alternatively: It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the "... then taking that decision is not rational." version is. And it's only after you've considered prisoners' dilemmas or Newcomb's paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from what's rational in the moment. (But maybe others would disagree on how intuitive these versions are.) EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then it's not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition. Of course, this argument is moot if it's true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe that's the case? But I think it's a little more ambiguous with the "... is not good policy" or "a rational person would not..." versions than with the "Don't commit to a policy..." version. EDIT2: Does what I'm trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)
4Ben Garfinkel3y
Just as a quick sidenote: I've been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing. If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT. The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they aren't). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT. [[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that you're building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisoner's dilemma (e.g. some version of P_FDT). But if R_CDT is true and you've just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisoner's dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative view -- which I've alluded to above -- is that the various proposed criteria of rightness don't in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]
1Ben Pace3y
Do you mean It seems to better fit the pattern of the example just prior.
This is similar to how you described it here: This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it's normative, it can be either an algorithm/procedure that's being recommended, or a criterion of rightness like "a decision is rational iff taking it would cause the largest expected increase in value" (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are "normative" or "endorsed"). Some of your discussion above seems to be focusing on the "algorithmic?" dimension, while other parts seem focused on "normative?". I'll say more about "normative?" here. The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think they're pretty concrete and crisply defined. It's harder for me to accidentally switch topics or bundle two different concepts together when talking about "trying to optimize vs. optimizing as a side-effect", "directly optimizing vs. optimizing via heuristics", "initially optimizing vs. self-modifying to optimize", or "function vs. algorithm". In contrast, I think "normative" and "rational" can mean pretty different things in different contexts, it's easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of what's at stake in the discussion. E.g., "normative" is often used in the context of human terminal values, and it's in this context that statements like this ring obviously true: If we're treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options: * Decision theory isn't normative. * Decision theory is normative in the way that "murder is bad" or "improving aggregate welfare is good" is normative, i.e., it expresses an arbitrary terminal value of human beings. * Decision theory is normative in the way that game theory, probability theory, Boolea
6Ben Garfinkel3y
[[Disclaimer: I'm not sure this will be useful, since it seems like most of discussions that verge on meta-ethics end up with neither side properly understanding the other.]] I think the kind of decision theory that philosophers tend to work on is typically explicitly described as "normative." (For example, the SEP article [https://plato.stanford.edu/entries/decision-theory/] on decision theory is about "normative decision theory.") So when I'm talking about "academic decision theories" or "proposed criteria of rightness" I'm talking about normative theories. When I use the word "rational" I'm also referring to a normative property. I don't think there's any very standard definition of what it means for something to be normative, maybe because it's often treated as something pretty close to a primitive concept, but a partial account is that a "normative theory" is a claim about what someone should do. At least this is what I have in mind. This is different from the second option you list (and I think the third one). Some normative theories concern "ends." These are basically claims about what people should do, if they can freely choose outcomes. For example: A subjectivist theory might say that people should maximize the fulfillment of their own personal preferences (whatever they are). Whereas a hedonistic utilitarian theory might say that people should should maximize total happiness. I'm not sure what the best terminology is, and think this choice is probably relatively non-standard, but let's label these "moral theories." Some normative theories, including "decision theories," concern "means." These theories put aside the question of which ends people should pursue and instead focus on how people should respond to uncertainty about the results/implications of their actions. For example: Expected utility theory says that people should take whatever actions maximize expected fulfillment of the relevant ends. Risk-weighted expected utility theory (and other alt
I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks! Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeks' astronomical observations, and some of their theories and predictive tools, were still true and useful. I think that terms like "normative" and "rational" are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser's pluralistic moral reductionism [https://www.lesswrong.com/posts/3zDX3f3QTepNeZHGc/pluralistic-moral-reductionism]). I would say that (1) some philosophers use "rational" in a very human-centric way, which is fine as long as it's done consistently; (2) others have a much more thin conception of "rational", such as 'tending to maximize utility'; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of "rationality", but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2. I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcomb's problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer who's accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.) I'd expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a f
5Ben Garfinkel3y
Thank you for taking the time to respond as well! :) I'm not positive I understand what (1) and (3) are referring to here, but I would say that there's also at least a fourth way that philosophers often use the word "rational" (which is also the main way I use the word "rational.") This is to refer to an irreducibly normative concept. The basic thought here is that not every concept can be usefully described in terms of more primitive concepts (i.e. "reduced"). As a close analogy, a dictionary cannot give useful non-circular definitions of every possible word -- it requires the reader to have a pre-existing understanding of some foundational set of words. As a wonkier analogy, if we think of the space of possible concepts as a sort of vector space, then we sort of require an initial "basis" of primitive concepts that we use to describe the rest of the concepts. Some examples of concepts that are arguably irreducible are "truth," "set," "property," "physical," "existance," and "point." Insofar as we can describe these concepts in terms of slightly more primitive ones, the descriptions will typically fail to be very useful or informative and we will typically struggle to break the slightly more primitive ones down any further. To focus on the example of "truth," some people have tried to reduce the concept substantially. Some people have argued, for example, that when someone says that "X is true" what they really mean or should mean is "I personally believe X" or "believing X is good for you." But I think these suggested reductions pretty obviously don't entirely capture what people mean when they say "X is true." The phrase "X is true" also has an important meaning that is not amenable to this sort of reduction. [[EDIT: "Truth" may be a bad example, since it's relatively controversial and since I'm pretty much totally unfamiliar with work on the philosophy of truth. But insofar as any concepts seem irreducible to you in this sense, or buy the more general argum
So, as an experiment, I'm going to be a very obstinate reductionist in this comment. I'll insist that a lot of these hard-seeming concepts aren't so hard. Many of them are complicated, in the fashion of "knowledge" -- they admit an endless variety of edge cases and exceptions -- but these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where there's a simple core we can point to, that core generally isn't mysterious. It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act [https://en.wikipedia.org/wiki/Speech_act] work like encouraging a behavior). But when I say it "isn't mysterious", I mean it's pretty easy to see how the concept can crop up in human thought even if it doesn't belong on the short list of deep fundamental cosmic structure [https://www.amazon.com/Writing-Book-World-Theodore-Sider/dp/0199687501] terms. Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like "love," "knowledge," "France") , or it's not (in which case it goes in bucket 2). Picking on the concept here that seems like the odd one out to me: I feel confident that there isn't a cosmic law (of nature, or of metaphysics, etc.) that includes "truth" as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like "intentionality/reference", "assertion", or "state of affairs", though the former two strike me as easy to explain in simple physical terms. Mundane empirical "truth" seems completely straightforward [https://www.lesswrong.com/posts/X3HpE8tMXz4m4w6Rz/the-simple-truth]. Then there's the truth of sentences like "Frodo is a hobbit", "2+2=4", "I could have been the president", "Hamburgers are more delicious than battery acid"... Some of these
2Ben Garfinkel3y
I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation. (I don't really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort I've just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments you've just made cover some reasons to suspect this.) The key point is just that when these philosophers say that "Action X is rational," they are explicitly reporting that they do not mean "Action X suits my terminal preferences" or "Action X would be taken by an agent following a policy that maximizes lifetime utility" or any other such reduction. I think that when people are very insistent that they don't mean something by their statements, it makes sense to believe them. This implies that the question they are discussing -- "What are the necessary and sufficient conditions that make a decision rational?" -- is distinct from questions like "What decision would an agent that tends to win take?" or "What decision procedure suits my terminal preferences?" It may be the case that the question they are asking is confused or insensible -- because any sensible question would be reducible -- but it's in any case different. So I think it's a mistake to interpret at least these philosophers' discussions of "decisions theories" or "criteria of rightness" as though they were discussions of things like terminal preferences or winning strategies. And it doesn't seem to me like the answer to the question they're asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies. [[NOTE: Plenty of decision theo
4Ben Garfinkel3y
Just on this point: I think you're right I may be slightly glossing over certain distinctions, but I might still draw them slightly differently (rather than doing a 2x2 grid). Some different things one might talk about in this context: 1. Decisions 2. Decision procedures 3. The decision procedure that is optimal with regard to some given metric (e.g. the decision procedure that maximizes expected lifetime utility for some particular way of calculating expected utility) 4. The set of properties that makes a decision rational ("criterion of rightness") 5. A claim about what the criterion of rightness is ("normative decision theory") 6. The decision procedure that it would be rational to decide to build into an agent (as implied by the criterion of rightness) (4), (5), and (6) have to do with normative issues, while (1), (2), and (3) can be discussed without getting into normativity. My current-although-not-firmly-held view is also that (6) probably isn't very sensitive to the what the criterion of rightness is, so in practice can be reasoned about without going too deep into the weeds thinking about competing normative decision theories.
Just want to note that I found the R_ vs P_ distinction to be helpful. I think using those terms might be useful for getting at the core of the disagreement.
3Ben Garfinkel3y
Lightly editing some thoughts I previously wrote up [ https://www.lesswrong.com/posts/GqTeChFnXdJzDzbMd/realism-and-rationality-2#fn-Y739J6M3tCNhCwtv3-12 on this issue:] on this issue, somewhat in line with Paul's comments:

For more on this divide/points of disagreement, see Will MacAskill's essay on the alignment forum (with responses from MIRI researchers and others)


and previously, Wolfgang Schwartz's review of Functional Decision Theory


(with some Lesswrong discussion here: https://www.lesswrong.com/posts/BtN6My9bSvYrNw48h/open-thread-january-2019#WocbPJvTmZcA2sKR6)

I'd also be interested in Buck's perspectives on this topic.

Thanks, I hadn't seen this.
See also bmg's LW post, Realism and rationality [https://www.lesswrong.com/posts/GqTeChFnXdJzDzbMd/realism-and-rationality-2]. Relevant excerpt:

Back in July, you held an in-person Q&A at REACH and said "There are a bunch of things about AI alignment which I think are pretty important but which aren’t written up online very well. One thing I hope to do at this Q&A is try saying these things to people and see whether people think they make sense." Could you say more about what these important things are, and what was discussed at the Q&A?

I don’t really remember what was discussed at the Q&A, but I can try to name important things about AI safety which I think aren’t as well known as they should be. Here are some:


I think the ideas described in the paper Risks from Learned Optimization are extremely important; they’re less underrated now that the paper has been released, but I still wish that more people who are interested in AI safety understood those ideas better. In particular, the distinction between inner and outer alignment makes my concerns about aligning powerful ML systems much crisper.


On a meta note: Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.


Compared to people who are relatively new to the field, skilled and experienced AI safety researchers seem to have a much more holistic and much more concrete mindset when they’re talking about plans to align AGI.

For example, here are some of my beliefs about AI alignment (none of which are original ideas of mine):


I think it’s pretty plausible that meta-learning systems are ... (read more)

On a meta note: Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.

+1, this is the thing that surprised me most when I got into the field. I think helping increase common knowledge and agreement on the big picture of safety should be a major priority for people in the field (and it's something I'm putting a lot of effort into, so send me an email at richardcngo@gmail.com if you want to discuss this).

I think the ideas described in the paper Risks from Learned Optimization are extremely important.

Also +1 on this.

Suppose you find out that Buck-in-2040 thinks that the work you're currently doing is a big mistake (which should have been clear to you, now). What are your best guesses about what his reasons are?

I think of myself as making a lot of gambles with my career choices. And I suspect that regardless of which way the propositions turn out, I'll have an inclination to think that I was an idiot for not realizing them sooner. For example, I often have both the following thoughts:

  • "I have a bunch of comparative advantage at helping MIRI with their stuff, and I'm not going to be able to quickly reduce my confidence in their research directions. So I should stop worrying about it and just do as much as I can."
  • "I am not sure whether the MIRI research directions are good. Maybe I should spend more time evaluating whether I should do a different thing instead."

But even if it feels obvious in hindsight, it sure doesn't feel obvious now.

So I have big gambles that I'm making, which might turn out to be wrong, but which feel now like they will have been reasonable-in-hindsight gambles either way. The main two such gambles are thinking AI alignment might be really important in the next couple decades and working on MIRI's approaches to AI alignment instead of some other approach.

When I ask myself "what things have I not really considered as much ... (read more)

How much do you worry that MIRI's default non-disclosure policy is going to hinder MIRI's ability to do good research, because it won't be able to get as much external criticism?

I worry very little about losing the opportunity to get external criticism from people who wouldn't engage very deeply with our work if they did have access to it. I worry more about us doing worse research because it's harder for extremely engaged outsiders to contribute to our work.

A few years ago, Holden had a great post where he wrote:

For nearly a decade now, we've been putting a huge amount of work into putting the details of our reasoning out in public, and yet I am hard-pressed to think of cases (especially in more recent years) where a public comment from an unexpected source raised novel important considerations, leading to a change in views. This isn't because nobody has raised novel important considerations, and it certainly isn't because we haven't changed our views. Rather, it seems to be the case that we get a large amount of valuable and important criticism from a relatively small number of highly engaged, highly informed people. Such people tend to spend a lot of time reading, thinking and writing about relevant topics, to follow our work closely, and to have a great deal of context. They also tend to be people who form relationships of
... (read more)

In November 2018 you said "we want to hire as many people as engineers as possible; this would be dozens if we could, but it's hard to hire, so we'll more likely end up hiring more like ten over the next year". As far as I can tell, MIRI has hired 2 engineers (Edward Kmett and James Payor) since you wrote that comment. Can you comment on the discrepancy? Did hiring turn out to be much more difficult than expected? Are there not enough good engineers looking to be hired? Are there a bunch of engineers who aren't on the team page/haven't been announced yet?

(This is true of all my answers but feels particularly relevant for this one: I’m speaking just for myself, not for MIRI as a whole)

We’ve actually made around five engineer hires since then; we’ll announce some of them in a few weeks. So I was off by a factor of two.

Before you read my more detailed thoughts: please don’t read the below and then get put off from applying to MIRI. I think that many people who are in fact good MIRI fits might not realize they’re good fits. If you’re unsure whether it’s worth your time to apply to MIRI, you can email me at buck@intelligence.org and I’ll (eventually) reply telling you whether I think you might plausibly be a fit. Even if it doesn't go further than that, there is great honor in applying to jobs from which you get rejected, and I feel warmly towards almost everyone I reject.

With that said, here are some of my thoughts on the discrepancy between my prediction and how much we’ve hired:

  • Since I started doing recruiting work for MIRI in late 2017, I’ve updated towards thinking that we need to be pickier with the technical caliber of engineering hires than I originally t
... (read more)

What's the biggest misconception people have about current technical AI alignment work? What's the biggest misconception people have about MIRI?

How should talented EA software engineers best put their skills to use?

The obvious answer is “by working on important things at orgs which need software engineers”. To name specific examples that are somewhat biased towards the orgs I know well:

  • MIRI needs software engineers who can learn functional programming and some math
    • I think that if you’re an engineer who likes functional programming, it might be worth your time to take a Haskell job and gamble on MIRI wanting to hire you one day when you’re really good at it. One person who currently works at MIRI is an EA who worked in Haskell for a few years; his professional experience is really helpful for him. If you’re interested in doing this, feel free to email me asking about whether I think it’s a good idea for you.
  • OpenAI’s safety team needs software engineers who can work as research engineers (which might only be a few months of training if you’re coming from a software engineering background; MIRI has a retraining program for software engineers who want to try this; if you’re interested in that, you should email me.)
  • Ought needs an engineering lead
  • The 80000 Hours job board lists positions

I have two main thoughts on how talented software ... (read more)

As an appendix to the above, some of my best learning experiences as a programmer were the following (starting from when I started programming properly as a freshman in 2012). (Many of these aren’t that objectively hard (and would fit in well as projects in a CS undergrad course); they were much harder for me because I didn’t have the structure of a university course to tell me what design decisions were reasonable and when I was going down blind alleys. I think that this difficulty created some great learning experiences for me.)

  • I translated the proof of equivalence between regular expressions and finite state machines from “Introduction to Automata Theory, Languages, and Computation” into Haskell.
  • I wrote a program which would take a graph describing a circuit built from resistors and batteries and then solve for the currents and potential drops.
  • I wrote a GUI for a certain subset of physics problems; this involved a lot of deconfusion-style thinking as well as learning how to write GUIs.
  • I went to App Academy and learned to write full stack web applications.
  • I wrote a compiler from C to assembly in Scala. It took a long time for me to figure out that I sh
... (read more)
This is an awesome answer; thanks Buck! The motivation behind strategy 2 seems pretty clear; are you emphasising strategy 1 (become a great engineer) for its instrumental benefit to strategy 2 (become useful to EA orgs), or for some other reason, like EtG? Strongly agree that some of the best engineers I come across have had very broad, multi-domain knowledge (and have been able to apply it cross-domain to whatever problem they're working on).

(Notably, the other things you might work on if you weren't at MIRI seem largely to be non-software-related)

I hadn't actually noticed that.

One factor here is that a lot of AI safety research seems to need ML expertise, which is one of my least favorite types of CS/engineering.

Another is that compared to many EAs I think I have a comparative advantage at roles which require technical knowledge but not doing technical research day-to-day.

I'm emphasizing strategy 1 because I think that there are EA jobs for software engineers where the skill ceiling is extremely high, so if you're really good it's still worth it for you to try to become much better. For example, AI safety research needs really great engineers at AI safety research orgs.

In your experience, what are the main reasons good people choose not to do AI alignment research after getting close to the field (at any org)? And on the other side, what are the main things that actually make the difference for them positively deciding to do AI alignment research?

The most common reason that someone who I would be excited to work with at MIRI chooses not to work on AI alignment is that they decide to work on some other important thing instead, eg other x-risk or other EA stuff.

But here are some anonymized recent stories of talented people who decided to do non-EA work instead of taking opportunities to do important technical work related to x-risk (for context, I think all of these people are more technically competent than me):

  • One was very comfortable in a cushy, highly paid job which they already had, and thought it would be too inconvenient to move to an EA job (which would have also been highly paid).
  • One felt that AGI timelines are probably relatively long (eg they thought that the probability of AGI in the next 30 years felt pretty small to them), which made AI safety feel not very urgent. So they decided to take an opportunity which they thought would be really fun and exciting, rather than working at MIRI, which they thought would be less of a good fit for a particular skill set which they'd been developing for years; this person thinks that they might come back and work on x-risk after they've had another job for a few year
... (read more)

In "Ways I've changed my mind about effective altruism over the past year" you write:

I feel very concerned by the relative lack of good quality discussion and analysis of EA topics. I feel like everyone who isn’t working at an EA org is at a massive disadvantage when it comes to understanding important EA topics, and only a few people manage to be motivated enough to have a really broad and deep understanding of EA without either working at an EA org or being really close to someone who does.

I am not sure if you still feel this way, but this makes me wonder what the current conversations are about with other people at EA orgs. Could you give some examples of important understandings or new ideas you have gained from such conversations in the last, say, 3 months?

I still feel this way, and I've been trying to think of ways to reduce this problem. I think the AIRCS workshops help a bit, I think that my SSC trip was helpful and EA residencies might be helpful.

A few helpful conversations that I've had recently with people who are strongly connected to the professional EA community, which I think would be harder to have without information gained from these strong connections:

  • I enjoyed a document about AI timelines that someone from another org shared me on.
  • Discussions about how EA outreach should go--how rapidly should we try to grow, what proportion of people who are good fits for EA are we already reaching, what types of people are we going to be able to reach with what outreach mechanisms.

How much do you agree with the two stories laid out in Paul Christiano's post What Failure Looks Like?

They are a pretty reasonable description of what I expect to go wrong in a world where takeoffs are slow. (My models of what I think slow takeoff looks like are mostly based on talking to Paul, though, so this isn’t much independent evidence.)
3Ben Pace3y
Hmm, I'm surprised to hear you say that about the second story, which I think is describing a fairly fast end to human civilization - "going out with a bang". Example quote: So I mostly see it as describing a hard take-off, and am curious if there's a key part of a fast-takeoff / discontinuous take-off that you think of as central that is missing there.

I think of hard takeoff as meaning that AI systems suddenly control much more resources. (Paul suggests the definition of "there is a one year doubling of the world economy before there's been a four year doubling".)

Unless I'm very mistaken, the point Paul is making here is that if you have a world where AI systems in aggregate gradually become more powerful, there might come a turning point where the systems suddenly stop being controlled by humans. By analogy, imagine a country where the military wants to stage a coup against the president, and their power increases gradually day by day, until one day they decide they have enough power to stage the coup. The power wielded by the military increased continuously and gradually, but the amount of control of the situation wielded by the president at some point suddenly falls.

What EY is doing now? Is he coding, writing fiction or new book, working on math foundations, providing general leadership?

Not sure why the initials are only provided. For the sake of clarity to other readers, EY = Eliezer Yudkowsky.

Over the last year, Eliezer has been working on technical alignment research and also trying to rejigger some of his fiction-writing patterns toward short stories.

Meta: A big thank you to Buck for doing this and putting so much effort into it! This was very interesting and will hopefully encourage more dissemination of knowledge and opinions publicly

I thought this was great. Thanks, Buck

It was a good time; I appreciate all the thoughtful questions.

+1. So good to see stuff like this

+2 helpful and thoughtful answers; really appreciate the time put in.

My sense of the current general landscape of AI Safety is: various groups of people pursuing quite different research agendas, and not very many explicit and written-up arguments for why these groups think their agenda is a priority (a notable exception is Paul's argument for working on prosaic alignment). Does this sound right? If so, why has this dynamic emerged and should we be concerned about it? If not, then I'm curious about why I developed this picture.

I think the picture is somewhat correct, and we surprisingly should not be too concerned about the dynamic.

My model for this is:

1) there are some hard and somewhat nebulous problems "in the world"

2) people try to formalize them using various intuitions/framings/kinds of math; also using some "very deep priors"

3) the resulting agendas look at the surface level extremely different, and create the impression you have

but actually

4) if you understand multiple agendas deep enough, you get a sense

  • how they are sometimes "reflecting" the same underlying problem
  • if they are based on some "deep priors", how deep it is, and how hard to argue it can be
  • how much they are based on "tastes" and "intuitions" ~ one model how to think about it is people having boxes comparable to policy net in AlphaZero: a mental black-box which spits useful predictions, but is not interpretable in language

Overall, given our current state of knowledge, I think running these multiple efforts in parallel is a better approach with higher chance of success that an idea that we should invest a lot in resolving disagreements/prioritizing, and everyone should work on the "best agenda".

This seems to go against some core EA heuristic ("compare the options, take the best") but actually is more in line with what rational allocation of resources in the face of uncertainty.

Thanks for the reply! Could you give examples of:

a) two agendas that seem to be "reflecting" the same underlying problem despite appearing very different superficially?

b) a "deep prior" that you think some agenda is (partially) based on, and how you would go about working out how deep it is?



For example, CAIS and something like "classical superintelligence in a box picture" disagree a lot on the surface level. However, if you look deeper, you will find many similar problems. Simple to explain example: problem of manipulating the operator - which has (in my view) some "hard core" involving both math and philosophy, where you want the AI to somehow communicate with humans in a way which at the same time allows a) the human to learn from the AI if the AI knows something about the world b) the operator's values are not "overwritten" by the AI c) you don't want to prohibit moral progress. In CAIS language this is connected to so called manipulative services.

Or: one of the biggest hits of past year is the mesa-optimisation paper. However, if you are familiar with prior work, you will notice many of the proposed solutions with mesa-optimisers are similar/same solutions as previously proposed for so called 'daemons' or 'misaligned subagents'. This is because the problems partially overlap (the mesa-optimisation framing is more clear and makes a stronger case for "this is what to expect by default"). ... (read more)

Not Buck, but one possibility is that people pursuing different high-level agendas have different intuitions about what's valuable, and those kind of disagreements are relatively difficult to resolve, and the best way to resolve them is to gather more "object-level" data.

Maybe people have already spent a fair amount of time having in-person discussions trying to resolve their disagreements, and haven't made progress, and this discourages them from writing up their thoughts because they think it won't be a good use of time. However, this line of reasoning might be mistaken -- it seems plausible to me that people entering the field of AI safety are relatively impartial judges of which intuitions do/don't seem valid, and the question of where new people in the field of AI safety should focus is an important one, and having more public disagreement would improve human capital allocation.

I think your sense is correct. I think that plenty of people have short docs on why their approach is good; I think basically no-one has long docs engaging thoroughly with the criticisms of their paths (I don't think Paul's published arguments defending his perspective count as complete; Paul has arguments that I hear him make in person that I haven't seen written up.)

My guess is that it's developed because various groups decided that it was pretty unlikely that they were going to be able to convince other groups of their work, and so they decided to just go their own ways. This is exacerbated by the fact that several AI safety groups have beliefs which are based on arguments which they're reluctant to share with each other.

(I was having a conversation with an AI safety researcher at a different org recently, and they couldn't tell me about some things that they knew from their job, and I couldn't tell them about things from my job. We were reflecting on the situation, and then one of us proposed the metaphor that we're like two people who were sliding on ice next to each other and then pushed away and have now chosen our paths and can't... (read more)

FWIW, it's not clear to me that AI alignment folks with different agendas have put less effort into (or have made less progress on) understanding the motivations for other agendas than is typical in other somewhat-analogous fields. Like, MIRI leadership and Paul have put >25 (and maybe >100, over the years?) hours into arguing about merits of their differing agendas (in person, on the web, in GDocs comments), and my impression is that central participants to those conversations (e.g. Paul, Eliezer, Nate) can pass the others' ideological Turing tests reasonably well on a fair number of sub-questions and down 1-3 levels of "depth" (depending on the sub-question), and that might be more effort and better ITT performance than is typical for "research agenda motivation disagreements" in small niche fields that are comparable on some other dimensions.

I suspect that things like the Alignment Newsletter are causing AI safety researchers to understand and engage with each other's work more; this seems good.

This is the goal, but it's unclear that it's having much of an effect. I feel like I relatively often have conversations with AI safety researchers where I mention something I highlighted in the newsletter, and the other person hasn't heard of it, or has a very superficial / wrong understanding of it (one that I think would be corrected by reading just the summary in the newsletter).

This is very anecdotal; even if there are times when I talk to people and they do know the paper that I'm talking about because of the newsletter, I probably wouldn't notice / learn that fact.

(In contrast, junior researchers are often more informed than I would expect, at least about the landscape, even if not the underlying reasons / arguments.)

The 2017 MIRI fundraiser post says "We plan to say more in the future about the criteria for strategically adequate projects in 7a" and also "A number of the points above require further explanation and motivation, and we’ll be providing more details on our view of the strategic landscape in the near future". As far as I can tell, MIRI hasn't published any further explanation of this strategic plan. Is MIRI still planning to say more about its strategic plan in the near future, and if so, is there a concrete timeframe (e.g. "in a few months", "in a year", "in two years") for publishing such an explanation?

(Note: I asked this question a while ago on LessWrong.)

Oops, I saw your question when you first posted it on LessWrong but forgot to get back to you, Issa. My apologies. I think there are two main kinds of strategic thought we had in mind when we said "details forthcoming": * 1. Thoughts on MIRI's organizational plans, deconfusion research, and how we think MIRI can help play a role in improving the future — this is covered by our November 2018 update post, https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/. [https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/.] * 2. High-level thoughts on things like "what we think AGI developers probably need to do" and "what we think the world probably needs to do" to successfully navigate the acute risk period. Most of the stuff discussed in "strategic background [http://(https://intelligence.org/2017/12/01/miris-2017-fundraiser/#3]" is about 2: not MIRI's organizational plan, but our model of some of the things humanity likely needs to do in order for the long-run future to go well. Some of these topics are reasonably sensitive, and we've gone back and forth about how best to talk about them. Within the macrostrategy / "high-level thoughts" part of the post, the densest part was maybe 7a. The criteria we listed for a strategically adequate AGI project were "strong opsec, research closure, trustworthy command, a commitment to the common good, security mindset, requisite resource levels, and heavy prioritization of alignment work". With most of these it's reasonably clear what's meant in broad strokes, though there's a lot more I'd like to say about the specifics. "Trustworthy command" and "a commitment to the common good" are maybe the most opaque. By "trustworthy command" we meant things like: * The organization's entire command structure is fully aware of the difficulty and danger of alignment. * Non-technical leadership can't interfere and won't object if technical leadership needs to delete a co

AI takeoff: continuous or discontinuous?

I don’t know. When I try to make fake mathematical models of how AI progress works, they mostly come out looking pretty continuous. And AI Impacts has successfully pushed my intuitions in a slow takeoff direction, by exhaustively cataloging all the technologies which didn't seem to have discontinuous jumps in efficiency. But on the other hand it sometimes feels like there’s something that has to “click” before you can have your systems being smart in some important way; this pushes me towards a discontinuous model. Overall I feel very confused.

When I talk to Paul Christiano about takeoff I feel persuaded by his arguments for slow takeoff, when I talk to many MIRI people I feel somewhat persuaded by their arguments for fast takeoff.

Do you have any thoughts on Qualia Research Institute?

I feel pretty skeptical of their work and their judgement.

I am very unpersuaded by their Symmetry Theory of Valence, which I think is summarized by “Given a mathematical object isomorphic to the qualia of a system, the mathematical property which corresponds to how pleasant it is to be that system is that object’s symmetry“.

I think of valence as the kind of thing which is probably encoded into human brains by a bunch of complicated interconnected mechanisms rather than by something which seems simple from the perspective of an fMRI-equipped observer, so I feel very skeptical of this. Even if it was true about human brains, I’d be extremely surprised if the only possible way to build a conscious goal-directed learning system involved some kind of symmetrical property in the brain state, so this would feel like a weird contingent fact about humans rather than something general about consciousness.

And I’m skeptical of their judgement for reasons like the following. Michael Johnson, the ED of QRI, wrote:

I’ve written four pieces of philosophy I consider profound. [...]
The first, Principia Qualia, took me roughly six years of obsessive work. [...
... (read more)

Buck- for an internal counterpoint you may want to discuss QRI's research with Vaniver. We had a good chat about what we're doing at the Boston SSC meetup, and Romeo attended a MIRI retreat earlier in the summer and had some good conversations with him there also.

To put a bit of a point on this, I find the "crank philosophy" frame a bit questionable if you're using only thin-slice outside view and not following what we're doing. Probably, one could use similar heuristics to pattern-match MIRI as "crank philosophy" also (probably, many people have already done exactly this to MIRI, unfortunately).

FWIW I agree with Buck's criticisms of the Symmetry Theory of Valence (both content and meta) and also think that some other ideas QRI are interested in are interesting. Our conversation on the road trip was (I think) my introduction to Connectome Specific Harmonic Waves (CSHW), for example, and that seemed promising to think about.

I vaguely recall us managing to operationalize a disagreement, let me see if I can reconstruct it:

A 'multiple drive' system, like PCT's hierarchical control system, has an easy time explaining independent desires and different flavors of discomfort. (If one both has a 'hunger' control system and a 'thirst' control system, one can easily track whether one is hungry, thirsty, both, or neither.) A 'single drive' system, like expected utility theories more generally, has a somewhat more difficult time explaining independent desires and different flavors of discomfort, since you only have the one 'utilon' number.
But this is mostly because we're looking at different parts of the system as the 'value'. If I have a vector of 'control errors', I get the nice multidimensional prop
... (read more)
I think this is a great description. "What happens if we seek out symmetry gradients in brain networks, but STV isn't true?" is something we've considered, and determining ground-truth is definitely tricky. I refer to this scenario as the "Symmetry Theory of Homeostatic Regulation" - https://opentheory.net/2017/05/why-we-seek-out-pleasure-the-symmetry-theory-of-homeostatic-regulation/ [https://opentheory.net/2017/05/why-we-seek-out-pleasure-the-symmetry-theory-of-homeostatic-regulation/] (mostly worth looking at the title image, no need to read the post) I'm (hopefully) about a week away from releasing an update to some of the things we discussed in Boston, basically a unification of Friston/Carhart-Harris's work on FEP/REBUS with Atasoy's work on CSHW -- will be glad to get your thoughts when it's posted.
Oh, an additional detail that I think was part of that conversation: there's only really one way to have a '0-error' state in a hierarchical controls framework, but there are potentially many consonant energy distributions that are dissonant with each other. Whether or not that's true, and whether each is individually positive valence, will be interesting to find out. (If I had to guess, I would guess the different mutually-dissonant internally-consonant distributions correspond to things like 'moods', in a way that means they're not really value but are somewhat close, and also that they exist. The thing that seems vaguely in this style are differing brain waves during different cycles of sleep, but I don't know if those have clear waking analogs, or what they look like in the CSHW picture.)

Most things that look crankish are crankish.

I think that MIRI looks kind of crankish from the outside, and this should indeed make people initially more skeptical of us. I think that we have a few other external markers of legitimacy now, such as the fact that MIRI people were thinking and writing about AI safety from the early 2000s and many smart people have now been persuaded that this is indeed an issue to be concerned with. (It's not totally obvious to me that these markers of legitimacy mean that anyone should take us seriously on the question "what AI safety research is promising".) When I first ran across MIRI, I was kind of skeptical because of the signs of crankery; I updated towards them substantially because I found their arguments and ideas compelling, and people whose judgement I respected also found them compelling.

I think that the signs of crankery in QRI are somewhat worse than 2008 MIRI's signs of crankery.

I also think that I'm somewhat qualified to assess QRI's work (as someone who's spent ~100 paid hours thinking about philosophy of mind in the last few years), and when I look at it, I think it looks pretty crankish and wrong.

QRI is tackling a very difficult problem, as is MIRI. It took many, many years for MIRI to gather external markers of legitimacy. My inside view is that QRI is on the path of gaining said markers; for people paying attention to what we're doing, I think there's enough of a vector right now to judge us positively. I think these markers will be obvious from the 'outside view' within a short number of years.

But even without these markers, I'd poke at your position from a couple angles:

I. Object-level criticism is best

First, I don't see evidence you've engaged with our work beyond very simple pattern-matching. You note that "I also think that I'm somewhat qualified to assess QRI's work (as someone who's spent ~100 paid hours thinking about philosophy of mind in the last few years), and when I look at it, I think it looks pretty crankish and wrong." But *what* looks wrong? Obviously doing something new will pattern-match to crankish, regardless of whether it is crankish, so in terms of your rationale-as-stated, I don't put too much stock in your pattern detection (and perhaps you shouldn't either). If we want to avoid... (read more)

cf. Jeff Kaufman on MIRI circa 2003: https://www.jefftk.com/p/yudkowsky-and-miri [https://www.jefftk.com/p/yudkowsky-and-miri]

For a fuller context, here is my reply to Buck's skepticism about the 80% number during our back-and-forth on Facebook -- as a specific comment, the number is loosely held, more of a conversation-starter than anything else. As a general comment I'm skeptical of publicly passing judgment on my judgment based on one offhand (and unanswered- it was not engaged with) comment on Facebook. Happy to discuss details in a context we'll actually talk to each other. :)

--------------my reply from the Facebook thread a few weeks back--------------

I think the probability question is an interesting one-- one frame is asking what is the leading alternative to STV?

At its core, STV assumes that if we have a mathematical representation of an experience, the symmetry of this object will correspond to how pleasant the experience is. The latest addition to this (what we're calling 'CDNS') assumes that consonance under Selen Atasoy's harmonic analysis of brain activity (connectome-specific harmonic waves, CSHW) is a good proxy for this in humans. This makes relatively clear predictions across all human states and could fairly easily be extended to non-human animals, in... (read more)

5Ben Pace3y
I’m having a hard time understanding whether everything below the dotted lines is something you just wrote, or a full quote from an old thread. The first time I read it I thought the former, and on reread think the latter. Might you be able to make it more explicit at the top of your comment?
Thanks, added.
We're pretty up-front about our empirical predictions; if critics would like to publicly bet against us we'd welcome this, as long as it doesn't take much time away from our research. If you figure out a bet we'll decide whether to accept it or reject it, and if we reject it we'll aim to concisely explain why.

Mike, while I appreciate the empirical predictions of the symmetry theory of valence, I have a deeper problem with QRI philosophy, and it makes me skeptical even if the predictions come to bear.

In physics, there are two distinctions we can make about our theories:

  • Disputes over what we predict will happen.
  • Disputes over the interpretation of experimental results.

The classic Many Worlds vs. Copenhagen is a dispute of the second kind, at least until someone can create an experiment which distinguishes the two. Another example of the second type of dispute is special relativity vs. Lorentz ether theory.

Typically, philosophers of science and most people who follow Lesswrong philosophy, will say that the way to resolve disputes of the second kind is to find out which interpretation is simplest. That's one reason why most people follow Einstein's special relativity over the Lorentz ether theory.

However, simplicity of an interpretation is often hard to measure. It's made more complicated for two reasons,

  • First, there's no formal way of measuring simplicity even in principle in a way that is language independent.
  • Second, there are ontological disputes about what type of the
... (read more)
Thanks Matthew! I agree issues of epistemology and metaphysics get very sticky very quickly when speaking of consciousness. My basic approach is 'never argue metaphysics when you can argue physics' -- the core strategy we have for 'proving' we can mathematically model qualia is to make better and more elegant predictions using our frameworks, with predicting pain/pleasure from fMRI data as the pilot project. One way to frame this is that at various points in time, it was completely reasonable to be a skeptic about modeling things like lightning, static, magnetic lodestones, and such, mathematically. This is true to an extent even after Faraday and Maxwell formalized things. But over time, with more and more unusual predictions and fantastic inventions built around electromagnetic theory, it became less reasonable to be skeptical of such. My metaphysical arguments are in my 'Against Functionalism' piece, and to date I don't believe any commenters have addressed my core claims: https://forum.effectivealtruism.org/posts/FfJ4rMTJAB3tnY5De/why-i-think-the-foundational-research-institute-should#6Lrwqcdx86DJ9sXmw [https://forum.effectivealtruism.org/posts/FfJ4rMTJAB3tnY5De/why-i-think-the-foundational-research-institute-should#6Lrwqcdx86DJ9sXmw] But, I think metaphysical arguments change distressingly few peoples' minds. Experiments and especially technology changes peoples' minds. So that's what our limited optimization energy is pointed at right now.
Agreed :). My main claim was that by only arguing physics, I will never agree upon your theory because your theory assumes the existence of elementary stuff that I don't believe in. Therefore, I don't understand how this really helps. Would you be prepared say the same about many worlds vs consciousness causes collapse theories? (Let's assume that we have no experimental data which distinguishes the two theories). The problem with the analogy to magnetism and electricity is that fails to match the pattern of my argument. In order to incorporate magnetism into our mathematical theory of physics, we merely added more mathematical parts. In this, I see a fundamental difference between the approach you take and the approach taken by physicists when they admit the existence of new forces, or particles. In particular, your theory of consciousness does not just do the equivalent of add a new force, or mathematical law that governs matter, or re-orient the geometry of the universe. It also posits that there is a dualism in physical stuff: that is, that matter can be identified as having both mathematical and mental properties. Even if your theory did result in new predictions, I fail to see why I can't just leave out the mental interpretation of it, and keep the mathematical bits for myself. To put it another way, if you are saying that symmetry can be shown to be the same as valence, then I feel I can always provide an alternative explanation that leaves out valence as a first-class object in our ontology. If you are merely saying that symmetry is definitionally equivalent to valence, then your theory is vacuous because I can just delete that interpretation from my mathematical theory and emerge with equivalent predictions about the world. And in practice, I would probably do so, because symmetry is not the kind of thing I think about when I worry about suffering. I agree that if you had made predictions that classical neuroscientists all agreed would never occur,
I think we actually mostly agree: QRI doesn't 'need' you to believe qualia are real, that symmetry in some formalism of qualia corresponds to pleasure, that there is any formalism about qualia to be found at all. If we find some cool predictions, you can strip out any mention of qualia from them, and use them within the functionalism frame. As you say, the existence of some cool predictions won't force you to update your metaphysics (your understanding of which things are ontologically 'first class objects'). But- you won't be able to copy our generator by doing that, the thing that created those novel predictions, and I think that's significant, and gets into questions of elegance metrics and philosophy of science. I actually think the electromagnetism analogy is a good one: skepticism is always defensible, and in 1600, 1700, 1800, 1862, and 2018, people could be skeptical of whether there's 'deep unifying structure' behind these things we call static, lightning, magnetism, shocks, and so on. But it was much more reasonable to be skeptical in 1600 than in 1862 (the year Maxwell's Equations were published), and more reasonable in 1862 than it was in 2018 (the era of the iPhone). Whether there is 'deep structure' in qualia is of course an open question in 2019. I might suggest STV is equivalent to a very early draft of Maxwell's Equations: not a full systematization of qualia, but something that can be tested and built on in order to get there. And one that potentially ties together many disparate observations into a unified frame, and offers novel / falsifiable predictions (which seem incredibly worth trying to falsify!) I'd definitely push back on the frame of dualism, although this might be a terminology nitpick: my preferred frame here is monism: https://opentheory.net/2019/06/taking-monism-seriously/ [https://opentheory.net/2019/06/taking-monism-seriously/] - and perhaps this somewhat addresses your objection that 'QRI posits the existence of too many things
I would think this might be our crux (other than perhaps the existence of qualia themselves). I imagine any predictions you produce can be adequately captured in a mathematical framework that makes no reference to qualia as ontologically primitive. And if I had such a framework, then I would have access to the generator, full stop. Adding qualia doesn't make the generator any better -- it just adds unnecessary mental stuff that isn't actually doing anything for the theory. I am not super confident in anything I said here, although that's mostly because I have an outside view that tells me consciousness is hard to get right. My inside view tells me that I am probably correct, because I just don't see how positing mental stuff that's separate from mathematical law can add anything whatsoever to a physical theory. I'm happy to talk more about this some day, perhaps in person. :)
Hey Mike, I'm a community moderator at Metaculus and am generally interested in creating more EA-relevant questions. Are your predictions explicitly listed somewhere? It would be great to add at least some of them to the site.
Hey Pablo! I think Andres has a few up on Metaculus; I just posted QRI's latest piece of neuroscience here, which has a bunch of predictions (though I haven't separated them out from the text): https://opentheory.net/2019/11/neural-annealing-toward-a-neural-theory-of-everything/ [https://opentheory.net/2019/11/neural-annealing-toward-a-neural-theory-of-everything/]

I think it would be worthwhile to separate these out from the text, and (especially) to generate predictions that are crisp, distinctive, and can be resolved in the near term. The QRI questions on metaculus are admirably crisp (and fairly near term), but not distinctive (they are about whether certain drugs will be licensed for certain conditions - or whether evidence will emerge supporting drug X for condition Y, which offer very limited evidence for QRI's wider account 'either way').

This is somewhat more promising from your most recent post:

I’d expect to see substantially less energy in low-frequency CSHWs [Connectome-Specific Harmonic Waves] after trauma, and substantially more energy in low-frequency CSHWs during both therapeutic psychedelic use (e.g. MDMA therapy) and during psychological integration work.

This is crisp, plausibly distinctive, yet resolving this requires a lot of neuroimaging work which (presumably) won't be conducted anytime soon. In the interim, there isn't much to persuade a sceptical prior.

I see this and appreciate it; the problem is that I want to bet on something like "your overall theory is wrong", but I don't know enough neuroscience to know whether the claims you're making are things that are probably true for reasons unrelated to your overall theory. If you could find someone who I trusted who knew neuroscience and who thought your predictions seemed unlikely, then I'd bet with them against you.
We’ve looked for someone from the community to do a solid ‘adversarial review’ of our work, but we haven’t found anyone that feels qualified to do so and that we trust to do a good job, aside from Scott, and he's not available at this time. If anyone comes to mind do let me know!
See also this recent Qualia Computing post about the orthogonality thesis [https://www.lesswrong.com/posts/uRT5ukYQ9fBYpva9p/link-is-the-orthogonality-thesis-defensible-qualia-computing]. (Qualia Computing is the blog of QRI's research director.)

How would you describe the general motivation behind MIRI's research approach? If you feel you don't want to answer that, feel free to restrict this specifically to the agent foundations work.

I’m speaking very much for myself and not for MIRI here. But, here goes (this is pretty similar to the view described here):

If we build AI systems out of business-as-usual ML, we’re going to end up with systems probably trained with some kind of meta learning (as described in Risks from Learned Optimization) and they’re going to be completely uninterpretable and we’re not going to be able to fix the inner alignment. And by default our ML systems won’t be able to handle the strain of doing radical self-improvement, and they’ll accidentally allow their goals to shift as they self-improve (in the same way that if you tried to make a physicist by giving a ten year old access to a whole bunch of crazy mind altering/enhancing drugs and the ability to do brain surgery on themselves, you might have unstable results). We can’t fix this with things like ML transparency or adversarial training or ML robustness. The only hope of building aligned really-powerful-AI-systems is having a much clearer picture of what we’re doing when we try to build these systems.

Thanks :)

I'm hearing "the current approach will fail by default, so we need a different approach. In particular, the new approach should be clearer about the reasoning of the AI system than current approaches."

Noticeably, that's different from a positive case that sounds like "Here is such an approach and why it could work."

I'm curious how much of your thinking is currently split between the two rough possibilities below.


I don't know of another approach that could work, so while I maybe personally feel more of an ability to understand some people's ideas than others, many people's very different concrete suggestions for approaches to understanding these systems better are all arguably similar in terms of how likely we should think they are to pan out, and how much resources we should want to put behind them.

Alternatively, second:

While it's incredibly difficult to communicate mathematical intuitions of this depth, my sense is I can see a very attractive case for why one or two particular efforts (e.g. MIRI's embedded agency work) could work out.

What do you think are the biggest mistakes that the AI Safety community is currently making?

Paul Christiano is a lot more optimistic than MIRI about whether we could align a Prosaic AGI. In a relatively recent interview with AI Impacts he said he thinks "probably most of the disagreement" about this lies in the question of "can this problem [alignment] just be solved on paper in advance" (Paul thinks there's "at least a third chance" of this, but suggests MIRI's estimate is much lower). Do you have a sense of why MIRI and Paul disagree so much on this estimate?

I think Paul is probably right about the causes of the disagreement between him and many researchers, and the summary of his beliefs in the AI Impacts interview you linked matches my impression of his beliefs about this.

What has been the causal history of you deciding that it was worth leaving your previous job to work with MIRI? Many people have a generic positive or negative view of MIRI, but it's much stronger to decide to actually work there.

Earning to give started looking worse and worse the more that I increased my respect for Open Phil; by 2017 it seemed mostly obvious that I shouldn’t earn to give. I stayed at my job for a few months longer because two prominent EAs gave me the advice to keep working at my current job, which in hindsight seems like an obvious mistake and I don’t know why they gave that advice. Then in May, MIRI advertised a software engineer internship program which I applied to; they gave me an offer, but I would have had to quit my job to take the offer, and Triplebyte (which I’d joined as the first engineer) was doing quite well and I expected that if I got another software engineering job it would have much lower pay. After a few months I decided that there were enough good things I could be doing with my time that I quit Triplebyte and started studying ML (and also doing some volunteer work for MIRI doing technical interviewing for them).

I tried to figure out whether MIRI’s directions for AI alignment were good, by reading a lot of stuff that had been written online; I did a pretty bad job of thinking about all this.

At this point MIRI offered me a full time job and ... (read more)

I'm curious about why you think you did a bad job at this. Could you roughly explain what you did and what you should have done instead?

On the SSC roadtrip post, you say "After our trip, I'll write up a post-mortem for other people who might be interested in doing things like this in the future". Are you still planning to write this, and if so, when do you expect to publish it?

I wrote it but I’ve been holding off on publishing it. I’ll probably edit it and publish it within the next month. If you really want to read it, email me and I’ll share you on it.
Published today: "EA residencies" as an outreach activity [https://forum.effectivealtruism.org/posts/yrSiWNypE6AMNApDi/ea-residencies-as-an-outreach-activity]

Q1: Has MIRI noticed a significant change in funding following the change in disclosure policy?

Q2: If yes to Q1, what was the direction of the change?

Q3: If yes to Q1, were you surprised by the degree of the change?


Q4: If yes to Q3, in which direction were you surprised?

It’s not clear what effect this has had, if any. I am personally somewhat surprised by this--I would have expected more people to stop donating to us.

I asked Rob Bensinger about this; he summarized it as “We announced nondisclosed-by-default in April 2017, and we suspected that this would make fundraising harder. In fact, though, we received significantly more funding in 2017 (https://intelligence.org/2019/05/31/2018-in-review/#2018-finances), and have continued to receive strong support since then. I don't know that there's any causal relationship between those two facts; e.g., the obvious thing to look at in understanding the 2017 spike was the cryptocurrency price spike that year. And there are other factors that changed around the same time too, e.g., Colm [who works at MIRI on fundraising among other things] joining MIRI in late 2016.“

What would you be working on if you were working on something else?

Here are some things I might do:

  • Inside AI alignment:
    • I don’t know if this is sufficiently different from my current work to be an interesting answer, but I can imagine wanting to work on AIRCS full time, possibly expanding it in the following ways:
      • Run more of the workshops. This would require training more staff. For example, Anna Salamon currently leads all the workshops and wouldn’t have time to run twice as many.
      • Expand the scope of the workshops to non-CS people, for example non-technical EAs.
      • Expand the focus of the workshop from AI to be more general. Eg I’ve been thinking about running something tentatively called “Future Camp”, where people come in and spend a few days thinking and learning about longtermism and what the future is going to be like, with the goal of equipping people to think more clearly about futurism questions like the timelines of various transformative technologies and what can be done to make those technologies go better.
      • Making the workshop be more generally about EA. The idea would be that the workshop does the same kind of thing that EA Global tries to do for relatively new EAs--expose them to more experienced EAs and t
... (read more)

Nate Soares once described autodidacting to prepare for a job at MIRI. For each position at MIRI (agent foundations, machine learning alignment, software engineer, type theorist, machine learning living library), what should one study if one wanted to do something like that today? (i.e. for agent foundations, is Scott Garrabrant’s suggestion of “Learn enough math to understand all fixed point theorems” essentially correct? Is there anything else one needs to know?)

I don't know about what you need to know in order to do agent foundations research and trust Scott's answer.

If you're seriously considering autodidacting to prepare for a non-agent-foundations job at MIRI, you should email me (buck@intelligence.org) about your particular situation and I'll try to give you personal advice. If too many people email me asking about this, I'll end up writing something publicly.

In general, I'd rather that people talk to me before they study a lot for a MIRI job rather than after, so that I can point them in the right direction and they don't waste effort learning things that aren't going to make the difference to whether we want to hire them.

And if you want to autodidact to work on agent foundations at MIRI, consider emailing someone on the agent foundations team. Or you could try emailing me and I can try to help.

Given an aligned AGI, what is your point estimate for the TOTAL (across all human history) cost in USD of having aligned it?

To hopefully spare you a bit of googling without unduly anchoring your thinking, Wiki says the Manhattan Project cost $21-23 billion in 2018 USD, with only about 3.7% or $786m of that being research and development.

7Ben Pace3y
This is such an interesting question. I’m not sure I have a sensible answer. Like, I feel like the present bottleneck on Alignment progress is entirely a question of getting good people doing helpful conceptual work, but afterwards indeed a lot of funding will be needed to align the AI, and I’ve not got a sense of how much money we might need to keep aside until then - e.g. is it more or less than OpenPhil’s current total?

Over the years, you have published several pieces on ways you've changed your mind (e.g. about EA, another about EA, weird ideas, hedonic utilitarianism, and a bunch of other ideas). While I've enjoyed reading the posts and the selection of ideas, I've also found most of the posts frustrating (the hedonic utilitarianism one is an exception) because they mostly only give the direction of the update, without also giving the reasoning and additional evidence that caused the update* (e.g. in the EA post you write "I am erring on the side of writing this faster and including more of my conclusions, at the cost of not very clearly explaining why I’ve shifted positions"). Is there a reason you keep writing in this style (e.g. you don't have time, or you don't want to "give away the answers" to the reader), and if so, what is the reason?

*Why do I find this frustrating? My basic reasoning is something like this: I think this style of writing forces the reader to do a weird kind of Aumann reasoning where they have to guess what evidence/arguments Buck might have had at the start, and what evidence/arguments he subsequently saw, in order to try to reconstruct the update. When I encounter this

... (read more)

I think a major reason that I write those posts is because I’m worried that I persuaded someone of claims that I now believe to be false; I hope that if they hear I changed my mind, they might feel more skeptical of arguments that I made in the past.

I mostly don’t try to write up the arguments that persuaded me because it feels hard and time-consuming.

I’m definitely not trying to avoid “giving away the answers”. I’m sorry that this is annoying :( I definitely don’t feel that it’s unreasonable for people to not try to update on my mind changing.

I agree with Issa about the costs of not giving reasons. My guess is that over the long run, giving reasons why you believe what you believe will be a better strategy to avoid convincing people of false things. Saying you believed X and now believe ~X seems like it's likely to convince people of ~X even more strongly.

It seems like there are many more people that want to get into AI Safety, and MIRI's fundumental research, than there is room to mentor and manage them. There are also many independent / volunteer researchers.

It seems that your current strategy is to focus on training, hiring and outreaching to the most promising talented individuals. Other alternatives might include more engagement with amatures, and providing more assistance for groups and individuals that want to learn and conduct independent research.

Do you see it the same way? This strategy makes a lot of sense, but I am curious to your take on it. Also, what would change if you had 10 times the amount of management and mentorship capacity?

It seems that your current strategy is to focus on training, hiring and outreaching to the most promising talented individuals.

This seems like a pretty good summary of the strategy I work on, and it's the strategy that I'm most optimistic about.

Other alternatives might include more engagement with amatures, and providing more assistance for groups and individuals that want to learn and conduct independent research.

I think that it would be quite costly and difficult for more experienced AI safety researchers to try to cause more good research to happen by engaging more with amateurs or providing more assistance to independent research. So I think that experienced AI safety researchers are probably going to do more good by spending more time on their own research than by trying to help other people with theirs. This is because I think that experienced and skilled AI safety researchers are much more productive than other people, and because I think that a reasonably large number of very talented math/CS people become interested in AI safety every year, so we can set a pretty high bar for which people to spend a lot of time with.

Also, what would change if you had 10 times th
... (read more)

What's been your experiences, positive and negative, of CFAR workshops?

I think they're OK. I think some CFAR staff are really great. I think that their incidental effect of causing people to have more social ties to the rationalist/EA Bay Area community is probably pretty good.

I've done CFAR-esque exercises at AIRCS workshops which were very helpful to me. I think my general sense is that a bunch of CFAR material has a "true form" which is pretty great, but I didn't get the true form from my CFAR workshop, I got it from talking to Anna Salamon (and somewhat from working with other CFAR staff).

I think that for (possibly dumb) personal reasons I get more annoyed by them than some people, which prevents me from getting as much value out of them.

I generally am glad to hear that an EA has done a CFAR workshop, and normally recommend that EAs do them, especially if they don’t have as much social connection to the EA/rationalist scene, or if they don’t have high opportunity cost to their time.

For what it's worth, I wouldn't describe the social ties thing as incidental—it's one of the main things CFAR is explicitly optimizing for. For example, I'd estimate (my colleagues might quibble with these numbers some) it's 90% of the reason we run alumni reunions, 60% of the reason we run instructor & mentorship trainings, 30% of the reason we run mainlines, and 15% of the reason we co-run AIRCS.

Yeah, makes sense; I didn’t mean “unintentional” by “incidental”.
How long ago did you attend your CFAR workshop? My sense is that the content CFAR teaches and who the teachers are have changed a lot over the years. Maybe they've gotten better (or worse?) about teaching the "true form." (Or maybe you were saying you also didn't get the "true form" even in the more recent AIRCS workshops?)

What's a belief that you hold that most people disagree with you about? I'm including most EAs and rationalists.

Found elsewhere on the thread, a list of weird beliefs that Buck holds: http://shlegeris.com/2018/10/23/weirdest

Do you have the intuition that a random gifted person can contribute to technical research on AI safety?

It's happened a few times at our local meetup (South Bay EA) that we get someone new who says something like “okay I’m a fairly good ML student who wants to decide on a research direction for AI Safety.” In the past we've given fairly generic advice like "listen to this 80k podcast on AI Safety" or "apply to AIRCS". One of our attendees went on to join OpenAI's safety team after this advice, and gave us some attribution for it. While this probably makes folks a little better off, it feels like we could do better for them.

If you had to give someone more concrete object-level advice on how to get started AI safety what would you tell them?

I’m a fairly good ML student who wants to decide on a research direction for AI Safety.

I'm not actually sure whether I think it's a good idea for ML students to try to work on AI safety. I am pretty skeptical of most of the research done by pretty good ML students who try to make their research relevant to AI safety--it usually feels to me like their work ends up not contributing to one of the core difficulties, and I think that they might have been better off if they'd instead spent their effort trying to become really good at ML in the hope of being better skilled up with the goal of working on AI safety later.

I don't have very much better advice for how to get started on AI safety; I think the "recommend to apply to AIRCS and point at 80K and maybe the Alignment Newsletter" path is pretty reasonable.

I'm broadly sympathetic to this, but I also want to note that there are some research directions in mainstream ML which do seem significantly more valuable than average. For example, I'm pretty excited about people getting really good at interpretability, so that they have an intuitive understanding of what's actually going on inside our models (particularly RL agents), even if they have no specific plans about how to apply this to safety.

I asked a question on LessWrong recently that I'm curious for your thoughts on. If you don't want to read the full text on LessWrong, the short version is: Do you think it has become harder recently (say 2013 vs 2019) to become a mathematician at MIRI? Why or why not?

I'm not sure; my guess is that it's somewhat harder, because we're enthusiastic about our new research directions and have moved some management capacity towards those, and those directions have relatively more room for engineering skillsets vs pure math skillsets.

What are some bottlenecks in your research productivity?

Here are several, together with the percentage of my productivity which I think they cost me over the last year:

  • I’ve lost a couple months of productivity over the last year due to some weird health issues--I was really fatigued and couldn’t think properly for several months. This was terrible, but the problem seems to have gone away for now. This had a 25% cost.
  • I am a worse researcher because I spend half my time doing other things than research. It’s unclear to me how much efficiency this costs me. Potential considerations:
    • When I don’t feel like doing technical work, I can do other work. This should increase my productivity. But maybe it lets me procrastinate on important work.
    • I remember less of the context of what I’m working on, because my memory is spaced out.
    • My nontechnical work often feels like social drama and is really attention-grabbing and distracting.
    • Overall this costs me maybe 15%; I’m really unsure about this though.
  • I’d be a better researcher if I were smarter and more knowledgeable. I’m working on the knowledgeableness problem with the help of tutors and by spending some of my time studying. It’s unclear how to figure out how costly this is. If I’d spent a year working as a Haskell programmer in industry, I’d probably be like 15% more effective now.
Thanks! I'm sorry to hear about your health problems, but I'm glad it's better now :)

[Meta] During the AMA, are you planning to distinguish (e.g. by giving short replies) between the case where you can't answer a question due to MIRI's non-disclosure policy vs the case where you won't answer a question simply because there isn't enough time/it's too much effort to answer?

I don't expect to not answer any questions because of MIRI non-disclosure stuff.

What other crazy ideas do you have about EA outreach?

How do you see success/an "existential win" playing out in short timeline scenarios (e.g. less than 10 years until AGI) where alignment is non-trivial/turns out to not solve itself "by default"? For example, in these scenarios do you see MIRI building an AGI, or assisting/advising another group to do so, or something else?

It's getting late and it feels hard to answer this question, so I'm only going to say briefly: * for something MIRI wrote re this, see the "strategic background" section here [https://intelligence.org/2017/12/01/miris-2017-fundraiser/] * I think there are cases where alignment is non-trivial but prosaic AI alignment is possible, and some people who are cautious about AGI alignment are influential in the groups that are working on AGI development and cause them to put lots of effort into alignment (eg maybe the only way to align the thing involves spending an extra billion dollars on human feedback). Because of these cases, I am excited for the leading AI orgs having many people in important positions who are concerned about and knowledgeable about these issues.

What's something specific about this community that you're grateful for?

Many things. Two that come to mind:

  • Even though most of my friends are mostly consequentialist in their EA actions, I feel like they’re also way more interested than average in figuring out what the right thing to do is from a wide variety of moral perspectives; I feel extremely warm about the scrupulosity and care of EAs, even towards things that don’t matter much.
  • I think EAs are often very naturally upset by injustices which are enormous but which most people don’t think to be upset by; I feel this way particularly because of anti-death sentiments, care for farmed and wild animals, and support for open borders.

Will we solve the alignment problem before crunch time?

I think it's plausible that "solving the alignment problem" isn't a very clear way of phrasing the goal of technical AI safety research. Consider the question "will we solve the rocket alignment problem before we launch the first rocket to the moon"--to me the interesting question is whether the first rocket to the moon will indeed get there. The problem isn't really "solved" or "not solved", the rocket just gets to the moon or not. And it's not even obvious whether the goal is to align the first AGI; maybe the question is "what proportion of resources controlled by AI systems end up being used for human purposes", where we care about a weighted proportion of AI systems which are aligned.

I am not sure whether I'd bet for or against the proposition that humans will go extinct for AGI-misalignment-related-reasons within the next 100 years.

Apologies, aren't we already in crunch time? Are your referring to this comment [https://www.lesswrong.com/posts/YduZEfz8usGbJXN4x/transcription-of-eliezer-s-january-2010-video-q-and-a] from Eliezer Yudkowsky,
2Ben Pace3y
Sure. "Crunch time" is not exactly a technically precise term, and it is quite likely our time is measured in decades. The thing I want to ask is whether Buck expects the timeline will fully run out before we solve alignment, or whether we'll manage to successfully build an AGI that helps us achieve our values and an existential win, or whether something different will happen instead.
I see. I asked only because I was confused why you asked "before crunch time" rather than leaving that part out.

Do you think non-altruistic interventions for AI alignment (i.e. AI safety "prepping") make sense? If so, do you have suggestions for concrete actions to take, and if not, why do you think they don't make sense?

(Note: I previously asked a similar question addressed at someone else, but I am curious for Buck's thoughts on this.)

I don't think you can prep that effectively for x-risk-level AI outcomes, obviously.

I think you can prep for various transformative technologies; you could for example buy shares of computer hardware manufacturers if you think that they'll be worth more due to increased value of computation as AI productivity increases. I haven't thought much about this, and I'm sure this is dumb for some reason, but maybe you could try to buy land in cheap places in the hope that in a transhuman utopia the land will be extremely valuable (the property rights might not carry through, but it might be worth the gamble for sufficiently cheap land).

I think it's probably at least slightly worthwhile to do good and hope that you can sell some of your impact certificates after good AI outcomes.

You should ask Carl Shulman, I'm sure he'd have a good answer.

Is there any public information on the AI Safety Retraining Program other than the MIRI Summer Update and the Open Phil grant page?

I am wondering:

1) Who should apply? How do they apply?

2) Have there been any results yet? I see two grants were given as of Sep 1st; have either of those been completed? If so, what were the outcomes?

I don't think there's any other public information. To apply, people should email me asking about it (buck@intelligence.org). [(buck@intelligence.org).] The three people who've received one of these grants were all people who I ran across in my MIRI recruiting efforts. Two grants have been completed and a third is ongoing. Of the two people who completed grants, both successfully replicated several deep RL papers. and one of them ended up getting a job working on AI safety stuff (the other took a data science job and hopes to work on AI safety at some point in the future). I'm happy to answer more questions about this.
So, to clarify: this program is for people who are already mostly sure they want to work on AI Safety? That is, a person who is excited about ML, and would maaaaybe be interested in working on safety-related topics, if they found those topics interesting, is not who you are targeting?
Yeah, I am not targeting that kind of person. Someone who is excited about ML and skeptical of AI safety but interested in engaging a lot with AI safety arguments for a few months might be a good fit.

What do you think integrity means and how can you tell when someone has it?

What do you think is the main cause of burnout in people working on ambitious projects, especially those designed to reduce existential risk?

You write:

"I think that the field of AI safety is growing in an awkward way. Lots of people are trying to work on it, and many of these people have pretty different pictures of what the problem is and how we should try to work on it. How should we handle this? How should you try to work in a field when at least half the "experts" are going to think that your research direction is misguided?"

What are your preliminary thoughts on the answers to these questions?

As you've grown up and become more agentic and competent and thoughtful, how have your values changed? Have you discarded old values, have they changed in more subtle ways?

How efficiently could MIRI "burn through" its savings if it considered AGI sufficiently likely to be imminent? In other words, if MIRI decided to spend all its savings in a year, how many normal-spending-years' worth of progress on AI safety do you think it would achieve?

Probably less than two.

Given a "bad" AGI outcome, how likely do you think a long-term worse-than-death fate for at least some people would be relative to extinction?

Idk. A couple percent? I'm very unsure about this.

Q1: How closely does MIRI currently coordinate with the Long-Term Future Fund (LTFF)?

Q2: How effective do you currently consider [donations to] the LTFF relative to [donations to] MIRI? Decimal coefficient preferred if you feel comfortable guessing one.

Q3: Do you expect the LTFF to become more or less effective relative to MIRI as AI capability/safety progresses?

(I've spent a few hours talking to people about the LTFF but I'm not sure about things like "what order of magnitude of funding did they allocate last year" (my guess without looking it up is $1M, (which turns out to be correct! [https://givingcompass.org/fund/long-term-future-fund/])), so take all this with a grain of salt.) Re Q1: I don't know, I don't think that we coordinate very carefully. Re Q2: I don't really know. When I look at the list of things the LTFF funded in August [https://www.lesswrong.com/posts/YnHX9ZRHunoA9td6K/long-term-future-fund-august-2019-grant-recommendations] or April [https://forum.effectivealtruism.org/posts/CJJDwgyqT4gXktq6g/long-term-future-fund-april-2019-grant-recommendations#How_good_of_a_target_group_are_Math_Olympiad_winners_for_these_effects_] (excluding regrants to orgs like MIRI, CFAR, and Ought), about 40% look meh (~0.5x MIRI), about 40% look like things which I'm reasonably glad someone funded (~1x MIRI), about 7% are things that I'm really glad someone funded (~3x MIRI), and 3% are things that I wish that they hadn't funded (-1x MIRI). Note that my mean outcome of the meh, good, and great categories are much higher than the median outcomes--a lot of them are "I think this is probably useless but seems worth trying for value of information". Apparently this adds up to thinking that they're 78% as good as MIRI. Q3: I don't really know. My median outcome is that they turn out to do less well than my estimation above, but I think there's a reasonable probability that they turn out to be much better than my estimate above, and I'm excited to see them try to do good. This isn't really tied up with AI capability or safety progressing though.

In your opinion, what are the most helpful organizations or groups working on AI safety right now? And why?

In parallel: what are the least helpful organizations or groups working on (or claiming to work on) AI safety right now? And why?

I feel reluctant to answer this question because it feels like it would involve casting judgement on lots of people publicly. I think that there are a bunch of different orgs and people doing good work on AI safety.

Yeah, I am sympathetic to that. I am curious how you decide where to draw the line here. For instance, you were willing to express judgment of QRI elsewhere in the comments.

Would it be possible to briefly list the people or orgs whose work you *most* respect? Or would the omissions be too obvious?

I sometimes wish there were good ways to more broadly disseminate negative judgments or critiques of orgs/people from thoughtful and well-connected people. But, understandably, people are sensitive to that kind of thing, and it can end up eating a lot of time and weakening relationships.

How do you view the field of Machine Ethics? (I only now heard of it in this AI Alignment Podcast)

What are your regular go-to sources of information online? That is, are there certain blogs you religiously read? Vox? Do you follow the EA Forum or LessWrong? Do you mostly read papers that you find through some search algorithm you previously set up? Etc.

I don't consume information online very intentionally. Blogs I often read: * Slate Star Codex * The Unit of Caring * Bryan Caplan (despite disagreeing with him a lot, obviously) * Meteuphoric (Katja Grace) * Paul Christiano's various blogs I often read the Alignment Newsletter. I mostly learn things from hearing about them from friends.

What are some of your favourite theorems, proofs, algorithms, data structures, and programming languages?

If an intelligent person was willing to prepare for any position at MIRI, but was currently unqualified for any, which would you want them to go for?