Effective Altruism Forum
EA Forum

Hide table of contents

Comment Permalink

bgarfinkelNov 22 201919

"What makes a decision 'good' if the decision happens inside an AI?" and "What makes a decision 'good' if the decision happens inside a brain?" aren't orthogonal questions, or even all that different; they're two different ways of posing the same question.

I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.

Here’s another go:

Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. The phrase “decision theory” in this context typically refers to a claim about necessary and/or sufficient conditions for a decision being rational. To use different jargon, in this context a “decision theory” refers to a proposed “criterion of rightness.”

When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational only if taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.

We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.

For each proposed criterion of rightness, it’s possible to define a decision procedure that only outputs decisions that fulfill the criterion. For example, we can define P_CDT as a decision procedure that involves only taking actions that R_CDT claims are rational.

My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.

The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves. R_CDT claims that we should do whatever will have the best effects -- and, in many cases, building agents that follow a decision procedure other than P_CDT is likely to have the best effects. More generally: Most proposed criteria of rightness imply that it can be rational to build agents that sometimes behave irrationally.

MIRI's AI work is properly thought of as part of the "success-first decision theory" approach in academic decision theory.

One possible criterion of rightness, which I’ll call R_UDT, is something like this: An action is rational only if it would have been chosen by whatever decision procedure would have produced the most expected value if consistently followed over an agent’s lifetime. For example, this criterion of rightness says that it is rational to one-box in the transparent Newcomb scenario because agents who consistently follow one-boxing policies tend to do better over their lifetimes.

I could be wrong, but I associate the “success-first approach” with something like the claim that R_UDT is true. This would definitely constitute a really interesting and significant divergence from mainstream opinion within academic decision theory. Academic decision theorists should care a lot about whether or not it’s true.

But I’m also not sure if it matters very much, practically, whether R_UDT or R_CDT is true. It’s not obvious to me that they recommend building different kinds of decision procedures into AI systems. For example, both seem to recommend building AI systems that would one-box in the transparent Newcomb scenario.

You can go with Paul and say that a lot of these distinctions are semantic rather than substantive -- that there isn't a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether they're successful, vs. some other criterion.

I disagree that any of the distinctions here are purely semantic. But one could argue that normative anti-realism is true. In this case, there wouldn’t really be any such thing as the criterion of rightness for decisions. Neither R_CDT nor R_UDT nor any other proposed criterion would be “correct.”

In this case, though, I think there would be even less reason to engage with academic decision theory literature. The literature would be focused on a question that has no real answer.

[[EDIT: Note that Will also emphasizes the importance of the criterion-of-rightness vs. decision-procedure distinction in his critique of the FDT paper: "[T]hey’re [most often] asking what the best decision procedure is, rather than what the best criterion of rightness is... But, if that’s what’s going on, there are a whole bunch of issues to dissect. First, it means that FDT is not playing the same game as CDT or EDT, which are proposed as criteria of rightness, directly assessing acts. So it’s odd to have a whole paper comparing them side-by-side as if they are rivals."]]

See in context

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

by Buck

Nov 15 20192 min read 228

123

AI safetyAsk Me AnythingAI alignmentMachine Intelligence Research Institute

Frontpage

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

EDIT: I'm only going to answer a few more questions, due to time constraints. I might eventually come back and answer more. I still appreciate getting replies with people's thoughts on things I've written.

230 comments

I'm going to do an AMA on Tuesday next week (November 19th). Below I've written a brief description of what I'm doing at the moment. Ask any questions you like; I'll respond to as many as I can on Tuesday.

Although I'm eager to discuss MIRI-related things in this AMA, my replies will represent my own views rather than MIRI's, and as a rule I won't be running my answers by anyone else at MIRI. Think of it as a relatively candid and informal Q&A session, rather than anything polished or definitive.

----

I'm a researcher at MIRI. At MIRI I divide my time roughly equally between technical work and recruitment/outreach work.

On the recruitment/outreach side, I do things like the following:

- For the AI Risk for Computer Scientists workshops (which are slightly badly named; we accept some technical people who aren't computer scientists), I handle the intake of participants, and also teach classes and lead discussions on AI risk at the workshops.
- I do most of the technical interviewing for engineering roles at MIRI.
- I manage the AI Safety Retraining Program, in which MIRI gives grants to people to study ML for three months with the goal of making it easier for them to transition into working on AI safety.
- I sometimes do weird things like going on a Slate Star Codex roadtrip, where I led a group of EAs as we travelled along the East Coast going to Slate Star Codex meetups and visiting EA groups for five days.

On the technical side, I mostly work on some of our nondisclosed-by-default technical research; this involves thinking about various kinds of math and implementing things related to the math. Because the work isn't public, there are many questions about it that I can't answer. But this is my problem, not yours; feel free to ask whatever questions you like and I'll take responsibility for choosing to answer or not.

----

Here are some things I've been thinking about recently:

- I think that the field of AI safety is growing in an awkward way. Lots of people are trying to work on it, and many of these people have pretty different pictures of what the problem is and how we should try to work on it. How should we handle this? How should you try to work in a field when at least half the "experts" are going to think that your research direction is misguided?
- The AIRCS workshops that I'm involved with contain a variety of material which attempts to help participants think about the world more effectively. I have thoughts about what's useful and not useful about rationality training.
- I have various crazy ideas about EA outreach. I think the SSC roadtrip was good; I think some EAs who work at EA orgs should consider doing "residencies" in cities without much fulltime EA presence, where they mostly do their normal job but also talk to people.

123 Reactions

Mentioned in

339Critiques of prominent AI safety labs: Redwood Research

90I'm Michelle Hutchinson, head of advising at 80,000 Hours, AMA

82EA Forum: Data analysis and deep learning

74Plan Your Career on Paper

37EA Updates for November 2019

Load more (5/6)

More posts like this

Comments230

Sorted by

New & upvoted

Click to highlight new comments since: Today at 7:05 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

elleNov 19 201956

Reading through some of your blog posts and other writing, I get the impression that you put a lot of weight on how smart people seem to you. You often describe people or ideas as "smart" or "dumb," and you seem interested in finding the smartest people to talk to or bring into EA.

I am feeling a bit confused by my reactions. I think I am both a) excited by the idea of getting the "smart people" together so that they can help each other think through complicated topics and make more good things happen, but b) I feel a bit sad and left out that I am probably not one of the smart people.

Curious about your thoughts on a few things related to this... I'll put my questions as separate comments below.

elleNov 19 201927

2) Somewhat relatedly, there seems to be a lot of angst within EA related to intelligence / power / funding / jobs / respect / social status / etc., and I am curious if you have any interesting thoughts about that.

BuckNov 19 201927

I feel really sad about it. I think EA should probably have a communication strategy where we say relatively simple messages like "we think talented college graduates should do X and Y", but this causes collateral damage where people who don't succeed at doing X and Y feel bad about themselves. I don't know what to do about this, except to say that I have the utmost respect in my heart for people who really want to do the right thing and are trying their best.

I don't think I have very coherent or reasoned thoughts on how we should handle this, and I try to defer to people who I trust whose judgement on these topics I think is better.

elleNov 19 201914

If you feel comfortable sharing: who are the people whose judgment on this topic you think is better?

elleNov 19 201915

1) Do you have any advice for people who want to be involved in EA, but do not think that they are smart or committed enough to be engaging at your level? Do you think there are good roles for such people in this community / movement / whatever? If so, what are those roles?

aogNov 20 2019141

I used to expect 80,000 Hours to tell me how to have an impactful career. Recently, I've started thinking it's basically my own personal responsibility to figure it out. I think this shift has made me much happier and much more likely to have an impactful career.

80,000 Hours targets the most professionally successful people in the world. That's probably the right idea for them - giving good career advice takes a lot of time and effort, and they can't help everyone, so they should focus on the people with the most career potential.

But, unfortunately for most EAs (myself included), the nine priority career paths recommended by 80,000 Hours are some of the most difficult and competitive careers in the world. If you’re among the 99% of people who are not Google programmer / top half of Oxford / Top 30 PhD-level talented, I’d guess you have slim-to-none odds of succeeding in any of them. The advice just isn't tailored for you.

So how can the vast majority of people have an impactful career? My best answer: A lot of independent thought and planning. Your own personal brainstorming and reading and asking around and exploring, not just following stoc... (read more)

brentonmayerNov 27 201953

Hi Aidan,

I’m Brenton from 80,000 Hours - thanks for writing this up! It seems really important that people don’t think of us as “tell[ing] them how to have an impactful career”. It sounds absolutely right to me that having a high impact career requires “a lot of independent thought and planning” - career advice can’t be universally applied.

I did have a few thoughts, which you could consider incorporating if you end up making a top level post. The most substantive two are:

Many of the priority paths are broader than you might be thinking.
A significant amount of our advice is designed to help people think through how to approach their careers, and will be useful regardless of whether they’re aiming for a priority path.

Many of the priority paths are broader than you might be thinking:

Most people won’t be able to step into an especially high impact role directly out of undergrad, so unsurprisingly, many of the priority paths require people to build up career capital before they can get into high impact positions. We’d think of people who are building up career capital focused on (say) AI policy as being ‘on ... (read more)

KirstenNov 20 201951

I think this comment is really lovely, and a very timely message. I'd support it being turned into a top-level post so more people can see it, especially if you have anything more to add.

JP Addison🔸Nov 21 201911

Seconded.

aogNov 21 201917

Thank you both very much, I will do that, and I almost definitely wouldn't have without your encouragement.

If anyone has more thoughts on the topic, please comment or reach out to me, I'd love to incorporate them into the top-level post.

DavidNashNov 21 201910

I think similar areas were covered in these two posts as well 80,000 Hours - how to read our advice and Thoughts on 80,000 Hours’ research that might help with job-search frustrations.

Sean_o_hNov 21 201935

I agree this is a very helpful comment. I would add: these roles in my view are not *lesser* in any sense, for a range of reasons and I would encourage people not to think of them in those terms.

You might have a bigger impact on the margins being the only - or one of the first few - people thinking in EA terms in a philanthropic foundation than by adding to the pool of excellence at OpenPhil. This goes for any role that involves influencing how resources are allocated - which is a LOT, in charity, government, industry, academic foundations etc.
You may not be in the presidential cabinet, or a spad to the UK prime minister, but those people are supported and enabled by people building up the resources, capacity, overton window expansion elsewhere in government and civil service. The 'senior person' on their own may not be able to achieve purchase with key policy ideas and influence.
A lot of xrisk research, from biosecurity to climate change, draws on and depends on a huge body of work on biology, public policy, climate science, renewable energy, insulation in homes, and much more. Often there are gaps in research on extreme scenarios due to lack of incentives for this kind

... (read more)

Milan Griffes

Nov 21 2019

But the marginal impact of becoming a star striker is so high! (Just kidding – this is a great analogy & highlights a big problem with reasoning on the margin + focusing on maximizing individual impact.)

JP Addison🔸Nov 21 201911

I also like the analogy, let's run with it. Suppose I'm reasoning from the point of view of the movement as a whole, and we're trying to put together a soccer team. Suppose also that there are two types of positions, midfield and striker. I'm not sure if this is true for strikers in what I would call soccer, but suppose the striker has a higher skillcap than midfield.[1] I'll define skillcap as the amount of skill with the position before the returns begin to diminish.

Where skill is some product of standard deviation of innate skill and hours practiced.

Back to the problem of putting together a soccer team, if you're starting with a bunch of players of unknown innate skill, you would get a higher expected value to tell 80% of your players to train to be strikers, and 20% to be midfielders. Because you have a smaller pool, your midfielders will have less innate talent for the position. You can afford to lose this however, as the effect will be small compared to the gain in the increased performance of the strikers.

That's not to say that you should fill your entire team with wannabe strikers. When you select your team you'll undoubtedly leave out some very dedicated strikers in favor

... (read more)

Milan Griffes

Nov 21 2019

Yeah, I think this model misses that people who are aiming to be strikers tend to have pretty different dispositions than people aiming to be midfielders. (And so filling a team mostly with intending-to-be-strikers could have weird effects on team cohesion & function.) Interesting to think about how Delta Force, SEAL Team Six, etc. manage this, as they select for very high-performing recruits (all strikers) then meld them into cohesive teams. I believe they do it via: 1. having a very large recruitment pool 2. intense filtering out of people who don't meet their criteria 3. breaking people down psychologically + cultivating conformity during training I found it interesting to cash this out more... thanks!

JP Addison🔸

Nov 21 2019

Ah, so like, in the "real world", you don't have a set of people, you end up recruiting a training class of 80% would-be-strikers, which influences the culture compared to if you recruited for the same breakdown as the eventually-selected-team?

Sean_o_hNov 22 201914

I really enjoy the extent to which you've both taken the ball and run with it ;)

ElityreNov 22 201920

I think a lot of this is right and important, but I especially love:

Don't let the fact that Bill Gates saved a million lives keep you from saving one.

We're all doing the best we can with the privileges we were blessed with.

BuckNov 20 201910

"Do you have any advice for people who want to be involved in EA, but do not think that they are smart or committed enough to be engaging at your level?"--I just want to say that I wouldn't have phrased it quite like that.

One role that I've been excited about recently is making local groups be good. I think that having better local EA communities might be really helpful for outreach, and lots of different people can do great work with this.

elle

Nov 20 2019

"...but do not think that they are smart or committed enough to be engaging at your level?" was intended to be from a generic insecure (or realistic) EA's perspective, not yours. Sorry for my confusing phrasing.

elleNov 19 201911

4) You seem like you have had a natural strong critical thinking streak since you were quite young (e.g., you talk about thinking that various mainstream ideas were dumb). Any unique advice for how to develop this skill in people who do not have it naturally?

BuckNov 19 201927

For the record, I think that I had mediocre judgement in the past and did not reliably believe true things, and I sometimes had made really foolish decisions. I think my experience is mostly that I felt extremely alienated from society, which meant that I looked more critically on many common beliefs than most people do. This meant I was weird in lots of ways, many of which were bad and some of which were good. And in some cases this meant that I believed some weird things that feel like easy wins, eg by thinking that people were absurdly callous about causing animal suffering.

My judgement improved a lot from spending a lot of time in places with people with good judgement who I could learn from, eg Stanford EA, Triplebyte, the more general EA and rationalist community, and now MIRI.

I feel pretty unqualified to give advice on critical thinking, but here are some possible ideas, which probably aren't actually good:

Try to learn simple models of the world and practice applying them to claims you hear, and then being confused when they don't match. Eg learn introductory microeconomics and then whenever you hear a claim about the world that intro micro has an opinion on, try t

... (read more)

elleNov 19 201911

3) I've seen several places where you criticize fellow EAs for their lack of engagement or critical thinking. For example, three years ago, you wrote:

I also have criticisms about EAs being overconfident and acting as if they know way more than they do about a wide variety of things, but my criticisms are very different from [Holden's criticisms]. For example, I’m super unimpressed that so many EAs didn’t know that GiveWell thinks that deworming has a relatively low probability of very high impact. I’m also unimpressed by how many people are incredibly confident that animals aren’t morally relevant despite knowing very little about the topic.

Do you think this has improved at all? And what are the current things that you are annoyed most EAs do not seem to know or engage with?

BuckNov 19 201912

I no longer feel annoyed about this. I'm not quite sure why. Part of it is probably that I'm a lot more sympathetic when EAs don't know things about AI safety than global poverty, because learning about AI safety seems much harder, and I think I hear relatively more discussion of AI safety now compared to three years ago.

One hypothesis is that 80000 Hours has made various EA ideas more accessible and well-known within the community, via their podcast and maybe their articles.

EdoAradNov 18 201947

In the 80k podcast episode with Hilary Greaves she talks about decision theory and says:

Hilary Greaves: Then as many of your listeners will know, in the space of AI research, people have been throwing around terms like ‘functional decision theory’ and ‘timeless decision theory’ and ‘updateless decision theory’. I think it’s a lot less clear exactly what these putative alternatives are supposed to be. The literature on those kinds of decision theories hasn’t been written up with the level of precision and rigor that characterizes the discussion of causal and evidential decision theory. So it’s a little bit unclear, at least to my likes, whether there’s genuinely a competitor to decision theory on the table there, or just some intriguing ideas that might one day in the future lead to a rigorous alternative.

I understand from that that there is little engagement of MIRI with the academia. What is more troubling for me is that it seems that the cases for the major decision theories are looked upon with skepticism from academic experts.

Do you think that is really the case? How do you respond to that? It would personally feel much better if I knew that there are some academic decision

... (read more)

BuckNov 19 201943

Yeah, this is an interesting question.

I’m not really sure what’s going on here. When I read critiques of MIRI-style decision theories (eg from Will or from Wolfgang Schwartz), I feel very unpersuaded by them. This leaves me in a situation where my inside views disagree with the views of the most obvious class of experts, which is always tricky.

When I read those criticisms by Will MacAskill and Wolfgang Schwartz, I feel like I understand their criticisms and find them unpersuasive, as opposed to not understanding their criticisms. Also, I feel like they don’t understand some of the arguments and motivations for FDT. I feel a lot better disagreeing with experts when I think I understand their arguments and when I think I can see particular mistakes that they’re making. (It’s not obvious that this is the right epistemic strategy, for reasons well articulated by Gregory Lewis here.)
Paul’s comments on this resolved some of my concerns here. He thinks that the disagreement is mostly about what questions decision theory should be answering. He thinks that the updateless decision theories are obviously more suitable to building AI than eg CDT or ED

... (read more)

bgarfinkelNov 20 201930

I think it’s plausible that Paul is being overly charitable to decision theorists; I’d love to hear whether skeptics of updateless decision theories actually agree that you shouldn’t build a CDT agent.

FWIW, I could probably be described as a "skeptic" of updateless decision theories; I’m pretty sympathetic to CDT. But I also don’t think we should build AI systems that consistently take the actions recommended by CDT. I know at least a few other people who favor CDT, but again (although small sample size) I don’t think any of them advocate for designing AI systems that consistently act in accordance with CDT.

I think the main thing that’s going on here is that academic decision theorists are primarily interested in normative principles. They’re mostly asking the question: “What criterion determines whether or not a decision is ‘rational’?” For example, standard CDT claims that an action is rational only if it’s the action that can be expected to cause the largest increase in value.

On the other hand, AI safety researchers seem to be mainly interested in a different question: “What sort of algorithm would it be rational for us to build into an AI system?” The first question doesn’t

... (read more)

RobBensingerNov 21 201920

The comments here have been very ecumenical, but I'd like to propose a different account of the philosophy/AI divide on decision theory:

1. "What makes a decision 'good' if the decision happens inside an AI?" and "What makes a decision 'good' if the decision happens inside a brain?" aren't orthogonal questions, or even all that different; they're two different ways of posing the same question.

MIRI's AI work is properly thought of as part of the "success-first decision theory" approach in academic decision theory, described by Greene (2018) (who also cites past proponents of this way of doing decision theory):

[...] Consider a theory that allows the agents who employ it to end up rich in worlds containing both classic and transparent Newcomb Problems. This type of theory is motivated by the desire to draw a tighter connection between rationality and success, rather than to support any particular account of expected utility. We might refer to this type of theory as a "success-first" decision theory.

[...] The desire to create a closer connection between rationality and success than that offered by standard d

... (read more)

bgarfinkelNov 22 201919

"What makes a decision 'good' if the decision happens inside an AI?" and "What makes a decision 'good' if the decision happens inside a brain?" aren't orthogonal questions, or even all that different; they're two different ways of posing the same question.

I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.

Here’s another go:

We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.

MIRI's AI work is properly thought of as part of the "success-first decision theory" approach in academic decision theory.

You can go with Paul and say that a lot of these distinctions are semantic rather than substantive -- that there isn't a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether they're successful, vs. some other criterion.

In this case, though, I think there would be even less reason to engage with academic decision theory literature. The literature would be focused on a question that has no real answer.

RobBensingerNov 22 201912

I agree that these three distinctions are important:

"Picking policies based on whether they satisfy a criterion X" vs. "Picking policies that happen to satisfy a criterion X". (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
"Trying to follow a decision rule Y 'directly' or 'on the object level'" vs. "Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y". (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you've come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
"A decision rule that prescribes outputting some action or policy and doesn't care how you do it" vs. "A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy". (E.g., a rule that says 'maximize the aggregate welfare of moral patients' vs. a specific mental algorithm intended to achieve that end.)

The first distinction above seems less relevant here, since we're mostly discussing AI systems and humans that are self-aware about their decision criteria and explicitly "trying to do what's right".

As a side-note, I do want to emphasize that from the MIRI cluster's perspective, it's fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the system's alignment-relevant properties aren't obscured and the system ends up safe and reliable).

The main reason to work on decision theory in AI alignment has never been "What if people don't make AI 'decision-theoretic' enough?" or "What if people mistakenly think CDT is correct and so build CDT into their AI system?" The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we've even been misunderstanding basic things at the level of "decision-theoretic criterion of rightness".

It's not that I want decision theorists to try to build AI systems (even notional ones). It's that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That's part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).

The second distinction ("following a rule 'directly' vs. following it by adopting a sub-rule or via self-modification") seems more relevant. You write:

My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.

The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves.

Far from being a distinction proponents of UDT/FDT neglect, this is one of the main grounds on which UDT/FDT proponents criticize CDT (from within the "success-first" tradition). This is because agents that are reflectively inconsistent in the manner of CDT -- ones that take actions they know they'll regret taking, wish they were following a different decision rule, etc. -- can be money-pumped and can otherwise lose arbitrary amounts of value.

A human following CDT should endorse "stop following CDT," since CDT isn't self-endorsing. It's not even that they should endorse "keep following CDT, but adopt a heuristic or sub-rule that helps us better achieve CDT ends"; they need to completely abandon CDT even at the meta-level of "what sort of decision rule should I follow?" and modify themselves into purely following an entirely new decision rule, or else they'll continue to perform poorly by CDT's lights.

The decision rule that CDT does endorse loses a lot of the apparent elegance and naturalness of CDT. This rule, "son-of-CDT", is roughly:

Have whatever disposition-to-act gets the most utility, unless I'm in future situations like "a twin prisoner's dilemma against a perfect copy of my future self where the copy was forked from me before I started following this rule", in which case ignore my correlation with that particular copy and make decisions as though our behavior is independent (while continuing to take into account my correlation with any copies of myself I end up in prisoner's dilemmas with that were copied from my brain after I started following this rule).

The fact that CDT doesn't endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don't), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.

But this decision rule CDT endorses also still performs suboptimally (from the perspective of success-first decision theory). See the discussion of the Retro Blackmail Problem in "Toward Idealized Decision Theory", where "CDT and any decision procedure to which CDT would self-modify see losing money to the blackmailer as the best available action."

In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents' due to events that happened after she turned 20 (such as "the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory"). But she'll refuse to coordinate for reasons like "we hung out a lot the summer before my 20th birthday", "we spent our whole childhoods and teen years living together and learning from the same teachers", and "we all have similar decision-making faculties due to being members of the same species". There's no principled reason to draw this temporal distinction; it's just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.

Regarding the third distinction ("prescribing a certain kind of output vs. prescribing a step-by-step mental procedure for achieving that kind of output"), I'd say that it's primarily the criterion of rightness that MIRI-cluster researchers care about. This is part of why the paper is called "Functional Decision Theory" and not (e.g.) "Algorithmic Decision Theory": the focus is explicitly on "what outcomes do you produce?", not on how you produce them.

(Thus, an FDT agent can cooperate with another agent whenever the latter agent's input-output relations match FDT's prescription in the relevant dilemmas, regardless of what computations they do to produce those outputs.)

The main reasons I think academic decision theory should spend more time coming up with algorithms that satisfy their decision rules are that (a) this has a track record of clarifying what various decision rules actually prescribe in different dilemmas, and (b) this has a track record of helping clarify other issues in the "understand what good reasoning is" project (e.g., logical uncertainty) and how they relate to decision theory.

bgarfinkelNov 23 20199

I agree that these three distinctions are important

"Picking policies based on whether they satisfy a criterion X" vs. "Picking policies that happen to satisfy a criterion X". (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)

"Trying to follow a decision rule Y 'directly' or 'on the object level'" vs. "Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y". (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you've come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)

"A decision rule that prescribes outputting some action or policy and doesn't care how you do it" vs. "A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy". (E.g., a rule that says 'maximize the aggregate welfare of moral patients' vs. a specific mental algorithm intended to achieve that end.)

The second distinction here is most closely related to the one I have in mind, although I wouldn’t say it’s the same. Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.

“Hedonistic utilitarianism is correct” would be a non-decision-theoretic example of (a). “Making decisions on the basis of coinflips” would be an example of (b).

In the context of decision theory, of course, I am thinking of R_CDT as an example of (a) and P_CDT as an example of (b).

I now have the sense I’m probably not doing a good job of communicating what I have in mind, though.

The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we've even been misunderstanding basic things at the level of "decision-theoretic criterion of rightness".

I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making. Although we might have different intuitions here.

It's that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That's part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).

I agree that this is a worthwhile goal and that philosophers can probably contribute to it. I guess I’m just not sure that the question that most academic decision theorists are trying to answer -- and the literature they’ve produced on it -- will ultimately be very relevant.

The fact that CDT doesn't endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don't), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.

The fact that R_CDT is “self-effacing” -- i.e. the fact that it doesn’t always recommend following P_CDT -- definitely does seem like a point of intuitive evidence against R_CDT.

But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”

There’s then a question of which of these considerations is more relevant, when judging which of the two normative theories is more likely to be correct. The failure of R_UDT to satisfy the “Don’t Make Things Worse Principle” seems more important to me, but I don’t really know how to argue for this point beyond saying that this is just my intuition. I think that the failure of R_UDT to satisfying this principle -- or something like it -- is also probably the main reason why many philosophers find it intuitively implausible.

(IIRC the first part of Reasons and Persons is mostly a defense of the view that the correct theory of rationality may be self-effacing. But I’m not really familiar with the state of arguments here.)

In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents' due to events that happened after she turned 20 (such as "the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory"). But she'll refuse to coordinate for reasons like "we hung out a lot the summer before my 20th birthday", "we spent our whole childhoods and teen years living together and learning from the same teachers", and "we all have similar decision-making faculties due to being members of the same species". There's no principled reason to draw this temporal distinction; it's just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.

I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.) So I don’t think there should be any weird “Frankenstein” decision procedure thing going on.

….Thinking more about it, though, I’m now less sure how much the different normative decision theories should converge in their recommendations about AI design. I think they all agree that we should build systems that one-box in Newcomb-style scenarios. I think they also agree that, if we’re building twins, then we should design these twins to cooperate in twin prisoner’s dilemmas. But there may be some other contexts where acausal cooperation considerations do lead to genuine divergences. I don’t have very clear/settled thoughts about this, though.

RobBensingerNov 23 20197

But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”

I think "Don't Make Things Worse" is a plausible principle at first glance.

One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the "Don't Make Things Worse Principle" makes things worse.

Once you've already adopted son-of-CDT, which says something like "act like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this rule", it's not clear to me why you wouldn't just go: "Oh. CDT has lost the thing I thought made it appealing in the first place, this 'Don't Make Things Worse' feature. If we're going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?"

A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demski's comments:

[...] In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn't look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn't pay well.

[...] One way of thinking about this is to say that the FDT notion of "decision problem" is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified 'bomb' with just the certain information that 'left' is (causally and evidentially) very bad and 'right' is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.

Another way to think about this is to say that FDT "rejects" decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.

And:

[...] This also hopefully clarifies the sense in which I don't think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.

There's a subtle point here, though, since Will describes the decision problem from an updated perspective -- you already know the bomb is in front of you. So UDT "changes the problem" by evaluating "according to the prior". From my perspective, because the very statement of the Bomb problem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.

Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let's call the way-you-put-agents-into-the-scenario the "construction". We then evaluate agents on how well they deal with the construction.

For examples like Bomb, the construction gives us the overall probability distribution -- this is then used for the expected value which UDT's optimality notion is stated in terms of.

For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.

bgarfinkelNov 23 20196

One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility).

A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state.

This just seems to be the point that R_CDT is self-effacing: It says that people should not follow P_CDT, because following other decision procedures will produce better outcomes in expectation.

I definitely agree that R_CDT is self-effacing in this way (at least in certain scenarios). The question is just whether self-effacingness or failure to satisfy "Don't Make Things Worse" is more relevant when trying to judge the likelihood of a criterion of rightness being correct. I'm not sure whether it's possible to do much here other than present personal intuitions.

The point that R_UDT only violates the "Don't Make Things Worse" principle only infrequently seems relevant, but I'm still not sure this changes my intuitions very much.

If we're going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?

I may just be missing something, but I don't see what this theoretical ugliness is. And I don't intuitively find the ugliness/elegance of the decision procedure recommend by a criterion of rightness to be very relevant when trying to judge whether the criterion is correct.

[[EDIT: Just an extra thought on the fact that R_CDT is self-effacing. My impression is that self-effacingness is typically regarded as a relatively weak reason to reject a moral theory. For example, a lot of people regard utilitarianism as self-effacing both because it's costly to directly evaluate the utility produced by actions and because others often react poorly to people who engage in utilitarian-style reasoning -- but this typically isn't regarded as a slam-dunk reasons to believe that utilitarianism is false. I think the SEP article on consequentialism is expressing a pretty mainstream position when it says: "[T]here is nothing incoherent about proposing a decision procedure that is separate from one’s criterion of the right.... Criteria can, thus, be self-effacing without being self-refuting." Insofar as people don't tend to buy self-effacingness as a slam-dunk argument against the truth of moral theories, it's not clear why they should buy it as a slam-dunk argument against the truth of normative decision theories.]]

ESRogsNov 26 20197

is more relevant when trying to judge the likelihood of a criterion of rightness being correct

Sorry to drop in in the middle of this back and forth, but I am curious -- do you think it's quite likely that there is a single criterion of rightness that is objectively "correct"?

It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. "don't make things worse", or "don't be self-effacing"). And so far there doesn't seem to be any single criterion that satisfies all of them.

So why not just conclude that, similar to the case with voting and Arrow's theorem, perhaps there's just no single perfect criterion of rightness.

In other words, once we agree that CDT doesn't make things worse, but that UDT is better as a general policy, is there anything left to argue about about which is "correct"?

EDIT: Decided I had better go and read your Realism and Rationality post, and ended up leaving a lengthy comment there.

bgarfinkelNov 26 20195

Sorry to drop in in the middle of this back and forth, but I am curious -- do you think it's quite likely that there is a single criterion of rightness that is objectively "correct"?

It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. "don't make things worse", or "don't be self-effacing"). And so far there doesn't seem to be any single criterion that satisfies all of them.

So why not just conclude that, similar to the case with voting and Arrow's theorem, perhaps there's just no single perfect criterion of rightness.

Happy to be dropped in on :)

I think it's totally conceivable that no criterion of rightness is correct (e.g. because the concept of a "criterion of rightness" turns out to be some spooky bit of nonsense that doesn't really map onto anything in the real world.)

I suppose the main things I'm arguing are just that:

When a philosopher expresses support for a "decision theory," they are typically saying that they believe some claim about what the correct criterion of rightness is.
Claims about the correct criterion of rightness are distinct from decision procedures.
Therefore, when a member of the rationalist community uses the word "decision theory" to refer to a decision procedure, they are talking about something that's pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems [[EDIT: or what decision procedure most closely matches our preferences about decision procedures]] don't directly speak to the questions that most academic "decision theorists" are actually debating with one another.

I also think that, conditional on there being a correct criterion of rightness, R_CDT is more plausible than R_UDT. But this is a relatively tentative view. I'm definitely not a super hardcore R_CDT believer.

It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. "don't make things worse", or "don't be self-effacing"). And so far there doesn't seem to be any single criterion that satisfies all of them.

So why not just conclude that, similar to the case with voting and Arrow's theorem, perhaps there's just no single perfect criterion of rightness.

I guess here -- in almost definitely too many words -- is how I think about the issue here. (Hopefully these comments are at least somewhat responsive to your question.)

It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.

One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isn't even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property P -- but just drop the assumption that these things will also have both property Q1 and property Q2.

This obviously a pretty abstract description, so I'll give a few examples. (No need to read the examples if the point seems obvious.)

Ethics: I might initially be inclined to think that it's always ethical (property P) to maximize happiness and that it's always unethical to torture people. But then I may realize that there's an inconsistency here: in at least rare circumstances, such as ticking time-bomb scenarios where torture can extract crucial information, there may be no decision that is both happiness maximizing (Q1) and torture-avoiding (Q2). It seems like a natural reaction here is just to drop either the belief that maximizing happiness is always ethical or that torture is always unethical. It doesn't seem like I need to abandon my belief that some actions have the property of being ethical.

Theology: I might initially be inclined to think that God is all-knowing, all-powerful, and all-good. But then I might come to believe (whether rightly or not) that, given the existance of evil, these three properties are inconsistent. I might then continue to believe that God exists, but just drop my belief that God is all-good. (To very awkwardly re-express this in the language of properties: This would mean dropping my belief that any entity that has the property of being God also has the property of being all-good).

Politician-bashing: I might initially be inclined to characterize some politician both as an incompetent leader and as someone who's successfully carrying out an evil long-term plan to transform the country. Then I might realize that these two characterizations are in tension with one another. A pretty natural reaction, then, might be to continue to believe the politician exists -- but just drop my belief that they're incompetent.

To turn to the case of the decision-theoretic criterion of rightness, I might initially be inclined to think that the correct criterion of rightness will satisfy both "Don't Make Things Worse" and "No Self-Effacement." It's now become clear, though, that no criterion of rightness can satisfy both of these principles. I think it's pretty reasoanble, then, to continue to believe that there's a correct criterion of rightness -- but just drop the belief that the correct criterion of rightness will also satisfy "No Self-Effacement."

ESRogsNov 27 20197

Thanks! This is helpful.

It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.

One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isn't even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property P -- but just drop the assumption that these things will also have both property Q1 and property Q2.

I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you're just saying it's "often" natural, and I suppose it's natural in some cases and not others. But I think we may disagree on how often it's natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)

In particular, I'm curious what makes you optimistic about finding a "correct" criterion of rightness. In the case of the politician, it seems clear that learning they don't have some of the properties you thought shouldn't call into question whether they exist at all.

But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that there's no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I'm not sure I understand why.

My best guess, particularly informed by reading through footnote 15 on your Realism and Rationality post, is that when faced with ethical dilemmas (like your torture vs lollipop examples), it seems like there is a correct answer. Does that seem right?

(I realize at this point we're talking about intuitions and priors on a pretty abstract level, so it may be hard to give a good answer.)

bgarfinkelJan 18 20205

I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you're just saying it's "often" natural, and I suppose it's natural in some cases and not others. But I think we may disagree on how often it's natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)

In particular, I'm curious what makes you optimistic about finding a "correct" criterion of rightness. In the case of the politician, it seems clear that learning they don't have some of the properties you thought shouldn't call into question whether they exist at all.

But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that there's no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I'm not sure I understand why.

Hey again!

I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So here’s an over-long document I probably shouldn’t have written, which you are under no social obligation to read.

ESRogsJan 26 20204

Thanks! Just read it.

I think there's a key piece of your thinking that I don't quite understand / disagree with, and it's the idea that normativity is irreducible.

I think I follow you that if normativity were irreducible, then it wouldn't be a good candidate for abandonment or revision. But that seems almost like begging the question. I don't understand why it's irreducible.

Suppose normativity is not actually one thing, but is a jumble of 15 overlapping things that sometimes come apart. This doesn't seem like it poses any challenge to your intuitions from footnote 6 in the document (starting with "I personally care a lot about the question: 'Is there anything I should do, and, if so, what?'"). And at the same time it explains why there are weird edge cases where the concept seems to break down.

So few things in life seem to be irreducible. (E.g. neither Eric nor Ben is irreducible!) So why would normativity be?

[You also should feel under no social obligation to respond, though it would be fun to discuss this the next time we find ourselves at the same party, should such a situation arise.]

RobBensingerNov 27 20195

This is a good discussion! Ben, thank you for inspiring so many of these different paths we've been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions you've been sharing are widespread enough that it's very worthwhile to have public discussion on these points.

Therefore, when a member of the rationalist community uses the word "decision theory" to refer to a decision procedure, they are talking about something that's pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems don't directly speak to the questions that most academic "decision theorists" are actually debating with one another.

On the contrary:

MIRI is more interested in identifying generalizations about good reasoning ("criteria of rightness") than in fully specifying a particular algorithm.
MIRI does discuss decision algorithms in order to better understand decision-making, but this isn't different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus. Joyce and Arntzenius' response to this wasn't to go "algorithms are uncouth in our field"; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act.
MIRI does discuss "what decision procedure performs best", but this isn't any different from traditional arguments in the field like "naive EDT is wrong because it performs poorly in the smoking lesion problem". Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isn't different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument.
As I noted above, MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency, etc., not for the sake of picking a "decision theory we should build into future AI systems". Again, this doesn't seem unlike the case of philosophers who think that decision theory arguments will help them reach conclusions about the nature of rationality.

I think it's totally conceivable that no criterion of rightness is correct (e.g. because the concept of a "criterion of rightness" turns out to be some spooky bit of nonsense that doesn't really map onto anything in the real world.)

Could you give an example of what the correctness of a meta-criterion like "Don't Make Things Worse" could in principle consist in?

I’m not looking here for a “reduction” in the sense of a full translation into other, simpler terms. I just want a way of making sense of how human brains can tell what’s “decision-theoretically normative” in cases like this.

Human brains didn’t evolve to have a primitive “normativity detector” that beeps every time a certain thing is Platonically Normative. Rather, different kinds of normativity can be understood by appeal to unmysterious matters like “things brains value as ends”, “things that are useful for various ends”, “things that accurately map states of affairs”...

When I think of other examples of normativity, my sense is that in every case there's at least one good account of why a human might be able to distinguish "truly" normative things from non-normative ones. E.g. (considering both epistemic and non-epistemic norms):

1. If I discover two alien species who disagree about the truth-value of "carbon atoms have six protons", I can evaluate their correctness by looking at the world and seeing whether their statement matches the world.

2. If I discover two alien species who disagree about the truth value of "pawns cannot move backwards in chess" or "there are statements in the language of Peano arithmetic that can neither be proved nor disproved in Peano arithmetic", then I can explain the rules of 'proving things about chess' or 'proving things about PA' as a symbol game, and write down strings of symbols that collectively constitute a 'proof' of the statement in question.

I can then assert that if any member of any species plays the relevant 'proof' game using the same rules, from now until the end of time, they will never prove the negation of my result, and (paper, pen, time, and ingenuity allowing) they will always be able to re-prove my result.

(I could further argue that these symbol games are useful ones to play, because various practical tasks are easier once we've accumulated enough knowledge about legal proofs in certain games. This usefulness itself provides a criteria for choosing between "follow through on the proof process" and "just start doodling things or writing random letters down".)

The above doesn't answer questions like "do the relevant symbols have Platonic objects as truthmakers or referents?", or "why do we live in a consistent universe?", or the like. But the above answer seems sufficient for rejecting any claim that there's something pointless, epistemically suspect, or unacceptably human-centric about affirming Gödel's first incompleteness theorem. The above is minimally sufficient grounds for going ahead and continuing to treat math as something more significant than theology, regardless of whether we then go on to articulate a more satisfying explanation of why these symbol games work the way they do.

3. If I discover two alien species who disagree about the truth-value of "suffering is terminally valuable", then I can think of at least two concrete ways to evaluate which parties are correct. First, I can look at the brains of a particular individual or group, see what that individual or group terminally values, and see whether the statement matches what's encoded in those brains. Commonly the group I use for this purpose is human beings, such that if an alien (or a housecat, etc.) terminally values suffering, I say that this is "wrong".

Alternatively, I can make different "wrong" predicates for each species: $w r o n g_{h u m a n}$ , $w r o n g_{a l i e n 1}$ , $w r o n g_{a l i e n 2}$ , $w r o n g_{h o u s e c a t}$ , etc.

This has the disadvantage of maybe making it sound like all these values are on "equal footing" in an internally inconsistent way ("it's wrong to put undue weight on what's $w r o n g_{h u m a n}$ !", where the first "wrong" is secretly standing in for " $w r o n g_{h u m a n}$ "), but has the advantage of making it easy to see why the aliens' disagreement might be important and substantive, while still allowing that aliens' normative claims can be wrong (because they can be mistaken about their own core values).

The details of how to go from a brain to an encoding of "what's right" seem incredibly complex and open to debate, but it seems beyond reasonable dispute that if the information content of a set of terminal values is encoded anywhere in the universe, it's going to be in brains (or constructs from brains) rather than in patterns of interstellar dust, digits of pi, physical laws, etc.

If a criterion like “Don’t Make Things Worse” deserves a lot of weight, I want to know what that weight is coming from.

If the answer is “I know it has to come from something, but I don’t know what yet”, then that seems like a perfectly fine placeholder answer to me.

If the answer is “This is like the ‘terminal values’ case, in that (I hypothesize) it’s just an ineradicable component of what humans care about”, then that also seems structurally fine, though I’m extremely skeptical of the claim that the “warm glow of feeling causally efficacious” is important enough to outweigh other things of great value in the real world.

If the answer is “I think ‘Don’t Make Things Worse’ is instrumentally useful, i.e., more useful than UDT for achieving the other things humans want in life”, then I claim this is just false. But, again, this seems like the right kind of argument to be making; if CDT is better than UDT, then that betterness ought to consist in something.

RobBensingerNov 26 20195

I mostly agree with this. I think the disagreement between CDT and FDT/UDT advocates is less about definitions, and more about which of these things feels more compelling:

1. On the whole, FDT/UDT ends up with more utility.

(I think this intuition tends to hold more force with people the more emotionally salient "more utility" is to you. E.g., consider a version of Newcomb's problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your child's life.)

2. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

(I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that it's "too late" for one-boxing to get you any more utility in this world.)

There are other considerations too, like how much it matters to you that CDT isn't self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. It's fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but it's still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT.

If a more disciplined and well-prepared version of you would have one-boxed, then isn't there something off about saying that two-boxing is in any sense "correct"? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, and (b) realistically, praising or identifying with CDT is likely to make it harder for a human being to follow through on switching to son-of-CDT (as CDT prescribes).

Mind you, if the sentence "CDT is the most rational decision theory" is true in some substantive, non-trivial, non-circular sense, then I'm inclined to think we should acknowledge this truth, even if it makes it a bit harder to follow through on the EDT+CDT+UDT prescription to one-box in strictly-future Newcomblike problems. When the truth is inconvenient, I tend to think it's better to accept that truth than to linguistically conceal it.

But the arguments I've seen for "CDT is the most rational decision theory" to date have struck me as either circular, or as reducing to "I know CDT doesn't get me the most utility, but something about it just feels right".

It's fine, I think, if "it just feels right" is meant to be a promissory note for some forthcoming account — a clue that there's some deeper reason to favor CDT, though we haven't discovered it yet. As the FDT paper puts it:

These are odd conclusions. It might even be argued that sufficiently odd behavior provides evidence that what FDT agents see as “rational” diverges from what humans see as “rational.” And given enough divergence of that sort, we might be justified in predicting that FDT will systematically fail to get the most utility in some as-yet-unknown fair test.

On the other hand, if "it just feels right" is meant to be the final word on why "CDT is the most rational decision theory", then I feel comfortable saying that "rational" is a poor choice of word here, and neither maps onto a key descriptive category nor maps onto any prescription or norm worthy of being followed.

RobBensingerNov 26 20193

My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking.

Superstitious people often believe that it's possible to directly causally influence things across great distances of time and space. At a glance, FDT's prescription ("one-box, even though you can't causally affect whether the box is full") as well as its account of how and why this works ("you can somehow 'control' the properties of abstract objects like 'decision functions'") seem weird and spooky in the manner of a superstition.

FDT's response: if a thing seems spooky, that's a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failure -- in the case of decision theory, some systematic loss of utility that isn't balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool "magical thinking" starts to ring rather hollow. At that point, it's necessary to consider the possibility that FDT is counter-intuitive but correct (like Einstein's "spukhafte Fernwirkung"), rather than magical.

In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:

2. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the "deterministic subprocess" view of our decision-making, we would find nothing strange about the idea that it's sometimes right for this subprocess to do locally incorrect things for the sake of better global results.

E.g., consider the transparent Newcomb problem with a 1% chance of predictor error. If we think of the brain's decision-making as a rule-governed system whose rules we are currently determining (via a meta-reasoning process that is itself governed by deterministic rules), then there's nothing strange about enacting a rule that gets us $1M in 99% of outcomes and $0 in 1% of outcomes; and following through when the unlucky 1% scenario hits us is nothing to agonize over, it's just a consequence of the rule we already decided. In that regard, steering the rule-governed system that is your brain is no different than designing a factory robot that performs well enough in 99% of cases to offset the 1% of cases where something goes wrong.

(Note how a lot of these points are more intuitive in CS language. I don't think it's a coincidence that people coming from CS were able to improve on academic decision theory's ideas on these points; I think it's related to what kinds of stumbling blocks get in the way of thinking in these terms.)

Suppose you initially tell yourself:

"I'm going to one-box in all strictly-future transparent Newcomb problems, since this produces more expected causal (and evidential, and functional) utility. One-boxing and receiving $1M in 99% of future states is worth the $1000 cost of one-boxing in the other 1% of future states."

Suppose that you then find yourself facing the 1%-likely outcome where Omega leaves the box empty regardless of your choice. You then have a change of heart and decide to two-box after all, taking the $1000.

I claim that the above description feels from the inside like your brain is escaping the iron chains of determinism (even if your scientifically literate system-2 verbal reasoning fully recognizes that you're a deterministic process). And I claim that this feeling (plus maybe some reluctance to fully accept the problem description as accurate?) is the only thing that makes CDT's decision seem reasonable in this case.

In reality, however, if we end up not following through on our verbal commitment and we one-box in that 1% scenario, then this would just prove that we'd been mistaken about what rule we had successfully installed in our brains. As it turns out, we were really following the lower-global-utility rule from the outset. A lack of follow-through or a failure of will is itself a part of the decision-making process that Omega is predicting; however much it feels as though a last-minute swerve is you "getting away with something", it's really just you deterministically following through on an algorithm that will get you less utility in 99% of scenarios (while happening to be bad at predicting your own behavior and bad at following through on verbalized plans).

I should emphasize that the above is my own attempt to characterize the intuitions behind CDT and FDT, based on the arguments I've seen in the wild and based on what makes me feel more compelled by CDT, or by FDT. I could easily be wrong about the crux of disagreement between some CDT and FDT advocates.

bgarfinkelNov 26 20193

In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:

I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the "deterministic subprocess" view of our decision-making, we would find nothing strange about the idea that it's sometimes right for this subprocess to do locally incorrect things for the sake of better global results.

Is the following a roughly accurate re-characterization of the intuition here?

"Suppose that there's an agent that implements P_UDT. Because it is following P_UDT, when it enters the box room it finds a ton of money in the first box and then refrains from taking the money in the second box. People who believe R_CDT claim that the agent should have also taken the money in the second box. But, given that the universe is deterministic, this doesn't really make sense. From before the moment the agent the room, it was already determined that the agent would one box. Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there's no relevant sense in which the agent should have two-boxed."

If so, then I suppose my first reaction is that this seems like a general argument against normative realism rather than an argument against any specific proposed criterion of rightness. It also applies, for example, to the claim that a P_CDT agent "should have" one-boxed -- since in a physically deterministic sense it could not have. Therefore, I think it's probably better to think of this as an argument against the truth (and possibly conceptual coherence) of both R_CDT and R_UDT, rather than an argument that favors one over the other.

In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only one choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So -- insofar as we accept this kind of objection from determinism -- there seems to be something problematically non-naturalistic about discussing what "would have happened" if we built in one decision procedure or another.

RobBensingerNov 27 20193

Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there's no relevant sense in which the agent should have two-boxed."

No, I don't endorse this argument. To simplify the discussion, let's assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, there's only one possible outcome; but it would be odd to say that a decision stops being a decision just because it's determined by something. (What's the alternative?)

I do endorse "given the predictor's perfect accuracy, it's impossible for the P_UDT agent to two-box and come away with $1,001,000". I also endorse "given the predictor's perfect accuracy, it's impossible for the P_CDT agent to two-box and come away with $1,001,000". Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesn't mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly.

(Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this "dominance" argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcomb's problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.)

In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also "could have two-boxed", but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what they'll choose, the uncertainty goes away as soon as they see whether the box is full.

In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So -- insofar as we accept this kind of objection from determinism -- there seems to be something problematically non-naturalistic about discussing what "would have happened" if we built in one decision procedure or another.

No, there's nothing non-naturalistic about this. Consider the scenario you and I are in. Simplifying somewhat, we can think of ourselves as each doing meta-reasoning to try to choose between different decision algorithms to follow going forward; where the new things we learn in this conversation are themselves a part of that meta-reasoning.

The meta-reasoning process is deterministic, just like the object-level decision algorithms are. But this doesn't mean that we can't choose between object-level decision algorithms. Rather, the meta-reasoning (in spite of having deterministic causes) chooses either "I think I'll follow P_FDT from now on" or "I think I'll follow P_CDT from now on". Then the chosen decision algorithm (in spite of also having deterministic causes) outputs choices about subsequent actions to take. Meta-processes that select between decision algorithms (to put into an AI, or to run in your own brain, or to recommend to other humans, etc.)) can make "real decisions", for exactly the same reason (and in exactly the same sense) that the decision algorithms in question can make real decisions.

It isn't problematic that all these processes requires us to consider counterfactuals that (if we were omniscient) we would perceive as inconsistent/impossible. Deliberation, both at the object level and at the meta level, just is the process of determining the unique and only possible decision. Yet because we are uncertain about the outcome of the deliberation while deliberating, and because the details of the deliberation process do determine our decision (even as these details themselves have preceding causes), it feels from the inside of this process as though both options are "live", are possible, until the very moment we decide.

vaniverNov 28 20197

I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself.

I think you need to make a clearer distinction here between "outcomes that don't exist in the universe's dynamics" (like taking both boxes and receiving $1,001,000) and "outcomes that can't exist in my branch" (like there not being a bomb in the unlucky case). Because if you're operating just in the branch you find yourself in, many outcomes whose probability an FDT agent is trying to affect are impossible from the problem specification (once you include observations).

And, to be clear, I do think agents "should" try to achieve outcomes that are impossible from the problem specification including observations, if certain criteria are met, in a way that basically lines up with FDT, just like agents "should" try to achieve outcomes that are already known to have happened from the problem specification including observations.

As an example, if you're in Parfit's Hitchhiker, you should pay once you reach town, even though reaching town has probability 1 in cases where you're deciding whether or not to pay, and the reason for this is because it was necessary for reaching town to have had probability 1.

RobBensingerNov 28 20192

+1, I agree with all this.

bgarfinkelNov 27 20192

Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this "dominance" argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself.

Suppose that we accept the principle that agents never "should" try to achieve outcomes that are impossible from the problem specification -- with one implication being that it's false that (as R_CDT suggests) agents that see a million dollars in the first box "should" two-box.

This seems to imply that it's also false that (as R_UDT suggests) an agent that sees that the first box is empty "should" one box. By the problem specification, of course, one boxing when there is no money in the first box is also an impossible outcome. Since decisions to two box only occur when the first box is empty, this would then imply that decisions to two box are never irrational in the context of this problem. But I imagine you don't want to say that.

I think I probably still don't understand your objection here -- so I'm not sure this point is actually responsive to it -- but I initially have trouble seeing what potential violations of naturalism/determinism R_CDT could be committing that R_UDT would not also be committing.

(Of course, just to be clear, both R_UDT and R_CDT imply that the decision to commit yourself to a one-boxing policy at the start of the game would be rational. They only diverge in their judgments of what actual in-room boxing decision would be rational. R_UDT says that the decision to two-box is irrational and R_CDT says that the decision to one-box is irrational.)

ESRogsNov 29 20193

both R_UDT and R_CDT imply that the decision to commit yourself to a two-boxing policy at the start of the game would be rational

That should be "a one-boxing policy", right?

bgarfinkelNov 30 20191

Yep, thanks for the catch! Edited to fix.

ESRogsNov 26 20191

But the arguments I've seen for "CDT is the most rational decision theory" to date have struck me as either circular, or as reducing to "I know CDT doesn't get me the most utility, but something about it just feels right".

It seems to me like they're coming down to saying something like: the "Guaranteed Payoffs Principle" / "Don't Make Things Worse Principle" is more core to rational action than being self-consistent. Whereas others think self-consistency is more important.

Mind you, if the sentence "CDT is the most rational decision theory" is true in some substantive, non-trivial, non-circular sense

It's not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn't it come down to which principles you favor?

Maybe you could say FDT is more elegant. Or maybe that it satisfies more of the intuitive properties we'd hope for from a decision theory (where elegance might be one of those). But I'm not sure that would make the justification less-circular per se.

I guess one way the justification for CDT could be more circular is if the key or only principle that pushes in favor of it over FDT can really just be seen as a restatement of CDT in a way that the principles that push in favor of FDT do not. Is that what you would claim?

RobBensingerNov 26 20191

Whereas others think self-consistency is more important.

The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.

It's not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn't it come down to which principles you favor?

FDT gets you more utility than CDT. If you value literally anything in life more than you value "which ritual do I use to make my decisions?", then you should go with FDT over CDT; that's the core argument.

This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there's no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).

The latter argument for CDT isn't circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.

bgarfinkelNov 26 20196

The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.

I do think the argument ultimately needs to come down to an intuition about self-effacingness.

The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.

But there's nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, there's also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.

So why shouldn't I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.

More formally, it seems like the argument needs to be something along these lines:

Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
(Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
Therefore, agents should not implement P_CDT.
(Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
Therefore -- as an implication of points (3) and (4) -- R_CDT is not true.

Whether you buy the "No Self-Effacement" assumption in Step 4 -- or, alternatively, the countervailing "Don't Make Things Worse" assumption that supports R_CDT -- seems to ultimately be a mattter of intuition. At least, I don't currently know what else people can appeal to here to resolve the disagreement.

[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesn't specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don't think this ambiguity matters much for the argument.]]

[[SECOND SIDENOTE: I'm using the phrase "self-effacing" rather than "self-contradictory" here, because I think it's more standard and because "self-contradictory" seems to suggest logical inconsistency.]]

RobBensingerNov 27 20194

But there's nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT.

If the thing being argued for is "R_CDT plus P_SONOFCDT", then that makes sense to me, but is vulnerable to all the arguments I've been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT's "Don't Make Things Worse" principle.

If the thing being argued for is "R_CDT plus P_FDT", then I don't understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over "R_FDT plus P_FDT"? (Indeed, what difference between the two views would be intended here?)

So why shouldn't I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be not self-effacing.

The argument against "R_CDT plus P_SONOFCDT" doesn't require any mention of self-effacingness; it's entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.

The argument against "R_CDT plus P_FDT" seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don't understand what this view would mean or why anyone would endorse it (and I don't take you to be endorsing it).

For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don't think this ambiguity matters much for the argument.

We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what "expected utility" means.

bgarfinkelNov 28 20195

Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT "gets you less utility" rather than the point that P_SONOFCDT "gets you less utility." So my comment was aiming to explain why I don't think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the "No Self-Effacement Principle"). But it sounds like you might agree that this fact doesn't on its own provide a strong challenge.

If the thing being argued for is "R_CDT plus P_SONOFCDT", then that makes sense to me, but is vulnerable to all the arguments I've been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT's "Don't Make Things Worse" principle.

In response to the first argument alluded to here: "Gets the most [expected] utility" is ambiguous, as I think we've both agreed.

My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. So -- if we evaluate the expected utility of a decision to adopt a policy from a casual perspective -- it seems to me that P_SONOFCDT "gets the most expected utility."

If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may "get the most utility" (because policy adoption decisions may be non-causally correlated.)

Apologies if I'm off-base, but it reads to me like you might be suggesting an argument along these lines:

R_CDT says that it is rational to decide to follow a policy that would not maximize "expected utility" (defined in evidential/subjunctive terms).
(Assumption) But it is not rational to decide to follow a policy that would not maximize "expected utility" (defined in evidential/subjunctive terms).
Therefore R_CDT is not true.

The natural response to this argument is that it's not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its "expected utility" defined in causal terms. So someone starting from the position that R_CDT is true obviously won't accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its "expected utility" defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.

Overall at least this particular simple argument -- that R_CDT is false because P_SONOFCDT gets less "expected utility" as defined in evidential/quasi-evidential terms -- would seemingly fail to due circularity. But you may have in mind a different argument.

We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what "expected utility" means.

I felt confused by this comment. Doesn't even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you don't want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more "rational" decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?

You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit to -- P_SONOFCDT -- sometimes outputs decisions that definitely make things worse. So "Don't Make Things Worse" implies that some of the decisions outputted by P_SONOFCDT are irrational.

But I still don't see what the argument is here unless we're assuming "No Self-Effacement." It still seems to me like we have a few initial steps and then a missing piece.

(Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.
(Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.
(Assumption) It is irrational to take decisions that definitely make things worse. In other words, the "Don't Make Things Worse" Principle is true.
Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.
???
Therefore, R_CDT is false.

The "No Self-Effacement" Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume "No Self-Effacement" in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if we're not assuming "No Self-Effacement," then it's not clear to me how we get there.

Actually, in the context of this particular argument, I suppose we don't really have the option of assuming that "No Self-Effacement" is true -- because this assumption would be inconsistent with the earlier assumption that "Don't Make Things Worse" is true. So I'm not sure it's actually possible to make this argument schema work in any case.

There may be a pretty different argument here, which you have in mind. I at least don't see it yet though.

ESRogsNov 29 20199

There may be a pretty different argument here, which you have in mind. I at least don't see it yet though.

Perhaps the argument is something like:

"Don't make things worse" (DMTW) is one of the intuitions that leads us to favoring R_CDT
But the actual policy that R_CDT recommends does not in fact follow DMTW
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_'s, and not about P_'s
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn't get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)

bgarfinkelNov 30 20194

So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_'s, and not about P_'s

But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn't get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)

Here are two logically inconsistent principles that could be true:

Don't Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.

Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.

I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they're mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.

One could argue that R_CDT sympathists don't actually have much stronger intuitions regarding the first principle than the second -- i.e. that their intuitions aren't actually very "targeted" on the first one -- but I don't think that would be right. At least, it's not right in my case.

A more viable strategy might be to argue for something like a meta-principle:

The 'Don't Make Things Worse' Meta-Principle: If you find "Don't Make Things Worse" strongly intuitive, then you should also find "Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse" just about as intuitive.

If the meta-principle were true, then I guess this would sort of imply that people's intuitions in favor of "Don't Make Things Worse" should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.

But I don't see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of "Don't Make Things Worse" :)

bgarfinkelNov 30 20194

Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.

Bomb Case: Omega puts a million dollars in a transparent box if he predicts you'll open it. He puts a bomb in the transparent box if he predicts you won't open it. He's only wrong about one in a trillion times.

Now suppose you enter the room and see that there's a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don't open the box, then nothing bad will happen to you. You'll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.

Intuitively, this decision strikes me as deeply irrational. You're intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you're flagrantly violating the "Don't Make Things Worse" principle.

Now, let's step back a time step. Suppose you know that you're sort of person who would refuse to kill yourself by detonating the bomb. You might decide that -- since Omega is such an accurate predictor -- it's worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you're making now is nonetheless making things better in expectation.

This decision strikes me as pretty intuitively rational. You're violating the second principle -- the "Don't Commit to a Policy..." Principle -- but this violation just doesn't seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.

(This obviously just a description of my own intuitions, as they stand, though.)

Wei DaiJan 19 202011

It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.

By triggering the bomb, you're making things worse from your current perspective, but making things better from the perspective of earlier you. Doesn't that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you don't. After updating, you think you're either a simulation within Omega's prediction so your action has no effect on yourself or you're in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/inconsistency.

Giving the human tendency to change our (UDT-)utility functions by updating, it's not clear what to do (or what is right), and I think this reduces UDT's intuitive appeal and makes it less of a slam-dunk over CDT/EDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isn't adequately explained in MIRI's decision theory papers.)

ESRogsDec 1 20193

Don't Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.

Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.

...

One could argue that R_CDT sympathists don't actually have much stronger intuitions regarding the first principle than the second -- i.e. that their intuitions aren't actually very "targeted" on the first one -- but I don't think that would be right. At least, it's not right in my case.

I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that that's right in your case.)

But I feel like the second doesn't quite capture what I had in mind regarding the DMTW intuition applied to P_'s.

Consider an alternate version:

If a decision would definitely make things worse, then taking that decision is not good policy.

Or alternatively:

If a decision would definitely make things worse, a rational person would not take that decision.

It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the "... then taking that decision is not rational." version is. And it's only after you've considered prisoners' dilemmas or Newcomb's paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from what's rational in the moment.

(But maybe others would disagree on how intuitive these versions are.)

EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then it's not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition.

Of course, this argument is moot if it's true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe that's the case? But I think it's a little more ambiguous with the "... is not good policy" or "a rational person would not..." versions than with the "Don't commit to a policy..." version.

EDIT2: Does what I'm trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)

bgarfinkelNov 28 20194

If the thing being argued for is "R_CDT plus P_SONOFCDT" ... If the thing being argued for is "R_CDT plus P_FDT...

Just as a quick sidenote:

I've been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.

If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.

The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they aren't). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.

[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that you're building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisoner's dilemma (e.g. some version of P_FDT). But if R_CDT is true and you've just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisoner's dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative view -- which I've alluded to above -- is that the various proposed criteria of rightness don't in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]

Ben PaceNov 26 20191

Do you mean

that agents should in general NOT make decisions by carrying out utilitarian reasoning.

It seems to better fit the pattern of the example just prior.

RobBensingerNov 23 20194

Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.

This is similar to how you described it here:

Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. [...]

When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational iff taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.

We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.

This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it's normative, it can be either an algorithm/procedure that's being recommended, or a criterion of rightness like "a decision is rational iff taking it would cause the largest expected increase in value" (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are "normative" or "endorsed").

Some of your discussion above seems to be focusing on the "algorithmic?" dimension, while other parts seem focused on "normative?". I'll say more about "normative?" here.

The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think they're pretty concrete and crisply defined. It's harder for me to accidentally switch topics or bundle two different concepts together when talking about "trying to optimize vs. optimizing as a side-effect", "directly optimizing vs. optimizing via heuristics", "initially optimizing vs. self-modifying to optimize", or "function vs. algorithm".

In contrast, I think "normative" and "rational" can mean pretty different things in different contexts, it's easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of what's at stake in the discussion.

E.g., "normative" is often used in the context of human terminal values, and it's in this context that statements like this ring obviously true:

I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making.

If we're treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options:

Decision theory isn't normative.
Decision theory is normative in the way that "murder is bad" or "improving aggregate welfare is good" is normative, i.e., it expresses an arbitrary terminal value of human beings.
Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).

Probability theory has obvious normative force in the context of reasoning and decision-making, but it's not therefore arbitrary or irrelevant to understanding human cognition, AI, etc.

A lot of the examples you've cited are theories from moral philosophy about what's terminally valuable. But decision theory is generally thought of as the study of how to make the right decisions, given a set of terminal preferences; it's not generally thought of as the study of which decision-making methods humans happen to terminally prefer to employ. So I would put it in category 1 or 3.

You could indeed define an agent that terminally values making CDT-style decisions, but I don't think most proponents of CDT or EDT would claim that their disagreement with UDT/FDT comes down to a values disagreement like that. Rather, they'd claim that rival decision theorists are making some variety of epistemic mistake. (And I would agree that the disagreement comes down to one party or the other making an epistemic mistake, though I obviously disagree about who's mistaken.)

I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.)

In the twin prisoner's dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).

I think you can model the voting dilemma the same way, just with noise added because the level of correlation is imperfect and/or uncertain. Ten agents following the same decision procedure are trying to decide whether to stay home and watch a movie (which gives a small guaranteed benefit) or go to the polls (which costs them the utility of the movie, but gains them a larger utility iff the other nine agents go to the polls too). Ten FDT agents will vote in this case, if they know that the other agents will vote under similar conditions.

bgarfinkelNov 23 20196

I think there are basically three options:

Decision theory isn't normative.

Decision theory is normative in the way that "murder is bad" or "improving aggregate welfare is good" is normative, i.e., it expresses an arbitrary terminal value of human beings.

Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).

[[Disclaimer: I'm not sure this will be useful, since it seems like most of discussions that verge on meta-ethics end up with neither side properly understanding the other.]]

I think the kind of decision theory that philosophers tend to work on is typically explicitly described as "normative." (For example, the SEP article on decision theory is about "normative decision theory.") So when I'm talking about "academic decision theories" or "proposed criteria of rightness" I'm talking about normative theories. When I use the word "rational" I'm also referring to a normative property.

I don't think there's any very standard definition of what it means for something to be normative, maybe because it's often treated as something pretty close to a primitive concept, but a partial account is that a "normative theory" is a claim about what someone should do. At least this is what I have in mind. This is different from the second option you list (and I think the third one).

Some normative theories concern "ends." These are basically claims about what people should do, if they can freely choose outcomes. For example: A subjectivist theory might say that people should maximize the fulfillment of their own personal preferences (whatever they are). Whereas a hedonistic utilitarian theory might say that people should should maximize total happiness. I'm not sure what the best terminology is, and think this choice is probably relatively non-standard, but let's label these "moral theories."

Some normative theories, including "decision theories," concern "means." These theories put aside the question of which ends people should pursue and instead focus on how people should respond to uncertainty about the results/implications of their actions. For example: Expected utility theory says that people should take whatever actions maximize expected fulfillment of the relevant ends. Risk-weighted expected utility theory (and other alternative theories) say different things. Typical versions of CDT and EDT flesh out expected utility theory in different ways to specify what the relevant measure of "expected fulfillment" is.

Moral theory and normative decision theory seem to me to have pretty much the same status. They are both bodies of theory that bear on what people should do. On some views, the division between them is more a matter of analytic convenience than anything else. For example, David Enoch, a prominent meta-ethicist, writes: "In fact, I think that for most purposes [the line between the moral and the non-moral] is not a line worth worrying about. The distinction within the normative between the moral and the non-moral seems to me to be shallow compared to the distinction between the normative and the non-normative" (Taking Morality Seriously, 86).

One way to think of moral theories and normative decision theories is as two components that fit together to form more fully specified theories about what people should do. Moral theories describe the ends people should pursue; given these ends, decision theories then describe what actions people should take when in states of uncertainty. To illustrate, two examples of more complete normative theories that combine moral and decision-theoretic components would be: "You should take whatever action would in expectation cause the largest increase in the fulfillment of your preferences" and "You should take whatever action would, if you took it, lead you to anticipate the largest expected amount of future happiness in the world." The first is subjectivism combined with CDT, while the second is total view hedonistic utilitarianism combined with EDT.

(On this conception, a moral theory is not a description of "an arbitrary terminal value of human beings." Decision theory here also is not "the study of which decision-making methods humans happen to terminally prefer to employ." These are both theories are about what people should do, rather than theories about about what people's preferences are.)

Normativity is obviously pretty often regarded as a spooky or insufficiently explained thing. So a plausible position is normative anti-realism: It might be the case that no normative claims are true, either because they're all false or because they're not even well-formed enough to take on truth values. If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn't really have an answer.

In the twin prisoner's dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).

If I'm someone with a twin and I'm implementing P_CDT, I still don't think I will choose to modify myself to cooperate in twin prisoner's dilemmas. The reason is that modifying myself won't cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.

(The fact P_CDT agents won't modify themselves to cooperate with their twins could of course be interpretted as a mark against R_CDT.)

RobBensingerNov 26 20195

I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!

If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn't really have an answer.

Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeks' astronomical observations, and some of their theories and predictive tools, were still true and useful.

I think that terms like "normative" and "rational" are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser's pluralistic moral reductionism).

I would say that (1) some philosophers use "rational" in a very human-centric way, which is fine as long as it's done consistently; (2) others have a much more thin conception of "rational", such as 'tending to maximize utility'; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of "rationality", but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.

I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcomb's problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer who's accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.)

I'd expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a field has ended up being more type-2/3 than type-1. (And in this case, even type-1 analyses of "rationality" can be helpful for bringing various widespread background assumptions to light.)

If I'm someone with a twin and I'm implementing P_CDT, I still don't think I will choose to modify myself to cooperate in twin prisoner's dilemmas. The reason is that modifying myself won't cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.

This is true if your twin was copied from you in the past. If your twin will be copied from you in the future, however, then you can indeed cause your twin to cooperate, assuming you have the ability to modify your own future decision-making so as to follow son-of-CDT's prescriptions from now on.

Making the commitment to always follow son-of-CDT is an action you can take; the mechanistic causal consequence of this action is that your future brain and any physical systems that are made into copies of your brain in the future will behave in certain systematic ways. So from your present perspective (as a CDT agent), you can causally control future copies of yourself, as long as the act of copying hasn't happened yet.

(And yes, by the time you actually end up in the prisoner's dilemma, your future self will no longer be able to causally affect your copy. But this is irrelevant from the perspective of present-you; to follow CDT's prescriptions, present-you just needs to pick the action that you currently judge will have the best consequences, even if that means binding your future self to take actions contrary to CDT's future prescriptions.)

(If it helps, don't think of the copy of you as "you": just think of it as another environmental process you can influence. CDT prescribes taking actions that change the behavior of future copies of yourself in useful ways, for the same reason CDT prescribes actions that change the future course of other physical processes.)

bgarfinkelNov 27 20195

I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!

Thank you for taking the time to respond as well! :)

I think that terms like "normative" and "rational" are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser's pluralistic moral reductionism).

I would say that (1) some philosophers use "rational" in a very human-centric way, which is fine as long as it's done consistently; (2) others have a much more thin conception of "rational", such as 'tending to maximize utility'; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of "rationality", but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.

I'm not positive I understand what (1) and (3) are referring to here, but I would say that there's also at least a fourth way that philosophers often use the word "rational" (which is also the main way I use the word "rational.") This is to refer to an irreducibly normative concept.

The basic thought here is that not every concept can be usefully described in terms of more primitive concepts (i.e. "reduced"). As a close analogy, a dictionary cannot give useful non-circular definitions of every possible word -- it requires the reader to have a pre-existing understanding of some foundational set of words. As a wonkier analogy, if we think of the space of possible concepts as a sort of vector space, then we sort of require an initial "basis" of primitive concepts that we use to describe the rest of the concepts.

Some examples of concepts that are arguably irreducible are "truth," "set," "property," "physical," "existance," and "point." Insofar as we can describe these concepts in terms of slightly more primitive ones, the descriptions will typically fail to be very useful or informative and we will typically struggle to break the slightly more primitive ones down any further.

To focus on the example of "truth," some people have tried to reduce the concept substantially. Some people have argued, for example, that when someone says that "X is true" what they really mean or should mean is "I personally believe X" or "believing X is good for you." But I think these suggested reductions pretty obviously don't entirely capture what people mean when they say "X is true." The phrase "X is true" also has an important meaning that is not amenable to this sort of reduction.

[[EDIT: "Truth" may be a bad example, since it's relatively controversial and since I'm pretty much totally unfamiliar with work on the philosophy of truth. But insofar as any concepts seem irreducible to you in this sense, or buy the more general argument that some concepts will necessarily be irreducible, the particular choice of example used here isn't essential to the overall point.]]

Some philosophers also employ normative concepts that they say cannot be reduced in terms of non-normative (e.g. psychological) properties. These concepts are said to be irreducibly normative.

For example, here is Parfit on the concept of a normative reason (OWM, p. 1):

We can have reasons to believe something, to do something, to have some desire or aim, and to have many other attitudes and emotions, such as fear, regret, and hope. Reasons are given by facts, such as the fact that someone’s finger-prints are on some gun, or that calling an ambulance would save someone’s life.

It is hard to explain the concept of a reason, or what the phrase ‘a reason’ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But ‘counts in favour of’ means roughly ‘gives a reason for’. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words. We must explain such concepts in a different way, by getting people to think thoughts that use these concepts. One example is the thought that we always have a reason to want to avoid being in agony.

When someone says that a concept they are using is irreducible, this is obviously some reason for suspicion. A natural suspicion is that the real explanation for why they can't give a useful description is that the concept is seriously muddled or fails to grip onto anything in the real world. For example, whether this is fair or not, I have this sort of suspicion about the concept of "dao" in daoist philosophy.

But, again, it will necessarily be the case that some useful and valid concepts are irreducible. So we should sometimes take evocations of irreducible concepts seriously. A concept that is mostly undefined is not always problematically "underdefined."

When I talk about "normative anti-realism," I mostly have in mind the position that claims evoking irreducably normative concepts are never true (either because these claims are all false or because they don't even have truth values). For example: Insofar as the word "should" is being used in an irreducibly normative sense, there is nothing that anyone "should" do.

[[Worth noting, though: The term "normative realism" is sometimes given a broader definition than the one I've sketched here. In particular, it often also includes a position known as "analytic naturalist realism" that denies the relevance of irreducibly normative concepts. I personally feel I understand this position less well and I think sometimes waffle between using the broader and narrower definition of "normative realism." I also more generally want to stress that not everyone who makes claims about "criterion of rightness" or employs other seemingly normative language is actually a normative realist in the narrow or even broad sense; what I'm doing here is just sketching one common especially salient perspective.]]

One motivation for evoking irreducibly normative concepts is the observation that -- in the context of certain discussions -- it's not obvious that there's any close-to-sensible way to reduce the seemingly normative concepts that are being used.

For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of "a rational choice" to the concept of "a winning choice" (or, in line with the type-2 conception you mention, a "utility-maximizing choice"). It seems difficult to make sense of a lot of basic claims about rationality if we use this reduction -- and other obvious alternative reductions don't seem to fair much better. To mostly quote from a comment I made elsewhere:

Suppose we want to claim that it is rational to try to maximize the expected winning (i.e. the expected fulfillment of your preferences). Due to randomness/uncertainty, though, an agent that tries to maximize expected "winning" won't necessarily win compared to an agent that does something else. If I spend a dollar on a lottery ticket with a one-in-a-billion chance of netting me a billion-and-one "win points," then I'm taking the choice that maximizes expected winning but I'm also almost certain to lose. So we can't treat "the rational action" as synonymous with "the action taken by an agent that wins."

We can try to patch up the issue here by reducing "the rational action" to "the action that is consistent with the VNM axioms," but in fact either action in this case is consistent with the VNM axioms. The VNM axioms don't imply that an agent must maximize the expected desirability of outcomes. They just imply that an agent must maximize the expected value of some function. It is totally consistent with the axioms, for example, to be effectively risk averse and instead maximize the expected square root of desirability. If we try to define "the action I should take" in this way, then the claim "it is rational to act consistently with the VNM axioms" also becomes an empty tautology.

We could of course instead reduce "the rational action" to "the action that maximizes expected winning." But now, of course, the claim "it is rational to maximize expected winning" no longer has any substantive content. When we make this claim, do we really mean to be stating an empty tautology? And do we really consider it trivially incoherent to wonder -- e.g. in a Pascal's mugging scenario -- whether it might be "rational" to take an action other than the one that maximizes expected winning? If not, then this reduction is a very poor fit too.

It ultimately seems hard, at least to me, to make non-vacuous true claims about what it's "rational" to do withoit evoking a non-reducible notion of "rationality." If we are evoking a non-reducible notion of rationality, then it makes sense that we can't provide a satisfying reduction.

FN15 in my post on normative realism elaborates on this point.

At the same time, though, I do think there are also really good and hard-to-counter epistemological objections to the existance of irreducibly normative properties (e.g. the objection described in this paper). You might also find the difficulty of reducing normative concepts a lot less obvious-seeming or problematic than I do. You might think, for example, that the difficulty of reducing "rationality" is less like the difficulty of reducing "truth" (which IMO mainly reflects the fact that truth is an important primitive concept) and more like the difficulty of defining the word "soup" in a way that perfectly matches our intuitive judgments about what counts as "soup" (which IMO mainly reflects the fact that "soup" is a high-dimensional concept). So I definitely don't want to say normative realism is obviously or even probably right.

I mainly just want to communicate the sort of thing that I think a decent chunk of philosophers have in mind when they talk about a "rational decision" or a "criterion of rightness." Although, of course, philosophy being philosophy, plenty of people do of course have in mind plenty of different things.

RobBensingerNov 27 20193

So, as an experiment, I'm going to be a very obstinate reductionist in this comment. I'll insist that a lot of these hard-seeming concepts aren't so hard.

Many of them are complicated, in the fashion of "knowledge" -- they admit an endless variety of edge cases and exceptions -- but these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where there's a simple core we can point to, that core generally isn't mysterious.

It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act work like encouraging a behavior). But when I say it "isn't mysterious", I mean it's pretty easy to see how the concept can crop up in human thought even if it doesn't belong on the short list of deep fundamental cosmic structure terms.

I would say that there's also at least a fourth way that philosophers often use the word "rational," which is also the main way I use the word "rational." This is to refer to an irreducibly normative concept.

Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like "love," "knowledge," "France") , or it's not (in which case it goes in bucket 2).

Some examples of concepts that are arguably irreducible are "truth," "set," "property," "physical," "existance," and "point."

Picking on the concept here that seems like the odd one out to me: I feel confident that there isn't a cosmic law (of nature, or of metaphysics, etc.) that includes "truth" as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like "intentionality/reference", "assertion", or "state of affairs", though the former two strike me as easy to explain in simple physical terms.

Mundane empirical "truth" seems completely straightforward. Then there's the truth of sentences like "Frodo is a hobbit", "2+2=4", "I could have been the president", "Hamburgers are more delicious than battery acid"... Some of these are easier or harder to make sense of in the naive correspondence model, but regardless, it seems clear that our colloquial use of the word "true" to refer to all these different statements is pre-philosophical, and doesn't reflect anything deeper than that "each of these sentences at least superficially looks like it's asserting some state of affairs, and each sentence satisfies the conventional assertion-conditions of our linguistic community".

I think that philosophers are really good at drilling down on a lot of interesting details and creative models for how we can try to tie these disparate speech-acts together. But I think there's also a common failure mode in philosophy of treating these questions as deeper, more mysterious, or more joint-carving than the facts warrant. Just because you can argue about the truthmakers of "Frodo is a hobbit" doesn't mean you're learning something deep about the universe (or even something particularly deep about human cognition) in the process.

[Parfit:] It is hard to explain the concept of a reason, or what the phrase ‘a reason’ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But ‘counts in favour of’ means roughly ‘gives a reason for’. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words.

Suppose I build a robot that updates hypotheses based on observations, then selects actions that its hypotheses suggest will help it best achieve some goal. When the robot is deciding which hypotheses to put more confidence in based on an observation, we can imagine it thinking, "To what extent is observation o a [WORD] to believe hypothesis h?" When the robot is deciding whether it assigns enough probability to h to choose an action a, we can imagine it thinking, "To what extent is P(h)=0.7 a [WORD] to choose action a?" As a shorthand, when observation o updates a hypothesis h that favors an action a, the robot can also ask to what extent o itself is a [WORD] to choose a.

When two robots meet, we can moreover add that they negotiate a joint "compromise" goal that allows them to work together rather than fight each other for resources. In communicating with each other, they then start also using "[WORD]" where an action is being evaluated relative to the joint goal, not just the robot's original goal.

Thus when Robot A tells Robot B "I assign probability 90% to 'it's noon', which is [WORD] to have lunch", A may be trying to communicate that A wants to eat, or that A thinks eating will serve A and B's joint goal. (This gets even messier if the robots have an incentive to obfuscate which actions and action-recommendations are motivated by the personal goal vs. the joint goal.)

If you decide to relabel "[WORD]" as "reason", I claim that this captures a decent chunk of how people use the phrase "a reason". "Reason" is a suitcase word, but that doesn't mean there are no similarities between e.g. "data my goals endorse using to adjust the probability of a given hypothesis" and "probabilities-of-hypotheses my goals endorse using to select an action"), or that the similarity is mysterious and ineffable.

(I recognize that the above story leaves out a lot of important and interesting stuff. Though past a certain point, I think the details will start to become Gettier-case nitpicks, as with most concepts.)

For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of "a rational choice" to the concept of "a winning choice" (or, in line with the type-2 conception you mention, a "utility-maximizing choice").

That essay isn't trying to "reduce" the term "rationality" in the sense of taking a pre-existing word and unpacking or translating it. The essay is saying that what matters is utility, and if a human being gets too invested in verbal definitions of "what the right thing to do is", they risk losing sight of the thing they actually care about and were originally in the game to try to achieve (i.e., their utility).

Therefore: if you're going to use words like "rationality", make sure that the words in question won't cause you to shoot yourself in the foot and take actions that will end up costing you utility (e.g., costing human lives, costing years of averted suffering, costing money, costing anything or everything). And if you aren't using "rationality" in a safe "nailed-to-utility" way, make sure that you're willing to turn on a time and stop being "rational" the second your conception of rationality starts telling you to throw away value.

It ultimately seems hard, at least to me, to make non-vacuous true claims about what it's "rational" to do withoit evoking a non-reducible notion of "rationality."

"Rationality" is a suitcase word. It refers to lots of different things. On LessWrong, examples include not just "(systematized) winning" but (as noted in the essay) "Bayesian reasoning", or in Rationality: Appreciating Cognitive Algorithms, "cognitive algorithms or mental processes that systematically produce belief-accuracy or goal-achievement". In philosophy, the list is a lot longer.

The common denominator seems to largely be "something something reasoning / deliberation" plus (as you note) "something something normativity / desirability / recommendedness / requiredness".

The idea of "normativity" doesn't currently seem that mysterious to me either, though you're welcome to provide perplexing examples. My initial take is that it seems to be a suitcase word containing a bunch of ideas tied to:

Goals/preferences/values, especially overridingly strong ones.
Encouraged, endorsed, mandated, or praised conduct.

Encouraging, endorsing, mandating, and praising are speech-acts that seem very central to how humans perceive and intervene on social situations; and social situations seem pretty central to human cognition overall. So I don't think it's particularly surprising if words associated with such loaded ideas would have fairly distinctive connotations and seem to resist reduction, especially reduction that neglects the pragmatic dimensions of human communication and only considers the semantic dimension.

bgarfinkelNov 28 20192

I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation.

(I don't really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort I've just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments you've just made cover some reasons to suspect this.)

The key point is just that when these philosophers say that "Action X is rational," they are explicitly reporting that they do not mean "Action X suits my terminal preferences" or "Action X would be taken by an agent following a policy that maximizes lifetime utility" or any other such reduction.

I think that when people are very insistent that they don't mean something by their statements, it makes sense to believe them. This implies that the question they are discussing -- "What are the necessary and sufficient conditions that make a decision rational?" -- is distinct from questions like "What decision would an agent that tends to win take?" or "What decision procedure suits my terminal preferences?"

It may be the case that the question they are asking is confused or insensible -- because any sensible question would be reducible -- but it's in any case different. So I think it's a mistake to interpret at least these philosophers' discussions of "decisions theories" or "criteria of rightness" as though they were discussions of things like terminal preferences or winning strategies. And it doesn't seem to me like the answer to the question they're asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies.

[[NOTE: Plenty of decision theorists are not non-naturalist or non-analytic naturalist realists, though. It's less clear to me how related or unrelated the thing they're talking about is to issues of interest to MIRI. I think that the conception of rationality I'm discussing here mainly just presents an especially clear case.]]

bgarfinkelNov 23 20194

This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it's normative, it can be either an algorithm/procedure that's being recommended, or a criterion of rightness like "a decision is rational iff taking it would cause the largest expected increase in value" (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are "normative" or "endorsed").

Just on this point: I think you're right I may be slightly glossing over certain distinctions, but I might still draw them slightly differently (rather than doing a 2x2 grid). Some different things one might talk about in this context:

Decisions
Decision procedures
The decision procedure that is optimal with regard to some given metric (e.g. the decision procedure that maximizes expected lifetime utility for some particular way of calculating expected utility)
The set of properties that makes a decision rational ("criterion of rightness")
A claim about what the criterion of rightness is ("normative decision theory")
The decision procedure that it would be rational to decide to build into an agent (as implied by the criterion of rightness)

(4), (5), and (6) have to do with normative issues, while (1), (2), and (3) can be discussed without getting into normativity.

My current-although-not-firmly-held view is also that (6) probably isn't very sensitive to the what the criterion of rightness is, so in practice can be reasoned about without going too deep into the weeds thinking about competing normative decision theories.

ESRogsNov 26 20192

Just want to note that I found the R_ vs P_ distinction to be helpful.

I think using those terms might be useful for getting at the core of the disagreement.

bgarfinkel

Nov 20 2019

Lightly editing some thoughts I previously wrote up on this issue, somewhat in line with Paul's comments:

Sean_o_hNov 18 201932

For more on this divide/points of disagreement, see Will MacAskill's essay on the alignment forum (with responses from MIRI researchers and others)

https://www.alignmentforum.org/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory

and previously, Wolfgang Schwartz's review of Functional Decision Theory

https://www.umsu.de/wo/2018/688

(with some Lesswrong discussion here: https://www.lesswrong.com/posts/BtN6My9bSvYrNw48h/open-thread-january-2019#WocbPJvTmZcA2sKR6)

I'd also be interested in Buck's perspectives on this topic.

RobBensingerNov 18 201916

Sean_o_h

Nov 19 2019

Thanks, I hadn't seen this.

Pablo

May 8 2020

See also bmg's LW post, Realism and rationality. Relevant excerpt:

KirstenNov 16 201944

What evidence would persuade you that further work on AI safety is unnecessary?

BuckNov 19 201937

I’m going to instead answer the question “What evidence would persuade you that further work on AI safety is low value compared to other things?”

Note that a lot of my beliefs here disagree substantially with my coworkers.

I’m going to split the answer into two steps: what situations could we be in such that I thought we should deprioritize AI safety work, and for each of those, what could I learn that would persuade me we were in them.

Situations in which AI safety work looks much less valuable:

We’ve already built superintelligence, in which case the problem is moot

Seems like this would be pretty obvious if it happened

We have clear plans for how to align AI that work even when it’s superintelligent, and we don’t think that we need to do more work in order to make these plans more competitive or easier for leading AGI projects to adopt.

What would persuade me of this:

I’m not sure what evidence would be required for me to be inside-view persuaded of this. I find it kind of hard to be inside view persuaded, for the same reason that I find it hard to imagine being persuaded that an operating system is secure.
But I can imagine what it

... (read more)

KirstenNov 19 201914

Thanks, that's really interesting! I was especially surprised by "If I thought there was a <30% chance of AGI within 50 years, I'd probably not be working on AI safety."

BuckNov 19 201923

Yeah, I think that a lot of EAs working on AI safety feel similarly to me about this.

I expect the world to change pretty radically over the next 100 years, and I probably want to work on the radical change that's going to matter first. So compared to the average educated American I have shorter AI timelines but also shorter timelines to the world becoming radically different for other reasons.

richard_ngoNov 20 201913

If I thought there was a <30% chance of AGI within 50 years, I'd probably not be working on AI safety.

I expect the world to change pretty radically over the next 100 years.

I find these statements surprising, and would be keen to hear more about this from you. I suppose that the latter goes a long way towards explaining the former. Personally, there are few technologies that I think are likely to radically change the world within the next 100 years (assuming that your definition of radical is similar to mine). Maybe the only ones that would really qualify are bioengineering and nanotech. Even in those fields, though, I expect the pace of change to be fairly slow if AI isn't heavily involved.

(For reference, while I assign more than 30% credence to AGI within 50 years, it's not that much more).

Buck

Nov 22 2019

Yeah, I suspect you're right. I think there are a couple more radically transformative technologies which I think are reasonably likely over the next hundred years, eg whole brain emulation. And I suspect we disagree about the expected pace of change with bioengineering and maybe nanotech.

Anthony DiGiovanni

Nov 20 2019

Huh, my impression was that the most plausible s-risks we can sort-of-specifically foresee are AI alignment problems - do you disagree? Or is this statement referring to s-risks as a class of black swans for which we don't currently have specific imaginable scenarios, but if those scenarios became more identifiable you would consider working on them instead?

BuckNov 22 201912

Most of them are related to AI alignment problems, but it's possible that I should work specifically on them rather than other parts of AI alignment.

Matthew_BarnettNov 20 201912

An s-risk could occur via a moral failure, which could happen even if we knew how to align our AIs.

2[comment deleted]Nov 19 2019

Donald Hobson

Nov 17 2019

If no more AI safety work is necessary, that means that there is nothing we can do to significantly increase the chance of FAI over UFAI. I could be almost certain that FAI would win because I had already built one. Although I suspect that there will be double checking to do, the new FAI will need told about what friendly behavior is, someone should keep an eye out for any UFAI ect. So FAI work will be needed until the point where no human labor is needed and we are all living in a utopia. I could be almost certain that UFAI will win. I could see lots of people working on really scary systems and still not have the slightest idea of how to do make anything friendly. But there would still be a chance that those systems didn't scale to superintelligence, that the people running them could be persuaded to turn them off, and that someone might come up with a brilliant alignment scheme tomorrow. Circumstances where you can see that you are utterly screwed, yet still be alive, seem unlikely. Keep working untill the nanites turn you into paperclips. Alternatively, it might be clear that we aren't getting any AI any time soon. The most likely cause of this would be a pretty serious disaster. It would have to destroy most of humanities technical ability and stop us rebuilding it. If AI alignment is something that we will need to do in a few hundred years, once we rebuild society enough to make silicon chips, its still probably worth having someone making sure that progress isn't forgotten, and that the problem will be solved in time. We gain some philosophical insight that says that AI is inherently good, always evil, impossible ect. It's hard to imagine what a philosophical insight that you don't have is like.

riceissaNov 16 201938

Back in July, you held an in-person Q&A at REACH and said "There are a bunch of things about AI alignment which I think are pretty important but which aren’t written up online very well. One thing I hope to do at this Q&A is try saying these things to people and see whether people think they make sense." Could you say more about what these important things are, and what was discussed at the Q&A?

BuckNov 19 201938

I don’t really remember what was discussed at the Q&A, but I can try to name important things about AI safety which I think aren’t as well known as they should be. Here are some:

----

I think the ideas described in the paper Risks from Learned Optimization are extremely important; they’re less underrated now that the paper has been released, but I still wish that more people who are interested in AI safety understood those ideas better. In particular, the distinction between inner and outer alignment makes my concerns about aligning powerful ML systems much crisper.

----

On a meta note: Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.

----

Compared to people who are relatively new to the field, skilled and experienced AI safety researchers seem to have a much more holistic and much more concrete mindset when they’re talking about plans to align AGI.

For example, here are some of my beliefs about AI alignment (none of which are original ideas of mine):

I think it’s pretty plausible that meta-learning systems are ... (read more)

richard_ngoNov 20 201913

On a meta note: Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.

+1, this is the thing that surprised me most when I got into the field. I think helping increase common knowledge and agreement on the big picture of safety should be a major priority for people in the field (and it's something I'm putting a lot of effort into, so send me an email at richardcngo@gmail.com if you want to discuss this).

I think the ideas described in the paper Risks from Learned Optimization are extremely important.

Also +1 on this.

William_MacAskillNov 20 201937

Suppose you find out that Buck-in-2040 thinks that the work you're currently doing is a big mistake (which should have been clear to you, now). What are your best guesses about what his reasons are?

BuckNov 21 201931

I think of myself as making a lot of gambles with my career choices. And I suspect that regardless of which way the propositions turn out, I'll have an inclination to think that I was an idiot for not realizing them sooner. For example, I often have both the following thoughts:

"I have a bunch of comparative advantage at helping MIRI with their stuff, and I'm not going to be able to quickly reduce my confidence in their research directions. So I should stop worrying about it and just do as much as I can."
"I am not sure whether the MIRI research directions are good. Maybe I should spend more time evaluating whether I should do a different thing instead."

But even if it feels obvious in hindsight, it sure doesn't feel obvious now.

So I have big gambles that I'm making, which might turn out to be wrong, but which feel now like they will have been reasonable-in-hindsight gambles either way. The main two such gambles are thinking AI alignment might be really important in the next couple decades and working on MIRI's approaches to AI alignment instead of some other approach.

When I ask myself "what things have I not really considered as much ... (read more)

William_MacAskillNov 20 201936

How much do you worry that MIRI's default non-disclosure policy is going to hinder MIRI's ability to do good research, because it won't be able to get as much external criticism?

BuckNov 21 201924

I worry very little about losing the opportunity to get external criticism from people who wouldn't engage very deeply with our work if they did have access to it. I worry more about us doing worse research because it's harder for extremely engaged outsiders to contribute to our work.

A few years ago, Holden had a great post where he wrote:

For nearly a decade now, we've been putting a huge amount of work into putting the details of our reasoning out in public, and yet I am hard-pressed to think of cases (especially in more recent years) where a public comment from an unexpected source raised novel important considerations, leading to a change in views. This isn't because nobody has raised novel important considerations, and it certainly isn't because we haven't changed our views. Rather, it seems to be the case that we get a large amount of valuable and important criticism from a relatively small number of highly engaged, highly informed people. Such people tend to spend a lot of time reading, thinking and writing about relevant topics, to follow our work closely, and to have a great deal of context. They also tend to be people who form relationships of

... (read more)

riceissaNov 16 201935

In November 2018 you said "we want to hire as many people as engineers as possible; this would be dozens if we could, but it's hard to hire, so we'll more likely end up hiring more like ten over the next year". As far as I can tell, MIRI has hired 2 engineers (Edward Kmett and James Payor) since you wrote that comment. Can you comment on the discrepancy? Did hiring turn out to be much more difficult than expected? Are there not enough good engineers looking to be hired? Are there a bunch of engineers who aren't on the team page/haven't been announced yet?

BuckNov 19 201923

(This is true of all my answers but feels particularly relevant for this one: I’m speaking just for myself, not for MIRI as a whole)

We’ve actually made around five engineer hires since then; we’ll announce some of them in a few weeks. So I was off by a factor of two.

Before you read my more detailed thoughts: please don’t read the below and then get put off from applying to MIRI. I think that many people who are in fact good MIRI fits might not realize they’re good fits. If you’re unsure whether it’s worth your time to apply to MIRI, you can email me at buck@intelligence.org and I’ll (eventually) reply telling you whether I think you might plausibly be a fit. Even if it doesn't go further than that, there is great honor in applying to jobs from which you get rejected, and I feel warmly towards almost everyone I reject.

With that said, here are some of my thoughts on the discrepancy between my prediction and how much we’ve hired:

Since I started doing recruiting work for MIRI in late 2017, I’ve updated towards thinking that we need to be pickier with the technical caliber of engineering hires than I originally t

... (read more)

William_MacAskillNov 20 201934

What's the biggest misconception people have about current technical AI alignment work? What's the biggest misconception people have about MIRI?

Henry Stanley 🔸Nov 17 201931

How should talented EA software engineers best put their skills to use?

BuckNov 19 201937

The obvious answer is “by working on important things at orgs which need software engineers”. To name specific examples that are somewhat biased towards the orgs I know well:

MIRI needs software engineers who can learn functional programming and some math

I think that if you’re an engineer who likes functional programming, it might be worth your time to take a Haskell job and gamble on MIRI wanting to hire you one day when you’re really good at it. One person who currently works at MIRI is an EA who worked in Haskell for a few years; his professional experience is really helpful for him. If you’re interested in doing this, feel free to email me asking about whether I think it’s a good idea for you.

OpenAI’s safety team needs software engineers who can work as research engineers (which might only be a few months of training if you’re coming from a software engineering background; MIRI has a retraining program for software engineers who want to try this; if you’re interested in that, you should email me.)
Ought needs an engineering lead
The 80000 Hours job board lists positions

I have two main thoughts on how talented software ... (read more)

BuckNov 19 201925

As an appendix to the above, some of my best learning experiences as a programmer were the following (starting from when I started programming properly as a freshman in 2012). (Many of these aren’t that objectively hard (and would fit in well as projects in a CS undergrad course); they were much harder for me because I didn’t have the structure of a university course to tell me what design decisions were reasonable and when I was going down blind alleys. I think that this difficulty created some great learning experiences for me.)

I translated the proof of equivalence between regular expressions and finite state machines from “Introduction to Automata Theory, Languages, and Computation” into Haskell.
I wrote a program which would take a graph describing a circuit built from resistors and batteries and then solve for the currents and potential drops.
I wrote a GUI for a certain subset of physics problems; this involved a lot of deconfusion-style thinking as well as learning how to write GUIs.
I went to App Academy and learned to write full stack web applications.
I wrote a compiler from C to assembly in Scala. It took a long time for me to figure out that I sh

... (read more)

Henry Stanley 🔸

Nov 20 2019

This is an awesome answer; thanks Buck! The motivation behind strategy 2 seems pretty clear; are you emphasising strategy 1 (become a great engineer) for its instrumental benefit to strategy 2 (become useful to EA orgs), or for some other reason, like EtG? Strongly agree that some of the best engineers I come across have had very broad, multi-domain knowledge (and have been able to apply it cross-domain to whatever problem they're working on).

Henry Stanley 🔸Nov 20 201914

(Notably, the other things you might work on if you weren't at MIRI seem largely to be non-software-related)

BuckNov 21 201910

I hadn't actually noticed that.

One factor here is that a lot of AI safety research seems to need ML expertise, which is one of my least favorite types of CS/engineering.

Another is that compared to many EAs I think I have a comparative advantage at roles which require technical knowledge but not doing technical research day-to-day.

BuckNov 21 201911

I'm emphasizing strategy 1 because I think that there are EA jobs for software engineers where the skill ceiling is extremely high, so if you're really good it's still worth it for you to try to become much better. For example, AI safety research needs really great engineers at AI safety research orgs.

Ben PaceNov 17 201929

In your experience, what are the main reasons good people choose not to do AI alignment research after getting close to the field (at any org)? And on the other side, what are the main things that actually make the difference for them positively deciding to do AI alignment research?

BuckNov 19 201917

The most common reason that someone who I would be excited to work with at MIRI chooses not to work on AI alignment is that they decide to work on some other important thing instead, eg other x-risk or other EA stuff.

But here are some anonymized recent stories of talented people who decided to do non-EA work instead of taking opportunities to do important technical work related to x-risk (for context, I think all of these people are more technically competent than me):

One was very comfortable in a cushy, highly paid job which they already had, and thought it would be too inconvenient to move to an EA job (which would have also been highly paid).
One felt that AGI timelines are probably relatively long (eg they thought that the probability of AGI in the next 30 years felt pretty small to them), which made AI safety feel not very urgent. So they decided to take an opportunity which they thought would be really fun and exciting, rather than working at MIRI, which they thought would be less of a good fit for a particular skill set which they'd been developing for years; this person thinks that they might come back and work on x-risk after they've had another job for a few year

... (read more)

elleNov 19 201928

In "Ways I've changed my mind about effective altruism over the past year" you write:

I feel very concerned by the relative lack of good quality discussion and analysis of EA topics. I feel like everyone who isn’t working at an EA org is at a massive disadvantage when it comes to understanding important EA topics, and only a few people manage to be motivated enough to have a really broad and deep understanding of EA without either working at an EA org or being really close to someone who does.

I am not sure if you still feel this way, but this makes me wonder what the current conversations are about with other people at EA orgs. Could you give some examples of important understandings or new ideas you have gained from such conversations in the last, say, 3 months?

BuckNov 20 201910

I still feel this way, and I've been trying to think of ways to reduce this problem. I think the AIRCS workshops help a bit, I think that my SSC trip was helpful and EA residencies might be helpful.

A few helpful conversations that I've had recently with people who are strongly connected to the professional EA community, which I think would be harder to have without information gained from these strong connections:

I enjoyed a document about AI timelines that someone from another org shared me on.
Discussions about how EA outreach should go--how rapidly should we try to grow, what proportion of people who are good fits for EA are we already reaching, what types of people are we going to be able to reach with what outreach mechanisms.

Ben PaceNov 17 201928

How much do you agree with the two stories laid out in Paul Christiano's post What Failure Looks Like?

Buck

Nov 19 2019

They are a pretty reasonable description of what I expect to go wrong in a world where takeoffs are slow. (My models of what I think slow takeoff looks like are mostly based on talking to Paul, though, so this isn’t much independent evidence.)

Ben Pace

Nov 19 2019

Hmm, I'm surprised to hear you say that about the second story, which I think is describing a fairly fast end to human civilization - "going out with a bang". Example quote: So I mostly see it as describing a hard take-off, and am curious if there's a key part of a fast-takeoff / discontinuous take-off that you think of as central that is missing there.

BuckNov 19 201927

I think of hard takeoff as meaning that AI systems suddenly control much more resources. (Paul suggests the definition of "there is a one year doubling of the world economy before there's been a four year doubling".)

Unless I'm very mistaken, the point Paul is making here is that if you have a world where AI systems in aggregate gradually become more powerful, there might come a turning point where the systems suddenly stop being controlled by humans. By analogy, imagine a country where the military wants to stage a coup against the president, and their power increases gradually day by day, until one day they decide they have enough power to stage the coup. The power wielded by the military increased continuously and gradually, but the amount of control of the situation wielded by the president at some point suddenly falls.

avturchinNov 16 201928

What EY is doing now? Is he coding, writing fiction or new book, working on math foundations, providing general leadership?

SiebeRozendalNov 19 201923

Not sure why the initials are only provided. For the sake of clarity to other readers, EY = Eliezer Yudkowsky.

Buck

Nov 19 2019

Over the last year, Eliezer has been working on technical alignment research and also trying to rejigger some of his fiction-writing patterns toward short stories.

[anonymous]Nov 21 201927

Meta: A big thank you to Buck for doing this and putting so much effort into it! This was very interesting and will hopefully encourage more dissemination of knowledge and opinions publicly

Howie_LempelNov 21 201915

I thought this was great. Thanks, Buck

BuckNov 22 201911

It was a good time; I appreciate all the thoughtful questions.

Milan GriffesNov 21 201911

+1. So good to see stuff like this

Sean_o_hNov 21 201912

+2 helpful and thoughtful answers; really appreciate the time put in.

Sam ClarkeNov 19 201927

My sense of the current general landscape of AI Safety is: various groups of people pursuing quite different research agendas, and not very many explicit and written-up arguments for why these groups think their agenda is a priority (a notable exception is Paul's argument for working on prosaic alignment). Does this sound right? If so, why has this dynamic emerged and should we be concerned about it? If not, then I'm curious about why I developed this picture.

Jan_KulveitNov 21 201922

I think the picture is somewhat correct, and we surprisingly should not be too concerned about the dynamic.

My model for this is:

1) there are some hard and somewhat nebulous problems "in the world"

2) people try to formalize them using various intuitions/framings/kinds of math; also using some "very deep priors"

3) the resulting agendas look at the surface level extremely different, and create the impression you have

but actually

4) if you understand multiple agendas deep enough, you get a sense

how they are sometimes "reflecting" the same underlying problem
if they are based on some "deep priors", how deep it is, and how hard to argue it can be
how much they are based on "tastes" and "intuitions" ~ one model how to think about it is people having boxes comparable to policy net in AlphaZero: a mental black-box which spits useful predictions, but is not interpretable in language

Overall, given our current state of knowledge, I think running these multiple efforts in parallel is a better approach with higher chance of success that an idea that we should invest a lot in resolving disagreements/prioritizing, and everyone should work on the "best agenda".

This seems to go against some core EA heuristic ("compare the options, take the best") but actually is more in line with what rational allocation of resources in the face of uncertainty.

Sam ClarkeNov 22 201913

Thanks for the reply! Could you give examples of:

a) two agendas that seem to be "reflecting" the same underlying problem despite appearing very different superficially?

b) a "deep prior" that you think some agenda is (partially) based on, and how you would go about working out how deep it is?

Jan_KulveitNov 24 201921

Sure

For example, CAIS and something like "classical superintelligence in a box picture" disagree a lot on the surface level. However, if you look deeper, you will find many similar problems. Simple to explain example: problem of manipulating the operator - which has (in my view) some "hard core" involving both math and philosophy, where you want the AI to somehow communicate with humans in a way which at the same time allows a) the human to learn from the AI if the AI knows something about the world b) the operator's values are not "overwritten" by the AI c) you don't want to prohibit moral progress. In CAIS language this is connected to so called manipulative services.

Or: one of the biggest hits of past year is the mesa-optimisation paper. However, if you are familiar with prior work, you will notice many of the proposed solutions with mesa-optimisers are similar/same solutions as previously proposed for so called 'daemons' or 'misaligned subagents'. This is because the problems partially overlap (the mesa-optimisation framing is more clear and makes a stronger case for "this is what to expect by default"). ... (read more)

John_MaxwellNov 19 201919

Not Buck, but one possibility is that people pursuing different high-level agendas have different intuitions about what's valuable, and those kind of disagreements are relatively difficult to resolve, and the best way to resolve them is to gather more "object-level" data.

Maybe people have already spent a fair amount of time having in-person discussions trying to resolve their disagreements, and haven't made progress, and this discourages them from writing up their thoughts because they think it won't be a good use of time. However, this line of reasoning might be mistaken -- it seems plausible to me that people entering the field of AI safety are relatively impartial judges of which intuitions do/don't seem valid, and the question of where new people in the field of AI safety should focus is an important one, and having more public disagreement would improve human capital allocation.

BuckNov 20 201918

I think your sense is correct. I think that plenty of people have short docs on why their approach is good; I think basically no-one has long docs engaging thoroughly with the criticisms of their paths (I don't think Paul's published arguments defending his perspective count as complete; Paul has arguments that I hear him make in person that I haven't seen written up.)

My guess is that it's developed because various groups decided that it was pretty unlikely that they were going to be able to convince other groups of their work, and so they decided to just go their own ways. This is exacerbated by the fact that several AI safety groups have beliefs which are based on arguments which they're reluctant to share with each other.

(I was having a conversation with an AI safety researcher at a different org recently, and they couldn't tell me about some things that they knew from their job, and I couldn't tell them about things from my job. We were reflecting on the situation, and then one of us proposed the metaphor that we're like two people who were sliding on ice next to each other and then pushed away and have now chosen our paths and can't... (read more)

lukeprogNov 20 201927

FWIW, it's not clear to me that AI alignment folks with different agendas have put less effort into (or have made less progress on) understanding the motivations for other agendas than is typical in other somewhat-analogous fields. Like, MIRI leadership and Paul have put >25 (and maybe >100, over the years?) hours into arguing about merits of their differing agendas (in person, on the web, in GDocs comments), and my impression is that central participants to those conversations (e.g. Paul, Eliezer, Nate) can pass the others' ideological Turing tests reasonably well on a fair number of sub-questions and down 1-3 levels of "depth" (depending on the sub-question), and that might be more effort and better ITT performance than is typical for "research agenda motivation disagreements" in small niche fields that are comparable on some other dimensions.

Rohin ShahNov 20 201919

I suspect that things like the Alignment Newsletter are causing AI safety researchers to understand and engage with each other's work more; this seems good.

This is the goal, but it's unclear that it's having much of an effect. I feel like I relatively often have conversations with AI safety researchers where I mention something I highlighted in the newsletter, and the other person hasn't heard of it, or has a very superficial / wrong understanding of it (one that I think would be corrected by reading just the summary in the newsletter).

This is very anecdotal; even if there are times when I talk to people and they do know the paper that I'm talking about because of the newsletter, I probably wouldn't notice / learn that fact.

(In contrast, junior researchers are often more informed than I would expect, at least about the landscape, even if not the underlying reasons / arguments.)

riceissaNov 16 201927

The 2017 MIRI fundraiser post says "We plan to say more in the future about the criteria for strategically adequate projects in 7a" and also "A number of the points above require further explanation and motivation, and we’ll be providing more details on our view of the strategic landscape in the near future". As far as I can tell, MIRI hasn't published any further explanation of this strategic plan. Is MIRI still planning to say more about its strategic plan in the near future, and if so, is there a concrete timeframe (e.g. "in a few months", "in a year", "in two years") for publishing such an explanation?

(Note: I asked this question a while ago on LessWrong.)

RobBensinger

Nov 19 2019

Oops, I saw your question when you first posted it on LessWrong but forgot to get back to you, Issa. My apologies. I think there are two main kinds of strategic thought we had in mind when we said "details forthcoming": * 1. Thoughts on MIRI's organizational plans, deconfusion research, and how we think MIRI can help play a role in improving the future — this is covered by our November 2018 update post, https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/. * 2. High-level thoughts on things like "what we think AGI developers probably need to do" and "what we think the world probably needs to do" to successfully navigate the acute risk period. Most of the stuff discussed in "strategic background" is about 2: not MIRI's organizational plan, but our model of some of the things humanity likely needs to do in order for the long-run future to go well. Some of these topics are reasonably sensitive, and we've gone back and forth about how best to talk about them. Within the macrostrategy / "high-level thoughts" part of the post, the densest part was maybe 7a. The criteria we listed for a strategically adequate AGI project were "strong opsec, research closure, trustworthy command, a commitment to the common good, security mindset, requisite resource levels, and heavy prioritization of alignment work". With most of these it's reasonably clear what's meant in broad strokes, though there's a lot more I'd like to say about the specifics. "Trustworthy command" and "a commitment to the common good" are maybe the most opaque. By "trustworthy command" we meant things like: * The organization's entire command structure is fully aware of the difficulty and danger of alignment. * Non-technical leadership can't interfere and won't object if technical leadership needs to delete a code base or abort the project. By "a commitment to the common good" we meant a commitment to both short-term goodness (the immediate welfare of present-day Earth) and long-ter

Ben PaceNov 17 201926

AI takeoff: continuous or discontinuous?

BuckNov 19 201918

I don’t know. When I try to make fake mathematical models of how AI progress works, they mostly come out looking pretty continuous. And AI Impacts has successfully pushed my intuitions in a slow takeoff direction, by exhaustively cataloging all the technologies which didn't seem to have discontinuous jumps in efficiency. But on the other hand it sometimes feels like there’s something that has to “click” before you can have your systems being smart in some important way; this pushes me towards a discontinuous model. Overall I feel very confused.

When I talk to Paul Christiano about takeoff I feel persuaded by his arguments for slow takeoff, when I talk to many MIRI people I feel somewhat persuaded by their arguments for fast takeoff.

riceissaNov 16 201925

Do you have any thoughts on Qualia Research Institute?

BuckNov 19 201937

I feel pretty skeptical of their work and their judgement.

I am very unpersuaded by their Symmetry Theory of Valence, which I think is summarized by “Given a mathematical object isomorphic to the qualia of a system, the mathematical property which corresponds to how pleasant it is to be that system is that object’s symmetry“.

I think of valence as the kind of thing which is probably encoded into human brains by a bunch of complicated interconnected mechanisms rather than by something which seems simple from the perspective of an fMRI-equipped observer, so I feel very skeptical of this. Even if it was true about human brains, I’d be extremely surprised if the only possible way to build a conscious goal-directed learning system involved some kind of symmetrical property in the brain state, so this would feel like a weird contingent fact about humans rather than something general about consciousness.

And I’m skeptical of their judgement for reasons like the following. Michael Johnson, the ED of QRI, wrote:

I’ve written four pieces of philosophy I consider profound. [...]

The first, Principia Qualia, took me roughly six years of obsessive work. [...

... (read more)

MikeJohnsonNov 19 201917

Buck- for an internal counterpoint you may want to discuss QRI's research with Vaniver. We had a good chat about what we're doing at the Boston SSC meetup, and Romeo attended a MIRI retreat earlier in the summer and had some good conversations with him there also.

To put a bit of a point on this, I find the "crank philosophy" frame a bit questionable if you're using only thin-slice outside view and not following what we're doing. Probably, one could use similar heuristics to pattern-match MIRI as "crank philosophy" also (probably, many people have already done exactly this to MIRI, unfortunately).

vaniverNov 19 201919

FWIW I agree with Buck's criticisms of the Symmetry Theory of Valence (both content and meta) and also think that some other ideas QRI are interested in are interesting. Our conversation on the road trip was (I think) my introduction to Connectome Specific Harmonic Waves (CSHW), for example, and that seemed promising to think about.

I vaguely recall us managing to operationalize a disagreement, let me see if I can reconstruct it:

A 'multiple drive' system, like PCT's hierarchical control system, has an easy time explaining independent desires and different flavors of discomfort. (If one both has a 'hunger' control system and a 'thirst' control system, one can easily track whether one is hungry, thirsty, both, or neither.) A 'single drive' system, like expected utility theories more generally, has a somewhat more difficult time explaining independent desires and different flavors of discomfort, since you only have the one 'utilon' number.

But this is mostly because we're looking at different parts of the system as the 'value'. If I have a vector of 'control errors', I get the nice multidimensional prop

... (read more)

MikeJohnson

Nov 20 2019

I think this is a great description. "What happens if we seek out symmetry gradients in brain networks, but STV isn't true?" is something we've considered, and determining ground-truth is definitely tricky. I refer to this scenario as the "Symmetry Theory of Homeostatic Regulation" - https://opentheory.net/2017/05/why-we-seek-out-pleasure-the-symmetry-theory-of-homeostatic-regulation/ (mostly worth looking at the title image, no need to read the post) I'm (hopefully) about a week away from releasing an update to some of the things we discussed in Boston, basically a unification of Friston/Carhart-Harris's work on FEP/REBUS with Atasoy's work on CSHW -- will be glad to get your thoughts when it's posted.

vaniver

Nov 20 2019

Oh, an additional detail that I think was part of that conversation: there's only really one way to have a '0-error' state in a hierarchical controls framework, but there are potentially many consonant energy distributions that are dissonant with each other. Whether or not that's true, and whether each is individually positive valence, will be interesting to find out. (If I had to guess, I would guess the different mutually-dissonant internally-consonant distributions correspond to things like 'moods', in a way that means they're not really value but are somewhat close, and also that they exist. The thing that seems vaguely in this style are differing brain waves during different cycles of sleep, but I don't know if those have clear waking analogs, or what they look like in the CSHW picture.)

BuckNov 19 201918

Most things that look crankish are crankish.

I think that MIRI looks kind of crankish from the outside, and this should indeed make people initially more skeptical of us. I think that we have a few other external markers of legitimacy now, such as the fact that MIRI people were thinking and writing about AI safety from the early 2000s and many smart people have now been persuaded that this is indeed an issue to be concerned with. (It's not totally obvious to me that these markers of legitimacy mean that anyone should take us seriously on the question "what AI safety research is promising".) When I first ran across MIRI, I was kind of skeptical because of the signs of crankery; I updated towards them substantially because I found their arguments and ideas compelling, and people whose judgement I respected also found them compelling.

I think that the signs of crankery in QRI are somewhat worse than 2008 MIRI's signs of crankery.

I also think that I'm somewhat qualified to assess QRI's work (as someone who's spent ~100 paid hours thinking about philosophy of mind in the last few years), and when I look at it, I think it looks pretty crankish and wrong.

MikeJohnsonNov 20 201917

QRI is tackling a very difficult problem, as is MIRI. It took many, many years for MIRI to gather external markers of legitimacy. My inside view is that QRI is on the path of gaining said markers; for people paying attention to what we're doing, I think there's enough of a vector right now to judge us positively. I think these markers will be obvious from the 'outside view' within a short number of years.

But even without these markers, I'd poke at your position from a couple angles:

I. Object-level criticism is best

First, I don't see evidence you've engaged with our work beyond very simple pattern-matching. You note that "I also think that I'm somewhat qualified to assess QRI's work (as someone who's spent ~100 paid hours thinking about philosophy of mind in the last few years), and when I look at it, I think it looks pretty crankish and wrong." But *what* looks wrong? Obviously doing something new will pattern-match to crankish, regardless of whether it is crankish, so in terms of your rationale-as-stated, I don't put too much stock in your pattern detection (and perhaps you shouldn't either). If we want to avoid... (read more)

Milan Griffes

Nov 21 2019

cf. Jeff Kaufman on MIRI circa 2003: https://www.jefftk.com/p/yudkowsky-and-miri

MikeJohnsonNov 19 201914

For a fuller context, here is my reply to Buck's skepticism about the 80% number during our back-and-forth on Facebook -- as a specific comment, the number is loosely held, more of a conversation-starter than anything else. As a general comment I'm skeptical of publicly passing judgment on my judgment based on one offhand (and unanswered- it was not engaged with) comment on Facebook. Happy to discuss details in a context we'll actually talk to each other. :)

--------------my reply from the Facebook thread a few weeks back--------------

I think the probability question is an interesting one-- one frame is asking what is the leading alternative to STV?

At its core, STV assumes that if we have a mathematical representation of an experience, the symmetry of this object will correspond to how pleasant the experience is. The latest addition to this (what we're calling 'CDNS') assumes that consonance under Selen Atasoy's harmonic analysis of brain activity (connectome-specific harmonic waves, CSHW) is a good proxy for this in humans. This makes relatively clear predictions across all human states and could fairly easily be extended to non-human animals, in... (read more)

Ben Pace

Nov 19 2019

I’m having a hard time understanding whether everything below the dotted lines is something you just wrote, or a full quote from an old thread. The first time I read it I thought the former, and on reread think the latter. Might you be able to make it more explicit at the top of your comment?

MikeJohnson

Nov 19 2019

Thanks, added.

MikeJohnson

Nov 19 2019

We're pretty up-front about our empirical predictions; if critics would like to publicly bet against us we'd welcome this, as long as it doesn't take much time away from our research. If you figure out a bet we'll decide whether to accept it or reject it, and if we reject it we'll aim to concisely explain why.

Matthew_BarnettNov 19 201912

Mike, while I appreciate the empirical predictions of the symmetry theory of valence, I have a deeper problem with QRI philosophy, and it makes me skeptical even if the predictions come to bear.

In physics, there are two distinctions we can make about our theories:

Disputes over what we predict will happen.
Disputes over the interpretation of experimental results.

The classic Many Worlds vs. Copenhagen is a dispute of the second kind, at least until someone can create an experiment which distinguishes the two. Another example of the second type of dispute is special relativity vs. Lorentz ether theory.

Typically, philosophers of science and most people who follow Lesswrong philosophy, will say that the way to resolve disputes of the second kind is to find out which interpretation is simplest. That's one reason why most people follow Einstein's special relativity over the Lorentz ether theory.

However, simplicity of an interpretation is often hard to measure. It's made more complicated for two reasons,

First, there's no formal way of measuring simplicity even in principle in a way that is language independent.
Second, there are ontological disputes about what type of the

... (read more)

MikeJohnson

Nov 20 2019

Thanks Matthew! I agree issues of epistemology and metaphysics get very sticky very quickly when speaking of consciousness. My basic approach is 'never argue metaphysics when you can argue physics' -- the core strategy we have for 'proving' we can mathematically model qualia is to make better and more elegant predictions using our frameworks, with predicting pain/pleasure from fMRI data as the pilot project. One way to frame this is that at various points in time, it was completely reasonable to be a skeptic about modeling things like lightning, static, magnetic lodestones, and such, mathematically. This is true to an extent even after Faraday and Maxwell formalized things. But over time, with more and more unusual predictions and fantastic inventions built around electromagnetic theory, it became less reasonable to be skeptical of such. My metaphysical arguments are in my 'Against Functionalism' piece, and to date I don't believe any commenters have addressed my core claims: https://forum.effectivealtruism.org/posts/FfJ4rMTJAB3tnY5De/why-i-think-the-foundational-research-institute-should#6Lrwqcdx86DJ9sXmw But, I think metaphysical arguments change distressingly few peoples' minds. Experiments and especially technology changes peoples' minds. So that's what our limited optimization energy is pointed at right now.

Matthew_Barnett

Nov 20 2019

Agreed :). My main claim was that by only arguing physics, I will never agree upon your theory because your theory assumes the existence of elementary stuff that I don't believe in. Therefore, I don't understand how this really helps. Would you be prepared say the same about many worlds vs consciousness causes collapse theories? (Let's assume that we have no experimental data which distinguishes the two theories). The problem with the analogy to magnetism and electricity is that fails to match the pattern of my argument. In order to incorporate magnetism into our mathematical theory of physics, we merely added more mathematical parts. In this, I see a fundamental difference between the approach you take and the approach taken by physicists when they admit the existence of new forces, or particles. In particular, your theory of consciousness does not just do the equivalent of add a new force, or mathematical law that governs matter, or re-orient the geometry of the universe. It also posits that there is a dualism in physical stuff: that is, that matter can be identified as having both mathematical and mental properties. Even if your theory did result in new predictions, I fail to see why I can't just leave out the mental interpretation of it, and keep the mathematical bits for myself. To put it another way, if you are saying that symmetry can be shown to be the same as valence, then I feel I can always provide an alternative explanation that leaves out valence as a first-class object in our ontology. If you are merely saying that symmetry is definitionally equivalent to valence, then your theory is vacuous because I can just delete that interpretation from my mathematical theory and emerge with equivalent predictions about the world. And in practice, I would probably do so, because symmetry is not the kind of thing I think about when I worry about suffering. I agree that if you had made predictions that classical neuroscientists all agreed would never occur,

MikeJohnson

Nov 20 2019

I think we actually mostly agree: QRI doesn't 'need' you to believe qualia are real, that symmetry in some formalism of qualia corresponds to pleasure, that there is any formalism about qualia to be found at all. If we find some cool predictions, you can strip out any mention of qualia from them, and use them within the functionalism frame. As you say, the existence of some cool predictions won't force you to update your metaphysics (your understanding of which things are ontologically 'first class objects'). But- you won't be able to copy our generator by doing that, the thing that created those novel predictions, and I think that's significant, and gets into questions of elegance metrics and philosophy of science. I actually think the electromagnetism analogy is a good one: skepticism is always defensible, and in 1600, 1700, 1800, 1862, and 2018, people could be skeptical of whether there's 'deep unifying structure' behind these things we call static, lightning, magnetism, shocks, and so on. But it was much more reasonable to be skeptical in 1600 than in 1862 (the year Maxwell's Equations were published), and more reasonable in 1862 than it was in 2018 (the era of the iPhone). Whether there is 'deep structure' in qualia is of course an open question in 2019. I might suggest STV is equivalent to a very early draft of Maxwell's Equations: not a full systematization of qualia, but something that can be tested and built on in order to get there. And one that potentially ties together many disparate observations into a unified frame, and offers novel / falsifiable predictions (which seem incredibly worth trying to falsify!) I'd definitely push back on the frame of dualism, although this might be a terminology nitpick: my preferred frame here is monism: https://opentheory.net/2019/06/taking-monism-seriously/ - and perhaps this somewhat addresses your objection that 'QRI posits the existence of too many things'.

Matthew_Barnett

Nov 20 2019

I would think this might be our crux (other than perhaps the existence of qualia themselves). I imagine any predictions you produce can be adequately captured in a mathematical framework that makes no reference to qualia as ontologically primitive. And if I had such a framework, then I would have access to the generator, full stop. Adding qualia doesn't make the generator any better -- it just adds unnecessary mental stuff that isn't actually doing anything for the theory. I am not super confident in anything I said here, although that's mostly because I have an outside view that tells me consciousness is hard to get right. My inside view tells me that I am probably correct, because I just don't see how positing mental stuff that's separate from mathematical law can add anything whatsoever to a physical theory. I'm happy to talk more about this some day, perhaps in person. :)

Pablo

Nov 20 2019

Hey Mike, I'm a community moderator at Metaculus and am generally interested in creating more EA-relevant questions. Are your predictions explicitly listed somewhere? It would be great to add at least some of them to the site.

MikeJohnson

Nov 30 2019

Hey Pablo! I think Andres has a few up on Metaculus; I just posted QRI's latest piece of neuroscience here, which has a bunch of predictions (though I haven't separated them out from the text): https://opentheory.net/2019/11/neural-annealing-toward-a-neural-theory-of-everything/

Gregory Lewis🔸Nov 30 201910

I think it would be worthwhile to separate these out from the text, and (especially) to generate predictions that are crisp, distinctive, and can be resolved in the near term. The QRI questions on metaculus are admirably crisp (and fairly near term), but not distinctive (they are about whether certain drugs will be licensed for certain conditions - or whether evidence will emerge supporting drug X for condition Y, which offer very limited evidence for QRI's wider account 'either way').

This is somewhat more promising from your most recent post:

I’d expect to see substantially less energy in low-frequency CSHWs [Connectome-Specific Harmonic Waves] after trauma, and substantially more energy in low-frequency CSHWs during both therapeutic psychedelic use (e.g. MDMA therapy) and during psychological integration work.

This is crisp, plausibly distinctive, yet resolving this requires a lot of neuroimaging work which (presumably) won't be conducted anytime soon. In the interim, there isn't much to persuade a sceptical prior.

Buck

Nov 19 2019

I see this and appreciate it; the problem is that I want to bet on something like "your overall theory is wrong", but I don't know enough neuroscience to know whether the claims you're making are things that are probably true for reasons unrelated to your overall theory. If you could find someone who I trusted who knew neuroscience and who thought your predictions seemed unlikely, then I'd bet with them against you.

MikeJohnson

Nov 20 2019

We’ve looked for someone from the community to do a solid ‘adversarial review’ of our work, but we haven’t found anyone that feels qualified to do so and that we trust to do a good job, aside from Scott, and he's not available at this time. If anyone comes to mind do let me know!

Milan Griffes

Nov 16 2019

See also this recent Qualia Computing post about the orthogonality thesis. (Qualia Computing is the blog of QRI's research director.)

Ben PaceNov 17 201924

How would you describe the general motivation behind MIRI's research approach? If you feel you don't want to answer that, feel free to restrict this specifically to the agent foundations work.

BuckNov 19 201920

I’m speaking very much for myself and not for MIRI here. But, here goes (this is pretty similar to the view described here):

If we build AI systems out of business-as-usual ML, we’re going to end up with systems probably trained with some kind of meta learning (as described in Risks from Learned Optimization) and they’re going to be completely uninterpretable and we’re not going to be able to fix the inner alignment. And by default our ML systems won’t be able to handle the strain of doing radical self-improvement, and they’ll accidentally allow their goals to shift as they self-improve (in the same way that if you tried to make a physicist by giving a ten year old access to a whole bunch of crazy mind altering/enhancing drugs and the ability to do brain surgery on themselves, you might have unstable results). We can’t fix this with things like ML transparency or adversarial training or ML robustness. The only hope of building aligned really-powerful-AI-systems is having a much clearer picture of what we’re doing when we try to build these systems.

Ben PaceNov 19 201920

Thanks :)

I'm hearing "the current approach will fail by default, so we need a different approach. In particular, the new approach should be clearer about the reasoning of the AI system than current approaches."

Noticeably, that's different from a positive case that sounds like "Here is such an approach and why it could work."

I'm curious how much of your thinking is currently split between the two rough possibilities below.

First:

I don't know of another approach that could work, so while I maybe personally feel more of an ability to understand some people's ideas than others, many people's very different concrete suggestions for approaches to understanding these systems better are all arguably similar in terms of how likely we should think they are to pan out, and how much resources we should want to put behind them.

Alternatively, second:

While it's incredibly difficult to communicate mathematical intuitions of this depth, my sense is I can see a very attractive case for why one or two particular efforts (e.g. MIRI's embedded agency work) could work out.

Sam ClarkeNov 19 201923

What do you think are the biggest mistakes that the AI Safety community is currently making?

Sam ClarkeNov 19 201923

Paul Christiano is a lot more optimistic than MIRI about whether we could align a Prosaic AGI. In a relatively recent interview with AI Impacts he said he thinks "probably most of the disagreement" about this lies in the question of "can this problem [alignment] just be solved on paper in advance" (Paul thinks there's "at least a third chance" of this, but suggests MIRI's estimate is much lower). Do you have a sense of why MIRI and Paul disagree so much on this estimate?

BuckNov 19 201911

I think Paul is probably right about the causes of the disagreement between him and many researchers, and the summary of his beliefs in the AI Impacts interview you linked matches my impression of his beliefs about this.

Ben PaceNov 17 201923

What has been the causal history of you deciding that it was worth leaving your previous job to work with MIRI? Many people have a generic positive or negative view of MIRI, but it's much stronger to decide to actually work there.

BuckNov 19 201925

Earning to give started looking worse and worse the more that I increased my respect for Open Phil; by 2017 it seemed mostly obvious that I shouldn’t earn to give. I stayed at my job for a few months longer because two prominent EAs gave me the advice to keep working at my current job, which in hindsight seems like an obvious mistake and I don’t know why they gave that advice. Then in May, MIRI advertised a software engineer internship program which I applied to; they gave me an offer, but I would have had to quit my job to take the offer, and Triplebyte (which I’d joined as the first engineer) was doing quite well and I expected that if I got another software engineering job it would have much lower pay. After a few months I decided that there were enough good things I could be doing with my time that I quit Triplebyte and started studying ML (and also doing some volunteer work for MIRI doing technical interviewing for them).

I tried to figure out whether MIRI’s directions for AI alignment were good, by reading a lot of stuff that had been written online; I did a pretty bad job of thinking about all this.

At this point MIRI offered me a full time job and ... (read more)

amc

Nov 27 2019

I'm curious about why you think you did a bad job at this. Could you roughly explain what you did and what you should have done instead?

riceissaNov 16 201923

On the SSC roadtrip post, you say "After our trip, I'll write up a post-mortem for other people who might be interested in doing things like this in the future". Are you still planning to write this, and if so, when do you expect to publish it?

Buck

Nov 18 2019

I wrote it but I’ve been holding off on publishing it. I’ll probably edit it and publish it within the next month. If you really want to read it, email me and I’ll share you on it.

Milan Griffes

Nov 17 2019

Published today: "EA residencies" as an outreach activity

WikipediaEANov 16 201921

Q1: Has MIRI noticed a significant change in funding following the change in disclosure policy?

Q2: If yes to Q1, what was the direction of the change?

Q3: If yes to Q1, were you surprised by the degree of the change?

ETA:

Q4: If yes to Q3, in which direction were you surprised?

BuckNov 19 201912

It’s not clear what effect this has had, if any. I am personally somewhat surprised by this--I would have expected more people to stop donating to us.

I asked Rob Bensinger about this; he summarized it as “We announced nondisclosed-by-default in April 2017, and we suspected that this would make fundraising harder. In fact, though, we received significantly more funding in 2017 (https://intelligence.org/2019/05/31/2018-in-review/#2018-finances), and have continued to receive strong support since then. I don't know that there's any causal relationship between those two facts; e.g., the obvious thing to look at in understanding the 2017 spike was the cryptocurrency price spike that year. And there are other factors that changed around the same time too, e.g., Colm [who works at MIRI on fundraising among other things] joining MIRI in late 2016.“

NunoSempereNov 16 201921

What would you be working on if you were working on something else?

BuckNov 19 201931

Here are some things I might do:

Inside AI alignment:

I don’t know if this is sufficiently different from my current work to be an interesting answer, but I can imagine wanting to work on AIRCS full time, possibly expanding it in the following ways:

Run more of the workshops. This would require training more staff. For example, Anna Salamon currently leads all the workshops and wouldn’t have time to run twice as many.
Expand the scope of the workshops to non-CS people, for example non-technical EAs.
Expand the focus of the workshop from AI to be more general. Eg I’ve been thinking about running something tentatively called “Future Camp”, where people come in and spend a few days thinking and learning about longtermism and what the future is going to be like, with the goal of equipping people to think more clearly about futurism questions like the timelines of various transformative technologies and what can be done to make those technologies go better.
Making the workshop be more generally about EA. The idea would be that the workshop does the same kind of thing that EA Global tries to do for relatively new EAs--expose them to more experienced EAs and t

... (read more)

ToflyNov 18 201917

Nate Soares once described autodidacting to prepare for a job at MIRI. For each position at MIRI (agent foundations, machine learning alignment, software engineer, type theorist, machine learning living library), what should one study if one wanted to do something like that today? (i.e. for agent foundations, is Scott Garrabrant’s suggestion of “Learn enough math to understand all fixed point theorems” essentially correct? Is there anything else one needs to know?)

BuckNov 19 201910

I don't know about what you need to know in order to do agent foundations research and trust Scott's answer.

If you're seriously considering autodidacting to prepare for a non-agent-foundations job at MIRI, you should email me (buck@intelligence.org) about your particular situation and I'll try to give you personal advice. If too many people email me asking about this, I'll end up writing something publicly.

In general, I'd rather that people talk to me before they study a lot for a MIRI job rather than after, so that I can point them in the right direction and they don't waste effort learning things that aren't going to make the difference to whether we want to hire them.

And if you want to autodidact to work on agent foundations at MIRI, consider emailing someone on the agent foundations team. Or you could try emailing me and I can try to help.

WikipediaEANov 17 201916

Given an aligned AGI, what is your point estimate for the TOTAL (across all human history) cost in USD of having aligned it?

To hopefully spare you a bit of googling without unduly anchoring your thinking, Wiki says the Manhattan Project cost $21-23 billion in 2018 USD, with only about 3.7% or $786m of that being research and development.

Ben Pace

Nov 17 2019

This is such an interesting question. I’m not sure I have a sensible answer. Like, I feel like the present bottleneck on Alignment progress is entirely a question of getting good people doing helpful conceptual work, but afterwards indeed a lot of funding will be needed to align the AI, and I’ve not got a sense of how much money we might need to keep aside until then - e.g. is it more or less than OpenPhil’s current total?

riceissaNov 16 201916

Over the years, you have published several pieces on ways you've changed your mind (e.g. about EA, another about EA, weird ideas, hedonic utilitarianism, and a bunch of other ideas). While I've enjoyed reading the posts and the selection of ideas, I've also found most of the posts frustrating (the hedonic utilitarianism one is an exception) because they mostly only give the direction of the update, without also giving the reasoning and additional evidence that caused the update* (e.g. in the EA post you write "I am erring on the side of writing this faster and including more of my conclusions, at the cost of not very clearly explaining why I’ve shifted positions"). Is there a reason you keep writing in this style (e.g. you don't have time, or you don't want to "give away the answers" to the reader), and if so, what is the reason?

*Why do I find this frustrating? My basic reasoning is something like this: I think this style of writing forces the reader to do a weird kind of Aumann reasoning where they have to guess what evidence/arguments Buck might have had at the start, and what evidence/arguments he subsequently saw, in order to try to reconstruct the update. When I encounter this

... (read more)

BuckNov 19 201915

I think a major reason that I write those posts is because I’m worried that I persuaded someone of claims that I now believe to be false; I hope that if they hear I changed my mind, they might feel more skeptical of arguments that I made in the past.

I mostly don’t try to write up the arguments that persuaded me because it feels hard and time-consuming.

I’m definitely not trying to avoid “giving away the answers”. I’m sorry that this is annoying :( I definitely don’t feel that it’s unreasonable for people to not try to update on my mind changing.

3[anonymous]Nov 20 2019

I agree with Issa about the costs of not giving reasons. My guess is that over the long run, giving reasons why you believe what you believe will be a better strategy to avoid convincing people of false things. Saying you believed X and now believe ~X seems like it's likely to convince people of ~X even more strongly.

EdoAradNov 18 201915

It seems like there are many more people that want to get into AI Safety, and MIRI's fundumental research, than there is room to mentor and manage them. There are also many independent / volunteer researchers.

It seems that your current strategy is to focus on training, hiring and outreaching to the most promising talented individuals. Other alternatives might include more engagement with amatures, and providing more assistance for groups and individuals that want to learn and conduct independent research.

Do you see it the same way? This strategy makes a lot of sense, but I am curious to your take on it. Also, what would change if you had 10 times the amount of management and mentorship capacity?

BuckNov 22 201910

It seems that your current strategy is to focus on training, hiring and outreaching to the most promising talented individuals.

This seems like a pretty good summary of the strategy I work on, and it's the strategy that I'm most optimistic about.

Other alternatives might include more engagement with amatures, and providing more assistance for groups and individuals that want to learn and conduct independent research.

I think that it would be quite costly and difficult for more experienced AI safety researchers to try to cause more good research to happen by engaging more with amateurs or providing more assistance to independent research. So I think that experienced AI safety researchers are probably going to do more good by spending more time on their own research than by trying to help other people with theirs. This is because I think that experienced and skilled AI safety researchers are much more productive than other people, and because I think that a reasonably large number of very talented math/CS people become interested in AI safety every year, so we can set a pretty high bar for which people to spend a lot of time with.

Also, what would change if you had 10 times th

... (read more)

Ben PaceNov 17 201915

What's been your experiences, positive and negative, of CFAR workshops?

BuckNov 20 201913

I think they're OK. I think some CFAR staff are really great. I think that their incidental effect of causing people to have more social ties to the rationalist/EA Bay Area community is probably pretty good.

I've done CFAR-esque exercises at AIRCS workshops which were very helpful to me. I think my general sense is that a bunch of CFAR material has a "true form" which is pretty great, but I didn't get the true form from my CFAR workshop, I got it from talking to Anna Salamon (and somewhat from working with other CFAR staff).

I think that for (possibly dumb) personal reasons I get more annoyed by them than some people, which prevents me from getting as much value out of them.

I generally am glad to hear that an EA has done a CFAR workshop, and normally recommend that EAs do them, especially if they don’t have as much social connection to the EA/rationalist scene, or if they don’t have high opportunity cost to their time.

Adam_SchollNov 21 201916

For what it's worth, I wouldn't describe the social ties thing as incidental—it's one of the main things CFAR is explicitly optimizing for. For example, I'd estimate (my colleagues might quibble with these numbers some) it's 90% of the reason we run alumni reunions, 60% of the reason we run instructor & mentorship trainings, 30% of the reason we run mainlines, and 15% of the reason we co-run AIRCS.

Buck

Nov 21 2019

Yeah, makes sense; I didn’t mean “unintentional” by “incidental”.

elle

Nov 20 2019

How long ago did you attend your CFAR workshop? My sense is that the content CFAR teaches and who the teachers are have changed a lot over the years. Maybe they've gotten better (or worse?) about teaching the "true form." (Or maybe you were saying you also didn't get the "true form" even in the more recent AIRCS workshops?)

Ben PaceNov 17 201914

What's a belief that you hold that most people disagree with you about? I'm including most EAs and rationalists.

HalffullNov 17 201912

Found elsewhere on the thread, a list of weird beliefs that Buck holds: http://shlegeris.com/2018/10/23/weirdest

NunoSempereNov 16 201914

Do you have the intuition that a random gifted person can contribute to technical research on AI safety?

Shri_SamsonNov 19 201913

It's happened a few times at our local meetup (South Bay EA) that we get someone new who says something like “okay I’m a fairly good ML student who wants to decide on a research direction for AI Safety.” In the past we've given fairly generic advice like "listen to this 80k podcast on AI Safety" or "apply to AIRCS". One of our attendees went on to join OpenAI's safety team after this advice, and gave us some attribution for it. While this probably makes folks a little better off, it feels like we could do better for them.

If you had to give someone more concrete object-level advice on how to get started AI safety what would you tell them?

BuckNov 22 201918

I’m a fairly good ML student who wants to decide on a research direction for AI Safety.

I'm not actually sure whether I think it's a good idea for ML students to try to work on AI safety. I am pretty skeptical of most of the research done by pretty good ML students who try to make their research relevant to AI safety--it usually feels to me like their work ends up not contributing to one of the core difficulties, and I think that they might have been better off if they'd instead spent their effort trying to become really good at ML in the hope of being better skilled up with the goal of working on AI safety later.

I don't have very much better advice for how to get started on AI safety; I think the "recommend to apply to AIRCS and point at 80K and maybe the Alignment Newsletter" path is pretty reasonable.

richard_ngo

Nov 22 2019

I'm broadly sympathetic to this, but I also want to note that there are some research directions in mainstream ML which do seem significantly more valuable than average. For example, I'm pretty excited about people getting really good at interpretability, so that they have an intuitive understanding of what's actually going on inside our models (particularly RL agents), even if they have no specific plans about how to apply this to safety.

riceissaNov 16 201913

I asked a question on LessWrong recently that I'm curious for your thoughts on. If you don't want to read the full text on LessWrong, the short version is: Do you think it has become harder recently (say 2013 vs 2019) to become a mathematician at MIRI? Why or why not?

Buck

Nov 20 2019

I'm not sure; my guess is that it's somewhat harder, because we're enthusiastic about our new research directions and have moved some management capacity towards those, and those directions have relatively more room for engineering skillsets vs pure math skillsets.

EdoAradNov 18 201912

What are some bottlenecks in your research productivity?

BuckNov 19 201911

Here are several, together with the percentage of my productivity which I think they cost me over the last year:

I’ve lost a couple months of productivity over the last year due to some weird health issues--I was really fatigued and couldn’t think properly for several months. This was terrible, but the problem seems to have gone away for now. This had a 25% cost.
I am a worse researcher because I spend half my time doing other things than research. It’s unclear to me how much efficiency this costs me. Potential considerations:

When I don’t feel like doing technical work, I can do other work. This should increase my productivity. But maybe it lets me procrastinate on important work.
I remember less of the context of what I’m working on, because my memory is spaced out.
My nontechnical work often feels like social drama and is really attention-grabbing and distracting.
Overall this costs me maybe 15%; I’m really unsure about this though.

I’d be a better researcher if I were smarter and more knowledgeable. I’m working on the knowledgeableness problem with the help of tutors and by spending some of my time studying. It’s unclear how to figure out how costly this is. If I’d spent a year working as a Haskell programmer in industry, I’d probably be like 15% more effective now.

EdoArad

Nov 20 2019

Thanks! I'm sorry to hear about your health problems, but I'm glad it's better now :)

riceissaNov 16 201912

[Meta] During the AMA, are you planning to distinguish (e.g. by giving short replies) between the case where you can't answer a question due to MIRI's non-disclosure policy vs the case where you won't answer a question simply because there isn't enough time/it's too much effort to answer?

Buck

Nov 18 2019

I don't expect to not answer any questions because of MIRI non-disclosure stuff.

[anonymous]Nov 16 201911

What other crazy ideas do you have about EA outreach?

riceissaNov 16 201911

How do you see success/an "existential win" playing out in short timeline scenarios (e.g. less than 10 years until AGI) where alignment is non-trivial/turns out to not solve itself "by default"? For example, in these scenarios do you see MIRI building an AGI, or assisting/advising another group to do so, or something else?

Buck

Nov 20 2019

It's getting late and it feels hard to answer this question, so I'm only going to say briefly: * for something MIRI wrote re this, see the "strategic background" section here * I think there are cases where alignment is non-trivial but prosaic AI alignment is possible, and some people who are cautious about AGI alignment are influential in the groups that are working on AGI development and cause them to put lots of effort into alignment (eg maybe the only way to align the thing involves spending an extra billion dollars on human feedback). Because of these cases, I am excited for the leading AI orgs having many people in important positions who are concerned about and knowledgeable about these issues.

Ben PaceNov 17 201910

What's something specific about this community that you're grateful for?

BuckNov 19 201918

Many things. Two that come to mind:

Even though most of my friends are mostly consequentialist in their EA actions, I feel like they’re also way more interested than average in figuring out what the right thing to do is from a wide variety of moral perspectives; I feel extremely warm about the scrupulosity and care of EAs, even towards things that don’t matter much.
I think EAs are often very naturally upset by injustices which are enormous but which most people don’t think to be upset by; I feel this way particularly because of anti-death sentiments, care for farmed and wild animals, and support for open borders.

Ben PaceNov 17 201910

Will we solve the alignment problem before crunch time?

BuckNov 21 201917

I think it's plausible that "solving the alignment problem" isn't a very clear way of phrasing the goal of technical AI safety research. Consider the question "will we solve the rocket alignment problem before we launch the first rocket to the moon"--to me the interesting question is whether the first rocket to the moon will indeed get there. The problem isn't really "solved" or "not solved", the rocket just gets to the moon or not. And it's not even obvious whether the goal is to align the first AGI; maybe the question is "what proportion of resources controlled by AI systems end up being used for human purposes", where we care about a weighted proportion of AI systems which are aligned.

I am not sure whether I'd bet for or against the proposition that humans will go extinct for AGI-misalignment-related-reasons within the next 100 years.

Matthew_Barnett

Nov 17 2019

Apologies, aren't we already in crunch time? Are your referring to this comment from Eliezer Yudkowsky,

Ben Pace

Nov 17 2019

Sure. "Crunch time" is not exactly a technically precise term, and it is quite likely our time is measured in decades. The thing I want to ask is whether Buck expects the timeline will fully run out before we solve alignment, or whether we'll manage to successfully build an AGI that helps us achieve our values and an existential win, or whether something different will happen instead.

Matthew_Barnett

Nov 17 2019

I see. I asked only because I was confused why you asked "before crunch time" rather than leaving that part out.

riceissaNov 16 201910

Do you think non-altruistic interventions for AI alignment (i.e. AI safety "prepping") make sense? If so, do you have suggestions for concrete actions to take, and if not, why do you think they don't make sense?

(Note: I previously asked a similar question addressed at someone else, but I am curious for Buck's thoughts on this.)

BuckNov 20 201914

I don't think you can prep that effectively for x-risk-level AI outcomes, obviously.

I think you can prep for various transformative technologies; you could for example buy shares of computer hardware manufacturers if you think that they'll be worth more due to increased value of computation as AI productivity increases. I haven't thought much about this, and I'm sure this is dumb for some reason, but maybe you could try to buy land in cheap places in the hope that in a transhuman utopia the land will be extremely valuable (the property rights might not carry through, but it might be worth the gamble for sufficiently cheap land).

I think it's probably at least slightly worthwhile to do good and hope that you can sell some of your impact certificates after good AI outcomes.

You should ask Carl Shulman, I'm sure he'd have a good answer.

elleNov 18 20199

Is there any public information on the AI Safety Retraining Program other than the MIRI Summer Update and the Open Phil grant page?

I am wondering:

1) Who should apply? How do they apply?

2) Have there been any results yet? I see two grants were given as of Sep 1st; have either of those been completed? If so, what were the outcomes?

Buck

Nov 19 2019

I don't think there's any other public information. To apply, people should email me asking about it (buck@intelligence.org). The three people who've received one of these grants were all people who I ran across in my MIRI recruiting efforts. Two grants have been completed and a third is ongoing. Of the two people who completed grants, both successfully replicated several deep RL papers. and one of them ended up getting a job working on AI safety stuff (the other took a data science job and hopes to work on AI safety at some point in the future). I'm happy to answer more questions about this.

elle

Nov 19 2019

So, to clarify: this program is for people who are already mostly sure they want to work on AI Safety? That is, a person who is excited about ML, and would maaaaybe be interested in working on safety-related topics, if they found those topics interesting, is not who you are targeting?

Buck

Nov 19 2019

Yeah, I am not targeting that kind of person. Someone who is excited about ML and skeptical of AI safety but interested in engaging a lot with AI safety arguments for a few months might be a good fit.

Ben PaceNov 17 20199

What do you think integrity means and how can you tell when someone has it?

Ben PaceNov 17 20199

What do you think is the main cause of burnout in people working on ambitious projects, especially those designed to reduce existential risk?

elleNov 18 20198

You write:

"I think that the field of AI safety is growing in an awkward way. Lots of people are trying to work on it, and many of these people have pretty different pictures of what the problem is and how we should try to work on it. How should we handle this? How should you try to work in a field when at least half the "experts" are going to think that your research direction is misguided?"

What are your preliminary thoughts on the answers to these questions?

Ben PaceNov 17 20198

As you've grown up and become more agentic and competent and thoughtful, how have your values changed? Have you discarded old values, have they changed in more subtle ways?

WikipediaEANov 16 20198

How efficiently could MIRI "burn through" its savings if it considered AGI sufficiently likely to be imminent? In other words, if MIRI decided to spend all its savings in a year, how many normal-spending-years' worth of progress on AI safety do you think it would achieve?

Buck

Nov 20 2019

Probably less than two.

WikipediaEANov 16 20198

Given a "bad" AGI outcome, how likely do you think a long-term worse-than-death fate for at least some people would be relative to extinction?

Buck

Nov 20 2019

Idk. A couple percent? I'm very unsure about this.

WikipediaEANov 16 20198

Q1: How closely does MIRI currently coordinate with the Long-Term Future Fund (LTFF)?

Q2: How effective do you currently consider [donations to] the LTFF relative to [donations to] MIRI? Decimal coefficient preferred if you feel comfortable guessing one.

Q3: Do you expect the LTFF to become more or less effective relative to MIRI as AI capability/safety progresses?

Buck

Nov 20 2019

(I've spent a few hours talking to people about the LTFF but I'm not sure about things like "what order of magnitude of funding did they allocate last year" (my guess without looking it up is $1M, (which turns out to be correct!)), so take all this with a grain of salt.) Re Q1: I don't know, I don't think that we coordinate very carefully. Re Q2: I don't really know. When I look at the list of things the LTFF funded in August or April (excluding regrants to orgs like MIRI, CFAR, and Ought), about 40% look meh (~0.5x MIRI), about 40% look like things which I'm reasonably glad someone funded (~1x MIRI), about 7% are things that I'm really glad someone funded (~3x MIRI), and 3% are things that I wish that they hadn't funded (-1x MIRI). Note that my mean outcome of the meh, good, and great categories are much higher than the median outcomes--a lot of them are "I think this is probably useless but seems worth trying for value of information". Apparently this adds up to thinking that they're 78% as good as MIRI. Q3: I don't really know. My median outcome is that they turn out to do less well than my estimation above, but I think there's a reasonable probability that they turn out to be much better than my estimate above, and I'm excited to see them try to do good. This isn't really tied up with AI capability or safety progressing though.

elleNov 18 20194

In your opinion, what are the most helpful organizations or groups working on AI safety right now? And why?

In parallel: what are the least helpful organizations or groups working on (or claiming to work on) AI safety right now? And why?

Buck

Nov 19 2019

I feel reluctant to answer this question because it feels like it would involve casting judgement on lots of people publicly. I think that there are a bunch of different orgs and people doing good work on AI safety.

elleNov 19 201912

Yeah, I am sympathetic to that. I am curious how you decide where to draw the line here. For instance, you were willing to express judgment of QRI elsewhere in the comments.

Would it be possible to briefly list the people or orgs whose work you *most* respect? Or would the omissions be too obvious?

I sometimes wish there were good ways to more broadly disseminate negative judgments or critiques of orgs/people from thoughtful and well-connected people. But, understandably, people are sensitive to that kind of thing, and it can end up eating a lot of time and weakening relationships.

EdoAradNov 18 20194

How do you view the field of Machine Ethics? (I only now heard of it in this AI Alignment Podcast)

elleNov 19 20193

What are your regular go-to sources of information online? That is, are there certain blogs you religiously read? Vox? Do you follow the EA Forum or LessWrong? Do you mostly read papers that you find through some search algorithm you previously set up? Etc.

Buck

Nov 19 2019

I don't consume information online very intentionally. Blogs I often read: * Slate Star Codex * The Unit of Caring * Bryan Caplan (despite disagreeing with him a lot, obviously) * Meteuphoric (Katja Grace) * Paul Christiano's various blogs I often read the Alignment Newsletter. I mostly learn things from hearing about them from friends.

Misha_YagudinMar 17 20201

What are some of your favourite theorems, proofs, algorithms, data structures, and programming languages?

ToflyNov 18 20191

If an intelligent person was willing to prepare for any position at MIRI, but was currently unqualified for any, which would you want them to go for?