All of Daniel_Dewey&#x27;s Comments + Replies

My current thoughts on MIRI's "highly reliable agent design" work

Thanks for these thoughts. (Your second link is broken, FYI.)

On empirical feedback: my current suspicion is that there are some problems where empirical feedback is pretty hard to get, but I actually think we could get more empirical feedback on how well HRAD can be used to diagnose and solve problems in AI systems. For example, it seems like many AI systems implicitly do some amount of logical-uncertainty-type reasoning (e.g. AlphaGo, which is really all about logical uncertainty over the result of expensive game-tree computations) -- maybe HRAD could be ... (read more)

LawrenceC

My suspicion is that MIRI agrees with you - if you read their job post on their software engineering internship, it seems that they're looking for people who can rapidly prototype and test AI Alignment ideas that have implications in machine learning.

WillPearson

Fixed, thanks. I agree that HRAD might be useful. I read some of the stuff. I think we need a mix of theory and practice and only when we have community where they can feed into each other will we actually get somewhere. When an AI safety theory paper says, "Here is an experiment we can do to disprove this theory," then I will pay more attention than I do. The "ignored physical aspect of computation" is less about a direction to follow, but more an argument about the type of systems that are likely to be effective and so an argument about which ones we should study. There is no point studying how to make ineffective systems safe if the lessons don't carry over to effective ones. You don't want a system that puts in the same computational resources trying to decide what brand of oil is best for its bearings as it does to deciding the question of what is a human or not. If you decide how much computational resources you want to put into each class of decision, you start to get into meta-decision territory. You also need to decide how much of your pool you want to put into making that meta-decision as making it will take away from making your other decisions. I am thinking about a possible system which can allocate resources among decision making systems and this can be used to align the programs (at least somewhat). It cannot align a super intelligent malign program, work needs to done on the initial population of programs in the system, so that we can make sure they do not appear. Or we need a different way of allocating resources entirely. I don't pick this path because it is an easy path to safety, but because I think it is the only path that leads anywhere interesting/dangerous and so we need to think about how to make it safe.

Daniel_Dewey7y0

My guess is that the capability is extremely likely, and the main difficulties are motivation and reliability of learning (since in other learning tasks we might be satisfied with lower reliability that gets better over time, but in learning human preferences unreliable learning could result in a lot more harm).

My current thoughts on MIRI's "highly reliable agent design" work

My current thoughts on MIRI's "highly reliable agent design" work

Thanks for this suggestion, Kaj -- I think it's an interesting comparison!

My current thoughts on MIRI's "highly reliable agent design" work

I am very bullish on the Far Future EA Fund, and donate there myself. There's one other possible nonprofit that I'll publicize in the future if it gets to the stage where it can use donations (I don't want to hype this up as an uber-solution, just a nonprofit that I think could be promising).

I unfortunately don't spend a lot of time thinking about individual donation opportunities, and the things I think are most promising often get partly funded through Open Phil (e.g. CHAI and FHI), but I think diversifying the funding source for orgs like CHAI and FHI is valuable, so I'd consider them as well.

LawrenceC

Not super relevant to Peter's question, but I would be interested in hearing why you're bullish on the Far Future EA Fund.

My current thoughts on MIRI's "highly reliable agent design" work

I think there's something to this -- thanks.

To add onto Jacob and Paul's comments, I think that while HRAD is more mature in the sense that more work has gone into solving HRAD problems and critiquing possible solutions, the gap seems much smaller to me when it comes to the justification for thinking HRAD is promising vs justification for Paul's approach being promising. In fact, I think the arguments for Paul's work being promising are more solid than those for HRAD, despite it only being Paul making those arguments -- I've had a much harder time understanding anything more nuanced than the basic case for HRAD I gave above, and a much easier time understanding why Paul thinks his approach is promising.

Wei Dai

Daniel, while re-reading one of Paul's posts from March 2016, I just noticed the following: My interpretation of this is that between March 2016 and the end of 2016, Paul updated the difficulty of his approach upwards. (I think given the context, he means that other problems, namely robust learning and meta-execution, are harder, not that informed oversight has become easier.) I wanted to point this out to make sure you updated on his update. Clearly Paul still thinks his approach is more promising than HRAD, but perhaps not by as much as before.

Wei Dai

This seems wrong to me. For example, in the "learning to reason from human" approaches, the goal isn't just to learn to reason from humans, but to do it in a way that maintains competitiveness with unaligned AIs. Suppose a human overseer disapproves of their AI using some set of potentially dangerous techniques, how can we then ensure that the resulting AI is still competitive? Once someone points this out, proponents of the approach, to continue thinking their approach is promising, would need to give some details about how they intend to solve this problem. Subsequently, justification for thinking the approach is promising is more subtle and harder to understand. I think conversations like this have occurred for MIRI's approach far more than Paul's, which may be a large part of why you find Paul's justifications easier to understand.

My perspective on this is a combination of “basic theory is often necessary for knowing what the right formal tools to apply to a problem are, and for evaluating whether you're making progress toward a solution” and “the applicability of Bayes, Pearl, etc. to AI suggests that AI is the kind of problem that admits of basic theory.” An example of how this relates to HRAD is that I think that Bayesian justifications are useful in ML, and that a good formal model of rationality in the face of logical uncertainty is likely to be useful in analogous ways. When

My current thoughts on MIRI's "highly reliable agent design" work

Thanks Nate!

The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are "your team runs into a capabilities roadblock and can't achieve AGI" or "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time."

This is particularly helpful to know.

We worry about "unknown unknowns", but I’d pro

... (read more)

So8res

I want to steer clear of language that might make it sound like we’re saying: * X 'We can't make broad-strokes predictions about likely ways that AGI could go wrong.' * X 'To the extent we can make such predictions, they aren't important for informing research directions.' * X 'The best way to address AGI risk is just to try to advance our understanding of AGI in a general and fairly undirected way.' The things I do want to communicate are: * All of MIRI's research decisions are heavily informed by a background view in which there are many important categories of predictable failure, e.g., 'the system is steering toward edges of the solution space', 'the function the system is optimizing correlates with the intended function at lower capability levels but comes uncorrelated at high capability levels', 'the system has incentives to obfuscate and mislead programmers to the extent it models its programmers’ beliefs and expects false programmer beliefs to result in it better-optimizing its objective function.’ * The main case for HRAD problems is that we expect them to help in a gestalt way with many different known failure modes (and, plausibly, unknown ones). E.g., 'developing a basic understanding of counterfactual reasoning improves our ability to understand the first AGI systems in a general way, and if we understand AGI better it's likelier we can build systems to address deception, edge instantiation, goal instability, and a number of other problems'. * There usually isn't a simple relationship between a particular open problem and a particular failure mode, but if we thought there were no way to predict in advance any of the ways AGI systems can go wrong, or if we thought a very different set of failures were likely instead, we'd have different research priorities.

My current thoughts on MIRI's "highly reliable agent design" work

Thanks!

My current thoughts on MIRI's "highly reliable agent design" work

I'm going to try to answer these questions, but there's some danger that I could be taken as speaking for MIRI or Paul or something, which is not the case :) With that caveat:

I'm glad Rob sketched out his reasoning on why (1) and (2) don't play a role in MIRI's thinking. That fits with my understanding of their views.

(1) You might think that "learning to reason from humans" doesn't accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can't rule

Daniel_Dewey7y2

Thanks for linking to that conversation -- I hadn't read all of the comments on that post, and I'm glad I got linked back to it.

Daniel_Dewey7y2

Thanks!

Conditional on MIRI's view that a hard or unexpected takeoff is likely, HRAD is more promising (though it's still unclear).

Do you mean more promising than other technical safety research (e.g. concrete problems, Paul's directions, MIRI's non-HRAD research)? If so, I'd be interested in hearing why you think hard / unexpected takeoff differentially favors HRAD.

Tobias_Baumann

Yeah, and also (differentially) more promising than AI strategy or AI policy work. But I'm not sure how strong the effect is. In a hard / unexpected takeoff scenario, it's more plausible that we need to get everything more or less exactly right to ensure alignment, and that we have only one shot at it. This might favor HRAD because a less principled approach makes it comparatively unlikely that we get all the fundamentals right when we build the first advanced AI system. In contrast, if we think there's no such discontinuity and AI development will be gradual, then AI control may be at least somewhat more similar (but surely not entirely comparable) to how we "align" contemporary software systems. That is, it would be more plausible that we could test advanced AI systems extensively without risking catastrophic failure or that we could iteratively try a variety of safety approaches to see what works best. It would also be more likely that we'd get warning signs of potential failure modes, so that it's comparatively more viable to work on concrete problems whenever they arise, or to focus on making the solutions to such problems scalable – which, to my understanding, is a key component of Paul's approach. In this picture, successful alignment without understanding the theoretical fundamentals is more likely, which makes non-HRAD approaches more promising. My personal view is that I find a hard and unexpected takeoff unlikely, and accordingly favor other approaches than HRAD, but of course I can't justify high confidence in this given expert disagreement. Similarly, I'm not highly confident that the above distinction is actually meaningful. I'd be interested in hearing your thoughts on this!

My current thoughts on MIRI's "highly reliable agent design" work

Daniel_Dewey7y9

Thanks Tara! I'd like to do more writing of this kind, and I'm thinking about how to prioritize it. It's useful to hear that you'd be excited about those topics in particular.

MikeJohnson

I too found this post very helpful/illuminating. I hope you can continue to do this sort of writing!

My current thoughts on MIRI's "highly reliable agent design" work

Daniel_Dewey7y6

Thanks Kerry, Benito! Glad you found it helpful.

Open Thread #36

Welcome! :)

I think your argument totally makes sense, and you're obviously free to use your best judgement to figure out how to do as much good as possible. However, a couple of other considerations seem important, especially for things like what a "true effective altruist" would do.

1) One factor of your impact is your ability to stick with your giving; this could give you a reason to adopt something less scary and demanding. By analogy, it might seem best for fitness to commit to intense workouts 5 days a week, strict diet changes, and no alcoho... (read more)

Daniel_Dewey7y1

Thanks for putting StrongMinds on my radar!

Advisory panel at CEA

What Should the Average EA Do About AI Alignment?

Nice work, and looks like a good group of advisors!

Daniel_Dewey7y7

Re: donation: I'd personally feel best about donating to the Long-Term Future EA Fund (not yet ready, I think?) or the EA Giving Group, both managed by Nick Beckstead.

TaraMacAulay

The EA Funds are now live and accepting donations. You can read about the Far Future fund here.

Use "care" with care.

Building Cooperative Epistemology (Response to "EA has a Lying Problem", among other things)

Thanks for recommending a concrete change in behavior here!

I also appreciate the discussion of your emotional engagement / other EAs' possible emotional engagement with cause prioritization -- my EA emotional life is complicated, I'm guessing others have a different set of feelings and struggles, and this kind of post seems like a good direction for understanding and supporting one another.

ETA: personally, it feels correct when the opportunity arises to emotionally remind myself of the gravity of the ER-triage-like decisions that humans have to make when a... (read more)

Daniel_Dewey7y0

I agree that if engagement with the critique doesn't follow those words, they're not helpful :) Editing my post to clarify that.

Contra the Giving What We Can pledge

Daniel_Dewey7y0

The pledge is really important to me as a part of my EA life and (I think) as a part of our community infrastructure, and I find your critiques worrying. I'm not sure what to do, but I appreciate you taking the critic's risk to help the community. Thank you!

Building Cooperative Epistemology (Response to "EA has a Lying Problem", among other things)

Building Cooperative Epistemology (Response to "EA has a Lying Problem", among other things)

This is a great point -- thanks, Jacob!

I think I tend to expect more from people when they are critical -- i.e. I'm fine with a compliment/agreement that someone spent 2 minutes on, but expect critics to "do their homework", and if a complimenter and a critic were equally underinformed/unthoughtful, I'd judge the critic more harshly. This seems bad!

One response is "poorly thought-through criticism can spread through networks; even if it's responded to in one place, people cache and repeat it other places where it's not responded to, and that... (read more)

RyanCarey

Not sure how much this helps because if the criticism is thoughtful and you fail to engage with it, you're still being rude and missing an opportunity, whether or not you say some magic words.

Building Cooperative Epistemology (Response to "EA has a Lying Problem", among other things)

Thanks!

I think parts of academia do this well (although other parts do it poorly, and I think it's been getting worse over time). In particular, if you present ideas at a seminar, essentially arbitrarily harsh criticism is fair game. Of course, this is different from the public internet, but it's still a group of people, many of whom do not know each other personally, where pretty strong criticism is the norm.

One guess is that ritualization in academia helps with this -- if you say something in a talk or paper, you ritually invite criticism, whereas I'... (read more)

Daniel_Dewey7y16

Prediction-making in my Open Phil work does feel like progress to me, because I find making predictions and writing them down difficult and scary, indicating that I wasn't doing that mental work as seriously before :) I'm quite excited to see what comes of it.

Raemon

Wanted to offer something stronger than an up vote in starting the prediction-making: that sounds like a great idea and want to see how it goes. :)

Building Cooperative Epistemology (Response to "EA has a Lying Problem", among other things)

Daniel_Dewey7y9

I have very mixed feelings about Sarah's post; the title seems inaccurate to me, and I'm not sure about how the quotes were interpreted, but it's raised some interesting and useful-seeming discussion. Two brief points:

I understand what causes people to write comments like "lying seems bad but maybe it's the best thing to do in some cases", but I don't think those comments usually make useful points (they typically seem pedantic at best and edgy at worst), and I hope people aren't actually guided by considerations like those. Most EAs I work wit

Daniel_Dewey7y1

I'm really glad you posted this! I've found it helpful food for thought, and I think it's a great conversation for the community to be having.

For many Americans, income taxes might go down; probably worth thinking about what to do with that "extra" money.

You're welcome :) Glad you liked it!

Thanks for mentioning this -- I totally see what you're pointing at here, and I think you make valid points re: there always being more excuses later.

I just meant to emphasize that "giving now feels good" wasn't something I was prepared to justify in terms of its actual impact on the world; if I found out that this good feeling was justified in terms of impact, that'd be great, but if it turned out that I could give up that good feeling in order to have a better impact, I'd try my best to do so.

Thanks Milan!

I haven't thought a lot about that, and might be making the wrong call. Off the top of my head:

There's a community norm toward donating 10%, and I'm following that without thinking too hard.
I expect donation effectiveness on the scale of my donations to get worse over time, so giving earlier at the cost of giving a little (?) less over my career seems like it might be better.
Giving feels good in a way that paying debt doesn't. This isn't an EA reason :)

I guess I could put my 10% toward debt reduction instead -- if you or anyone else has ... (read more)

Milan_Griffes

I don't have pointers to good info, other than Mr. Money Mustache's blog, which I think was already mentioned. I'm following in intuition along the lines of "put on your own oxygen mask before helping those around you with theirs." My bet is that my personal impact will be much larger once I'm financially independent. Giving a significant portion of my income now is a drag on reaching financial independence. I'd prefer to accelerate my progress towards financial independence at the expense of doing good today. We're touching on the "give now vs. give later" debate here; intuitions may diverge.

Jmd

I disagree that 'giving cause it feels good' isn't an EA reason to give. It's about the head and the heart right? I give because it feels good, and it feels even better knowing that where you give is high impact and if giving makes you feel good then that's encouraging to others as well :) And I also started giving when I had my student loan to pay off - maybe if my loan was bigger I would have thought about starting with smaller donations like with The Life You Can Save, but my main motivation was that if the debt is an excuse now, then buying a house will be an excuse later, and then all the other life excuses and I will never do it. So I leapt. People live really well on less than I did even with the donations and the loan repayments, it does mean thinking more about 'fun' activities' though I found that I could still do all those things and where I spent less was on 'stuff' - things you buy but don't really need anyways.