Jul 07, 2017
I lead the Open Philanthropy Project's work on technical AI safety research. In our MIRI grant writeup last year, we said that we had strong reservations about MIRI’s research, and that we hoped to write more about MIRI's research in the future. This writeup explains my current thinking about the subset of MIRI's research referred to as "highly reliable agent design" in the Agent Foundations Agenda. My hope is that this writeup will help move the discussion forward, but I definitely do not consider it to be any kind of final word on highly reliable agent design. I'm posting the writeup here because I think this is the most appropriate audience, and I'm looking forward to reading the comments (though I probably won't be able to respond to all of them).
After writing the first version of this writeup, I received comments from other Open Phil staff, technical advisors, and MIRI staff. Many comments were disagreements with arguments or credences stated here; some of these disagreements seem plausible to me, some comments disagree with one another, and I place significant weight on all of them because of my confidence in the commentators. Based on these comments, I think it's very likely that some aspects of this writeup will turn out to have been miscalibrated or mistaken – i.e. incorrect given the available evidence, and not just cases where I assign a reasonable credence or make a reasonable argument that may turn out to be wrong – but I'm not sure which aspects these will turn out to be.
I considered spending a lot of time heavily revising this writeup to take these comments into account. However, it seems pretty likely to me that I could continue this comment/revision process for a long time, and this process offers very limited opportunities for others outside of a small set of colleagues to engage with my views and correct me where I'm wrong. I think there's significant value in instead putting an imperfect writeup into the public record, and giving others a chance to respond in their own words to an unambiguous snapshot of my beliefs at a particular point in time.
I understand MIRI's "highly reliable agent design" work (coined in this research agenda, "HRAD" for short) as work that aims to describe basic aspects of reasoning and decision-making in a complete, principled, and theoretically satisfying way. Here's a non-exhaustive list of research topics in this area:
To be really satisfying, it should be possible to put these descriptions together into a full and principled description of an AI system that reasons and makes decisions in pursuit of some goal in the world, not taking into account issues of efficiency; this description might be understandable as a modified/expanded version of AIXI. Ideally this research would also yield rigorous explanations of why no other description is satisfying.
My understanding is that MIRI (or at least Nate and Eliezer) believe that if there is not significant progress on many problems in HRAD, the probability that an advanced AI system will cause catastrophic harm is very high. (They reserve some probability for other approaches being found that could render HRAD unnecessary, but they aren't aware of any such approaches.)
I've engaged in many conversations about why MIRI believes this, and have often had trouble coming away with crisply articulated reasons. So far, the basic case that I think is most compelling and most consistent with the majority of the conversations I've had is something like this (phrasing is mine / Holden's):
I also find it helpful to see this case as asserting that HRAD is one kind of "basic science" approach to understanding AI. Basic science in other areas – i.e. work based on some sense of being intuitively, fundamentally confused and unsatisfied by the lack of explanation for something – seems to have an outstanding track record of uncovering important truths that would have been hard to predict in advance, including the work of Faraday/Maxwell, Einstein, Nash, and Turing. Basic science can also provide a foundation for high-reliability engineering, e.g. by giving us a language to express guarantees about how an engineered system will perform in different circumstances or by improving an engineer's ability to design good empirical tests. Our lack of satisfying explanations for how an AI system should reason and make decisions and the importance of "knowing what we're doing" in AI make a basic science approach appealing, and HRAD is one such approach. (I don't think MIRI would say that there couldn't be other kinds of basic science that could be done in AI, but they don't know of similarly valuable-looking approaches.)
We've spent a lot of effort (100+ hours) trying to write down more detailed cases for HRAD work. This time included conversations with MIRI, conversation among Open Phil staff and technical advisors, and writing drafts of these arguments. These other cases didn't feel like they captured MIRI's views very well and were not very understandable or persuasive to me and other Open Phil staff members, so I've fallen back on this simpler case for now when thinking about HRAD work.
I have several points of agreement with MIRI's basic case:
The fact that MIRI researchers (who are thoughtful, very dedicated to this problem, aligned with our values, and have a good track record in thinking about existential risks from AI) and some others in the effective altruism community are significantly more positive than I am about HRAD is an extremely important factor to me in favor of HRAD. These positive views significantly raise the minimum credence I'm willing to put on HRAD research being very helpful.
In addition to these positive factors, I have several reservations about HRAD work. In relation to the basic case, these reservations make me think that HRAD isn't likely to be significantly helpful for getting a confidence-generating description of how an advanced AI system reasons and makes decisions.
1. It seems pretty likely that early advanced AI systems won't be understandable in terms of HRAD's formalisms, in which case HRAD won't be useful as a description of how these systems should reason and make decisions.
Note: I'm not sure to what extent MIRI and I disagree about how likely HRAD is to be applicable to early advanced AI systems. It may be that our overall disagreement about HRAD is more about the feasibility of other AI alignment research options (see 3 below), or possibly about strategic questions outside the scope of this document (e.g. to what extent we should try to address potential risks from advanced AI through strategy, policy, and outreach rather than through technical research).
2. HRAD has gained fewer strong advocates among AI researchers than I'd expect it to if it were very promising -- including among AI researchers whom I consider highly thoughtful about the relevant issues, and whom I'd expect to be more excited if HRAD were likely to be very helpful.
Together, these two concerns give me something like a 20% credence that if HRAD work reached a high level of maturity (and relatively little other AI alignment research were done) HRAD would significantly help AI researchers build aligned AI systems around the time it becomes possible to build any advanced AI system.
3. The above considers HRAD in a vacuum, instead of comparing it to other AI alignment research options. My understanding is that MIRI thinks it is very unlikely that other AI alignment research can make up for a lack of progress in HRAD. I disagree; HRAD looks significantly less promising to me (in terms of solving object-level alignment problems, ignoring factors like field-building value) than learning to reason and make decisions from human-generated data (described more below), and HRAD seems unlikely to be helpful on the margin if reasonable amounts of other AI alignment research is done.
This reduces my credence in HRAD being very helpful to around 10%. I think this is the decision-relevant credence.
In the next few sections, I'll go into more detail about the factors I just described. Afterward, I'll say what I think this implies about how much we should support HRAD research, briefly summarizing the other factors that I think are most relevant.
The basic case for HRAD being helpful depends on HRAD producing a description of how an AI system should reason and make decisions that can be productively applied to advanced AI systems. In this section, I'll describe my reasons for thinking this is not likely. (As noted above, I'm not sure to what extent MIRI and I disagree about how likely HRAD is to be applicable to early advanced AI systems; nevertheless, it's an important factor in my current beliefs about the value of HRAD work.)
I understand HRAD work as aiming to describe basic aspects of reasoning and decision-making in a complete, principled, and theoretically satisfying way, and ideally to have arguments that no other description is more satisfying. I'll refer to this as a "complete axiomatic approach," meaning that an end result of HRAD-style research on some aspect of reasoning would be a set of axioms that completely describe that aspect and that are chosen for their intrinsic desirability or for the desirability of the properties they entail. This property of HRAD work is the source of several of my reservations:
Overall, this makes me think it's unlikely that HRAD work will apply well to advanced AI systems, especially if advanced AI is reached soon (which would make it more likely to resemble today's machine learning methods). A large portion of my credence in HRAD being applicable to advanced AI systems comes from the possibility that advanced AI systems won't look much like today's. I don't know how to gain much evidence about HRAD's applicability in this case.
HRAD has gained fewer strong advocates among AI researchers than I'd expect it to if it were very promising, despite other aspects of MIRI's research (the alignment problem, value specification, corrigibility) being strongly supported by a few prominent researchers. Our review of five of MIRI's HRAD papers last year provided more detailed examples of how a small number of AI researchers (seven computer science professors, one graduate student, and our technical advisors) respond to HRAD research; these reviews made it seem to us that HRAD research has little potential to decrease potential risks from advanced AI relative to other technical work with the same goal, though we noted that this conclusion was "particularly tentative, and some of our advisors thought that versions of MIRI’s research direction could have significant value if effectively pursued".
I interpret these unfavorable reviews and lack of strong advocates as evidence that:
I'm frankly not sure how many strong advocates among AI researchers it would take to change my mind on these points – I think a lot would depend on details of who they were and what story they told about their interest in HRAD.
I do believe that some of this lack of interest should be explained by social dynamics and communication difficulties – MIRI is not part of the academic system, and the way MIRI researchers write about their work and motivation is very different from many academic papers, and both of these could cause mainstream AI researchers to be less interested in HRAD research than they would be if these factors weren't in play. However, I think our review process and conversations with our technical advisors each provide some evidence that this isn't likely to be sufficient to explain AI researchers' low interest in HRAD.
Reviewers' descriptions of the papers' main questions, conclusions, and intended relationship to potential risks from advanced AI generally seemed thoughtful and (as far as I can tell) accurate, and in several cases (most notably Fallenstein and Kumar 2015) some reviewers thought the work was novel and impressive; if reviewers' opinions were more determined by social and communication issues, I would expect reviews to be less accurate, less nuanced, and more broadly dismissive.
I only had enough interaction with external reviewers to be moderately confident that their opinions weren't significantly attributable to social or communication issues. I've had much more extensive, in-depth interaction with our technical advisors, and I'm significantly more confident that their views are mostly determined by their technical knowledge and research taste. I think our technical advisors are among the very best-qualified outsiders to assess MIRI's work, and that they have genuine understanding of the importance of alignment as well as being strong researchers by traditional standards. Their assessment is probably the single biggest data point for me in this section.
Outside of HRAD, some other research topics that MIRI has proposed have been the subject of much more interest from AI researchers. For example, researchers and students at CHAI have published papers on and are continuing to work on value specification and error-tolerance (particularly corrigibility), these topics have consistently seemed more promising to our technical advisors, and Stuart Russell has adopted the value alignment problem as a central theme of his work. In light of this, I am more inclined to take AI researchers' lack of interest in HRAD as evidence about its promisingness than as evidence of severe social or communication issues.
The most convincing argument I know of for not treating other researchers' lack of interest as significant evidence about the promisingness of HRAD research is:
I'm strongly inclined to resolve this conflict by continuing to believe that MIRI's decision theory work is good philosophy, and to explain 2 by appealing to social dynamics and communication difficulties. I think it's reasonable to consider an analogous situation with HRAD and AI researchers to be plausible a priori, but the analogue of point 1 above doesn't apply to HRAD work, and the other reasons I've given in this section lead me to think that this is not likely.
How promising does HRAD look compared to other AI alignment research options? The most significant factor to me is the apparent promisingness of designing advanced AI systems to reason and make decisions from human-generated data ("learning to reason from humans"); if an approach along these lines is successful, it doesn't seem to me that much room would be left for HRAD to help on the margin. My views here are heavily based on Paul Christiano's writing on this topic, but I'm not claiming to represent his overall approach, and in particular I'm trying to sketch out a broader set of approaches that includes Paul's. It's plausible to me that other kinds of alignment research could play a similar role, but I have a much less clear picture of how that would work, and finding out about significant problems with learning to reason from humans would make me both more pessimistic about technical work on AI alignment in general and more optimistic that HRAD would be helpful. The arguments in this section are pretty loose, but the basic idea seems promising enough to me to justify high credence that something in this general area will work.
"Learning to reason from humans" is different from the most common approaches in AI today, where decision-making methods are implicitly learned in the process of approximating some function – e.g. a reward-maximizing policy, an imitative policy, a Q-function or model of the world, etc. Instead, learning to reason from humans would involve directly training a system to reason in ways that match human demonstrations or are approved of by human feedback, as in Paul's article here.
If we are able to become confident that an AI system is learning to reason in ways that meet human approval or match human demonstrations, it seems to me that we could also become confident that the AI system would be aligned overall; a very harmful decision would need to be generated by a series of human-endorsed reasoning steps (and unless human reasoning endorses a search for edge cases, edge cases won't be sought). Human endorsement of reasoning and decision-making could not only incorporate valid instrumental reasoning (in parts of epistemology and decision theory that we know how to formalize), but also rules of thumb and sanity checks that allow humans to navigate uncertainty about which epistemology and decision theory are correct, as well as human value judgements about which decisions, actions, short-term consequences, and long-term consequences are desirable, undesirable, or of uncertain value.
Another factor that is important to me here is the potential to design systems to reason and make decisions in ways that are calibrated or conservative. The idea here is that we can become more confident that AI systems will not make catastrophic decisions if they can reliably detect when they are operating in unfamiliar domains or situations, have low confidence that humans would approve of their reasoning and decisions, have low confidence in predicted consequences, or are considering actions that could cause significant harm; in those cases, we'd like AI systems to "check in" with humans more intensively and to act more conservatively. It seems likely to me that these kinds of properties would contribute significantly to alignment and safety, and that we could pursue these properties by designing systems to learn to reason and make decisions in human-approved ways, or by directly studying statistical properties like calibration or "conservativeness".
"Learning to reason and make decisions from human examples and feedback" and "learning to act 'conservatively' where 'appropriate'" don't seem to me to be many orders of magnitude more difficult than the kinds of learning tasks AI systems are good at today. If it was necessary for an AI system to imitate human judgement perfectly, I would be much more skeptical of this approach, but that doesn't seem to be necessary, as Paul argues:
"You need only the vaguest understanding of humans to guess that killing the user is: (1) not something they would approve of, (2) not something they would do, (3) not in line with their instrumental preferences.
So in order to get bad outcomes here you have to really mess up your model of what humans want (or more likely mess up the underlying framework in an important way).
If we imagine a landscape of possible interpretations of human preferences, there is a 'right' interpretation that we are shooting for. But if you start with a wrong answer that is anywhere in the neighborhood, you will do things like 'ask the user what to do, and don’t manipulate them.' And these behaviors will eventually get you where you want to go.
That is to say, the 'right' behavior is surrounded by a massive crater of 'good enough' behaviors, and in the long-term they all converge to the same place. We just need to land in the crater."
Learning to reason from humans is a good fit with today's AI research, and is broad enough that it would be very surprising to me if it were not productively applicable to early advanced AI systems.
It seems to me that this kind of approach is also much more likely to be robust to unanticipated problems than a formal, HRAD-style approach would be, since it explicitly aims to learn how to reason in human-endorsed ways instead of relying on researchers to notice and formally solve all critical problems of reasoning before the system is built. There are significant open questions about whether and how we could make machine learning robust and theoretically well-understood enough for high confidence, but it seems to me that this will be the case for any technical pathway that relies on learning about human preferences in order to act desirably.
Finally, it seems to me that if a lack of HRAD-style understanding does leave us exposed to many important "unknown unknown" problems, there is a good chance that some of those problems will be revealed by failures or difficulties in achieving alignment in earlier AI systems, and that researchers who are actively thinking about the goal of aligning advanced AI systems will be able to notice these failings and relate them to a need for better HRAD-style understanding. This kind of process seems very likely to be applicable to learning to reason from humans, but could also apply to other approaches to AI alignment. I do not think that this process is guaranteed to reveal a need for HRAD-style understanding in the case that it is needed, and I am fairly sure that some failure modes will not appear in earlier advanced AI systems (the failure modes Bostrom calls "treacherous turns", which only appear when an AI system has a large range of general-purpose capabilities, can reason very powerfully, etc.). It's possible that earlier failure modes will be too rare, too late, or not clearly enough related to a need for HRAD-style research. However, if a lack of fundamental understanding does expose us to many important "unknown unknown" failure modes, it seems more likely to me that some informative failures will happen early than that all such failures will appear only after systems are advanced enough to be extremely high-impact, and that researchers motivated by alignment of advanced AI will notice if those failures could be addressed through HRAD-style understanding. (I'm uncertain about how researchers who aren't thinking actively about alignment of advanced AI would respond, and I think one of the most valuable things we can do today is to increase the number of researchers who are thinking actively about alignment of advanced AI and are therefore more likely to respond appropriately to evidence.)
My credence for this section isn't higher for three basic reasons:
As I noted above, I believe that MIRI staff are thoughtful, very dedicated to this problem, aligned with our values, and have a good track record in thinking about existential risk from AI. The fact that some of them are much more optimistic than I am about HRAD research is a very significant factor in favor of HRAD. I think it would be incorrect to place a very low credence (e.g. 1%) on their views being closer to the truth than mine are.
I don't think it is helpful to try to list a large amount of detail here; I'm including this as its own section in order to emphasize its importance to my reasoning. My views come from many in-person and online conversations with MIRI researchers over the past 5 years, reports of many similar conversations by other thoughtful people I trust, and a large amount of online writing about existential risk from AI spread over several sites, most notably LessWrong.com, agentfoundations.org, arbital.com, and intelligence.org.
The most straightforward thing to list is that MIRI was among the first groups to strongly articulate the case for existential risk from artificial intelligence and the need for technical and strategic research on this topic, as noted in our last writeup:
"We believe that MIRI played an important role in publicizing and sharpening the value alignment problem. This problem is described in the introduction to MIRI’s Agent Foundations technical agenda. We are aware of MIRI writing about this problem publicly and in-depth as early as 2001, at a time when we believe it received substantial attention from very few others. While MIRI was not the first to discuss potential risks from advanced artificial intelligence, we believe it was a relatively early and prominent promoter, and generally spoke at more length about specific issues such as the value alignment problem than more long-standing proponents."
My 10% credence that "if HRAD reached a high level of maturity it would significantly help AI researchers build aligned AI systems" doesn't fully answer the question of how much we should support HRAD work (with our funding and with our outreach to researchers) relative to other technical work on AI safety. It seems to me that the main additional factors are:
Field-building value: I expect that the majority of the value of our current funding in technical AI safety research will come from its effect of increasing the total number of people who are deeply knowledgeable about technical research on artificial intelligence and machine learning, while also being deeply versed in issues relevant to potential risks. HRAD work appears to be significantly less useful for this goal than other kinds of AI alignment work, since HRAD has not gained much support among AI researchers. (I do think that in order to be effective for field-building, AI safety research directions should be among the most promising we can think of today; this is not an argument for work on non-promising, but attractive "AI safety" research.)
Replaceability: HRAD work seems much more likely than other AI alignment work to be neglected by AI researchers and funders. If HRAD work turns out to be significantly helpful, we could make a significant counterfactual difference by supporting it.
Shovel-readiness: My understanding is that HRAD work is currently funding-constrained (i.e. MIRI could scale up its program given more funds). This is not generally true of technical AI safety work, which in my experience has also required significant staff time.
The difference in field-building value between HRAD and the other technical AI safety work we support makes me significantly more enthusiastic about supporting other technical AI safety work than about supporting HRAD. However, HRAD's low replaceability and my 10% credence in HRAD being useful make me excited to support at least some HRAD work.
In my view, enough HRAD work should be supported to continue building evidence about its chance of applicability to advanced AI, to have opportunities for other AI researchers to encounter it and become advocates, and to generally make it reasonably likely that if it is more important than it currently appears then we can learn this fact. MIRI's current size seems to me to be approximately right for this purpose, and as far as I know MIRI staff don't think MIRI is too small to continue making steady progress. Given this, I am ambivalent (along the lines of our previous grant writeup) about recommending that Good Ventures funds be used to increase MIRI's capacity for HRAD research.