Rohin Shah

Hi, I'm Rohin Shah! I work as a Research Scientist on the technical AGI safety team at DeepMind. I completed my PhD at the Center for Human-Compatible AI at UC Berkeley, where I worked on building AI systems that can learn to assist a human user, even if they don't initially know what the user wants.

I'm particularly interested in big picture questions about artificial intelligence. What techniques will we use to build human-level AI systems? How will their deployment affect the world? What can we do to make this deployment go better? I write up summaries and thoughts about recent work tackling these questions in the Alignment Newsletter.

In the past, I ran the EA UC Berkeley and EA at the University of Washington groups.

Topic Contributions


EA can sound less weird, if we want it to

I agree with both of those reasons in the abstract, and I definitely do (2) myself. I'd guess there are around 50 people total in the world who could do (2) in a way where I'd look at it and say that they succeeded (for AI risk in particular), of which I could name maybe 20 in advance. I would certainly not be telling a random EA to make our arguments sound less weird.

I'd be happy about the version of (1) where the non-weird version was just an argument that people talked about, without any particular connection to EA / AI x-risk. I would not say "make EA sound less weird", I'd say "one instrumental strategy for EA is to talk about this other related stuff".

EA can sound less weird, if we want it to

I agree with the main point that we could sound less weird if we wanted to, but it seems unlikely to me that we want that.

since the mechanisms needed to prevent them are the same as those needed to prevent the less severe and more plausible-sounding scenarios of the form "you ask an AI to do X, and the AI accomplishes X by doing Y,  but Y is bad and not what you intended". 

This is just not true.

If you convince someone of a different, non-weird version of AI risk, that does not then mean that they should take the actions that we take. There are lots of other things you can do to mitigate the less severe versions of AI risk:

  1. You could create better "off-switch" policies, where you get tech companies to have less-useful but safe baseline policies that they can quickly switch to if one of their AI systems starts to behave badly (e.g. switching out a recommender system for a system that provides content chronologically).
  2. You could campaign to have tech companies not use the kinds of AI systems subject to these risks (e.g. by getting them to ban lethal autonomous weapons).
  3. You could switch to simpler "list of rules" based AI systems, where you can check that the algorithm the AI is using in fact seems good to you (e.g. Figure 3 here).

Most of these things are slightly helpful but overall don't have much effect on the versions of AI risk that lead to extinction.

(I expect this to generalize beyond AI risk as well and this dynamic is my main reason for continuing to give the weird version of EA ideas.)

rohinmshah's Shortform

There's been a few posts recently about how there should be more EA failures, since we're trying a bunch of high-risk, high-reward projects, and some of them should fail or we're not being ambitious enough.

I think this is a misunderstanding of what high-EV bets look like. Most projects do not either produce wild success or abject failure, there's usually a continuity of outcomes in between, and that's what you hit. This doesn't look like "failure", it looks like moderate success.

For example, consider the MineRL BASALT competition that I organized. The low-probability, high-value outcome would have had hundreds or thousands of entries to the competition, several papers produced as a result, and the establishment of BASALT as a standard benchmark and competition in the field.

What actually happened was that we got ~11 submissions, of which maybe ~5 were serious, made decent progress on the problem, produced a couple of vaguely interesting papers, some people in the field have heard about the benchmark and occasionally use it, and we built enough excitement in the team that the competition will (very likely) run again this year.

Is this failure? It certainly isn't what normally comes to mind from the normal meaning of "failure". But it was:

  • Below my median expectation for what the competition would accomplish
  • Not something I would have put time into if someone had told me in advance exactly what it would accomplish so far, and the time cost needed to get it.

One hopes that roughly 50% of the things I do meet the first criterion, and probably 90% of the things I'd do would meet the second. But also maybe 90% of the work I do is something people would say was "successful" even ex post.

If you are actually seeing failures for relatively large projects that look like "failures" in the normal English sense of the word, where basically nothing was accomplished at all, I'd be a lot more worried that actually your project was not in fact high-EV even ex ante, and you should be updating a lot more on your failure, and it is a good sign that we don't see that many EA "failures" in this sense.

(One exception to this is earning-to-give entrepreneurship, where "we had to shut the company down and made ~no money after a year of effort" seems reasonably likely and it still would plausibly be high-EV ex ante.)

Peacefulness, nonviolence, and experientialist minimalism

I don't think these are complex questions! If your minimalist axiology ranks based on states of the world (and not actions except inasmuch as they lead to states of the world), then the best possible value to achieve is zero. Assuming this is achieved by an empty universe, then there is nothing strictly better than taking an action that creates an empty universe forever! This is a really easy to prove theorem!

I believe that it's a complex question whether or not this should be a dealbreaker for adopting a minimalist axiology, but that's not the question you wrote down. The answers to 

  1. Would an empty world (i.e. a world without sentient beings) be axiologically perfect?
  2. For any hypothetical world, would the best outcome always be realized by pressing a button that leads to its instant cessation?

really are just straightforwardly "yes", for state-based minimalist axiologies where an empty universe has none of the thing you want to minimize, which is the thing you are analyzing in this post unless I have totally misread it.

Peacefulness, nonviolence, and experientialist minimalism

I ignored the first footnote because it's not in the posts' remit, according to the post itself:

Additionally, the scope is limited to minimalist axiologies that are based on experientialist accounts of welfare (cf. van der Deijl, 2021). In other words, I assume that the welfare of any given being cannot be affected by things that do not enter their experience, and thus set aside views such as preference-based axiologies that imply extra-experientialism.

If you assume this limited scope, I think the answer to the second question is "yes" (and that the post agrees with this). I agree that things change if you expand the scope to other minimalist axiologies. It's unfortunate that the quote I selected implies "all minimalist axiologies" but I really was trying to talk about this post.

I shouldn't have called it "the main point", I should have said something like "the main point made in response to the two questions I mentioned", which is what I actually meant.

I agree that there is more detail about why the author thinks you shouldn't be worried about it that I did not summarize. I still think it is accurate to say that the author's main response to question 1 and 2, as written in Section 2, is "the answers are yes, but actually that's fine and you shouldn't be worried about it", with the point about cessation implications being one argument for that view.

Peacefulness, nonviolence, and experientialist minimalism

For others who were confused, like I was:

Some people may worry that minimalist axiologies would imply an affirmative answer to the following questions:

  1. Would an empty world (i.e. a world without sentient beings) be axiologically perfect?
  2. For any hypothetical world, would the best outcome always be realized by pressing a button that leads to its instant cessation?

The author agrees that the answers to these questions are "yes" (EDIT: for the specific class of minimalist axiologies considered in this post). The author's main point (EDIT: in Section 2, which addresses these questions, there's also a third question and a Section 3 that talks about it) is that perhaps you shouldn't be worried about that.

Ben Garfinkel's Shortform

I suppose my point is more narrow, really just questioning whether the observation "humans care about things besides their genes" gives us any additional reason for concern.

I mostly go ¯\_(ツ)_/¯ , it doesn't feel like it's much evidence of anything, after you've updated off the abstract argument. The actual situation we face will be so different (primarily, we're actually trying to deal with the alignment problem, unlike evolution).

I do agree that in saying " ¯\_(ツ)_/¯  " I am disagreeing with a bunch of claims that say "evolution example implies misalignment is probable". I am unclear to what extent people actually believe such a claim vs. use it as a communication strategy. (The author of the linked post states some uncertainty but presumably does believe something similar to that; I disagree with them if so.)

Relatedly, something I'd be interested in reading (if it doesn't already exist?) would be a piece that takes a broader approach to drawing lessons from the evolution of human goals - rather than stopping at the fact that humans care about things besides genetic fitness.

I like the general idea but the way I'd do it is by doing some black-box investigation of current language models and asking these questions there; I expect we understand the "ancestral environment" of a language model way, way better than we understand the ancestral environment for humans, making it a lot easier to draw conclusions; you could also finetune the language models in order to simulate an "ancestral environment" of your choice and see what happens then.

So -- if we want to create AI systems that don't murder people, by rewarding non-murderous behavior --then the evidence from human evolution seems like it might be medium-reassuring. I'd maybe give it a B-.

I agree with the murder example being a tiny bit reassuring for training non-murderous AIs; medium-reassuring is probably too much, unless we're expecting our AI systems to be put into the same sorts of situations / ancestral environments as humans were in. (Note that to be the "same sort of situation" it also needs to have the same sort of inputs as humans, e.g. vision + sound + some sort of controllable physical body seems important.)

Ben Garfinkel's Shortform

The actual worry with inner misalignment style concerns is that the selection you do during training does not fully constrain the goals of the AI system you get out; if there are multiple goals consistent with the selection you applied during training there's no particular reason to expect any particular one of them. Importantly, when you are using natural selection or gradient descent, the constraints are not "you must optimize X goal", the constraints are "in Y situations you must behave in Z ways", which doesn't constrain how you behave in totally different situations. What you get out depends on the inductive biases of your learning system (including e.g. what's "simpler").

For example, you train your system to answer truthfully in situations where we know the answer. This could get you an AI system that is truthful... or an AI system that answers truthfully when we know the answer, but lies to us when we don't know the answer in service of making paperclips. (ELK tries to deal with this setting.)

When I apply this point of view to the evolution analogy it dissolves the question / paradox you've listed above. Given the actual ancestral environment and the selection pressures present there, organisms that maximized "reproductive fitness" or "tiling the universe with their DNA" or "maximizing sex between non-sterile, non-pregnant opposite-sex pairs" would all have done well there (I'm sure this is somehow somewhat wrong but clearly in principle there's a version that's right), so who knows which of those things you get. In practice you don't even get organisms that are maximizing anything, because they aren't particularly goal-directed, and instead are adaption-executers rather than fitness-maximizers.

I do think that once you inhabit this way of thinking about it, the evolution example doesn't really matter any more; the argument itself very loudly says "you don't know what you're going to get out; there are tons of possibilities that are not what you wanted", which is the alarming part. I suppose in theory someone could think that the "simplest" one is going to be whatever we wanted in the first place, and so we're okay, and the evolution analogy is a good counterexample to that view?

It turns out that people really really like thinking of training schemes as "optimizing for a goal". I think this is basically wrong -- is CoinRun training optimizing for "get the coin" or "get to the end of the level"? What would be the difference? Selection pressures seem much better as a picture of what's going on.

But when you communicate with people it helps to show how your beliefs connect into their existing way of thinking about things. So instead of talking about how selection pressures from training algorithms and how they do not uniquely constrain the system you get out, we instead talk about how the "behavioral objective" might be different from the "training objective", and use the evolution analogy as an example that fits neatly into this schema given the way people are already thinking about these things.

(To be clear a lot of AI safety people, probably a majority, do in fact think about this from an "objective-first" way of thinking, rather than based on selection, this isn't just about AI safety people communicating with other people.)

Load More