Rohin Shah

3381Joined May 2015


Hi, I'm Rohin Shah! I work as a Research Scientist on the technical AGI safety team at DeepMind. I completed my PhD at the Center for Human-Compatible AI at UC Berkeley, where I worked on building AI systems that can learn to assist a human user, even if they don't initially know what the user wants.

I'm particularly interested in big picture questions about artificial intelligence. What techniques will we use to build human-level AI systems? How will their deployment affect the world? What can we do to make this deployment go better? I write up summaries and thoughts about recent work tackling these questions in the Alignment Newsletter.

In the past, I ran the EA UC Berkeley and EA at the University of Washington groups.


You can either interpret low karma as a sign that the karma system is broken or that the summaries aren't sufficiently good. In hindsight I think you're right and I lean more towards the former --even though people tell me they like my newsletter, it doesn't actually get that much karma.

I thought you thought that karma was a decent measure since you suggested

Putting the summary up as a Forum post and seeing if it gets a certain number of karma

as a way to evaluate how good a summary is.

Idk, in my particular case I'd say writing summaries was a major reason that I now have prestige / access to resources.

I think it's probably just hard to write good summaries; many of the summaries posted here don't get very much karma.

I'm surprised that "write summaries" isn't one of the proposed concrete solutions. One person can do a lot.

Yeah, I don't think it's clearly unreasonable (though it's not my intuition).

I agree that suicide rates are not particularly strong evidence one way or the other.

I broadly agree that "what does a life barely worth living look like" matters a lot, and you could imagine setting it to be high enough that the repugnant conclusion doesn't look repugnant.

That being said, if you set it too high, there are other counterintuitive conclusions. For example, if you set it higher than people alive today (as it sounds like you're doing), then you are saying that people alive today have negative terminal value, and (if we ignore instrumental value) it would be better if they didn't exist.

So, did I or didn't I come across as unfriendly/hostile?

You didn't to me, but also (a) I know you in person and (b) I'm generally pretty happy to be in forceful arguments and don't interpret them as unfriendly / hostile, while other people plausibly would (see also combat culture). So really I think I'm the wrong person to ask.

So, given that I wanted to do both 1 and 2, would you think it would have been fine if I had just made them as separate comments, instead of mentioning 1 in passing in the thread on 2? Or do you think I really should have picked one to do and not done both?

I think you can do both, if it's clear that you're doing these as two separate things. (Which could be by having two different comments, or by signposting clearly in a single comment.)

In this particular situation I'm objecting to starting with (2), then switching to (1) after a critique without acknowledging that you had updated on (2) and so were going to (1) instead. When I see that behavior from a random Internet commenter I'm like "ah, you are one of the people who rationalizes reasons for beliefs, and so your beliefs do not respond to evidence, I will stop talking with you now". You want to distinguish yourself from the random Internet commenter.

(And if you hadn't updated on (2), then my objection would have been "you are bad at collaborative truth-seeking, you started to engage on one node and then you jumped to a totally different node before you had converged on that one node, you'll never make progress this way".)

Did I come across as unfriendly and hostile? I am sorry if so, that was not my intent.

No, that's not what I meant. I'm saying that the conversational moves you're making are not ones that promote collaborative truth-seeking.

Any claim of actual importance usually has a giant tree of arguments that back it up. Any two people are going to disagree on many different nodes within this tree (just because there are so many nodes). In addition, it takes a fair amount of effort just to understand and get to the same page on any one given node.

So, if you want to do collaborative truth-seeking, you need to have the ability to look at one node of the tree in isolation, while setting aside the rest of the nodes.

In general when someone is talking about some particular node (like "evolution anchor for AGI timelines"), I think you have two moves available:

  1. Say "I think the actually relevant node to our disagreement is <other node>"
  2. Engage with the details of that particular node, while trying to "take on" the views of the other person for the other nodes

(As a recent example, the ACX post on underpopulation does move 2 for Sections 1-8 and move 1 for Section 9.)

In particular, the thing not to do is to talk about the particular node, then jump around into other nodes where you have other disagreements, because that's a way to multiply the number of disagreements you have and fail to make any progress on collaborative truth-seeking. Navigating disagreements is hard enough that you really want to keep them as local / limited as possible.

(And if you do that, then other people will learn that they aren't going to learn much from you because the disagreements keep growing rather than progress being made, and so they stop trying to do collaborative truth-seeking with you.)

Of course sometimes you start doing move (2) and then realize that actually you think your partner is correct in their assessment given their views on the other nodes, and so you need to switch to move (1). I think in that situation you should acknowledge that you agree with their assessment given their other views, and then say that you still disagree on the top-level claim because of <other node>.

Lots of thoughts on this post:

Value of inside views

Inside Views are Overrated [...]

The obvious reason to form inside views is to form truer beliefs

No? The reason to form inside views is that it enables better research, and I'm surprised this mostly doesn't feature in your post. Quoting past-you:

  • Research quality - Doing good research involves having good intuitions and research taste, sometimes called an inside view, about why the research matters and what’s really going on. This conceptual framework guides the many small decisions and trade-offs you make on a daily basis as a researcher
    • I think this is really important, but it’s worth distinguishing this from ‘is this research agenda ultimately useful’. This is still important in eg pure maths research just for doing good research, and there are areas of AI Safety where you can do ‘good research’ without actually reducing the probability of x-risk.

Quoting myself:

There’s a longstanding debate about whether one should defer to some aggregation of experts (an “outside view”), or try to understand the arguments and come to your own conclusion (an “inside view”). This debate mostly focuses on which method tends to arrive at correct conclusions. I am not taking a stance on this debate; I think it’s mostly irrelevant to the problem of doing good research. Research is typically meant to advance the frontiers of human knowledge; this is not the same goal as arriving at correct conclusions. If you want to advance human knowledge, you’re going to need a detailed inside view.

Let’s say that Alice is an expert in AI alignment, and Bob wants to get into the field, and trusts Alice’s judgment. Bob asks Alice what she thinks is most valuable to work on, and she replies, “probably robustness of neural networks”. What might have happened in Alice’s head?

Alice (hopefully) has a detailed internal model of risks from failures of AI alignment, and a sketch of potential solutions that could help avert those risks. Perhaps one cluster of solutions seems particularly valuable to work on. Then, when Bob asks her what work would be valuable, she has to condense all of the information about her solution sketch into a single word or phrase. While “robustness” might be the closest match, it’s certainly not going to convey all of Alice’s information.

What happens if Bob dives straight into a concrete project to improve robustness? I’d expect the project will improve robustness along some axis that is different from what Alice meant, ultimately rendering the improvement useless for alignment. There are just too many constraints and considerations that Alice is using in rendering her final judgment, that Bob is not aware of.

I think Bob should instead spend some time thinking about how a solution to robustness would mean that AI risk has been meaningfully reduced. Once he has a satisfying answer to that, it makes more sense to start a concrete project on improving robustness. In other words, when doing research, use senior researchers as a tool for deciding what to think about, rather than what to believe.

It’s possible that after all this reflection, Bob concludes that impact regularization is more valuable than robustness. The outside view suggests that Alice is more likely to be correct than Bob, given that she has more experience. If Bob had to bet which of them was correct, he should probably bet on Alice. But that’s not the decision he faces: he has to decide what to work on. His options probably look like:

  1. Work on a concrete project in robustness, which has perhaps a 1% chance of making valuable progress on robustness. The probability of valuable work is low since he does not share Alice’s models about how robustness can help with AI alignment.
  2. Work on a concrete project in impact regularization, which has perhaps a 50% chance of making valuable progress on impact regularization.

It’s probably not the case that progress in robustness is 50x more valuable than progress in impact regularization, and so Bob should go with (2). Hence the advice: build a gearsy, inside-view model of AI risk, and think about that model to find solutions.

(Though I should probably edit that section to also mention that Bob could execute on Alice's research agenda, if Alice is around to mentor him; and that would probably be more directly impactful than either of the other two options.)

Other meta thoughts on inside views

  • Relatedly, it's much more important to understand other people's views than to evaluate them - if I can repeat a full, gears-level model of someone's view back to them in a way that they endorse , that's a lot more valuable than figuring out how much I agree or disagree with their various beliefs and conclusions.
    • [...] having several models lets you compare and contrast them, figure out novel predictions, better engage with technical questions, do much better research, etc


I'm having trouble actually visualizing a scenario where Alice understands Bob's views (well enough to make novel predictions that Bob endorses, and say how Bob would update upon seeing various bits of evidence), but Alice is unable to evaluate Bob's view. Do you think this actually happens? Any concrete examples that I can try to visualize?

(Based on later parts of the post maybe you are mostly saying "don't reject an expert's view before you've tried really hard to understand it and make it something that does work", which I roughly agree with.)

Forming a "true" inside view - one where you fully understand something from first principles with zero deferring - is wildly impractical.

Yes, clearly true. I don't think anyone is advocating for this. I would say I have an inside view on bio anchors as a way to predict timelines, but I haven't looked into the data for Moore's Law myself and am deferring to others on that.

People often orient to inside views pretty unhealthily.


What fraction of people who are trying to build inside views do you think have these problems? (Relevant since I often encourage people to do it)

I know some people who do great safety relevant work, despite not having an inside view.

Hmm, I kind of agree in that there are people without inside views who are working on projects that other people with inside views are mentoring them on. I'm not immediately thinking of examples of people without inside views doing independent research that I would call "great safety relevant work".

(Unless perhaps you're counting e.g. people who do work on forecasting AGI, without having an inside view on how AGI leads to x-risk? I would say they have a domain-specific inside view on forecasting AGI.)

Forming inside views will happen naturally, and will happen much better alongside actually trying to do things and contribute to safety - you don't form them by locking yourself in your room for months and meditating on safety!

Idk, I feel like I formed my inside views by locking myself in my room for months and meditating on safety. This did involve reading things other people wrote, and talking with other junior grad students at CHAI who were also orienting to the problem. But I think it did not involve trying to do things and contributing to safety (I did do some of that but I think that was mostly irrelevant to me developing an inside view).

I do agree that if you work on topic X, you will naturally form an inside view on topic X as you get more experience with it. But in AI safety that would look more like "developing a domain-specific inside view on (say) learning from human feedback and its challenges" rather than an overall view on AI x-risk and how to address it. (In fact it seems like the way to get experience with an overall view on AI x-risk and how to address it is to meditate on it, because you can't just run experiments on AGI.)

Inside views lie on a spectrum. You will never form a "true" inside view, but conversely, not having a true inside view doesn't mean you're failing, or that you shouldn't even try. You want to aim to get closer to having an inside view! And making progress here is great and worthy

Strong +1

Aim for domain specific inside views. As an interpretability researcher, it's much more important to me to have an inside view re how to make interpretability progress and how this might interact with AI X-risk, than it is for me to have an inside view on timelines, the worth of conceptual alignment work, etc.

Yes, once you've decided that you're going to be an interpretability researcher, then you should focus on an interpretability-specific inside view. But "what should I work on" is also an important decision, and benefits from a broader inside view on a variety of topics. (I do agree though that it is a pretty reasonable strategy to just pick a domain based on deference and then only build a domain-specific inside view.)

Concrete advice

inside views are about zooming in. Concretely, in this framework, inside views look like starting with some high-level confusing claim, and then breaking it down into sub-claims, breaking those down into sub-claims, etc.

I agree that this is a decent way to measure your inside view -- like, "how big can you make this zooming-in tree before you hit a claim where you have to defer" is a good metric for "how detailed your inside view is".

I'm less clear on whether this is a good way to build an inside view, because a major source of difficulty for this strategy is in coming up with the right decomposition into sub-claims. Especially in the earlier stages of building an inside view, even your first and second levels of decomposition are going to be bad and will change over time. (For example, even for something like "why work on AI safety", Buck and I have different decompositions.) It does seem more useful once you've got a relatively fleshed out inside view, as a way to extend it further -- at this point I can in fact write out a tree of claims and expect that they will stay mostly the same (at the higher levels) after a few years, and so the leaves that I get to probably are good things to investigate.


These seem great and I'd strongly recommend people try them out :)

Meta: I feel like the conversation here and with Nuno's reply looks kinda like:

Nuno: People who want to use the evolutionary anchor as an upper bound on timelines should consider that it might be an underestimate, because the environment might be computationally costly.

You: It's not an underestimate: here's a plausible strategy by which you can simulate the environment.

Nuno / me: That strategy does not seem like it clearly supports the upper bound on timelines, for X, Y and Z reasons.

You: The evolution anchor doesn't matter anyway and barely affects timelines.

This seems bad:

  1. If you're going to engage with a subpoint that OP made that was meant to apply in some context (namely, getting an upper bound on timelines), stick within that context (or at least signpost that you're no longer engaging with the OP).
  2. I don't really understand why you bothered to do the analysis if you're not changing the analysis based on critiques that you agree are correct. (If you disagree with the critique then say that instead.)

If I understand you correctly, you are saying that the Evolution Anchor might not decrease in cost with time as fast as the various neural net anchors?

Yes, and in particular, the mechanism is that environment simulation cost might not decrease as fast as machine learning algorithmic efficiency. (Like, the numbers for algorithmic efficiency are anchored on estimates like AI and Efficiency, those estimates seem pretty unlikely to generalize to "environment simulation cost".)

her spreadsheet splits up algorithmic progress into different buckets for each anchor, so the spreadsheet already handles this nuance.

Just because someone could change the numbers to get a different output doesn't mean that the original numbers weren't flawed and that there's no value in pointing that out?

E.g. suppose I had the following timelines model:

Input: N, the number of years till AGI.

Output: Timeline is 2022 + N.

I publish a report estimating N = 1000, so that my timeline is 3022. If you then come and give a critique saying "actually N should be 10 for a timeline of 2032", presumably I shouldn't say "oh, my spreadsheet already allows you to choose your own value of N, so it handles that nuance".

To be clear, my own view is also that the evolution anchor doesn't matter, and I put very little weight on it and the considerations in this post barely affect my timelines. 

Note that this analysis is going to wildly depend on how progress on "environment simulation efficiency" compares to progress on "algorithmic efficiency". If you think it will be slower then the analysis above doesn't work.

Load More