I'm a PhD student at the Center for Human-Compatible AI (CHAI) at UC Berkeley. I edit and publish the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment. In the past, I ran the EA UC Berkeley and EA at the University of Washington groups.

rohinmshah's Comments

Critical Review of 'The Precipice': A Reassessment of the Risks of AI and Pandemics
This suggests that we should schedule a time to talk in person, and/or an adversarial collaboration trying to write a version of the argument that you're thinking of.

Sounds good, I'll just clarify my position in this response, rather than arguing against your claims.

So then I guess your response is something like "But everyone forgetting to eat food is a crazy scenario, whereas the naive extrapolation of the thing we're currently doing is the default scenario".

It's more like "there isn't any intellectual work to be done / field building to do / actors to coordinate to get everyone to eat".

Whereas in the AI case, I don't know how we're going to fix the problem I outlined; and as far as I can tell nor does anyone else in the AI community, and therefore there is intellectual work to be done.

We are already at significantly-better-than-human optimisation

Sorry, by optimization there I meant something more like "intelligence". I don't really care whether it comes from better SGD, some hardcoded planning algorithm, or a mesa optimizer; the question is whether it is significantly more capable than humans at pursuing goals.

I thought our opinions were much more similar.

I think our predictions of how the world will go concretely are similar; but I'd guess that I'm happier with abstract arguments that depend on fuzzy intuitive concepts than you are, and find them more compelling than more concrete ones that depend on a lot of specific details.

Critical Review of 'The Precipice': A Reassessment of the Risks of AI and Pandemics
I agree that perfect optimisers are pathological. But we are not going to train anything that is within light-years of perfect optimisation. Perfect optimisation is a totally different type of thing to what we're doing.

If you replace "perfect optimization" with "significantly-better-than-human optimization" in all of my claims, I'd continue to agree with them.

This argument feels to me like saying "We shouldn't keep building bigger and bigger bombs because in the limit of size they'll form a black hole and destroy the Earth."

If somehow I knew that this fact were true, but I didn't know at what size the bombs form a black hole and destroy us all, I would in fact see this as a valid and motivating argument for not building bigger bombs, and for trying to figure out how to build bombs that don't destroy the Earth (or coordinate to not build them at all).

Firstly because it feels reminiscent of the utility-maximisation arguments made by Yudkowsky - in both cases the arguments are based on theoretical claims which are literally true but in practice irrelevant or vacuous.

I strongly disagree with this.

The utility-maximization argument that I disagree with is something like:

"AI is superintelligent" implies "AI is EU-maximizing" implies "AI has convergent instrumental subgoals".

This claim is not true even theoretically. It's not a question of what's happening in practice.

There is a separate argument which goes

"Superintelligent AI is built by humans" implies "AI is goal-directed" implies "AI has convergent instrumental subgoals"

And I place non-trivial weight on this claim, even though it is a conceptual, fuzzy claim that we're not sure yet will be relevant in practice, and one of the implications doesn't apply in the case where the AI is pursuing some "meta" goal that refers to the human's goals.

(You might disagree with this analysis as well, but I'd guess you'd be in the minority amongst AI safety researchers.)

The argument I gave is much more like the second kind -- a conceptual claim that depends on fuzzy categories like "certain specifications".

Secondly [...]

Sorry, I don't understand your point here. It sounds like "the last time we made an argument, we were wrong, therefore we shouldn't make more arguments", but that can't be what you're saying.

Maybe your point is that ML researchers are more competent than we give them credit for, and so we should lower our probability of x-risk? If so, I mostly just want to ignore this; I'm really not making a probabilistic argument. I'm making an argument "from the perspective of humanity / the full AI community".

I think spreading the argument "if we don't do X, then we are in trouble because of problem Y" seems better than spreading something like "there is a p% of having problem Y, where I've taken into account the fact that people will try to solve Y, and that won't be sufficient because of Z; therefore we need to put more effort into X". The former is easier to understand and more likely to be true / correctly reasoned.

(I would also defend "the chance is not so low that EAs should ignore it", but that's a separate conversation, and seems not very relevant to what arguments we should spread amongst the AI community.)

Thirdly, because I am epistemically paranoid about giving arguments which aren't actually the main reason to believe in a thing. [...] I suspect that the same is not really the case for you and the argument you give.

It totally is. I have basically two main concerns with AI alignment:

  • We're aiming for the wrong thing (outer alignment)
  • Even if we aim for the right thing, we might generalize poorly (inner alignment)

If you told me that inner alignment was magically not a problem -- we always generalize in the way that the reward function would have incentivized -- I would still be worried; though it would make a significant dent in my AI risk estimate.

If you told me that outer alignment was magically not a problem (we're actually aiming for the right thing), that would make a smaller but still significant dent in my estimate of AI risk. It's only smaller because I expect the work to solve this problem to be done by default, whereas I feel less confident about that for inner alignment.

it doesn't establish that AI safety work needs to be done by someone, it just establishes that AI researchers have to avoid naively extrapolating their current work.

Why is "not naively extrapolating their current work" not an example of AI safety work? Like, presumably they need to extrapolate in some as-yet-unknown way, figuring out that way sounds like a central example of AI safety work.

It seems analogous to "biologists just have to not publish infohazards, therefore there's no need to work on the malicious use category of biorisk".

Secondly because the argument is also true for image classifiers, since under perfect optimisation they could hack their loss functions. So insofar as we're not worried about them, then the actual work is being done by some other argument.

I'm not worried about them because there are riskier systems that will be built first, and because there isn't much economic value in having strongly superintelligent image classifiers. If we really tried to build strongly superintelligent image classifiers, I would be somewhat worried (though less so, since the restricted action space provides some safety).

(You might also think that image classifiers are safe because they are myopic, but in this world I'm imagining that we make non-myopic image classifiers, because they will be better at classifying images than myopic ones.)

Thirdly because I do think that counterfactual impact is the important bit, not "AI safety work needs to be done by someone."

I do think that there is counterfactual impact in expectation. I don't know why you think there isn't counterfactual impact. So far it sounds to me like "we should give the benefit of the doubt to ML researchers and assume they'll solve outer alignment", which sounds like a claim about norms, not a claim about the world.

I think the better argument against counterfactual impact is "there will be a strong economic incentive to solve these problems" (see e.g. here), and that might reduce it by an order of magnitude, but that still leaves a lot of possible impact. But also, I think this argument applies to inner alignment as well (though less strongly).

Critical Review of 'The Precipice': A Reassessment of the Risks of AI and Pandemics
What is a "certain specification"?

I agree this is a fuzzy concept, in the same way that "human" is a fuzzy concept.

Is training an AI to follow instructions, giving it strong negative rewards every time it misinterprets us, then telling it to do X, a "certain specification" of X?

No, the specification there is to follow instructions. I am optimistic about these sorts of "meta" specifications; CIRL / assistance games can also be thought of as a "meta" specification to assist the human. But like, afaict this sort of idea has only recently become common in the AI community; I would guess partly because of people pointing out problems with the regular method of writing down specifications.

Broadly speaking, think of certain specifications as things that you plug in to hardcoded optimization algorithms (not learned ones which can have "common sense" and interpret you correctly).

I just don't think this concept makes sense in modern ML, because it's the optimiser, not the AI, that is given the specification.

If you use a perfect optimizer and train in the real world with what you would intuitively call a "certain specification", an existential catastrophe almost certainly happens. Given agreement on this fact, I'm just saying that I want a better argument for safety than "it's fine because we have a less-than-perfect optimizer", which as far as I can tell is ~the argument we have right now, especially since in the future we will presumably have better optimizers (where more compute during training is a type of better optimization).

More constructively, I just put this post online. It's far from comprehensive, but it points at what I'm concerned about more specifically than anything else.

I also find that the most plausible route by which you actually get to extinction, but it's way more speculative (to me) than the arguments I'm using above.

So this observed fact doesn't help us distinguish between "everyone in AI thinks that making AIs which intend to do what we want is an integral part of their mission, but that the 'intend' bit will be easy" vs "everyone in AI is just trying to build machines that can achieve hardcoded literal objectives even if it's very difficult to hardcode what we actually want".

??? I agree that you can't literally rule the first position out, but I've talked to many people in AI, and the closest people get to this position is saying "well maybe the 'intend' bit will be easy"; I haven't seen anyone argue for it.

I feel like you're equivocating between what AI researchers want (obviously they don't want extinction) and what they actually do (things that, if extrapolated naively, would lead to extinction).

I agree that they will start (and have started) working on the 'intend' bit once it's important, but to my mind that means at that point they will have started working on the category of work that we call "AI safety". This is consistent with my statement above:

Therefore, if we want superintelligent AI systems that don't have these problems, we need to change how AI is done.

(We in that statement was meant to refer to humanity as a whole.)

And without distinguishing them, then the "stated goal of AI" has no predictive power (if it even exists).

I specifically said this was not a prediction for this reason:

This doesn't tell you the probability with which superintelligent AI has convergent instrumental subgoals, since maybe we were always going to change how AI is done

Nonetheless, it still establishes "AI safety work needs to be done by someone", which seems like the important bit.

Perhaps you think that to motivate work by EAs on AI safety, you need to robustly demonstrate that a) there is a problem AND b) the problem won't be solved by default. I think this standard eliminates basically all x-risk prevention efforts, because you can always say "but if it's so important, someone else will probably solve it" (a thing that I think is approximately true).

(I don't think this is actually your position though, because the same critique could be applied to your new post.)

Critical Review of 'The Precipice': A Reassessment of the Risks of AI and Pandemics

We have discussed this, so I'll just give brief responses so that others know what my position is. (My response to you is mostly in the last section, the others are primarily explanation for other readers.)

Convergent instrumental subgoals aren't the problem. Large-scale misaligned goals (instrumental or not) are the problem.

I'm not entirely sure what you mean by "large-scale", but misaligned goals simply argues for "the agent doesn't do what you want". To get to "the agent kills everyone", you need to bring in convergent instrumental subgoals.

Once you describe in more detail what it actually means for an AI system to "have some specification", the "certain" bit also stop seeming like a problem.

The model of "there is an POMDP, it has a reward function, the specification is to maximize expected reward" is fully formal and precise (once you spell out the MDP and reward), and the optimal solution usually involves convergent instrumental subgoals.

Whether or not a predefined specification gives rise to those sorts of goals depends on the AI architecture and training process in a complicated way.

I'm assuming you agree with:

1. The stated goal of AI research would very likely lead to human extinction

I agree that it is unclear whether AI systems actually get anywhere close to optimal for the tasks we train them for. However, if you think that we will get AGI and be fine, but we'll continue to give certain specifications of what we want, it seems like you also have to believe:

2. We will build AGI without changing the stated goal of AI research

3. AI research will not achieve its stated goal

The combination of 2 + 3 seems like a strange set of beliefs to have. (Not impossible, but unlikely.)

Critical Review of 'The Precipice': A Reassessment of the Risks of AI and Pandemics
Though far from perfect, I believe this process if far more transparent than the estimates provided by Ord, for which no explanation is offered as to how they were derived. This means that it is effectively impossible to subject them to critical scrutiny.

I want to note that I agree with this, and I think it's good for people to write down their explicit reasoning.

That said, I disagree pretty strongly with the section on AI.

More generally, it is unclear why we should even expect AI researchers to have any particular knowledge about the future trajectories of AI capabilities. Such researchers study and develop particular statistical and computational techniques to solve specific types of problems. I am not aware of any focus of their training on extrapolating technological trends, or in investigations historical case studies of technological change.

I don't see why people keep saying this. Given the inconsistent expert responses to surveys, I think it makes sense to say that AI researchers probably aren't great at predicting future trajectories of AI capabilities. Nonetheless, if I had no inside-view knowledge and I wanted to get a guess at AI timelines, I'd ask experts in AI. (Even now, after people have spent a significant amount of time thinking about AI timelines, I would not ask experts in trend extrapolation; I seriously doubt that they would know which trends to extrapolate without talking to AI researchers.)

I suppose you could defend a position of the form "we can't know AI timelines", but it seems ridiculous to say "we can't know AI timelines, therefore AGI risk is low".

However such current methods, in particular deep learning, are known to be subject to a wide range of limitations. [...] at present they represent deep theoretical limitations of current methods

I disagree. So do many of the researchers at OpenAI and DeepMind, who are explicitly trying to build AGI using deep learning, reinforcement learning, and similar techniques. Meanwhile, academics tend to agree. I think from an outside view this should be maybe a 2x hit to the probability of developing AGI soon (if you start from the OpenAI / DeepMind position).

Atari games are highly simplified environments with comparatively few degrees of freedom, the number of possible actions is highly limited, and where a clear measure of success (score) is available. Real-world environments are extremely complicated, with a vast number of possible actions, and often no clear measure of success. Uncertainty also plays little direct role in Atari games, since a complete picture of the current gamespace is available to the agent. In the real world, all information gained from the environment is subject to error, and must be carefully integrated to provide an approximate model of the environment.

All of these except for the "clear measure of success" have already been surmounted (see OpenAI Five or AlphaStar for example). I'd bet that we'll see AI systems based on deep imitation learning and related techniques that work well in domains without a clear measure of success within the next 5 years. There definitely are several obstacles to general AI systems, but these aren't the obstacles.

These skills may be regarded as a subset of a very broad notion of intelligence, but do not seem to correspond very closely at all to the way we normally use the word ‘intelligence’, nor do they seem likely to be the sorts of things AIs would be very good at doing.

... Why wouldn't AIs be good at doing these things? It seems like your main point is that AI will lack a physical body and so will be bad at social interactions, but I don't see why an AI couldn't have social interactions from a laptop screen (just like the rest of us in the era of COVID-19).

More broadly, if you object to the implication "superintelligence implies ability to dominate the world", then just take whatever mental property P you think does allow an agent to dominate the world; I suspect both Toby and I would agree with "there is a non-trivial chance that future AI systems will be superhuman at P and so would be able to dominate the world".

While this seems plausible in the case of a reinforcement learning agent, it seems far less clear that it would apply to another form of AI. In particular, it is not even clear if humans actually posses anything that corresponds to a ‘reward function’, nor is it clear that such a thing is immutable with experience or over the lifespan. To assume that an AI would have such a thing therefore is to make specific assumptions about the form such an AI would take.

I agree with this critique of Toby's argument; I personally prefer the argument given in Human Compatible, which roughly goes:

  • Almost every AI system we've created so far (not just deep RL systems) have some predefined, hardcoded, certain specification that the AI is trying to optimize for.
  • A superintelligent agent pursuing a known specification has convergent instrumental subgoals (the thing that Toby is worried about).
  • Therefore, if we want superintelligent AI systems that don't have these problems, we need to change how AI is done.

This doesn't tell you the probability with which superintelligent AI has convergent instrumental subgoals, since maybe we were always going to change how AI is done, but it does show why you might expect the "default assumption" to be an AI system that has convergent instrumental subgoals, instead of one that is more satisficing like humans are.

the fate of a humanity dominated by an AI would be in the hands of that AI (or collective of AIs that share control)

This seems true to me, but if an AI system was so misaligned as to subjugate humans, I don't see why you should be hopeful that future changes in its motivations lead to it not subjugating humans. It's possible, but seems very unlikely (< 1%).

I regard 1) as roughly as likely as not

Isn't this exactly the same as Toby's estimate? (I actually don't know, I have a vague sense that this is true and was stated in The Precipice.)

Probability of unaligned artificial intelligence

Here are my own estimates for your causal pathway:

1: 0.8

2 conditioned on 1: 0.05 (I expect that there will be an ecosystem of AI systems, not a single AI system that can achieve a decisive strategic advantage)

3 conditioned on 1+2: 0.3 (If there is a single AI system that has a DSA, probably it took us by surprise, seems less likely we solved the problem in that world)

4 conditioned on 1+2+3: 0.99

Which gives in total ~0.012, or about 1%.

But really, the causal pathway I would want involves a change to 2 and 3:

2+3: Some large fraction of the AI systems in the world have reason / motivation to usurp power, and by coordinating they are able to do it.


1: 0.8

2+3 conditioned on 1: 0.1 (with ~10% on "has the motivation to usurp power" and ~95% on "can usurp power")

4: 0.99

Which comes out to ~0.08, or 8%.

How do you talk about AI safety?
While I haven't read the book, Slate Star Codex has a great review on Human Compatible. Scott says it speaks of AI safety, especially in the long-term future, in a very professional sounding, and not weird way. So I suggest reading that book, or that review.

Was going to recommend this as well (and I have read the book).

COVID-19 response as XRisk intervention
Note: part of what impressed Scott here was being early to raise the alarm, and that boat has already sailed, so it could be that future COVID-19 work won't do much to impress people like him.

I think that's crucial -- I'm generally supportive of EAs / rationalists to be doing things like COVID-19 work when they have a comparative advantage at doing so, which is a factor in why I support forecasting / meta work even now, and I'd certainly want biosecurity people to at least be thinking about how they could help with COVID-19 (as they in fact are). But the OP isn't arguing that, and whether or not it was intended I could see readers thinking that they should be actively trying to work on COVID even if they don't have an obvious comparative advantage at it, and that seems wrong to me.

This point about comparative advantage is also why I wrote:

I'd probably change my mind if I thought that these other longtermists could actually make a large impact on the COVID-19 response, but that seems quite unlikely to me.
COVID-19 response as XRisk intervention

I can't tell whether you're arguing "some small subset of EAs/rationalists are in a great position to fight COVID-19 and they should do so" vs. "if an arbitrary EA/rationalist wants to fight COVID-19, they shouldn't worry that they are doing less because they aren't reducing x-risk" vs. "COVID-19 is such an opportunity for x-risk reduction that nearly all longtermists should be focusing on it now".

I agree with the first (in particular for people who work on forecasting / "meta" stuff), but not with the latter two. To the extent you're arguing for the latter two, I don't find the arguments very convincing, because they aren't comparing against counterfactuals. Taking each point in turn:

Training Ourselves

I agree that COVID-19 is particularly good for training the general bucket of forecasting / applied epistemology / scenario-planning.

However, for coordination, persuasive argumentation, networking, and project management, I don't see why COVID-19 is particularly better than other projects you could be working on. For example, I think I practiced all of those skills by organizing a local EA group; it also seems like ~any project that involves advocacy would likely require / train all of these skills.

Forging alliances

Presumably for most goals there are more direct ways to forge alliances than by working on COVID-19. E.g. you mentioned AI safety -- if I wanted to forge alliances with people at OSTP, I'd focus on current AI issues like interpretability and fairness.

Establishing credibility

I agree that this is important for the more "meta" parts of x-risk, such as forecasting. But for those of us who are working closer to the object level (e.g. technical AI safety, nuclear war, climate change), I don't really see how this is going to help establish credibility that's used in the future.

Growing the global risk movement

You talk about field-building here, which in fact seems like an important thing to be doing, but seems basically unrelated to the COVID-19 response. I'd guess that field-building has ~zero effect on how many people die from COVID-19 this year.

Creating XRisk infrastructure

Agreed that this is good.

Overall take: It does seem like anyone working on "meta" approaches to x-risk reduction probably should be thinking very seriously about how they can contribute to the COVID-19 response, but I'd guess that for most other longtermists the argument "it is just a distraction" is basically right.

I'd probably change my mind if I thought that these other longtermists could actually make a large impact on the COVID-19 response, but that seems quite unlikely to me.

The case for building more and better epistemic institutions in the effective altruism community

I really like the general class of improving community epistemics :)

That being said, I feel pretty pessimistic about having dedicated "community builders" come in to create good institutions that would then improve the epistemics of the field: in my experience, most such attempts fail, because they don't actually solve a problem in a way that works for the people in the field (and in addition, they "poison the well", in that it makes it harder for someone else to build an actually-functioning version of the solution, because everyone in the field now expects it to fail and so doesn't buy in to it).

I feel much better about people within the field figuring out ways to improve the epistemics of the community they're in, trialing them out themselves, and if they seem to work well only then attempting to formalize them into an institution.

Take me as an example. I've done a lot of work that could be characterized as "trying to improve the epistemics of a community", such as:

The first five couldn't have been done by a person without the relevant expertise (in AI alignment for the first four, and in EA group organizing for the fifth). If they were trying to build institutions that would lead to any of these six things happening, I think they might have succeeded, but it probably would have taken multiple years, as opposed to it taking ~a month each for me. (Here I'm assuming that an institution is "built" once it operates through the effort of people within the field, with no or very little ongoing effort from the person who started the institution.) It's just quite hard to build institutions for a field without significant buy-in from people in the field, and creating that buy-in is hard.

I think people who find the general approach in this post interesting should probably be becoming very knowledgeable about a particular field (both the technical contents of the field, as well as the landscape of people who work on it), and then trying to improve the field from within.

It's also of course fine to think of ideas for better institutions and pitch them to people in the field; what I want to avoid is coming up with a clever idea and then trying to cause it to exist without already having a lot of buy in from people in the field.

What are the challenges and problems with programming law-breaking constraints into AGI?

Yeah, I certainly feel better about learning law relative to learning the One True Set of Human Values That Shall Then Be Optimized Forevermore.

Load More