Hi I'm Steve Byrnes, an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms—see

Wiki Contributions


[Discussion] Best intuition pumps for AI safety

One of my theories here is that it's helpful to pivot quickly towards "here's an example concrete research problem that seem hard but not impossible, and people are working on it, and not knowing the solution seems obviously problematic". This is good for several reasons, including "pattern-matching to serious research, safety engineering, etc., rather than pattern-matching to sci-fi comics", providing a gentler on-ramp (as opposed to wrenching things like "your children probably won't die of natural causes" or whatever), providing food for thought, etc. Of course this only works if you can engage in the technical arguments. Brian Christian's book is the extreme of this approach.

Why aren't you freaking out about OpenAI? At what point would you start?

Vicarious and Numenta are both explicitly trying to build AGI, and neither does any safety/alignment  research whatsoever. I don't think this fact is particularly relevant to OpenAI, but I do think it's an important fact in its own right, and I'm always looking for excuses to bring it up.  :-P

Anyone who wants to talk about Vicarious or Numenta in the context of AGI safety/alignment, please DM or email me.  :-)

Why does (any particular) AI safety work reduce s-risks more than it increases them?

I don't really distinguish between effects by order*

I agree that direct and indirect effects of an action are fundamentally equally important (in this kind of outcome-focused context) and I hadn't intended to imply otherwise.

Why does (any particular) AI safety work reduce s-risks more than it increases them?

Hmm, it seems to me (and you can correct me) that we should be able to agree that there are SOME technical AGI safety research publications that are positive under some plausible beliefs/values and harmless under all plausible beliefs/values, and then we don't have to talk about cluelessness and tradeoffs, we can just publish them.

And we both agree that there are OTHER technical AGI safety research publications that are positive under some plausible beliefs/values and negative under others. And then we should talk about your portfolios etc. Or more simply, on a case-by-case basis, we can go looking for narrowly-tailored approaches to modifying the publication in order to remove the downside risks while maintaining the upside.

I feel like we're arguing past each other: I keep saying the first category exists, and you keep saying the second category exists. We should just agree that both categories exist! :-)

Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that "publishing an analysis about what happens if a cosmic ray flips a bit" goes in the first category.)

(Luke says "AI-related" but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)

For example, if you didn't really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/cluelessness here, not that it's good in expectation or bad in expectation or 0 in expectation.

This points to another (possible?) disagreement. I think maybe you have the attitude where (to caricature somewhat) if there's any downside risk whatsoever, no matter how minor or far-fetched, you immediately jump to "I'm clueless!". Whereas I'm much more willing to say: OK, I mean, if you do anything at all there's a "downside risk" in a sense, just because life is uncertain, who knows what will happen, but that's not a good reason to let just sit on the sidelines and let nature take its course and hope for the best. If I have a project whose first-order effect is a clear and specific and strong upside opportunity, I don't want to throw that project out unless there's a comparably clear and specific and strong downside risk. (And of course we are obligated to try hard to brainstorm what such a risk might be.)  Like if a firefighter is trying to put out a fire, and they aim their hose at the burning interior wall, they don't stop and think, "Well I don't know what will happen if the wall gets wet, anything could happen, so I'll just not pour water on the fire, y'know, don't want to mess things up."

The "cluelessness" intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.

If the first-order effect of a project is "directly mitigating an important known s-risk", and the second-order effects of the same project are "I dunno, it's a complicated world, anything could happen", then I say we should absolutely do that project.

Why does (any particular) AI safety work reduce s-risks more than it increases them?

In practice, we can't really know with certainty that we're making AI safer, and without strong evidence/feedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects.

This strikes me as too pessimistic. Suppose I bring a complicated new board game to a party. Two equally-skilled opposing teams each get a copy of the rulebook to study for an hour before the game starts. Team A spends the whole hour poring over the rulebook and doing scenario planning exercises. Team B immediately throws the rulebook in the trash and spends the hour watching TV.

Neither team has "strong evidence/feedback"—they haven't started playing yet. Team A could think they have good strategy ideas but in fact they are engaging in arbitrary subjective judgments and motivated reasoning. In fact, their strategy ideas, which seemed good on paper, could in fact turn out to be counterproductive!

Still, I would put my money on Team A beating Team B. Because Team A is trying. Their planning abilities don't have to be all that good to be strictly better (in expectation) than "not doing any planning whatsoever, we'll just wing it". That's a low bar to overcome!

So by the same token, it seems to me that vast swathes of AGI safety research easily surpasses the (low) bar of doing better in expectation than the alternative of "Let's just not think about it in advance, we'll wing it".

For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person won't do a great job, or that it will be counterproductive because they'll happen across very dangerous information and then publish it, etc. But still, the expected value here is  clearly positive, right?

You also bring up the idea that (IIUC) there may be objectively good safety ideas but they might not actually get implemented because there won't be a "strong and justified consensus" to do them. But again, the alternative is "nobody comes up with those objectively good safety ideas in the first place". That's even worse, right? (FWIW I consider "come up with crisp and rigorous and legible arguments for true facts about AGI safety" to be a major goal of AGI safety research.)

Anyway, I'm objecting to undirected general feelings of "gahhhh we'll never know if we're helping at all", etc. I think there's just a lot of stuff in the AGI safety research field which is unambiguously good in expectation, where we don't have to feel that way. What I don't object to—and indeed what I strongly endorse—is taking a more directed approach and say "For AGI safety research project #732, what are the downside risks of this research, and how do they compare to the upsides?"

So that brings us to "ambitious value alignment". I agree that an ambitiously-aligned AGI comes with a couple potential sources of s-risk that other types of AGI wouldn't have, specifically via (1) sign flip errors, and (2) threats from other AGIs. (Although I think (1) is less obviously a problem than it sounds, at least in the architectures I think about.) On the other hand, (A) I'm not sure anyone is really working on ambitious alignment these days … at least Rohin Shah & Paul Christiano have stated that narrow (task-limited) alignment is a better thing to shoot for (and last anyone heard MIRI was shooting for task-limited AGIs too); (B) my sense is that current value-learning work (e.g. at CHAI) is more about gaining conceptual understanding then creating practical algorithms / approaches that will scale to AGI. That said, I'm far from an expert on the current value learning literature; frankly I'm often confused by what such researchers are imagining for their longer-term game-plan.

BTW I put a note on my top comment that I have a COI. If you didn't notice. :)

Why does (any particular) AI safety work reduce s-risks more than it increases them?

Hmm, just a guess, but …

  • Maybe you're conceiving of the field as "AI alignment", pursuing the goal "figure out how to bring an AI's goals as close as possible to a human's (or humanity's) goals, in their full richness" (call it "ambitious value alignment")
  • Whereas I'm conceiving the field as "AGI safety", with the goal "reduce the risk of catastrophic accidents involving AGIs".

"AGI safety research" (as I think of it) includes not just how you would do ambitious value alignment, but also whether you should do ambitious value alignment. In fact, AGI safety research may eventually result in a strong recommendation against doing ambitious value alignment, because we find that it's dangerously prone to backfiring, and/or that some alternative approach is clearly superior (e.g. CAIS, or microscope AI, or act-based corrigibility or myopia or who knows what). We just don't know yet. We have to do the research.

"AGI safety research" (as I think of it) also includes lots of other activities like analysis and mitigation of possible failure modes (e.g. asking what would happen if a cosmic ray flips a bit in the computer), and developing pre-deployment testing protocols, etc. etc.

Does that help? Sorry if I'm missing the mark here.

Why does (any particular) AI safety work reduce s-risks more than it increases them?


(Incidentally, I don't claim to have an absolutely watertight argument here that AI alignment research couldn't possibly be bad for s-risks, just that I think the net expected impact on s-risks is to reduce them.)

If s-risks were increased by AI safety work near (C), why wouldn't they also be increased near (A), for the same reasons?

I think suffering minds are a pretty specific thing, in the space of "all possible configurations of matter". So optimizing for something random (paperclips, or "I want my field-of-view to be all white", etc.) would almost definitely lead to zero suffering (and zero pleasure). (Unless the AGI itself has suffering or pleasure.) However, there's a sense in which suffering minds are "close" to the kinds of things that humans might want an AGI to want to do. Like, you can imagine how if a cosmic ray flips a bit, "minimize suffering" could turn into "maximize suffering". Or at any rate, humans will try (and I expect succeed even without philanthropic effort) to make AGIs with a prominent human-like notion of "suffering", so that it's on the table as a possible AGI goal.

In other words, imagine you're throwing a dart at a dartboard.

  • The bullseye has very positive point value.
    • That's representing the fact that basically no human wants astronomical suffering, and basically everyone wants peace and prosperity etc.
  • On other parts of the dartboard, there are some areas with very negative point value.
    • That's representing the fact that if programmers make an AGI that desires something vaguely resembling what they want it to desire, that could be an s-risk.
  • If you miss the dartboard entirely, you get zero points.
    • That's representing the fact that a paperclip-maximizing AI would presumably not care to have any consciousness in the universe (except possibly its own, if applicable).

So I read your original post as saying "If the default is for us to miss the dartboard entirely, it could be s-risk-counterproductive to improve our aim enough that we can hit the dartboard", and my response to that was "I don't think that's relevant, I think it will be really easy to not miss the dartboard entirely, and this will happen "by default". And in that case, better aim would be good, because it brings us closer to the bullseye."

Why does (any particular) AI safety work reduce s-risks more than it increases them?

Sorry I'm not quite sure what you mean. If we put things on a number line with (A)=1, (B)=2, (C)=3, are you disagreeing with my claim "there is very little probability weight in the interval ", or with my claim "in the interval , moving down towards 1 probably reduces s-risk", or with both, or something else?

Why does (any particular) AI safety work reduce s-risks more than it increases them?

[note that I have a COI here]

Hmm, I guess I've been thinking that the choice is between (A) "the AI is trying to do what a human wants it to try to do" vs (B) "the AI is trying to do something kinda weirdly and vaguely related to what a human wants it to try to do". I don't think (C) "the AI is trying to do something totally random" is really on the table as a likely option, even if the AGI safety/alignment community didn't exist at all.

That's because everybody wants the AI to do the thing they want it to do, not just long-term AGI risk people. And I think there are really obvious things that anyone would immediately think to try, and these really obvious techniques would be good enough to get us from (C) to (B) but not good enough to get us to (A).

[Warning: This claim is somewhat specific to a particular type of AGI architecture that I work on and consider most likely—see e.g. here. Other people have different types of AGIs in mind and would disagree. In particular, in the "deceptive mesa-optimizer" failure mode (which relates to a different AGI architecture than mine) we would plausibly expect failures to have random goals like "I want my field-of-view to be all white", even after reasonable effort to avoid that. So maybe people working in other areas would have different answers, I dunno.]

I agree that it's at least superficially plausible that (C) might be better than (B) from an s-risk perspective. But if (C) is off the table and the choice is between (A) and (B), I think (A) is preferable for both s-risks and x-risks.

evelynciara's Shortform

The main argument of Stuart Russell's book focuses on reward modeling as a way to align AI systems with human preferences.

Hmm, I remember him talking more about IRL and CIRL and less about reward modeling. But it's been a little while since I read it, could be wrong.

If it's really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.

Maybe there's an analogy where someone would say "If it's really difficult to prevent accidental release of pathogens from your lab, then it seems unlikely that bio researchers would do research on pathogens whose accidental release would be catastrophic". Unfortunately there's a horrifying many-decades-long track record of accidental release of pathogens from even BSL-4 labs, and it's not like this kind of research has stopped. Instead it's like, the bad thing doesn't happen every time, and/or things seem to be working for a while before the bad thing happens, and that's good enough for the bio researchers to keep trying.

So as I talk about here, I think there are going to be a lot of proposals to modify an AI to be safe that do not in fact work, but do seem ahead-of-time like they might work, and which do in fact work for a while as training progresses. I mean, when x-risk-naysayers like Yann LeCun or Jeff Hawkins are asked how to avoid out-of-control AGIs, they can spout off a list of like 5-10 ideas that would not in fact work, but sound like they would. These are smart people and a lot of other smart people believe them too. Also, even something as dumb as "maximize the amount of money in my bank account" would plausibly work for a while and do superhumanly-helpful things for the programmers, before it starts doing superhumanly-bad things for the programmers.

Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systems' goals using reward modeling.

Yup, if you don't get corrigibility then you failed.

Load More