RobBensinger

2525Joined Sep 2014

Sequences
1

Late 2021 MIRI Conversations

Comments
307

Topic Contributions
2

I'd guess not. From my perspective, humanity's bottleneck is almost entirely that we're clueless about alignment. If a meme adds muddle and misunderstanding, then it will be harder to get a critical mass of researchers who are extremely reasonable about alignment, and therefore harder to solve the problem.

It's hard for muddle and misinformation to spread in exactly the right way to offset those costs; and attempting to strategically sow misinformation so will tend to erode our ability to think well and to trust each other.

From another subthread:

I'm much more excited by scenarios like: 'a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture, norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent'.

It's rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it's a podcast containing many hours of content, there's the potential to seed subsequent conversations with a lot of high-quality background thoughts.

To my eye, that seems more like the kind of change that might shift us from a current trajectory of "~definitely going to kill ourselves" to a new trajectory of "viable chance of an existential win".

Whereas warning shots feel more unpredictable to me, and if they're unhelpful, I expect the helpfulness to at best look like "we were almost on track to win, and then the warning shot nudged us just enough to secure a win".

That feels to me like the kind of event that (if we get lucky and a lot of things go well) could shift us onto a winning trajectory. Obviously, another event would be some sort of technical breakthrough that makes alignment a lot easier.

"looking at nonhuman brains is useless because you could have a perfect understanding of a chimpanzee brain but still completely fail to predict human behavior (after a 'sharp left turn')."

Sounds too strong to me. If Nate or Eliezer thought that it would be totally useless to have a perfect understanding of how GPT-3, AlphaZero, and Minerva do their reasoning, then I expect that they'd just say that.

My Nate-model instead says things like:

  • Current transparency work mostly isn't trying to gain deep mastery of how GPT-3 etc. do their reasoning; and to the extent it's trying, it isn't making meaningful progress.

    ('Deep mastery of how this system does its reasoning' is the sort of thing that would let us roughly understand what thoughts a chimpanzee is internally thinking at a given time, verify that it's pursuing the right kinds of goals and thinking about all (and only) the right kinds of topics, etc.)
     
  • A lot of other alignment research isn't even trying to understand chimpanzee brains, or future human brains, or generalizations that might hold for both chimps and humans; it's just assuming there's no important future chimp-to-human transition it has to worry about.
     
  • Once we build the equivalent of 'humans', we won't have much time to align them before the tech proliferates and someone accidentally destroys the world. So even if the 'understand human cognition' problem turns out to be easier than the 'understand chimpanzee cognition' problem in a vacuum, the fact that it's a new problem and we have a lot less time to solve it makes it a lot harder in practice.

scientists who are trying to understand human brains do spend a lot (most?) of their time looking at nonhuman brains, no?

My sense is that this is mostly for ethics reasons, rather than representing a strong stance that animal models are the fastest way to make progress on understanding human cognition.

Whatever works for you!

For me, "guilty pleasure" is a worse tag because it encourages me to feel guilty, which is exactly what I don't want to encourage myself to do.

"I'm being so evil by listening to Barry Manilow" works well for me exactly because it's too ridiculous to take seriously, so it diffuses guilt. I'm making light of the feel-guilty impulse, not just acknowledging it.

Yep. As the author of https://nothingismere.com/2014/12/03/chaos-altruism/ , I think of myself as an MtG red person who cosplays as esper because I happen to have found myself in a world that has a lot of urgent WUB-shaped problems.

(The world has plenty of fiery outrage and impulsive action and going-with-the-flow, but not a lot of cold utilitarian calculus, principled integrity, rationalism, utopian humanism, and technical alignment research. If I don't want humanity's potential to be snuffed out, I need to prioritize filling those gaps over pure self-expression.)

The specific phrasing "ruthlessly do whatever you want all the time" sounds more MtG-black to me than MtG-red, but if I interpret it as MtG-red, I think I understand what it's trying to convey. :)

My basic reason for thinking "early rogue [AGI] will inevitably succeed in defeating us" is:

  • I think human intelligence is crap. E.g.:
    • Human STEM ability occurs in humans as an accidental side-effect — our brains underwent zero selection for the ability to do STEM in our EAA, and barely-any selection to optimize this skill since the Scientific Revolution. We should expect that much more is possible when humans are deliberately optimizing brains to be good at STEM.
    • There are many embarrassingly obvious glaring flaws in human reasoning.
    • One especially obvious example is "ability to think mathematically at all". This seems in many respects like a reasoning ability that's both relatively simple (it doesn't require grappling with the complexity of the physical world) and relatively core. Yet the average human can't even do trivial tasks like 'multiply two eight-digit numbers together in your head in under a second'. This gap on its own seems sufficient for AGI to blow humans out of the water.
    • (E.g., I expect there are innumerable scientific fields, subfields, technologies, etc. that are easy to find when you can hold a hundred complex mathematical structures in your 200 slots of working memory simultaneously and perceive connections between those structures. Many things are hard to do across a network of separated brains, calculators, etc. that are far easier to do within a single brain that can hold everything in view at once, understand the big picture, consciously think about many relationships at once, etc.)
    • Example: AlphaGo Zero. There was a one-year gap between 'the first time AI ever defeated a human professional' and 'the last time a human professional ever beat a SotA AI'. AlphaGo Zero in particular showed that 2500 years of human reasoning about Go was crap compared to what was pretty easy to do with 2017 hardware and techniques and ~72 hours of self-play. This isn't a proof that human intelligence is similarly crap in physical-data-dependent STEM work, or in other formal settings, but it seems like a strong hint.
  • I'd guess we already have a hardware overhang for running AGI. (Considering, e.g., that we don't need to recapitulate everything a human brain is doing in order to achieve AGI. Indeed, I'd expect that we only need to capture a small fraction of what the human brain is doing in order to produce superhuman STEM reasoning. I expect that AGI will be invented in the future (i.e., we don't already have it), and that we'll have more than enough compute.)

I'd be curious to know (1) whether you disagree with these points, and (2) whether you disagree that theses points are sufficient to predict that at least one early AGI system will be capable enough to defeat humans, if we don't succeed on the alignment problem.

(I usually think of "early AGI systems" as 'AGI systems built within five years of when humanity first starts building a system that could be deployed to do all the work human engineers can do in at least one hard science, if the developers were aiming at that goal'.)

It's strange to me that this is aimed at people who aren't aware that MIRI staffers are quite pessimistic about AGI risk.

It's not. It's mainly aimed at people who found it bizarre and hard-to-understand that Nate views AGI risk as highly disjunctive. (Even after reading all the disjunctive arguments in AGI Ruin.) This post is primarily aimed at people who understand that MIRI folks are pessimistic, but don't understand where "it's disjunctive" is coming from.

Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.

By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that

  • is highly safety-conscious.
  • has a large resource advantage over the other groups, so that it can hope to reach AGI with more than a year of lead time — including accumulated capabilities ideas and approaches that it hasn’t been publishing.
  • has adequate closure and opsec practices, so that it doesn’t immediately lose its technical lead if it successfully acquires one.

The magnitude and variety of difficulties that are likely to arise in aligning the first AGI systems also suggests that failure is very likely in trying to align systems as opaque as current SotA systems; and suggests an AGI developer likely needs to have spent preceding years deliberately steering toward approaches to AGI that are relatively alignable; and it suggests that we need to up our game in general, approaching the problem in ways that are closer to the engineering norms at (for example) NASA, than to the engineering norms that are standard in ML today.

Load More