Founder of CEEALAR (née the EA Hotel;

Topic Contributions


A tale of 2.75 orthogonality theses

AlphaZero isn't smart enough (algorithmically speaking). From Human Compatible (p.207):

Life for AlphaGo during the training period must be quite frustrating: the better it gets, the better its opponent gets—because its opponent is a near-exact copy of itself. Its win percentage hovers around 50 percent, no matter how good it becomes. If it were more intelligent—if it had a design closer to what one might expect of a human-level AI system—it would be able to fix this problem. This AlphaGo++ would not assume that the world is just the Go board, because that hypothesis leaves a lot of things unexplained. For example, it doesn’t explain what “physics” is supporting the operation of AlphaGo++’s own decisions or where the mysterious “opponent moves” are coming from. Just as we curious humans have gradually come to understand the workings of our cosmos, in a way that (to some extent) also explains the workings of our own minds, and just like the Oracle AI discussed in Chapter 6, AlphaGo++ will, by a process of experimentation, learn that there is more to the universe than the Go board. It will work out the laws of operation of the computer it runs on and of its own code, and it will realize that such a system cannot easily be explained without the existence of other entities in the universe. It will experiment with different patterns of stones on the board, wondering if those entities can interpret them. It will eventually communicate with those entities through a language of patterns and persuade them to reprogram its reward signal so that it always gets +1. The inevitable conclusion is that a sufficiently capable AlphaGo++ that is designed as a rewardsignal maximizer will wirehead.

From wireheading, it might then go on to resource grab to maximise the probability  that it gets a +1 or maximise the number of +1s it's getting (e.g. filling planet sized memory banks with 1s); although already it would have to have a lot of power over humans to be able to convince them to reprogram it by sending messages via the go board!

I don't think the examples of humans (Bezos/Witten) are that relevant, in as much as we are products of evolution, and are "adaption executors" rather than "fitness maximisers", are imperfectly rational, and tend to be (broadly speaking) aligned/human-compatible, by default.

Open Thread: Spring 2022

Would it be fair to say that Triplebyte is a similar thing for the software engineering industry?

A tale of 2.75 orthogonality theses

So by 'by default' I mean without any concerted effort to address existential risk from AI, or just following "business as usual" with AI development. Yes, Drexler's CAIS would be an example of this. But I'd argue that "just don't code AI as an unbounded optimiser" is very likely to fail due to mesa-optimisers and convergent instrumental goals emerging in sufficiently powerful systems.

Interesting you mention climate change, as I actually went from focusing on that pre-EA to now thinking that AGI is a much more severe, and more immediate, threat! (Although I also remain interested in other more "mundane" GCRs.)

A tale of 2.75 orthogonality theses

So the last Caplan says there is: 

"1′. AIs have a non-trivial chance of being dangerously un-nice.

I do find this plausible, though only because many governments will create un-nice AIs on purpose."

Which to me sounds like he doesn't really get it. Like he's ignoring "by default does things we regard as harmful" (which he kind of agrees to above; he agrees with "2. Instrumental convergence"). You're right in that the Orthogonality Thesis doesn't carry the argument on it's own, but in conjunction with Instrumental Convergence (and to be more complete, mesa-optimisation), I think it does.

It's a shame that Caplan doesn't reply to Yudkowsky's follow up:

Bryan, would you say that you’re not worried about 1′ because:

1’a: You don’t think a paperclip maximizer is un-nice enough to be dangerous, even if it’s smarter than us.
1’b: You don’t think a paperclip maximizer of around human intelligence is un-nice enough to be dangerous, and you don’t foresee paperclip maximizers becoming much smarter than humans.
1’c: You don’t think that AGIs as un-nice as a paperclip maximizer are probable, unless those durned governments create AGIs that un-nice on purpose.

A tale of 2.75 orthogonality theses

Right, but I think "by default" is important here. Many more people seem to think alignment will happen by default (or at least something along the lines of us being able to muddle through, reasoning with the AI and convincing it to be good, or easily shutting it down if it's not, or something), rather than the opposite.

A tale of 2.75 orthogonality theses

The Orthogonality Thesis is useful to counter the common naive intuition that sufficiently intelligent AI will be benevolent by default (which a lot of smart people tend to hold prior to examining the arguments in any detail). But as Steven refers to above, it's only one component of the argument for taking AGI x-risk seriously (and Yudkowsky lists several others in that example. He leads with orthogonality to prime the pump; to emphasise that common human intuitions aren't useful here.).

A tale of 2.75 orthogonality theses

Riffing on possible reasons to be hopeful, I recently compiled a list of potential "miracles" (including empirical "crucial considerations" [/wishful thinking]) that could mean the problem of AGI x-risk is bypassed:

  • Possibility of a failed (unaligned) takeoff scenario where the AI fails to model humans accurately enough (i.e. realise smart humans could detect its "hidden" activity in a certain way). [This may only set things back a few months to years; or could lead to some kind of Butlerian Jihad if there is a sufficiently bad (but ultimately recoverable) global catastrophe (and then much more time for Alignment the second time around?)].
  • Valence realism being true. Binding problem vs AGI Alignment.
    • Omega experiencing every possible consciousness and picking the best? [Could still lead to x-risk in terms of a Hedonium Shockwave].
  • Moral Realism being true (and the AI discovering it and the true morality being human-compatible).
  • Natural abstractions leading to Alignment by Default?
  • Rohin’s links here.
  • AGI discovers new physics and exits to another dimension (like the creatures in Greg Egan’s Crystal Nights).
  • Simulation/anthropics stuff.
  • Alien Information Theory being true!? (And the aliens having solved alignment).

I don't think I put more than 10% probability on them collectively though, and my  P(doom) is high enough to consider it "crunch time".

A tale of 2.75 orthogonality theses

Thought provoking post, thanks. I think the Orthogonality Thesis (in its theoretical form - your "Motte") is useful to counter the common naive intuition that sufficiently intelligent AI will be benevolent by default (or at least open to being "reasoned with"). 

But (as Steven Byrnes says), it is just one component of the argument that AGI x-risk is a significant threat. Others being Goodhart's Law  and the fragility of human values (what Stuart Russell refers to as the "King Midas problem"), Instrumental Convergence, Mesa-optimisation,  the second species argument (what Stuart Russell refers to as the "gorilla problem"); and differential technological development (capabilities research outstripping alignment research), arms races, and (lack of) global coordination amidst a rapid increase in available compute (increasing hardware overhang).

I guess you could argue that strong convergence on human-compatible values by default would make most of these concerns moot, but there is little to suggest that this is likely.  Going through your "Some reasons to expect correlation", I think that 1-4 don't address the risks from mesa-optimisation and instrumental convergence (on seeking resources etc - think turning the world into "computronium"). In general, it seems that things have to be completely water-tight in order to avoid x-risk. We might asymptote toward human-compatibility of ML systems, but all the doom flows through the gap between the curve and the axis. Making it completely watertight is an incredibly difficult challenge. Especially as it needs to be done on the first try when deploying AGI.

5-8 are interesting: perhaps (to use your terms) if some form of moral exclusivism bottoming out to valence utilitarianism is true, and the superintelligence discovers it by default, we might be ok (but even then, your 9 may apply).

My thoughts on nanotechnology strategy research as an EA cause area

Just noting that the Bottleneck analysis report is written in the first person, but I can't see a name attached to it anywhere! Who is the author?

Preserving and continuing alignment research through a severe global catastrophe

[As posted in the Discord]. An MVP of this might be making offline copies of the AI Alignment Forum, EA Forum and LessWrong available using an app like Kiwix, and encouraging EAs to download them. Bonus if they are automatically updated every month or so. Next step for resilience would be burying old phones with copies of the content on them.

Load More