Preamble:

(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)

I have several times failed to write up a well-organized list of reasons why AGI will kill you.  People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first.  Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.

Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants.  I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.

Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities:

-3.  I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true.  People occasionally claim to me that I need to stop fighting old wars here, because, those people claim to me, those wars have already been won within the important-according-to-them parts of the current audience.  I suppose it's at least true that none of the current major EA funders seem to be visibly in denial about orthogonality or instrumental convergence as such; so, fine.  If you don't know what 'orthogonality' or 'instrumental convergence' are, or don't see for yourself why they're true, you need a different introduction than this one.

-2.  When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone.  When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get.  So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it.  Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as "less than roughly certain to kill everybody", then you can probably get down to under a 5% chance with only slightly more effort.  Practically all of the difficulty is in getting to "less than certainty of killing literally everyone".  Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment.  At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'.  Anybody telling you I'm asking for stricter 'alignment' than this has failed at reading comprehension.  The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.

-1.  None of this is about anything being impossible in principle.  The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.  For people schooled in machine learning, I use as my metaphor the difference between ReLU activations and sigmoid activations.  Sigmoid activations are complicated and fragile, and do a terrible job of transmitting gradients through many layers; ReLUs are incredibly simple (for the unfamiliar, the activation function is literally max(x, 0)) and work much better.  Most neural networks for the first decades of the field used sigmoids; the idea of ReLUs wasn't discovered, validated, and popularized until decades later.  What's lethal is that we do not have the Textbook From The Future telling us all the simple solutions that actually in real life just work and are robust; we're going to be doing everything with metaphorical sigmoids on the first critical try.  No difficulty discussed here about AGI alignment is claimed by me to be impossible - to merely human science and engineering, let alone in principle - if we had 100 years to solve it using unlimited retries, the way that science usually has an unbounded time budget and unlimited retries.  This list of lethalities is about things we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle.

That said:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable on anything remotely remotely resembling the current pathway, or any other pathway we can easily jump to.

 

Section A:

This is a very lethal problem, it has to be solved one way or another, it has to be solved at a minimum strength and difficulty level instead of various easier modes that some dream about, we do not have any visible option of 'everyone' retreating to only solve safe weak problems instead, and failing on the first really dangerous try is fatal.

 

1.  Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games.  Anyone relying on "well, it'll get up to human capability at Go, but then have a hard time getting past that because it won't be able to learn from humans any more" would have relied on vacuum.  AGI will not be upper-bounded by human ability or human learning speed.  Things much smarter than human would be able to learn from less evidence than humans require to have ideas driven into their brains; there are theoretical upper bounds here, but those upper bounds seem very high. (Eg, each bit of information that couldn't already be fully predicted can eliminate at most half the probability mass of all hypotheses under consideration.)  It is not naturally (by default, barring intervention) the case that everything takes place on a timescale that makes it easy for us to react.

2.  A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.  The concrete example I usually use here is nanotech, because there's been pretty detailed analysis of what definitely look like physically attainable lower bounds on what should be possible with nanotech, and those lower bounds are sufficient to carry the point.  My lower-bound model of "how a sufficiently powerful intelligence would kill everyone, if it didn't want to not do that" is that it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery.  (Back when I was first deploying this visualization, the wise-sounding critics said "Ah, but how do you know even a superintelligence could solve the protein folding problem, if it didn't already have planet-sized supercomputers?" but one hears less of this after the advent of AlphaFold 2, for some odd reason.)  The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer.  Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second".  (I am using awkward constructions like 'high cognitive power' because standard English terms like 'smart' or 'intelligent' appear to me to function largely as status synonyms.  'Superintelligence' sounds to most people like 'something above the top of the status hierarchy that went to double college', and they don't understand why that would be all that dangerous?  Earthlings have no word and indeed no standard native concept that means 'actually useful cognitive power'.  A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)

3.  We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.  This includes, for example: (a) something smart enough to build a nanosystem which has been explicitly authorized to build a nanosystem; or (b) something smart enough to build a nanosystem and also smart enough to gain unauthorized access to the Internet and pay a human to put together the ingredients for a nanosystem; or (c) something smart enough to get unauthorized access to the Internet and build something smarter than itself on the number of machines it can hack; or (d) something smart enough to treat humans as manipulable machinery and which has any authorized or unauthorized two-way causal channel with humans; or (e) something smart enough to improve itself enough to do (b) or (d); etcetera.  We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors.  This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try.  If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked.  Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder.  That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong.  (One remarks that most people are so absolutely and flatly unprepared by their 'scientific' educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that right on the first critical try.)

4.  We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world.  The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world.  Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth.  The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research).  Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on the first critical try, but with an unlimited time bound; would both be terrifically humanity-threatening challenges by historical standards individually.

5.  We can't just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.  I've also in the past called this the 'safe-but-useless' tradeoff, or 'safe-vs-useful'.  People keep on going "why don't we only use AIs to do X, that seems safe" and the answer is almost always either "doing X in fact takes very powerful cognition that is not passively safe" or, even more commonly, "because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later".  If all you need is an object that doesn't do dangerous things, you could try a sponge; a sponge is very passively safe.  Building a sponge, however, does not prevent Facebook AI Research from destroying the world six months later when they catch up to the leading actor.

6.  We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.  While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that.  It's not enough to be able to align a weak system - we need to align a system that can do some single very large thing.  The example I usually give is "burn all GPUs".  This is not what I think you'd actually want to do with a powerful AGI - the nanomachines would need to operate in an incredibly complicated open environment to hunt down all the GPUs, and that would be needlessly difficult to align.  However, all known pivotal acts are currently outside the Overton Window, and I expect them to stay there.  So I picked an example where if anybody says "how dare you propose burning all GPUs?" I can say "Oh, well, I don't actually advocate doing that; it's just a mild overestimate for the rough power level of what you'd have to do, and the rough level of machine cognition required to do that, in order to prevent somebody else from destroying the world in six months or three years."  (If it wasn't a mild overestimate, then 'burn all GPUs' would actually be the minimal pivotal task and hence correct answer, and I wouldn't be able to give that denial.)  Many clever-sounding proposals for alignment fall apart as soon as you ask "How could you use this to align a system that you could use to shut down all the GPUs in the world?" because it's then clear that the system can't do something that powerful, or, if it can do that, the system wouldn't be easy to align.  A GPU-burner is also a system powerful enough to, and purportedly authorized to, build nanotechnology, so it requires operating in a dangerous domain at a dangerous level of intelligence and capability; and this goes along with any non-fantasy attempt to name a way an AGI could change the world such that a half-dozen other would-be AGI-builders won't destroy the world 6 months later.

7.  The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.  There's no reason why it should exist.  There is not some elaborate clever reason why it exists but nobody can see it.  It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness.  If you can't solve the problem right now (which you can't, because you're opposed to other actors who don't want to be solved and those actors are on roughly the same level as you) then you are resorting to some cognitive system that can do things you could not figure out how to do yourself, that you were not close to figuring out because you are not close to being able to, for example, burn all GPUs.  Burning all GPUs would actually stop Facebook AI Research from destroying the world six months later; weaksauce Overton-abiding stuff about 'improving public epistemology by setting GPT-4 loose on Twitter to provide scientifically literate arguments about everything' will be cool but will not actually prevent Facebook AI Research from destroying the world six months later, or some eager open-source collaborative from destroying the world a year later if you manage to stop FAIR specifically.  There are no pivotal weak acts.

8.  The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.

9.  The builders of a safe system, by hypothesis on such a thing being possible, would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that.  Running AGIs doing something pivotal are not passively safe, they're the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.

 

Section B:

Okay, but as we all know, modern machine learning is like a genie where you just give it a wish, right?  Expressed as some mysterious thing called a 'loss function', but which is basically just equivalent to an English wish phrasing, right?  And then if you pour in enough computing power you get your wish, right?  So why not train a giant stack of transformer layers on a dataset of agents doing nice things and not bad things, throw in the word 'corrigibility' somewhere, crank up that computing power, and get out an aligned AGI?

 

Section B.1:  The distributional leap. 

10.  You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning.  On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.  (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn't be working on a live unaligned superintelligence to align it.)  This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they'd do, in order to align what output - which is why, of course, they never concretely sketch anything like that.  Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you.  This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm.  Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you're starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat.  (Note that anything substantially smarter than you poses a threat given any realistic level of capability.  Eg, "being able to produce outputs that humans look at" is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)

11.  If cognitive machinery doesn't generalize far out of the distribution where you did tons of training, it can't solve problems on the order of 'build nanotechnology' where it would be too expensive to run a million training runs of failing to build nanotechnology.  There is no pivotal act this weak; there's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later.  Pivotal weak acts like this aren't known, and not for want of people looking for them.  So, again, you end up needing alignment to generalize way out of the training distribution - not just because the training environment needs to be safe, but because the training environment probably also needs to be cheaper than evaluating some real-world domain in which the AGI needs to do some huge act.  You don't get 1000 failed tries at burning all GPUs - because people will notice, even leaving out the consequences of capabilities success and alignment failure.

12.  Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes.  Problems that materialize at high intelligence and danger levels may fail to show up at safe lower levels of intelligence, or may recur after being suppressed by a first patch.

13.  Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.  Consider the internal behavior 'change your outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over you'.  This problem is one that will appear at the superintelligent level; if, being otherwise ignorant, we guess that it is among the median such problems in terms of how early it naturally appears in earlier systems, then around half of the alignment problems of superintelligence will first naturally materialize after that one first starts to appear.  Given correct foresight of which problems will naturally materialize later, one could try to deliberately materialize such problems earlier, and get in some observations of them.  This helps to the extent (a) that we actually correctly forecast all of the problems that will appear later, or some superset of those; (b) that we succeed in preemptively materializing a superset of problems that will appear later; and (c) that we can actually solve, in the earlier laboratory that is out-of-distribution for us relative to the real problems, those alignment problems that would be lethal if we mishandle them when they materialize later.  Anticipating all of the really dangerous ones, and then successfully materializing them, in the correct form for early solutions to generalize over to later solutions, sounds possibly kinda hard.

14.  Some problems, like 'the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment', seem like their natural order of appearance could be that they first appear only in fully dangerous domains.  Really actually having a clear option to brain-level-persuade the operators or escape onto the Internet, build nanotech, and destroy all of humanity - in a way where you're fully clear that you know the relevant facts, and estimate only a not-worth-it low probability of learning something which changes your preferred strategy if you bide your time another month while further growing in capability - is an option that first gets evaluated for real at the point where an AGI fully expects it can defeat its creators.  We can try to manifest an echo of that apparent scenario in earlier toy domains.  Trying to train by gradient descent against that behavior, in that toy domain, is something I'd expect to produce not-particularly-coherent local patches to thought processes, which would break with near-certainty inside a superintelligence generalizing far outside the training distribution and thinking very different thoughts.  Also, programmers and operators themselves, who are used to operating in not-fully-dangerous domains, are operating out-of-distribution when they enter into dangerous ones; our methodologies may at that time break.

15.  Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.  Given otherwise insufficient foresight by the operators, I'd expect a lot of those problems to appear approximately simultaneously after a sharp capability gain.  See, again, the case of human intelligence.  We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection.  Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game.  We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously.  (People will perhaps rationalize reasons why this abstract description doesn't carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”.  My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question.  When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned 'lethally' dangerous relative to the outer optimization loop of natural selection.  Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)

 

Section B.2:  Central difficulties of outer and inner alignment. 

16.  Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments.  Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.  This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.  This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

17.  More generally, a superproblem of 'outer optimization doesn't produce inner alignment' is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.  This is a problem when you're trying to generalize out of the original training distribution, because, eg, the outer behaviors you see could have been produced by an inner-misaligned system that is deliberately producing outer behaviors that will fool you.  We don't know how to get any bits of information into the inner system rather than the outer behaviors, in any systematic or general way, on the current optimization paradigm.

18.  There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned', because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function.  That is, if you show an agent a reward signal that's currently being generated by humans, the signal is not in generalreliable perfect ground truth about how aligned an action was, because another way of producing a high reward signal is to deceive, corrupt, or replace the human operators with a different causal system which generates that reward signal.  When you show an agent an environmental reward signal, you are not showing it something that is a reliable ground truth about whether the system did the thing you wanted it to do; even if it ends up perfectly inner-aligned on that reward signal, or learning some concept that exactly corresponds to 'wanting states of the environment which result in a high reward signal being sent', an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (as seen by the operators).

19.  More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.  This isn't to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident.  Humans ended up pointing to their environments at least partially, though we've got lots of internally oriented motivational pointers as well.  But insofar as the current paradigm works at all, the on-paper design properties say that it only works for aligning on known direct functions of sense data and reward functions.  All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like 'kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward button forever after'.  It just isn't true that we know a function on webcam input such that every world with that webcam showing the right things is safe for us creatures outside the webcam.  This general problem is a fact about the territory, not the map; it's a fact about the actual environment, not the particular optimizer, that lethal-to-us possibilities exist in some possible environments underlying every given sense input.

20.  Human operators are fallible, breakable, and manipulable.  Human raters make systematic errors - regular, compactly describable, predictable errors.  To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.  It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

21.  There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket.  When you have a wrong belief, reality hits back at your wrong predictions.  When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff.  In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints.  Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.  This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else.  This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'.  The central result:  Capabilities generalize further than alignment once capabilities start to generalize far.

22.  There's a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer.  The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon.  There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find 'want inclusive reproductive fitness' as a well-generalizing solution within ancestral humans.  Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.

23.  Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee.  We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down).  Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

24.  There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.  The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it.  The second course is to build corrigible AGI which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.

  1. The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI.  Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try.  It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
  2. The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution).  You're not trying to make it have an opinion on something the core was previously neutral on.  You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555.  You can maybe train something to do this in a particular training distribution, but it's incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.

 

Section B.3:  Central difficulties of sufficiently good and useful transparency / interpretability.

25.  We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.  Drawing interesting graphs of where a transformer layer is focusing attention doesn't help if the question that needs answering is "So was it planning how to kill us or not?"

26.  Even if we did know what was going on inside the giant inscrutable matrices while the AGI was still too weak to kill us, this would just result in us dying with more dignity, if DeepMind refused to run that system and let Facebook AI Research destroy the world two years later.  Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system of inscrutable matrices that isn't planning to kill us.

27.  When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.  Optimizing against an interpreted thought optimizes against interpretability.

28.  The AGI is smarter than us in whatever domain we're trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent.  A powerful AI searches parts of the option space we don't, and we can't foresee all its options.

29.  The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences.  Human beings cannot inspect an AGI's output to determine whether the consequences will be good.

30.  Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don't know so that it can make plans we wouldn't be able to make ourselves.  It knows, at the least, the fact we didn't previously know, that some action sequence results in the world we want.  Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence.  An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn't make the same guarantee about an unaligned human as smart as yourself and trying to fool you.  There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.

31.  A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about.  (Including how smart it is, or whether it's acquired strategic awareness.)

32.  Human thought partially exposes only a partially scrutable outer surface layer.  Words only trace our real thoughts.  Words are not an AGI-complete data representation in its native style.  The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset.  This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

33.  The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale.  Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

 

Section B.4:  Miscellaneous unworkable schemes. 

34.  Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can't reason reliably about the code of superintelligences); a "multipolar" system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like "the 20 superintelligences cooperate with each other but not with humanity".

35.  Schemes for playing "different" AIs off against each other stop working if those AIs advance to the point of being able to coordinate via reasoning about (probability distributions over) each others' code.  Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other.  Eg, if you set an AGI that is secretly a paperclip maximizer, to check the output of a nanosystems designer that is secretly a staples maximizer, then even if the nanosystems designer is not able to deduce what the paperclip maximizer really wants (namely paperclips), it could still logically commit to share half the universe with any agent checking its designs if those designs were allowed through, if the checker-agent can verify the suggester-system's logical commitment and hence logically depend on it (which excludes human-level intelligences).  Or, if you prefer simplified catastrophes without any logical decision theory, the suggester could bury in its nanosystem design the code for a new superintelligence that will visibly (to a superhuman checker) divide the universe between the nanosystem designer and the design-checker.

36.  What makes an air conditioner 'magic' from the perspective of say the thirteenth century, is that even if you correctly show them the design of the air conditioner in advance, they won't be able to understand from seeing that design why the air comes out cold; the design is exploiting regularities of the environment, rules of the world, laws of physics, that they don't know about.  The domain of human thought and human brains is very poorly understood by us, and exhibits phenomena like optical illusions, hypnosis, psychosis, mania, or simple afterimages produced by strong stimuli in one place leaving neural effects in another place.  Maybe a superintelligence couldn't defeat a human in a very simple realm like logical tic-tac-toe; if you're fighting it in an incredibly complicated domain you understand poorly, like human minds, you should expect to be defeated by 'magic' in the sense that even if you saw its strategy you would not understand why that strategy worked.  AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.

 

Section C:

Okay, those are some significant problems, but lots of progress is being made on solving them, right?  There's a whole field calling itself "AI Safety" and many major organizations are expressing Very Grave Concern about how "safe" and "ethical" they are?

 

37.  There's a pattern that's played out quite often, over all the times the Earth has spun around the Sun, in which some bright-eyed young scientist, young engineer, young entrepreneur, proceeds in full bright-eyed optimism to challenge some problem that turns out to be really quite difficult.  Very often the cynical old veterans of the field try to warn them about this, and the bright-eyed youngsters don't listen, because, like, who wants to hear about all that stuff, they want to go solve the problem!  Then this person gets beaten about the head with a slipper by reality as they find out that their brilliant speculative theory is wrong, it's actually really hard to build the thing because it keeps breaking, and society isn't as eager to adopt their clever innovation as they might've hoped, in a process which eventually produces a new cynical old veteran.  Which, if not literally optimal, is I suppose a nice life cycle to nod along to in a nature-show sort of way.  Sometimes you do something for the first time and there are no cynical old veterans to warn anyone and people can be really optimistic about how it will go; eg the initial Dartmouth Summer Research Project on Artificial Intelligence in 1956:  "An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer."  This is less of a viable survival plan for your planet if the first major failure of the bright-eyed youngsters kills literally everyone before they can predictably get beaten about the head with the news that there were all sorts of unforeseen difficulties and reasons why things were hard.  You don't get any cynical old veterans, in this case, because everybody on Earth is dead.  Once you start to suspect you're in that situation, you have to do the Bayesian thing and update now to the view you will predictably update to later: realize you're in a situation of being that bright-eyed person who is going to encounter Unexpected Difficulties later and end up a cynical old veteran - or would be, except for the part where you'll be dead along with everyone else.  And become that cynical old veteran right away, before reality whaps you upside the head in the form of everybody dying and you not getting to learn.  Everyone else seems to feel that, so long as reality hasn't whapped them upside the head yet and smacked them down with the actual difficulties, they're free to go on living out the standard life-cycle and play out their role in the script and go on being bright-eyed youngsters; there's no cynical old veterans to warn them otherwise, after all, and there's no proof that everything won't go beautifully easy and fine, given their bright-eyed total ignorance of what those later difficulties could be.

38.  It does not appear to me that the field of 'AI safety' is currently being remotely productive on tackling its enormous lethal problems.  These problems are in fact out of reach; the contemporary field of AI safety has been selected to contain people who go to work in that field anyways.  Almost all of them are there to tackle problems on which they can appear to succeed and publish a paper claiming success; if they can do that and get funded, why would they embark on a much more unpleasant project of trying something harder that they'll fail at, just so the human species can die with marginally more dignity?  This field is not making real progress and does not have a recognition function to distinguish real progress if it took place.  You could pump a billion dollars into it and it would produce mostly noise to drown out what little progress was being made elsewhere.

39.  I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them.  This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others.  It probably relates to 'security mindset', and a mental motion where you refuse to play out scripts, and being able to operate in a field that's in a state of chaos.

40.  "Geniuses" with nice legible accomplishments in fields with tight feedback loops where it's easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn't the place where humanity most needed a genius, and (c) probably don't have the mysterious gears simply because they're rare.  You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them.  They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can't tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do.  I concede that real high-powered talents, especially if they're still in their 20s, genuinely interested, and have done their reading, are people who, yeah, fine, have higher probabilities of making core contributions than a random bloke off the street. But I'd have more hope - not significant hope, but more hope - in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later.

41.  Reading this document cannot make somebody a core alignment researcher.  That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author.  It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction.  The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so.  Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try.  I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle).  The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that.  I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this.  That's not what surviving worlds look like.

42.  There's no plan.  Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive.  It is a written plan.  The plan is not secret.  In this non-surviving world, there are no candidate plans that do not immediately fall to Eliezer instantly pointing at the giant visible gaping holes in that plan.  Or if you don't know who Eliezer is, you don't even realize you need a plan, because, like, how would a human being possibly realize that without Eliezer yelling at them?  It's not like people will yell at themselves about prospective alignment difficulties, they don't have an internal voice of caution.  So most organizations don't have plans, because I haven't taken the time to personally yell at them.  'Maybe we should have a plan' is deeper alignment mindset than they possess without me standing constantly on their shoulder as their personal angel pleading them into... continued noncompliance, in fact.  Relatively few are aware even that they should, to look better, produce a pretend plan that can fool EAs too 'modest' to trust their own judgments about seemingly gaping holes in what serious-looking people apparently believe.

43.  This situation you see when you look around you is not what a surviving world looks like.  The worlds of humanity that survive have plans.  They are not leaving to one tired guy with health problems the entire responsibility of pointing out real and lethal problems proactively.  Key people are taking internal and real responsibility for finding flaws in their own plans, instead of considering it their job to propose solutions and somebody else's job to prove those solutions wrong.  That world started trying to solve their important lethal problems earlier than this.  Half the people going into string theory shifted into AI alignment instead and made real progress there.  When people suggest a planetarily-lethal problem that might materialize later - there's a lot of people suggesting those, in the worlds destined to live, and they don't have a special status in the field, it's just what normal geniuses there do - they're met with either solution plans or a reason why that shouldn't happen, not an uncomfortable shrug and 'How can you be sure that will happen' / 'There's no way you could be sure of that now, we'll have to wait on experimental evidence.'

A lot of those better worlds will die anyways.  It's a genuinely difficult problem, to solve something like that on your first try.  But they'll die with more dignity than this.

160

55 comments, sorted by Click to highlight new comments since: Today at 3:33 AM
New Comment

On Twitter, Eric Rogstad wrote:

"the thing where it keeps being literally him doing this stuff is quite a bad sign"

I'm a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...

1. I parse the original as, "a collection of EY's thoughts on why safe AI is hard". It's EY's thoughts, why would someone else (other than @robbensinger) write a collection of EY's thoughts?

(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or ...?)

2. Was there anything new in this doc? It's prob useful to collect all in one place, but we don't ask, "why did no one else write this" for every bit of useful writing out there, right?

Why was it so overwhelmingly important that someone write this summary at this time, that we're at all scratching our heads about why no one else did it?

Copying over my reply to Eric:

My shoulder Eliezer (who I agree with on alignment, and who speaks more bluntly and with less hedging than I normally would) says:

  1. The list is true, to the best of my knowledge, and the details actually matter.

    Many civilizations try to make a canonical list like this in 1980 and end up dying where they would have lived just because they left off one item, or under-weighted the importance of the last three sentences of another item, or included ten distracting less-important items.
     
  2. There are probably not many civilizations that wait until 2022 to make this list, and yet survive.
     
  3. It's true that many of the points in the list have been made before. But it's very doomy that they were made by me.
     
  4. Nearly all of the field's active alignment research is predicated on a false assumption that's contradicted by one of the items in sections A or B. If the field had recognized everything in A and B sooner, we could have put our recent years of effort into work that might actually help on the mainline, as opposed to work that just hopes a core difficulty won't manifest and has no Plan B for what to do when reality says "no, we're on the mainline".

So the answer to 'Why would someone else write EY's thoughts?' is 'It has nothing to do with an individual's thoughts; it's about civilizations needing a very solid and detailed understanding of what's true on these fronts, or they die'.

Re "(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or ...?)": 

The point is not 'humanity needs to write a convincing-sounding essay for the thesis Safe AI Is Hard, so we can convince people'. The point is 'humanity needs to actually have a full and detailed understanding of the problem so we can do the engineering work of solving it'.

If it helps, imagine that humanity invents AGI tomorrow and has to actually go align it now. In that situation, you need to actually be able to do all the requisite work, not just be able to write essays that would make a debate judge go 'ah yes, well argued.'

When you imagine having water cooler arguments about the importance of AI alignment work, then sure, it's no big deal if you got a few of the details wrong.

When you imagine actually trying to build aligned AGI the day after tomorrow, I think it comes much more into relief why it matters to get those details right, when the "details" are as core and general as this.

I think that this is a really good exercise that more people should try. Imagine that you're running a project yourself that's developing AGI first, in real life. Imagine that you are personally responsible for figuring out how to make the thing go well. Yes, maybe you're not the perfect person for the job; that's a sunk cost. Just think about what specific things you would actually do to make things go well, what things you'd want to do to prepare 2 years or 6 years in advance, etc.

Try to think your way into near-mode with regard to AGI development, without thereby assuming (without justification) that it must all be very normal just because it's near. Be able to visualize it near-mode and weird/novel. If it helps, start by trying to adopt a near-mode, pragmatic, gearsy mindset toward the weirdest realistic/plausible hypothesis first, then progress to the less-weird possibilities.

I think there's a tendency for EAs and rationalists to instead fall into one of these two mindsets with regard to AGI development, pivotal acts, etc.:

  1. Fun Thought Experiment Mindset.  On this mindset, pivotal acts, alignment, etc. are mentally categorized as a sort of game, a cute intellectual puzzle or a neat thing to chat about.

    This is mostly a good mindset, IMO, because it makes it easy to freely explore ideas, attend to the logical structure of arguments, brainstorm, focus on gears, etc.

    Its main defect is a lack of rigor and a more general lack of drive: because on some level you're not taking the question seriously, you're easily distracted by fun, cute, or elegant lines of thought, and you won't necessarily push yourself to red-team proposals, spontaneously take into account other pragmatic facts/constraints you're aware of from outside the current conversational locus, etc. The whole exercise sort of floats in a fantasy bubble, rather than being a thing people bring their full knowledge, mental firepower, and lucidity/rationality to bear on.
     
  2. Serious Respectable Person Mindset.  Alternatively, when EAs and rationalists do start taking this stuff seriously, I think they tend to sort of turn off the natural flexibility, freeness, and object-levelness of their thinking, and let their mind go to a very fearful or far-mode place. The world's gears become a lot less salient, and "Is it OK to say/think that?" becomes a more dominant driver of thought.

    Example: In Fun Thought Experiment Mindset, IME, it's easier to think about governments in a reductionist and unsentimental way, as specific messy groups of people with specific institutional dysfunctions, psychological hang-ups, etc. In Serious Respectable Person Mindset, there's more of a temptation to go far-mode, glom on to happy-sounding narratives and scenarios, or even just resist the push to concretely visualize the future at all -- thinking instead in terms of abstract labels and normal-sounding platitudes.

The entire fact that EA and rationalism mostly managed to avert their gaze from the concept of "pivotal acts" for years, is in my opinion an example of how these two mindsets often fail.

"In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world" is a thing I think most LWers and many longtermist EAs would have considered obvious. As a community, however, we mostly managed to just-not-think the obvious next thought, "In order to prevent the world's destruction in this scenario, one of the first AGI groups needs to find some fast way to prevent the proliferation of AGI."

Fun Thought Experiment Mindset, I think, encouraged this mental avoidance because it thought of AGI alignment (to some extent) as a fun game in the genre of "math puzzle" or "science fiction scenario", not as a pragmatic, real-world dilemma we actually have to solve, taking into account all of our real-world knowledge and specific facts on the ground. The 'rules of the game', many people apparently felt, were to think about certain specific parts of the action chain leading up to an awesome future lightcone, rather than taking ownership of the entire problem and trying to figure out what humanity should in-real-life do, start to finish.

(What primarily makes this weird is that many alignment questions crucially hinge on 'what task are we aligning the AGI on?'. These are not remotely orthogonal topics.)

Serious Respectable Person Mindset, I think, encouraged this mental avoidance more actively, because pivotal acts are a weird and scary-sounding idea once you leave 'these are just fun thought experiments' land.

What I'd like to see instead is something like Weirdness-Tolerant Project Leader Mindset, or Thought Experiments Plus Actual Rigor And Pragmatism And Drive Mindset, or something.

I think a lot of the confusion around EY's post comes from the difference between thinking of these posts (on some level) as fun debate fodder or persuasion/outreach tools, versus attending to the fact that humanity has to actually align AGI systems if we're going to make it out of this problem, and this is an attempt by humanity to distill where we're currently at, so we can actually proceed to go solve alignment right now and save the world.

Imagine that this is v0 of a series of documents that need to evolve into humanity's (/ some specific group's) actual business plan for saving the world. The details really, really matter. Understanding the shape of the problem really matters, because we need to engineer a solution, not just 'persuade people to care about AI risk'.

If you disagree with the OP... that's pretty important! Share your thoughts. If you agree, that's important to know too, so we can prioritize some disagreements over others and zero in on critical next actions. There's a mindset here that I think is important, that isn't about "agree with Eliezer on arbitrary topics" or "stop thinking laterally"; it's about approaching the problem seriously, neither falling into despair nor wishful thinking, neither far-mode nor forced normality, neither impracticality nor propriety.

"In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world" is a thing I think most LWers and many longtermist EAs would have considered obvious.

I think that many AI alignment researchers just have a different development model than this, where world-destroying AGIs don't emerge suddenly from harmless low-impact AIs, no one project gets a vast lead over competitors, there's lots of early evidence of misalignment and (if alignment is harder) many smaller scale disasters in the lead up to any AI that is capable of destroying the world outright. See e.g. Paul's What failure looks like.

On this view, the idea that there'll be a lead project with a very short time window to execute a single pivotal act is wrong, and instead the 'pivotal act' is spread out and about making sure the aligned projects have a lead over the rest, and that failures from unaligned projects are caught early enough for long enough (by AIs or human overseers), for the leading projects to become powerful and for best practices on alignment to be spread universally.

Basically, if you find yourself in the early stages of WFLL2 and want to avert doom, what you need to do is get better at overseeing your pre-AGI AIs, not build an AGI to execute a pivotal act. This was pretty much what Richard Ngo was arguing for in most of the  MIRI debates with Eliezer, and also I think it's what Paul was arguing for. And obviously, Eliezer thought this was insufficient, because he expects alignment to be much harder and takeoff to be much faster.

But I think that's the reason a lot of alignment researchers haven't focussed on pivotal acts: because they think a sudden, fast-moving single pivotal act is unnecessary in a slow takeoff world. So you can't conclude just from the fact that most alignment researchers don't talk in terms of single pivotal acts that they're not thinking in near mode about what actually needs to be done.

However, I do think that what you're saying is true of a lot of people - many people I speak to just haven't thought about the question of how to ensure overall success, either in the slow takeoff sense I've described or the Pivotal Act sense. I think people in technical research are just very unused to thinking in such terms, and AI governance is still in its early stages.

 

I agree that on this view it still makes sense to say, 'if you somehow end up that far ahead of everyone else in an AI takeoff then you should do a pivotal act', like Scott Alexander said:

That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)

But I don't think you learn all that much about how 'concrete and near mode' researchers who expect slower takeoff are being, from them not having given much thought to what to do in this (from their perspective) unlikely edge case.

But I don't think you learn all that much about how 'concrete and near mode' researchers who expect slower takeoff are being, from them not having given much thought to what to do in this (from their perspective) unlikely edge case.

 

I'm not sure how many researchers assign little enough credence to fast takeoff that they'd describe it as an unlikely edge case, which sounds like <=10%? e.g in Paul's blog post he writes "I’m around 30% of fast takeoff"

ETA: One proxy could be what percentage researchers assigned to "Superintelligence" in this survey

I don't think what Paul means by fast takeoff is the same thing as the sort of discontinous jump that would enable a pivotal act. I think fast for Paul just means the negation of Paul-slow: 'no four year economic doubling before one year economic doubling'. But whatever Paul thinks the survey respondents did give at least 10% to scenarios where a pivotal act is possible.

Even so, 'this isn't how I expect things to to on the mainline so I'm not going to focus on what to do here' is far less of a mistake than 'I have no plan for what to do on my mainline', and I think the researchers who ignored pivotal acts are mostly doing the first one

Great comment. Perhaps it would be helpful to explicitly split the analysis by assumptions about takeoff speed? It seems that conditional on takeoff speed, there's not much disagreement.

At least for me, I thought we should avoid talking about the pivotal act stuff through a combination of a) this is obviously an important candidate hypothesis but seems bad to talk about because then the Bad Guys will Get Ideas and b) other people who're better at math/philosophy/alignment presumably know this and are privately considering it in detail, I have only so much to contribute here.

b) is plausibly a dereliction of duty, as is my relative weighting of the terms, but at least in my head it wasn't (isn't?) obvious to me that it was wrong for me not to spend a ton of time thinking about pivotal acts.

I think that makes sense as a worry, but I think EAs' caution and reluctance to model-build and argue about this stuff has turned out to do more harm than good, so we should change tactics. (And we very probably should have done things differently from the get-go.)

If you're worried that it's dangerous to talk about something publicly, I'd start off by thinking about it privately and talking about it over Signal with friends, etc. Then you can progress to contacting more EAs privately, then to posting publicly, as it becomes increasingly clear "there's real value in talking about this stuff" and "there's not a strong-enough reason to keep quiet".

Step one in doing that, though, has to be a willingness to think about the topic at all, even if there isn't clear public social proof that this is a normal or "approved" direction to think in. I think a thing that helps here is to recognize how small the group of "EA leaders and elite researchers" is, how divided their attention is between hundreds of different tasks and subtasks, and how easy it is for many things to therefore fall through the cracks or just-not-happen.

Just as a datapoint:

I had once explicitly posted on LW asking what pivotal act you could take by querying a misaligned oracle AI, with the assumption that you want to leak as little bits of information about the world as possible to the AI. Reasoning being if it had lesser data it would have lesser ability to break out of its box even if it failed to answer your query.

LW promptly downvoted the question heavily so I assumed it's taboo.

This makes me wish there were a popular LW- or EAF-ish forum that doesn't use the karma system (or uses a very different karma system). If karma sometimes makes errors sticky because disagreement gets downvoted, then it would be nice to have more venues that don't have exactly that issue.

(Also, if more forums exist, this makes it likelier that different forums will downvote different things, so a wider variety of ideas can be explored in at least one place.)

This is also another reason to roll out separate upvotes for 'I like this' versus 'I agree with this'.

Interesting suggestion, although my first reaction is it feels a bit like handing things over to Moloch. Like, I would rather replace a bad judge of what is infohazardous content with a good judge, than lose our ability to keep any infohazards private at all.

There's also a similar discussion on having centralized versus decentralized grantmaking for longterm future stuff. People pointed out unilateralists curse as a reason to keep it centralised.

I am normally super in favour of more decentralisation and pluralism of thought, but having seen a bunch of infohazardous stuff and potential dangerous-if-executed plans here and there I'm no longer sure.

P.s. Maybe one needs to decouple stuff like "low quality", "not a topic of interest in this particular forum", "not worth funding" etc from "infohazardous" and "dangerous to execute".

Like maybe we can decentralize decision making on former without doing the same on latter.

This is another one of those downvoted comments.

This comment was at -2 until I strong upvoted. :/

Thanks for writing this up. I don't spend that much time thinking about AI as a problem area, but I agree that something in the ballpark of AI alignment is a top priority problem. As something of a newbie to this, I wanted to throw out some very possibly naive criticisms and questions so that I can try and understand your argument better. 

1 Suppose we give the AI the goal of 'maximise the total welfare of sentient life'. We use a training set of local tasks to see how it performs over many years before we get to AGI. I'm not sure I understand why this system would would almost definitely subsequently kill everyone. I agree that there is some risk that it might go wrong, that it might misunderstand what we mean by the goal, or it might deceptively perform on the training set in order to get power to meet its mistaken goal. But from what you have said, I don't understand why it would almost definitely kill everyone. Similarly, if you gave the AI the goal of benefitting only the American people, I don't understand, from what you have said, why the system would almost definitely kill everyone in the world once it is let out from its training and has to make a distributional shift into the wider world. 

2 You say "More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward."

Insofar as I understand this, it seems false. Like, if I designed a driverless car, I think it could be true that it could reliably identify things within the environment, such as dogs, other cars, and pedestrians. Is this what you mean by 'point out'. It is true that it would learn what these are by sense data and reward but I don't see why this means that such a system couldn't reliably identify actual objects in the real world. 

3 "There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned', because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function"

The implicit claim in this part of the argument seems to be that the rate at which all AI systems will attempt to fool human operators attempting to align them is high enough that we can never have (much?) confidence that a system is aligned. But this seems to be asserted rather than argued for. In AI training, we could punish systems strongly for deception to make it strongly disfavoured. Are you saying that deception in training is a 1% chance or a 99% chance? What is the argument for either number?

4 " Human operators are fallible, breakable, and manipulable.  Human raters make systematic errors - regular, compactly describable, predictable errors.  To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them"

I don't understand why this last sentence is true. I can see why optimising the rewards assigned by human operators wouldn't optimise human welfare, but I don't understand why it would lead to death. If I have a super competent PA who optimises for what I assign as a reward, why does this mean the PA will kill me?

[I'm going for dinner and will try to come up with some more questions later]

Thanks for the questions, John! :) I think this is a great place for this kind of discussion.

(I do comms at MIRI, where Eliezer works. I tend to have very Eliezer-ish views of AI risk, though I don't generally run my comments by Eliezer or other MIRI staff, so it's always possible I'm saying something Eliezer would disagree with.)

2 You say "More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward."

Insofar as I understand this, it seems false. Like, if I designed a driverless car, I think it could be true that it could reliably identify things within the environment, such as dogs, other cars, and pedestrians. Is this what you mean by 'point out'. It is true that it would learn what these are by sense data and reward but I don't see why this means that such a system couldn't reliably identify actual objects in the real world. 

What Eliezer's saying here is that current ML doesn't have a way to point the system's goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.

See points 21 ("Capabilities generalize further than alignment once capabilities start to generalize far.") and 22 ("There is no analogous truth about there being a simple core of alignment"). These are saying that getting AGI to understand things is a lot easier, on paradigms like the current one, than getting it to have a specific intended motivation.

Similarly, if you gave the AI the goal of benefitting only the American people, I don't understand, from what you have said, why the system would almost definitely kill everyone in the world once it is let out from its training and has to make a distributional shift into the wider world. 

The short answer is: we don't know how to get an AI system's goals to robustly 'point at' objects like 'the American people'. We don't even know how to get the goals to robustly point at much simpler physical systems that have crisp, known definitions (e.g., 'carbon atoms arranged to form diamond').

Absent such knowledge, we may be able to get an AI system to exhibit 'American-benefitting-ish behaviors' in a particular setting. But when you increase the AGI system's capability, or move to a new setting, this correlation is likely to break, because the vast majority of systems that exhibit superficially 'American-benefitting-ish behaviors' in a specific setting will not generalize in the way we implicitly want them to. The space of possible goals is too large and multidimensional for that, and the intuitive human idea of 'what counts as benefiting Americans' is too complex and unnatural.

A particularly dangerous example of "the superficially good system won't generalize in the way we implicitly want it to" is if the AI system is strategically aware and trying to gain influence. In that case, the system may deliberately look more friendly than it is, look more-liable-to-generalize-out-of-distribution than it in fact will, etc.

The simplest way this connects up to 'death' is that the system just doesn't have a goal that remotely resembles what you intended. It has a goal that only correlated with 'benefit Americans' under very specific circumstances; or it has a totally unrelated random goal (e.g., 'maximize the number of granite spheres in the universe'), but had a belief that this goal would be better-served in the near term if its behavior satisfied the programmers.

(This belief is often true, because there are likelier to be more granite spheres in the future if the programmers think you are friendly and useful, because this gives you an avenue for gaining more influence later, and reduces the probability that the programmers will shut you down or change your goals.)

A powerful AGI acting on the world with a goal like 'maximize the number of granite spheres' will (with high probability) kill everyone, because (a) humans are potentially threats to its sphere agenda (e.g., we could build a rival superintelligence that has a different goal), and (b) humans are made of raw materials which can be repurposed to build more spheres and infrastructure.

In AI training, we could punish systems strongly for deception to make it strongly disfavoured. Are you saying that deception in training is a 1% chance or a 99% chance? What is the argument for either number?

By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance. The argument for this as a default is 23 ("you can't bring the coffee if you're dead").

If you're a paperclip maximizer and your operator is a staple maximizer, then you have a strong incentive to find ways to reduce your operator's influence over the future and increase your own influence, so that there are more paperclips in the future and fewer staples.

"Intervene on the part of the world that is my operator's beliefs, in ways that increase my influence" is a special case of "intervening on the world in general, in ways that increase my influence". We shouldn't generally expect it to be easy to get an AGI to specifically carve out an exception for the former, while freely doing the latter -- because "my operator's brain" is not a simple, crisp, easy-to-formally-specify idea, but also because we don't know how to robustly point AGI goals at specific ideas even when they are simple, crisp, and easy to formally specify.

See also 8:

"The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars."

You can try to train the system to dislike deception, but this is triply difficult to do because:

  1. it's hard to train robust goals at all;
  2. it should be even harder to robustly train complex, value-laden goals that we have a fuzzy sense of but don't know how to crisply define; and most importantly
  3. we're actively pushing against the default incentives most possible systems have.

The latter point is discussed more in 24.2:

"The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution).  You're not trying to make it have an opinion on something the core was previously neutral on.  You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555.  You can maybe train something to do this in a particular training distribution, but it's incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all."

'Don't deceive your operators, even when you aren't perfectly aligned with your operator and have different goals from them' is an example of a corrigible behavior. This goal is like '222 + 222 = 555' because it locally violates the 'you can't get the coffee if you're dead' principle (as a special case of the principle 'you're likelier to get the coffee insofar as you have more influence over the future').

We're trying to get the system to generally be smart, useful, and strategic about some domain, but trying to get it not to understand, or not to care about, one of the most basic strategic implications of multi-agent scenarios: that when two agents have different goals, each agent will better achieve its goals if it gains control and the other agent loses control. This should be possible in principle, but on the face of it, it looks difficult.

you say "By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance." I agree that badly misaligned smart agents are likely to try to deceive their operators. But I was discussing the following proposition: "among advanced AI systems that we might plausibly make, there is a 99% chance of deception". Your claim is about the subset of misaligned agents, not how likely we are to produce misaligned agents (that might deceive us)

I take it that 23 shows that all systems have incentives not to be turned off.  I don't think this shows that there is a 99% chance that AI systems will deceive their programmers. 

Thanks for the three point argument, that is clarifying. I agree that if those premises are true, then we should expect AI systems to seek power over human operators who might try to turn them off or change their goals. If the goal is something like 'increase total human welfare' and the AI has a different idea about that than its operator, then the AI will try to disempower the operator in one way or another. But I'm not sure I see why this is necessarily a bad outcome. The AI might still be  good at advancing  human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view. 

This gets back to some of the ambiguity about alignment that pops up in the AI safety literature. I have been informally asking people working on AI what they mean by alignment over the last year, and nearly every answer has been importantly different from any of the others. To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered. 

Holden saw your questions and decided to write a new series to explain.

I don't think this shows that there is a 99% chance that AI systems will deceive their programmers.

Agreed. I wasn't trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you're new to the field and haven't searched around for counter-arguments, counter-counter-arguments, etc.

The AI might still be  good at advancing  human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.

In the vast majority of 'AGI with a random goal trying to deceiving you' scenarios, I think the random goal produces outcomes like paperclips, rather than 'sort-of-good' outcomes.

I think the same in the case of 'AGI with a goal sort-of related to advancing human welfare in the training set', though the argument for this is less obvious.

I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9/10 of the numbers correct gets you 0% of the value but getting 10/10 right gets you 100% of the value.

Also relevant is Stuart Russell's point:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.  This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want.

And Goodhart's Curse:

Goodhart's Curse in this form says that a powerful agent neutrally optimizing a proxy measure U that we hoped to align with true values V, will implicitly seek out upward divergences of U from V.

In other words: powerfully optimizing for a utility function is strongly liable to blow up anything we'd regard as an error in defining that utility function.

[...] Suppose the humans have true values V. We try to convey these values to a powerful AI, via some value learning methodology that ends up giving the AI a utility function U.

Even if U is locally an unbiased estimator of V, optimizing U will seek out what we would regard as 'errors in the definition', places where U diverges upward from V. Optimizing for a high U may implicitly seek out regions where U - V is high; that is, places where V is lower than U. This may especially include regions of the outcome space or policy space where the value learning system was subject to great variance; that is, places where the value learning worked poorly or ran into a snag.

Goodhart's Curse would be expected to grow worse as the AI became more powerful. A more powerful AI would be implicitly searching a larger space and would have more opportunity to uncover what we'd regard as "errors"; it would be able to find smaller loopholes, blow up more minor flaws.

[...] We could see the genie as implicitly or emergently seeking out any possible loophole in the wish: Not because it is an evil genie that knows our 'truly intended' V and is looking for some place that V can be minimized while appearing to satisfy U; but just because the genie is neutrally seeking out very large values of U and these are places where it is unusually likely that U diverged upward from V.

So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value.  (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)

Another part of the problem is that powerfully optimizing one value will tend to crowd out  other values.

And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since "places where our specification of what's good was wrong" are especially likely to include more "places where you can score extremely high on the specification".

To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered. 

Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I'm worried is that I expect misaligned AGI to produce things morally equivalent to "granite spheres" instead.

Thanks for this and sorry for the slow reply. ok, great so your earlier thought was that even if we tried to give the AI welfarist goals, it would most likely end up with some random goal like optimising granite spheres or paperclips?

I will give the resources you shared a read. Thanks for the interesting discussion!

Hi Rob, 

Thanks for this detailed response. I appreciate getting the opportunity to discuss this in depth. 

What Eliezer's saying here is that current ML doesn't have a way to point the system's goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.

I'm not sure whether I have misunderstood, but doesn't this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)? If I gave an AI the aim of 'kill all humans' then don't the system's goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn't that mean it would be straightforward to give AIs the goal of 'kill all humans'?

I don't really get how there can be such a firm dividing line between understanding the world and having  motivations that are faithful to the intentions of the programmer. If a system can understand the world really well, it can eg understand what pleasure is really well. Why then would it be extremely difficult to get it to optimise the amount of pleasure in the world? Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world? I still don't really get why this would with ~100% probability kill everyone. 

The key point in the argument in 21 seems to be:

In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints.  Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.

The first sentence seems like a non-sequitur and I'm not sure why it is relevant to the argument. Of course there are unboundedly many utility functions that programmers could give AIs. On the second sentence, it is true that reality doesn't hit back against things that are locally aligned on test cases but globally misaligned on the broader set of test cases. But I take it what the argument is trying to defend is the proposition "we are extremely likely to make a system that is locally aligned in test cases but globally misaligned". This argument doesn't tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they'll get past our current safety testing. 

I agree that AGIs with the goal of maximising granite spheres and things like that would kill everyone or do something very bad. The harder cases is where you give an AI a welfarist goal. 

An important note in passing. At the start, Eliezer defines alignment as ">0 people survive" but in the remainder of the piece, he often seems to refer to alignment as the more prosaic 'alignment with the intent of the programmer'. I find this ambiguity pops up a lot in AI safety writing. 

"What Eliezer's saying here is that current ML doesn't have a way to point the system's goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal."

I'm not sure whether I have misunderstood, but doesn't this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)?

No; this is why I said "current ML doesn't have a way to point the system's goals at specific physical objects in the world", and why I said "getting a specific programmer-intended concept into the goal".

The central difficulty isn't 'getting the AGI to instrumentally care about the world's state' or even 'getting the AGI to terminally care about the world's state'. (I don't know how one would do the latter with any confidence, but maybe there's some easy hack.)

Instead, the central difficulty is 'getting the AGI to terminally care about a specific thing, as opposed to something relatively random'.

If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we've probably solved most of the alignment problem. It's not necessarily a huge leap from this to saving the world.

The problem is that we don't know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned 'paperclips', 'granite spheres', etc. in my previous comments, I was using these as stand-ins for 'random goals that have little to do with human flourishing'. I wasn't saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.

If I gave an AI the aim of 'kill all humans' then don't the system's goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn't that mean it would be straightforward to give AIs the goal of 'kill all humans'?

The instrumental convergence thesis implies that it's straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy 'kill all humans' (if any humans exist in its environment).

This doesn't transfer over to 'we know how to robustly build AGI that has humane values', because (a) humane values aren't a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.

But yes, if 'kill all humans' or 'acquire resources' or 'make an AGI that's very smart' or 'make an AGI that protects itself from being destroyed' were the only thing we wanted from AGI, then the problem would already be solved.

Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world?

No, because (e.g.) a deceptive agent that is "playing nice" will be just as able to answer those questions well. There isn't an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you'll get those before you get real friendliness.

This doesn't mean that it's impossible to get real friendliness, but it means that you'll need some method other than just looking at external behaviors in order to achieve friendliness.

This argument doesn't tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they'll get past our current safety testing. 

The paragraph you quoted isn't talking about safety testing. It's saying 'gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks', plus 'there isn't an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly'.

An important note in passing. At the start, Eliezer defines alignment as ">0 people survive" but in the remainder of the piece, he often seems to refer to alignment as the more prosaic 'alignment with the intent of the programmer'. I find this ambiguity pops up a lot in AI safety writing. 

He says "So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it." The "carries out some pivotal superhuman engineering task" is important too. This part, and the part where the AGI somehow respects the programmer's "don't kill people" goal, connects the two phrasings.

If you want to catch up quickly to the front of the conversation on AI safety, you might find this YouTube channel helpful: https://www.youtube.com/channel/UCLB7AzTwc6VFZrBsO2ucBMg

If you prefer text to video, I'm less able to give you an information-dense resource--I haven't kept track of which introductory sources and compilations have been written in the past six years. Maybe other people in the comments could help.

If you want to learn  the mindset and background knowledge that goes into thinking productively about AI (and EA in general, since this is--for many of the old hands--where it all started) this is the classic introduction: https://www.readthesequences.com/

It looks like the other comments have already offered a good amount of relevant reading material, but in case you're up for some more, I think the ideas expressed in this paper (video introduction here) are a big part of why some people think that we don't know how to train models to have any (somewhat complex) objectives that we want them to have, which is a response to points (1), partly (3), and also (2) (if we interpret the quote in (2) as described in Rob's comment).

This report (especially pp. 1-8) might also make the potential difficulty of penalizing deception more intuitive.

Insofar as I understand this, it seems false. Like, if I designed a driverless car, I think it could be true that it could reliably identify things within the environment, such as dogs, other cars, and pedestrians. Is this what you mean by 'point out'. It is true that it would learn what these are by sense data and reward but I don't see why this means that such a system couldn't reliably identify actual objects in the real world. 

 

Surface level: the AI literally just identifies roads and cars from their appearance in the sensor data but these are basically immediate predictions using computer vision. Basically, this is the current state of driverless cars[1].

Latent model: The AI creates and maintains a robust model of the world. Instead of cars and roads appearing out of nowhere and having to deal with that [2], the model explains how roads and cars are related, why cars use roads, how roads are placed, and how drivers think and behave while driving. 

  • The AI could notice that the driver it is following, is very sober and competent, yet for no reason is slowing down in a controlled way, and the AI could decide that's a sign of an incident ahead. This would depend on an accurate model of the skill and theory of mind of the other driver.
  • Driving along, the AI can notice a lot of cars turning off into a sidestreet anomalously. By having an accurate model of the road system, which it maintains with live sensory data, and also using information to figure out which cars are through or local traffic, depending on social economic status, model, behavior of drivers, and the neighborhood, it can infer there is a slowdown or road closure, and make the decision to follow the anomalously turning cars.
  • Having to turn on a poorly designed, treacherous intersection where visibility is blocked, the AI would assess traffic and more forward assertively to become visible and gain information, at necessary exposure to itself. This assertiveness would depend on an assessment of current traffic, and the culture of driver norms in the locale.

Like, in theory, you can see this understanding would be desirable to have. 

(While unlikely to be realized for some time) in theory, all the above functionality/behavior is possible, using reinforcement style learning.

So, it seems like for complex systems in other tasks (or actual automated driving, literally), you could imagine how this functionality could lead to sophisticated, deep "latent" models of the world where the AI could gain knowledge of human theory of mind, the construction of infrastructure, economic conditions, societal conditions, patterns and limitations. Or something.

 

Note that the situation (read: vaporware) with "Driverless cars" should be an update, or at least be responded to by AI safety people.

 

  1. ^

    Note that  the above isn't really true, in addition to, using computer vision, the AI probably keeps state variables, like upcoming road dimensions, surface, general traffic, previously seen cars, and this could be sophisticated, although looking at Tesla as of 2019, it seems unimpressive. Tesla will full-on drive through red lights, and into trailers  and into firetrucks, which suggests "a surface level functionality", or  that the latent structures mentioned are primitive.

    I don't know much about SOTA automated driving.

  2. ^

    Again, as per the first footnote, it's more like the model produces shallow prediction of the next 15 seconds, and not literally just reacting to what appears on the screen

The implicit claim in this part of the argument seems to be that the rate at which all AI systems will attempt to fool human operators attempting to align them is high enough that we can never have (much?) confidence that a system is aligned. But this seems to be asserted rather than argued for. In AI training, we could punish systems strongly for deception to make it strongly disfavoured. Are you saying that deception in training is a 1% chance or a 99% chance? What is the argument for either number?


The model that AI safety people have is that:

  1. It seems like you don't need a 99% chance or even 1% chance for this to be a big problem. If it just happens once, and the AI is in position to exploit it, that seems dangerous.
    1. The model is like "Ice-Nine", I guess.
  2. To AI safety people, it seems like systems are being built with additional complexity and functionality all the time, and there's no way of knowing which new system is dangerous, in terms of capability or "alignment", or what that "percentage" (1% or 99%) might apply to its alignment, or even if this "percentage" or model of risk even is the right way of thinking about this problem for new systems.

FYI: LessWrong currently has an AGI Safety FAQ / all-dumb-questions-allowed thread --- if you have questions or things you're confused about, this could be a good opportunity for you.

I think there's a bunch of really important content here and I hope people engage seriously. (I plan to.)

I find that I agree (in impression space, at the moment of reading) with ~70% of what you're saying -- and think it covers an awful lot of important ground, and wish it was better appreciated in these communities. Then I think have disagreements with the frame of ~20% (some of which rub me the wrong way in a manner which gives me a visceral urge to disengage, which I'm resisting; other parts of which I think may put attention on importantly the wrong things), and flat disagree (?) with ~10%.

I want to think about the places where I have disagreements. I suspect with some fraction I'll end up thinking you're ~right without further prompting; with some other fraction I'll end up thinking you're ~right after further discussion. On the other hand maybe I'll notice that some of the things that passed muster at first read seem subtly wrong. I'm interested to find out whether the remaining disagreements after that are enough to give me a significantly different bottom line than you. (A reason I particularly liked reading this is that it feels like it has a shot at significantly changing my bottom line, which is very unusual.)

(With apologies for the contentless reply; I thought it was better to express how I was relating to it and how I wanted to relate to it than express nothing for now until I've done my thinking.)

I agree that there is some framing and language that could be changed to make people more likely to engage and to be convinced by the arguments (which are very important!). There were parts that I found quite annoying, especially at the end, that could easily be left out without damaging any of the content. To reiterate, I'm glad this was posted and appreciate it being written down, I just think some stylistic changes could have improved it

FWIW, I strongly encourage and endorse folks engaging with whatever parts of Eliezer's post they want to, without feeling obliged to respond to every single topic or sub-topic or whatever.

(Also, I like your comment and find it helpful.)

(I edited to get my meaning closer to correct)

40.  "Geniuses" with nice legible accomplishments in fields with tight feedback loops where it's easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn't the place where humanity most needed a genius, and (c) probably don't have the mysterious gears simply because they're rare.  You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them.  They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can't tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do. 

This may well be true, but I think we are well past the stage where this should at least be established by significant empirical evidence. We should at least try, whilst we have the opportunity. I would feel a lot better to be in a world where a serious attempt was made at this project (to the point where I’m willing to personally contribute a significant fraction of my own net worth toward it).  More.

Sounds right to me! I think we should try lots of things.

Also, we can generalise ""Geniuses" with nice legible accomplishments" (e.g. Fields Medalists) to "people who a panel of top people in AGI Safety would have on their dream team". Is there such a list? Who would you nominate to be on the dream team? A great first step would be a survey of top Alignment researchers on this question.

The way I see it is that we need to throw all we have at both AGI Alignment research and AGI governance. On the research front, getting the most capable people in the world working on the problem -- a Manhattan Project for alignment as it were -- seems like our best bet. (On the governance front, it would be something along the lines of global regulation of, or a moratorium on, AGI capabilities research. Or in the extreme, a full-on Butlerian Jihad. That seems harder.)

Didn't they try this? e.g. Kmett 

He's just one person, so I wouldn't say that's significant empirical evidence. Unless a bunch of other people they approached turned them down (and if they did, it would be interesting to know why).

Seems likely to me

I don't think MIRI has tried this much; we were unusually excited about Edward Kmett.

Given that technical AI alignment is impossible, we should focus on political solutions, even though they seem impractical. Running any sufficiently powerful computer system should be treated as launching a nuclear weapon. Major military powers can, and should, coordinate to not do this and destroy any private actor who attempts to do it.

This may seem like an unworkable fantasy now, but if takeoff is slow, there will be a 'Thalidomide moment' when an unaligned but not super-intelligent AI does something very bad and scary but is ultimately stopped. We should be ready to capitalize on that moment and ride the public wave of techno-phobia to put in sensible 'AI arms control' policies.

Technical AI alignment isn't impossible, we just don't currently know how to do it. (And it looks hard.)

I feel somewhat concerned that after reading your repeated writing saying "use your AGI to (metaphorically) burn all GPUs", someone might actually do so, but of course their AGI isn't actually aligned or powerful enough to do so without causing catastrophic collateral damage. At least the suggestion encourages AI race dynamics – because if you don't make AGI first, someone else will try to burn all your GPUs! – and makes the AI safety community seem thoroughly supervillain-y.

Points 5 and 6 suggest that soon after someone develops AGI for the first time, they must use it to perform a pivotal act as powerful as "melt all GPUs", or else we are doomed. I agree that figuring out how to align such a system seems extremely hard, especially if this is your first AGI. But aiming for such a pivotal act with your first AGI isn't our only option, and this strategy seems much riskier than if we take some more time use our AGI to solve alignment further before attempting any pivotal acts. I think it's plausible that all major AGI companies could stick to only developing AGIs that are (probably) not power-seeking for a decent number of years. Remember, even Yann LeCun of Facebook AI Research thinks that AGI should have strong safety measures. Further, we could have compute governance and monitoring to prevent rogue actors from developing AGI, at least until we solve alignment enough to entrust more capable AGIs to develop strong guarantees against random people developing misaligned superintelligences. (There are also similar comments and responses on LessWrong.)

Perhaps a crux here is that I'm more optimistic than you about things like slow takeoffs, AGI likely being at least 20 years out, the possibility of using weaker AGI to help supervise stronger AGI, and AI safety becoming mainstream. Still, I don't think it's helpful to claim that we must or even should aim to try to "burn all GPUs" with our first AGI, instead of considering alternative strategies.

Quoting Scott Alexander here:

I agree it's not necessarily a good idea to go around founding the Let's Commit A Pivotal Act AI Company.

But I think there's room for subtlety somewhere like "Conditional on you being in a situation where you could take a pivotal act, which is a small and unusual fraction of world-branches, maybe you should take a pivotal act."

That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)

Somewhere halfway between "found the Let's Commit A Pivotal Act Company" and "if you happen to stumble into a pivotal act, take it", there's an intervention to spread a norm of "if a good person who cares about the world happens to stumble into a pivotal-act-capable AI, take the opportunity". I don't think this norm would necessarily accelerate a race. After all, bad people who want to seize power can take pivotal acts whether we want them to or not. The only people who are bound by norms are good people who care about the future of humanity. I, as someone with no loyalty to any individual AI team, would prefer that (good, norm-following) teams take pivotal acts if they happen to end up with the first superintelligence, rather than not doing that.

Another way to think about this is that all good people should be equally happy with any other good person creating a pivotal AGI, so they won't need to race among themselves. They might be less happy with a bad person creating a pivotal AGI, but in that case you should race and you have no other option. I realize "good" and "bad" are very simplistic but I don't think adding real moral complexity changes the calculation much.

I am more concerned about your point where someone rushes into a pivotal act without being sure their own AI is aligned. I agree this would be very dangerous, but it seems like a job for normal cost-benefit calculation: what's the risk of your AI being unaligned if you act now, vs. someone else creating an unaligned AI if you wait X amount of time? Do we have any reason to think teams would be systematically biased when making this calculation?

I'm more confident than Scott that the first AGI systems will be capable enough to execute a pivotal act (though alignability is another matter!). And, unlike Scott, I think AGI orgs should take the option more seriously at an earlier date, and center more of their strategic thinking around this scenario class. But if you don't agree with me there, I think you should endorse a position more like Scott's.

The alternative seems to just amount to writing off futures where early AGI systems are highly capable or impactful — giving up in advance, effectively deciding that endorsing a strategy that sounds weirdly extreme is a larger price to pay than human extinction. Phrased in those terms, this seems obviously absurd. (More absurd if you agree with me that this would mean writing off most possible futures.)

Nuclear weapons were an extreme technological development in their day, and MAD was an extreme and novel strategy developed in response to the novel properties of nuclear weapons. Strategically novel technologies force us to revise our strategies in counter-intuitive ways. The responsible way to handle this is to seriously analyze the new strategic landscape, have conversations about it, and engage in dialogue between major players until we collectively have a clear-sighted picture of what strategy makes sense, even if that strategy sounds weirdly extreme relative to other strategic landscapes.

If there's some alternative to intervening on AGI proliferation, then that seems important to know as well. But we should discover that, if so, via investigation, argument, and analysis of the strategic situation, rather than encouraging a mindset under which most of the relevant strategy space is taboo or evil (and then just hoping that this part of the strategy space doesn't end up being relevant).

If someone manages to create a powerful AGI, and the only cost for most humans is that it burns their GPUs, this seems like an easy tradeoff for me. It's not great, but it's mostly a negligible problem for our species. But I do agree using governance and monitoring is a possible option. I'm normally a hardline libertarian/anarchist, but I'm fine going full Orwellian in this domain.

Strongly agreed. Somehow taking over the world and preventing anybody else from building AI seems like a core part of the plan for Yudkowsky and others. (When I asked about this on LW, somebody said they expected the first aligned AGI to implement global surveillance to prevent unaligned AGIs.) That sounds absolutely terrible -- see risks from stable totalitarianism

If Yudkowsky is right and the only way to save the world is by global domination, then I think we're already doomed. But there's lots of cruxes to his worldview: short timelines, short takeoff speeds, the difficulty of the alignment problem, the idea that AGI will be a single entity rather than many different systems in different domains. Most people in AI safety are not nearly as pessimistic. I'd much rather bet on the wide range of scenarios where his dire predictions are incorrect. 

But this wouldn't be global domination in any conventional sense. When humans implement such things, its methods are extremely harsh and inhibit freedoms on all levels of society. A human-run domination would need to enforce such measures with harsh prison time, executions, fear and intimidation, etc. But this is mostly because humans are not very smart, so they don't know any other way to stop human y from doing x. A powerful AGI wouldn't have this problem. I don't think it would even have to be as crude as "burn all GPUs". It could probably monitor and enforce things so efficiently that trying to create another AGI would be like trying to fight gravity. For a human, it would simply be that you can't achieve it, no matter how many times you try, almost a new rule interwoven into the fabric of reality. This could probably be made less severe with an implementation such as "can't achieve AGI that is above intelligence threshold X" or "poses X amount of risk to population". In this less severe form, humans would still be free to develop AIs that could solve aging, cancer, space travel, etc., but couldn't develop anything too powerful or dangerous.

My spicy take in one paragraph:

Eliezer has a lot of intuitions and ways of thinking that he feels are supported by evidence, but I feel that part of what might be going on here is “public beliefs vs. private beliefs”: Eliezer believes he’s right, but it’s just very hard for most other people to see or understand why he believes this. Ideally, Eliezer would have started building a prediction track record 20 years ago to show that his intuitions and ways of thinking are better than other people's, but he doesn’t appear to have any such record. I do still consider many of his arguments to be strong, and I think the world is very lucky that Eliezer exists. However, I feel I have to evaluate each of his arguments on its own merit rather than deferring to what I see as his appeal to authority (where he’s the authority), because I don't think he has the track record to back up such an appeal.

but he doesn’t appear to have any such record

I want to register a gripe: when Eliezer says that he, Demis Hassabis, and Dario Amodei have a good "track record" because of their qualitative prediction successes, Jotto objects that the phrase "track record" should be reserved for things like Metaculus forecasts.

But when Ben Garfinkel says that Eliezer has a bad "track record" because he made various qualitative predictions Ben disagrees with, Jotto sets aside his terminological scruples and slams the retweet button.

I already thought this narrowing of the term "track record" was weird. If you're saying that we shouldn't count Linus Pauling's achievements in chemistry, or his bad arguments for Vitamin C megadosing, as part of Pauling's "track record", because they aren't full probability distributions over concrete future events, then I worry a lot that this new word usage will cause confusion and lend itself to misuse.

As long as it's used even-handedly, though, it's ultimately just a word. On my model, the main consequence of this is just that "track records" matter a lot less, because they become a much smaller slice of the evidence we have about a lot of people's epistemics, expertise, etc. (Jotto apparently disagrees, but this is orthogonal to the thing his post focuses on, which is 'how dare you use the phrase "track record"'.)

But if you're going to complain about "track record" talk when the track record is alleged to be good but not when it's alleged to be bad, then I have a genuine gripe with this terminology proposal. It already sounded a heck of a lot like an isolated demand for rigor to me, but if you're going to redefine "track record" to refer to  a narrow slice of the evidence, you at least need to do this consistently, and not crow some variant of 'Aha! His track record is terrible after all!' as soon as you find equally qualitative evidence that you like.

This was already a thing I worried would happen if we adopted this terminological convention, and it happened immediately.

</end of gripe>

Very many thanks for your responses Rob, you've helped me update. (And hopefully others who might have had thoughts kinda similar to mine will view your responses and update accordingly:))

However, I feel I have to evaluate each of his arguments on its own merit rather than deferring to what I see as his appeal to authority (where he’s the authority)

Eliezer isn't saying "believe me because I'm a trustworthy authority"; just the opposite. Eliezer is explicitly claiming that we're all dead if we base our beliefs on this topic on deference, as opposed to evaluating arguments on their merits, figuring out the domain for ourselves, generating our own arguments for and against conclusions, refining our personal inside views of AGI alignment, etc.

(At least, his claim is that we need vastly, vastly more people doing that. Not every EA needs to do that, but currently we're far below water on this dimension, on Eliezer's model and on mine.)

I'm practically new to AI safety, so reading this post was a pretty intense crash course!

What I'm wondering though, even if we suppose that we can solve all the technical problems to create a completely beneficial, Gaia mother-like AGI which is both super-intelligent and genuinely really wants the best for humanity and the rest of earthlings (or even the whole universe), how can humans themselves even align on:

1. What should be the goals and priorities given limited resources and

2. What should be the reasonable contours of the solution space which isn't going to cause some harm, or since no harm is impossible, what would be acceptable harms for certain gains?

In other words, to my naïve understading it seems like the philosophical questions of what is "good" and what should an AGI even align to is the hardest bit?

I mean, obviously not obliterating life on Earth is a reasonable baseline but feels a bit low ambition? Or maybe this is just a completely different discussion?

Welcome to the field! Wow, I can imagine this post would be an intense crash course! :-o

There are some people who spend time on these questions. It's not something I've spent a ton of time on, but I think you'll find interesting posts related to this on LessWrong and AI Alignment Forum, e.g. using the value learning tag. Posts discussing 'ambitious value learning' and 'Coherent Extrapolated Volition' should be pretty directly related to your two questions.

Thanks a lot, really appreciate these pointers!

The concrete example I usually use here is nanotech, because there's been pretty detailed analysis of what definitely look like physically attainable lower bounds on what should be possible with nanotech, and those lower bounds are sufficient to carry the point.

It sounds like this is well-traveled ground here, but I'd appreciate a pointer to this analysis.

I assume Eliezer means Eric Drexler's book Nanosystems.

Interesting, thanks. I read Nanosystems as establishing a high upper bound. I don't see any of its specific proposals as plausibly workable enough to use as a lower bound in the sense that, say, a ribosome is a lower bound, but perhaps that's not what Eliezer means.