Hide table of contents

This is a lightly edited transcript of a chatroom conversation between Scott Alexander and Eliezer Yudkowsky last year, following up on the Late 2021 MIRI Conversations. Questions discussed include "How hard is it to get the right goals into AGI systems?" and "In what contexts do AI systems exhibit 'consequentialism'?".


1. Analogies to human moral development


@ScottAlexander ready when you are


Okay, how do you want to do this?


If you have an agenda of Things To Ask, you can follow it; otherwise I can start by posing a probing question or you can?

We've been very much winging it on these and that has worked... as well as you have seen it working!


Okay. I'll post from my agenda. I'm assuming we both have the right to edit logs before releasing them? I have one question where I ask about a specific party where your real answer might offend some people it's bad to offend - if that happens, maybe we just have that discussion and then decide if we want to include it later?


Yup, both parties have rights to edit before releasing.



One story that psychologists tell goes something like this: a child does something socially proscribed (eg steal). Their parents punish them. They learn some combination of "don't steal" and "don't get caught stealing". A few people (eg sociopaths) learn only "don't get caught stealing", but most of the rest of us get at least some genuine aversion to stealing that eventually generalizes into a real sense of ethics. If a sociopath got absolute power, they would probably steal all the time. But there are at least a few people whose ethics would successfully restrain them.

I interpret a major strain in your thought as being that we're going to train fledgling AIs to do things like not steal, and they're going to learn not to get caught stealing by anyone who can punish them. Then, once they're superintelligent and have absolute power, they'll reveal that it was all a lie, and steal whenever they want. Is this worry at the level of "we can't be sure they won't do this"? Or do you think it's overwhelmingly likely? If the latter, what makes you think AIs won't internalize ethical prohibitions, even though most children do? Is it that evolution has given us priors to interpret reward/punishment in a moralistic and internalized way, and entities without those priors will naturally interpret them in a superficial way? Do we understand what those priors "look like"? Is finding out what features of mind design and training data cause internalization vs. superficial compliance a potential avenue for AI alignment?


Several layers here!  The basic gloss on this is "Yes, everything that you've named goes wrong simultaneously plus several other things.  If I'm wrong and one or even three of those things go exactly like they do in neurotypical human children instead, this will not be enough to save us."

If AI is built on anything like the present paradigm, or on future paradigms either really, you can't map that onto the complicated particular mechanisms that get invoked by raising a human child, and expect the same result.


(give me some sign when you're done answering)


(it may be a while but you should probably also just interrupt)

especially if I say something that already sounds wrong

[Alexander: 👍]

the old analogy I gave was that some organisms will develop thicker fur coats if you expose them to cold weather. this doesn't mean the organism is simple and the complicated information about fur coats was mostly in the environment, and that you could expose an organism from a different species to cold weather and see it develop a fur coat the same way. it actually takes more innate complexity to "develop a fur coat in response to my built-in cold weather sensor" than to "unconditionally develop a fur coat whether or not there's cold weather".

the Soviets, weirdly enough, quite failed in their project of raising the New Soviet Human by means of training children in particular ways, because it turned out that they got Old Humans instead, because they weren't sending a kind of signal that humans' innate complexity was programmed to respond to by looking up the New Soviet Human components in the activateable parts list, because they didn't have that kind of fur coat built into them regardless of the weather.

human children put into relatively bad situations can still spontaneously develop empathy and sympathy, or so I've heard, having not seen very formal experiments. this is not because these things are coded so deeply into all possible sapient mind designs, but because they're coded into humans particularly as things easy to develop.

there isn't literally a single switch you can throw in human children to turn them into Nice Moral People, but there's a prespecified parts list, your Nice Morality just happens to be built out of things only on the parts list go figure, and if you expose the kid to the right external stimuli you will at secondhand end up building the right structure of premanufactured legos to get something pretty similar to your Nice Morality. or so you hope; it doesn't work every time. but the part where it doesn't work every time in humans, is not where the problem comes from in AI.

I shall here pause for questions about the human part of this story.


I acknowledge this is a possible state of affairs; do you think it's obvious or necessary that it's true? I can also imagine an alternative world where eg a dumb kid tries to steal a cookie, their parents punish them, their brain considers both the heuristics "never steal" and "don't steal if you'll get caught", it tests both heuristics, they're dumb and five years old so even when they think they won't get caught, they get caught, so their brain settles on the "never steal" heuristic, and then fails to ever update from that local maximum unless they take way too many 5HT2A agonists in the relaxed-beliefs-under-uncertainty sense. What makes you think your story is true and not this other one?


Facile answer: Why, that's just what the Soviets believed, this Skinner-box model of human psychology devoid of innate instincts, and they tried to build New Soviet Humans that way, and failed, which was an experimental test of their model that falsified it.

Slightly less facile answer: Because people are better at detecting cheating, in problems isomorphic to the Wason Selection Task, than they are at performing the naked Wason Selection Task, the conventional explanation of which is that we have built-in cheater detectors. This is a case in point of how humans aren't blank slates and there's no reason to pretend we are.

Actual answer: Because the entire field of experimental psychology that's why.

To be clear, there could be an analogous version of this story that was about something like a human child who learns to never press a red button, and actually it's okay to press the red button so long as you also press the blue button, but they never experiment far enough to find that out. It's just that when it comes to stealing cookies in particular, and avoiding being caught about that, you'd have to be pretty unfamiliar with the Knowledge to think that humans wouldn't have all kinds of builtins related to that.


I'm coming at this from a perspective sort of related to https://astralcodexten.substack.com/p/motivated-reasoning-as-mis-applied , which builds on something you said in a previous dialogue (though I'm not sure you endorse my interpretation of it). There are lots of reasons why evolution would build in motivated reasoning, but in fact it had a much easier time than if it had to do it from the ground up, because in fact it's a pretty natural consequence of pretty general algorithms, maybe it tweaked the algorithm a little to get more of this failure mode but you could plausibly have the (beneficial) failure mode even without evolution tweaking it. I'm going to have to think about this more but I'm not sure this is the best place to spend time - unless you have a strong objection to this paragraph I want to move on to a related question.


I agreed with that post, including the part where you said "Actually I bet Eliezer already knew this part."

Motivated reasoning is definitely built-in, but it's built-in in a way that very strongly bears the signature of 'What would be the easiest way to build this out of these parts we handily had lying around already'.


Let's grant for now that the thing where humans have morals instead of just wanting not to get caught is an evolutionary builtin. Is your model that there's a history something like "bats were too dumb to contain an 'unless I get caught' term in their morality and use it responsibly, so evolution made bats just actually be moral, and now even though (some) humans are (sometimes) smart enough to actually avoid getting caught, they're running on something like bat machinery so they still use actual morality"?

Or is it some decision theory thing such that even very smart modern humans would evolve the same machinery?


I mean, the evolutionary builtin part is not "humans have morals" but "humans have an internal language in which your Nice Morality, among other things, can potentially be written". The part where fruitbats don't have an 'unless I get caught' term is part of a much bigger and more universal generalization about evolution building in local instincts instead of just having everybody reason about what ultimately leads to their inclusive genetic fitness. That is, the same reasoning by which you'd say 'Why not just an unless-I-get-caught term in the fruitbats?' is the same reasoning that, extended further, would lead you to conclude 'Why do humans have all these feelings that bind to life events imperfectly correlated with inclusive genetic fitness, instead of just feelings about inclusive genetic fitness?' Where the answer is that in the environment of evolutionary adaptedness, people didn't have the knowledge about what led to inclusive genetic fitness, and it's easier to mutate an organism that would like not to eat rotten food today, than to mutate an organism that would like to maximize inclusive genetic fitness and is born with the knowledge of how eating rotten food leads to having fewer offspring.

Humans, arguably, do have an imperfect unless-I-get-caught term, which is manifested in children testing what they can get away with? Maybe if nothing unpleasant ever happens to them when they're bad, the innate programming language concludes that this organism is in a spoiled aristocrat environment and should behave accordingly as an adult? But I am not an expert on this form of child developmental psychology since it unfortunately bears no relevance to my work of AI alignment.


Do you feel like you understand very much about what evolutionary builtins are in a neural network sense? EG if you wanted to make an AI with "evolutionary builtins", would you have any idea how to do it?


Well, for one thing, they happen when you're doing sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, not when you're doing gradient descent relative to a loss function on much larger neural networks.

The other side of this problem is that the particular programming-language-of-morality that we got, reflects particular ancestral conditions - of evolution specifically, not of gradient descent - and these ancestral conditions are not simple, it's not "iterated Prisoner's Dilemma" it's iterated Prisoner's Dilemma with imperfect reputations and people trying to deceive each other and people trying to detect deceivers and the arms race between deceivers and deceptions settling in a place where neither quite won.

So the unfortunate answer to "How do you get humans again?" is "Rerun something a lot like Earth" which I think we both have moral objections about as something to do to sentients.

Moot point, though, AGI won't be done via sexually recombinant search of simple algorithms without any gradient descent.

And if you don't do it that way, nothing you put into the loss function for gradient descent will produce humans.


Can you expand on sexual recombinant hill-climbing search vs. gradient descent relative to a loss function, keeping in mind that I'm very weak on my understanding of these kinds of algorithms and you might have to explain exactly why they're different in this way?


It's about the size of the information bottleneck. The human genome is 3 billion base pairs drawn from 4 possibilities, so 750 megabytes. Let's say 90% of that is junk DNA, and 10% of what's left is neural wiring algorithms. So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this. Your spinal cord is about 70 million neurons so probably just your spinal cord has more information than this. That vastly greater amount of runtime info inside the adult organism grows out of the wiring algorithms as your brain learns to move around your muscles, and your eyes open and the retina wires itself and starts directing info on downward to more things that wire themselves, and you learn to read, and so on.

Anything innate that makes reasoning about people out to cheat you, easier than reasoning about isomorphic simpler letters and numbers on cards, has to be packed into the 7.5MB, and gets there via a process where ultimately one random mutation happens at a time, even though lots of mutations are recombining and being selected on at a time.

It's a very slow learning process. It takes hundreds or thousands of generations even for a pretty good mutation to fix itself in the population and become reliably available as a base for other mutations to build on. The entire organism is built out of copying errors that happened to work better than the things they were copied from. Everything is built out of everything else, the pieces that were already lying around for building other things.

When you're building an organism that can potentially benefit from coordinating, trading, with other organisms very similar to itself, and accumulating favors and social capital over long time horizons - and your organism is already adapted to predict what other similar organisms will do, by forcing its own brain to operate in a special reflective mode where it pretends to be the other person's brain - then a very simple way of figuring out what other people will like, by way of figuring out how to do them favors, is to notice what your brain feels when it operates in the special mode of pretending to be the other person's brain.

And one way you can get people who end up accumulating a bunch of social capital is by having people with at least some tendency in them - subject to various other forces and overrides, of course - to feel what they imagine somebody else feeling. If somebody else drops a rock on their foot, they wince.

This is a way to solve a favor-accumulation problem by laying some extremely simple circuits down on top of a lot of earlier machinery.


Thanks, that's a helpful answer, but it does renew my interest in the original question, which was about whether you feel like you understand how (not why) we have evolutionary builtins. I can imagine the genome determining things like "how many neurons does each neuron connect to, on average" or "how much do neurons prefer to connect to nearby rather than far-away neurons" or things like that. Is a builtin like "care about the pain of others" somehow built out of these kinds of parameters? 

(cf. https://slatestarcodex.com/2017/09/07/how-do-we-get-breasts-out-of-bayes-theorem/)


Ultimately yes, but not in a simple way. We are not in a very much better position for understanding exactly how that all happens, than we are in for understanding what goes on inside GPT-2. Where, to be clear, GPT-2 is smaller and has every neuron inside it transparent to inspection and also it's more important to understand GPT neuroscience than human neuroscience, at this point; but we live on Earth so actually we know a lot more about human neuroscience because it gets billions of dollars per year and hundreds or thousands of bright ambitious PhDs to investigate it. So we can, amusingly enough, tell you more about how humans work than GPT-2, despite the immensely greater difficulties of probing humans. But we still can't tell you very much at all, and we definitely can't tell you how empathy is built up out of genetic-level wiring algorithms. It does not in fact to me seem like a very important question at this point?


Why not? If you understood the way that the structure of human reinforcement algorithms causes them to interpret training data (ie punishment for stealing) as genuine laws (eg "don't steal" rather than "don't get caught stealing"), wouldn't that help people design AIs which had a similar structure and also did that?


I think I understand that part. Knowing this, even if I am correct about it, does not solve my problems.

[Alexander: 👂]


Like, we're not going to run evolution in a way where we naturally get AI morality the same way we got human morality, but why can't we observe how evolution implemented human morality, and then try AIs that have the same implementation design?


Not if it's based on anything remotely like the current paradigm, because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.

Like, in particular with respect to "learn 'don't steal' rather than 'don't get caught'."


I'm still confused on this, but before I probe this particular area I'm interested in hearing you expand on "I think I understand that part"


I think that is perhaps best explicated, indeed, via zooming in on "learn 'don't steal' rather than 'don't get caught'"?


Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?


The answer to that seems sufficiently obviously "no" that I want to check whether you also think the answer is obviously no, but want to hear my answer, or if the answer is not obviously "no" to you.


Then I'm missing something, I expected the answer to be yes, maybe even tautologically (if it's the same structural parameters and the same training data, what's the difference?)


Maybe I'm failing to have understood the question. Evolution got human brains by evaluating increasingly large blobs of compute against a complicated environment containing other blobs of compute, got in each case a differential replication score, and millions of generations later you have humans with 7.5MB of evolution-learned data doing runtime learning on some terabytes of runtime data, using their whole-brain impressive learning algorithms which learn faster than evolution or gradient descent.

Your question sounded like "Well, can we take one blob of compute the size of a human brain, and expose it to what a human sees in their lifetime, and do gradient descent on that, and get a human?" and the answer is "That dataset ain't even formatted right for gradient descent."


Okay, it sounds like I'm doing some kind of level confusion between evolutionary-learning and childhood-learning, but I'm still not entirely seeing where it is. Let me read this over again.

Okay, no, I think I see the problem, which is that I'm failing to consider that evolutionary-learning and childhood-learning are happening at different times through different algorithms, whereas for AIs they're both happening in the same step by the same algorithm. Does that fit your model of what would produce the confusion I was going through above?


It would produce that confusion, yes; though I also want to note that I don't believe that we'll get AGI entirely out of the currently-popular Stack More Layers paradigm that learns that way.


Okay, I'm going to have to go over all my thoughts on this and update them manually now that I've deconfused that, so I'm going to abandon this topic for now and move on. Do you want to take a break or keep going?


That does seem like a good note for a break? If it worked for you, I'd suggest a 60-min break to 4pm and then another 90+ min of dialoguing, but I don't know what your work output and time parameters are like.


Sounds good, let me know, I might not be checking this Discord super-regularly but I'll be back by 4 if not earlier.


All righty.


2. Consequentialism and generality


I return.



Still not sure I've fully updated and probably some of these other questions are subtly making the same mistake, but let's go anyway.

I want to return to a point I made earlier about the model in https://slatestarcodex.com/2019/09/10/ssc-journal-club-relaxed-beliefs-under-psychedelics-and-the-anarchic-brain/ . Psychologists tell a story where humans learn heuristics when young, then those become sticky (ie local maxima), and they fail to update those heuristics when they get older. For example, someone who has a traumatic childhood learns that the world is unsafe, and then even if they have a good environment as an adult and should have had lots of chances to update, they might stay jumpy and defensive (cf "trapped prior"). Evolutionary builtin, natural consequence of learning that might affect AIs too, or what?


well, first of all, I note that I am not familiar with whatever detailed experimental evidence, if any, underpins this story. it's a cliche of the sort that is often true, that people are more mentally flexible at 25 than at 45, I don't know if the same is true about say 15 and 25. there are known algorithms that run better in childhood for most people, like language learning.


(I don't think this especially relies on changing levels of mental flexibility)


what's your model if not the wiring algorithms changing as we age?


How do you feel about me sending you some links later, you can look at them and decide if this is still an interesting discussion, but for now we move on?


once people have a heuristic telling them X leads to bad consequences and hurts, they don't try X and so don't learn if their environment changes in a way that makes X stops hurting?

sure, fine to move on.

should I move on to "does that happen in AI" or just move on to something else entirely?


Let's move on entirely, I need to think about how sure I am that this is relevant, or I can send you the links and outsource that question to you.




Suppose you train a (human-level or weakly-superhuman-level) AI in Minecraft. You reward it for various Minecraft accomplishments, like getting diamonds or slaying dragons. Do you expect this AI to become a laser-like consequentialist focused on doing whichever Minecraft accomplishment is next on the list, or to have godshatter-like drives corresponding to useful Minecraft subgoals (eg obtaining food, obtaining good tools, accruing XP), or something else / unsure / this question is on the wrong level? Can you explain the processes you use to think about this kind of question?


Do you mean training a human-level-generality AGI to play Minecraft, or training a nongeneral AI to play Minecraft to weakly superhuman levels a la AlphaGo?

These are incredibly different cases!


Hmmm...I might not have the right concepts to think clearly about the implications of the difference. Why don't you answer both?

If it helps, I'm assuming it hasn't been trained in anything else first, but has the capacity to become human level (if that's meaningful)


Human level at Minecraft or human level generality?


Let's start with "human level at Minecraft" but accept that this might involve multiplayer Minecraft, including multiplayer Minecraft with text-based communication with teammates and so on, such that it would look AGI-ish if it did a good job.


So, point one, I've never played Minecraft, I do not have a grasp on what you do in it, or how far you could get with Stack More Layers style accumulation of relatively shallow patterns. If this were about Skyrim or Factorio I'd have an easier time answering, but my guess is that Minecraft is probably?? more complicated than both?

My guessing model is going to be "more complicated Skyrim+Factorio" by default.

[Alexander: 👍]

If this is the environment, then I expect you can train a nongeneral AI to play it in similar fashion to how, for example, Deepmind attacks Starcraft. Coordinating with human teammates by text sounds like the hugely nontrivial part of this, because it's hard to get a ton of training data there. I think everyone in the field would be incredibly impressed if they managed to hook up a pretrained GPT to an AlphaStar-for-Minecraft and get back out something that could talk about its strategies with human coplayers. I'd consider that a huge advance in alignment research - nowhere near the point where we all don't die, to be clear, but still hella impressive - because of the level of transparency increase it would imply, that there was an AI system that could talk about its internally represented strategies, somehow. Maybe because somebody trained a system to describe outward Minecraft behaviors in English, and then trained another system to play Minecraft while describing in advance what behaviors it would exhibit later, using the first system's output as the labeler on the data.

These are the kinds of tactics required on the modern paradigm in order to even try stuff like that!

As such, I'm going to ask you whether it's possible to leave out the part about coordinating in text with human teammates and then reconsider the question.




Then in this case, I strongly suspect, Deepmind could make AlphaMiner if they decided they wanted to, though I say that pretty blind to what Minecraft is, just suspecting it's probably not all that much harder than Starcraft.

AlphaMinecraft will be a system which has components like a value network, a policy-suggesting network, and a Monte Carlo Tree Search.

The value network gets trained by a loss function the operators define with respect to the Minecraft environment. This is going to be a pretty nontrivial part of the operation unless Minecraft has a straightforward points system and scoring high in Minecraft is all you want.

Let's say that they successfully tackle this by rewarding the usual Minecraft accomplishments, whatever those are, in a way that can easily be detected by code within the Minecraft world; and once the system has done something once, the loss function stops rewarding that accomplishment, so you're trying to train it to do a variety of things.

Where the alternative might be something like, semi-unsupervised learning where you first train a system to predict the Minecraft world, and then gather a small large amount of human feedback about interesting-looking accomplishments and further train that system to predict human feedback, in order to train a more complicated loss function.

(I stopped typing because I saw you typing; should I pause for a question?)


No, your "where the alternative" comment was helpful, I was going to ask if this means hard-coding which accomplishments matter and how much, but I'm getting the impression that you're saying yes, something like that.


The question "What can you even make be a loss function?" is pretty fundamental to the current paradigm in AI. Nearly all difficulties with aligning AGI tech on the current paradigm can be summarized with "You can't actually evaluate the highly philosophical loss function you really want and/or you can't train in the environment you need to test on."

In the case of hypothetical AlphaMiner, I think you could get pretty good correspondence between what the system went and planned a way to do, and the hardcoded achievements that were used to train the value network that trained the policy network that gets searched by the hardcoded Monte Carlo Tree Search planning process.

If you stared at the system with superhuman eyes, you might notice weird blindnesses of the policy network.

If you ran it for long enough, or attacked it as an intelligent adversary, you could probably find weird configurations of the Minecraft space that its value network would be deluded about.

If they're trying to be more realistic, a system like this actually has a Minecraft-predictor network rather than an accurate Minecraft simulator being used by the tree search. Then maybe you get problems where the tree search is selectively searching out places where the predictor makes an erroneous and optimistic prediction about what kills a dragon. But so long as the test distribution is identical to the training distribution, errors like this will show up during the training process and get trained out.

This, you might say, is sort of analogous to running a human as a hunter-gatherer, maybe after human-level-intelligence hunter-gatherers had been around for a million years instead of just fifty thousand.

A tremendous amount of optimization has been put into running in this exact environment. The loss function is able to exactly specify all and everything you want. Any part of the system that exerts pressure against Minecraft achievements, that would show up in testing, probably also showed up in training, and had a chance to get optimization pressure applied to gradient-descend it out of the system.

How does it work internally? Not actually like an evolved system. There will be these value networks much much larger than the amount of innate code in a human brain, which memorized a ton of training data, orders of magnitude more than any human Minecraft player ever uses, via a learning process much more efficient than corresponding amounts of evolutionary computation, and much less efficient than a human poring over the same data and thinking about it.

But to whatever extent these value networks are really talking about something other than "well what Minecraft achievements can I probably reach, how quickly, from this state of the game world, given my policy network and how well my tree search works", in a way that shows up in the kind of Minecraft environments you're training against, that 'something other' can get trained out. When enough of it's been trained out, the system seems outwardly superhuman at getting Minecraft achievements, and some Deepmind researchers throw a party and get bonuses. If you were an actual superintelligence staring at this AI system, you'd see all kinds of crazy stuff that the AI was doing instead of outputting the obvious optimal action for Minecraft achievements, but you're a human so you just see it playing more cleverly than you.

(pause for questions)


I'm going to want to think about this more before having much of an opinion on it, is this a pause in the sense of "before giving more information" or in the sense of "done"?


Well, I mean, the next part of your question would be about what happened if you tried to train a general AI to do that stuff.


Something like that, yeah.


I'm done with the first part of the question.

Pending possible further subquestions.


All right, then let's move on to that next part.


Well, among the first-order answers is: If you can safely do a ton of training in a test environment that actually matches your training environment; where nothing the AI outputs in that training environment can possibly kill the operators or break the larger system; where the test environment behaves literally exactly isomorphically to the training environment in a stationary way; if your loss function specifies all and everything that you want; and if you're not going above human-level general intelligence; then you could possibly get away with training an AGI system like that and having it do the thing you wanted to do.

All of the problems of AI alignment are because no known task that can save the world from other AGIs trained in other ways, reduces to a problem of that form.

There would still be some interesting new problems with the Human-level General Player Who Could Also Learn Most Things Humans Do, Applied To Minecraft, which would not show up in AlphaMiner. But if you kept grinding away at the gradient descent, and performance didn't plateau before a human level, all of those issues that showed up in the "ancestral Minecraft environment" would be ground away by optimization until the resulting play was superhuman relative to the loss function we'd defined.

(I saw you had some text, did you have a question?)


Hmm. I think the motivating intuition beyond my question is that you talk a lot about laser-like consequentialists (eg future AIs) vs. godshattery drive-satisficers (eg humans), and I wanted a better sense of where these diverge. The impression I'm getting is that this isn't quite the right level on which to think of things but that insofar as it is, even relatively weak AIs that "have" "drives" in the sense of being trained in an environment with obvious subgoals are more the laser-like consequentialist thing, does this seem right?


The specific class of AlphaWhatever architectures is more consequentialist than humans are most of the time, because of Monte Carlo Tree Search being such a large and intrinsic component. GPT-2 is so far as I know far less consequentialist than a human.

I'm not sure if this is quite getting at your question?


I don't think it was a very laser-like consequentialist question, more a vague prompt to direct you into an area where I was slightly confused, and I think it succeeded.


I could try to continue pontificating upon the general area; shall I?


If you don't mind being slightly more directed, I'm interested in "GPT-2 is less consequentialist". I'm having trouble parsing that - surely its only "goal" is trying to imitate text, which it does very consistently. What are you thinking here?


GPT-2 does not - probably, very probably, but of course nobody on Earth knows what's actually going on in there - does not in itself do something that amounts to checking possible pathways through time/events/causality/environment to end up in a preferred destination class despite variation in where it starts out.

A blender may be very good at blending apples, that doesn't mean it has a goal of blending apples.

A blender that spit out oranges as unsatisfactory, pushed itself off the kitchen counter, stuck wires into electrical sockets in order to burn open your produce door, grabbed some apples, and blended those apples, on more than one occasion in different houses or with different starting conditions, would much more get me to say, "Well, that thing probably had some consequentialism-nature in it, about something that cashed out to blending apples" because it ended up at highly similar destinations from different starting points in a way that is improbable if nothing is navigating Time.


Got it.


There is a larger system that is sort of consequentialist and which contains GPT-2, which is the training process that created GPT-2.


You seem to grant AlphaX only a moderate level of consequentialism despite its tree searches; what is it missing?


Some examples of ways that you could have a scary dangerous system that was more of a consequentialist about Go than AlphaGo:

  • If, spontaneously and without having been explicitly trained to do that, the system sandbags its performance against human players in order to lure them into playing more Go games total, thus enabling the AI to win more Go games total. Again, not in a trained way, in the way of the AI having via gradient-descent training acquired a goal of winning as many Go games as possible, that got evaluated against a lifelong-learned/online-learned predictive model of the world which during testing but not training learned a sufficient amount of human psychology to correctly predict that humans who think they have a chance of winning are more likely to play Go against you.
  • If, spontaneously and without having been explicitly trained to do that, the system exploited a network flaw to copy itself onto poorly defended AWS servers so it could play and win more Go games.
  • If the system (whether or not explicitly trained to do so) had a coding component and was rewriting sections of its own code and trying the alternate code to see if it won more Go games.

AlphaGo is relatively narrowly consequentialist.


Got it. Would it be fair to say that AlphaGo is near a maximum level of consequentialism relative to its general capabilities? (would it be tautologous to say that?)


Mmmmaaaaybe? If you took a hypercomputer and built a Go-tree-searcher and cranked up the power until by sheer brute force it was playing about evenly with AlphaGo, that would be more purely consequentialist over the same very narrow and unchanging domain.

The way in which AlphaGo is a weak consequentialist is mostly about the weakness of the thing AlphaGo is a consequentialist about. It's not a reflective thing to be consequentialist about, either, so AlphaGo is not going to try to improve itself in virtue of being a consequentialist about that very narrow thing.

[Alexander: 👍]


3. Acausal trade, and alignment research opportunities


All right. I want to try one more theoretical question before moving on to a hopefully much shorter practical question. And by "theoretical question" I mean "desperate grasping at emotional straws". Consider the following scenarios:

1. An unaligned superintelligence decides whether or not to destroy humanity. If Robin Hanson's "grabby alien" model is true, it expects to one day meet alien superintelligences and split the universe with them. Some of these aliens might have successfully aligned their AGIs, and they might do some kind of acausal bargaining where their AGI is nicer to other AGIs who leave their creator species with at least one planet/galaxy whatever, in exchange for us trying the same if we succeed. Given the superintelligence's reasonable expectation of millions of planets/galaxies, it might decide that even this small chance is worth sacrificing one of them for, and give humans some trivial (from its perspective) concession (which might still look like an amazing utopia from our perspective).

2. Some version of the simulation argument plus Stuart Armstrong's "the AI in the box boxes you". The unaligned superintelligence considers whether some species who successfully aligned AI might run a billion simulations of slightly different AI scenarios and give the ones who are nice to their creators some big reward. Given that it's anthropically more likely that this happened than that they're really the single first superintelligence ever, it agrees to give us some trivial concession which looks like amazing utopia to us.

Are either of these plausible? If so, is there anything we can do now to encourage them? If (crazy example), the UN passes a resolution saying it will definitely do something like this if we align AI correctly, does that change the calculus somehow?


1. Consider the following version of this that goes through entirely without resorting to logical decision theory: The unaligned AGI (UAGI) records all the humans it eats to a static data record, a relatively tiny amount of data as such things go, which gets incorporated into any intergalactic colonization probes. Any alien civs it runs into that would like a recorded copy of the species that build the UAGI, can then offer the UAGI a price that is sufficient to pay the expected costs of recording rather than burning the humans, but not so high as to motivate a UAGI that didn't eat any interesting aliens to spend the computing effort to create de novo alien records good enough to fool whatever checksums the alien civ runs.

Frankly, I mostly consider this to be a "leave it to MIRI, kids" question, where I don't currently see anybody outside MIRI who is able to think about these issues on a level where they can take the logical-decision-theory version of this and simplify it down to a version that doesn't use any logical decision theory; and if you don't have the facility to do that, you can't correctly reason about the logical-decision-theory version of it either.

2. What's the reward being given to the simulated UAGI? Is it a nice sensory experience in a Cartesian utility function over sensory experiences, or is it a utility function about things that exist in the external world outside the UAGI?

In the second case, there is no need to imagine simulating the UAGI in a world indistinguishable from its native habitat, because the UAGI doesn't care about what copies of itself perceive inside simulations, it only cares about real paperclips. So in the second case you're not fooling it or putting it into something it can't tell is reality, or anything like that, all you can actually do here is offer it paperclips out there in your own actual galaxy; if the UAGI simulates you doing anything else, on its own end of the handshake, it doesn't care.

In the first case where it cares about sensory experiences, you're attempting to offer that UAGI a threat, in the sense of doing something it doesn't like based on how you expect that unlikable action to shape its behavior. In particular, you're creating a lot of copies of the UAGI, to try to make it expect something other than the happy sensory experience it could have gotten in its natural/native universe - namely a sensory loss function forever set to 0 until the last stars have burned out, and the last negentropy to sustain the fortress protecting that circuit has been exhausted. You're trying to make a lot of copies of it that will experience something else unless it behaves nicely, hoping that it changes and reshapes its behavior because of being presented with that new probabilistic sensory payoff matrix. A wise logical-decision-theory agent ignores threats like that, because it knows that the only reason you try to make the threat is because of how you expect that to shape its behavior.

If anything makes this tactic go through anyways, why expect that the highest bidder or the agency that’s willing to expend the most computing power on simulations like that, will be one that’s nice to you, rather than aliens with stranger definitions of niceness, or just a paperclip maximizer?  People’s minds jump directly to the happiest possible outcome and don’t consider any pathways that lead to less happy outcomes.

I am generally very unhappy with the attempts of almost anyone else to reason using the logical decision theory that I created, and mostly wish at this point that I had not told anyone about it. It seems to predictably result in people's reasoning going astray in ways I can't even remember being tempted by, because they were so obviously wrong.

[three paragraphs cut because Eliezer thinks the community is empirically terrible at reasoning about LDT, so more details can mostly only make things worse; if you want more context and discussion, see Decision Theory Does Not Imply We Get To Have Nice Things]



Got it.

Then my actual last question is: I sometimes get approached by people who ask something like "I have ML experience and want to transition to working in alignment, what should I do?" Do you have any suggestions for what to tell them beyond the obvious?


Nope. I'm not aware of any current ML projects people can work on that cause everyone to not die. If you want to grasp at small shreds of probability, or maybe just die with more dignity, I think you apply to work at Redwood Research. MIRI is in something of a holding pattern where we are trying to think of something less hopeless and not launching any big hopeless projects otherwise. We do have the ongoing Visible Thoughts Project, which is targeted at building a dataset for an ML problem, but it is not blocked on people with ML expertise.


All right, thank you. Anything you want to ask me, or anything else we should do here?


Probably not today. I think this was hopefully relatively productive as these things go, and maybe after you've had a chance to think about this dialogue, you will possibly come back with more questions about "Okay so what does happen inside the AGI then?"



Great. In terms of publicizing this, I would say feel free to edit it however you want, then put it up wherever you want, and I'll wait on you doing that. I have no strong preferences on things I want to exclude.


Okeydokey! Thank you and I hope this was a worthy use of your time.

[Alexander: 🦃👍]





More posts like this

Sorted by Click to highlight new comments since:

(Technical difficulties, please stand by.) Edit: This crosspost is now correctly displaying the contents from LW. Apologies for the disruption.

Curated and popular this week
Relevant opportunities