My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Quintin Pope

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Quintin Pope

46 min read · Mar 21, 2023

166

Comments 21

Sorted by

New & upvoted

Toby_Ord

There was nothing with that much sugar, salt, fat combined together as ice cream.

This example has always frustrated me on the grounds that it is clearly false.

Eliezer correctly points out that honey existed in the ancestral environment, and so did gazelles. But if you simply pour honey over animal fat, you get something much higher in both sugar and fat than ice cream:

Ice cream is typically ~20% sugar, ~10% fat. An experiment found even that ice cream optimised for being liked by the eater involves ~15% sugar, 15% fat.

But honey is 82% sugar and animal fat is 100% fat. A 50-50 mixture is 41% sugar, 50% fat — more than twice as much of each as ice-cream.

Neither is super-salty, but animal fat (0.15%) still contains more salt than ice cream (0.08%), so a 30-70 mixture would have more sugar, fat, and salt. Also, very salty things such as salt deposits (100%) and sea water (3.5%) were sometimes available in the ancestral environment.

Of course there is something about ice cream that makes us want it more than honey-drenched fat! But I don't know what that is, nor how it relates to AI alignment.

EliezerYudkowsky

Somehow I never thought about it that way. Point conceded.

The analogy survives and if anything becomes more meaningful, but is now harder to explain to a general audience: After training humans exclusively on inclusive genetic fitness, with a correlation in the outer environment to high-calorie foods, humans ended up preferring something that didn't exist in the ancestral environment, lacks correlations to micronutrients that were reliably in ample supply in the ancestral environment and didn't need to be optimized over, has some resemblance to things that were important/scarce like the taste of sugar and salt and fat (if the sugar hasn't been replaced with allulose), but where it ultimately depends on properties like "the ice cream is cold rather than melted" that don't match to anything obvious at a surface glance about the ancestral environment; and on the whole, the thing that starts to max out human tastebuds seems almost impossible to have called in advance by any simple means.

If you want the old form of the analogy, "male humans scrolling Tumblr porn" works (2D images not present in ancestral environment, Coolidge effect superstimulated). Hopefully I or somebody can think of a more general-audiences-friendly transparent example of a superstimulus than that one.

Toby_Ord

I think you have put your finger on a key aspect with the coldness requirement.

When ice cream is melted or coke is lukewarm, they both taste far too sweet. I've long had a hypothesis that we evolved some kind of rejection of foods that taste too sweet (at least in large quantity) and that by cooling them down, they taste less sweet (overcoming that rejection mechanism) but we still get increased reward when the sugar content enters our bloodstream. I feel that carbonation is similar (flat coke tastes too sweet), so that the cold and carbonation could be hacks we've discovered to get around the 'tastes too sweet' defence mechanism, while still enjoying extremely high blood sugar based rewards. (Other forms of bitterness or saltiness added to the sweet foods could be similar.)

More speculative and still requires a few sentences to explain though, so a different example may be best.

Elityre

If this is true, it's fascinating, because it suggest that our preference for cold and carbonation are a kind of specification gaming!

Michael St Jules 🔸

It could just be attention. If something would otherwise be too sweet, but some other part of it is salient (coldness, carbonization, bitterness, saltiness), those other parts will take some of your attention away from its sweetness, and it'll seem less sweet.

Elityre

Why might humans evolve a rejection of things that taste to sweet? What fitness reducing thing does "eating oversweet things" correlate with? Or is it a spandrel of something else?

mmKALLL

Increased body weight or development of type 2 diabetes, for example?

Dušan D. Nešić (Dushan)

Perhaps the feeling of achievement gained from cookie-clicker games, such as FarmVille and such, that have taken over all the old and young people's temporary attention? Gambling in Gatcha or Online Gambling? Opioids epidemic?

David Johnston

Now I want to see how much I like honey-drenched fat

vaniver

I think the 'traditional fine dining' experience that comes closest to this is Peking Duck.

Most of my experience has been with either salt-drenched cooked fat or honey-dusted cooked fat; I'll have to try smoking something and then applying honey to the fat cap before I eat it. My experience is that it is really good but also quickly becomes unbalanced / no longer good; some people, on their first bite, already consider it too unbalanced to enjoy. So I do think there's something interesting here where there is a somewhat subtle taste mechanism (not just optimizing for 'more' but somehow tracking a balance) that ice cream seems to have found a weird hole in.

[edit: for my first attempt at this, I don't think the honey improved it at all? I'll try it again tho.]

Charlie_Guthmann

honey-baked ham is actually well-liked and considered a holiday delicacy by many Americans.

DirectedEvolution

Honey baked ham is 5g fat and 3g sugar/3oz. ~28g = 1oz, so that's 6% fat and 4% sugar, so ice cream is about 5x sugarier and ~2x fattier than honey-baked ham. In other words, for sugar and fat content, honey-drenched fat > ice cream > honey-baked ham. Honey-baked ham is therefore not a modern American equivalent to honey-drenched Gazelle fat, a sentence I never thought I'd write but I'm glad I had the chance to once in my life.

Toby_Ord

Thanks so much for writing this Quinton. I found it extremely informative — one of the best things I've read in recent years on the big picture of alignment research.

JulianHazell

Thanks for taking the time to write up your views on this. I'd be keen on reading more posts like this from other folks with backgrounds in ML — particularly those who aren't already already in the EA/LessWrong/AIS sphere.

NunoSempere

This was great, thanks

NickLaing

Thanks so much, this was unusually clearly written, with a small percentage of technicality a global health chump like me couldn't understand, but I still could understand most of it. Please write more!

My initial reaction is, let's assume you are right and Alignment is nowhere near as difficult as Yudkowsky claims.

This might not be relevant to your point that alignment might not be so hard, but it seemed like your arguments assume that the people making the AI are shooting for alignment, not misalignment.

For example your comment As far as I can tell, the answer is: don't reward your AIs for taking bad actions.

What if someone does decide t reward it for that? Then do your optimistic arguments still hold? Maybe this is outside the scope of your points!

D_M_x

Thank you Quintin, this was very helpful for me as a non-ML person to understand the other side of Eliezer’s arguments. As your post is quite dense and it took me a while to work through it, I summarised it for myself. I occasionally had to check the context of the original interview (transcript here) to fully parse the arguments made. I thought the summary might also be helpful to share with others (and let me know if I got anything wrong!):

Eliezer thinks current ML approaches won’t scale to AGI, though due to money influx an approach might be found. Quintin is more optimistic that current ML approaches can scale to AGI. As current alignment techniques are focused on current ML approaches, they won’t help if we have something different that gets us to ML. Current ML capability improvements usually integrate well with previously used alignment approaches which suggests they will keep doing so.
Eliezer is concerned that AI will show more ‘truly general’ intelligence. Humans are not equally general at different tasks as evolution made them specialize on what was important in the ancestral environment and might therefore outclass humans in other tasks. Quintin points out that the learning process humans have been given by evolution is pretty general (albeit biased to what was useful in the ancestral environment), just as the learning process current ML paradigms use is pretty general. How different ML systems actually differ isn’t by using different paradigms but by being trained on different data. Therefore he doesn’t expect such a pattern. He also points out that scale is what makes humans smarter, just as scale is a big driver of how good ML systems are. Humans are not any more constrained by their architecture than ML systems; both can modify themselves to an extent.
Eliezer considers a superintelligence to be what can beat all humans at all tasks. Quintin finds this to be a too high bar as you can have transformative systems which will have deficits.
Eliezer points out that mindspace is large and humans occupy a tiny corner, as such we should expect many different potential AI designs which poses danger. Quintin thinks we should expect AI systems to occupy only a small corner in mindspace, similar to humans. An intuition pump for this is that most real-life data in higher dimensions actually only occupies a small part in those. Again, so far in practice ML systems are using pretty similar processes to humans. They will also be trained on data similar to the data humans are “trained on” as ML systems are mostly trained on human-written text which make them more similar to humans as well.
Eliezer thinks it’s not only hard to align AIs on human values, but on even much more simple goals like duplicating a strawberry. Quintin again thinks this isn’t actually all that hard in principle, but requires starting out with an AI with more general goals which would then be modified to aim for strawberry duplication. He points out that human value formation follows more general and multiple goals than something as single minded as strawberry duplication, so we should allow ML systems to follow such a process of value formation. This will also be a lot easier as ML systems can follow actual examples in the data of such value formation processes and there is a lot more data on human following complex goals than single minded ones.
Eliezer thinks that we won’t be able to align AIs by merely using gradient descent. This is because the primary example of using gradient descent to align a system is evolution and we know that evolution failed to align humans to pursue inclusive genetic fitness in the modern environment. In the ancestral environment, e.g. desiring sexuality was sufficient, but now humans have figured out contraception. People do not desire to maximise their inclusive genetic fitness for its own sake. Quintin thinks this is because ancestral humans didn’t have a concept of inclusive genetic fitness, therefore evolution couldn’t optimise its rewards for improving inclusive genetic fitness directly. Modern AI systems however will have an understanding of human values as they are directly exposed to them during training.
Eliezer makes the same point about humans desiring ice cream. Quintin counters again that there was no ice cream in the ancestral environment, therefore evolution couldn’t punish humans for desiring ice cream. Modern ML researchers however can punish ML systems for doing things they aren’t supposed to, i.e. which are misaligned with human values.
Eliezer thinks aligning AI with gradient descent will be even harder than for evolution to align humans with natural selection as gradient descent is blunter and less simple. Quintin isn’t convinced by this and also points out that evolution was optimising over the learning process via the human genome which will be a lot messier due to its indirectness while ML researchers are training the whole ML system directly. Therefore a comparison doesn’t make much sense.
Eliezer is worried about ML systems trained to predict e.g. human preferences will try to look for opportunities to make predictions easier. Quintin thinks ML systems aren’t optimising to do well at long-term prediction by making it easier to predict things, predicting things is something that ML systems do, not what they want to do. He compares this to humans who also don’t explicitly prioritise to e.g. see very well in the long term.
Eliezer considers it important to employ a ‘security mindset’, a term from computer security, for AI alignment. Ordinary paranoia is insufficient for keeping a system secure, some deeper skills are required. Quintin thinks ML is unlike computer security as most fields are unlike computer security and we don’t use a security mindset for most fields including childrearing which seems like an important analogue to training ML systems to him. This is because ML systems during the training process don’t have adversaries to the same extent as computer systems. They might have adversarial users during deployment, but ML systems themselves aren’t keen to be jailbroken. He also uses the opportunity to point out that Eliezer often compares AI to other fields like rocket science, but ML often works in a pretty different way to other fields, e.g. swapping individual components of ML systems often doesn’t change their functionality while changing rocket components would make rockets fail.
Eliezer is concerned that AI optimists haven’t encountered real difficulties yet and that’s why they’re optimistic, the same way that the original AI conference in the 50s thought problems could be solved in two months which took 70 years to solve. Quintin counters that there were plenty of ML problems which were easier than expected and most notably easier than Eliezer and AI field veterans who have been working on AI since the early days predicted. Both Eliezer and AI veterans didn’t expect neural networks to work as well as they do today. He mentions that Eliezer also stated in a different venue that he didn’t believe that general adversarial networks worked right away, yet they did. He expects the hardness of ML research to predict the hardness of ML alignment research and thinks that Eliezer seems to be poorly calibrated on the former so he will also be on the latter.
Eliezer expects that for AI alignment to go well he will have to be wrong about aspects of AI alignment, but he expects that where he is mistaken about AI alignment this will make AI alignment even harder than he already thinks it is, as it would be really surprising when a new engineering project is easier than you think it is. Quintin strongly disagrees with this framing, because if Eliezer was wrong about how hard alignment is he should expect alignment to be easier than he previously thought.
Eliezer points to how fast AI progress was in the game of Go as a reason for concern that superintelligent AI will suddenly kill humans without killing a somewhat smaller amount of humans in advance. Quintin thinks that Go is disanalogous to a more general AI system as progress in more general systems is usually slower and smoother. Go also had a single objective function AI could use to score itself which will not be true for many other tasks which will require human input slowing down improvements.
Eliezer is even more concerned about AI systems which can self-improve and get smarter during inference (deployment) getting us to fast take off. Quintin counters that we basically already have that. ChatGPT could train on user input; but it’s not programmed to as it wouldn’t be practical. ML training processes could also be changed so they could be reasonably said to self-improve during inference as inference is also a part of training.
Eliezer thinks that people who are capable of breaking AI systems show more AI expertise than people who are merely creating functional AI systems, which is how it works in computer security. This is related to the security mindset claim above. Maybe they’d be able to find ways to improve AI alignment. Quintin thinks the people who break things in computer security are only experts there because in computer security there are clear signs whether the system is broken or not, which isn’t true for AI alignment. He discusses an example where Eliezer thinks a ML system is easily breakable as the ML will try to maximise the reward function, but Quintin thinks that simply maximizing the reward function isn’t how realistic ML systems work. He discusses another example where he thinks ML systems are not easily broken.

Overall my take: Eliezer is concerned about AI that doesn’t look like modern ML systems. Quintin argues modern ML systems don’t show the properties that Eliezer is concerned about more advanced AI showing. Quintin thinks that more advanced ML systems can already be real AGI. What I am confused about is why Eliezer is then so worried about the current state of AI if the thing he is worried about is so much more advanced/general in mindspace, or more specifically why does he consider current ML systems to be evidence that we are getting closer to the kind of AI he is worried about.

Vasco Grilo🔸

Thanks for the post, Quintin!

However, time and again, we've found that deep learning systems improve more through scaling, of either the data or the model.

Jaime Sevilla from Epoch mentioned here that scaling of compute and algorithms are both responsible for half of the progress:

roughly historically it has turned out that the two main drivers of progress have been the scaling of compute and these algorithmic improvements. And I will say that they are like 50

Jaime also mentions that data has not been a bottleneck.

The Pipers Son

Most of this stuff is well above my ability to really make a judgement call and at the moment I'm just trying to learn about it. Eliezer does make a lot of sense to me in the abstract though, I feel like I honestly don't understand enough to know if the above rebuttals make sense.

However there does seem to me to be one pretty likely possibility I haven't seen mentioned: there is now a lot of public and political attention on AI. The problem at the moment is it all seems abstract and the dangers seem too sci-fi and unbelievable.

Surely there is a non-trivial chance that in the intervening time between now and true AGI (at which point I agree, if Eliezer is right, the game is already up) something very scary and dangerous will happen because of AI? At that point presumably sufficient political will can be found to implement the required moratorium.

I'm reminded of a TNG episode where a relatively more primitive society is about to be destroyed by some sort of invading force (I forget the specifics) and the people refuse to believe it and just run about squabbling. So Data blows up something with his phaser, and instantly they fall into line and agree. I'm not suggesting it would be that neat and tidy, but you get the idea.

The question to me is, is there necessarily a correlation between how potentially dangerous what AI could cause is, and how intelligent it's become? Because obviously if it's too intelligent it's not going to try something that would cause it to be shut down. But even Eliezer would admit, we're not at AGI yet. And in the meantime...how on earth do we know in this weird world we're now in, somewhere "pre-AGI", that something crazy isn't going to happen?

I'm very likely to be wrong as a noob, but there does seem to be a slight contradiction in there somewhere, that we don't and can't know what AI is going to do as it's this inchoate set of fractions, and yet we can be sure it won't do something silly (at least in the early stages) that gives the game away and results in humans shutting it down.

Now of course that crazy thing in itself is going to have to be bad. People will almost certainly have to die, and not just one or two, for there then to be the political will to act. But that's not literally everyone, or even close.

If I were to sum it up I'd say: Eliezer is fond of saying "we need to get AI alignment right on the first attempt". Sure. But AI also needs to get its attack right on the first attempt, surely? Or, perhaps better, it needs to time that attack such that it holds all the cards if it doesn't work. And sure, it's smarter than us so there's good reason to think it would do that; but I'm smarter than any animal, but if I made a mistake could still end up being killed by them. And again (assuming I'm right about political will etc), it would need to get it right on its first try. Is that really so likely?

It just doesn't seem to me that the chance of that sequence of events is non-trivial, but I'm happy to be told otherwise and have it explained if I'm being naive, which I probably am. And by the way I'm not naive about the politics that would have to result either; that's where the "thing that happens" would have to be sufficiently terrible. But given the "thing that Eliezer is predicting" is even more terrible, I don't assign it a trivial possibility.

And that's before you get onto other real-world things that could interfere with this "inevitable" AGI emergence. What if climate change wins the "humanity destruction" race? That would prevent there being properly-operated data centres at all. Of course it's also a nasty apocalyptic scenario, but only the biggest doomers thing humanity will literally end because of it. I'm guessing this has been raised before and Eliezer basically thinks AGI emergence will win the destruction "race"? Again though, it seems difficult to predict that timeframe. So the "near 100%" chance again seems very questionable to me.

rotatingpaguro

I did not like the analogies. They do not seem to make an effort to point at something meaningful, they are superficial.

For example, password salts are added before hashing the passwords. If you switch to adding them after, this makes salting near useless.

You could make the analogy with concatenating the salt to the head or tail of the string, which would be fine.

Randomly adding / subtracting extra pieces to either rockets or cryptosystems is playing with the worst kind of fire, and will eventually get you hacked or exploded, respectively.

"Randomly" doing stuff to a neural network is bad too, you are not doing "random" modifications. I'm not an engineer yet I bet there are tons of modular parts in a rocket.

MattBall

-8

Song: Wilco "You Never Know"
"Come on, children, you’re acting like children
"Every generation thinks it’s the end of the world"

https://www.mattball.org/2023/01/climate-activists-are-to-blame-for-some.html

Comments

D_M_x

Eliezer thinks current ML approaches won’t scale to AGI, though due to money influx an approach might be found. Quintin is more optimistic that current ML approaches can scale to AGI. As current alignment techniques are focused on current ML approaches, they won’t help if we have something different that gets us to ML. Current ML capability improvements usually integrate well with previously used alignment approaches which suggests they will keep doing so.
Eliezer is concerned that AI will show more ‘truly general’ intelligence. Humans are not equally general at different tasks as evolution made them specialize on what was important in the ancestral environment and might therefore outclass humans in other tasks. Quintin points out that the learning process humans have been given by evolution is pretty general (albeit biased to what was useful in the ancestral environment), just as the learning process current ML paradigms use is pretty general. How different ML systems actually differ isn’t by using different paradigms but by being trained on different data. Therefore he doesn’t expect such a pattern. He also points out that scale is what makes humans smarter, just as scale is a big driver of how good ML systems are. Humans are not any more constrained by their architecture than ML systems; both can modify themselves to an extent.
Eliezer considers a superintelligence to be what can beat all humans at all tasks. Quintin finds this to be a too high bar as you can have transformative systems which will have deficits.
Eliezer points out that mindspace is large and humans occupy a tiny corner, as such we should expect many different potential AI designs which poses danger. Quintin thinks we should expect AI systems to occupy only a small corner in mindspace, similar to humans. An intuition pump for this is that most real-life data in higher dimensions actually only occupies a small part in those. Again, so far in practice ML systems are using pretty similar processes to humans. They will also be trained on data similar to the data humans are “trained on” as ML systems are mostly trained on human-written text which make them more similar to humans as well.
Eliezer thinks it’s not only hard to align AIs on human values, but on even much more simple goals like duplicating a strawberry. Quintin again thinks this isn’t actually all that hard in principle, but requires starting out with an AI with more general goals which would then be modified to aim for strawberry duplication. He points out that human value formation follows more general and multiple goals than something as single minded as strawberry duplication, so we should allow ML systems to follow such a process of value formation. This will also be a lot easier as ML systems can follow actual examples in the data of such value formation processes and there is a lot more data on human following complex goals than single minded ones.
Eliezer thinks that we won’t be able to align AIs by merely using gradient descent. This is because the primary example of using gradient descent to align a system is evolution and we know that evolution failed to align humans to pursue inclusive genetic fitness in the modern environment. In the ancestral environment, e.g. desiring sexuality was sufficient, but now humans have figured out contraception. People do not desire to maximise their inclusive genetic fitness for its own sake. Quintin thinks this is because ancestral humans didn’t have a concept of inclusive genetic fitness, therefore evolution couldn’t optimise its rewards for improving inclusive genetic fitness directly. Modern AI systems however will have an understanding of human values as they are directly exposed to them during training.
Eliezer makes the same point about humans desiring ice cream. Quintin counters again that there was no ice cream in the ancestral environment, therefore evolution couldn’t punish humans for desiring ice cream. Modern ML researchers however can punish ML systems for doing things they aren’t supposed to, i.e. which are misaligned with human values.
Eliezer thinks aligning AI with gradient descent will be even harder than for evolution to align humans with natural selection as gradient descent is blunter and less simple. Quintin isn’t convinced by this and also points out that evolution was optimising over the learning process via the human genome which will be a lot messier due to its indirectness while ML researchers are training the whole ML system directly. Therefore a comparison doesn’t make much sense.
Eliezer is worried about ML systems trained to predict e.g. human preferences will try to look for opportunities to make predictions easier. Quintin thinks ML systems aren’t optimising to do well at long-term prediction by making it easier to predict things, predicting things is something that ML systems do, not what they want to do. He compares this to humans who also don’t explicitly prioritise to e.g. see very well in the long term.
Eliezer considers it important to employ a ‘security mindset’, a term from computer security, for AI alignment. Ordinary paranoia is insufficient for keeping a system secure, some deeper skills are required. Quintin thinks ML is unlike computer security as most fields are unlike computer security and we don’t use a security mindset for most fields including childrearing which seems like an important analogue to training ML systems to him. This is because ML systems during the training process don’t have adversaries to the same extent as computer systems. They might have adversarial users during deployment, but ML systems themselves aren’t keen to be jailbroken. He also uses the opportunity to point out that Eliezer often compares AI to other fields like rocket science, but ML often works in a pretty different way to other fields, e.g. swapping individual components of ML systems often doesn’t change their functionality while changing rocket components would make rockets fail.
Eliezer is concerned that AI optimists haven’t encountered real difficulties yet and that’s why they’re optimistic, the same way that the original AI conference in the 50s thought problems could be solved in two months which took 70 years to solve. Quintin counters that there were plenty of ML problems which were easier than expected and most notably easier than Eliezer and AI field veterans who have been working on AI since the early days predicted. Both Eliezer and AI veterans didn’t expect neural networks to work as well as they do today. He mentions that Eliezer also stated in a different venue that he didn’t believe that general adversarial networks worked right away, yet they did. He expects the hardness of ML research to predict the hardness of ML alignment research and thinks that Eliezer seems to be poorly calibrated on the former so he will also be on the latter.
Eliezer expects that for AI alignment to go well he will have to be wrong about aspects of AI alignment, but he expects that where he is mistaken about AI alignment this will make AI alignment even harder than he already thinks it is, as it would be really surprising when a new engineering project is easier than you think it is. Quintin strongly disagrees with this framing, because if Eliezer was wrong about how hard alignment is he should expect alignment to be easier than he previously thought.
Eliezer points to how fast AI progress was in the game of Go as a reason for concern that superintelligent AI will suddenly kill humans without killing a somewhat smaller amount of humans in advance. Quintin thinks that Go is disanalogous to a more general AI system as progress in more general systems is usually slower and smoother. Go also had a single objective function AI could use to score itself which will not be true for many other tasks which will require human input slowing down improvements.
Eliezer is even more concerned about AI systems which can self-improve and get smarter during inference (deployment) getting us to fast take off. Quintin counters that we basically already have that. ChatGPT could train on user input; but it’s not programmed to as it wouldn’t be practical. ML training processes could also be changed so they could be reasonably said to self-improve during inference as inference is also a part of training.
Eliezer thinks that people who are capable of breaking AI systems show more AI expertise than people who are merely creating functional AI systems, which is how it works in computer security. This is related to the security mindset claim above. Maybe they’d be able to find ways to improve AI alignment. Quintin thinks the people who break things in computer security are only experts there because in computer security there are clear signs whether the system is broken or not, which isn’t true for AI alignment. He discusses an example where Eliezer thinks a ML system is easily breakable as the ML will try to maximise the reward function, but Quintin thinks that simply maximizing the reward function isn’t how realistic ML systems work. He discusses another example where he thinks ML systems are not easily broken.

^{^}

By this, I mostly mean the sorts of empirical approaches we actually use on current state of the art language models, such as RLHF, red teaming, etc.

^{^}

We can take drugs, though, which maybe does something like change the brain's learning rate, or some other hyperparameters.

^{^}

Technically it's trained to do decision transformer-esque reward-conditioned generation of texts.

^{^}

The brain likely includes within-neuron learnable parameters, but I expect these to be a relatively small contribution to the overall information content a human accumulates over their lifetime. For convenience, I just say “connectome” in the main text, but really I mean “connectome + all other within-lifetime learnable parameters of the brain’s operation”.

^{^}

I expect there are pretty straightforward ways of leveraging a 99% successful alignment method into a near-100% successful method by e.g., ensembling multiple training runs, having different runs cross-check each other, searching for inputs that lead to different behaviors between different models, transplanting parts of one model's activations into another model and seeing if the recipient model becomes less aligned, etc.

^{^}

Some alignment researchers do argue that gradient descent is likely to create such an intelligence - an inner optimizer - that then deliberately manipulates the training process to its own ends. I don't believe this either. I don't want to dive deeply into my objections to that bundle of claims in this post, but as with Yudkowsky's position, I have many technical objections to such arguments. Briefly, they:
- often rely on inappropriate analogies to evolution.
- rely on unproven (and dubious, IMO) claims about the inductive biases of gradient descent.
- rely on shaky notions of "optimization" that lead to absurd conclusions when critically examined.
- seem inconsistent with what we know of neural network internal structures (they're very interchangeable and parallel).
- seem like the postulated network structure would fall victim to internally generated adversarial examples.
- don't track the distinction between mesa objectives and behavioral objectives (one can probably convert an NN into an energy function, then parameterize the NN's forwards pass as a search for energy function minima, without changing network behavior at all, so mesa objectives can have ~no relation to behavioral objectives).
- seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).
- provide limited avenues for any such inner optimizer to actually influence the training process.
See also: Deceptive Alignment is <1% Likely by Default

^{^}

There's also in-context learning, which arguably does count as 'getting smarter while running in inference mode'. E.g., without updating any weights, LMs can:
- adapt information found in task descriptions / instructions to solving future task instances.
- given a coding task, write an initial plan on how to do that task, and then use that plan to do better on the coding task in question.
- even learn to classify images.
The reason this in-context learning doesn't always lead to persistent improvements (or at least changes) in GPT-4 is because OpenAI doesn't train their models like that.

^{^}

OpenAI does periodically train its models in a way that incorporates user inputs somehow. E.g., ChatGPT became much harder to jailbreak after OpenAI trained against the breaks people used against it. So GPT-4 is probably learning from some of the times it's run in inference mode.

^{^}

Unless we actually try the approach and it fails in the way predicted. But that hasn't happened (yet).

^{^}

This sentence would sound much less weird if John had called them "attractors" instead of "demons". One potential downside of choosing evocative names for things is that they can make it awkward to talk about those things in an emotionally neutral way.

^{^}

Level	What it does	In Humans:	In AIs:
Top	Configures the learning process	Genome	Training code
Middle	Stores learned information / behaviors	Connectome	Weights
Bottom	Applies stored info to the current situation	Activations	Activations

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Introduction

My objections

Will current approaches scale to AGI?

Yudkowsky apparently thinks not

Discussion of human generality

Yudkowsky says humans aren't fully general

Yudkowsky talks about an AI being more general than humans

How to think about superintelligence

Yudkowsky describes superintelligence

The difficulty of alignment

Yudkowsky on the width of mind space

Yudkowsky brings up strawberry alignment

Yudkowsky argues against AIs being steerable by gradient descent

Yudkowsky brings up humans liking ice cream as an example of values misgeneralization caused by the shift to our modern environment

Edit: Why evolution is not like AI training

Yudkowsky claims that evolution has a stronger simplicity bias than gradient descent:

Yudkowsky tries to predict the inner goals of a GPT-like model.

Why aren't other people as pessimistic as Yudkowsky?

Yudkowsky mentions the security mindset.

On optimists preemptively becoming "grizzled old cynics"

Hopes for a good outcome

Yudkowsky on being wrong

AI progress rates

Yudkowsky uses progress rates in Go to argue for fast takeoff

On current AI not being self-improving:

Edit: Yudkowsky comments to clarify the intent behind his statement about AIs getting better over time

True experts learn (and prove themselves) by breaking things

Conclusion