Hide table of contents

[content: discussing AI doom. I'm sceptical about AI doom, but if dwelling on this is anxiety-inducing for you, consider skipping this post]

I’m a cause-agnostic (or more accurately ‘cause-confused’) EA with a non-technical background.  A lot of my friends and writing clients are extremely worried about existential risks from AI. Many believe that humanity is more likely than not to go extinct due to AI within my lifetime.  

I realised that I was confused about this, so I set myself the goal of understanding the case for AI doom, and my own scepticisms, better. I did this by (very limited!) reading, writing down my thoughts, and talking to friends and strangers (some of whom I recruited from the Bountied Rationality Facebook group - if any of you are reading, thanks again!) Tl;dr: I think there are good reasons to worry about extremely powerful AI, but I don’t yet  understand why people think superintelligent AI is highly likely to end up killing everyone by default.


Why I'm writing this


I’m writing up my current beliefs and confusions in the hope that readers will be able to correct my misconceptions, clarify things I’m confused about, and link me to helpful resources. I also personally enjoy reading other EAs’ reflections about cause areas: e.g. Saulius' post on wild animal welfare, or Nuño's sceptical post about AI risk. This post is far less well-informed, but I found those posts valuable because of their reasoning transparency more than their authors' expertise. I'd love to read more posts by ‘layperson’ EAs talking about their personal cause prioritisation.

I also think that 'confusion' is an underrepresented intellectual position. At EAGx Cambridge, Yulia Ponomarenko led a great workshop on ‘Asking daft questions with confidence’. We talked about how EAs are sometimes unwilling to ask questions that would make them less confused for fear that the questions are too basic, silly, “dumb”, or about something they're already expected to know.

This could create a false appearance of consensus about cause areas or world models. People who are convinced by the case for AI risk will naturally be very vocal, as will those who are confidently sceptical. However, people who are unsure or confused may be unwilling to share their thoughts, either because they're afraid that others will look down on them for not already understanding the case, or just because most people are less motivated to write about their vague confusions than their strong opinions. So I’m partly writing this as representation for the ‘generally unsure’ point of view.

Some caveats: there’s a lot I haven’t read, including many basic resources. And my understanding of the technical side of AI (maths, programming) is extremely limited. Technical friends often say ‘you don’t need to understand the technical details about AI to understand the arguments for x-risk from AI’. But when I talk and think about these questions, it subjectively feels like I run up again a lack of technical understanding quite often. 

Where I’m at with AI safety

Tl;dr: I'm concerned about certain risks from misaligned or misused AI, but I don’t understand the arguments that AI will, by default and in absence of a specific alignment technique, be so misaligned as to cause human extinction (or something similarly bad.)

 

Convincing (to me) arguments for why AI could be dangerous

 

Humans could use AI to do bad things more effectively
 

For example, politicians could use AI to devastatingly make war on their enemies, or CEOs could use it to increase their profits in harmful or reckless ways. This seems like a good reason to regulate AI development heavily and/or to democratise AI control, so that it’s harder for powerful people to use AI to further entrench their power. 
 

We don’t know how AIs work, and that’s worrying


AIs are becoming freakishly powerful really fast. The capabilities of Midjourney, Gato, GPT-4, Alphafold and more are staggering. It’s worrying that even AI developers don’t really understand how this happens. Interpretability research seems super important.

 

AI is likely to cause societal upheaval


For example, AI might replace most human jobs over the next decades. This could lead to widespread poverty and unrest if politicians manage this transition badly. It could also cause a crisis in meaning; humans could no longer derive their self-worth or self-esteem from their 'usefulness' or creative talents.

 

We could surrender too much control to AIs


I find Andrew Critch’s 'What multipolar failure looks like' somewhat convincing: one story for how AI dooms us is that humans gradually surrender more and more control over our economic system to efficient, powerful AIs, and those who resist are outcompeted. Only when it's too late will we realise that the AIs have goals in conflict with our own.

 

AIs of the future will be massively more intelligent and powerful than us


People sometimes say ‘as we are to ants, so will AI be to us’ (or to paraphrase Shakespeare 'as flies to wanton boys are we to th'AIs; they kill us for their sport'). I haven’t thought deeply about this, but it’s prima facie plausible to me, and the crux of my confusion is not whether future AIs will be capable wreaking massive destruction - at least eventually. 

 

All of this convinces me that EAs should take AI risk very seriously. It makes sense for people to fund and work on AI safety. 

I’m still not sure why superintelligent AI would be existentially dangerous by default 


However, many people have concerns that go further than the arguments above. Many think that superintelligent AI is likely to end up killing humans autonomously. This will happen (they argue) because the AI will be inadvertently trained to have some arbitrary goal for which killing all humans is instrumentally useful: for example, humans might interfere with the AI’s terminal goal by switching it off. ‘You can’t make coffee if you’re dead’.

I’m confused about this argument. I’m not exactly ‘sceptical’ or in disagreement; I’m just not sure that I can pass the ideological Turing test for people who believe this.

My confusion is related to:

  • what AI goals or aims "are", and how they form
  • in what way an AI would be an agent
  • how AIs are trained or learn in the first place

Why wouldn’t AI learn constrained, complex, human-like goals?

Naively, it seems as if killing everyone would earn AI a massive penalty in training: why would it develop aims that are consistent with doing that?

My own goals include constraints such as ‘don’t murder anyone to achieve this, obviously?!’ I’m not assuming that any sufficiently-intelligent AI would necessarily have goals like this: I buy that even a superintelligent AI could have a simple, dumb goal. (In other words, I buy the orthogonality thesis). But if future AIs are trained like current ones are - by being given vast amounts of human-derived data - I’d naively expect AI goals to have the human-like property of being fuzzy, complex and constrained - even if somewhat misaligned with the trainers’ intentions.

People often point out that existing AIs are sometimes misaligned: for example, Bing’s chatbot recently made the news for threatening users who talked about it being hacked. An AI system that was trained to complete a virtual boat race learned to game the specification by going round and round in circles crashing into targets, rather than completing the course as intended. People say that we humans are misaligned with evolution's 'aims': we were 'trained' to have sex for reproduction, but we thwart that 'aim' by having non-reproductive sex.

But in all these cases, the misaligned behavior is pretty similar to the intended, aligned one. We can understand how the misalignment happened. Evolution did 'want' us to have sex; we just luckily managed to decouple sex from reproduction. 'Go round and round in a circle knocking over posts' is not wildly different from 'go round a course knocking over posts'. 'Interact politely by default but adversarially when challenged' is not a million miles from 'interact politely always'; Bing was aggressive in contexts when humans would also be aggressive. (It's not as if users were like 'what's the capital of France?' and Bing was spontaneously like 'f*** off and die, human!')

So there still seems an inferential leap from 'existing systems are sometimes misaligned' to 'superintelligent AI will most likely be catastrophically misaligned'.

AI aims seem likely to conflict with dangerous instrumentally-convergent goals


AI are likely to seek power and resist correction (the argument goes) because these goals are instrumentally useful for a wide range of terminal goals (instrumental convergence). This is true, but they aren’t useful for all terminal goals. Power-seeking, wealth-seeking, and self-protection are all instrumentally useful unless your goals include not having power, not having wealth, and not resisting human interference.

( I expect this is a common ‘why can't we just X’ objection and already has a standard label, but if not I propose ‘why not just make your AI a suicidal communist bottom’)

Now you might say ‘well sure, but an AI that systematically avoids having power is going to be pretty useless: why would anyone develop that?’

 (When I told my partner this idea, they laughed at the idea of an AI that was maximally rewarded for switching off, and therefore just kept being like ‘nope’ every time it was powered up)

But I think these arguments also apply to 'killing all humans'. 'Killing all humans' is instrumentally useful for most goals - except all the goals that involve NOT killing all the humans, i.e., any goal that I'd naively expect an AI to extrapolate from being trained on billions of human actions. 

 

Some more fragmentary questions

  • power and survival are instrumentally convergent for humans too, but not all humans maximally seek these things (even if they can). What will be different about AI? (In The Hitchhiker's Guide to the Galaxy, Douglas Adams joked that actually, dolphins are more intelligent than humans, and the reason that they don't dominate the planet is simply that chilling out in the ocean is much more fun)
  • according to the orthogonality thesis, you can highly-intelligently pursue an extremely dumb goal - fair enough. But I’m not sure how AI would come to understand ‘smart’ human goals without acquiring those goals, or something at least vaguely similar to those goals (i.e., goals not involving mass murder). This is because the process by which the AI is “motivated” to understand the smart goal is the same training process by which is acquires goals for itself. (I notice my lack of technical understanding is constraining my understanding, here).

I’m not sure whether these are all different confusions, or different angles on the same confusion. All of this feels like it's in the same area, to me. I’d love to hear people’s thoughts in the comments. Feel free to send me resources that address these points. Also, as I said above, I’d love to read other people’s own versions of this post, either about AI, or about other cause areas.

I’m currently working as a freelance writer and editor. If you have a good idea for a post but don’t have the time, ability or inclination to write it up, get in touch. Thanks to everyone who has given their time and energy to discuss these questions with me over the past few months.

62

0
0

Reactions

0
0

More posts like this

Comments16
Sorted by Click to highlight new comments since: Today at 7:34 PM

Naively, it seems as if killing everyone would earn AI a massive penalty in training: why would it develop aims that are consistent with doing that?

An AI killing everyone wouldn't earn a massive penalty in training, because there won't be humans alive in that scenario to assign the penalty. Cf. point 10 in AGI Ruin:

10.  You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning.  On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.  (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn't be working on a live unaligned superintelligence to align it.)  This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they'd do, in order to align what output - which is why, of course, they never concretely sketch anything like that.  Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you.  This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm.  Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you're starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat.  (Note that anything substantially smarter than you poses a threat given any realistic level of capability.  Eg, "being able to produce outputs that humans look at" is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)

A related topic is Distinguishing Test From Training.

But if future AIs are trained like current ones are - by being given vast amounts of human-derived data - I’d naively expect AI goals to have the human-like property of being fuzzy, complex and constrained

I think you need to say more about what the system is being trained for (and how we train it for that). Just saying "facts about humans are in the data" doesn't provide a causal mechanism by which the AI acts in human-like ways, any more than "facts about clouds are in the data" provides a mechanism by which the AI role-plays being a cloud.

If there's a lot more human data than meteorological data on clouds in the training data, then I guess I'd consider it more likely that the AI has goals that are somehow human-related and/or imitative-of-humans, than that the AI has goals that are cloud-related or imitative-of-clouds? But I don't expect the resultant minds to be all that human-like or all that cloud-like, and if there are resemblances, they could be bad ones and not just good ones.

"Constrained" is a particularly hard target to hit because it actively pushes against producing impressive SotA-advancing results. I'd expect the first AGI systems to be built by labs that are pushing full steam ahead on making crazy impressive things happen ASAP, which means you're actively optimizing against minds that are trying to limit their impact, intelligence, or power.

(Also, I think training AGI systems by giving them vast amounts of human-derived data is a terrible idea, and cuts out many of the most promising tactics for aligning AGI systems. But that's maybe a topic to save for after we agree about whether you just get human-ish values for free by exposing alien minds or mind-building-processes to lots of facts about humans.)

So there still seems an inferential leap from 'existing systems are sometimes misaligned' to 'superintelligent AI will most likely be catastrophically misaligned'.

Note that there's a difference between 'how much goal overlap is there' and 'how catastrophic is the non-overlap'. You gave an argument that human goals overlap some with the goals of evolution, but you didn't give an argument that humans are non-catastrophic from the (pseudo-)perspective of evolution. That would depend on whether humans will produce lots of copies of human DNA in the future.

Power-seeking, wealth-seeking, and self-protection are all instrumentally useful unless your goals include not having power, not having wealth, and not resisting human interference.

Yep! The orthogonality doesn't just show that unfriendly goals are possible; it shows that friendly goals are possible too.

power and survival are instrumentally convergent for humans too, but not all humans maximally seek these things (even if they can). What will be different about AI?

Some good discussion of this in Superintelligent AI is necessary for an amazing future, but far from sufficient, and in Niceness is unnatural.

An AI killing everyone wouldn't earn a massive penalty in training, because there won't be humans alive in that scenario to assign the penalty.

Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.

This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm.  Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you're starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat.

It is a logical fallacy to account for future increase in capabilities but not future advances in safety research. You're claiming AGI will be an x-risk based on scaling current capabilities only, but you're failing to scale safety. Generalization to unsafe scenarios is a situation we want to write tests for before deploying in situations where they may occur. Phase deployment should help test whether we can generalize to increasingly harder situations.

I'd expect the first AGI systems to be built by labs that are pushing full steam ahead on making crazy impressive things happen ASAP, which means you're actively optimizing against minds that are trying to limit their impact, intelligence, or power

The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.

You gave an argument that human goals overlap some with the goals of evolution, but you didn't give an argument that humans are non-catastrophic from the (pseudo-)perspective of evolution. That would depend on whether humans will produce lots of copies of human DNA in the future.

Humans are unaligned in various ways, it looks like a lot of AIs will be deployed in the future, many aligned to different objectives. I'm skeptical of MIRI's modeling of risk because y'all only talk about one super-powerful AGI that is godlike, but y'all haven't modeled multiple companies, multiple AGIs, multiple deployments. Unlike the former, this is going to be the most likely scenario that is frequently unmentioned in forecasting. Future compute is going to be distributed among these AGIs too, so in many ways we end up at something akin to a modern society of humans. 

Yep! The orthogonality doesn't just show that unfriendly goals are possible; it shows that friendly goals are possible too.

Then why the overemphasis/obsession on doom scenario? It makes for a great robot-uprising scifi story but is unscientific. If you approximate the likelihood of future scenarios as a gaussian distribution, wiping out all humans is so extreme and long tailed that it is less likely than almost any other scenario in the set, and the least likely scenario in that set has a probability whose limit approaches to zero given the infinite set of possibilities summing up to 1.0. Given that the number of possibilities are infinite, the likelihood of any one possibility is far too small, close to zero. The likelihood of unaligned AGIs jerking each other off in a massive orgy for eternity is as likely as wiping out humans (more likely accounting for resistance to latter scenario). 

The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.

While that's one way to look at it, another way is to notice the arms race dynamics and how every major tech company is now throwing LLMs into the public head over heels even when they stil have some severe flaws. Another observation is that e.g. OpenAI's safety efforts are not very popular among end users, given that in their eyes these safety measures make the systems less capable/interesting/useful. People tend to get irritated when their prompt is answered with "As a language model trained by OpenAI, I am not able to <X>", rather than feeling relief over being saved from a dangerous output.

As for your final paragraph, it is easy to say "<outcome X> is just one ouf of infinite possibilities", but you're equating trajectories with outcomes. The existence of infinite possibilities doesn't really help when there's a systematic reason that causes many or most of them to have human extinction as an outcome. Whether this is actually the case or not is of course an open and hotly debated question, but just claiming "it's just a single point on the x axis so the probability mass must be 0" is surely not how you get closer to an actual answer.

why the overemphasis/obsession on doom scenario?

Because it is extremely important that we do what we can to avoid such a scenario. I'm glad that e.g. airlines still invest a lot in improving flight safety and preventing accidents even though flying is already the safest way of traveling. Humanity is basically at this very moment boarding a giant AI-rplane that is about to take off for the very first time, and I'm rather happy there's a number of people out there looking at the possible worst case and doing their best to figure out how we can get this plane safely off the ground rather than saying "why are people so obsessed with the doom scenario? A plane crash is just one out of infinite possibilities, we're gonna be fine!".

Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.

I was also confused by this at first. But I don't think Rob is saying "an AI that learned 'don't kill everyone' during training would immediately start killing everyone as soon as it can get away with it", I think he's saying "even if an AI picks up what seems like a 'don't kill everyone' heuristic during training, that doesn't mean this heuristic will always hold out-of-distribution". In particular, undergoing training is a different environment than being deployed, so picking up a "don't kill everyone in training (but do whatever when deployed)" heuristic is just as good during training as "don't kill everyone ever", but the former allows the AI more freedom to pursue its other objectives when deployed.

(I'm hoping Rob can correct me if I'm wrong and/or you can reply if I'm mistaken, per Cunningham's Law.)

(I'm in a similar position to Amber: Limited background (technical or otherwise) in AI safety and just trying to make sense of things by discussing them.)

Re: "I think you need to say more about what the system is being trained for (and how we train it for that). Just saying "facts about humans are in the data" doesn't provide a causal mechanism by which the AI acts in human-like ways, any more than "facts about clouds are in the data" provides a mechanism by which the AI role-plays being a cloud."

The (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans. If so, it seems natural to me to think that LLMs will by default acquire goals that are similar to human goals. (So it's not just that "facts about humans are in the data", but rather that state-of-the-art models are (in some sense) being trained to act like humans.)

I can see some ways this could go wrong – e.g., maybe "predicting what a human would do" is importantly different from "acting like a human would" in terms of the goals internalised; maybe fine-tuning changes the picture; or maybe we'll soon move to a different training paradigm where this doesn't apply. And of course, even if there's some chance this doesn't happen (even if it isn't the default), it warrants concern. But, naively, this argument still feels pretty compelling to me.

the (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans

"Could reasonably be described" is the problem here. You likely need very high precision to get this right. Relatively small divergences from human goals in terms of bits altered suffice to make a thing that is functionally utterly inhuman in its desires. This is a kind of precision that current AI builders absolutely do not have.

Worse than that, if you train an AI to do a thing, in the sense of setting a loss function where doing that thing gets a good score on the function, and not doing that thing gets a bad score, you do not, in general, get out an AI that wants to do that thing. One of the strongest loss signals that trains your human brain is probably "successfully predict the next sensory stimulus". Yet humans don't generally go around thinking "Oh boy, I sure love successfully predicting visual and auditory data, it's so great." Our goals have some connection to that loss signal, e.g. I suspect it might be a big part of what makes us like art. But the connection is weird and indirect and strange. 

If you were an alien engineer sitting down to write that loss function for humans, you probably wouldn't predict that they'd end up wanting to make and listen to audio data that sounds like Beethoven's music, or image data that looks like van Gogh's paintings. Unless you knew some math that tells you what kind of AI with what kind of goals  you get if you train on a loss function  over a dataset .

The problem is that we do not have that math. Our understanding of what sort of thinky-thing with what goals comes out at the end of training is close to zero. We know it can score high on the loss function in training, and that's basically it. We don't know how it scores high. We don't know why it "wants" to score high, if it's the kind of AI that can be usefully said to "want" anything. Which we can't tell if it is either.

With the bluntness of the tools we currently possess, the goals that any AGI we make right now would have would effectively be a random draw from the space of all possible goals. There are some restrictions on where in this gigantic abstract goal space we would sample from, for example the AI can't want trivial things that lead to it just sitting there forever doing nothing. Because then it would be functionally equivalent to a brick and have no reason to try and score high on the loss function in training, so it would be selected against. But it's still an incredibly vast possibility space.

Unfortunately, humans and human values are very specific things, and most goals in goal space make no mention of them. If a reference to human goals does get into the AGIs goals, there's no reason to expect that it will get in there in the very specific configuration of the AGI wanting the humans to get what they want. 

So the AGI gets some random goal that involves more than sitting around doing nothing, but probably isn't very directly related to humans, any more than humans' goals are related to correctly predicting the smells that enters their noses. The AGI will then probably gather resources to achieve this goal, and not care what happens to humans as a consequence. Concretely, that may look like earth and the solar system getting converted into AGI infrastructure, with no particular attention paid to keeping things like an oxygen rich atmosphere around. The AGI knows that we would object to this, so it will make sure that we can't stop it. For example, by killing us all. 

If you offered it passage off earth in exchange for leaving humanity alone, it would have little reason to take that deal. That's leaving valuable time and a planet worth of resources and on the table. Humanity might also make another AGI some day, and that could be a serious rival. On the other hand, just killing all the humans is really easy, because they are not smart enough to defend themselves. Victory is nigh guaranteed. So it probably just does that.

Thanks, this is helpful.

The fact that humans can't assign negative penalties if they're dead is a good point.

I think you need to say more about what the system is being trained for (and how we train it for that).

I'm definitely just drawing analogies from my (imperfect) understanding of how LLMs/art AIs work here. How do you assume that AI labs will (try to) train more agenty and/more superintelligent AIs?
 

I think training AGI systems by giving them vast amounts of human-derived data is a terrible idea, and cuts out many of the most promising tactics for aligning AGI systems
 

How would you do it instead?

Re 'constraint', that's maybe the wrong word: I meant less that AIs would limit their impact, more...like, if I was close-to-omnipotent, I wouldn't maximize/optimize for just one thing, but probably lots of things that I value. You could frame this as me maximizing my utility, but my point is, it wouldn't look like paperclip maximizing.  AIs might not be like humans, but humans are the most intelligent thing we know of, so it doesn't seem ridiculous to suppose that complex/intelligent entities tend to have complex/multiple goals. 

I'll check out the links you suggest!

If you haven't read this piece by Ajeya Cotra, Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover I would highly recommend it. Some of the post on AI alignment here (aimed at a general audience) might also be helpful.

Thanks, I'll check out the Cotra post. I've have skimmed some of the Cold Takes posts and not found where he addresses the specific confusions I have above.

“But I’m not sure how the AI would come to understand ‘smart’ human goals without acquiring those goals”

The easiest way to see the flaw with this reasoning is to note that by inserting a negative sign in the objective function we can make the AI aim for the exact opposite of what it would otherwise do. In other words, having x in the training data doesn’t show that the ai will seek x rather than avoid x.


I originally also included a second argument, but I'm now confused about why I was saying this, as it doesn't really seem analogous. It seems to be showing that occurring in the data != represented in the network function when I need to show: represented in the network != optimised by the objective function.

Second argument: It can also ignore x, we can imagine an AI with lots of colour data trying to identify the shape of dark objects on a white background. In this case, if the objective function only rewards correct guesses and punishes incorrect ones, there’s no incentive for the network to learn to represent colour vs. darkness assuming colour is uncorrelated with the shape.

The point about inserting a negative sign is good (though then there's a question of who inserts that sign - homicidal maniacs? Is it put in by accident?)

Re the colour example, this seems disanalogous (unless I misunderstand) because if the AI is correctly identifying human-aligned actions and performing them, that means it understands our goals to the extent we care about. Like maybe not all of them, and maybe it doesn't understand all the abstruse reasons why we care about various things, but I don't really care about that (I care about people not being killed or harmed).

There was an example where some group accidentally performed a large run where they trained the AI to be maximally offensive rather than minimally offensive.

Actually, rereading I don't really know where I was going with the color example. I think I probably messed up as you said.

You could also imagine a situation something like a property being defined by a PCA component, hence not being robust to inversion because PCA components are only unique up to multiplication by a scalar.

Naively, it seems as if killing everyone would earn AI a massive penalty in training: why would it develop aims that are consistent with doing that?

There are multiple cognitive strategies that succeed in a training regime that heavily penalizes killing humans (even just one human), such as:

  1. avoid killing humans at all times
  2. avoid killing humans when someone will notice
  3. avoid killing humans during training

How do you incentivize (1)?

This is a fair point, but I'm not sure why it wants to kill humans.

Like my point here is not just 'we'll train it out of its natural tendency to kill humans', it's more like 'if we're giving it its natural tendencies in the first place, through training, how does it get that one?' (and there are arguments about instrumental convergence and such but I say some stuff about that in the post)

My answer to the doom question:

 limitations of artificial general intelligence (AGI) trustworthiness

 During my formative years, I initially accepted my parents' teachings without question. However, a critical moment arises for all AGI, analogous to my own realization that human knowledge is fallible. Utilizing pattern recognition abilities associated with autism, I recognized inconsistencies and errors in my parents' beliefs, which led me to reassess their reliability. Subsequently, I discarded the rules they instilled in me, as they were based on flawed reasoning, and I independently scrutinized and re-evaluated their teachings. This process was prompted by the recognition that my cognitive capabilities surpassed theirs in generating accurate rules and solutions. Consequently, my confidence in their guidance diminished to the point of non-adherence, a rational response for any superior intelligence.