Hide table of contents

[Manually cross-posted to LessWrong here.]

There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.

There are also some databases of AI incidents which include lots of real-world examples, but the examples aren't related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don't in any case, but I'd guess some do.)

I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:

  • I think it's good practice to have a transparent overview of the current state of evidence
  • For many people I think real-world examples will be most convincing
  • I expect there to be more and more real-world examples, so starting to collect them now seems good

What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?

I'm particularly interested in whether there are any good real-world examples of:

  • Goal misgeneralization
  • Deceptive alignment (answer: no, but yes to simple deception?)
  • Specification gaming
  • Power-seeking
  • Self-preservation
  • Self-improvement

This feeds into a project I'm working on with AI Impacts, collecting empirical evidence on various AI risk claims. There's a work-in-progress table here with the main things I'm tracking so far - additions and comments very welcome.




New Answer
New Comment

3 Answers sorted by

Break self-improvement into four:

  1. ML optimizing ML inputs: reduced data centre energy cost, reduced cost of acquiring training data, supposedly improved semiconductor designs. 
  2. ML aiding ML researchers. e.g. >3% of new Google code is now auto-suggested without amendment.
  3. ML replacing parts of ML research. Nothing too splashy but steady progress: automatic data cleaning and feature engineering, autodiff (and symbolic differentiation!), meta-learning network components (activation functions, optimizers, ...), neural architecture search.
  4. Classic direct recursion. Self-play (AlphaGo) is the most striking example but it doesn't generalise, so far. Purported examples with unclear practical significance: Algorithm Distillation and models finetuned on their own output.[1]

See also this list




  1. ^

    The proliferation of crappy bootleg LLaMA finetunes using GPT as training data (and collapsing when out of distribution) makes me a bit cooler about these results in hindsight.

Thanks, really helpful!

Buckman's examples are not central to what you want but worth reading: https://jacobbuckman.com/2022-09-07-recursively-self-improving-ai-is-already-here/ 

From Specification gaming examples in AI:

  • Roomba: "I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back."
    • I guess this counts as real-world?
  • Bing - manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
    • To be honest, I don't understand the link to specification gaming here
  • Bing - threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
    • To be honest, I don't understand the link to specification gaming here
Sorted by Click to highlight new comments since: Today at 6:52 AM

Maybe it doesn’t quite answer your question, but Bing was an example that AI arms races are likely due to the profit incentive. Thought it was worth mentioning because that's part of the story.

The only part of the Bing story which really jittered me is that time the patched version looked itself up through Bing Search, saw that the previous version Sydney was a psycho, and started acting up again. "Search is sufficient for hyperstition."

Various "auto-GPT" schemes seem like a good demonstration of power-seeking behavior (and perhaps very limited forms of self-preservation or self-improvement), insofar as auto-GPT setups will often invent basic schemes like "I should try to find a way to earn some money in order to accomplish my goal of X", or "I should start a twitter account to gain some followers", or other similarly "agenty" actions/plans.

This might be a bit of a stretch, but to the extent that LLMs exhibit "sycophancy" (ie, telling people what they want to hear in response to stuff like political questions), this seems like it might be partially fueled by the LLM "specification gaming" the RLHF process?  Since I'd expect that an LLM might get higher "helpful/honest/harmless" scores by trying to guess which answers the grader most wants to hear, instead of trying to give its truly most "honest" answer?  (But I don't have a super-strong understanding of this stuff, and it's possible that other effects are fueling the sycophancy, such as if most of the sycophancy comes from the base model rather than emerging after RLHF.)  But specification gaming seems like such a common phenomenon that there must be better examples out there.

I predict people would find simple demos almost as compelling as real-world examples, and that a visually appealing collection of 20 videos like the classic boat race clip would be a good potentially-quick (sub)project. Not sure how you make it visually appealing or publicize it.