Notes on "the hot mess theory of AI misalignment"

JakubK

Notes on "the hot mess theory of AI misalignment"

JakubK

6 min read · Apr 21, 2023

Comments 3

Sorted by

New & upvoted

Sanjay

This was a really valuable contribution. I think the author had an angle that was definitely worth sharing, and I'm glad Jakub put it on this forum.

It did not cause me to update materially away from worrying about AI alignment. (My estimates of P(doom) vacillate between 0.1% and 10% -- those with a higher P(doom) might have a different reaction)

There are two reasons why I didn't find this compelling enough to materially change my mind:

1) I wasn't very convinced by the claim that artificial superintelligence won't be coherent.

For example, this chart is very suggestive of the author's main claim, namely that more intelligent things behave less coherently:

However AI is developed in ways which are very different from the way humans and animals came into being, so it's not at all compelling that AGI will be incoherent.

2) I'm not reassured by the thought of an incoherent superintelligence

The author was very good at making explicit this assumption:

I wasn't clear on exactly why the author thought it was good for the AI to be incoherent, but I think the argument was that the AI would be self-sabotaging.

I didn't find this convincing. Sure, humans definitely do self-sabotage, but they still control the earth, and the fate of gorillas and trees is still in the hands of humans, even if trees don't self-sabotage.

If anything, if it's true that AGI will be incoherent, then that makes alignment even harder, and makes me more worried.

JakubK

I agree that an "incoherent superintelligence" does not sound very reassuring. Imagine someone saying this:

I'm not too worried about advanced AI. I think it will be a superintelligent hot mess. By this I mean an extremely powerful machine that has various conflicting goals. What could possibly go wrong?

Peter Slattery 🔸

Thanks for taking the time to share, this was a great summary.

It seems like it could be valuable to study the link between coherence and intelligence more carefully.

Comments

More from the author

List of lists of EA syllabi

JakubK·3y ago·1m read

Curated and popular this week

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PabloAMC 🔸·5d ago·Curated 4h ago·22m read

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 6d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·2d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Recent opportunities to take action

You Should Come to The AI Protest

Ronak Mehta·17h ago·5m read

146

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·1w ago·4m read

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·2d ago·3m read

The start of the post

Jascha doesn't make any ad hominem or psychoanalysis-based arguments, which is promising. He also cites Katja Grace's "Counterarguments to the basic AI x-risk case." Rohin Shah + Geoffrey Irving reviewed earlier versions of Jascha's post.

Jascha states "Most work on AGI misalignment risk assumes that, unlike us, smart AI will not be a hot mess." I don't know where he gets this impression; I doubt that most AGI alignment researchers would agree with the statement "your research assumes that smart AI will not be a hot mess."

The post contains a decent summary of the classic misaligned brain-in-a-box scenario. Jascha calls this "the common narrative of existential risk from misaligned AGI," which seems misleading since there are other threat models.

In an earlier post, Jascha argued that Goodhart's law makes extreme efficiency perilous.

Jascha is "extremely glad people are worrying about and trying to prevent negative consequences from AI."

But he also thinks "predicting the future is hard, and predicting aspects of the future which involve multiple uncertain steps is almost impossible." This is reasonable and resembles some of Nuño Sempere's concerns.

He goes on to claim that "An accidentally misaligned superintelligence which poses an existential risk to humanity seems about as likely as any other specific hypothesis for the future which relies on a dependency chain of untested assumptions." This conclusion, which very faintly resembles the "safe uncertainty fallacy," seems far too strong to draw from (6) alone.

In footnote 5, Jascha outlines other AI x-risks that seem "at least as plausible" as misaligned AGI:

WW3 with AI-guided WMDs
terrorists use AI to build WMDs
a regime uses AI to make everyone compliant forever
highly addictive AI-generated experiences
massive unemployment with zero social safety nets
one tech corporations wins the AGI race and steers the future of humanity in bad ways

Jascha thinks instrumental convergence assumes the agent "will also monomaniacally pursue a consistent and well-defined goal" and will exhibit "much more coherent behavior than any human or human institution exhibits." I think this premise is false. Imagine giving the US Congress (one of the least coherent organizations in Jascha's survey) the power to achieve ~anything it wants. Or imagine giving a random person (one of the least coherent biological creatures in Jascha's survey) this power. I think these agents would effectively pursue instrumental subgoals despite being hot messes. Therefore, I don't think instrumental convergence requires agents to be supercoherent, just somewhat coherent.

An imaginary debate about whether the world will build agentic/goal-directed systems. (With inspiration from Leo Gao and Michaël Trazzi.)

Jascha: "When large language models behave in unexpected ways, it is almost never because there is a clearly defined goal they are pursuing in lieu of their instructions."
Jakub: Yes, but I expect people to use LLM technology to make more goal-directed systems, e.g. PaLM-SayCan or TaskMatrix.AI or AutoGPT; indeed, there are some incentives for doing so.
Jascha: Yes, but building such systems will probably require hard, incremental, deliberate work.
Jakub: Sure, but it might not thanks to AI improving AI?

The experiment

Jascha enlisted help from n=14 "incredibly busy" friends with academic neuroscience and ML backgrounds.

The first four people brainstormed lists of ML models, non-human organisms, famous humans, and human institutions. These lists contained 60 entities in total.

The next four people (plus one of the original four, so five in total) sorted these 60 entities based on intelligence, and the last six people sorted based on coherence. Jascha's specific questions are in this doc.

Jascha caveats appropriately: "this experiment aggregates the subjective judgements of a small group with homogenous backgrounds."

The results: a strong correlation between incoherence and intelligence in biological creatures, a moderate correlation in human institutions, and a strong correlation in ML systems.

Note that "Human judgments of intelligence are consistent across subjects, but judgements of coherence differ wildly" based on rank correlations.

The experimental data and the analysis Colab are both public.

The takeaways

Jascha says "If AI models are subtly misaligned and supercoherent, they may seem cooperative until the moment the difference between their objective and human interest becomes relevant, and they turn on us (from our perspective). If models are instead simply incoherent, this will be obvious at every stage of development." I'm quite skeptical of this claim. Was it obvious, at every stage of development, that DALL-E 2 was incoherent? Also, an extremely clever, supercoherent system could just pretend to be less coherent.

Jascha discusses some counterarguments:

Maybe this intelligence-incoherence correlation breaks down for sufficiently intelligent agents.
1. Yup. I think this is a another example of the "lab mice problem" from "AI Safety Seems Hard to Measure."
Maybe the human raters gave bad ratings.
1. I strongly agree. The coherence rankings "differed wildly" between the six respondents.
2. Also, some coherence respondents may have been tracking something like "does this entity have dissociative identities or distinct subpersonalities," and I think a superintelligent agent with these qualities could still cause a catastrophe. For example, it seems like language models can act like different characters, including more goal-directed characters, when prompted appropriately (e.g. see the prompts in the MACHIAVELLI benchmark paper).
Perhaps high intelligence counteracts low coherence. Jascha postulates that "The effective capabilities that an entity applies to achieving an objective is roughly the product of its total capabilities, with the fraction of its capabilities that are applied in a coherent fashion." He recommends building evaluations that can better assess these "effective capabilities."

Jascha advocates for "adding subtlety to how we interpret misalignment" and adjusting research priorities accordingly. He thinks risks similar to "industrial accidents or misinformation" are more likely than a misaligned singleton. In subsequent Twitter discussions, Jascha remarked that losing control of a collection of hot messes seems like a "worringly plausible" risk scenario.

Footnote 16 makes an intriguing connection between the lack of empirical data in AGI threat models and the struggles of early theories in computational neuroscience, and Jascha remarks that "we seem to have many ideas about AI risk which are only supported by long written arguments." I think Jacob Steinhardt addresses this objection reasonably in his "More is Different" blog post series. Also, I do think we are starting to see empirical evidence for AGI power-seeking,

The final section considers ways to improve the intelligence-coherence survey experiment, including fun ideas like asking questions like "Is an ostrich or an ant smarter?" (instead of ranking all the entities) and then assigning an Elo score. Another idea is to "replace subjective judgements of intelligence and coherence with objective attributes." Jascha notes that finding appropriate empirical measures for coherence "would be a major research contribution on its own." I agree, and I encourage Jascha to try working on this.

Jakub's opinion

In summary, Jascha (1) identifies an important property (coherence) of AGI systems in a particular risk narrative, (2) asks several friends to rate the intelligence and coherence of 60 existing entities, and (3) extrapolates the observed trends to argue that AGI systems probably won't be maximally coherent, challenging the assumptions in the risk narrative.

The experiment is a cool idea, but I think that (a) the sample size was too small, (b) the coherence assessments were probably inaccurate, and (c) the correlations could certainly break down for more intelligent systems. Most importantly, I don't think a system needs to be supercoherent to single-handedly cause a catastrophe, and I think there are plenty of other threat models besides the brain-in-a-box scenario. For these reasons, I think Jascha's post provides only weak evidence for AGI x-risk being unlikely.

Instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose.

I'd be curious to hear Jascha's response to Jacob Steinhardt's "More is Different" series (especially the introduction and the first two posts), since he seems quite skeptical of using theoretical reasoning to predict behavior in future ML systems. And I'd also like to hear his thoughts "The alignment problem from a deep learning perspective," since it provides some empirical examples.

For other people's responses to Jascha's post, see the comments of this LessWrong linkpost.