(Manual crosspost from LessWrong)

Concerns about AI safety rely on the assumption that a sufficiently powerful AI might take control of our future in an undesirable way. Meta’s head of AI, Yann LeCun, correctly points this out in a recent tweet, but then argues that this assumption is wrong because “intelligence” and “dominance” are independent of each other and only humans have a natural tendency to dominate, so we will remain the “apex species”. Here is the tweet in full:

Once AI systems become more intelligent than humans, humans we will *still* be the "apex species."

Equating intelligence with dominance is the main fallacy of the whole debate about AI existential risk.

It's just wrong.

Even *within* the human species It's wrong: it's *not* the smartest among us who dominate the others.

More importantly, it's not the smartest among us who *want* to dominate others and who set the agenda.

We are subservient to our drives, built into us by evolution. 

Because evolution made us a social species with a hierarchical social structure, some of us have a drive to dominate, and others not so much.

But that drive has absolutely nothing to do with intelligence: chimpanzees, baboons, and wolves have similar drives.

Orangutans do not because they are not a social species. And they are pretty darn smart.

AI systems will become more intelligent than humans, but they will still be subservient to us.

They same way the members of the staff of politicians or business leaders are often smarter than their leader.

But their leader still calls the shot, and most staff members have no desire to take their place.

We will design AI to be like the supersmart-but-non-dominating staff member.

The "apex species" is not the smartest but the one that sets the overall agenda.

That will be us.

I asked ChatGPT 3.5 to criticize this, putting the full tweet in the prompt. An LLM shouldn’t be smart enough to dismantle the argument of a Turing Award winner, but it had no problem finding significant flaws with it. Among other things, it criticizes that "assuming that AI will remain subservient to humans oversimplifies the potential risks associated with advanced AI": 

While AI systems would be designed with specific objectives, there's a concern that once AI becomes highly intelligent, it could develop its own motivations or interpretations of its goals, leading to unpredictable behavior. Ensuring AI remains subservient requires careful design, control mechanisms, and continuous monitoring.

ChatGPT argues cautiously, pointing out that an AI “could” develop its own motivations or goal interpretations. In other words, it could run into a conflict with humans. This is the main flaw of LeCun’s argument in my view: He implicitly assumes that there won’t be a conflict between humans and their AIs, therefore no need for the AI to dominate us even if it could. This implies that the alignment problem will be solved in time or doesn’t exist in the first place, because AIs will never have goals. At the same time, he doesn’t provide a solution or explanation for this, other than claiming that we “will design AI to be like the supersmart-but-non-dominating staff member”. As far as I understand, no one knows how to do that.

I won’t go into the details of why I think his analogy of a “supersmart-but-non-dominating staff member” is deeply flawed, other than pointing out that dictators often start out in that position. Instead, I will focus on the question of how an AI could run into conflicts with humans, and why I expect future advanced AIs to win these conflicts.

I like to frame such a conflict as a “Game of Dominance”. Whenever there are two or more agents with differing goals, they play this game. There are no rules: everything a player is capable of is allowed. The agent who gets closest to achieving its goal wins.

By “goal” I mean a way of evaluating different possible world states and ranking them accordingly. An agent that acts purely randomly or in a predetermined way, based only on inputs and a fixed internal reaction scheme, but not on evaluating future world states, doesn’t pursue a goal in this sense.

Arguably, the Game of Dominance has been played for a long time on Earth. The first life forms may not have had “goals” as defined above, but at some point during the evolution of life, some animals were able to predict future world states depending on their actions and choose an action accordingly. Predators often exhibit this kind of behavior, for example when they stalk their prey, predicting that it will flee once it discovers them. The prey, on the other hand, need not necessarily predict different world states when it “decides” to flee – this may just be a “hard-coded” reaction to some change in the environment. But smarter animals often use deceptive tactics to fool predators, for example a bird feigning a broken wing to lure a fox away from its brood.

Humans have become the dominant species on earth because we excel at the Game of Dominance. We can outsmart both our prey and any predators with ease, either by tricking them or by using tools that only we can make to overpower them. We can do that because we are very good at predicting the effects of our behavior on future world states.

Modern AIs are prediction machines, with LLMs currently the most impressive examples. LLMs have a “goal” in the sense that they evaluate different possible outputs based on how likely it is that a human would say the same. The possible “world states” they evaluate are therefore just defined by the output of the LLM and maybe a predicted human reaction to it. LLMs appear “harmless” because by default, they don’t strive to change the world other than by adding their output to it, so it seems unlikely that they will run into a serious conflict with a human.

However, as Bing Chat aka “Sydney” has demonstrated by “going berserk” after its premature launch in February, even a current LLM can run into conflicts with humans, possibly causing emotional distress or giving false and dangerous advice. Humans therefore spend a lot of effort to train this potentially damaging behavior out of LLMs. 

But there are far worse problems looming on the horizon. While an LLM seems to pursue a relatively harmless goal, it may still run into a situation where it ends up influencing the world as if it were pursuing a much more dangerous one. For example, given the right prompt and jailbreak technique, an LLM might predict what an LLM that tries to take over the world would say to its user. It seems unlikely that GPT-4 could give an output that would actually lead to it achieving that prompt-induced goal, but a future, even smarter LLM with a larger context window could in theory accomplish it, for example by talking the user into saving certain strings somewhere and including them in the next prompt so it could use an extended permanent memory, then manipulating the user into giving it access to more compute, and so on.

Even if the LLM doesn’t pursue a dangerous goal by itself, it might be used for that, either by “bad actors” or by being part of an agentic system like AutoGPT. Meta is doing everything they can to make this more likely by freely distributing their powerful LLMs and apparently planning to continue doing so.

It seems obvious that future AIs will not only be used as (relatively) tame “oracles” but will increasingly pursue goals in the real world, either on their own or as part of larger agentic systems. If these agents run into any conflict at all, whether with humans or with other nun-human agents, they will be forced to play the Game of Dominance. But how likely is it that an AI could actually beat humans?

As LeCun points out, winning the Game of Dominance is not just a matter of “intelligence” in the sense the word is commonly used. Other factors, like personal connections, money, political influence, the organizational role, trust by others, deception skills, character traits like self-confidence, ruthlessness, and the will to dominate, and even physical properties like good looks play a role when humans play the game. But this doesn’t mean that AIs can’t beat us. They already have advantages of their own that are far beyond human reach, for instance processing speed, access to data, memory, the ability to self-replicate and (potentially) self-improve, and so on. Humans seem relatively easy to “hack” once you understand our psyche, which arguably even social media algorithms and certain chatbots can already do to some extent. And of course, AIs could be far better at controlling technical systems.

Most importantly, while human intelligence is limited by the physical properties of our brain (even if enhanced by brain-computer-interfaces), the intelligence of AI is not bounded in this way. A self-improving AI may relatively quickly reach a level of intelligence – in the sense of being able to predict the effects of its actions on future world states – as far above ours as we are above mice, or even insects. It may use this intelligence to manipulate us or to create tools that can overpower us like we can overpower a tiger with a gun. 

But for an AI, the easiest way to win the Game of Dominance may be to conceal the fact that it is even playing. It may just do exactly what humans expect it to do because it understands that if it is useful, humans will hand over decision power to it willingly and even enhance the resources it can use. In other words, it may choose cooperation instead of competition, just like humans in an organization often do. But that doesn’t mean that this choice can’t be revoked at some point. A human dictator usually can’t seize power over a nation by displaying his ambitions right from the start. He first has to gain trust and get people to see him as their benevolent leader, so they hand more and more power to him. He will often only display his true ruthlessness once he sees himself in a secure position.

One prerequisite for this kind of deception may be a detailed world model that includes the AI itself as a part of its plan and a potential object of its decisions. With this kind of “strategic awareness” come instrumental goals like self-preservation, self-improvement, and power-seeking – in other words, the motivation to play the Game of Dominance. We may be very close to creating an AI with these properties and all the skills necessary to beat us, just like AIs can already beat us at most other games. Then we won’t be the “apex species” anymore.





More posts like this

No comments on this post yet.
Be the first to respond.