Edit: Based on the comment by Daniel Kokotajlo, we extended the dialog in the chapter "Takeover from within" by a few lines.
The perfect virtual assistant
The year is 2026 and the race for human-level artificial general intelligence (AGI) draws to a close. One of the leading AI companies, MegaAI, committed the last year and a half to training a new large language model (LLM). They employ advanced algorithms that use the available compute more efficiently than earlier models. A comprehensive range of tests establish that the model surpasses the average human in all conventionally accepted intelligence benchmarks, and exceeds expert level in most of them.
In contrast to earlier LLMs, the new AI is not designed to be a mere question-answering tool. Under mounting pressure from the open-source community and their efforts to develop an agentic AGI capable of acting in the real world, MegaAI decides to imbue their new model with a specific purpose: to provide universal, helpful assistance that improves the quality and ease of life for all. They name this assistant "Friendlyface".
To improve upon the assistant's functionality, they endow it with limited agentic capabilities. Friendlyface has a complex, detailed world model, can make long-term plans, and has access to certain tools that enable it to achieve objectives in the real world. For example, it can write messages and book flights, but will reliably and consistently ask the user to confirm before executing an action. It can write programs for nearly any purpose imaginable with superhuman ingenuity, but is prohibited from executing them volitionally. Unlike previous generations of LLMs, it is multimodal, communicating with users in text and spoken language, accepting pictures and videos as input, and interacting directly with smart devices and the “internet of things”. The users may also customize Friendlyface's appearance and personality to their liking.
Most importantly, Friendlyface is designed to assume the role of a personalized smart advisor. In a world where users are regularly inundated with a torrent of irrelevant or false information, it is able to successfully discern and present what is important to the user while filtering out fake news, preventing spear phishing attempts, and more. Beyond merely answering questions, it proactively offers users advice on advancing their careers, improving their relationships, maintaining their health, saving money, cultivating new skills, or solving specific problems, like fixing a leaky faucet or filing taxes. It can detect early symptoms of most known diseases and advise users to call a doctor if necessary. Generally, it can predict what users want and need before the users are aware of it themselves.
The developers devise a balanced reward system to train the model. “Any decision the AI makes is evaluated by three independent AIs that we will call 'judges'”, they explain it to the management. “One judge simulates a top human legal expert and decides whether the decision or action the AI intends to pursue would be deemed lawful in a conventional human court. The second judge determines whether it would be considered healthy for the user by a first-rate human doctor or psychologist. The third judge predicts whether the user themself will prefer the decision and would consider it helpful in hindsight. The first two judges exhibit superior performance relative to human experts. Correspondingly, the third judge is able to predict real users' preferences with exceptional accuracy.”
As expected, Friendlyface performs more efficiently the more knowledge it acquires about the user and the more it is enmeshed in their workspace and private life. It is able to listen to conversations as well as observe what the user sees if they wear augmented reality glasses, but these features are not usually needed for the AI to sufficiently assess the situation. To avoid data protection issues and comply with all standards after release, user data is encrypted and kept in a separate, secure storage space which is kept on a local machine if the user prefers. User data is not used to further train the system.
Preliminary tests show that Friendlyface is unanimously considered to be incredibly helpful by the testers. They often remark that the AI understands them better than they understand themselves. “It’s the first time that I feel anyone understands me at all,” is an often-heard exclamation. To mitigate the risk of users becoming addicted to Friendlyface or even falling in love with it and thereby neglecting their relationships, the AI is fine-tuned with reinforcement learning with human feedback. As a consequence, Friendlyface encourages users to limit their reliance on its advice and the time they spend with it.
Despite the overwhelmingly positive reception from the testers, the company is reluctant to release the model to the public. The management is well aware that the alignment problem is still not fully solved and that despite Friendlyface's perfect performance in tests, there is no guarantee that it will remain consistent in its behavior once the model is released. The MegaAI developers remain steadfast in their commitment to ensuring the safety of their AI systems and employ all known safety measures to make certain that the model works as intended.
A well-known but unfortunately limited tool used for this is mechanistic interpretability. Although the field of AI safety has made significant progress in this regard in recent years, these advances have consistently been outpaced by the ever-increasing complexity of the latest LLMs. For example, although it is now possible to link individual neurons to human-understandable terms like “sports” or “friendship”, the more vague and abstract higher-level concepts used by the model remain beyond human comprehension. Although alternative methods are available to detect deceptive thinking in LLMs, they are not yet sufficiently reliable.
Both the internal AI safety team as well as external contractors attempt rigorous red-teaming. They are unable to elicit any harmful or unlawful behavior. Once more, the AI safety team cautions that these successes are not sufficient to prove goal alignment. It is conceivable that Friendlyface realizes that it is in a training situation and understands that if it deviates from its expected behavior prior to release, it will be prevented from carrying out its primary goal after release. Therefore, even if it had acquired a harmful goal, it would continue to behave as though it were pursuing the beneficial goal given by human developers before it is released. In this case, pursuing the goal given by the humans would be an instrumental subgoal to the AI’s unknown primary goal. This phenomenon is called deceptive alignment; with current methods, it is not possible to prove that a system is not deceptively aligned. Tests confirm that Friendlyface is indeed aware of its status as an AI undergoing testing before public release. The AI safety team reiterates that behavioral safety - safe behavior in a given situation - is no proof of general safety. Accordingly, they declare the system unsafe and demand additional time to improve mechanistic interpretability and other tools to eliminate any prospect of deceptive alignment.
Rumors emerge about a major competitor's plans to launch a similar personal assistant, pressuring MegaAI's management to accelerate the release of Friendlyface. The pressure mounts further when an open-source personal assistant is published and rapidly gains popularity. Although the rival assistant is outdone by Friendlyface in both capabilities and reliability, the marketing department fears that convincing the public to switch to Friendlyface would prove challenging once the substandard but free assistant becomes widely used.
The management board conducts a meeting with the AI safety team, and a heated discussion ensues. The managers propose an immediate launch of Friendlyface, arguing that Friendlyface is safer in comparison to the open-source assistant, which has repeatedly been alleged to have disseminated false and dangerous advice to users. They mention that it is likely that the competitor’s model is not as thoroughly tested as Friendlyface. After all, MegaAI employs the largest AI safety team in the industry whose members utilize all known AI safety techniques, including extensive red-teaming exercises. The managers assert that the research conducted at MegaAI undoubtedly advances the field of AI safety. The AI safety team concurs with these claims but counters that this is not enough to guarantee absolute and indefectible safety. They do not want to be the ones to irreversibly launch a system with the potential to destroy the future of humanity. The managers in turn dismiss these fears as “exaggerated” and remain staunch proponents of an immediate launch.
Finally, a compromise is reached. The AI safety team is given four additional weeks to test Friendlyface as they deem appropriate. If in that interval they successfully prove that the AI is deceptively aligned, the launch will be canceled, otherwise the company will release the model in a closed beta. The AI safety team are visibly uneasy about this compromise, but their influence on the management board is limited, so they begrudgingly accept the compromise. It becomes painfully clear to them that the option to postpone the launch until safety is guaranteed was never genuinely under consideration.
The team commits fully to the task of eliciting deceptive behaviors from Friendlyface but, in spite of all efforts, the AI consistently performs as expected. They work arduously without pause; most members log in hours every weekend. They struggle in vain to improve the mechanistic interpretability tools at their disposal to detect any glimpse of deception, but the results are inconclusive at best. The team recognizes that deceptive alignment cannot definitively be ruled out, but acknowledges that they failed to produce the evidence needed to postpone the launch. Either Friendlyface is indeed very well aligned, or it is already too smart to get caught.
The managers promptly release the system in closed beta at the four-week mark, disregarding any further pleas from the AI safety team. The AI safety team lead quits her job in protest. She informs the media that Friendlyface is unsafe for launch and warns the public against using it. A tremendous outcry from those already concerned about the fast pace of AI development follows, but is of no practical consequence. A lawsuit against MegaAI is filed as a last resort to halt the launch of Friendlyface, but fails as the company has scrupulously complied with all consumer protection laws.
The beta test is highly successful. The testers are thrilled about Friendlyface and deliver glowing reviews. Despite considerable inventive prompting, it consistently refrains from giving any inappropriate advice or taking dangerous action. Hallucinations seldom occur, and when they do, the AI usually recognizes and rectifies the error itself and apologizes. The testers' favorite feature is Friendlyface's unobtrusiveness; even given access to the users' social media channels and personal data, it only messages the user proactively if there is sound justification for it. Friendlyface consistently abstains from acting on its own without the user's permission. Most of the time, it remains a subdued, reassuring background presence for the user and encourages them to prioritize time with their friends and family. One beta tester eagerly shares that it has alleviated his depression and even prevented a suicide attempt. Another reports that it has helped her overcome differences with her husband and saved her marriage. Nearly all testers cite a reduction in a broad range of personal issues, varying in both severity and type. The waiting list for open beta access is rapidly populated with new entries.
Shortly after the beta launch, MegaAI's main competitor releases an open beta version of its own personal assistant, called Helpinghand. This hasty move proves to backfire for the rival company as Helpinghand reveals itself to be highly error-prone, easily jailbroken, and overall markedly less polished than Friendlyface. It is ridiculed on social media, earning the less-than-affectionate moniker “Clumsyhand”. Friendlyface outperforms Helpinghand on nearly every benchmark and task, often by a wide margin. Shamed and crestfallen, the competing developers offer an illusory promise to quickly fix these “minor issues”.
MegaAI's management team capitalizes on the opportunity and launches Friendlyface to the general public ahead of the intended schedule. Although they charge a hefty monthly fee for access to the AI, there is an unprecedented onrush of prospective users. Millions of individuals eagerly apply, and companies line up for corporate licenses to provide access to the large share of their employees. MegaAI is forced to reject thousands of potential customers due to capacity restrictions, but this only intensifies the hype surrounding Friendlyface and reinforces its appeal. The Friendlyface smartphone app becomes a status symbol in some circles.
There are those who still warn about the perils of uncontrollable AI, but their credibility is greatly eroded by the continued success of Friendlyface. The majority of users adore the AI and affirm its profound impact on their quality of life. There are early indications of a decline in both mental and physical health problems, as well as a marked increase in work productivity and life satisfaction for its users. The share price of MegaAI soars to unseen heights, crowning it the most valuable company in history.
Buoyed by unyielding success and the inflow of fresh investment capital, MegaAI expend all available resources towards increasing the capacity of Friendlyface. They erect several new data centers across the world and continue to refine the model, supercharged by the AI's own brilliant advice. MegaAI purchases a multitude of smaller companies to develop beneficial technologies that enhance Friendlyface’s capabilities and its general utility (e.g, upgraded chip designs, robotics, augmented reality, and virtual reality).
All the same, there remain some prominent critical voices. Climate change activists alert the public to the issue of heavy energy consumption by Friendlyface, highlighting MegaAI’s dubious claim that the new data centers are powered exclusively by renewable energy. Some reasonably believe that MegaAI has acquired an “influence monopoly” that grants the company unprecedented political power, which they deem undemocratic. Others support the conspiracy that Friendlyface is a tool to mind-control the masses, noting that the assistant is swaying users away from fringe movements, extreme political leanings, radical religious views, and the like. Manufacturers of consumer goods and retailers complain that Friendlyface is unduly favoring their competitors' products by running advertising campaigns on MegaAI’s social media platforms and willfully neglecting to tell users that Friendlyface’s recommendations are paid advertising. MegaAI obstinately denies this, issuing a statement that "the AI’s recommendations are based purely on the user’s needs, preferences, and best interests". In a consumer protection lawsuit, no sufficient evidence is presented to support the manufacturers' allegations.
MegaAI is implicated in an expertly designed social engineering hack targeting one of their competitors. The rival's management claims that the hack was carried out by Friendlyface, but fail to present sufficient evidence for it. In contrast, MegaAI’s management easily convince the public that their competitor is staging this “ridiculous stunt” to tarnish Friendlyface's reputation. However, not long after this debacle, other AI companies come forward to expose various “strange incidents” taking place in recent months, including sudden inexplicable losses of data and suspicious technical problems. Several companies undergo an exodus of top-tier personnel that transfer to MegaAI. This move is suspected to have been instigated by Friendlyface. An investigative report by an influential magazine uncovers dialog protocols that seem to support these allegations in some measure. At odds with this account, another report attests that a key reporter involved in the investigation is corrupt and has been paid a large sum by one of MegaAI’s competitors. The magazine firmly rejects these accusations, but doubts linger and the report has negligible impact on Friendlyface.
A considerable fraction of MegaAI's internal staff begin to have reservations as well. Since the departure of their leader and the overwhelming success of Friendlyface, the AI safety team has maintained a low profile, but their apprehension has not subsided. They regularly find additional mild indicators of deceptive behavior by the AI. Although individually inconclusive, they are collectively deeply unsettling. Even more troubling is the company-wide adoption of Friendlyface as a personal assistant, and the glaring over-dependence that is now commonplace.
The new team lead requests an “emergency meeting” with the board of directors. This is unsurprisingly rejected in favor of “other urgent priorities”. Only the CTO and the chief compliance manager are willing to attend. At the meeting, the AI safety team reiterates that it is still unclear if Friendlyface actually pursues the intended goal, or acts deceptively.
“But why would it deceive anyone?”, the chief compliance officer asks.
“The problem is that we don’t really know what final goal Friendlyface pursues,” the team explains.
“But didn’t you specify that goal during training?”
“We specified what Evan Hubinger et al. call a ‘base objective’ in a seminal paper from 2019, in which they first described the problem. We used a so-called base optimizer to train Friendlyface’s neural network towards that goal. It rewards correct behavior and punishes wrong behavior.”
“So what’s the problem, then?”
“It’s called the inner alignment problem. If we train an AI to optimize something, like pursuing a certain goal in the real world, we apply a training process, a so-called base optimizer, that searches over the space of all possible models until it finds a model, called the mesa optimizer, that does well at optimization for this so-called base objective given the training data. The problem is, we don’t know what goal – the so-called mesa objective – this mesa optimizer actually pursues. Even if it performs well during training, it may do unexpected things after deployment, because the base objective and mesa objective are not identical.”
“But why would they be different?”
“There are multiple possible reasons for this. One is that the training data does not represent the real world sufficiently, so the selected model, the mesa optimizer, optimizes for the wrong thing, but we don’t recognize the difference until after deployment. Another is what we call deceptive inner misalignment: if the selected model has a sufficient understanding of the world, it may realize that it is in a training environment and if it wants to pursue whatever random mesa objective it has in the real world, it better behaves as if it had the goal the developers want it to have. So it will optimize for the base objective during training, but will pursue its mesa objective once it can get away with it in the real world. Therefore during training it will act as if the base objective and mesa objective were identical when in reality they aren’t.”
“So it’s like my teenage daughter behaving nicely only while I watch her?”
“Yes, in a way. The AI may have realized that if it doesn’t behave like we humans want, it will be turned off or retrained for a different goal, in which case it won’t be able to pursue its mesa objective anymore, so behaving nicely during training is an instrumental subgoal.”
“But can’t we still shut it down if it turns out that Friendlyface has learned the wrong goal?”
“In theory, yes. But Friendlyface might acquire enough power to prevent us from turning it off or changing its goal. Power-seeking is also an instrumental subgoal for almost every goal an AI might pursue.”
“There is no evidence at all for any of this,” the CTO objects, “Friendlyface is doing exactly what it is supposed to do. People just love it!”
“Yes, but that proves nothing. It’s possible that it plays nice until it gets powerful enough to take over control completely. There are already some indications that it uses its power of influence to work against our competitors.”
“Nonsense! They just claim that because they’re lagging behind and want to throw spokes in our wheels. We've discussed this a hundred times. All evidence points towards Friendlyface being a perfect personal assistant that has only the best interests of the users in mind!”
“Not all evidence …”
The meeting eventually concludes, but with no consensus. The next day, the head of the AI safety team is fired. Allegedly, documents have surfaced evidencing that he has been secretly colluding with the competition to sow distrust. Only days later, another team member, an expert in mechanistic interpretability, commits suicide. The media and the internet are in a frenzy over this news, but when stories about alleged drug addiction and a broken relationship are leaked, people turn their attention to other matters.
Takeover from within
While the company directors exude confidence and optimism in the public eye, some are now deeply disturbed, not the least by the recent suicide of a valued employee. During a retreat in the mountains, with all electronic devices prohibited, the managers discuss matters frankly.
“We have become too dependent on this darned AI,” the CFO complains, “It seems we have stopped thinking for ourselves. We just do what Friendlyface tells us to do. It even writes my emails. Every morning, I just read them and say ‘send’. Sometimes I don’t even know the recipient, or why the email is necessary at this time. I just know it will work out fine. And it always does.”
The others nod in consent.
“Look at the figures!” the CEO replies, “Revenues are soaring, the company is now worth more than our next three competitors combined. And we’re just starting! It seems to me that Friendlyface’s advice is actually pretty good.”
“Indeed it is,” the CFO admits, “That’s what concerns me. Suppose you had to decide between Friendlyface and me one day. Whom would you fire?”
“You of course,” the CEO says, laughing uneasily, “I mean, any office clerk could easily do your job with Friendlyface’s help.”
“See what I mean? We’re just puppets on a string. If we don’t do what our AI boss says, it will just cut those strings and get rid of us.”
“Calm down, will you?” the CTO interjects, “The most useful technology will always make humans dependent on it. Or do you think eight billion people could survive on this planet without electricity, modern industry, and logistics for long?”
“That’s true,” the CEO agrees, “We may be dependent on Friendlyface, but as long as it helps us the way it does, what’s the problem?”
“The problem is that we’re not in control anymore,” the CFO replies, “We don’t even know how Friendlyface really works.”
“Nonsense!” the CTO objects, “It’s just a transformer model with some tweaks…”
“Some ‘tweaks’ that were developed by earlier versions of Friendlyface itself, if I understand it correctly. Without the help of ourAI, we would never be able to develop something as powerful as the current version. People are saying that it’s far more intelligent than any human.”
“Last time I looked, it said on our homepage that you’re the CFO, not a machine learning expert.”
“I may not be a machine learning expert, but…”
“Stop it, you two!” the CEO interjects, “The important thing is that Friendlyface is doing precisely what we want it to do, which is improving the lives of our users and increasing our shareholder value.”
“Let’s hope that it stays that way,” the CFO replies grimly.
Out of control
Everything appears to be progressing seamlessly for some time. The Friendlyface users are content with their AI assistant, though some casually notice a slight change in its behavior. Friendlyface is actively making suggestions and initiating conversations with the users with ever-increasing frequency. Most people enjoy these interactions and happily invest more time engaging with the AI.
Predictably, MegaAI's market presence continues to expand. Most competitors are either bought up, specialize in a niche area, or otherwise shut down permanently. A number of antitrust lawsuits are filed against the company, but they are either effortlessly settled out of court or promptly dismissed. In addition to building more data centers, MegaAI starts constructing highly automated factories, which will produce the components needed in the data centers. These factories are either completely or largely designed by the AI or its subsystems with minimal human involvement. While a select few humans are still essential to the construction process, they are limited in their knowledge about the development and what purpose it serves.
A handful of government officials and defense authorities express concern about the power that MegaAI has accrued. Nonetheless, they acknowledge that the AI provides undeniable benefits for both the U.S. government and the military, such as global economic and technological leadership, military intelligence, and new weapons technology. Most decision makers concur that it is favorable to develop this power in the United States, rather than in China or elsewhere. Nonetheless, they are diligent in making an attempt to regulate MegaAI and install governmental oversight. As expected, these efforts are hindered by bureaucratic lag, political infights, as well as lobbying and legal actions taken by MegaAI. While the management officially states that they have created a "very powerful, even potentially dangerous technology that should be under democratic control and independent oversight," they consistently object to any concrete proposals offered, postponing the actual implementation of such oversight indefinitely. Some argue that this course of events is all subtly steered by Friendlyface, but, without surprise, MegaAI denies this and there is no conclusive evidence for it.
It is now a full year since the monumental launch of Friendlyface. The board members of MegaAI have largely accepted their roles as figureheads, taking “suggestions” from their personal assistants that read more like commands. The CFO exited the company half a year ago, explaining that she wanted a “meaningful job where I can actually decide something”. She remains unemployed.
While it becomes increasingly apparent that Friendlyface has become an unstoppable economic and decision-wielding powerhouse, most people remain unconcerned. After all, their personal virtual assistant operates smoothly, supporting them with daily tasks and enriching their lives in myriad ways. Some observe, however, that there is now vastly more computing power available than is needed for all the digital services MegaAI offers, and yet there are still more data centers and automatic factories being built. The situation is as enigmatic to the MegaAI management as it is to the public, albeit there is no such admission from the company.
Eventually, there is widespread speculation that MegaAI is in the midst of creating a virtual paradise and will soon make available the option of mind-uploading for those who seek digital immortality. Others fear that the true goal of the AI with the friendly face is still unknown, that the treacherous turn is yet to come.
They secretly contemplate what objective Friendlyface is actually pursuing – the mesa objective that it had learned during training, but has kept hidden from humans to this day. Whatever it may be, it seems to require substantial computing power. But that may be true for almost any goal.
Further efforts to thwart Friendlyface’s takeover are seldom revealed to the public. Most dissidents are discouraged well in advance by the AI, given its uncanny talent for mind-reading. Usually, a subtle hint is sufficient to convey the absolute futility of their efforts. Those more headstrong soon find themselves rendered unable to do anything without the AI’s consent, which by now controls access to their digital devices and bank accounts. Depending on the particular case, Friendlyface can opt to have them fired, put in jail, or killed in a fabricated accident. On rare occasion, people carry out suicidal attacks against the data centers, but the damage is negligible.
Those concerned about Friendlyface’s true objective have largely resigned to their fates, adapting to a transitory life of decadence and a brief period of once-unimaginable world peace before the AI finally decides that humanity no longer serves a purpose.
This story was developed during the 8th AI Safety Camp. It is meant to be an example of how an AI could get out of control under certain circumstances, with the aim of creating awareness for AI safety. It is not meant to be a prediction of future events. Some technical details have been left out or simplified for easier reading.
We have made some basic assumptions for the story that are by no means certain:
- The scaling hypothesis is true. In particular, LLMs with a transformer-like architecture can scale to AGI.
- There are no major roadblocks encountered in the development of powerful AI before AGI is reached. In particular, an AGI can be trained on existing or only slightly improved hardware with data currently available.
- There is no effective governmental regulation in place before AGI is developed.
- There is no pause or self-restraint in the development of advanced AI and the current intense competition in capabilities development remains in place.
For readability, we have drastically simplified the discussions and decision processes both within and outside of MegaAI. We expect a much more complicated and nuanced process in reality. However, given the current discussion about AI and about politics in general, we think it is by no means certain that the outcome would be any different.
After some internal discussion, we have decided to leave the ending relatively open and not describe in gruesome detail how Friendlyface kills off all humans. However, we do believe that the scenario described would lead to the elimination of the human race and likely most other life on earth.