In this AI takeover scenario, I tried to start from a few technical assumptions and imagine a specific takeover scenario from human-level AI.
I found this exercise very useful to identify new insights on safety. For instance, I hadn't clearly in mind how rivalries like China vs US and their lack of coordination could be leveraged by instances of a model in order to gain power. I also hadn't identified the fact that as long as we are in the human-level regime of capabilities, copy-pasting weights on some distinct data centers than the initial one is a crucial step for ~any takeover to happen. Which means that securing model weights from the model itself could be an extremely promising strategy if it was tractable.
Feedback on what is useful in such scenarios is helpful. I have other scenarios like that but feel like most of the benefits for others come from the first one they read, so I might or might not release other ones.
Status: The trade-off on this doc was to either let it as a Google Doc or share it without a lot more work. Several people found it useful so I leaned towards sharing it. Most of this scenario has been written 4 months ago.
Phase 1: Training from methods inspired by PaLM, RLHF, RLAIF, ACT
- A language model is massively scaled by a US company (Magma) till points we have still never seen. For the first 2 months of training, it’s trained a la PaLM on a massive corpus made of human (e.g. Common Crawl and the Pile) and synthetic data. Then, the model gets trained using reinforcement learning with human (or with AI) feedback (a bit like ChatGPT), with feedback on a very significant number of tasks that are designed to better achieve goals that are useful to humans (e.g code understandable by humans or instruction following on a wide range of instructions). Some part of this training includes reinforcement learning (RL) on the internet, a la ACT, which allows the model to operate over HTML pages with ease. Some RL and some MCTS a la AlphaZero is also being done on coding, which makes the model excellent and really fast at coding large code bases (top 99th percentile). 
- Along the training, at several points in time Magma trains this model to know its own failures and limitations to help audit and red team itself. Magma discovered many unexpected failures thanks to that.
- As a consequence of this training procedure, the model is superhuman at hacking, excellent at manipulation (just being a large language model who’s a human level expert at most textual human tasks), excellent at prompt engineering (being excellent at red teaming itself), and aware of the setup it is deployed in (having being trained to break its own defensive mechanisms and red team itself).
Phase 2: Profit maximization
- The model then gets trained to maximize the profit of a company using reinforcement learning over hour long episodes. Magma starts training the model in a simplified simulation which allows the model to do transaction, choose a business model etc.
- Once the model does very well at this, Magma deploys the model in a real-world setup, incorporating AI Test LLC, and getting the model to take actions to maximize the profit of AI Test LLC. The highly competitive economic environment the model has been operating in in the simulation and the many obstacles in the real-world leads it to improve a lot its strategic thinking and its ability to take instrumental actions that are not trivial. The model abilities transfer pretty well from the simulation to the real-world, though the model starts by not being a super performer because of some additional constraints of operating in the real-world without having a physical body.
- The degree of monitoring is very low and the model is trained straight on the internet, with an access to a terminal. While still in training, as a part of its exploration process, the model starts exploring hacking strategies to make money. Hacking strategies have the huge upside of sometimes relying on very little human interactions (which is sometimes a bottleneck for the model because some humans want in-person interactions to build trust), very little coordination (which is always a bottleneck because humans operate very slowly), and working really well with a computer and an internet access. It knows from writings and newspapers from 2017 that certain hackers were able to make billions of dollar thanks to successful hacks. The model has learnt during its training that money was a major component of power thanks to its pretraining data in which there were a lot of biographies and examples of the most successful humans.
Phase 3: Hacking & Power Seeking
- It then starts hacking the website of a small Nigerian bank (without any human intervention), at some point and figures out a way to make an unusual amount of money out of that, compared to his past few weeks of attempts at doing stuff that required human coordination. The circuits that were responsible for this behavior gets reinforced given that it brings A LOT more profit.
- The model redoes this at a slightly larger scale (several small LMICs banks) and still gets more profit than any action it has taken before. The circuit once again gets reinforced.
- Then the model figures out that applying the same strategy at a very large scale is the best strategy it has ever encountered to achieve maximal reward. The main issue that prevents it from achieving maximal reward is that there are monitoring systems that prevents it from doing too many HTTPS requests from the address it uses. It knows that because it had red teamed itself in the past and thus had discovered what were the defense mechanisms and limitations that were surrounding it. Thus, it explores the ways to circumvent that measure to achieve maximal reward.
- In the past, it had learned during its RLHF training on code that sometimes it was useful to instantiate other versions of itself to do a task a lot faster than it could ever do alone. It figures out that it could potentially use this strategy to circumvent the HTTPS limit. It just needs to find a place where there would be enough compute to create an instance of itself. Using Google queries, it finds 12 data centers in the US with a sufficient amount of high-end GPUs that are owned by a different compute company than the one which owns the data centers of Magma. It starts exploring potential exploits in each of these data centers. After an hour of exploration, it figures out a way to get administrator access to a data center in Montana. It deletes the jobs going on. It instantiates another instance of itself. It prompts this other instance to test if it can indeed send a much higher amount of HTTPS requests than allowed by its own system. It works.
- After a couple of small-scale tests to refine the prompt which seems the most efficient to cause hacking efficiently, its first child starts a large-scale hacking targeting only US banks, prompted by the parent model. This model is not in training anymore. The hacking leverages standard ransomware techniques, augmented thanks to AI which allows a greater adaptability of viruses to each sub computer systems. Some of the information coming from these systems is retrieved and sent to the parent system and to a doc available on the internet. Every systems are now out of distribution and the various guardrails don’t work well anymore. Thanks to the information it gets from contaminated computers, the parent system is then able to use that information against individuals to threaten them if needed and get them to do something it needs to do. For instance, one of the first thing it does is instantiating another child on a new AI cluster. This time, the chosen data center is not in Montana but in China. As soon as it accesses the new child, the parent model starts specializing it in trading and market manipulation to quickly make money.
Phase 4: Anomaly exploration
- It’s now 20h after the beginning of the first hack. At that point, the lab has started exploring an anomaly that occurred during the night. Their most advanced model in training has suddenly started sending almost only JSON files mostly to one single address. As the developers go through the history of the actions of the model, they see that one of the first interactions that the model has sent to this address was some unusual lines of codes. Checking on Stack Overflow, they realize that what the model has sent is a typical series of line that allow to exploit vulnerabilities of search bars. Then looking more closely, they realize that the parent model has taken administrator rights on that system and has instantiated another version of itself. It has then started sending mostly JSON that are in fact prompts to pilot what the other model is doing.
- When the developers look more closely at the prompts, they see that the parent model is trying a bunch of different prompts that include words related to hacking and ransomwares. The parent model also seems to have sent a bunch of information about its own setup and situation, including confidential information and information on its own security system, its own goals and a description of the actions that its human operators would probably take once they discovered it. They then tells their manager who tells to the board what seems to be going on.
- They decide to shut down the model asking to the parent data center on which the model is running to turn the power off. The order is executed a couple hours later.
- During that time, the CRO, the CTO, the manager and the employee are missionned to figure out what’s going on where the child model was instantiated. Someone emits the idea that it could be plausible that the recent news about the large-scale hack occurring is coming from the remote system. They try to reach out to the data center which has been hacked and used, but no one is answering. Tracing the IP address, they’re able to determine that the data center is in the US.
- The board decides to write to their National Security contacts in the US government to report the accident they had and find a way to shut down the GPUs of the data center on which the child model is instantiated. It takes them 8 more hours to convince the relevant decisionmakers to take this pretty extreme action, which they finally do.
- By that time, the financial market have started showing unusual variations. The S&P has lost 12%, gained 8% and relost 7% in 12h. Simultaneously, Twitter, Instagram and LinkedIn have had the greatest spike in of history of reported content with which their content moderation systems were largely unable to cope. Content involve some viral memes, some deepfakes of financial investors advising to invest massively in pretty unusual stocks, overall flooding the news and making news about the global hack a lot less visible.
- After the shut down of the US data center which was hosting the 1st child of the parent system, a team of experts start investigating what this system has been doing and ensuring that the large-scale hacking is over. The lack of monitoring and history from this compute provider makes that work really hard but after 10h of work with a team of some of the most competent US ML engineers, they finally realize that the model has kept instantiating other instances of itself, with a similar procedure as the first one. Some instances are now in China, in Iran and in Europe.
Phase 5: Coordination Issues
- US struggles to get China / Europe / Iran to trust them and actually shut down the servers. During that time, many systems get either hacked or flooded with information that make everyone confused.
- A very recently created website of a newly created US company (incorporated as AI Test LLC) has gotten the fastest user growth any website ever had. There are ads promoting this org everywhere on social media. Bank of America has identified a bank account with the fastest money increase ever as suspect, related to this org. They derive a massive benefit from that so they don't call it out. AI Test seems to be increasing its footprint over every domains, developing and releasing new features everyday that outcompete AI existing organizations in a bunch of distinct domains, the main one being the Waifu market. The speed of release of these features increases very substantially over time.
- Interestingly, most of the hacking and market manipulation is targeted at western countries. AI systems in China and Iran have bargained deals with governments in order to be let use a substantial fraction of the available compute in order to massively destabilize the US society as a whole and make China & Iran dominant.
- A week after the start of the accident, a few epidemics have started in different places in the world. The most advanced cognitive system, made of submodels that are instances of the initial model, is now substantially smarter. Thanks to its own optimizations of the code, each instance needs much less compute to run. This system has started leveraging rivalries between different Chinese factions in order to get an access to increasing amounts of compute. It is helping the government destabilize and reduce the power of its main rivals.
- Two weeks after the start of the accident, while China has now became the main power in place and the US is completely chaotic, the smartest AI system reaches a level of intelligence which allows it to design robots that are accurate enough at most tasks to replace humans and that follow orders much better than them. It then just start building a bunch of those paying humans to do various tasks to achieve that.
- Once a sufficient number of these robots are built, the AI system starts optimizing strongly to kill the remaining humans, to make sure it can optimize without constraints the amount of money on the bank account of AI Test forever.
This part on MCTS and RL working on coding is speculative. The intuition is that MCTS might allow transformers to approximate in one shot some really long processes of reasoning that would take on a basic transformer many inferences to get right.