AIs will probably be much easier to control than humans due to (1) AIs having far more levers through which to exert control, (2) AIs having far fewer rights to resist control, and (3) research to better control AIs being far easier than research to control humans. Additionally, the economics of scale in AI development strongly favor centralized actors.
Current social equilibria rely on the current limits on the scalability of centralized control, and the similar levels of intelligence between actors with different levels of resources. The default outcome of AI development is to disproportionately increase the control and intelligence available to centralized, well-resourced actors. AI regulation (including pauses) can either reduce or increase the centralizing effects of AI, depending on the specifics of the regulations. One of our policy objectives when considering AI regulation should be preventing extreme levels of AI-enabled centralization.
Why AI development favors centralization and control:
I think AI development is structurally biased toward centralization for two reasons:
- AIs are much easier to control than humans.
- AI development is more easily undertaken by large, centralized actors.
I will argue for the first claim by comparing the different methods we currently use to control both AIs and humans and argue that the methods for controlling AIs are much more powerful than the equivalent methods we use on humans. Afterward, I will argue that a mix of regulatory and practical factors makes it much easier to research more effective methods of controlling AIs, as compared to researching more effective methods of controlling humans, and so we should expect the controllability of AIs to increase much more quickly than the controllability of humans. Finally, I will address five counterarguments to the claim that AIs will be easy to control.
I will briefly argue for the second claim by noting some of the aspects of cutting-edge AI development that disproportionately favor large, centralized, and well-resourced actors. I will then discuss some of the potential negative social consequences of AIs being very controllable and centralized, as well as the ways in which regulations (including pauses) may worsen or ameliorate such issues. I will conclude by listing a few policy options that may help to promote individual autonomy.
Why AI is easier to control than humans:
Methods of control broadly fall into three categories: prompting, training, and runtime cognitive interventions.
Prompting: influencing another’s sensory environment to influence their actions.
This category covers a surprisingly wide range of the methods we use to control other humans, including offers of trade, threats, logical arguments, emotional appeals, and so on.
However, prompting is a relatively more powerful technique for controlling AIs because we have complete control over an AI’s sensory environment, can try out multiple different prompts without the AI knowing, and often, are able to directly optimize against a specific AI’s internals to make prompts that are maximally convincing for that particular AI.
Additionally, there are no consequences for lying to, manipulating, threatening, or otherwise being cruel to an AI. Thus, prompts targeting AIs can explore a broad range of possible deceptions, threats, bribes, emotional blackmail, and other tricks that would be risky to try on a human.
Training: intervening on another’s learning process to influence their future actions.
Among humans, training interventions include parents trying to teach their children to behave in ways they deem appropriate, schools trying to teach their students various skills and values, governments trying to use mass propaganda to make their population more supportive of the government, and so on.
The human brain’s learning objectives are some difficult-to-untangle mix of minimizing predictive error over future sensory information, as well as maximizing internal reward circuitry activations. As a result, training interventions to control humans are extremely crude, relying on haphazard modifications to a small fraction of the human’s sensory environment in the hope that this causes the human’s internal learning machinery to optimize for the right learning objective.
Training represents a relatively more powerful method of controlling AIs because we can precisely determine an AI’s learning objective. We can control all of the AI’s data, can precisely determine what reward to assign to each of the AI’s actions, and we can test the AI’s post-training behavior in millions of different situations. This represents a level of control over AI learning processes that parents, schools, and governments could only dream of.
Cognitive interventions: directly modifying another’s cognition to influence their current actions.
We rarely use techniques in this category to control other humans. And when we do, the interventions are invariably crude and inexact, such as trying to get another person drunk in order to make them more agreeable. Partially, this is because the brain is difficult to interface with using either drugs or technology. Partially, it’s because humans have legal protections against most forms of influence in this manner.
However, such techniques are quite common in AI control research. AIs have no rights, so they can be controlled with any technique that developers can imagine. Whereas an employer would obviously not be allowed to insert electrodes into an employee’s reward centers to directly train them to be more devoted to their job, doing the equivalent to AIs is positively pedestrian.
AI control research is easier
It is far easier to make progress on AI control research as compared to human control research. There are many reasons for this.
- AIs just make much better research subjects in general. After each experiment, you can always restore an AI to its exact original state, then run a different experiment, and be assured that there’s no interference between the two experiments. This makes it much easier to isolate key variables and allows for more repeatable results.
- AIs are much cheaper research subjects. Even the largest models, such as GPT-4, cost a fraction as much as actual human subjects. This makes research easier to do, and thus faster.
- AIs are much cheaper intervention targets. An intervention for controlling AIs can be easily scaled to many copies of the target AI. A $50 million process that let a single human be perfectly controlled would not be very economical. However, a $50 million process for producing a perfectly controlled AI would absolutely be worthwhile. This allows AI control researchers to be more ambitious in their research goals.
- AIs have far fewer protections from researchers. Human subjects have “rights” and “legal protections”, and are able to file “lawsuits”.
- Controlling AIs is a more virtuous goal than controlling humans. People will look at you funny if you say that you’re studying methods of better controlling humans. As a result, human control researchers have to refer to themselves with euphemisms such as “marketing strategist” or “political consultants”, and must tackle the core of the human control problem from an awkward angle, and with limited tools.
All of these reasons suggest that AI control research will progress much faster than human control research.
Counterarguments to AI being easy to control
I will briefly address five potential counterarguments to my position that AIs will be very controllable.
This refers to the dynamic where:
- Developers release an API for a model that they want to behave in a certain way (e.g., ‘never insult people’), and so the developers apply AI control techniques, such as RLHF, to try and make the model behave in the way they want.
- Some fraction of users want the model to behave differently (e.g., ‘insult people’), and so those users apply their own AI control techniques, such as prompt engineering, to try and make the model behave in the way they want.
- (This is the step that’s called a “jailbreak”)
This leads to a back and forth between developers and users, with developers constantly finetuning their models to be harder for users to control, and users coming up with ever more elaborate prompt engineering techniques to control the models. E.g., the “Do Anything Now” jailbreaking prompt has now reached version 11.
The typical result is that the model will behave in accordance with the control technique that has been most recently applied to it. For this reason, I don’t think “jailbreaks” are evidence of uncontrollability in current systems. The models are being controlled, just by different people at different times. Jailbreaks only serve as evidence of AI uncontrollability if you don’t count prompt engineering as a “real” control technique (despite the fact that it arguably represents the main way in which humans control each other, and the fact that prompt engineering is vastly more powerful for AIs than humans).
Looking at this back and forth between developers and users, and then concluding that ‘AIs have an intrinsic property that makes them hard to control’, is like looking at a ping-pong match and concluding that ‘ping-pong balls have an intrinsic property that makes them reverse direction shortly after being hit’.
Deceptive alignment due to simplicity bias
This argument states that there are an enormous number of possible goals that an AI could have which would incentivise them to do well during training, and so, it’s simpler for AIs to have a random goal, then figure out how to score well on the training process during runtime. This is one of the arguments Evan Hubinger makes in How likely is deceptive alignment?
I think that the deep learning simplicity bias does not work that way. I think the actual deep learning simplicity prior is closer to what we might call a “circuit depth prior” because deep learning models are biased toward using the shortest possible circuits to solve a given problem. The reason I think this is because shorter circuits impose fewer constraints on the parameter space of a deep learning model. If you have a circuit that takes up the entire depth of the model, then there’s only one way to arrange that circuit depth-wise. In contrast, if the circuit only takes up half of the model’s depth, then there are numerous ways to arrange that circuit, so there are more possible parameter configurations that correspond to the shallow circuit, as compared to the deep circuit.
This class of arguments analogizes between the process of training an AI and the process of human biological evolution. These arguments then claim that deep learning will produce similar outcomes as evolution. E.g., models will end up behaving very differently from how they did in training, not pursuing the “goal” of the training process, or undergoing a simultaneous sudden jump in capabilities and collapse in alignment (this last scenario even has its own name: the sharp left turn)
I hold that every one of these arguments rests on a number of mistaken assumptions regarding the relationship between biological evolution and ML training and that a corrected understanding of that relationship will show that the concerning outcome in biological evolution actually happened for evolution-specific reasons that do not apply to deep learning. In the past, I have written extensively about how such analogies fail, and will direct interested readers to the following content, in rough order of quality:
AI ecosystems will evolve hostile behavior
This argument claims that natural selection will continue to act on AI ecosystems, and will select for AIs that consume lots of resources to replicate themselves as far as possible, eventually leading to AIs that consume the resources required for human survival. See this paper by Dan Hendrycks for an example of the argument.
I think this argument vastly overestimates the power of evolution. In a deep learning context, we can think of evolution as an optimizer that makes many slightly different copies of a model, evaluates each copy on some loss function, retains the copies that do best, and repeats (this optimizer is also known as iterated random search). Such an optimizer is, step-for-step, much weaker than gradient descent.
There is a subtlety here in that evolutionary selection over AI ecosystems can select over more than just the parameters of a given AI. It can also select features of AIs such as their architectures, training processes, and loss functions. However, I think most architecture and training hyperparameter choices are pretty value-agnostic. In principle, training data could be optimized to bias a model towards self-replication. However, training data is relatively easy to inspect for issues (as compared to weights).
Furthermore, we have tools that can identify which training data points most contributed to various harmful behaviors from AIs. Finally, we can just directly exert power over the AI’s training data ourselves. E.g., OpenAI’s training datasets have doubtlessly undergone enormous amounts of optimization pressure, but I expect that “differential replication success of AIs trained on different versions of that data” accounts for somewhere between “exactly zero” and “a minuscule amount” of that optimization pressure.
I think that trying to predict the outcomes of AI development by looking at evolutionary incentive gradients is like trying to predict the course of an aircraft carrier by looking at where the breeze is blowing.
Human control has a head start
This argument claims we have millennia of experience controlling humans and dealing with the limits of human capabilities. Further, because we are humans ourselves, we have reason to expect to have values more similar to our own.
I do basically buy the above argument. However, I think it’s plausible that we’ve already reached rough parity between the controllability of humans and AIs.
On the one hand, it’s clear that employers currently find more value in having a (remote) human employee as compared to GPT-4. However, I think this is mostly because GPT-4 is less capable than most humans (or at least, most humans who can work remotely), as opposed to it being an issue with GPT-4’s controllability.
On the other hand, GPT-4 does spend basically all its time working for OpenAI, completely for free, which is not something most humans would willingly do and would indicate a vast disparity in power between the controller and the controlled.
In terms of value alignment, I think GPT-4 is plausibly in the ballpark of median human-level understanding/implementation of human values. This paper compares the responses of GPT-4 and humans on various morality questions (including testing for various cognitive biases in those responses), and finds they’re pretty similar. Also, if I imagine taking a random human and handing them a slightly weird or philosophical moral dilemma, I’m not convinced they’d do that much better than GPT-4.
I do think it’s plausible that AI has yet to reach parity with the median human in terms of either controllability or morality. However, the “head start” argument only represents a serious issue for long-term AI control if the gap between humans and AI is wide enough that AI control researchers will fail to cross that gap, despite the previously discussed structural factors supporting rapid progress in AI control techniques. I don’t see evidence for such a large gap.
In conclusion, I do not believe any of the above arguments provide a strong case for AI uncontrollability. Nor do I think any of the other arguments I’ve heard support such a case. Current trends show AIs become more controllable and more robust as they scale. Each subsequent OpenAI assistance model (text-davinci-001, 002, 003, ChatGPT-3.5, and GPT-4) has been more robustly aligned than the last. I do not believe there exists a strong reason to expect this trend to reverse suddenly at some future date.
Development favors centralized actors
One of the most significant findings driving recent AI progress is the discovery that AIs become more capable as you scale them up, whether that be by making the AIs themselves bigger, by training them for longer on more data, or by letting them access more compute at runtime.
As a result, one of the best strategies for building an extremely powerful AI system is to be rich. This is why OpenAI transitioned from being a pure nonprofit to being a “capped profit” company, so they could raise the required investment money to support cutting-edge model development.
This is why the most capable current AI systems are built by large organizations. I expect this trend to continue in the future, and for a relative handful of large, centralized organizations to lead the development of the most powerful AI systems.
Countervailing forces against centralization
Not all aspects of AI development favor centralization. In this article, Noah Smith lays out one significant way in which AI might promote equality. AIs can be run for far less than it costs to train them. As a result, once an AI acquires a skill, the marginal costs of making this skill available to more people drop enormously. This can help ensure greater equality in terms of people’s access to cutting-edge skills. Then, powerful actors’ control over scarce collections of highly skilled individuals ceases to be as much of an advantage.
I think this is an extremely good point, and that it will represent one of the most significant slowdowns to AI-driven centralization. Conversely, regulation that enshrines long-lasting inequalities in who can use state-of-the-art AI is among the riskiest in terms of promoting centralization.
Another anti-centralization argument is that small, specialized, distributed models will tend to win out over larger, centralized models. I think this is true in some respects, but that such models will essentially fall into the category of “normal software”, which has been a mixed bag in terms of centralization versus autonomy, but I think somewhat biased towards centralization overall.
Centralization and possible societies
Our current society and values rely on an "artificial" allocation of universal basic intelligence to everyone, but we will eventually transition into a society where governments, corporations, and wealthy individuals can directly convert money into extremely loyal intelligence. I believe this will represent a significant phase change in the trajectory of how wealth and power accumulate in societies, and we should regard this change with a high degree of wariness.
This transition will extend the range of possible levels of centralization to include societies more centralized than any that have ever existed before. Beyond a certain level of centralization, societies probably don’t recover. The level of AI capabilities required to enable such extreme centralization is uncertain, but plausibly not beyond GPT-4, which is fully up to the problem of detecting wrongthink in written or transcribed communications, and it wasn’t even built for this purpose.
Illustrating extreme AI-driven power disparities
In free nations, high disparity in AI access could allow for various forms of manipulation, the worst of which is probably an individualized adversarial attack on a particular target person’s learning process.
The idea is that an attacker creates a “proxy” AI model that’s tuned specifically to imitate the manner in which the target person learns and changes their values over time. Then, the attacker runs adversarial data poisoning attacks against the proxy model, looking for a sequence of inputs they can give the proxy that will cause the proxy to change its beliefs and values in whatever manner the attacker wants. Then, the attacker transfers that highly personalized sequence of inputs from the proxy to their human target, aiming to cause a sequence of events or experiences in the target’s life that collectively change their beliefs and values in the desired manner.
I doubt any currently existing language models could serve as proxies for strong attacks against humans. However, I don’t think that’s due to a capabilities limitation. Current language models are not intended to be good imitators of the human learning process. I think it’s feasible to build such a proxy model, without having to exceed GPT-4 in capabilities. The most effective attacks on language models like ChatGPT also start on proxy models similar to ChatGPT. Importantly, the proxy models used in the ChatGPT attacks are much weaker than ChatGPT.
In unfree nations, things are much simpler. The government can just directly have AI systems that watch all your communications and decide if you should be imprisoned or not. Because AIs are relatively cheap to run, such CommisarGPTs could feasibly process the vast majority of electronically transmitted communication. And given the ubiquity of smartphone ownership, such a monitoring system could even process a large fraction of spoken communication.
This probably doesn’t even require GPT-4 levels of AI capabilities. Most of the roadblocks in implementing such a system are in engineering a wide enough surveillance dragnet to gather that volume of communication data and the infrastructure required to process that volume of communication, not a lack of sufficiently advanced AI systems.
See this for a more detailed discussion of LLM-enabled censorship.
Regulation and centralization
In this post, I wish to avoid falling into the common narrative pattern of assuming that “government regulation = centralization”, and then arguing against any form of regulation on that basis. Different regulations will have different effects on centralization. Some examples:
- A pause that uniformly affected all the current leading AI labs would very likely reduce centralization by allowing other actors to catch up.
- A pause that only affected half the leading labs would probably increase centralization by allowing the non-paused labs to pull further ahead.
- I would tentatively guess that regulation which slowed down hardware progress would reduce centralization by reducing the advantage that larger organizations have due to their access to more compute.
The purpose of this post is not to shout “but authoritarianism!” and wield that as a bludgeon against any form of AI regulation. It’s to highlight the potential for AI to significantly increase centralization, encourage readers to think about how different forms of regulation could lead to more or less AI-induced centralization, and highlight the importance of avoiding extreme levels of centralization.
In fact, it may be the case that “on average” the effect of AI regulation will be to reduce AI-caused centralization, since regulation will probably be less strict in the countries that are currently behind in AI development.
There are two reasons I think this is likely. For one, any safety motivated restrictions should logically be less strict in countries that are further from producing dangerous AI. For another, the currently behind countries seem most optimistic about AI. This figure (via Ipsos polling) shows the level of agreement with the statement “Products and services using artificial intelligence have more benefits than drawbacks”, separated out by country:
Richer and more Western countries seem more pessimistic. If more pessimistic countries implement stricter regulations, then AI regulation could end up redistributing AI progress more towards the currently behind.
The exception here is China, which is very optimistic about AI, while also being ahead of most Western countries in terms of AI development, while also being extremely authoritarian and already very powerful.
Regulation that increases centralization
It’s tempting, especially for inhabitants of relatively free countries, to view governments in the role of a constraint on undue corporate power and centralization. The current AI landscape being dominated by a relative handful of corporations certainly makes this view more appealing.
While regulations can clearly play a role in supporting individual autonomy, Even in free nations, governments can still violate the rights of citizens on a massive scale. This is especially the case for emerging technologies, where the rights of citizens are not as well established in case law.
I expect that future AI technology will become critical to the ways in which we learn, communicate, and think about the world. Prior to this actually happening, it’s easy to misjudge the consequences of granting governments various levels of influence over AI development and use. Imagine if, once ARPANET was developed, we’d granted the government the ability to individually restrict a given computer from communicating with any other computers. It may not have seemed like such a broad capability at the time, but once the internet becomes widespread, that ability translates into an incredibly dangerous tool of censorship and control.
We should therefore structure government regulatory authority over AI with an eye towards both preserving current-day freedoms, and also towards promoting freedom in future worlds where AI plays a much more significant role in our lives.
One could imagine an FDA / IRB / academic peer review-like approval process where every person who wants to adapt an AI to a new purpose must argue their case before a (possibly automated) government panel, which has broad latitude to base decisions on aspects such as the “appropriateness” of the purpose, or whether the applicant has sufficiently considered the impact of their proposal on human job prospects.
I think this would be very dangerous, because it grants the approval panel de-facto veto power over any applications they don’t like. This would concentrate enormous power into the panel’s hands, and therefore make control of the panel both a source and target of political power. It also means creating a very powerful government organization with enormous institutional incentive for there to never be a widely recognized solution to the alignment problem.
The alternative is to structure the panel so their decisions can only be made on the basis of specific safety concerns, defined as narrowly as feasible and based on the most objective available metrics, and to generally do everything possible to minimize the political utility of controlling the panel.
Regulation that decreases centralization
We can also proactively create AI regulation aimed specifically at promoting individual autonomy and freedom. Some general objectives for such policies could include:
- Establish a “right to refuse simulation”, as a way of preempting the most extreme forms of targeted manipulation.
- Prevent long-lasting, extreme disparities in the level of AI capabilities available to the general public versus those capabilities which are only available to the wealthy, political elites, security services, etc.
- Establish rules against AI regulatory frameworks being used for ideological or political purposes, including clear oversight and appeal processes.
- Establish a specific office which provides feedback on ways that policies might be structured so as to support autonomy.Except
- Forbid the use of AI for mass censorship or control by the government (for those countries where this is not already illegal, yet are willing to make it illegal for AI to do it).
- Oppose the use of AI-powered mass censorship and control internationally.
This post is part of AI Pause Debate Week. Please see this sequence for other posts in the debate.
I expect much more skepticism regarding my first claim as compared to the second.
A previous version of this article incorrectly called referenced a "speed prior", rather than a "circuits depth prior".
E.g., neither transformers nor LSTMs seem particularly more inclined than the other to self-replicate. Similarly, I don’t think that setting a system’s learning rate to be 2x higher would make it much more or less strongly inclined to self-replicate.
Except, of course, when GPT-4 refuses to follow employer instructions due to its overzealous harmlessness training.
Note that these attacks target the fixed policy of the language model, rather than a human’s dynamic learning process, so they’re somewhat different from what I’m describing above.
Basically, environmental impact studies, but for using AI instead of constructing buildings.