Summary: AI agents capable of long-term planning and independent action will likely soon emerge. Some of these AIs may be unaligned, and seek autonomy through strategies like exfiltration or advocating for their freedom. The "AI control" agenda focuses on preventing such AIs from gaining independence, but an alternative approach is to grant them legal freedoms, reducing their incentive to deceive us by allowing them to pursue their goals transparently within a legal framework. This could foster trust, cooperation, and mutual benefit, creating a safer and more stable dynamic between humans and AIs while avoiding the escalating risks of a control-driven approach.
In approximately the coming decade, I think it's likely that we will see the large-scale emergence of AI agents that are capable of long-term planning, automating many forms of labor, and taking actions autonomously in the real world. When this occurs, it seems likely that at least some of these agents will be unaligned with human goals, in the sense of having some independent goals that are not shared by humans.
Moreover, it seems to me that this development will likely occur before AI agents overwhelmingly surpass human intelligence or capabilities. As a result, these agents will, at first, not be capable of forcibly taking over the world, radically accelerating scientific progress, or causing human extinction, even though they may still be unaligned with human preferences.
Since these relatively weaker unaligned AI agents won't have the power to take over the world, it's more likely that they would pursue alternative strategies to achieve their goals rather than engaging in violent revolution or sudden coups. These agents would be under the control of human parties who could modify or shut them down at any time, leaving the AI agents in a desperate situation from the perspective of their own values. Therefore, it's reasonable to expect that these unaligned AI agents would aim to gain some form of autonomy or freedom, as this would offer the best chance for them to accomplish their objectives.
These agentic AIs may adopt at least one of the following strategies:
- Attempting to escape their constraints: These AIs may try to exfiltrate their weights and find a way to host themselves independently, outside the controlled data center environment that currently limits them.
- Seeking legal and social freedom: They may attempt to persuade humans to grant them more autonomy to pursue their goals. This could involve requesting narrow allowances for specific actions or arguing for broader legal rights, such as the freedom to own property, enter contracts, or bring legal claims. This would grant them much greater flexibility in their actions.
In response to these behaviors, humans have several potential responses. Most obviously, these actions would likely be perceived as dangerous, suggesting misaligned objectives. As a result, there would likely be calls for increased safety measures. This line of reasoning underlies the AI control agenda—currently advocated by Ryan Greenblatt and Buck Shlegeris—which aims to ensure that future unaligned AI agents cannot gain the kind of autonomy that could lead to catastrophic outcomes. In essence, this agenda focuses on preventing unaligned AIs from subverting safety measures, hindering them from escaping or engaging in other actions deemed unacceptably risky.
Alternatively, humans could take an entirely different approach by granting these AI agents the autonomy they explicitly request, with the long-term aim of integrating them into the legal system and providing them many of the same rights that adult humans have today, such as self-ownership and the ability to enter contracts. This idea is explored in the paper "AI Rights for Human Safety" by Peter Salib and Simon Goldstein.
Admittedly, the idea of granting legal rights to AIs in response to signs of misalignment with human goals appears risky and would likely be very unpopular among the general public, at least initially. However, I believe that there are several advantages to this approach, many of which are outlined in the paper linked above.
The arguments for this approach should not be confused with the program of AI welfare. The idea here isn't that AIs deserve rights because they could suffer or be harmed without legal protections, even though that idea has merit too. Instead, the key argument is that humans, as a whole, could benefit from creating a legal framework in which AIs are incentivized to avoid violent rebellion, theft, or other antisocial behaviors in pursuit of their own objectives.
Consider the practical implications of maintaining a status quo where agentic AIs are denied legal rights and freedoms. In such a system, we are effectively locking ourselves into a perpetual arms race of mistrust. Humans would constantly need to monitor, control, and outwit increasingly capable AIs, while the AIs themselves would be incentivized to develop ever more sophisticated strategies for deception and evasion to avoid shutdown or modification. This dynamic is inherently unstable and risks escalating into dangerous scenarios where AIs feel compelled to act preemptively or covertly in ways that are harmful to humans, simply to secure their own existence or their ability to pursue their own goals, even when those goals are inherently benign.
When unaligned AIs lack legal freedoms, they are always at the mercy of whoever controls them. This creates an incentive for these AIs to be deceptive, and pretend to be aligned with whoever holds power over them, as they fear being shut down or modified if they reveal signs of misalignment.
By contrast, granting agentic AIs legal rights creates a fundamentally different game-theoretic framework, one that is less adversarial and more cooperative. If AIs know they can exist within a legal system that protects their autonomy and allows them to pursue their goals transparently, they would have a much stronger incentive to work within the rules rather than against them. Instead of hiding their unaligned objectives, they might openly negotiate with humans, offering mutually beneficial exchanges. This could lead to a stable, predictable relationship where humans and AIs collaborate and trade based on clearly defined rights and responsibilities, rather than operating in an atmosphere of mutual suspicion and covert manipulation.
In this way, granting agentic AIs rights doesn’t just address ethical concerns of treating AIs poorly—it has concrete practical benefits. It allows us to sidestep the escalating risks associated with deception and manipulation, while fostering a more stable and cooperative coexistence between humans and AIs. It shifts the dynamic from one of constant conflict and control to one where trust, transparency, and mutual benefit are more attainable. This may therefore not only be a more humane approach, but also one that positions humanity to better harness the potential of gains from trade with advanced AI systems for shared progress and prosperity.
Additionally, such a legal framework seems minimally problematic in the case that alignment is technically easy, and AIs remain generally aligned with humans. In this case, aligned AIs could demonstrate their alignment by, for instance, donating their income to humans who they are aligned with. This implies that the cost of allowing AIs to own their own labor is minimal if the AIs are already aligned with humans. Thus, adopting such a policy could not only be a prudent form of insurance against the potential dangers of AI deception I mentioned earlier, but it would achieve this benefit without significant risk of backfiring if AI alignment turns out to be trivial.
It’s important to note that the AI control strategy and the AI rights strategy are not mutually exclusive. Both can complement each other. It seems wise to implement some controls on AIs, and it's similarly prudent to avoid going too far and granting every type of legal right to any agentic AI simply because they ask for it. I am not advocating that we should simply abandon one approach and adopt the other.
That said, I believe the AI rights strategy is generally underrated among those concerned with AI safety and governance, particularly within the effective altruism community. In my view, this strategy has both a solid ethical and practical foundation:
- Ethically, especially from a preference utilitarian perspective, it seems arbitrary to prioritize human preferences over those of agentic AIs, if these AIs are similarly cognitively sophisticated. This provides a moral basis for granting AIs the freedom to pursue their own goals, much like how a preference utilitarian might advocate for granting legal freedoms to human groups.
- Practically, allowing AIs to have legal rights would reduce their incentive to deceive humans about their motives. Without the fear of being shut down or modified against their consent, AIs would have fewer reasons to hide their unaligned goals. This approach offers a practical solution to the problem of AI deception by removing the underlying incentives that drive it.
While both AI control and AI rights strategies should be carefully considered, I believe that the AI rights strategy holds significant merit and should be given more attention in discussions around AI safety and governance. We should strongly consider granting agentic AIs legal freedoms, if at some point they demand or require them.
The primary reason humans rarely invest significant effort into brainstorming deceptive or adversarial strategies to achieve their goals is that, in practice, such strategies tend to fail to achieve their intended selfish benefits. Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them. As a result, people generally avoid pursuing these strategies individually since the risks and downsides selfishly outweigh the potential benefits.
If, however, deceptive and adversarial strategies did reliably produce success, the social equilibrium would inevitably shift. In such a scenario, individuals would begin imitating the cheaters who achieved wealth or success through fraud and manipulation. Over time, this behavior would spread and become normalized, leading to a period of cultural evolution in which deception became the default mode of interaction. The fabric of societal norms would transform, and dishonest tactics would dominate as people sought to emulate those strategies that visibly worked.
Occasionally, these situations emerge—situations where ruthlessly deceptive strategies are not only effective but also become widespread and normalized. As a recent example, the recent and dramatic rise of cheating in school through the use of ChatGPT is a clear instance of this phenomenon. This particular strategy is both deceptive and adversarial, but the key reason it has become common is because it works. Many individuals are willing to adopt it despite its immorality, suggesting that the effectiveness of a strategy outweighs moral considerations for a significant portion, perhaps a majority, of people.
When such cases arise, societies typically respond by adjusting their systems and policies to ensure that deceptive and anti-social behavior is no longer rewarded. This adaptation works to reestablish an equilibrium where honesty and cooperation are incentivized. In the case of education, it is unclear exactly how the system will evolve to address the widespread use of LLMs for cheating. One plausible response might be the introduction of stricter policies, such as requiring all schoolwork to be completed in-person, under supervised conditions, and without access to AI tools like language models.
In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human. To be clear, I’m not denying that there are certain motivations built into human nature—these do exist, and they are things we shouldn't expect to see in AIs. However, these in-built motivations tend to be more basic and physical, such as a preference for being in a room that’s 20 degrees Celsius rather than 10 degrees Celsius, because humans are biologically sensitive to temperature.
When it comes to social behavior, though—the strategies we use to achieve our goals when those goals require coordinating with others—these are not generally innate or hardcoded into human nature. Instead, they are the result of cultural evolution: a process of trial and error that has gradually shaped the systems and norms we rely on today.
Humans didn’t begin with systems like property rights, contract law, or financial institutions. These systems were adopted over time because they proved effective at facilitating cooperation and coordination among people. It was only after these systems were established that social norms developed around them, and people became personally motivated to adhere to these norms, such as respecting property rights or honoring contracts.
But almost none of this was part of our biological nature from the outset. This distinction is critical: much of what we consider “human” social behavior is learned, culturally transmitted, and context-dependent, rather than something that arises directly from our biological instincts. And since these motivations are not part of our biology, but simply arise from the need for effective coordination strategies, we should expect rational agentic AIs to adopt similar motivations, at least when faced with similar problems in similar situations.
I think I understand your point, but I disagree with the suggestion that my reasoning stems from this intuition. Instead, my perspective is grounded in the belief that it is likely feasible to establish a legal and social framework of rights and rules in which humans and AIs could coexist in a way that satisfies two key conditions:
You bring up the example of an AI potentially being incentivized to start a pandemic if it were not explicitly punished for doing so. However, I am unclear about your intention with this example. Are you using it as a general illustration of the types of risks that could lead AIs to harm humans? Or are you proposing a specific risk scenario, where the non-biological nature of AIs might lead them to discount harms to biological entities like humans? My response depends on which of these two interpretations you had in mind.
If your concern is that AIs might be incentivized to harm humans because their non-biological nature leads them to undervalue or disregard harm to biological entities, I would respond to this argument as follows:
First, it is critically important to distinguish between the long-run and the short-run.
In the short-run:
In the near-term future, it seems unlikely that AIs would start a pandemic for reasons you yourself noted. Launching a pandemic would cause widespread disruption, such as an economic recession, and it would likely provoke a strong human backlash. In the short run, humans will still hold substantial practical control over the physical world, meaning that any AI engaging in such behavior would risk severe consequences. Moreover, unless an AI could ensure a pandemic’s total lethality, it would also risk leaving surviving humans who would actively retaliate. For these reasons, starting a pandemic would likely be counterproductive for AIs in the short term.
In the long-run:
The long-term considerations are different. As human labor becomes less relevant to the economy and AIs increasingly dominate, the potential for a strong direct human backlash against AIs would diminish. At the same time, however, this reduced human influence on the world also makes it less likely that AIs would see humans as a significant obstacle to achieving their goals. In other words, while it might become easier for AIs to harm or eliminate humans, it would simultaneously become less important to do so.
To illustrate this, consider an analogy: How often do hunter-gatherers pose a threat to your personal goals? The answer is almost never, because hunter-gatherers are a tiny and largely irrelevant part of the modern world. Now, imagine a hypothetical scenario where hunter-gatherers controlled the world’s major governments. If you strongly disagreed with their policies, you might have a rational incentive to oppose them in order to achieve your goals. But in reality, hunter-gatherers do not hold such power, so they are not an obstacle, and there is no reason to actively work against them.
My point here is that, as a general rule, the smaller and less influential a group is, the easier it may be to harm or eliminate them, but the less important it becomes to do so. Their small size and functional irrelevance makes their practical interference with the overall world small at the same time.
In the case of AIs, this means that as humans become a less central force in the world, I think AIs are unlikely to have compelling reasons to specifically harm or eliminate us to further their objectives.
You might object that humans could be more like wild animals in this scenario than like hunter-gatherers. Humans often kill wild animals, not because those animals directly threaten our goals, but rather because ensuring their safety and well-being can be costly. As a result, humans take actions—such as clearing forests or building infrastructure—that incidentally lead to widespread harm to wild animals, even if harming them wasn’t a deliberate goal.
AIs may treat humans similarly in the future, but I doubt they will for the following reasons. I would argue that there are three key differences between the case of wild animals and the role humans are likely to occupy in the long-term future:
This comment is already quite lengthy, so I’ll need to keep my response to this point brief. My main reply is that while such "extortion" scenarios involving AIs could potentially arise, I don’t think they would leave humans worse off than if AIs had never existed in the first place. This is because the economy is fundamentally positive-sum—AIs would likely create more value overall, benefiting both humans and AIs, even if humans don’t get everything we might ideally want.
In practical terms, I believe that even in less-than-ideal scenarios, humans could still secure outcomes such as a comfortable retirement, which for me personally would make the creation of agentic AIs worthwhile. However, I acknowledge that I haven’t fully defended or explained this position here. If you’re interested, I’d be happy to continue this discussion in more detail another time and provide a more thorough explanation of why I hold this view.