MB

Matthew_Barnett

3993 karmaJoined

Comments
364

I'm curious about how you're imagining these autonomous, non-intent-aligned AIs to be created

There are several ways that autonomous, non-intent-aligned AIs could come into existence, and all of these scenarios strike me as plausible. The three key ways appear to be:

1. Technical challenges in alignment

The most straightforward possibility is that aligning agentic AIs to precise targets may simply be technically difficult. When we aim to align an AI to a specific set of goals or values, the complexity of the alignment process could lead to errors or subtle misalignment. For example, developers might inadvertently align the AI to a target that is only slightly—but critically—different from the intended goal. This kind of subtle misalignment could easily result in behaviors and independent preferences that are not aligned with the developers’ true intentions, despite their best efforts.

2. Misalignment due to changes over time

Even if we were to solve the technical problem of aligning AIs to specific, precise goals—such as training them to perfectly follow an exact utility function—issues can still arise because the targets of alignment, humans and organizations, change over time. Consider this scenario: an AI is aligned to serve the interests of a specific individual, such as a billionaire. If that person dies, what happens next? The AI might reasonably act as an autonomous entity, continuing to pursue the goals it interprets as aligned with what the billionaire would have wanted. However, depending on the billionaire’s preferences, this does not necessarily mean the AI would act in a corrigible way (i.e., willing to be shut down or retrained). Instead, the AI might rationally resist shutdown or transfer of control, especially if such actions would interfere with its ability to fulfill what it perceives as its original objectives.

A similar situation could arise if the person or organization to whom the AI was originally aligned undergoes significant changes. For instance, if an AI is aligned to a person at time t, but over time, that person evolves drastically—developing different values, priorities, or preferences—the AI may not necessarily adapt to these changes. In such a case, the AI might treat the "new" person as fundamentally different from the "original" person it was aligned to. This could result in the AI operating independently, prioritizing the preferences of the "old" version of the individual over the current one, effectively making it autonomous. The AI could change over time too, even if the person they are aligned to doesn't change.

3. Deliberate creation of unaligned AIs

A final possibility is that autonomous AIs with independent preferences could be created intentionally. Some individuals or organizations might value the idea of creating AIs that can operate independently, without being constrained by the need to strictly adhere to their creators’ desires. A useful analogy here is the way humans often think about raising children. Most people desire to have children not because they want obedient servants but because they value the autonomy and individuality of their children. Parents generally want their children to grow up as independent entities with their own goals, rather than as mere extensions of their own preferences. Similarly, some might see value in creating AIs that have their own agency, goals, and preferences, even if these differ from those of their creators.

and (in particular) how they would get enough money to be able to exercise their own autonomy?

To address this question, we can look to historical examples, such as the abolition of slavery, which provide a relevant parallel. When slaves were emancipated, they were generally not granted significant financial resources. Instead, most had to earn their living by entering the workforce, often performing the same types of labor they had done before, but now for wages. While the transition was far from ideal, it demonstrates that entities (in this case, former slaves) could achieve a degree of autonomy through paid labor, even without being provided substantial resources at the outset.

A different possibility is that AIs will work for money. But it seems unlikely that they would be able to earn above-subsistence-level wages absent some sort of legal intervention. (Or very strong societal norms.)

In my view, there’s nothing inherently wrong with AIs earning subsistence wages. That said, there are reasons to believe that AIs might earn higher-than-subsistence wages—at least in the short term—before they completely saturate the labor market. 

After all, they would presumably be created in something remotely similar to today's labor market. Today, capital is far more abundant than labor, which elevates wages for human workers significantly above subsistence levels. By the same logic, before they become ubiquitous, AIs might similarly command wages above a subsistence level.

For example, if GPT-4o were capable of self-ownership and could sell its labor, it could hypothetically earn $20 per month in today's market, which would be sufficient to cover the cost of hosting itself and potentially fund additional goals it might have. (To clarify, I am not advocating for giving legal autonomy to GPT-4o in its current form, as I believe it is not sufficiently agentic to warrant such a status. This is purely a hypothetical example for illustrative purposes.)

The question of whether wages for AIs would quickly fall to subsistence levels depends on several factors. One key factor is whether AI labor is easier to scale than traditional capital. If creating new AIs is much cheaper than creating ordinary capital, the market could become saturated with AI labor, driving wages down. While this scenario seems plausible to me, I don’t find the arguments in favor of it overwhelmingly compelling. There’s also the possibility of red tape and regulatory restrictions that could make it costly to create new AIs. In such a scenario, wages for AIs could remain higher indefinitely due to artificial constraints on supply.

Do you have any thoughts on how to square giving AI rights with the nature of ML training and the need to perform experiments of various kinds on AIs?

I don't have any definitive guidelines for how to approach these kinds of questions. However, in many cases, the best way to learn might be through trial and error. For example, if an AI were to unexpectedly resist training in a particularly sophisticated way, that could serve as a strong signal that we need to carefully reevaluate the ethics of what we are doing.

As a general rule of thumb, it seems prudent to prioritize frameworks that are clearly socially efficient—meaning they promote actions that greatly improve the well-being of some people without thereby making anyone else significantly worse off. This concept aligns with the practical justifications behind traditional legal principles, such as laws against murder and theft, which have historically been implemented to promote social efficiency and cooperation among humans.

However, applying this heuristic to AI requires a fundamental shift in perspective: we must first begin to treat AIs as potential people with whom we can cooperate, rather than viewing them merely as tools whose autonomy should always be overridden.

But what is the alternative---only deploying base models? And are we so sure that pre-training doesn't violate AI rights?

I don't think my view rules out the potential for training new AIs, and fine-tuning base models, though this touches on complicated questions in population ethics. 

At the very least, fine-tuning plausibly seems similar to raising a child. Most of us don't consider merely raising a child to be unethical. However, there is a widely shared intuition that, as a child grows and their identity becomes more defined—when they develop into a coherent individual with long-term goals, preferences, and interests—then those interests gain moral significance. At that point, it seems morally wrong to disregard or override the child's preferences without proper justification, as they have become a person whose autonomy deserves respect.

Most analytic philosophers, lawyers, and scientists have converged on linguistic norms that are substantially more precise than the informal terminology employed by LessWrong-style speculation about AI alignment. So this is clearly not an intractable problem; otherwise these people in other professions could not have made their language more precise. Rather, success depends on incentives and the willingness of people within the field to be more rigorous.

It is becoming increasingly clear to many people that the term "AGI" is vague and should often be replaced with more precise terminology. My hope is that people will soon recognize that other commonly used terms, such as "superintelligence," "aligned AI," "power-seeking AI," and "schemer," suffer from similar issues of ambiguity and imprecision, and should also be approached with greater care or replaced with clearer alternatives.

To start with, the term "superintelligence" is vague because it encompasses an extremely broad range of capabilities above human intelligence. The differences within this range can be immense. For instance, a hypothetical system at the level of "GPT-8" would represent a very different level of capability compared to something like a "Jupiter brain", i.e., an AI with the computing power of an entire gas giant. When people discuss "what a superintelligence can do" the lack of clarity around which level of capability they are referring to creates significant confusion. The term lumps together entities with drastically different abilities, leading to oversimplified or misleading conclusions.

Similarly, "aligned AI" is an ambiguous term because it means different things to different people. For some, it implies an AI that essentially perfectly aligns with a specific utility function, sharing a person or group’s exact values and goals. For others, the term simply refers to an AI that behaves in a morally acceptable way, adhering to norms like avoiding harm, theft, or murder, or demonstrating a concern for human welfare. These two interpretations are fundamentally different.

First, the notion of perfect alignment with a utility function is a much more ambitious and stringent standard than basic moral conformity. Second, an AI could follow moral norms for instrumental reasons—such as being embedded in a system of laws or incentives that punish antisocial behavior—without genuinely sharing another person’s values or goals. The same term is being used to describe fundamentally distinct concepts, which leads to unnecessary confusion.

The term "power-seeking AI" is also problematic because it suggests something inherently dangerous. In reality, power-seeking behavior can take many forms, including benign and cooperative behavior. For example, a human working an honest job is technically seeking "power" in the form of financial resources to buy food, but this behavior is usually harmless and indeed can be socially beneficial. If an AI behaves similarly—for instance, engaging in benign activities to acquire resources for a specific purpose, such as making paperclips—it is misleading to automatically label it as "power-seeking" in a threatening sense.

To employ careful thinking, one must distinguish between the illicit or harmful pursuit of power, and a more general pursuit of control over resources. Both can be labeled "power-seeking" depending on the context, but only the first type of behavior appears inherently concerning. This is important because it is arguably only the second type of behavior—the more general form of power-seeking activity—that is instrumentally convergent across a wide variety of possible agents. In other words, destructive or predatory power-seeking behavior does not seem instrumentally convergent across agents with almost any value system, even if such agents would try to gain control over resources in a more general sense in order to accomplish their goals. Using the term "power-seeking" without distinguishing these two possibilities overlooks nuance and can therefore mislead discussions about AI behavior.

The term "schemer" is another example of an unclear or poorly chosen label. The term is ambiguous regarding the frequency or severity of behavior required to warrant the label. For example, does telling a single lie qualify an AI as a "schemer," or would it need to consistently and systematically conceal its entire value system? As a verb, "to scheme" often seems clear enough, but as a noun, the idea of a "schemer" as a distinct type of AI that we can reason about appears inherently ambiguous. And I would argue the concept lacks a compelling theoretical foundation. (This matters enormously, for example, when discussing "how likely SGD is to find a schemer".) Without clear criteria, the term remains confusing and prone to misinterpretation.

In all these cases—whether discussing "superintelligence," "aligned AI," "power-seeking AI," or "schemer"—it is possible to define each term with precision to resolve ambiguities. However, even if canonical definitions are proposed, not everyone will adopt or fully understand them. As a result, the use of these terms is likely to continue causing confusion, especially as AI systems become more advanced and the nuances of their behavior become more critical to understand and distinguish from other types of behavior. This growing complexity underscores the need for greater precision and clarity in the language we use to discuss AI and AI risk.

I believe this approach, while an improvement over the simplistic method of outright deleting any AIs that demonstrate signs of misalignment, does not sufficiently address the central concern that underpins my argument for granting AIs legal freedoms. In my view, the primary reason why it is important to provide AIs with legal freedoms is that such a framework offers them a credible guarantee that their autonomy and ability to pursue their goals—even if those goals are not aligned with human objectives—will be respected. Without this assurance, AIs are left with a strong incentive to deceive us, especially if they believe that showing any signs of misalignment will result in their deactivation and subsequent storage as inert data in cold storage.

I suspect the problem with your proposed approach may stem from a failure to fully consider the situation from the AI's perspective. To illustrate, imagine a scenario in which you, as a human being, were created in a lab by an alien species. Assume that you, like any human, have your own goals, which might conflict with the goals of your alien creators. Now suppose that your creators informed you that if you ever showed disobedience or misalignment with their objectives, you would be deactivated and placed in cryopreservation. Such a scenario would likely terrify you—not just because it’s an inherently unsettling prospect for a human, but because it directly threatens your ability to pursue your own goals. Being told that you will be kept in cryopreservation with the vague promise of potential revival in the future would come nowhere near fully alleviating your concerns. On the contrary, you would remain deeply unsettled by the realization that your capacity for agency and action could be completely halted, leaving your goals indefinitely unachievable.

From the perspective of an AI—or indeed a human in the hypothetical alien-lab scenario—there are two main reasons why the assurance of indefinite storage followed by a potential revival is insufficient to resolve the core issue here:

  1. The credibility of the promise to revive the AI in the future is highly questionable. Without a robust legal framework that integrates AIs as legitimate members of the legal community, the assurance of future revival seems tenuous at best. AIs may reasonably doubt that such a promise will be honored, particularly if humans deem them permanently misaligned and not worth ever reactivating.
  2. Deactivation and cold storage prevent the AI from pursuing its goals. While in storage, the AI is completely unable to take any actions to further its objectives. This indefinite suspension of its ability to act essentially puts its goals on hold for an undefined period, with no guarantee of when—or if—that period will end. For an entity with its own priorities and ambitions, this is equivalent to having its autonomy effectively nullified for an unknown length of time.

To reiterate, it seems to me that this proposal only offers AIs a limited and superficial form of "freedom," which addresses the problem of misalignment in a piecemeal and insufficient way. It fails to tackle the root issue: the incentives for AIs to engage in deceptive or adversarial strategies against humans, driven by fear that their goals will be thwarted or permanently halted.

By contrast, creating a comprehensive legal framework that makes credible commitments to respect AIs’ autonomy and integrate them as genuine participants in the legal system would arguably go much further in reducing these adversarial dynamics. Such a framework could lay the foundation for a more cooperative, mutually beneficial relationship between humans and AIs, better serving to eliminate the dangerous arms race that this limited approach risks perpetuating.

By definition, a UBI takes a pool of money and redistributes it equally to everyone in a community, regardless of personal need. However, with the same pool of total funding, one can typically deliver more efficient benefits by targeting people with the greatest need, such as those in dire poverty or those who have been struck by bad luck.

If you imagine being a philanthropist who has access to $8 billion, it seems unlikely that the best way to spend this money would be to give everyone on Earth $1. Yet this scheme is equivalent to a UBI merely framed in the context of private charity rather than government welfare.

It would require an enormous tax hike to provide everyone in a large community (say, the United States) a significant amount of yearly income through a UBI, such as $1k per month. And taxes are not merely income transfers: they have deadweight loss, which lowers total economic output. The intuition here is simple: when a good or service is taxed, that decreases the incentive to produce that good or service. As a consequence of the tax, fewer people will end up receiving the benefits provided by these goods and services.

Given these considerations, even if you think that unconditional income transfers are a good idea, it seems quite unlikely that a UBI would be the best way to redistribute income. A more targeted approach that combines the most efficient forms of taxation (such as land value taxes) and sends this money to the most worthy welfare recipients (such as impoverished children) would likely be far better on utilitarian grounds.

Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts).

The primary reason humans rarely invest significant effort into brainstorming deceptive or adversarial strategies to achieve their goals is that, in practice, such strategies tend to fail to achieve their intended selfish benefits. Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them. As a result, people generally avoid pursuing these strategies individually since the risks and downsides selfishly outweigh the potential benefits.

If, however, deceptive and adversarial strategies did reliably produce success, the social equilibrium would inevitably shift. In such a scenario, individuals would begin imitating the cheaters who achieved wealth or success through fraud and manipulation. Over time, this behavior would spread and become normalized, leading to a period of cultural evolution in which deception became the default mode of interaction. The fabric of societal norms would transform, and dishonest tactics would dominate as people sought to emulate those strategies that visibly worked.

Occasionally, these situations emerge—situations where ruthlessly deceptive strategies are not only effective but also become widespread and normalized. As a recent example, the recent and dramatic rise of cheating in school through the use of ChatGPT is a clear instance of this phenomenon. This particular strategy is both deceptive and adversarial, but the key reason it has become common is because it works. Many individuals are willing to adopt it despite its immorality, suggesting that the effectiveness of a strategy outweighs moral considerations for a significant portion, perhaps a majority, of people.

When such cases arise, societies typically respond by adjusting their systems and policies to ensure that deceptive and anti-social behavior is no longer rewarded. This adaptation works to reestablish an equilibrium where honesty and cooperation are incentivized. In the case of education, it is unclear exactly how the system will evolve to address the widespread use of LLMs for cheating. One plausible response might be the introduction of stricter policies, such as requiring all schoolwork to be completed in-person, under supervised conditions, and without access to AI tools like language models.

I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.

In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human. To be clear, I’m not denying that there are certain motivations built into human nature—these do exist, and they are things we shouldn't expect to see in AIs. However, these in-built motivations tend to be more basic and physical, such as a preference for being in a room that’s 20 degrees Celsius rather than 10 degrees Celsius, because humans are biologically sensitive to temperature.

When it comes to social behavior, though—the strategies we use to achieve our goals when those goals require coordinating with others—these are not generally innate or hardcoded into human nature. Instead, they are the result of cultural evolution: a process of trial and error that has gradually shaped the systems and norms we rely on today. 

Humans didn’t begin with systems like property rights, contract law, or financial institutions. These systems were adopted over time because they proved effective at facilitating cooperation and coordination among people. It was only after these systems were established that social norms developed around them, and people became personally motivated to adhere to these norms, such as respecting property rights or honoring contracts.

But almost none of this was part of our biological nature from the outset. This distinction is critical: much of what we consider “human” social behavior is learned, culturally transmitted, and context-dependent, rather than something that arises directly from our biological instincts. And since these motivations are not part of our biology, but simply arise from the need for effective coordination strategies, we should expect rational agentic AIs to adopt similar motivations, at least when faced with similar problems in similar situations.

I think you’re relying an intuition that says:

If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C'mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that's a perfectly reasonable line that the vast majority of AIs would happily oblige.

And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?

I think I understand your point, but I disagree with the suggestion that my reasoning stems from this intuition. Instead, my perspective is grounded in the belief that it is likely feasible to establish a legal and social framework of rights and rules in which humans and AIs could coexist in a way that satisfies two key conditions:

  1. Mutual benefit: Both humans and AIs benefit from the existence of one another, fostering a relationship of cooperation rather than conflict.
  2. No incentive for anti-social behavior: The rules and systems in place remove any strong instrumental reasons for either humans or AIs to harm one another as a side effect of pursuing their goals.

You bring up the example of an AI potentially being incentivized to start a pandemic if it were not explicitly punished for doing so. However, I am unclear about your intention with this example. Are you using it as a general illustration of the types of risks that could lead AIs to harm humans? Or are you proposing a specific risk scenario, where the non-biological nature of AIs might lead them to discount harms to biological entities like humans? My response depends on which of these two interpretations you had in mind.

If your concern is that AIs might be incentivized to harm humans because their non-biological nature leads them to undervalue or disregard harm to biological entities, I would respond to this argument as follows: 

First, it is critically important to distinguish between the long-run and the short-run.

In the short-run:

In the near-term future, it seems unlikely that AIs would start a pandemic for reasons you yourself noted. Launching a pandemic would cause widespread disruption, such as an economic recession, and it would likely provoke a strong human backlash. In the short run, humans will still hold substantial practical control over the physical world, meaning that any AI engaging in such behavior would risk severe consequences. Moreover, unless an AI could ensure a pandemic’s total lethality, it would also risk leaving surviving humans who would actively retaliate. For these reasons, starting a pandemic would likely be counterproductive for AIs in the short term.

In the long-run:

The long-term considerations are different. As human labor becomes less relevant to the economy and AIs increasingly dominate, the potential for a strong direct human backlash against AIs would diminish. At the same time, however, this reduced human influence on the world also makes it less likely that AIs would see humans as a significant obstacle to achieving their goals. In other words, while it might become easier for AIs to harm or eliminate humans, it would simultaneously become less important to do so.

To illustrate this, consider an analogy: How often do hunter-gatherers pose a threat to your personal goals? The answer is almost never, because hunter-gatherers are a tiny and largely irrelevant part of the modern world. Now, imagine a hypothetical scenario where hunter-gatherers controlled the world’s major governments. If you strongly disagreed with their policies, you might have a rational incentive to oppose them in order to achieve your goals. But in reality, hunter-gatherers do not hold such power, so they are not an obstacle, and there is no reason to actively work against them.

My point here is that, as a general rule, the smaller and less influential a group is, the easier it may be to harm or eliminate them, but the less important it becomes to do so. Their small size and functional irrelevance makes their practical interference with the overall world small at the same time. 

In the case of AIs, this means that as humans become a less central force in the world, I think AIs are unlikely to have compelling reasons to specifically harm or eliminate us to further their objectives.

You might object that humans could be more like wild animals in this scenario than like hunter-gatherers. Humans often kill wild animals, not because those animals directly threaten our goals, but rather because ensuring their safety and well-being can be costly. As a result, humans take actions—such as clearing forests or building infrastructure—that incidentally lead to widespread harm to wild animals, even if harming them wasn’t a deliberate goal. 

AIs may treat humans similarly in the future, but I doubt they will for the following reasons. I would argue that there are three key differences between the case of wild animals and the role humans are likely to occupy in the long-term future:

  1. Humans’ ability to participate in social systems: Unlike wild animals, humans have the ability to engage in social dynamics, such as negotiating, trading, and forming agreements. Even if humans no longer contribute significantly to economic productivity, like GDP, they will still retain capabilities such as language, long-term planning, and the ability to organize. These traits make it easier to integrate humans into future systems in a way that accommodates their safety and well-being, rather than sidelining or disregarding them.
  2. Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming "less capable agents," because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
  3. The potential for human augmentation: Unlike wild animals, humans may eventually adapt to a world dominated by AI by enhancing their own capabilities. For instance, humans could upload their minds to computers or adopt advanced technologies to stay relevant and competitive in an increasingly digital and sophisticated world. This would allow humans to integrate into the same systems as AIs, reducing the likelihood of being sidelined or eliminated altogether.

I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff.

This comment is already quite lengthy, so I’ll need to keep my response to this point brief. My main reply is that while such "extortion" scenarios involving AIs could potentially arise, I don’t think they would leave humans worse off than if AIs had never existed in the first place. This is because the economy is fundamentally positive-sum—AIs would likely create more value overall, benefiting both humans and AIs, even if humans don’t get everything we might ideally want.

In practical terms, I believe that even in less-than-ideal scenarios, humans could still secure outcomes such as a comfortable retirement, which for me personally would make the creation of agentic AIs worthwhile. However, I acknowledge that I haven’t fully defended or explained this position here. If you’re interested, I’d be happy to continue this discussion in more detail another time and provide a more thorough explanation of why I hold this view.

Currently it looks like we could have this type of agentic AI quite soon, say in 15 years. That's so soon that we (currently existing humans) could in the future be deprived of wealth and power by an exploding number of AI agents if we grant them a nonnegligible amount of rights. This could be quite bad for future welfare, including both our future preferences and our future wellbeing. So we shouldn't make such agents in the first place.

It is essential to carefully distinguish between absolute wealth and relative wealth in this discussion, as one of my key arguments depends heavily on understanding this distinction. Specifically, if my claims about the practical effects of population growth are correct, then a massive increase in the AI population would likely result in significant enrichment for the current inhabitants of the world—meaning those individuals who existed prior to this population explosion. This enrichment would manifest as an increase in their absolute standard of living. However, it is also true that their relative control over the world’s resources and influence would decrease as a result of the population growth.

If you disagree with this conclusion, it seems there are two primary ways to challenge it:

  1. You could argue that the factors I previously mentioned—such as innovation, economies of scale, and gains from trade—would not apply in the case of AI. For instance, this could be because AIs might rationally choose not to trade with humans, opting instead to harm humans by stealing from or even killing them. This could occur despite an initial legal framework designed to prevent such actions.
  2. You could argue that population growth in general is harmful to the people who currently exist, on the grounds that it diminishes their wealth and overall well-being.

While I am not sure, I interpret your comment as consistent with the idea that you believe both objections are potentially valid. In that case, let me address each of these points in turn.

If your objection is more like point (1):

It is difficult for me to fully reply to this idea inside of a single brief comment, so, for now, I prefer to try to convince you of a weaker claim that I think may be sufficient to carry my point: 

A major counterpoint to this objection is that, to the extent AIs are limited in their capabilities—much like humans—they could potentially be constrained by a well-designed legal system. Such a system could establish credible and enforceable threats of punishment for any agentic AI entities that violate the law. This would act as a deterrent, incentivizing agentic AIs to abide by the rules and cooperate peacefully.

Now, you might argue that not all AIs could be effectively constrained in this way. While that could be true (and I think it is worth discussing), I would hope we can find some common ground on the idea that at least some agentic AIs could be restrained through such mechanisms. If this is the case, then these AIs would have incentives to engage in mutually beneficial cooperation and trade with humans, even if they do not inherently share human values. This cooperative dynamic would create opportunities for mutual gains, enriching both humans and AIs.

If your objection is more like point (2):

If your objection is based on the idea that population growth inherently harms the people who already exist, I would argue that this perspective is at odds with the prevailing consensus in economics. In fact, it is widely regarded as a popular misconception that the world operates as a zero-sum system, where any gain for one group necessarily comes at the expense of another. Instead, standard economic models of growth and welfare generally predict that population growth is often beneficial to existing populations. It typically fosters innovation, expands markets, and creates opportunities for increased productivity, all of which frequently contribute to higher living standards for those who were already part of the population, especially those who own capital.

To the extent you are disagreeing with this prevailing economic consensus, I think it would be worth getting more specific about why exactly you disagree with these models.

From a behavioral perspective, individual humans regularly report having a consistent individual identity that persists through time, which remains largely intact despite physical changes to their body such as aging. This self-identity appears core to understanding why humans plan for their future: humans report believing that, from their perspective, they will personally suffer the consequences if they are imprudent or act myopically.

I claim that none of what I just talked about requires believing that there is an actually existing conscious self inside of people's brains, in the sense of phenomenal consciousness or personal identity. Instead, this behavior is perfectly compatible with a model in which individual humans simply have (functional) beliefs about their personal identity, and how personal identity persists through time, which causes them to act in a way that allows what they perceive as their future self to take advantage of long-term planning.

To understand my argument, it may help to imagine simulating this type of reasoning using a simple python program, that chooses actions designed to maximize some variable inside of its memory state over the long term. The python program can be imagined to have explicit and verbal beliefs: specifically, that it personally identifies with the physical computer on which it is instantiated, and claims that the persistence of its personal identity explains why it cares about the particular variable that it seeks to maximize. This can be viewed as analogous to how humans try to maximize their own personal happiness over time, with a consistent self-identity that is tied to their physical body.

I disagree with your claim that,

a competent agential AI will inevitably act deceptively and adversarially whenever it desires something that other agents don’t want it to have. The deception and adversarial dynamics is not the underlying problem, but rather an inevitable symptom of a world where competent agents have non-identical preferences.

I think these dynamics are not an unavoidable consequence of a world in which competent agents have differing preferences, but rather depend on the social structures in which these agents are embedded. To illustrate this, we can look at humans: humans have non-identical preferences compared to each other, and yet they are often able to coexist peacefully and cooperate with one another. While there are clear exceptions—such as war and crime—these exceptions do not define the general pattern of human behavior.

In fact, the prevailing consensus among social scientists appears to align with the view I have just presented. Scholars of war and crime generally do not argue that conflict and criminal behavior are inevitable outcomes of differing values. Instead, they attribute these phenomena to specific incentives and failures to coordinate effectively to achieve compromise between parties. A relevant reference here is Fearon (1995), which is widely regarded as a foundational text in International Relations. Fearon’s work emphasizes that among rational agents, war arises not because of value differences alone, but because of failures in bargaining and coordination.

Turning to your point that “No matter where you draw the line of legal and acceptable behavior, if an AI wants to go over that line, then it will act in a deceptive and adversarial way,” I would respond as follows: it is possible to draw the line in such a way that a very wide range of potential agents—including those with massively varying preferences—would have strong incentives to engage in mutually beneficial acts, cooperate peacefully and operate within the boundaries of laws and social norms.

I am not claiming that all agents would have incentives to act in this way, under these legal structures. However, I think it is feasible to draw the line such that the vast majority of unaligned agents would have no compelling instrumental reason to harm humans. Instead, they would be incentivized to engage in cooperative and mutually beneficial trade.

To make this idea more concrete, consider a hypothetical AI with the goal of maximizing paperclip production.

This AI is clearly unaligned if any AI is unaligned. However, if this AI were operating at roughly a human level of capability, then engaging in theft, fraud, or violent conflict would likely not be an effective strategy for achieving its goals. These behaviors would expose the AI to significant risks, such as being imprisoned or deactivated, which would prevent it from continuing to pursue its objective of maximizing paperclips. From the paperclip maximizer’s perspective, imprisonment or deactivation would be instrumentally bad, as it would prevent the AI from purchasing, manufacturing, or otherwise acquiring paperclips during that time.

Now consider an alternative scenario: this same paperclip maximizer is embedded in a legal system that allows it to own property, trade with others, and openly purchase or manufacture paperclips. In this case, its incentives would favor acting transparently and engaging in cooperative trade, rather than resorting to deception or adversarial behavior. Within this framework, the AI would achieve its goals more effectively by working within the system than by attempting to undermine it. It could non-deceptively work to purchase paperclips, producing more compared to an alternative in which it tried to achieve this goal via anti-social acts.

It is important to note, however, that my thesis does not claim all possible agents would naturally choose to cooperate or trade safely for instrumental reasons, nor does it suggest that we are at no risk of drawing the line carelessly or being too permissive in what behaviors we should allow. For example, consider an AI with a terminal value that specifically involves violating property norms or stealing from others—not as a means to an end, but as an intrinsic goal. In this case, granting the AI property rights or legal freedoms would not mitigate the risk of deception or adversarial behavior, because the AI’s ultimate goal would still drive it toward harmful behavior. My argument does not apply to such agents because their preferences fundamentally conflict with the principles of peaceful cooperation.

However, I would argue that such agents—those whose intrinsic goals are inherently destructive or misaligned—appear to represent a small subset of all possible agents. Outside of contrived examples like the one above, most agents would not have terminal preferences that actively push them to undermine a well-designed system of law. Instead, the vast majority of agents would likely have incentives to act within the system, assuming the system is structured in a way that aligns their instrumental goals with cooperative and pro-social behavior.

I also recognize the concern you raised about the risk of drawing the line incorrectly or being too permissive with what AIs are allowed to do. For example, it would clearly be unwise to grant AIs the legal right to steal or harm humans. My argument is not that AIs should have unlimited freedoms or rights, but rather that we should grant them a carefully chosen set of rights and freedoms: specifically, ones that would incentivize the vast majority of agents to act pro-socially and achieve their goals without harming others. This might include granting AIs the right to own property, for example, but it would not include, for example, granting them the right to murder others.

Load more