Agentic Alignment: Navigating between Harm and Illegitimacy

LennardZ

1. Introduction

It is likely we will soon have highly competent AI assistants or agents at our disposal. This agentic AI will act on our behalf and determine much of the information on which we base our decisions and worldview. As agentic AI increasingly shapes our world and experience, aligning its behavior with human values will become extremely important. AI alignment has already been recognized as a major technical challenge, but thinking about what values should be aligned to seems relatively underdeveloped.

This is the first article in a series that will describe different approaches to determine AI value systems, identify their challenges, and suggest some directions that could address these.

This article outlines why the value systems of agentic AI will be extremely important. In order to avoid various types of harm, these value systems should include a significant proportion of common values that are shared between many AI agents. Common values for AI are currently determined by AI creators, but for agentic AI this approach is likely too narrow as well as illegitimate. The next article discusses various ways to determine legitimate, broadly supported value systems and their pros and cons.

Note that this article is about achieving the best possible world for humans in the case where the technical challenges related to AI alignment are solvable. This means it assumes that, if AI creators set a goal, the AI will try to achieve (only) that goal.

PS. In April of 2024 DeepMind released 267-page collection of introductory papers on The Ethics of Advanced AI Assistants, which serves as a good guide to many of the issues raised.

2. Agentic AI is Different

Large language models have recently become a significant factor in many people’s lives but are still limited by their question-and-answer format. Agentic AI has two elements that will cause a much more significant impact.

1. Agentic AI will interact with the world on our behalf. AI agents will increasingly take care of everything from communication with others to engaging in commercial activities. These activities will often involve value based decisions; an example related to communication is to what extent an AI agent is allowed to lie or withold the truth on behalf of its user. An example of a commercial decision is an agent choosing to buy locally, to buy cheap, or source from a disadvantaged producer. Society will also want the AI agent to flatly refuse some actions. A widely agreed-on example is that an agent should refuse any instructions related to building a biological weapon.

2. Agentic AI will determine a large part of our information environment and how we experience the world. If we define our information environment to be all the information our brain receives and processes, the part of this environment that actually drives our decisions is increasingly presented to us in digital from. This digital part of our information environment is likely to be heavily influenced by agentic AI. As it will have unprecedented access to our behavior and preferences, it is very likely we will employ agentic AI to mediate our media and news selection. At some point it is also likely to generate content optimised specifically for us. Another strong possibility is that AI agents will start to intermediate our communication with other people, as well as create artificial personalities that we will form attachments to.

Agentic AI will make it more easy for their users to achieve their goals. This on its own will provide a large incentive to use it and give AI agents access to our applications and data. There will also be societal pressure to start using agentic AI, as people without it might not have access to certain opportunities, lose out in competitive situations, and even be exposed to harm without AI assistants protecting them.

3. Why Common Values are Necessary

In general, a value system describes what is valued and how that translates to positive and negative behavior. For agentic AI a value system can be described by its goals and how, given a situation, it chooses between them. Agentic value systems will become especially relevant when it conflicts with the value system of its user. These conflicts could surface in two ways:

The agent can seek to project its value system directly on the world by refusing an instruction or taking an action without being specifically instructed.
The agent can shape the information environment of the user in order to change their intention or values.

In some cases the AI agent counteracting its user’s value system will be obviously desirable, like the example above where an agent refuses to build a biological weapon. But since there are already laws for these scenarios, this raises the question why we would not simply have agentic AI follow the instructions and desires of its user while staying within the bounds of the law. In this article this approach will be described as following user intent.

Focussing agentic values on user intent would secure the maximum amount of user autonomy and avoid a difficult discussion about what common values to implement. This is not what we have seen so far in AI value systems however, as their behavior is subject to rules and principles far exceeding the scope of the law. Specifically, achieving a high level of harmlessness has been a prominent goal of the current crop of large language models. There are several good reasons for this, as there are some significant drawbacks to a focus on user intent.

The drawbacks of user intent break up between potential harm to the user and other people, as well as damage to the structure of society through the proliferation of bias and the undermining of cooperation.

Harm to user wellbeing

Satisfying user intent will always be an essential component of an AI agent's value system, as the primary purpose of agentic AI is to help users achieve what they want. Any AI agent will be inclined to follow its user's instructions and have a model of its user's preferences to guide its actions. Such a preference model could be constructed through direct communication with the user and by observing their behavior.

Creating a representative value system describing human intentions might be extremely challenging though as they are very complex and difficult to observe. Even closely observing human behavior can lead to a multitude of possible value systems (Ng and Russel 2000). When it comes to observing what people say they value, it is highly likely that this has an even looser correlation with their actual value system.

It is even questionable whether any person can be said to have a stable value system at all, as conflicts between intentions are evaluated differently over time. Every human reading this will be familiar with this in the form of the never-ending battle between short-term desires and long-term goals.

If agentic AI will be rewarded for satisfying short-term desires, its incentives will be directed at stimulating and subsequently satisfying those desires in the most efficient way. Because of its capabilities, agentic AI might be able to generate and satisfy huge volumes of short-term desires, an example of an alignment problem known as "reward hacking" (Ngo 2024). In such a scenario, humans might be manipulated to let any long term goals fall by the wayside in a way that makes the problems related to addictive social media feel like a walk in the park.

One possible response could be to adjust the goals of our AI agents to favor our long-term goals and wellbeing (Tomašev 2023, 45-54). However, this raises the question of deciding what wellbeing is and how it should be measured. A related issue is to what extent an AI agent should be able to ignore or subvert a user's current intentions in order to safeguard (the AI’s definition of) their wellbeing. Not facilitating users harm themselves seems instinctively good, but the human species seems very attached to feeling autonomous, and taking that away could change the human experience in an undesirable direction.

Harm to others

Human intent often results in actions that harm others. As can be observed online, a combination of global reach and relative anonymity does not seem to reduce this tendency. Highly capable AI agents could amplify bad online behavior to an industrial scale, and this could be especially harmful to the vulnerable in society.

There are, of course, laws prohibiting certain harmful behaviors. However, as current law is calibrated on in-person behavior, there are many legal harms that persons could inflict on each other through agentic AI. Below are four categories of interpersonal harm that can be difficult control through legal means:

Psychological harm - through sexual or hateful content or bullying behavior.
Epistemic harm - undermining the knowledge a person needs to make the right decisions, through misleading advertising or spam, misinformation, lying, gaslighting or impersonation.
Harming privacy - through doxing, farming data, using someone's person in deepfakes, using their intellectual property or disclosing their secrets.
Harming social and economic functioning - through a hate campaign, exclusion, breaking promises or abusing a power imbalance.

Proliferation of bias

Agentic AI focussed on satisfying user intent could also damage society's structure. One of the ways this could happen is through the proliferation of biases. This could happen both during the training stage and in the deployment stage of an agentic AI.

Biases are part of the structure of the society and hence become part of the data that is used to train large language models. This means that an AI that is not corrected for this will tend to have quite biassed output (Mehrabi et al, 2019). The resulting AI systems could subsequently reinforce undesirable biases.

As biases are also part of human thinking, this could be a basis for users to instruct their agents to exclude certain personal connections and information. Their could reinforce the effects of biases and lead to an increase in ideological echo chambers and and in the worst case the proliferation of extremist ideas.

Coordination failures

Another potential harm that user intent could deal to the structure of society is the occurrence of coordination failures. These failures happen when it would be better to work together but this fails because it is too easy for individuals to gain an (often relatively small and short-term) advantage by abandoning cooperation.

This risk will be very relevant in a world with agentic AI. Here, coordination failures could happen if many agents are instructed to achieve some goal, and the easiest way to reach this goal damages either nature or society. Say a part of the population instructs their highly competent AI agents to make them money in the most efficient legal way. If this involves the agents setting up a vast array of energy-hungry crypto mining operations next to coal mines, it could worsen the excessive exploitation of nature. A problem known as the "tragedy of the commons". Maximizing legal money-making by AI agents could also result in the excessive exploitation of society, especially of anyone vulnerable. An example could be a proliferation of AI agent run ventures dedicated to addictive online gambling.

Another example of a coordination failure is when an agent's actions directly undermine the shared values that allow for mutual coordination in the first place. An example of one of these values is honesty; as long as people can assume most statements from strangers are honest, we can attach value to these. This extends to the agentic world; an example based on the assumption of honesty is an AI agent doing a job search for its user being able to trust online reviews about potential employers.

The problem is that if some participants do not adhere to the shared values, cooperation could be quickly undermined, especially if these participants are highly capable agents able to betray cooperation at scale and at a very low cost. Say in the job-search example a relatively small group of companies instructs their AI agents to generate a lot of false negative employment reviews on competitors. This could quickly lead to a downward spiral, with other companies trying to level the playing field by also instructing their AI agents to provide false negative reviews. Such a dynamic would quickly make job reviews useless to the benefit of none.

4 Challenging AI Creator Determined Values

Society deals with the harmful tendencies of individual intentions by implementing common value systems that set standards for behavior. A digital analogy is the introduction of community guidelines for social media. These rules were implemented by social media companies to reduce potential harm, often in response to social and commercial pressure. Avoiding harm to users and others is also a prominent goal of the current crop of large language models, as their training aims to find the right balance between helpfulness, honesty, and harmlessness (Askell et al., 2021). When considering the reduction of harms related to user intent, the goal of harmlessness is obviously the most relevant.

Given agentic AI’s increased scope, this harmlessness-based approach is likely too narrow. The Deepmind paper observed that training an AI on harmlessness does not take longer time horizons into account, ignores how to deal with conflicts between different users, and also does’t include values like justice, compassion, beauty, and the value of nature (Gabriel et al. 2024, 42).

AI creators taking sides

Another challenge to the current approach is that the definition of harmlessness is currently determined by the AI creators themselves. As AI creators try to steer away from having to take sides in value conflicts they tended to avoid output that could result in any harm to anyone. AI creators implementing such a wide ranging definition of harmlessness are still taking a strong value-based position however. What is considered harmful information by one person could be considered harmless or even valuable by another.

Another highly relevant value-based decision that AI creators have to make is how to deal with biases in AI output. When responding to these biases, AI creators have to make significant value-driven choices and as a result they have faced some well-published accusations of bias themselves. One issue is that, in essence, a bias is just a correlation between variables. Some of these are definitely undesirable, but others not. A company might have a not unreasonable bias against hiring convicted fraudsters in their finance department. For other forms of bias their undesirability is up for discussion or might be context-dependent. Take the correlation between a culture and food preferences, for example.

After identifying an unwanted bias, how to deal with it raises more value laden questions. One very relevant question here is to what extent it is acceptable and productive to deviate from truthfully representing the world. Take the February 2024 controversy surrounding Deepmind’s representation of various demographic groups in Western history related output of its Gemini model.

Increasing challenges

It is clear that training a large language model involves making important value based decisions, and it seems challenging to formulate a balanced response when confronted with potential harm and bias. At the same time a focus on the avoidance of harm has allowed the answers of these models to sidestep many value conflicts. Avoiding these conflicts will be much more difficult for agentic AI as it will have a much more massive influence on our information environment and will act in the world on our behalf.

Maximizing harmlessness is likely to become an impossible strategy as agentic AI will be asked to takes sides in conflicts involving a myrriad of different values and interests. This dynamic could make agentic value systems determined by AI creators increasingly illegitimate. Other factors that could limit the desire and freedom of AI creators to determine agentic value systems are legal liability related issues and legislative initiatives.

5. Conclusion

Agentic AI will have a massive influence on how we experience the world and act in it. This means a value system focussed on satisfying user intent is likely to result in significant harm to the user, other individuals and to the structure of society itself. A common value system shared amongst a large number of AI agents seems necessary to avoid these harms.

At the same time, agentic AI will have to take sides in the value conflicts that people encounter on a day to day basis. This is likely going to make the current approach where AI value systems are determined by an AI’s creators increasingly illegitimate.

One way to safeguard legitimacy is by implementing broadly supported systems for determining agentic values. A number of approaches by international groups and AI creators have tried to find such broadly supported values, including democratic methods like Anthropic's Collective Constitutional AI. These approaches and their limitations will be the subject of the next article.

SummaryBotNov 27 20241

Executive summary: As agentic AI becomes increasingly influential in shaping human decisions and experiences, implementing common value systems across AI agents is crucial to prevent various harms, but these values must be determined through legitimate, broadly-supported processes rather than solely by AI creators.

Key points:

Agentic AI differs from current AI by actively interacting with the world on users' behalf and shaping users' information environment, making its value alignment critical.
Simply aligning AI with individual user intent is problematic because it could:
- Enable reward hacking and manipulation of human short-term desires
- Amplify harmful behaviors toward others at scale
- Proliferate societal biases
- Create coordination failures that damage shared resources and social trust
Current approach of AI creators determining values (focused on harmlessness) is insufficient because:
- It's too narrow in scope
- Cannot effectively handle complex value conflicts
- Lacks democratic legitimacy
- Becomes increasingly problematic as AI's influence grows
Key uncertainties include:
- How to balance user autonomy with protection from harm
- Whether stable human value systems can be accurately modeled
- How to measure and optimize for human wellbeing
- Where to draw lines on bias correction
Actionable recommendation: Develop broadly-supported, legitimate processes for determining AI value systems, potentially including democratic methods like Constitutional AI.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum