Caveat: This post probably raises a naive question; I assume there's at least a 70% chance it's been considered (if not answered) exhaustively elsewhere already; please provide links if so.  I've studied evolutionary psych & human nature for 30 years, but am a relative newbie to AI safety research. Anyway....

When AI alignment researchers talk about 'alignment', they often seem to have a mental model where either (1) there's a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there's all 7.8 billion humans that the AI system should be aligned with, so it doesn't impose global catastrophic risks. In those relatively simple cases, I could imagine various current alignment strategies, such as cooperative inverse reinforcement learning (CIRL) being useful, or at least a vector in a useful direction.

However, there are large numbers of intermediate-level cases where an AI system that serves multiple humans would need to become aligned with diverse groups of users or subsets of humanity. And within each such group, the humans will have partly-overlapping but partly-conflicting interests. 

Example 1: a smart home/domestic robot AI might be serving a family consisting of a mom, a dad, an impulsive teenage kid, a curious toddler, and an elder grandparent with Alzheimer's. Among these five humans, whose preferences should the AI try to align with? It can't please all of them all the time. They may have genuinely diverging interests and incommensurate preferences. So it may find itself in much the same position as a traditional human domestic servant (maid, nanny, butler) trying to navigate through the household's minefield of conflicting interests, hidden agendas, family dramas, seething resentments, etc. Such challenges, of course, provide much of the entertainment value and psychological complexity of TV series such as 'Downtown Abbey', or the P.G. Wodehouse 'Jeeves' novels. 

Example 2: a tactical advice AI might be serving a US military platoon deployed near hostile forces, doing information-aggregation and battlefield-simulation services. The platoon includes a lieutenant commanding 3-4 squads, each with a sergeant commanding 6-10 soldiers. The battlefield also includes a few hundred enemy soldiers, and a few thousand civilians. Which humans should this AI be aligned with? The Pentagon procurement office might have intended for the AI to maximize the likelihood of 'victory' while minimizing 'avoidable casualties'. But the Pentagon isn't there to do the cooperative inverse reinforcement learning (or whatever preference-alignment tech the AI uses) with the platoon. The battlefield AI may be doing its CIRL in interaction with the commanding lieutenant and their sergeants -- who may be somewhat aligned with each other in their interests (achieve victory, avoid death), but who may be quite mis-aligned with each other in their specific military career agendas, family situations, and risk preferences. The ordinary soldiers have their own agendas. And they are all constrained, in principle, by various rules of engagement and international treaties regarding enemy combatants and civilians -- whose interests may or may not be represented in the AI's alignment strategy.  

Examples 3 through N could include AIs serving various roles in traffic management, corporate public relations, political speech-writing, forensic tax accounting, factory farm inspections, crypto exchanges, news aggregation, or any other situation where groups of humans affected by the AI's behavior have highly divergent interests and constituencies.

The behavioral and social sciences focus on these ubiquitous conflicts of interest and diverse preferences and agendas that characterize human life. This is the central stuff of political science, economics, sociology, psychology, anthropology, and media/propaganda studies. I think that to most behavioral scientists, the idea that an AI system could become aligned simultaneously with multiple diverse users, in complex nested hierarchies of power, status, wealth, and influence, would seem highly dubious.

Likewise, in evolutionary biology, and its allied disciplines such as evolutionary psychology, evolutionary anthropology, Darwinian medicine, etc., we use 'mid-level theories' such as kin selection theory, sexual selection theory, multi-level selection theory, etc to describe the partly-overlapping, partly-divergent interests of different genes, individuals, groups, and species.  The idea that AI could become aligned with 'humans in general' would seem impossible, given these conflicts of interest.

In both the behavioral sciences and the evolutionary sciences, the best insights into animal and human behavior, motivations, preferences, and values often involve some game-theoretic modeling of conflicting interests. And ever since von Neumann and Morgenstern (1944), it's been clear that when strategic games include lots of agents with different agendas, payoffs, risk profiles, and choice sets, and they can self-assemble into different groups, factions, tribes, and parties with shifting allegiances, the game-theoretic modeling gets very complicated very quickly. Probably too complicated for a CIRL system, however cleverly constructed, to handle.

So, I'm left wondering what AI safety researchers are really talking about when they talk about 'alignment'. Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI's company's legal team says would impose the highest litigation risk?

I don't have any answers to these questions, but I'd value your thoughts, and links to any previous work that addresses this issue. 

33

12 comments, sorted by Click to highlight new comments since: Today at 2:54 PM
New Comment

(Edit: Accidentally posted a duplicate link.)

Aligned with whom? by Anton Korinek and Avital Balwit (2022) has a possible answer. They write that an aligned AI system should have

  • direct alignment with its operator, and
  • social alignment with society at large.

Some examples of failures in direct and social alignment are provided in Why we need a new agency to regulate advanced artificial intelligence: Lessons on AI control from the Facebook Files (Korinek, 2021).

We could expand the moral circle further by aligning AI with the interests of both human and non-human animals. Direct, social and sentient alignment?

As you mentioned, these alignments present conflicting interests that need mediation and resolution.

I think AI alignment isn't really about designing AI to maximize for the preference satisfaction of a certain set of humans. I think an aligned AI would look more like an AI which:

  • is not trying to cause an existential catastrophe or take control of humanity
  • has had undesirable behavior trained out or adversarially filtered
  • learns from human feedback about what behavior is more or less preferable
    • In this case, we would hope the AI would be aligned to the people who are allowed to provide feedback
  • has goals which are corrigible
  • is honest, non-deceptive, and non-power-seeking

Hi mic,

I understand that's how 'alignment' is normally defined in AI safety research. 

But it seems like such a narrow notion of alignment that it glosses over almost all of the really hard problems in real AI safety -- which concern the very real conflicts between the humans who will be using AI.

For example, if the AI is aligned 'to the people who are allowed to provide feedback' (eg the feedback to a CIRL system), that raises the question of who is actually going to be allowed to provide feedback. For most real-world applications, deciding that issue is tantamount to deciding which humans will be in control of that real-world domain -- and it may leave the AI looking very 'unaligned' to all the other humans involved.

But it seems like such a narrow notion of alignment that it glosses over almost all of the really hard problems in real AI safety -- which concern the very real conflicts between the humans who will be using AI.

I very much agree these these political questions matter, and that alignment to multiple humans is conceptually pretty shaky; thanks for bringing up these issues. Still, I think some important context is that many AI safety researchers think that it's a hard, unsolved problem to just keep future powerful AI systems from literally killing everyone (or doing other unambiguously terrible things). They're often worried that CIRL and every other approach that's been proposed will completely fail. From that perspective, it no longer looks like almost all of the really hard problems are about conflicts between humans.

(On CIRL, here's a thread and a longer writeup on why some think that "it almost entirely fails to address the core problems" of AI safety. This video and this post outline some broader potential limitations of current approaches to safety.)

I agree, that seems concerning. Ultimately, since the AI developers are designing the AIs, I would guess that they would try to align the AI to be helpful to the users/consumers or to the concerns of the company/government, if they succeed at aligning the AI at all. As for your suggestions "Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI's company's legal team says would impose the highest litigation risk?" – these all seem plausible to me.

On the separate question of handling conflicting interests: there's some work on this (e.g., "Aligning with Heterogeneous Preferences for Kidney Exchange" and "Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning"), though perhaps not as much as we would like.

It does seem like alignment researchers often focus on the case of aligning AI to a single human. Here are some views that might explain this. I think these views are at least somewhat common among alignment researchers.

  • Aligning with a single human contains most of the difficulty of the problem of aligning with groups of humans. Once we figure out how to align AI with a single human, figuring out how to align it with groups of humans will be relatively easy. We should focus on the hard part first, which is aligning AI with a single human. (edit: I am not saying that aligning with a single human is harder than aligning with groups of humans. See also my comment below.)

  • If AI is aligned with a single random human, this is still much better than unaligned AI. Therefore this kind of research is very valuable.

  • If the AI acts according to the CEV of a single random human, then the results will be probably good for humanity as a whole.

harfe - I'm not convinced that aligning with a single human is much harder than aligning with a group of humans that have diverse and partly conflicting interests.  Single human preferences can already be learned pretty well by product recommendation engines, but group preferences are much more complicated.

We already know from game theory that there is no way, even in principle, for a single agent (such as an AI) to represent and enact the collective interests of a group that doesn't actually have internally aligned collective interests. The only exceptions to this are edge cases like pure coordination games (e.g. which side of the road to drive on, left or right). 

My concern is that if we think we've solved the single-human alignment problem, we'll be tempted to scale AI systems up to try to reflect the general preferences of human groups (or humanity in general) -- but that this will simply not be possible, given that groups, and even the human species itself, does not actually have collectively aligned interests (even the principle of 'don't let AI drive us extinct' won't seem aligned with the agendas of the Voluntary Human Extinction movement, the anti-natalists, the Earth First eco-activists, the religious extremists expecting imminent messiahs or Raptures, or the depressed nihilists.)

And, if group alignment isn't possible, we'll end up in a situation where whichever subgroup has the most direct control over AI design, training, and feedback will end up being basically in control of everybody else.

I'm not convinced that aligning with a single human is much harder than aligning with a group of humans that have diverse and partly conflicting interests.

I did not claim that aligning with a single human is harder than aligning with a group of humans (nor have I claimed that others believe that). I have probably expressed myself poorly, if that was the impression after reading my comment. In fact, I believe the opposite!

Let my make another attempt at explaining.

  • A: Figuring out how to align an AGI with a single human.
  • B: Figuring out how to align an AGI with a group of human.
  • C: Doing B after you have completed A.

Then, for the difficulties of these, I currently believe

  • all three of A, B, C are hard
  • B is harder than A
  • B is harder than C
  • A is much harder than C (this was what I was trying to state in the comment above)
  • A reasonable strategy for doing B would be to do A, and then do C (I am not super confident here, and things might be much more complex)
  • If you do both A and C, it is better to first focus on A (and put more resources into it), because A is harder than C.

I would be curious what other people think. My current guess would be that at least some alignment researchers believe these (or a part of these) points too. I do not recall hearing opposing viewpoints.

I do not believe that, for example, the author of the PreDCA alignment proposal wishes that the values of a random human are imposed (via AGI) on the rest of humanity, even though PreDCA (currently) is a protocol that aligns AGI with a single human (called "user").

Hi harfe, thanks for this helpful clarification. 

I'd agree that A, B, and C seem hard; that B is harder than A, and that B is harder than C.

Where we disagree is that I suspect that C is harder than A, for basic game-theoretic reasons I mentioned in the original post. 

I'm also not confident that C is a whole lot easier than B -- I'm not sure that alignment with individual humans will actually give us all that much help in doing alignment with complicated groups of humans.

But, I need to think further about this, and do some more readings!

You claimed that your starting question was naive, so allow me to respond with similar naivete: 

If AI become smart enough to perform behaviors that we consider potentially threatening or beyond our control, aren't they really artificial life?

As such, and imbued with consciousness equal to or greater than our own, don't we consider them to have rights or legal protections or freedoms?

If they did, they would also be subject to legal restrictions on their behavior, in similar fashion to human beings. However, with additional freedoms, those legal restrictions would be inadequate to punish their behavior. As a consequence, we face an ethical challenge in how we integrate such life into our society. We don't want to be unfair, but we prefer them as servants than as equals or superiors.

The continued development of AI could reflect techno-utopianism, or technological determinism (marketing and memes), leading everyone to a condition in which the actual motives of people paying for it all are really poorly thought out and short-term, but the larger vision looks more attractive than it is.