Caveat: This post probably raises a naive question; I assume there's at least a 70% chance it's been considered (if not answered) exhaustively elsewhere already; please provide links if so. I've studied evolutionary psych & human nature for 30 years, but am a relative newbie to AI safety research. Anyway....
When AI alignment researchers talk about 'alignment', they often seem to have a mental model where either (1) there's a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there's all 7.8 billion humans that the AI system should be aligned with, so it doesn't impose global catastrophic risks. In those relatively simple cases, I could imagine various current alignment strategies, such as cooperative inverse reinforcement learning (CIRL) being useful, or at least a vector in a useful direction.
However, there are large numbers of intermediate-level cases where an AI system that serves multiple humans would need to become aligned with diverse groups of users or subsets of humanity. And within each such group, the humans will have partly-overlapping but partly-conflicting interests.
Example 1: a smart home/domestic robot AI might be serving a family consisting of a mom, a dad, an impulsive teenage kid, a curious toddler, and an elder grandparent with Alzheimer's. Among these five humans, whose preferences should the AI try to align with? It can't please all of them all the time. They may have genuinely diverging interests and incommensurate preferences. So it may find itself in much the same position as a traditional human domestic servant (maid, nanny, butler) trying to navigate through the household's minefield of conflicting interests, hidden agendas, family dramas, seething resentments, etc. Such challenges, of course, provide much of the entertainment value and psychological complexity of TV series such as 'Downtown Abbey', or the P.G. Wodehouse 'Jeeves' novels.
Example 2: a tactical advice AI might be serving a US military platoon deployed near hostile forces, doing information-aggregation and battlefield-simulation services. The platoon includes a lieutenant commanding 3-4 squads, each with a sergeant commanding 6-10 soldiers. The battlefield also includes a few hundred enemy soldiers, and a few thousand civilians. Which humans should this AI be aligned with? The Pentagon procurement office might have intended for the AI to maximize the likelihood of 'victory' while minimizing 'avoidable casualties'. But the Pentagon isn't there to do the cooperative inverse reinforcement learning (or whatever preference-alignment tech the AI uses) with the platoon. The battlefield AI may be doing its CIRL in interaction with the commanding lieutenant and their sergeants -- who may be somewhat aligned with each other in their interests (achieve victory, avoid death), but who may be quite mis-aligned with each other in their specific military career agendas, family situations, and risk preferences. The ordinary soldiers have their own agendas. And they are all constrained, in principle, by various rules of engagement and international treaties regarding enemy combatants and civilians -- whose interests may or may not be represented in the AI's alignment strategy.
Examples 3 through N could include AIs serving various roles in traffic management, corporate public relations, political speech-writing, forensic tax accounting, factory farm inspections, crypto exchanges, news aggregation, or any other situation where groups of humans affected by the AI's behavior have highly divergent interests and constituencies.
The behavioral and social sciences focus on these ubiquitous conflicts of interest and diverse preferences and agendas that characterize human life. This is the central stuff of political science, economics, sociology, psychology, anthropology, and media/propaganda studies. I think that to most behavioral scientists, the idea that an AI system could become aligned simultaneously with multiple diverse users, in complex nested hierarchies of power, status, wealth, and influence, would seem highly dubious.
Likewise, in evolutionary biology, and its allied disciplines such as evolutionary psychology, evolutionary anthropology, Darwinian medicine, etc., we use 'mid-level theories' such as kin selection theory, sexual selection theory, multi-level selection theory, etc to describe the partly-overlapping, partly-divergent interests of different genes, individuals, groups, and species. The idea that AI could become aligned with 'humans in general' would seem impossible, given these conflicts of interest.
In both the behavioral sciences and the evolutionary sciences, the best insights into animal and human behavior, motivations, preferences, and values often involve some game-theoretic modeling of conflicting interests. And ever since von Neumann and Morgenstern (1944), it's been clear that when strategic games include lots of agents with different agendas, payoffs, risk profiles, and choice sets, and they can self-assemble into different groups, factions, tribes, and parties with shifting allegiances, the game-theoretic modeling gets very complicated very quickly. Probably too complicated for a CIRL system, however cleverly constructed, to handle.
So, I'm left wondering what AI safety researchers are really talking about when they talk about 'alignment'. Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI's company's legal team says would impose the highest litigation risk?
I don't have any answers to these questions, but I'd value your thoughts, and links to any previous work that addresses this issue.