This post was written by Peli Grietzer, inspired by internal writings by TJ (tushant jha), for AOI[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment.
The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human autonomy and human flourishing. In the course of articulating our mission and positioning ourselves -- a young organization -- in the landscape of AI risk orgs, we’ve come to notice what we think are serious conceptual problems with the prevalent vocabulary of ‘AI alignment.’ This essay will discuss some of the major ways in which we think the concept of ‘alignment’ creates bias and confusion, as well as our own search for clarifying concepts.
At AOI, we try to think about AI within the context of humanity’s contemporary institutional structures: How do contemporary market and non-market (eg. bureaucratic, political, ideological, reputational) forces shape AI R&D and deployment, and how will the rise of AI-empowered corporate, state, and NGO actors reshape those forces? We increasingly feel that ‘alignment’ talk tends to obscure or distort these questions.
The trouble, we believe, is the idea that there is a single so-called Alignment Problem. Talk about an ‘Alignment Problem’ tends to conflate a family of related but distinct technical and social problems, including:
P1: Avoiding takeover from emergent optimization in AI agents
P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us
P3: Ensuring AIs are good at solving problems as specified (by user or designer)
P4: Ensuring AI systems enhance, and don’t erode, human agency
P5: Ensuring that advanced AI agents learn a human utility function
P6: Ensuring that AI systems lead to desirable systemic and long term outcomes
Each of P1-P6 is known as ‘the Alignment Problem’ (or as the core research problem in ‘Alignment Research’) to at least some people in the greater AI Risk sphere, in at least some contexts. And yet these problems are clearly not simply interchangeable: placing any one of P1-P6 at the center of AI safety implies a complicated background theory about their relationship, their relative difficulty, and their relative significance.
We believe that when different individuals and organizations speak of the ‘Alignment Problem,’ they assume different controversial reductions of the P1-P6 problems network to one of its elements. Furthermore, the very idea of an ‘Alignment Problem’ precommits us to finding a reduction for P1-P6, obscuring the possibility that this network of problems calls for a multi-pronged treatment.
One surface-level consequence of the semantic compression around ‘alignment’ is widespread miscommunication, as well as fights over linguistic real-estate. The deeper problem, though, is that this compression serves to obscure some of a researcher’s or org’s foundational ideas about AI by ‘burying’ them under the concept of alignment. Take a familiar example of a culture clash within the greater AI Risk sphere: many mainstream AI researchers identify ‘alignment work’ with incremental progress on P3 (task-reliability), which researchers in the core AI Risk community reject as just safety-washed capabilities research. We believe working through this culture-clash requires that both parties state their theories about the relationship between progress on P3 and progress on P1 (takeover avoidance).
In our own work at AOI, we’ve had occasion to closely examine a viewpoint we call the Berkeley Model of Alignment -- a popular reduction of P1-P6 to P5 (agent value-learning) based on a paradigm consolidated at UC Berkeley’s CHAI research group in the late ‘10s. While the assumptions we associate with the Berkeley Model are no longer as dominant in technical alignment research[2] as they once were, we believe that the Berkeley Model still informs a great deal of big-picture and strategic discourse around AI safety.
Under the view we call the Berkeley Model of Alignment, advanced AIs can be naturally divided into two kinds: AI agents possessing a human utility function (‘aligned AIs’) and AI agents motivated to take over or eliminate humanity (‘unaligned AIs’). Within this paradigm, solving agent value-learning is effectively necessary for takeover avoidance and effectively sufficient for a systematically good future, making the relationship between observable progress on task-reliability and genuine progress on agent value-learning the central open question in AI safety and AI policy. This model of alignment is, of course, not simply arbitrary: it’s grounded in well-trodden arguments about the likelihood of emergent general-planner AGI and its tendency towards power-seeking. Nevertheless, we think the status of the Berkeley Model in our shared vocabulary blends these arguments into the background in ways that support imprecise, automatic thought-patterns instead of precise inferences.
The first implicit pillar of the Berkeley Model that we want to criticize is the assumption of content indifference: The Berkeley Model assumes we can fully separate the technical problem of aligning an AI to some values or goals and the governance problem of choosing what values or goals to target. While it is logically possible that we’ll discover some fully generic method of pointing to goals or values (e.g. brain-reading), it’s equally plausible that different goals or values will effectively have different ‘type-signatures’: goals or values that are highly unnatural or esoteric given one training method or specification-format may be readily accessible given another training method or specification-format, and vice versa. This issue is even more pressing if we take a sociotechnical viewpoint that considers the impact of early AI technology on the epistemic, ideological, and economic conditions under which later AI development and deployment takes place.
The second implicit pillar that we want to criticize is the assumption of a value-learning bottleneck: The Berkeley Model assumes that the fundamental challenge in AI safety is teaching AIs a human utility function. We want to observe, first of all, that value learning is neither clearly necessary nor clearly sufficient for either takeover avoidance or a systematically good future. Consider that we humans ourselves manage to be respectful, caring, and helpful to our friends despite not fully knowing what they care about or what their life plans are -- thereby providing an informal human proof for the possibility of beneficial and safe behavior without exhaustive learning of the target’s values. And as concerns sufficiency, the recent literature on deceptive alignment vividly demonstrates that value learning by itself can’t guarantee the right relationship to motivation: understanding human value and caring about values are different things.
Perhaps more important, the idea of a value-learning bottleneck assumes that AI systems will have a single ‘layer’ of goals or values. While this makes sense within the context of takeover scenarios where an AI agent directly stamps its utility function on the world, the current advance of applied AI suggests that near-future, high-impact AI systems will be composites of many AI and non-AI components. Without dismissing takeover scenarios, we at AOI believe that it’s also critical to study and guide the collective agency of composite, AI-driven sociotechnical systems. Consider, for example, advanced LLM-based systems: although we could empirically measure whether the underlying LLM can model human values by testing token completion over complex ethical statements, what’s truly impact-relevant are the patterns of interaction that emerge at the conjunction of the base LLM, RLHF regimen, prompting wrapper and plugins, interface design, and user-culture.
This brings us to our final, central problem with the Berkeley Model: the assumption of context independence. At AOI, we are strongly concerned with how the social and economic ‘ambient background’ to AI R&D and deployment is likely to shape future AI. Our late founder Peter Eckerlsey was motivated by the worry that market dynamics favor the creation of powerful profit-maximizing AI systems that trample the public good: risks from intelligent optimization in advanced AI, Eckersley thought, are a radical new extension of optimization risks from market failures and misaligned corporations that already impact human agency in potentially catastrophic ways. Eckersely hoped that by restructuring the incentives around AI R&D, humanity could wrest AI from these indifferent optimization processes and build AI institutions sensitive to the true public good. In Eckersley's work at AOI and AOI's work after his passing we continue to expand this viewpoint, incorporating a plethora of other social forces: bureaucratic dynamics within corporations and states, political conflicts, ideological and reputational incentives. We believe that in many plausible scenarios these forces will both shape the design of future AI technology itself, and guide the conduct of future AI-empowered sociotechnical intelligences such as governments and corporations.
This sociotechnical perspective on the future of AI does, of course, makes its own hidden assumptions: In order to inherit or empower the profit-motive of corporations, advanced AI must be at least minimally controllable. While on the Berkeley Model of Alignment one technical operation (‘value alignment’) takes care of AI risk in its entirety, our sociotechnical model expects the future of AI to be determined by two complementary fronts: technical AI safety engineering, and design and reform of institutions that develop, deploy, and govern AI. We believe that without good institutional judgment, many of the most likely forms of technically controllable AI may end up amplifying current harms, injustices, and threats to human agency. At the same time, we also worry that exclusive focus on current harms and their feedback loops can blind researchers and policy-makers to more technical forms of AI risk: Consider, for example, that researchers seeking to develop AI systems’ understanding of rich social contexts may produce new AI capabilities with ‘dual use’ for deception and manipulation.
It may seem reasonable, at first glance, to think about our viewpoint as simply expanding the alignment problem -- adding an ‘institutional alignment problem’ to the technical AI alignment problem. While this is an approach some might have taken in the past, we’ve grown suspicious of the assumption that technical AI safety will take the form of an ‘alignment’ operation, and wary of the implication that good institutional design is a matter of inducing people to collectively enact some preconceived utility function. As we’ll discuss in our next post, we believe Martha Nussbuam’s and Amartya Sen’s ‘capabilities’ approach to public benefit gives a compelling alternative framework for institutional design that applies well to advanced AI and to the institutions that create and govern it. For now, we hope we’ve managed to articulate some of the ways in which ‘alignment’ talk restricts thought about AI and its future, as well as suggest some reasons to paint outside of these lines.
- ^
This post's contents were drafted by Peli and TJ, in their former capacity as Research Fellow and Research Director at AOI. They are currently research affiliates collaborating with the organization.
- ^
We believe there is an emerging paradigm that seeks to reduce P1-P6 to P2 (human intelligibility), but this new paradigm has so far not consolidated to the same degree as the Berkeley Model. Current intelligibility-driven research programs such as ELK and OAA don’t yet present themselves as ‘complete’ strategies for addressing P1-P6.
Executive summary: The concept of "AI alignment" conflates distinct problems and obscures important questions about the interaction between AI systems and human institutions, potentially limiting productive discourse and research on AI safety.
Key points:
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.