Hide table of contents

Pulling a child out of the path of a fast car is the right thing to do, whether or not the child agrees, understands, is grateful, or even is hurt during the rescue. Paternalistic acts like this, when we might argue that a person's straightforward consent ought to be overridden, add uncomfortable complications to liberal ethical frameworks.

As artificial intelligence continues to improve, there are an increasing number of domains in which its judgement is superior to that of humans. At what point is the relationship between AI and humans similar to that between an adult and a child? And to what extent should we demand to understand, agree or consent to decisions made by AI?

I see Paternalism and intentional Value Lock-in as intimately related. Considering that we may, by influencing the creation of AI, have a significant impact on the freedoms of future people: how should we responsibly discharge this power?

In writing this essay I’m hoping to reach people who’d like to discuss these things in more depth, and people who might want to work together.

I'd be very grateful for any criticism, suggestions, pointers, or discussion :)

Structure of this writing

I start with brief overviews of what I mean by ‘Value Lock-in’ and ‘Paternalism’, and the connection I see between them.

Then I discuss a core intuition behind my recent research into a plausible minimally-paternalistic approach to alignment: the idea of ‘un-asking’ human values. I include a brief discussion of technical approaches, which I plan to expand on more fully in later writing.

The core of this essay is a list of questions for which I currently have no answer:

  • Metaethics: Is all moral change contingent?
  • Paternalism: How much epistemic humility is appropriate or desired? What is the minimal amount required as a condition of logical consistency?
  • Empowerment: How does choice maximisation behave in the context of Social Choice Theory?
  • Practicality/deployment: Is the alignment tax of something like “maximise Empowerment-of-Others” too high?
  • Empowerment: Is there a way to generalise choice-maximisation approaches beyond toy models?
  • Hybridisation: Is there a way to balance goal-inference and choice-maximisation in a way that is robust to both reward-hacking and value lock-in?
  • Paternalism: Is there a way to generalise debate between LLM agents to comprehensively and meaningfully explore non-binary/open questions?

Finally, I wrap up with some discussion about the purpose of this writing, and suggest possible future steps.

Value Lock-in

When creating a new technology, such as the for-profit corporation or cryptocurrency, certain values may end up “locked in” and difficult to change once the technology is established. Values locked-in to powerful AI systems could mean that a single ideology could rule for a very long time, with no competing value systems. This ideology may reflect the values of a particular individual or group, or be accidentally misaligned. It may allow no meaningful dissent or debate, even in the case of evolving knowledge, understanding, and moral tastes[1].

Chapter 4 of Will MacAskill's What We Owe The Future is a very good introduction to value lock-in, describing it as arising from a combination of other convergent instrumental goals such as self-preservation and goal-content integrity (see e.g. Bostrom's 2012 The Superintelligent Will), and discussing the risks of very-long-term AI persistence (comparing emerging AI to the advent of writing).

Is there a way of minimising or mitigating the effect that value lock-in will have? Are there meta-values which, either for coherency or to protect things we unanimously value, we must unavoidably lock in?


Interfering with a person against their will, defended or motivated by claims about their well-being, has been well-studied as Paternalism. We accept that, at least in some cases, the frameworks of consent and liberalism are strained[2]. Yet meddling interference is often unwelcome, insulting and constrictive. In the worst cases, it is simply a fig leaf for selfish coercive control.

Iason Gabriel's Artificial Intelligence, Values, and Alignment describes multiple potential 'targets' of alignment:

  • literal instructions, as in the case of King Midas
  • expressed intentions, correctly interpreting the intent behind an instruction
  • revealed preferences, doing what one’s behaviour indicates they prefer
  • informed preferences, doing “what I would want it to do if I were rational and informed”
  • well-being, doing “what is best” for the person, objectively speaking
  • moral values, doing “what it morally ought to do, as defined by the individual or society”.

For many of these potential targets, there is some level of discretionary judgement-call which serves as an interpretation or 'correction' to the request. When should one make such corrections on another's behalf, thinking that one knows better?

One might argue that this is a less pressing concern than other Alignment areas, since Paternalism assumes that at least some human’s preferences are being represented. An extreme dictatorship might seem preferable to human extinction, since we might expect a human dictator’s values to be at least somewhat similar to the average human’s. By contrast, I’ve chosen to research Paternalism in the context of superintelligent AGI because it seems relatively neglected (compared to e.g. mechanistic interpretability), and because of being personally motivated by questions around avoiding oppression and encouraging individual exploration. While the opportunity cost of non-existence is significant, I am more moved by the prospect of people trapped in a value system not of their own choosing.

"Un-asking" human values

In this section, I describe one underlying theme I have been exploring, searching for a general approach to minimally-paternalistic actions.

Since interpretation of a person’s requests or behaviour creates the potential for paternalistic value judgements to be locked in to the system, I'm curious about approaches which try to "un-ask" the question of human value, instead trying to be helpful without taking a strong potentially-paternalistic stance on "what's good". Here, not only am I trying to avoid relying on intent alignment, but also - as far as possible - I aim to minimise commitment to any particular kind of value alignment.

Trying to be value-agnostic is kind of paradoxical, in a way that I'm not sure how to address:

- Any stance which aims to preserve people’s ability to make their own choices must be opinionated in some meaningful way. How can "meaningful free choice" be preserved without making a value statement about which things are meaningful?[3]

- The Paradox of Tolerance (a society which seeks to remain tolerant must be intolerant of intolerance) seems core to the question of non-paternalism. To what extent do we allow one person to coerce or limit another? (How does this relate to an individual's effect on their future selves: should one be able to take on crippling debt?)

Technical approaches

A large part of the Outer Alignment Problem is that people don't know how to communicate what we want. (One might go further, and say that people don't even know what we want.) This makes it hard to help us.

Instead of asking the agent (and having to use judgement to interpret the answer), approaches such as Inverse Reinforcement Learning (IRL) seek to learn the values of an agent (e.g. human values) through observation of its behaviour[4].

One main shortcoming of IRL is that agents sometimes behave in 'irrational' ways that don't reliably or directly bring about their goals. This may be because of physical or mental limitations, or because the true goals of the demonstrator lie outside the hypothesis set of the observer. One example of this latter point is that “Demonstration by Expert” can be a suboptimal way to teach a helpful robot about a reward function (consider showing a new worker where the spare coffee lives, before it’s needed). More generally, without assumptions about an agent's rationality, its value function cannot be derived from its observed behaviour.

Choice Maximisation

Rather than inferring an agent's goals, to be able to help the agent acheive them, we might instead try to increase the agent's ability to reach a variety of outcomes, without needing to take an explicit stance on which outcomes the agent might desire.

One potential advantage of this kind of approach is that it could sidestep the problem of instrumental power-seeking (see Bostrom, Turner). The reasons to be afraid of a power-seeking agent comprise unintended collateral damage, and incorrigibility[5]. If, instead, a power-seeking agent is trying to maximise humans' own control over their future, and so turns over its newly-gained power to us, we could perhaps remain able to choose from a variety of possible futures.

In an attempt to helpfully increase the variety of outcomes available to an agent, we might try to increase the agent's Empowerment, which is “the maximal potential causal flow from an agent’s actuators to an agent’s sensors at a later point in time”. This could, for example, be measured as the Mutual Information between mathematical sets of Actions available to an agent, and sets of States which might arise in the future: how much the choice of Action tells us about which future State arises[6]. This can happen as a perception-action loop, and agents can cooperatively work to empower each other, or competitively act to disempower each other.

Other approaches to this kind of goal-agnostic choice-maximisation include Turner et al's "POWER" and Franzmeyer et al's "CHOICE". My reasons to focus on Empowerment are:

  1. The "POWER" approach is based on Markov Decision Processes, which I think scale poorly and are difficult to generalise to continuous (rather than discrete) states. (I'm uncertain about this, though, and "read up on MDPs" is on my to-do list.)
  2. The "CHOICE" approach does away with the "Theory of Mind" aspect of Empowerment. While this has the advantage of no longer requiring an environment simulator, which improves the speed and tractability of their chosen problems[7], I think that considering an agent's own assessment of their Empowerment is a significant advantage: an agent is only Empowered if it believes itself to be so.

Shortcoming of Choice Maximisation: Trading Optionality for Reward

An assistant that sought to maximise optionality while staying completely agnostic to reward would work against a human’s attempt to commit to a particular desired end-state. If choice-maximisation is the only thing valued by an assisting AI, then we would naively be forced into whichever world most maintains optionality, without ever being able to ‘cash in’. We see this play out in Shared Autonomy via Deep Reinforcement Learning, where auto-pilots attempt to increase an human pilot's Empowerment, but when too opinionated insist on hovering, working against the human pilot’s efforts to land. It seems likely that some kind of hybrid approach would be necessary.

Shortcoming of Empowerment: Delusional Beliefs

An agent’s Empowerment is subjective: it is the Mutual Information between its own conception of the actions available to it, and the possible futures which it can imagine. This is an important point, for a number of reasons:

  1. Theory of Mind: For a Helper agent to increase the Empowerment of an Assisted agent, the Helper agent must have some model of the understanding and world-view of the Assisted agent.
  2. Education: It is possible to Empower an agent simply by educating it about the abilities available to it, or about the nature of the world. In a room with a lever and a closed door, it is Empowering to inform someone that the lever opens the door.

However, the subjectivity of Empowerment introduces an important consideration: delusional beliefs. In the previous example of the lever and door, it increases the Empowerment of the person in the room for them to believe that the lever opens the door, whether or not this is the case in reality. There is no inherent reason for a Helper agent trying to maximise the empowerment of an Assisted agent to provide truthful information, and indeed it is more likely than not for Empowerment-maximising statements to be untrue.

One mitigation here is the concept of trust: over a long enough time period of repeated interactions, a Helper which loses the trust placed in it also loses the ability to increase the Empowerment of the Assisted agent. This may result in verifiable statements being a necessary component of the long-term Empowerment-maximising strategy.

Approaches I have avoided

Static parameter-based regularisation of side-constraints

One common approach to attempting to balance doing-good with not-doing-bad is to have side-constraints. In Conservative Agency, a regularisation parameter λ "can be interpreted as expressing the designer’s beliefs about the extent to which [the reward function] R might be misspecified". This successfully induces conservative behaviour in the paper’s simulated environments. However, the approach hard-codes an assumption of the level of correctness of the reward function’s specification, which seems unsafe to me.

Elsewhere, it is suggested that an agent with access to multiple ways to meet its goal would take the approach that minimises side-impact. This reminds me of food made with ingredients "sourced as locally as possible": this kind of least-privilege tie-breaker feels ripe for regulatory capture by the agent itself. I haven't thought this through as closely as I want to, though, and this intuition would benefit from being stated more formally.

My original research plan (coming from earlier exploratory writing of mine) was to investigate methods of hybridising Choice Maximisation (e.g. Empowerment) with Goal Inference (e.g. Inverse Reinforcement Learning), potentially also including active-learning based around curiosity or uncertainty. But it's unclear to me how to do this without a naïve tuning parameter which is either hackable or arbitrarily locked-in.

Censorship-based approaches

One approach to AI Safety is to attempt to guide a given model away from particular behaviour which the creators consider undesirable. For example, Anthropic's "Helpful and Harmless Assistant" disfavours violent responses, and Stable Diffusion's NSFW Safety Module (description, code) aims to avoid generation of sexual content. While certain traits, such as avoiding intentional deception, may end up being necessary to any system that avoids undesirable lock-in, the approaches mentioned above seem like opinionated stances on a more object-level ethical level than I think is helpful for considering questions of value lock-in and paternalism.


Concrete questions and problems

In this section, I try to pinpoint concrete questions and problems around Paternalism, Value Lock-in, and Empowerment. These are my questions rather than those of any broader research community. I begin with more abstract, philosophical questions, and become increasingly technical.


Metaethics: Is all moral change contingent?

Many values which are held today are so held because those values are culturally adaptive - people are a product of their society, and stable forms persist. That is to say: certain norms, such as valourising conquest or favouring caring for children, seem intuitively to increase the competitive fitness of a culture. But we should not ex ante expect that the evolutionary benefit of a particular trait speaks to its moral desirability. This is similar to the discussion in Allan Dafoe’s Value Erosion: “absent a strongly coordinated world in which military and economic competition is controlled, the future will be shaped by evolutionary forces and not human values”. What reason do we have to believe that any 'moral' trait which we see agreed upon within or across cultures is desirable? Kind of a big question.

Right now I’m reading the chapter Why Act Morally? in Peter Singer’s Practical Ethics, which has the line “Whether to act according to considerations of ethics, self-interest, etiquette, or aesthetics would be a choice ‘beyond reason’ - in a sense, an arbitrary choice. Before we resign ourselves to this conclusion we should at least attempt to interpret the question so that the mere asking of it does not commit us to any particular point of view.”

If all moral change is contingent, then I do not see that we have any special reason to avoid imposing our own preferences on future people, whose moral views will be just as arbitrary as our own.


Paternalism: How much epistemic humility is appropriate or desired? What is the minimal amount required as a condition of logical consistency?

Encoding complete epistemic uncertainty would prevent the taking of any action[8], and so would be equivalent to declining to deploy AGI: as such it seems an impractical solution to the alignment problemSome epistemic statement must therefore be made, which will unavoidably lock in some path-dependence. What is the minimal possible coherent value statement? What are its implications, and possible shortcomings? What additional statements might various groups desire?


Empowerment: How does choice maximisation behave in the context of Social Choice Theory?

Literature on choice maximisation currently covers single Helper/Assisted-agent pairingsagents which share a reward function, and agents which play zero-sum games against each other. But how should an altruistic agent behave when trying to help a collection of agents who are engaged in a mixed- or zero-sum game with each other? Whose Empowerment should be maximised? How do we weight the Empowerment of non-human animals, or digital minds? How should an agent which is trying to maximise the Empowerment-of-Others behave when one of these others wants to disempower another? This is a question of welfare functions, familiar to political philosophy, which I have so far neglected.


Practicality/deployment: Is the alignment tax of something like “maximise Empowerment-of-Others” too high?

Assuming that something like Empowerment-of-Others is a defensible, ethical thing to maximise: is there any reason to consider that a first-actor would be incentivised to choose this rather than their own personal goal?


Empowerment: Is there a way to generalise choice-maximisation approaches beyond toy models?

Existing successes of Empowerment are (to my knowledge) limited to arenas such as grid-worlds and real-world drones flying in 3D. These successes are sufficient to prove the concept, but do not easily extend to encompass all the things people might care about. People do not only care about position in 3D space: a conversation between two people may go well, or poorly, without either participant leaving their seat. It is unclear to me how nuanced outcomes, like the quality of a discussion, could be well-represented by the finite states of an MDP.

What about the general intuitive concept of optionality which Paul Graham calls "staying upwind"? These presumably exist as concepts in AlphaZero (which is capable of positional play), but again are restricted to particular arenas.

More specifically, it seems that the choice of metric over the spaces of Actions and future States completely predetermines Empowerment. In grid worlds, identifying States with position is reasonable. But instead if we have a person in front of a single apple on a table, does cutting it into eighths increase the person's Empowerment (because there are more things to pick up) or reduce it (because they can no longer choose between a whole apple, and an apple in pieces)? This choice determines which action will be preferred: to cut or not to cut. These choices can be made appropriately for particular specific purposes, but it is not clear to me how choices about which of multiple options is more Empowering can be generalised in a tractable and value-agnostic way.


Hybridisation: Is there a way to balance goal-inference and choice-maximisation in a way that is robust to both reward-hacking and value lock-in?

There is existing work on hybridising goal-inference and choice-maximisation, for example creating an auto-pilot that seeks to stabilise a human-piloted vehicle. In these situations, excessive choice-maximisation will subvert a human’s attempt to trade optionality for reward. In the literature I’ve found, this is resolved either by a simple Lagrangian (which I discuss in Parameter-based regularisation of side-constraints), or by selecting the most Empowering policy from a minimally-valuable set. Both of these seem vulnerable to misspecification of the reward function, and (more importantly) to a hard-coded misspecification of the uncertainty around the accuracy of the reward function. Perhaps weighting by a dynamically-updating Bayesian confidence of the inferred goal would be an improvement?


Paternalism: Is there a way to generalise debate between LLM agents to comprehensively and meaningfully explore non-binary/open questions?

OpenAI describe AI Safety via Debate, where two agents debate whether a given photo is of a cat, or of a dog. The hope is that for difficult problems, computers could argue for opposite points of view, breaking down the problem step-by-step into possible decision points, where the reasoning could be easier for humans to audit, verify and referee.

I'm not aware of any generalisation of Safety via Debate from binary yes/no questions to broader discussion. For example, rather than Anthropic's approach of downweighting violent suggestions for how to obtain drugs, where there is a single Helpfulness/Harmlessness axis, could there instead be some maximally-diverse representation of viewpoints (see for example the GPT-3 simulations of human subgroups of Argyle et al), which could provide good coverage of diverse values and opinions to engage in a forum of auditable debate?



This writing aims to increase the variety and richness of experience available in the long term, by reducing the chance that people become unable to explore and express their values.

It seeks to do that by:

  1. Pushing me to clarify my own thoughts and research direction. If I’m hoping to make things clear to a reader, they must first be clear to me.
  2. Increasing the legibility of my research, so that it can be corrected and built upon.
  3. Reaching others who can guide and advise my research.
  4. Suggesting projects to people who might like to collaborate, whether working in parallel, with guidance, or independently.


In this writing, I have given a brief overview of my thoughts on Paternalism, Value Lock-in, and touched on some early technical ideas that seek to address these issues.

I have also tried to provide a selection of concrete research questions and real-world problems that could form the seed of further research.

If you’d like to get in touch, please do reach out.



I'm grateful to Shoshannah Tekofsky, Stanisław Malinowski, Edward Saperia, Kaïs Alayej, Blaine Rogers and Justis Mills for comments on drafts of this work, and to the EA Long Term Future Fund for enabling this research.


  1. ^

    See also: corrigibility, discussed by Paul Cristiano and MIRI. Quoting Koen Holtman: "A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started." To avoid Value Lock-in, however, we must also consider the corrigibility and entrenchment of these authorised parties.

  2. ^

    There is a question of how consent and liberty fit into a utilitarian framework.

    Rule utilitarianism might propose that humans, being boundedly rational, pursue the greatest utility by pursuing these kind of higher-order values, but this does not necessarily apply to arbitrarily intelligent agents.

    There is a plausible epistemic stance that a person is the best judge of what pleases or displeases them, or of what their own preferences are. However, a cornerstone of paternalism is that someone else might understand a person's preferences or welfare better than the person themselves. 

    Appeals to the inherent pleasure of self-rule only address the appearance of paternalism, and thus only require paternalistic acts to be sufficiently subtle or deceptive.

  3. ^

    This is similar to the question of Impact Measures, but whereas Impact Measures are usually thought of in the context of minimising undesired impact, a diversity of meaningful options would be something to maximise.

  4. ^

    A monkey who collects bananas rather than potatoes might be said to "prefer" bananas, or to find them more rewarding. It's not clear to me how this distinguishes between revealed preferences, 'true' preferences, and non-goal-driven behaviour, especially when considering addictive behaviour. I'd like to expand on this in later writing.

  5. ^

    Let me know if you see another :)

  6. ^

    If, no matter which action an agent takes, the probability of ending up in any given future state is unchanged, then the agent has zero empowerment. If instead for any given future state the agent can choose an action to guarantee it arising, then the agent is maximally empowered.

  7. ^

    An advantage which they press with their "Immediate Choice" proxy, which myopically restricts consideration to the next single time-step.

  8. ^

    Even something like the Principle of Maximum Entropy, which can be used for decision and prediction tasks, is an opinionated statement about the world.





More posts like this

No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities