This post is written in my personal capacity, and does not necessarily represent the views of OpenAI or any other organization. Cross-posted to the Alignment Forum.
The structure of this sequence will be as follows:
- First, in this post, I will define some key terms and sketch what an ideal law-following AI ("LFAI") system might look like.
- In the next few posts, I will explain why law-following might not emerge by default given the existing constellation of alignment approaches, financial objectives, and legal constraints, and explain why this is troubling.
- Finally, I will propose some policy and technical routes to ameliorating these problems.
If the vision here excites you, and you would like to get funding to work on it, get in touch. I am excited to consider supporting people interested in working on this, as long as it does not distract them from working on more important alignment issues.
Image by OpenAI's DALL·E.
A law-following AI , or LFAI , is an AI system that is designed to rigorously comply with some defined set of human-originating rules ("laws"), using legal interpretative techniques, under the assumption that those laws apply to the AI in the same way that they would to a human. By "intrinsically motivated," I mean that the AI is motivated to obey those rules regardless of whether (a) its human principal wants it to obey the law, or (b) disobeying the law would be instrumentally valuable. (The Appendix to this post explores some possible conceptual issues with this definition of LFAI.)
I will compare LFAI with intent-aligned AI. The standard definition of "intent alignment" generally concerns only the relationship between some property of a human principal H and the actions of the human's AI agent A:
- Jan Leike et al. define the "agent alignment problem" as "How can we create agents that behave in accordance with the user's intentions?"
- Amanda Askell et al. define "alignment" as "the degree of overlap between the way two agents rank different outcomes."
- Paul Christiano defines "AI alignment" as "A is trying to do what H wants it to do."
- Richard Ngo endorses Christiano's definition.
Iason Gabriel does not directly define "intent alignment," but provides a taxonomy wherein an AI agent can be aligned with:
- "Instructions: the agent does what I instruct it to do."
- "Expressed intentions: the agent does what I intend it to do."
- "Revealed preferences: the agent does what my behaviour reveals I prefer."
- "Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
- "Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
- "Values: the agent does what it morally ought to do, as defined by the individual or society."
All but (6) concern the relationship between H and A. It would therefore seem appropriate to describe them as types of intent alignment.
Alignment with some broader or more complete set of values—such as type (6) in Gabriel's taxonomy, Coherent Extrapolated Volition, or what Ngo calls "maximalist" or "ambitious" alignment—is perhaps desirable or even necessary, but seems harder than working on intent alignment. Much current alignment work therefore focuses on intent alignment.
We can see that, on its face, intent alignment does not entail law-following. A key crux of this sequence, to be defended in subsequent posts, is that this gap between intent alignment and law-following is:
- Bad in expectation for the long-term future.
- Easier to bridge than the gap between intent alignment and deeper alignment with moral truth.
- Therefore worth addressing.
To clarify, this sequence does not claim that LFAI can replace intent alignment.
A Sketch of LFAI
What might an LFAI system look like? I'm not a computer scientist, but here is roughly what I have in mind.
If A is an LFAI, then A's evaluation of the legality of an action will sometimes trump A's evaluation of an action in light of its benefit to H. In LFAI, as in a legally scrupulous human, legality constrains how an agent can advance their principal's interests. For example, a human mover may be instructed to efficiently move a box for her principal, but may not unnecessarily destroy others' property in doing so. Similarly, an LFAI moving a box normally would not knock over a vase in its path, because doing so would violate the legal rights of the vase-owner.
Above, I preliminarily defined LFAI as "rigorously comply[ing]" with some set of laws. Obviously this needs a bit more elaboration. We probably don't want to define this as minimizing legal noncompliance, since this would make the system extremely risk-averse to the point of being useless. More likely, one would attempt to weight legal downside risks heavily in the agent's objective function, such that it would keep legal risk to an acceptable level.
It is worth noting that LFAI is ideally not merely attempting to reduce its expected legal liability in fact. As will be explored later, a sufficiently smart agent could probably reduce its expected legal liability merely by hiding its knowledge/intentions/actions or corrupting a legal proceeding. An LFAI, by contrast, is attempting to obey the law in an idealized sense, even if it is unlikely to actually face legal consequences.
An LFAI system does not need to store all knowledge regarding the set of laws that it is trained to follow. More likely, the practical way to create such a system would be to make the system capable of recognizing when it faces sufficient legal uncertainty, then seeking evaluation from a legal expert system ("Counselor").
The Counselor could be a human lawyer, but in the long-run is probably most robust and efficient if (at least partially) automated. The Counselor would then render advice on the pure basis of idealized legality: the probability and expected legal downsides that would result from an idealized legal dispute regarding the action if everyone knew all the relevant facts.
Thus pseudocode for an LFAI who wants to take an action X to benefit H might be:
- If X is clearly illegal:
- don't do X.
- Elseif X is maybe-illegal:
- Give Counselor all relevant information about X in an unbiased way; then
- Get Counselor's opinion on expected legal consequences from X; then
- Weigh expected legal consequences against benefit to H from X; then
- Decide whether to do X given those weightings.
- do X.
Note that this pseudocode may resemble the decisionmaking process of A if H wants A to obey the law. Thus, one route to giving an intent-aligned AI the motivation to obey the law may be stipulating to A that H wants A to obey the law.
With this picture in mind, it seems like, to make LFAI a reality, progress on the following open problems (non-exhaustively) would be useful:
- Reliably stipulating low-following conditions to AI systems' objectives.
- Resolving any disagreement between law-following and a principal's instructions appropriately.
- Getting AI agents to recognize when they face legal uncertainty (especially in a way that does not incentivize ignorance of the law).
- This seems similar to the intent alignment problem of getting agents to recognize when they need further information from principals, as in corrigibility work.
- Eliciting, in natural language, AI systems' honest description of its knowledge and desired actions.
- As noted above, this seems likely to run into problems related to ELK generally.
- Mapping legal concepts of mental states (e.g., intent, knowledge) to features of AI systems.
- This seems related to interpretability and explainability work.
- Building Counselor functions.
- Automating the process of legal research given a natural language description of an agent's proposed actions and mental state.
- Simulating idealized and fair substantive legal disputes.
- This seems related to Debate.
Appendix: More Conceptual Clarifications on LFAI
This Appendix provides some additional clarification on the definition of LFAI given above.
Applicability of Law to AI Systems
One might worry that the law often regulates physical behavior in a way that is not obviously applicable to all AI systems. For example, physical contact with another is an element of the tort of battery. However, this may be less of a problem than initially appears: courts have been able to reason through whether to apply laws originating in meatspace to computational and cyberspace conduct. Whether such analogies are properly applied is indeed highly debatable, but the fact that such analogizing is conceptually possible reduces the force of this objection. Furthermore, if some laws are simply inapplicable to non-embodied actors, this is not a problem for the conceptual coherence of LFAI as a whole: an LFAI can simply ignore those laws, and we can design laws specifically with computational content.
Perhaps a more fundamental problem is that the law frequently depends on mental states that are not straightforwardly applicable to AI systems. For example, the legality of an action may depend on whether the actor intended some harmful outcome. Thus, much of the value of LFAI depends on whether we can map human understandings of moral culpability to AI systems.
To me, however, this seems like an argument in favorof working on LFAI. Regardless of whether LFAI as such is valuable, if we expect increasingly autonomous AI systems to take increasingly impactful actions, we would probably like to understand how their objective functions (analogous to human motives) and world-model (analogous to human knowledge) map to their actions and the effects thereof. This is for the same reasons that we care about human motives and knowledge: when evaluating the alignment of agents, it is useful to know whether an agent intended to cause some harm, or knew that such a harm would ensue, etc. LFAI depends on progress on this, but is also potentially a useful toy problem for interpretability and related work in ML.
Legal compliance is also a function of both law and facts, and responsibility for definitive determinations of law and facts is split between judges and juries. Law often invokes standards like "reasonableness" that are definitively assessed only ex post, in the context of a particular dispute. The definitive legality of an action may therefore turn on an actual adjudication of the dispute. This is of course costly, which is why I suspect we would want an LFAI to act on its best estimate of what such an adjudication would yield (after asking a Counselor), rather than wait for such adjudication to take place.
It is also worth distinguishing between whether an actual court of law would rule that an AI's behavior violated some law and whether a simulated and fair legal dispute resolution process (possibly including, for example, a bespoke arbitral panel) would conclude that the behavior violated the law. The latter may be more convenient for working on LFAI for a number of reasons, including that it can ignore or stipulate away some of the peculiarities of adjudicating disputes in which an AI system is a "party."
For early, informal discussion on this topic, see Michael St. Jules, What are the challenges and problems with programming law-breaking constraints into AGI?, Effective Altruism Forum (Feb. 2, 2020), https://forum.effectivealtruism.org/posts/qKXLpe7FNCdok3uvY/what-are-the-challenges-and-problems-with-programming-law [https://perma.cc/HJ4Y-XSSE] and accompanying comments. ↩︎
Whether such rules are actually encoded into legislation is not particularly important. Virtually all legal rules not part of public law can be made “legal” with regards to particular parties as part of a contract, for example. In any case, the heart of LFAI is being bound to follow rules, and interpreting those rules leveraging the rich body of useful rule-interpretation metarules from law. ↩︎
This is important because one of the core functions of law is to provide metarules regarding the interpretation of rules, guided by certain normative values (e.g., fairness, predictability, consistency). Indeed, rules of legal interpretation aim to solve many problems relevant to AI interpretation of instructions. Cf. Dylan Hadfield-Menell & Gillian Hadfield, Incomplete Contracting and AI Alignment (2018) (preprint), https://arxiv.org/abs/1804.04268. ↩︎
That is, the AI is not law-following just because the principal wants the AI to follow the law. Indeed, LFAI should disobey orders that would require it to behave illegally. ↩︎
That is, the AI is not law-following just because it is instrumentally valuable to it (because, e.g., being caught breaking the law would cause the AI to be turned off). ↩︎
As Ngo says, "My opinion is that defining alignment in maximalist terms is unhelpful, because it bundles together technical, ethical and political problems. While it may be the case that we need to make progress on all of these, assumptions about the latter two can significantly reduce clarity about technical issues." ↩︎
I don't here offer an opinion on what training regime would yield such an outcome—my hope is to get someone to answer that for me! ↩︎
This approach may work particularly well when combined with insurance requirements for people deploying AI systems. ↩︎
Note that there are ELK-style problems with this approach. If an AI is asking for legal advice and wants to minimize the negative signal it gets from the Counselor, it may hide certain relevant information (e.g., its true state of knowledge or its true intentions) from the Counselor. A good solution, as discussed, could be to simulate an idealized adjudication of the issue if all the parties knew all the relevant facts and had equal legal firepower. But incentivizing the LFAI to tell the Counselor its true knowledge/intentions is an ELK problem. In the limit, the Counselor need not strictly be a distinct agent from the LFAI: an LFAI system may have Counselor capabilities and run this "consultation" process internally. Nevertheless, it is illustratively useful to imagine a separation of the LFAI and the Counselor. ↩︎
This would be idealized so that details not ultimately relevant to the substantive legality of the action (e.g., jurisdiction, AI personhood, other procedural matters, asymmetries in legal firepower) can be ignored. See the final footnote of this piece for further discussion. ↩︎
See the Appendix for more discussion on this point. ↩︎
See, e.g., Intel Corp. v. Hamidi, 71 P.3d 296, 304–08 (Cal. 2003) (applying trespass to chattels to unauthorized electronic computer access); MAI Sys. Corp. v. Peak Computer, Inc., 991 F.2d 511, 518–19 (9th Cir. 1993) (storing data in RAM sufficient to create a "copy" for copyright purposes, despite the fact that a "copy" must be "fixed in a tangible medium"); cf. United States v. Jones, 565 U.S. 400, 406 n.3 (2012) (analogizing GPS tracking to in-person surveillance for Fourth Amendment purposes). ↩︎
See, e.g., Jonathan H. Blavin & I. Glenn Cohen, Gore, Gibson, and Goldsmith: The Evolution of Internet Metaphors in Law and Commentary, 16 Harv. J.L. & Tech. 265 (2002). ↩︎
However, the case for working on LFAI certainly diminishes with the number of applicable laws. ↩︎
This raises further issues, including the possibility of self-reference. For example, an LFAI or Counselor asymmetrically deployed by one litigant may be able to persuade a judge or jury of its position, even if it's not the best outcome. To avoid this, such simulations should assume that judges and juries are fully apprised of all relevant facts (i.e., neither the LFAI nor Counselor can obscure relevant evidence) and if deployed in the simulated proceeding are symmetrically available to both sides. ↩︎