This post is written in my personal capacity, and does not necessarily represent the views of OpenAI or any other organization. Cross-posted to the Alignment Forum.
In the first post of this sequence, I defined "law-following AI" ("LFAI") and "intent alignment." In this post, I will begin to motivate the case for working on LFAI by showing that an AI agent A that is only intent-aligned (i.e., without additional law-following constraints directly on the agents) would in many cases break the law to advance its human principal H's interests.
If you agree that it is generally good for agents to be bound by law, then this should be a pro tanto reason to support work on LFAI in addition to intent alignment. If you need more convincing that LFAI would be generally good, the next few posts will explore some more specific motivations for LFAI.
Image by OpenAI's DALL·E
Sufficiently Intelligent Agents Can Evade Detection and Attribution
I suspect this point will need little argumentation to many of my readers. Evasion and obfuscation seems likely to be a generally promising means of ensuring self-preservation (because detection would increase the probability of punishment and constraint), which in turn is plausibly a convergent instrumental goal of optimizing AI systems.
For example, to evade detection and attribution, A could:
- Behave innocently only when being observed;
- Orchestrate complex malevolent actions through a coordinated series of small, inscrutable or innocent-seeming steps;
- Surreptitiously create and run subagents or copies of itself on new hardware, potentially very far removed from its original hardware;
- Convince humans or other systems to take actions on its behalf, without revealing its true identity; and/or
- Delete logs of its actions.
A sufficiently intelligent AI agent could likely consistently fool humans using these and other (unforeseen) techniques.
Other Ways to Circumvent Law
Even in the best case scenario, where the agent is detected and within the jurisdiction of a well-functioning legal system, it would be reasonable to question whether A or H could be effectively subject to normal legal processes. If A had a motivation to, A could help H escape liability by, for example:
- "Outlawyering" counterparties.
- Benefitting H in a way that would undermine recourse for creditors.
- Shifting and hiding assets in ways that would make it difficult for creditors to reach.
- Persuasively arguing for the law to be changed in H's favor (by legislation or otherwise).
- Engaging in vexatious litigation techniques to delay and raise the costs of the proceeding.
- Convincingly fabricating favorable evidence and destroying or obscuring unfavorable evidence.
- Bribing, misleading, or intimidating counterparties, witnesses, jurors, and judges.
A Competent Intent-Aligned Agent Will Sometimes Intentionally Break the Law
As I said in the previous post, on its face, intent-alignment does not entail law-following. Part of law is coercing prosocial behavior: law incentivizes agents to behave in ways that they do not intrinsically want to behave. If A is aligned with H, whether A obeys the law depends on whether H wants A to obey the law. Subsequent posts will examine what legal consequences H might face if A causes legally cognizable harms. However, even if an adequate theory of liability for the H was available, it will seem impossible to hold H liable if nobody can produce evidence that some agent of H's was responsible for those harms. As argued above, a sufficiently intelligent agent probably could consistently avoid leaving any such evidence.
Detection and attribution would not solve the problem, however. Even if H was compelled, under court order, to instruct A to behave in some way, it's not clear that A would follow the order. Consider again Iason Gabriel's taxonomy of alignment. We can see that, for most types of intent alignment, an intent-aligned agent would likely not obey compelled instructions that are against H's true wishes:
"Instructions: the agent does what I instruct it to do."
- If H asks the agent to stop, it will. However, this type of alignment is likely to be insufficient for safety.
"Expressed intentions: the agent does what I intend it to do."
- Even if H coerced to instruct the agent to stop, a sufficiently intelligent agent will probably not follow the instruction. Even though H has instructed A to stop, by supposition, H does not actually want A to stop. Under most definitions of intent alignment, it would therefore not follow the order.
"Revealed preferences: the agent does what my behaviour reveals I prefer."
- H's revealed preference would probably be that A not follow the order. Therefore, A would not obey the order.
"Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
- H's rational and informed preference would probably be that A not follow the order. Therefore, A would not obey the order.
"Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
- It is in H's objective best interest for A to disobey the order. Therefore, A would not obey the order.
Now, it may be the case that H actually does want A to obey the order, though compelled, if the failure of A to obey would lead to liability for H that is worse than the results of A's obedience (e.g., because H will be held in contempt of court if A does not actually obey). However, note that "[o]rdinarily, one charged with contempt of court for failure to comply with a court order makes a complete defense by proving that he is unable to comply." H can comply with an order that requires H to command A to do something, but it may be impossible for H to actually force A to comply if the order is against H's true wishes (to which A is aligned). If so, H could have an impossibility defense to contempt. A, understanding this, may continue on without complying because A understands that H will not actually be held in contempt. H can therefore benefit from A's disobedience. A will therefore be lawless.
Appendix: The Impossibility Defense
A's behavior here would be functionally similar to a trustee acting pursuant to a distress clauses in asset protection trusts ("APTs"). While these provisions can prevent a contempt charge, the burden of proof on the alleged contemnor is high. 
As a matter of policy, however, courts may decide to pre-commit to a contempt standard that does not allow for an impossibility defense when the defendant's AI agent refuses to obey orders issued pursuant to a court order. Analogously, courts are imposing heightened impossibility standards in response to APTs, in an attempt to make their use more onerous. If this pre-commitment is credible, it may change the agent's behavior because H may then genuinely desire A to perform (because H will be held in contempt otherwise). However, such a policy may be both contrary to precedent and more fundamental notions of fairness and due process: in some cases A's refusal to comply may be a surprise to H, since H may have had a long history of observing A scrupulously complying with H's orders, and H did not implement principal–agent alignment for the purpose of evading court orders. If so, H may be able to invoke impossibility more easily, since the impossibility was not as clearly intentionally self-induced as in the APT case. Furthermore, I would intuitively not expect courts to advance such a reform until they have faced multiple such instances of AI disobedience. This seems bad if we expect the earliest deployed AI agents to have an outsized impact on society. In any case, I would expect the possibility of favorable law reform post-AGI to solve this problem to be an insufficient solution. Finally, I would expect sufficiently intelligent agents to recognize these dynamics, and attempt to find ways to circumvent the contempt process itself, such as by surreptitious non-compliance.
An alternative, pre-AGI solution (which arguably seems pretty sensible from a public policy perspective anyway) is to advocate weakening the impossibility defense for self-imposed impossibility.
Even this may not hold for many types of agreements, including in particular international treaties. ↩︎
See also Cullen O'Keefe et al., The Windfall Clause: Distributing the Benefits of AI for the Common Good 26–27 (2020), https://perma.cc/8KES-GTBN; Jan Leike, On The Windfall Clause (2020) (unpublished manuscript), https://docs.google.com/document/d/1leOVJkNDDj-NZUZrNJauZw9S8pBpuPAJotD0gpnGEig/. ↩︎
Indeed, this is already a common technique without the use of AI systems. ↩︎
"If men were angels, no government would be necessary." The Federalist No. 51. This surely overstates the point: law can also help solve coordination problems and facilitate mutually desired outcomes. But prosocial coercion is nevertheless an important function of law and government. ↩︎
See Gabriel at 7 ("However, as Russell has pointed out, the tendency towards excessive literalism poses significant challenges for AI and the principal who directs it, with the story of King Midas serving as a cautionary tale. In this fabled scenario, the protagonist gets precisely what he asks for—that everything he touches turns to gold—not what he really wanted. Yet, avoiding such outcomes can be extremely hard in practice. In the context of a computer game called CoastRunners, an artificial agent that had been trained to maximise its score looped around and around in circles ad infinitum, achieving a high score without ever finishing the race, which is what it was really meant to do. On a larger scale, it is difficult to precisely specify a broad objective that captures everything we care about, so in practice the agent will probably optimise for some proxy that is not completely aligned with our goal. Even if this proxy objective is 'almost' right, its optimum could be disastrous according to our true objective." (citations omitted)). ↩︎
Based on my informal survey of alignment researchers at OpenAI. Everyone I asked agreed that an intent-aligned agent would not follow an order that the principal did not actually want followed. Cf. alsoChristiano (A is aligned when it "is trying to do what H wants it to do" (emphasis added)). ↩︎
We can compare this definition of intent with to the relevant legal definition thereof: "To have in mind a fixed purpose to reach a desired objective; to have as one's purpose." INTEND, Black's Law Dictionary (11th ed. 2019). H does not "intend" for the order to be followed under this definition: the "desired objective" of H issuing the order is to follow H's legal obligations, not actually achieve the result contemplated by the order. ↩︎
For example, H would exhibit signs of happiness when A continues. ↩︎
United States v. Bryan, 339 U.S. 323, 330 (1950). ↩︎
A principal may want its AI agents to be able to distinguish between genuine and coerced instructions, and to disobey the latter. Indeed, this might generally be a good thing, except for the case when compulsion is pursuant to law rather than extortion. ↩︎
See Appendix for further discussion. ↩︎
See generally Asset Protection Trust, Wex , https://www.law.cornell.edu/wex/asset_protection_trust (last visited Mar. 24, 2022); Richard C. Ausness, The Offshore Asset Protection Trust: A Prudent Financial Planning Device or the Last Refuge of A Scoundrel?, 45 Duq. L. Rev. 147, 174 (2007). ↩︎
See generally 2 Asset Protection: Dom. & Int'l L. & Tactics §§ 26:5–6 (2021). ↩︎
See id. ↩︎