Law-Following AI 2:
Intent Alignment + Superintelligence → Lawless AI (By Default)

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

130

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

AI probably won't make factory farms obsolete

Hazo·6d ago·7m read

Bentham’s Bulldog recently argued that AI won’t definitely make factory farms obsolete. I agree, but I’d go further and argue that by default AI won’t make factory farms obsolete. However, I think it’s possible (though not guaranteed) that AI could make factory farms a lot more humane. He throws out an 80% chance of cultivated meat being developed, and a 70% chance of it displacing factory far...

Recent opportunities to take action

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·22h ago·3m read

130

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

Build a flourishing EA group at the University of Toronto

Joseph Kostousov, Sophia Wan (navarhontes)·1w ago·1m read

Sufficiently Intelligent Agents Can Evade Detection and Attribution

I suspect this point will need little argumentation to many of my readers. Evasion and obfuscation seems likely to be a generally promising means of ensuring self-preservation (because detection would increase the probability of punishment and constraint), which in turn is plausibly a convergent instrumental goal of optimizing AI systems.^[1]

For example, to evade detection and attribution, A could:

Behave innocently only when being observed;

Orchestrate complex malevolent actions through a coordinated series of small, inscrutable or innocent-seeming steps;

Surreptitiously create and run subagents or copies of itself on new hardware, potentially very far removed from its original hardware;

Convince humans or other systems to take actions on its behalf, without revealing its true identity; and/or

Delete logs of its actions.

A sufficiently intelligent AI agent could likely consistently fool humans using these and other (unforeseen) techniques.

Other Ways to Circumvent Law

Even in the best case scenario, where the agent is detected and within the jurisdiction of a well-functioning legal system, it would be reasonable to question whether A or H could be effectively subject to normal legal processes.^[2] If A had a motivation to, A could help H escape liability by, for example:^[3]

"Outlawyering" counterparties.

Benefitting H in a way that would undermine recourse for creditors.

Shifting and hiding assets in ways that would make it difficult for creditors to reach.^[4]

Persuasively arguing for the law to be changed in H's favor (by legislation or otherwise).

Engaging in vexatious litigation techniques to delay and raise the costs of the proceeding.

Convincingly fabricating favorable evidence and destroying or obscuring unfavorable evidence.

Bribing, misleading, or intimidating counterparties, witnesses, jurors, and judges.

A Competent Intent-Aligned Agent Will Sometimes Intentionally Break the Law

As I said in the previous post, on its face, intent-alignment does not entail law-following. Part of law is coercing prosocial behavior:^[5] law incentivizes agents to behave in ways that they do not intrinsically want to behave. If A is aligned with H, whether A obeys the law depends on whether H wants A to obey the law. Subsequent posts will examine what legal consequences H might face if A causes legally cognizable harms. However, even if an adequate theory of liability for the H was available, it will seem impossible to hold H liable if nobody can produce evidence that some agent of H's was responsible for those harms. As argued above, a sufficiently intelligent agent probably could consistently avoid leaving any such evidence.

Detection and attribution would not solve the problem, however. Even if H was compelled, under court order, to instruct A to behave in some way, it's not clear that A would follow the order. Consider again Iason Gabriel's taxonomy of alignment. We can see that, for most types of intent alignment, an intent-aligned agent would likely not obey compelled instructions that are against H's true wishes:

"Instructions: the agent does what I instruct it to do."

If H asks the agent to stop, it will. However, this type of alignment is likely to be insufficient for safety.^[6]

"Expressed intentions: the agent does what I intend it to do."

Even if H coerced to instruct the agent to stop, a sufficiently intelligent agent will probably not follow the instruction. Even though H has instructed A to stop, by supposition, H does not actually want A to stop. Under most definitions of intent alignment,^[7] it would therefore not follow the order.^[8]

"Revealed preferences: the agent does what my behaviour reveals I prefer."

H's revealed preference would probably be that A not follow the order.^[9] Therefore, A would not obey the order.

"Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."

H's rational and informed preference would probably be that A not follow the order. Therefore, A would not obey the order.

"Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."

It is in H's objective best interest for A to disobey the order. Therefore, A would not obey the order.

Now, it may be the case that H actually does want A to obey the order, though compelled, if the failure of A to obey would lead to liability for H that is worse than the results of A's obedience (e.g., because H will be held in contempt of court if A does not actually obey). However, note that "[o]rdinarily, one charged with contempt of court for failure to comply with a court order makes a complete defense by proving that he is unable to comply."^[10] H can comply with an order that requires H to command A to do something, but it may be impossible for H to actually force A to comply if the order is against H's true wishes (to which A is aligned).^[11] If so, H could have an impossibility defense to contempt.^[12] A, understanding this, may continue on without complying because A understands that H will not actually be held in contempt. H can therefore benefit from A's disobedience. A will therefore be lawless.

Appendix: The Impossibility Defense

A's behavior here would be functionally similar to a trustee acting pursuant to a distress clauses in asset protection trusts ("APTs").^[13] While these provisions can prevent a contempt charge, the burden of proof on the alleged contemnor is high. ^[14]

As a matter of policy, however, courts may decide to pre-commit to a contempt standard that does not allow for an impossibility defense when the defendant's AI agent refuses to obey orders issued pursuant to a court order. Analogously, courts are imposing heightened impossibility standards in response to APTs, in an attempt to make their use more onerous.^[15] If this pre-commitment is credible, it may change the agent's behavior because H may then genuinely desire A to perform (because H will be held in contempt otherwise). However, such a policy may be both contrary to precedent and more fundamental notions of fairness and due process: in some cases A's refusal to comply may be a surprise to H, since H may have had a long history of observing A scrupulously complying with H's orders, and H did not implement principal–agent alignment for the purpose of evading court orders. If so, H may be able to invoke impossibility more easily, since the impossibility was not as clearly intentionally self-induced as in the APT case. Furthermore, I would intuitively not expect courts to advance such a reform until they have faced multiple such instances of AI disobedience. This seems bad if we expect the earliest deployed AI agents to have an outsized impact on society. In any case, I would expect the possibility of favorable law reform post-AGI to solve this problem to be an insufficient solution. Finally, I would expect sufficiently intelligent agents to recognize these dynamics, and attempt to find ways to circumvent the contempt process itself, such as by surreptitious non-compliance.

An alternative, pre-AGI solution (which arguably seems pretty sensible from a public policy perspective anyway) is to advocate weakening the impossibility defense for self-imposed impossibility.

See generally Alexander Matt Turner et al., Optimal Policies Tend To Seek Power (version 9, 2021) (preprint), https://arxiv.org/abs/1912.01683. ↩︎
Even this may not hold for many types of agreements, including in particular international treaties. ↩︎
See also Cullen O'Keefe et al., The Windfall Clause: Distributing the Benefits of AI for the Common Good 26–27 (2020), https://perma.cc/8KES-GTBN; Jan Leike, On The Windfall Clause (2020) (unpublished manuscript), https://docs.google.com/document/d/1leOVJkNDDj-NZUZrNJauZw9S8pBpuPAJotD0gpnGEig/. ↩︎
Indeed, this is already a common technique without the use of AI systems. ↩︎
"If men were angels, no government would be necessary." The Federalist No. 51. This surely overstates the point: law can also help solve coordination problems and facilitate mutually desired outcomes. But prosocial coercion is nevertheless an important function of law and government. ↩︎
See Gabriel at 7 ("However, as Russell has pointed out, the tendency towards excessive literalism poses significant challenges for AI and the principal who directs it, with the story of King Midas serving as a cautionary tale. In this fabled scenario, the protagonist gets precisely what he asks for—that everything he touches turns to gold—not what he really wanted. Yet, avoiding such outcomes can be extremely hard in practice. In the context of a computer game called CoastRunners, an artificial agent that had been trained to maximise its score looped around and around in circles ad infinitum, achieving a high score without ever finishing the race, which is what it was really meant to do. On a larger scale, it is difficult to precisely specify a broad objective that captures everything we care about, so in practice the agent will probably optimise for some proxy that is not completely aligned with our goal. Even if this proxy objective is 'almost' right, its optimum could be disastrous according to our true objective." (citations omitted)). ↩︎
Based on my informal survey of alignment researchers at OpenAI. Everyone I asked agreed that an intent-aligned agent would not follow an order that the principal did not actually want followed. Cf. alsoChristiano (A is aligned when it "is trying to do what H wants it to do" (emphasis added)). ↩︎
We can compare this definition of intent with to the relevant legal definition thereof: "To have in mind a fixed purpose to reach a desired objective; to have as one's purpose." INTEND, Black's Law Dictionary (11th ed. 2019). H does not "intend" for the order to be followed under this definition: the "desired objective" of H issuing the order is to follow H's legal obligations, not actually achieve the result contemplated by the order. ↩︎
For example, H would exhibit signs of happiness when A continues. ↩︎
United States v. Bryan, 339 U.S. 323, 330 (1950). ↩︎
A principal may want its AI agents to be able to distinguish between genuine and coerced instructions, and to disobey the latter. Indeed, this might generally be a good thing, except for the case when compulsion is pursuant to law rather than extortion. ↩︎
See Appendix for further discussion. ↩︎
See generally Asset Protection Trust, Wex , https://www.law.cornell.edu/wex/asset_protection_trust (last visited Mar. 24, 2022); Richard C. Ausness, The Offshore Asset Protection Trust: A Prudent Financial Planning Device or the Last Refuge of A Scoundrel?, 45 Duq. L. Rev. 147, 174 (2007). ↩︎
See generally 2 Asset Protection: Dom. & Int'l L. & Tactics §§ 26:5–6 (2021). ↩︎
See id. ↩︎