What can the principal-agent literature tell us about AI risk?

ac

What can the principal-agent literature tell us about AI risk?

20 min readFeb 10, 2020

Comments 1

Sorted by

New & upvoted

cole_haus

This isn't looking at it from exactly the same angle as this post, but Incomplete Contracting and AI Alignment also looks at the alignment problem through the principal-agent lens:

We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions. We first provide an overview of the incomplete contracting literature and explore parallels between this work and the problem of AI alignment. As we emphasize, misalignment between principal and agent is a core focus of economic analysis. We highlight some technical results from the economics literature on incomplete contracts that may provide insights for AI alignment researchers. Our core contribution, however, is to bring to bear an insight that economists have been urged to absorb from legal scholars and other behavioral scientists: the fact that human contracting is supported by substantial amounts of external structure, such as generally available institutions (culture, law) that can supply implied terms to fill the gaps in incomplete contracts. We propose a research agenda for AI alignment work that focuses on the problem of how to build AI that can replicate the human cognitive processes that connect individual incomplete contracts with this supporting external structure.

Comments

More from the author

Longtermist reasons to work for innovative governments

ac·5y ago·2m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 3d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

186

The first video from Giving What We Can's new channel is out now!

JustinPortela·5d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

Let's taboo the V-word

lincolnq·17h ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

Recent opportunities to take action

The EA Opportunities Board now has full-time roles

Agnes Hasselblad 🔸·1h ago·3m read

177

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·2w ago·4m read

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·3d ago·1m read

Thanks to Wei Dai for pointing out a previous inaccuracy ↩︎
Agency rents are about e.g. working vs shirking. If the agent uses the money she earned to buy a gun and later shoot the principal, clearly this is very bad for her, but it’s not captured by agency rents. ↩︎
It’s not totally clear to us why we should care about our fraction of influence over the future, rather than the total influence. Probably because the fraction of influence affects the total influence, influence being zero-sum and resources finite. ↩︎
It wasn’t clear to us from the original post, at least in Part 1 of the story with no conflict, that humans are better off in absolute terms. For instance, wording like “over time those proxies will come apart” and “People really will be getting richer for a while” seemed to suggest that things are expected to worsen. Given this, Hanson’s interpretation (that Christiano’s story implied massive agency rents) seems reasonable without further clarification. Ben Garfinkel mentioned an outside-view measure which he thought undermined the plausibility of Part 1: since the industrial revolution we seem to have been using more and more proxies, which are optimized for more and more heavily, but things have been getting better and better. So he also seems to have understood the scenario to mean things get worse in absolute terms. ↩︎
Clarifying what it means for an AI system to earn and use rents also seems important, helping us make sure that the abstraction maps cleanly onto the practical scenarios we are envisaging. Relatedly, what traits would an AI system need to have for it to make sense to think of the system as “accumulating and using rents”? Rents can be cashed out in influence of many different kinds — a human worker might get higher wage, or more free time — and what ends up occuring will depend on the capabilities of the AI systems. Concretely, money can be saved in a bank account, people can be influenced, or computer hardware can be bought and run. One example of an obvious capability constraint for AI: some AI systems will be “switched off” after they are run, limiting their ability to transfer rents through time. As AI agents will (initially) be owned by humans, historical instances of slaves earning rents seem worth looking into. ↩︎
Although his scenario is more plausible if a smarter agent extracts more agency rents. ↩︎
Hanson and Christiano agree on this point. Hanson: “Just as most wages that slaves earned above subsistence went to slave owners, most of the wealth generated by AI could go to the capital owners, i.e. their slave owners. Agency rents are the difference above that minimum amount.” Christiano: “Agency rents are what makes the AI rich. It's not that computers would "become rich" if they were superhuman, and they just aren't rich yet because they aren't smart enough. On the current trajectory computers just won't get rich.” ↩︎
One limitation is that rents are the cost to the principal, whereas the accident scenario has costs for all humanity. This distinction isn’t especially important because in the accident scenario the outcome for the principal is catastrophic (i.e. extremely high agency rents), and this is what is potentially in tension with PAL. Nonetheless, we should keep in mind that the total costs of this scenario are not limited to agency rents, just as in Christiano’s scenario. ↩︎
Perhaps a more realistic framing: the principal is aware that there’s some probability that the agent will take an unanticipated catastrophic action, without knowing what that action might be. Under competitive pressures, maybe in a time of war, it could be beneficial for the principal to delegate (in expectation) despite significant risk, while humanity is made worse off (in expectation). This, of course, would be modelled quite differently to the accident AI risk we consider in the text, and we suspect that economic models would confirm that principals would take the risk in sufficiently competitive scenarios. These models would focus on negative externalities of risky AI development, something more naturally studied in domains like public economics rather than with agency theory. In any case, we focus here on the more traditional AI risk framing along the lines of “you think you have the AI under control, but beware, you could be wrong”. ↩︎
AI accident risk will be large when the AI agent thinks of new actions that i) harm the principal ii) further the agent's goals iii) the principal hasn't anticipated. ↩︎
This is because claims about the actions available to the agent and the principal’s awareness are part of PAL models’ assumptions. We discuss this more below. ↩︎
The correct example: “If you prefer solving environmental problems, you might ask the machine to counter the rapid acidification of the oceans that results from higher carbon dioxide levels. The machine develops a new catalyst that facilitates an incredibly rapid chemical reaction between ocean and atmosphere and restores the oceans’ pH levels. Unfortunately, a quarter of the oxygen in the atmosphere is used up in the process, leaving us [humans] to asphyxiate slowly and painfully.” ↩︎
I.e. the principal’s rationality is bounded to a greater extent than the agent’s ↩︎
In the model in “Moral Hazard With Unawareness” either the principal or the agent’s rationality can be bounded ↩︎
As argued above, we don’t think contract enforceability is the main reason Hanson’s critique of Christiano fails; agency rents are just not unusually high in his scenario. ↩︎
From Contract Theory: “The benchmark contracting situation that we shall consider in this book is one between two parties who operate in a market economy with a well-functioning legal system. Under such a system, any contract the parties decide to write will be enforced perfectly by a court, provided, of course, that it does not contravene any existing laws.” ↩︎
Thanks to Ben Garfinkel for pointing this out. ↩︎
Robin Hanson pointed out to us that when thinking about strange future scenarios, we should try to think about similar strange scenarios that we have seen in the past (we are very sympathetic to this, despite our somewhat skeptical position regarding PAL). With this in mind, another field which seems worth looking into is Security, especially military security. National leaders have been assassinated by their guards; kings have been killed by their protectors. These seem like a closer analogue to many AI risk scenarios than the typical PAL setup. It seems important to understand what the major risk factors are in these situations, how people have guarded against catastrophic failures, and how this translates to cases of catastrophic AI risk. ↩︎

What can the principal-agent literature tell us about AI risk?

What can the principal-agent literature tell us about AI risk?

Introduction

Summary

PAL and Christiano’s AI risk scenarios

PAL isn’t in tension with Christiano’s story and isn’t especially informative

Extending agency models seems promising for understanding the level of agency rents in Christiano’s scenario

PAL and AI risk from “accidents”

PAL doesn’t tell us much about AI risk from accidents

Extending agency models doesn’t seem promising for understanding AI risk from “accidents”

General difficulties with using PAL to assess AI risk

PAL models rarely consider weak principals and more capable agents^[13]

PAL models are brittle

Agents rents are too narrow a measure

PAL models typically assume contract enforceability

PAL models typically assume AIs work for humans because they are paid

Conclusion

What can the principal-agent literature tell us about AI risk?

What can the principal-agent literature tell us about AI risk?

Introduction

Summary

PAL and Christiano’s AI risk scenarios

PAL isn’t in tension with Christiano’s story and isn’t especially informative

Extending agency models seems promising for understanding the level of agency rents in Christiano’s scenario

PAL and AI risk from “accidents”

PAL doesn’t tell us much about AI risk from accidents

Extending agency models doesn’t seem promising for understanding AI risk from “accidents”

General difficulties with using PAL to assess AI risk

PAL models rarely consider weak principals and more capable agents[13]

PAL models are brittle

Agents rents are too narrow a measure

PAL models typically assume contract enforceability

PAL models typically assume AIs work for humans because they are paid

Conclusion

PAL models rarely consider weak principals and more capable agents^[13]