*Over the past two months I have been lucky enough to carry out a part-time research project as part of **this** program with Vael Gates. This post serves both as a short summary of what I have found, my views on these findings, as well as a reflection on the project process as a whole. Many thanks to Vael for their support and guidance with this project.*

As the title suggests, in this project I focused on the application of dynamic game theory, in particular Stackelberg games, to the problem of cooperative commitment in multi-agent scenarios. This was motivated and inspired by the section on Commitment in Dafoe *et al.*'s *'**Open Problems in Cooperative AI**'*.

# Summary of the literature

## Open Problems in Cooperative AI

This is a broad article that covers a lot of ground in its attempt to synthesise relevant research topics from computer science, economics, cognitive science etc. into a coherent description of the topic of cooperative AI, that is; *'AI research trying to help individuals, humans and machines, find ways to improve their joint welfare.' *As part of this aim, Dafoe *et al.* classify four *'cooperative opportunities'* as a way of analysing cooperative scenarios in terms of their core characteristics of: whether the individuals have common or conflicting interests, whether the individuals are humans, machines, or organisations, whether the scenario is representative of the *'individual perspective'* or *'planner perspective'*, and the *'scope'* of the scenario. Furthermore, four crucial *'cooperative capabilities'* are also identified for their relevance for promoting cooperation. These are: understanding, communication, commitment, and institutions.

I chose to explore the third of these capabilities, *'commitment'*, which Dafoe *et al.* define as the ability to overcome *'commitment problems'*, that is, *'the inability to make credible threats or promises'*. They point to cases in the social sciences (such as war and conflict) where it has been argued that the inability to cooperate to mutual benefit stems from the inability to make credible commitments. Society has come up with a number of solutions (ranging from hard to soft) to such commitment problems including contracts (hard) and norms and reputation systems (soft).

Dafoe *et al.* also make explicit the connection to game theory through an example relating to the Prisoner's Dilemma. If one player could make a credible, observable commitment to play C *if and only if* the opponent does, then the dilemma is solved; it is strictly better for the opponent to play C over D.

C | D | |

C | 1, 1 | -0.5, 1.5 |

D | 1.5, -0.5 | 0, 0 |

In the main body of this section on commitment, Dafoe *et al.* discuss and classify solutions - theoretical ways of solving commitment problems - as well as commitment *devices* - methods through which solutions can be implemented in practice.

Commitment solutions can be classified according to two binary classifications; whether the solution is unilateral or multilateral, and whether the solution is conditional on other factors or not. This gives four categories:

- Unilateral unconditional solutions are those whereby an actor is able to individually and irreversibly commit to a certain course of action. The simplest way of doing this is just by taking a hard-to-reverse action.
- Unilateral conditional solutions are those where a single actor commits to a course of action
*conditional*on internal or external factors such as the actions of an adversary or the state of the wider environment. Conditional solutions can generally be more powerful than unconditional solutions, but this comes at a price of being harder to implement. - Multilateral unconditional solutions are those whereby a group of actors agree to strictly commit to certain actions. The success of commitment is dependent on the participation of all parties.
- Finally, multilateral conditional solutions (as one would expect by now) are those whereby a group of actors make a commitment that depends on some external state.

Finally is the discussion of devices for implementing solutions. These range on a continuous scale of 'hardness' depending on how strictly commitment is enforced. At one extreme, the 'hardest' devices make reneging on a commitment physically or logically impossible to carry out, whereas much weaker devices could come in the form of social norms whereby not following through on a commitment may not have explicit costs or consequences, but could otherwise harm the agent through e.g., a loss of reputation making collaboration more difficult in the future. In between these two examples, one could find devices such as legal contracts that aim to impose costs on breaking agreements.

## Stackelberg Games

A Stackelberg game (or Stackelberg competition) is a form of dynamic game introduced by Heinrich Freiherr von Stackelberg in 1934. In the classical formulation a two-player game is between a *Leader* and a *Follower* (though extensions have been made with multiple followers) whereby the Leader plays first, selecting her strategy with the knowledge that the Follower is able to observe her choice. The Follower then selects her strategy given the knowledge of the Leader's. As an example of how such a dynamic game setup can lead to differing play when compared to standard simultaneous-move game, consider the game given below (From Conitzer & Sandholm (2006)).

C | D | |

A | 2, 1 | 4, 0 |

B | 1, 0 | 3, 2 |

Note that, for the row player, playing A strictly dominates B so in a simultaneous-move situation the optimal strategy would be to always play A. However, if we now take the row player to be a Leader in a Stackelberg game, it can be shown that the *Stackelberg optimal *strategy for the leader is a mixed strategy of A and B each with probability 0.5. In this case, in order to maximise her reward, the Follower will do best to select a strategy of always playing D, giving the Leader an expected reward of 3.5. It has been shown that this ability to select a strategy before her opponent results in the Leader of a Stackelberg game receiving *at least* as much payoff compared to if the same game is played simultaneously. In this sense, the Leader of Stackelberg game can be considered to have a tangible first-move advantage.

More recently, attention has been paid to the practical problem of computing optimal strategies in Stackelberg games. In 2006, Conitzer and Sandholm presented a number of results relating to such problems. In particular, they present efficient algorithms for computing optimal strategies in a number of simple game formulations: normal-form (non-Bayesian), pure-strategy games with any number of players; Bayesian, pure-strategy games with only two players and the Leader has only one type; and normal-form Bayesian games with only two players. For other game setups (varying whether the game is normal-form or Bayesian, the number of players, and whether or not mixed strategies are considered) it was shown that computing an optimal strategy is in fact NP-hard.

## Recent work applying Stackelberg games to AI and robotics

While there is a large corpus of work applying Stackelberg games to settings of security, I aimed to focus more narrowly on applications to the fields of AI and robotics.Below I summarise some recent papers that address this intersection.

In *'**Stackelberg Punishment and Bully-Proofing Autonomous Vehicles**'*, Cooper *et al.* introduce the notion of *Stackelberg Punishment*. They note that punishment in repeated games, whereby one player may intentionally select a strategy that reduces the payoff for their opponent at the cost of a reduction in their own payoff, can lead to mutually improved outcomes if it incentivises the opponent to pick an action that is beneficial to both players. *Stackelberg punishment* is then defined as a punishment strategy that sufficiently punishes the opponent according to some predefined target level, whilst minimising the cost to the punisher. They then adapt established algorithms for computing Stackelberg equilibria to instead compute a Stackelberg punishment, and demonstrate its effectiveness on a model driving scenario. In this toy problem, two vehicles are approaching a narrow bridge from opposite directions, one autonomous, one controlled by a human subject. Since the bridge is only wide enough for one vehicle, one must let the other pass first, following the usual right of way rules that the car that is closer to the bridge has priority. However, the vehicle that is farther from the bridge is able to ‘bully’ the other by continuing onto the bridge, despite not having priority, forcing the opponent to yield. In cases where the human subject ‘bullies’ the AI, the game was setup so that in the following round the AI would punish the human according to the optimal Stackelberg punishment. This experiment was carried out with the intention of exploring ways to prevent autonomous vehicles from being overly cautious when interacting with human drivers.

Secondly, in *'**Cooperative Control of Mobile Robots with Stackelberg Learning**'*, Koh *et al. *propose an algorithm which they name *Stackelberg Learning in Cooperative Control *(SLiCC) that promotes the learning of cooperative behaviour between a pair of robots tasked with carrying an item that is too large for either of them to transport individually. To achieve this cooperative behaviour, they model the task as a *Partially Observable Stochastic Game* (POSG) which is decomposed into a set of Stackelberg games at each timestep. Deep reinforcement learning is then used to learn the payoff matrices for these games.

In *'**Effective Solutions for Real-World Stackelberg Games: When Agents Must Deal with Human Uncertainties**'*, Pita *et al.* address the implicit assumptions in Stackelberg games that the Follower is a rational agent with complete information, and thus reliably decides a strategy that maximises her payoff. They note that in practice, human players are unlikely to act in this way either due to bounded rationality or limitations in observing the Leader's strategy and suggest extensions to a given Stackelberg equilibrium algorithm (*DOBSS*, introduced by Paruchuri *et al*.) that take into account these Follower characteristics. They show not only that their extended algorithms perform better than the baseline algorithm in cases where the follower does play suboptimally, but also that they achieve this in comparable runtimes.

# My thoughts

## Differing notions of 'commitment'

One thing that stood out to me when exploring the above papers (and wider literature of Stackelberg games) is that the term 'commitment' is used in a slightly different way than Dafoe *et al.* in *'Cooperative AI'*. In Stackelberg games, 'commitment' is used to refer to the fact that the Leader selects a strategy before the Follower, thus 'committing' to this strategy, *unconditional* on the Follower's choice. While this does correspond to a 'unilateral, unconditional' commitment in Dafoe's terminology, I found the formulation of a Stackelberg less enlightening than expected in terms of 'solving' the commitment problem of ensuring that an agent does indeed carry out the strategy that it commits to. As an example, reconsider the example Stackelberg game presented above. We saw that the optimal strategy for the Leader to commit to in this case is a mixed strategy of playing A or B each with probability 0.5. However, there is still a commitment problem here. Once the Follower selects her strategy of always playing D in order to maximise her payoff, what is stopping the Leader from reneging on her commitment. At this point in the game it is now even more beneficial for the Leader to *always *play A, ensuring a payoff of 4. As far as I have seen, the consideration of how to ensure that the Leader does indeed play its chosen strategy in such cases is not addressed in the Stackelberg game literature.

## Focus on 'human-follows-agent'

Secondly, this, admittedly rather niche, area of research seems to go against the grain by considering situations in which the agent is given the role of a leader. In all examples I saw, the agent plays the role of Stackelberg Leader whereas a human or other agent is the Follower, giving the agent a strategic advantage. However, when viewed at a sufficiently high level, most of the other human-AI interaction literature focuses on how to give an agent the skills to follow a responsible human - a very basic example is a human-specified reward function and the challenge of ensuring that it produces the desired behaviour in an RL agent. Inverse reinforcement learning is perhaps an even clearer example, where the goal is to have an agent learn from observation of desired behaviour, most commonly from a human demonstrator. For me, having an agent follow and respond to a human overseer or instructor feels like a much safer bet from an intuitive alignment perspective.

## Is game theory that applicable?

Finally, despite the advances presented in the papers discussed above, I find myself questioning the magnitude of game theory's utility in addressing issues such as AI cooperation and alignment. Game theory takes player payoffs as axiomatic, using these given facts to explore courses of action and how desirable they are for each player. However, if we focus on questions of human-agent cooperation we eventually have to face the problem of encoding complex human utilities into the scalar values that game theory works with. It is not inconceivable that this is a harder problem than those dealt with within game theory, potentially drastically reducing game theory's potency in solving problems of human-agent cooperation.

Furthermore, I have developed the intuition that game theory is ill-equipped to deal with more complex cooperative scenarios, and may quickly become unwieldy due to its rather rigid structure. This is perhaps backed-up by Conitzer and Sandholm's computational hardness results discussed above. In the case of Stackelberg games in particular, computation of an optimal strategy is NP-hard for all but a few of the simplest game formulations, with restrictions on the number of players, whether the game is normal-form or Bayesian, and whether mixed strategies are considered. I would like to stress that this is still very much an intuition and I unfortunately have not had enough time to fully explore these questions.

# Reflection on the project

In addition to giving me the opportunity and incentive to dive into and learn about topics outside my current focus, completing this project has taught me plenty about my own work practices as well as some of the challenges of carrying out individual research.

## Finding a specific enough question is hard

This project went through quite a few iterations, having begun life as focusing on the comparison of near- and long-term problems and approaches in AI safety. Initially I followed a trail of thought that led me to explore topics in computational social choice theory and its applications to AI. While an interesting subject in its own right, this felt as though it was not going in a very productive direction, so I changed my focus to the Cooperative AI paper. Over time, this was narrowed down to the commitment problem, where the aim of the project would be to conduct a literature review of commitment in AI, using the section in the Cooperative AI paper as a launchpad. After realising that this was still a giant body of literature, I finally zoomed in further on Stackelberg games.

This initial exploration, changing of direction, and delayed narrowing down on a realistic-sized focus area took up quite a bit of time, and meant that I didn't have as much remaining time to read the relevant literature. I imagine that this 'problem formulation' period of research is something that I still underestimate, but I nonetheless feel that I should be able to improve on my own ability of finding tractable questions. Hopefully next time I do a similar project, I will be able to find a specific research question more efficiently in order to afford more time on actually exploring and answering it.

## It's ok to present negative results

A final thought on this project is that it has allowed me to become more comfortable with accepting that a research project being successful is not the same thing as confirming your hypothesis or having a breakthrough revelation. In this case, having explored existing and potential connections between Stackelberg games and the commitment problem, I found myself putting less weight on my initial intuition that there would be a good amount of scope for useful connections between these topics. In spite of this, I would say that this still counts as a success. Oftentimes, starting with an intuition and weakening it through further research can be as useful and instructive as strengthening an intuition through research, and this is a notion that I feel I need to embrace further.

Thanks for sharing! I wonder how relevant AI-AI interactions are for broader AI Safety and if those might be better captured with current game theory.