This is a brief overview of the Center on Long-Term Risk (CLR)’s activities in 2023 and our plans for 2024. We are hoping to fundraise $770,000 to fulfill our target budget in 2024.
CLR works on addressing the worst-case risks from the development and deployment of advanced AI systems in order to reduce s-risks. Our research primarily involves thinking about how to reduce conflict and promote cooperation in interactions involving powerful AI systems. In addition to research, we do a range of activities aimed at building a community of people interested in s-risk reduction, and support efforts that contribute to s-risk reduction via the CLR Fund.
Review of 2023
Our research in 2023 primarily fell in a few buckets:
Commitment races and safe Pareto improvements deconfusion. Many researchers in the area consider commitment races a potentially important driver of conflict involving AI systems. But we have been missing a precise understanding of the mechanisms by which they could lead to conflict. We believe we made significant progress on this over the last year. This includes progress on understanding the conditions under which an approach to bargaining called “safe Pareto improvements (SPIs)” can prevent catastrophic conflict.
Most of this work is non-public, but public documents that came out of this line of work include Open-minded updatelessness, Responses to apparent rationalist confusions about game / decision theory, and a forthcoming paper (see draft) & post on SPIs for expected utility maximizers.
Paths to implementing surrogate goals. Surrogate goals are a special case of SPIs and we consider them a promising route to reducing the downsides from conflict. We (along with CLR-external researchers Nathaniel Sauerberg and Caspar Oesterheld) thought about how implementing surrogate goals could be both credible and counterfactual (i.e., not done by AIs by default), e.g., using compute monitoring schemes.
CLR researchers, in collaboration with Caspar Oesterheld and Filip Sondej, are also working on a project to “implement” surrogate goals/SPIs in contemporary language models.
Conflict-prone dispositions. We thought about the kinds of dispositions that could exacerbate conflict, and how they might arise in AI systems. The primary motivation for this line of work is that, even if alignment does not fully succeed, we may be able to shape their dispositions in coarse-grained ways that reduce the risks of worse-than-extinction outcomes. See our post on making AIs less likely to be spiteful.
Evaluations of LLMs. We continued our earlier work on evaluating cooperation-relevant properties in LLMs. Part of this involved cheap exploratory work with GPT-4 and Claude (e.g., looking at behavior in scenarios from the Machiavelli dataset) to see if there were particularly interesting behaviors worth investing more time in.
We also worked with external collaborators to develop “Welfare Diplomacy”, a variant of the Diplomacy game environment designed to be better for facilitating Cooperative AI research. We wrote a paper introducing the benchmark and using it to evaluate several LLMs.
Progress on s-risk community building was slow, due to the departures of our community building staff and funding uncertainties that prevented us from immediately hiring another Community Manager.
- We continued having career calls;
- We ran our fourth Summer Research Fellowship, with 10 fellows;
- We have now hired a new Community Manager, Winston Oswald-Drummond, who has just started.
Staff & leadership changes
We saw some substantial staff changes this year, with three staff members departing and two new members joining.
The organization was previously led by a group of “lead researchers” along with the Director of Operations. This year, we moved to a structure with a single Executive Director, who is me (Jesse Clifton). Amrit Sidhu-Brar became Director of Operations after Stefan Torges, who was previously in that role, left the organization.
Plans for 2024
We’ve accumulated a lot of research debt over the past year. The priority for our research in at least the first half of 2024 will be to translate internal research progress into shareable research outputs that take us closer to shovel-ready interventions. Specifically:
Overseer’s manual. For a while, one of our main plans for preventing conflict involving AI systems has been to develop an “overseer’s manual” for the overseers of advanced AI systems, with advice for preventing their systems from locking in catastrophic bargaining policies (see brief discussion here). Based on our research over the last year, we now think we’re in a position to write content for an overseer’s manual with basic recommendations (related to implementing SPIs and deferring decisions about thorny bargaining topics to wiser successors). We plan to draft this over the next few months, and try to get lots of input from external researchers with relevant knowledge.
Systematic evaluation of LLMs. Preventing AI systems from exhibiting spite or other conflict-conducive properties will require the ability to measure those properties, and an understanding of how different kinds of training affect those properties. It’s a bit unclear how much we can learn about the systems we actually want to intervene on from studying contemporary LLMs. But in any case, we’d like to start getting practice building out evaluation loops, getting feedback from alignment researchers, etc. so that we’re in a good place once more advanced models do come along.
Our priority in this area in Q1 is to write an empirical research agenda and hire researchers/research engineers to work on the agenda.
Beyond these two priority areas, we will continue to conduct more exploratory research.
Now that we’ve got a new Community Manager, our priority is to identify the most important and tractable bottlenecks for increasing quality-weighted work on s-risks, and come up with a strategy for addressing these.
Alongside new activities that come out of our strategy efforts, we plan to continue the following existing community building programs in 2024:
We are currently considering establishing a presence in the Bay Area, with some of our staff relocating to work there in order to take advantage of the greater concentration of AI research activity in the area and the opportunities it presents for collaboration. We expect to make a decision in this area in the first quarter of 2024 and, if we decide to go ahead, implement the move soon afterwards.
- We are currently $770k short of our target budget. This budget would allow us to hire more researchers, run a larger Summer Research Fellowship, and conduct more compute-intensive research.
- Our budget does not currently account for the possibility of opening a Bay Area office, which would present substantial additional expenditure. We currently plan to fundraise for this separately if we decide to go ahead with the move, but additional donations at this stage would directly contribute to these expenses.
- To donate to CLR, please go to the Fundraiser page on our website. For frequently asked questions on donating to CLR, see here.
- We accept expressions of interest in research roles on a rolling basis. As mentioned, we will likely hire for researchers and/or research engineers to focus on our empirical agenda in Q1 2024.
- Thinking about how to use your career to reduce s-risk? You can also register your interest in a career call.