I'm an early career independent researcher who graduated in Economics at University of Cambridge in 2019. I'm part of Modeling Cooperation, a team of independent researchers who work to build computational models and software tools for understanding the consequences of competition in transformative AI. We've previously investigated the consequences of a Windfall Clause in a model of AI Existential Safety (under review, see preprint on arXiv at: https://arxiv.org/abs/2108.09404). My current work focuses on building a model to explore policies to promote more resources for AI Safety research.
In October I'm starting a PhD at Teeside University on “Understanding dynamics of AI Safety development through behavioural and network modelling”.
Partially aligned transformative AIs are likely to be stable under reflection
I think this is very unlikely. I'm assuming continual learning becomes more important and as Pachiardi et al 2025 (https://cl-eval.github.io/) point out the dynamics of continual learning have failure modes like chaos and hard to predict convergence. How agents learn values could depend a lot on experiences. I can imagine partially aligned tAIs are set to work on lots of parts of the economy and they could face many experiences where moral principles are compromised for the sake of efficiency or profit. I'm not convinced armchair reflection by the AIs would prepare them for continual learning.
AI alignment to humans will in practice avoid moral catastrophes to digital minds
Reasons are similar to the same poll but for animals. Humans by default are likely to underweight the importance of digital minds (surveys suggest people will dismiss digital minds as not having a soul), so alignment to human preferences likely means respecting human preferences to use digital minds as machines for achieving their goals. It's easier perhaps than for animals because digital minds are likely to express themselves directly in ways humans could empathise with (but this could be buried if interactions are increasingly agent to agent ).
Robust alignment requires alignment-relevant intervention during pretraining
Interpreting this as saying a necessary condition for robust alignment is training data that captures good values and discourages bad values. I think there's good evidence this matters lots for current systems so lean to agree. It's still plausible to me that robust alignment could be achieved with post-training interventions and relatively neutral pre-training setups.
AI alignment to humans will in practice avoid moral catastrophes to animals
Humans are currently very motivated to perpetuate moral catastrophes to animals. If AI alignment means aligned to the intent of their users, then AI systems help humans perpetuate moral catastrophes. If AI alignment is in terms of human moral preferences, then even well-chosen mechanism for aggregating human preferences will select for speciest values. There is a strong sense in which avoiding moral catastrophes to animals is usually misaligned with human preferences. Admittedly the same could be said of other moral issues such as attitudes towards outgroups and foreigners. There appears to be room in the current human alignment agenda for ensuring AI does not succumb to tribal prejudices, so there is likely scope for compatability between the current alignment agenda and avoiding moral catastrophes to animals. It does not happen by default and given how deep speciesm goes, it is likely much harder to avoid. Hence, why I still disagree with this poll as written.
This is super cool work, David and Zoe!
It's rare to see LLM games that contain this much structure (you have a discrete set of actions which update a world state, and even a bunch of shocks). The other thing I was impressed by is the three different LLM judges. Looking forward to seeing more visualisations.
I have a few questions.
Super cool! Great to see others digging into the costs of Agent performance. I agree that more people should be looking into this.
I'm particularly interested in predicting the growth of costs for Agentic AI safety evaluations. So I was wondering if you had any takes on this given this recent series. Here are a few more specific questions along those lines for you, Toby
- Given the cost trends you've identified, do you expect the costs of running agents to take up an increasing share of the total costs of AI safety evaluations (including researcher costs)?
- Which dynamics do you think will drive how the costs of AI safety evaluations change over the next few years?
- Any thoughts on under what conditions it would be better to elicit the maximum capabilities of models using a few very expensive safety evaluations, or better to prioritise a larger quantity of evaluations that get close to plateau performance (i.e. hitting that sweet spot where their hourly cost / performance is lowest, or alternatively their saturation point)? Presumably a mix is a best, but how do we determine what a good mix looks like? What might you recommend to an AI Lab's Safety/Preparedness team? I'm thinking about how this might inform evaluation requirements for AI labs.
Many thanks for the excellent series! You have a knack for finding elegant and intuitive ways to explain the trends from the data. Despite knowing this data well, I feel like I learn something new with every post. Looking forward to the next thing.
Thanks for sharing this critique, Cullen.
I was curious about who would be the firm's opponent in this scenario, i.e. the actor trying to legally implement the Windfall Clause.
In a world where a Windfall of this order of magnitude is possible, I would anticipate a number of additional actors of somewhat comparable magnitude. I'd also expect states to have more wealth too (even if the AI company didn't pay tax, an AI advanced enough to generate Windfall profits is likely to grow the economy dramatically). If this were true, I might expect there to be incentives (or the possibility of providing incentives) for sufficiently wealthy states or other actors to use their resources to keep the legal offense-defence ratio more manageable.
That being said, I'm very uncertain about the above. There is certainly precedence for companies to become dramatically richer than some states. Moreover, states benefiting considerably from transformative AI may not necessarily see defending a Windfall Clause as a priority. Nevertheless, I do think there's merit in thinking carefully about what kind of actors might exist in a world where the Windfall Clause looks like it will soon trigger.
Great to see the Predict feature. I might have missed this when you first added it, but I've seen it now. It looks great and the tool is easy to use! I also like the additional changes you've made to make the site more polished. Myself and a friend had some issues when clicking the 'share' button which I'll post as an issue on the Github later.
Assuming multipolar worlds where humans retain control but loss of control risks are still real: Most models of AI tech races suggest strategic behavior competes away most future value, at least in the worst cases (Armstrong et al, The Han et al, Stafford et al, Emery-Xu et al, Jensen et al). While this is also true for unipolar scenarios (power concentration can lock-in risks that eliminate most of the value of the future) multipolar worlds are unique in that even when the players internalise much of the risks, they race to the bottom on safety (see the travellers dilemma or Armstrong et al's racing to the precipice). They can even escalate into destructive conflict if they feel especially threatened by their rivals (see the crisis bargaining literature, or, for a more optimistic take, superintellegence strategy).
If we assume the AI systems are in control and in competition with one another to achieve their own goals, then many of the above issues could be amplified by faster AI optimisation that may be more likely by default to neglect other values humans (and other beings) care about. On the other hand, sophisticated AI systems could establish coordination mechanisms with each other. This is also true of global powers who could work establish verification regimes for international AI Governance. It's not clear that AI systems would be better or worse than governments, but I lean towards worse by default.