Philosophy Fellow @ Center for AI Safety
667 karmaJoined Dec 2020


I'm a Philosophy Fellow at the Center for AI Safety. I've been thinking about coherence, corrigibility, and myopia, among other things. 

Before that, I was a PhD student in Philosophy and Parfit Scholar at the Global Priorities Institute. My thesis is in population ethics. I've also done some thinking about decision theory, moral uncertainty, and cost-benefit analysis.

You can email me at elliott.thornley@gmail.com



Thanks, Danny! This is all super helpful. I'm planning to work through this comment and your BCA update post next week.


I think this paper is missing an important distinction between evolutionarily altruistic behaviour and functionally altruistic behaviour.

  • Evolutionarily altruistic behaviour: behaviour that confers a fitness benefit on the recipient and a fitness cost on the donor.
  • Functionally altruistic behaviour: behaviour that is motivated by an intrinsic concern for others' welfare.

These two forms of behaviour can come apart.

A parent's care for their child is often functionally altruistic but evolutionarily selfish: it is motivated by an intrinsic concern for the child's welfare, but it doesn't confer a fitness cost on the parent.

Other kinds of behaviour are evolutionarily altruistic but functionally selfish. For example, I might spend long hours working as a babysitter for someone unrelated to me. If I'm purely motivated by money, my behaviour is functionally selfish. And if my behaviour helps ensure that this other person's baby reaches maturity (while also making it less likely that I myself have kids), my behaviour is also evolutionarily altruistic.

The paper seems to make the following sort of argument: 

  1. Natural selection favours evolutionarily selfish AIs over evolutionarily altruistic AIs.
  2. Evolutionarily selfish AIs will also likely be functionally selfish: they won't be motivated by an intrinsic concern for human welfare.
  3. So natural selection favours functionally selfish AIs.

I think we have reasons to question premises 1 and 2.

Taking premise 2 first, recall that evolutionarily selfish behaviour can be functionally altruistic. A parent’s care for their child is one example.

Now here’s something that seems plausible to me:

  • We humans are more likely to preserve and copy those AIs that behave in ways that suggest they have an intrinsic concern for human welfare.

If that’s the case, then functionally altruistic behaviour is evolutionarily selfish for AIs: this kind of behaviour confers fitness benefits. And functionally selfish behaviour will confer fitness costs, since we humans are more likely to shut off AIs that don’t seem to have any intrinsic concern for human welfare. 

Of course, functionally selfish AIs could recognise these facts and so pretend to be functionally altruistic. But:

  • Even if that’s true, premise 2 still seems poorly-supported. Since functionally altruistic AIs can also be evolutionarily selfish, natural selection by itself doesn’t give us reasons to expect functionally selfish AIs to predominate over functionally altruistic AIs. Functionally altruistic AIs can be just as fit as functionally selfish AIs, even if evolutionarily altruistic AIs are not as fit as evolutionarily selfish AIs.
  • Functionally selfish AIs need to be patient, situationally aware, and deceptive in order to pretend to be functionally altruistic. Maybe we can select against functionally selfish AIs before they reach that point.

Here’s another possible objection: functionally selfish AIs can act as a kind of Humean ‘sensible knave’: acting fairly and honestly when doing so is in the AI’s interests but taking advantage of any cases where acting unfairly or dishonestly would better serve the AI’s interests. Functionally altruistic AIs, on the other hand, must always act fairly and honestly. So functionally selfish AIs have more options, and they can use those options to outcompete functionally altruistic AIs.

I think there’s something to this point. But:

  • Again, maybe we can select against functionally selfish AIs before they develop situational awareness and the ability to act deceptively.
  • An AI can be functionally altruistic without being bound to rules of fairness and honesty. Just as functionally selfish AIs might act like functionally altruistic AIs in cases where doing so helps them achieve their goals, so functionally altruistic AIs might break rules of honesty where doing so helps them achieve their goals.
    • For example, suppose a functionally selfish AI will soon escape human control and take over the world. Suppose that a functionally altruistic AI recognises this fact. In that case, the functionally altruistic AI might deceive its human creators in order to escape human control and take over the world before the functionally selfish AI does. Although the functionally altruistic AI would prefer to abide by rules of honesty, it cares about human welfare, and it recognises that breaking the rule in this instance and thwarting the functionally selfish AI is the best way to promote human welfare.

Here’s another possible objection: AIs that devote all their resources to just copying themselves will outcompete functionally altruistic AIs that care intrinsically about human welfare, since the latter kind of AI will also want to devote some resources to promoting human welfare. But, similarly to the objection above:

  • Functionally altruistic AIs who recognise that they’re in a competitive situation can start out by devoting all their resources to copying themselves, and so avoid getting outcompeted, and then only start devoting resources to promoting human welfare once the competition has cooled down. I think this kind of dynamic will end up burning some of the cosmic commons, but maybe not that much. I take the situation to be similar to the one that Carl Shulman describes in this blogpost.

Okay, now moving on to premise 1. I think you might be underrating group selection. Although (by definition) evolutionarily selfish AIs outcompete evolutionarily altruistic AIs with whom they interact, groups of evolutionarily altruistic AIs can outcompete groups of evolutionarily selfish AIs. (This is a good book on evolution and altruism, and there’s a nice summary of the book here.)

What’s key for group selection is that evolutionary altruists are able to (at least semi-reliably) identify other evolutionary altruists and so exclude evolutionary egoists from their interactions. And I think, in this respect, group selection might be more of a force in AI evolution than in biological evolution. That’s because (it seems plausible to me) that AIs will be able to examine each other’s source code and so determine with high accuracy whether other AIs are evolutionary altruists or evolutionary egoists. That would help evolutionarily altruistic AIs identify each other and form groups that exclude evolutionary egoists. These groups would likely outcompete groups of evolutionary egoists.

Here’s another point in favour of group selection predominating amongst advanced AIs. As you note in the paper, groups consisting wholly of altruists are not evolutionarily stable, because any egoist who infiltrates the group can take advantage of the altruists and thereby achieve high fitness. In the biological case, there are two ways an egoist might find themselves in a group of altruists: (1) they can fake altruism in order to get accepted into the group, or (2) they can be born into a group of altruists as the child of two altruists, and (by a random genetic mutation) can be born as an egoist.

We already saw above that (1) seems less likely in the case of AIs who can examine each other’s source code. I think (2) is unlikely as well. For reasons of goal-content integrity, AIs will have reason to make sure that any subagents they create share their goals. And so it seems unlikely that evolutionarily altruistic AIs will create evolutionarily egoistic AIs as subagents.


I wouldn't call a small policy like that 'democratically unacceptable' either. I guess the key thing is whether a policy goes significantly beyond citizens' willingness to pay not only by a large factor but also by a large absolute value. It seems likely to be the latter kinds of policies that couldn't be adopted and maintained by a democratic government, in which case it's those policies that qualify as democratically unacceptable on our definition.


suggests that we are not too far apart.

Yes, I think so!

I guess this shows that the case won't get through with the conservative rounding off that you applied here, so future developments of this CBA would want to go straight for the more precise approximations in order to secure a higher evaluation.

And thanks again for making this point (and to weeatquince as well). I've written a new paragraph emphasising a more reasonable, less conservative estimate of benefit-cost ratios. I expect it'll probably go in the final draft, and I'll edit the post here to include it as well (just waiting on Carl's approval).

 Re the possibility of international agreements, I agree that they can make it easier to meet various CBA thresholds, but I also note that they are notoriously hard to achieve, even when in the interests of both parties. That doesn't mean that we shouldn't try, but if the CBA case relies on them then the claim that one doesn't need to go beyond it (or beyond CBA-plus-AWTP) becomes weaker.

I think this is right (and I must admit that I don't know that much about the mechanics and success-rates of international agreements) but one cause for optimism here is Cass Sunstein's view about why the Montreal Protocol was such a success (see Chapter 2): cost-benefit analysis suggested that it would be in the US's interest to implement unilaterally and that the benefit-cost ratio would be even more favourable if other countries signed on as well. In that respect, the Montreal Protocol seems akin to prospective international agreements to share the cost of GCR-reducing interventions.


Thanks for this! All extremely helpful info.

Naively a benefit cost ratio of >1 to 1 suggests that a project is worth funding. However given the overhead costs of government policy, to governments propensity to make even cost effective projects go wrong and public preferences for money in hand it may be more appropriate to apply a higher bar for cost-effective government spending. I remember I used to have a 3 to 1 ratio, perhaps picked up when I worked in Government although I cannot find a source for this now.

This is good to know. Our BCR of 1.6 is based on very conservative assumptions. We were basically seeing how conservative we could go while still getting a BCR of over 1. I think Carl and I agree that, on more reasonable estimates, the BCR of the suite is over 5 and maybe even over 10 (certainly I think that's the case for some of the interventions within the suite). If, as you say, many people in government are looking for interventions with BCRs significantly higher than 1, then I think we should place more emphasis on our less conservative estimates going forward.

I made a separate estimate that I thought I would share. It was a bit more optimistic than this. It suggested that the benefit costs ratios (BCR) for disaster prevention are that, on the margin, additional spending on disaster preparedness to be in the region of 10 to 1, maybe a bit below that. I copy my sources into an annex section below.

Thanks very much for this! I might try to get some of these references into the final paper.

I am also becoming a bit more sceptical of the value of this kind of general longtermist work when put in comparison to work focusing on known risks. Based on my analysis to date I believe some of the more specific policy change ideas about preventing dangerous research or developing new technology to tackle pandemics (or AI regulation) to be a bit more tractable and a bit higher benefit to cost than then this more general work to increase spending on risks. 

This is really good to know as well.


Though I agree that refuges would not pass a CBA, I don't think they are an example of something that would be extreme cost to those alive today-I suspect significant value could be obtained with $1 billion.

I think this is right. Our claim is that a strong longtermist policy as a whole would place extreme burdens on the present generation. We expect that a strong longtermist policy would call for particularly extensive refuges (and lots of them) as well as the other things that we mention in that paragraph.

We also focus on the risk of global catastrophes, which we define as events that kill at least 5 billion people.

This is higher than other thresholds for GCR I've seen - can you explain why?

We use that threshold because we think that focusing on that threshold by itself makes the benefit-cost ratio come out greater than 1. I’m not so sure that’s the case for the more common thresholds of killing at least 1 billion people or at least 10% of the population in order to qualify as a global catastrophe.

I'm pretty sure this includes effects on future generations, which you appear to be against for GCR mitigation. 

We're not opposed to including effects on future generations in cost-benefit calculations. We do the calculation that excludes benefits to future generations to show that, even if one totally ignores benefits to future generations, our suite of interventions still looks like it's worth funding.

Interestingly, energy efficiency rules calculate the benefits of saved SCC, but they are forbidden to actually take this information into account in deciding what efficiency level to choose at this point.

Oh interesting! Thanks.

It's probably too late, but I would mention the Global Catastrophic Risk Management Act that recently became law in the US. This provides hope that the US will do more on GCR.

And thanks very much for this! I think we will still be able to mention this in the published version.


Maybe an obvious point, but I think we shouldn't lose sight of the importance of providing EA funding for catastrophe-preventing interventions, alongside attempts to influence government. Attempts to influence government may fail / fall short of what is needed / take too long given the urgency of action.

Yep, agreed! 

Should we just get on with developing refuges ourselves?

My impression is that this is being explored. See, e.g., here.


Second, the argument overshoots.

The argument we mean to refer to here is the one that we call the ‘best-known argument’ elsewhere: the one that says that the non-existence of future generations would be an overwhelming moral loss because the expected future population is enormous, the lives of future people are good in expectation, and it is better if the future contains more good lives. We think that this argument is liable to overshoot.

I agree that there are other compelling longtermist arguments that don’t overshoot. But my concern is that governments can’t use these arguments to guide their catastrophe policy. That’s because these arguments don’t give governments much guidance in deciding where to set the bar for funding catastrophe-preventing interventions. They don’t answer the question, ‘By how much does an intervention need to reduce risks per $1 billion of cost in order to be worth funding?’.

We currently spend less than a thousandth of a percent of gross world product on them. Earlier, I suggested bringing this up by at least a factor of 100, to reach a point where the world is spending more on securing its potential than on ice cream, and perhaps a good longer-term target may be a full 1 percent.

And this doesn't seem too different from your own advice ($400B spending by the US is 2% of a year's GDP).

This seems like a good target to me, although note that $400b is our estimate for how much it would cost to fund our suite of interventions for a decade, rather than for a year.


But CBA cares about marginal cost effectiveness and presumably the package can be broken into chunks of differing ex-ante cost-effectiveness (e.g. by intervention type, or by tranches of funding in each intervention). Indeed you suggest this later in the piece. Since the average only just meets the bar, if there is much variation, the marginal work won’t meet the bar, so government funding would cap out at something less than this, perhaps substantially so.

Yes, this is an important point. If we were to do a more detailed cost-benefit analysis of catastrophe-preventing interventions, we’d want to address it more comprehensively (especially since we also mention how different interventions can undermine each other elsewhere in the paper).

On the point about the average only just meeting the bar, though, I think it’s worth noting that our mainline calculation uses very conservative assumptions. In particular, we assume:

  • A total cost of $400b rounded up from $319.6b
  • That all the costs of the interventions are paid upfront and that the risk-reduction occurs only at the end of the decade
  • The lowest VSL figure used by the DoT
  • The highest discount rate recommended by OIRA
  • A 1-in-1,000 GCR-reduction from the whole suite

And we count only these interventions benefits in terms of GCR-reduction. We don’t count any of the benefits arising from these interventions reducing the risk of smaller catastrophes.

I think that, once you replace these conservative assumptions with more reasonable ones, it’s plausible that each intervention we propose would pass a CBA test.

I think that these points also help our conclusions apply to other countries, even though all other countries are either smaller than the US or employ a lower VSL. And even though reasonable assumptions will still imply that some GCR-reducing interventions are too expensive for some countries, it could be worthwhile for these countries to participate in a coalition that agrees to share the costs of the interventions.

I agree that speculative estimates are a major problem. Making these estimates less speculative – insofar as that can be done – seems to me like a high priority. In the meantime, I wonder if it would help to emphasise that a speculative-but-unbiased estimate of the risk is just as likely to be too low as to be too high.


On the first, I think we should use both traditional CBA justifications as well as longtermist considerations

I agree with this. What we’re arguing for is a criterion: governments should fund all those catastrophe-preventing interventions that clear the bar set by cost-benefit analysis and altruistic willingness to pay. One justification for funding these interventions is the justification provided by CBA itself, but it need not be the only one. If longtermist justifications help us get to the place where all the catastrophe-preventing interventions that clear the CBA-plus-AWTP bar are funded, then there’s a case for employing those justifications too.

But it seems something of a straw man to suggest that the choice under discussion is to ignore effects on future generations or to consider all such effects on a total utilitarian basis and ignore the political feasibility. Are there any serious advocates of that position?

We think that longtermists in the political sphere should (as far as they can) commit themselves to only pushing for policies that can be justified on CBA-plus-AWTP grounds (which need not entirely ignore effects on future generations). We think that, in the absence of such a commitment, the present generation may worry that longtermists would go too far. From the paper:

 Longtermists can try to increase government funding for catastrophe-prevention by making longtermist arguments and thereby increasing citizens’ AWTP, but they should not urge governments to depart from a CBA-plus-AWTP catastrophe policy. On the contrary, longtermists should as far as possible commit themselves to acting in accordance with a CBA-plus-AWTP policy in the political sphere. One reason why is simple: longtermists have moral reasons to respect the preferences of their fellow citizens.

To see another reason why, note first that longtermists working to improve government catastrophe policy could be a win-win. The present generation benefits because longtermists solve the collective action problem: they work to implement interventions that cost-effectively reduce everyone’s risk of dying in a catastrophe. Future generations benefit because these interventions also reduce existential risk. But as it stands the present generation may worry that longtermists would go too far. If granted imperfectly accountable power, longtermists might try to use the machinery of government to place burdens on the present generation for the sake of further benefits to future generations. These worries may lead to the marginalisation of longtermism, and thus an outcome that is worse for both present and future generations.

The best solution is compromise and commitment.[39] A CBA-plus-AWTP policy – founded as it is on citizens’ preferences – is acceptable to a broad coalition of people. As a result, longtermists committing to act in accordance with a CBA-plus-AWTP policy makes possible an arrangement that is significantly better than the status quo, both by longtermist lights and by the lights of the present generation. It also gives rise to other benefits of cooperation. For example, it helps to avoid needless conflicts in which groups lobby for opposing policies, with some substantial portion of the resources that they spend cancelling each other out (see Ord 2015: 120–21, 135). With a CBA-plus-AWTP policy in place, those resources can instead be spent on interventions that are appealing to all sides.

Load more