698Joined Jun 2022


Answer by FroolowMar 10, 202342

There's no single right answer here, and several good approaches depending on the application to which you want to put the information. A reasonable estimate is that a baby born today and then given the best life modern society can give it will accrue around 24 QALYs (at a 3.5% discount rate). This is equivalent to 70 undiscounted QALYs, but you absolutely must discount to some extent in this case, because a QALY now is clearly preferred to a QALY in 80 years time.

This value is found by multiplying the life expectancy of a baby born in the UK in 2020 by the typical quality of life that baby will experience each year of their life. Life expectancy is pretty straightforward to calculate (I take it from the Office for National Statistics), quality of life is  much more complicated - the standard in the field is Ara, R. and Brazier, J.E. (2010), but this is getting a bit out of date now.

Obvious problems with the translatability of this approach is that the Ara & Brazier paper only applies to the UK. Different countries will have different profiles of population health (which is obvious) and also different ways of interpreting how health applies to QALYs (which is less obvious). For example, if old age affects your ability to walk easily / comfortably then this might matter more to your QALYs in dense walkable European cities than car-focussed American suburbs (I have no idea if this mechanism is true, just giving it as an example). This will be particularly challenging if you're trying to calculate the number of QALYs a person accrues in a global health context, because there is limited research in the area.


Could anyone explain why the experts think Angry Birds will be so hard? It seems like absolutely ideal conditions for reinforcement learning, in the sense that inputs are very simple and there is a very straightforward way of determining how successful each shot is. Is the limitation that it has to be an Artificial Intelligence which succeeds at the problem, rather than a dumb reinforcement algorithm which happens to be really well suited for the task?


I'm one of the evaluators involved in the project (Alex Bates). I wanted to mention that it was an absolute pleasure to work with Unjournal, and a qualitative step above any other journal I've ever reviewed for. I'd definitely encourage people to get involved if they are on the fence about it!

Answer by FroolowJan 10, 202330

When you say 'adding QALYs' you mean interventions which generate QALYs right? Not taking two interventions which generate QALYs and trying to estimate the combined effect of doing both interventions together?

If the former, there's a very interesting paper giving an overview of many different methods of generating QALYs here :

It is quite outdated now, and some of the assumptions about what does and doesn't count are controversial, however I still think it is an excellent way of thinking about the variety of different ways in which you might generate QALYs.

The general answer to your first question is, I think, that the most robust methods of generating QALYs are exactly what you would expect - laws mandating seat belt usage, fire alarm installations, vaccinations etc

With respect to air pollution, you'll see that there are significantly varying estimates for cost-effectiveness depending on exactly the intervention used to control pollution. For example, 'Coal-fired power plants emission control through high stacks' is approximately as cost-effective as wearing a seatbelt whereas 'Acrylonitde emission control via best available technology' is one of the least cost-effective interventions studied. The references attached to each intervention will give more details on how these estimates were arrived at.

Answer by FroolowNov 23, 2022120

You might be interested in an Adversarial Collaboration I wrote on this topic a few years ago. My collaborator was a meat-eater who was very strong on finding representative statistics (in fact he wrote the first draft of Section 2.2 to keep me extra-honest)


Yes I will do, although some respondents asked to remain anonymous / not have their data publicly accessible and so I need to make some slight alterations before I share. I'd guess a couple of weeks for this


I agree that the arith-vs-geo question is basically the crux when it comes to whether this essay should move FF's 'fair betting probabilities' - it sounds like everyone is pretty happy with the point about distributions and I'm really pleased about that because it was the main point I was trying to get across. I'm even more pleased that there is background work going on in the analysis of uncertainty space, because that's an area where public statements by AI Risk organisations have sometimes lagged behind the state of the art in other risk management applications. 

With respect to the crux, I hate to say it - because I'd love to be able to make as robust a claim for the prize as possible - but I'm not sure there is a principled reason for using geomean over arithmean for this application (or vice versa). The way I view it, they are both just snapshots of what is 'really' going on, which is the full distribution of possible outcomes given in the graphs / model. By analogy, I would be very suspicious of someone who always argued the arithmean would be a better estimate of central tendency than the median for every dataset / use case! I agree with you the problem of which is best for this particular dataset / use case is subtle, and I think I would characterise it as being a question of whether my manipulations of people's forecasts have retained some essential 'forecast-y' characteristic which means geomean is more appropriate for various features it has, or whether they have been processed into having some sort of 'outcome-y' characteristic in which case arithmean is more appropriate. I take your point below in the coin example and the obvious superiority of arithmeans for that application, but my interpretation is that the FF didn't intend for the 'fair betting odds' position to limit discussion about alternate ways to think about probabilities ("Applicants need not agree with or use our same conception of probability")

However, to be absolutely clear, even if geomean was the right measure of central tendency I wouldn't expect the judges to pay that particular attention - if all I had done was find a novel way of averaging results then my argument would basically be mathematical sophistry, perhaps only one step better than simply redefining 'AI Risk' until I got a result I liked. I think the distribution point is the actually valuable part of the essay, and I'm quite explicit in the essay that neither geomean nor arithmean is a good substitute for the full distribution. While I would obviously be delighted if I could also convince you my weak preference for geomean as a summary statistic was actually robust and considered, I'm actually not especially wedded to the argument for one summary statistic over the other. I did realise after I got my results that the crux for moving probabilities was going to be a very dry debate about different measures of central tendency, but I figured since the Fund was interested in essays on the theme of "a bunch of this AI stuff is basically right, but we should be focusing on entirely different aspects of the problem" (even if they aren't being strictly solicited for the prize) the distribution bit of the essay might find a readership there anyway.

By the way, I know your four-step argument is intended just as a sketch of why you prefer arithmean for this application, but I do want to just flag up that I think it goes wrong on step 4, because acting according to arithmean probability (or geomean, for that matter) throws away information about distributions. As I mention here and elsewhere, I think the distribution issue is far more important than the geo-vs-arith issue, so while I don't really feel strongly if I lose the prize because the judges don't share my intuition that geomean is a slightly better measure of central tendency I would be sad to miss out because the distribution point was misunderstood! I describe in Section 5.2.2 how the distribution implied by my model would quite radically change some funding decisions, probably by more than an argument taking the arithmean to 3% (of course, if you're already working on distribution issues then you've probably already reached those conclusions and so I won't be changing your mind by making them - but in terms of publicly available arguments about AI Risk I'd defend the case that the distribution issue implies more radical redistribution of funds than changing the arithmean to 1.6%). So I think "act according to that mean probability" is wrong for many important decisions you might want to take - analogous to buying a lot of trousers with 1.97 legs in my example in the essay. No additional comment if that is what you meant though and were just using shorthand for that position.


Thanks, this is really interesting - in hindsight I should have included something like this when describing the SDO mechanism, because it illustrates it really nicely. Just to follow up on a comment I made somewhere else, the concept of a 'conjunctive model' is something I've not seen before and implies a sort of ontology of models which I haven't seen in the literature. A reasonable definition of a model is that it is supposed to reflect an underlying reality, and this will sometimes involve multiplying probabilities and sometimes involve adding two different sources of probabilities. 

I'm not an expert in AI Risk so I don't have much of a horse in this race, but I do note that if the one published model of AI Risk is highly 'conjunctive' / describes a reality where many things need to occur in order for AI Catastrophe to occur then the correct response from the 'disjunctive' side is to publish their own model, not argue that conjunctive models are inherently biased - in a sense 'bias' is the wrong term to use here because the case for the disjunctive side is that the conjunctive model accurately describes a reality which is not our own. 

(I'm not suggesting you don't know this, just that your comment assumes a bit of background knowledge from the reader I thought could potentially be misinterpreted!)


I completely agree that the survey demographic will make a big difference to the headline results figure. Since I surveyed people interested in existential risk (Astral Codex Ten, LessWrong, EA Forum) I would expect the results to bias upwards though. (Almost) every participant in my survey agreed the headline risk was greater than the 1.6% figure from this essay, and generally my results line up with the Bensinger survey. 

However, this is structurally similar to the state of Fermi Paradox estimates prior to SDO 'dissolving' this - that is, almost everyone working on the Drake Equation put the probable number of alien civilisations in this universe very high, because they missed the extremely subtle statistical point about uncertainty analysis SDO spotted, and which I have replicated in this essay. In my opinion, Section 4.3 indicates that as long as you have any order-of-magnitude uncertainty you will likely get asymmetric distribution of risk, and so in that sense I disagree that the mechanism depends on who you ask. The mechanism is the key part of the essay, the headline number is just one particular way to view that mechanism.


In practice these numbers wouldn't perfectly match even if there was no correlation because there is some missing survey data that the SDO method ignores (because naturally you can't sample data that doesn't exist). In principle I don't see why we shouldn't use this as a good rule-of-thumb check for unacceptable correlation.

The synth distribution gives a geomean of 1.6%, a simple mean of around 9.6%, as per the essay

The distribution of all survey responses multiplied together (as per Alice p1 x Alice p2 x Alice p3) gives a geomean of approx 2.3% and a simple mean of approx 17.3%.

I'd suggest that this implies the SDO method's weakness to correlated results is potentially depressing the actual result by about 50%, give or take. I don't think that's either obviously small enough not to matter or obviously large enough to invalidate the whole approach, although my instinct is that when talking about order-of-magnitude uncertainty, 50% point error would not be a showstopper.

Load more