Thanks for writing this Sam. I think a large part of the disconnect you are perceiving can be explained as follows:
The longtermist community are primarily using an unusual (or unusually explicit) concern about what happens over the long term as a way of doing cause prioritisation. e.g. it picks out issues like existential risk as being especially important, and then with some more empirical information, one can pick out particular risks or risk factors to work on. The idea is that these are far more important areas to work on than the typical area, so you already have a big win from this style of thinking. In contrast, you seem to be looking at something that could perhaps be called 'long-term thinking', which takes any area of policy and tries to work out ways to better achieve its longterm goals using longterm plans.
These are quite different approaches to having impact. A useful analogy would be the difference between using cost-effectiveness as a tool for selecting a top cause or intervention to work on, vs using it to work out the most cost-effective way to do what you are already doing. I think a lot of the advantages of both cost-effectiveness and longtermist thinking are had in this first step of its contribution to cause-prioritisation, rather than to improving the area you are already working on.
That said, there are certainly cases of overlap. For example, while one could use longtermist cause-prioritisation to select nuclear disarmament as an area and then focus on the short term goal of re-establishing the INF treaty, which lapsed under Trump, one could also aim higher, for the best ways to completely eliminate nuclear weapons over the rest of the century, which would require longterm planning. I expect that longtermists could benefit from advances in longterm planning more than the average person, but it is not always required in order to get large gains from a longtermist approach.
Thanks — I hadn't heard of f-means before and it is a useful concept, and relevant here.
I think we are roughly in agreement on this, it is just hard to talk about. I think that compression of the set of expert estimates down to a single measure of central tendency (e.g. the arithmetic mean) loses information about the distribution that is needed to give the right answer in each of a variety of situations. So in this sense, we shouldn't aggregate first.
The ideal system would neither aggregate first into a single number, nor use each estimate independently and then aggregate from there (I suggested doing so as a contrast to aggregation first, but agree that it is not ideal). Instead, the ideal system would use the whole distribution of estimates (perhaps transformed based on some underlying model about where expert judgments come from, such as assuming that numbers between the point estimates are also plausible) and then doing some kind of EV calculation based on that. But this is so general an approach as to not offer much guidance, without further development.
I agree with a lot of this. In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first. That was part of what I was trying to show with the institution example. And it was part of what I was getting at by suggesting that the problem is ill-posed — there are a number of different assumptions we are all making about what these probabilities are going to be used for and whether we can assume the experts are themselves careful reasoners etc. and this discussion has found various places where the best form of aggregation depends crucially on these kinds of matters. I've certainly learned quite a bit from the discussion.
I think if you wanted to take things further, then teasing out how different combinations of assumptions lead to different aggregation methods would be a good next step.
I see what you mean, though you will find that scientific experts often end up endorsing probabilities like these. They model the situation, run the calculation and end up with 10^-12 and then say the probability is 10^-12. You are right that if you knew the experts were Bayesian and calibrated and aware of all the ways the model or calculation could be flawed, and had a good dose of humility, then you could read more into such small claimed probabilities — i.e. that they must have a mass of evidence they have not yet shared. But we are very rarely in a situation like that. Averaging a selection of Metaculus forecasters may be close, but is quite a special case when you think more broadly about the question of how to aggregate expert predictions.
Thinking about this more, I've come up with an example which shows a way in which the general question is ill-posed — i.e. that no solution that takes a list of estimates and produces an aggregate can be generally correct, but instead requires additional assumptions.
Three cards (a Jack, Queen, and King) are shuffled and dealt to A, B, and C. Each person can see their card, and the one with the highest card will win. You want to know the chance C will win. Your experts are A and B. They both write down their answers on slips of paper and privately give them to you. A says 50%, so you know A doesn't have the King. B also says 50%, which also lets you know B doesn't have the King. You thus know the correct answer is a 100% chance that C has King. In this situation, expert estimates of (50%, 50%) lead to an aggregate estimate of 100%, while anything where an expert estimates 0% leads to an aggregate estimate of 0%. This violates all central estimate aggregation methods.
The point is that it shows there are additional assumptions of whether the information from the experts is independent etc that is needed for the problem to be well posed, and that without this, no form of mean could be generally correct.
I often favour arithmetic means of the probabilities, and my best guess as to what is going on is that there are (at least) two important kinds of use-case for these probabilities, which lead to different answers.
Sorting this out does indeed seem very useful for the community, and I fear that the current piece gets it wrong by suggesting one approach at all times, when we actually often want the other one.
Looking back, it seems the cases where I favoured arithmetic means of probabilities are those where I'm imagining using the probability in an EV calculation to determine what to do. I'm worried that optimising Brier and Log scoring rules is not what you want to do in such cases, so this analysis leads us astray. My paradigm example for geometric mean looking incorrect is similar to Linch's one below.
Suppose one option has value 10 and the other has value 500 with probability p (or else it has value zero). Now suppose you combine expert estimates of p and get 10% and 0.1%. In this case the averaging of probabilities says p=5.05% and the EV of the second option is 25.25, so you should choose it, while the geometric average of odds says p=1%, so the EV is 5, so you shouldn't choose it. I think the arithmetic mean does better here.
Now suppose the second expert instead estimated 0.0000001%. The arithmetic mean considers this no big deal, while the geometric mean now things it is terrible — enough to make it not worth taking even if the prize if successful were now 1,000 times greater. This seems crazy to me. If the prize were 500,000 and one of two experts said 10% chance, you should choose that option no matter how low the other expert goes. In the extreme case of one saying zero exactly, the geometric mean downgrades the EV of the option to zero — no matter the stakes — which seems even more clearly wrong.
Now here is a case that goes the other way. Two experts give probabilities 10% and 0.1% for the annual chance of an institution failing. We are making a decision whose value is linear in the lifespan of the institution. Arithmetic mean says p=5.05%, so an expected lifespan of 19.8 years. Geometric mean says p=1%, so an expected lifespan of 100 years, which I think is better. But what I think is even better is to calculate the expected lifespans for each expert estimate and average them. This gives (10 + 1,000) / 2 = 505 years (which would correspond to an implicit probability of .198% — the harmonic mean.
Note that both of these can be relevant at the same time. e.g. suppose two surveyors estimated the chance your AirB&B will collapse each night and came back with 50% and 0.00000000001%. In that case, the geometric mean approach says it is fine, but really you shouldn't stay there tonight. However simultaneously, expected number of nights it will last without collapsing is very high.
How I often model these cases internally is to assume a mixture model with the real probability randomly being one of the estimated probabilities (with equal weights unless stated otherwise). That gets what I think of as the intuitively right behaviours in the cases above.
Now this is only a sketch and people might disagree with my examples, but I hope it shows that "just use the geometric mean of odds ratios" is not generally good advice, and points the way towards understanding when to use other methods.
I don't think I'm a proponent of strong longtermism at all — at least not on the definition given in the earlier draft of Will and Hilary's paper on the topic that got a lot of attention here a while back and which is what most people will associate with the name. I am happy to call myself a longtermist, though that also doesn't have an agreed definition at the moment.
Here is how I put it in The Precipice:
Considerations like these suggest an ethic we might call longtermism, which is especially concerned with the impacts of our actions upon the longterm future. It takes seriously the fact that our own generation is but one page in a much longer story, and that our most important role may be how we shape—or fail to shape—that story. Working to safeguard humanity’s potential is one avenue for such a lasting impact and there may be others too.
My preferred use of the term is akin to being an environmentalist: it doesn't mean that the only thing that matters is the environment, just that it is a core part of what you care about and informs a lot of your thinking.
This is a very nice explanation Ben.
For the record, while I'm perhaps the most prominent voice in EA for our time being one of the most influential there will ever be, I'm also very sympathetic to this approach. For instance, my claim is that this key time period has already been going for 75 years and can't last more than a small number of centuries. This is quite compatible with more important times being 100 years away, and with the arguments that investing for long periods like that could provide a large increase in the expected impact of the resources (even if the time they were spent was not more influential). And of course, I might be wrong about the importance of this time. So I am excited to see more work exploring patient longtermism.
While I think the Shapley value can be useful, there are clearly cases where the counterfactual value is superior for an agent deciding what to do. Derek Parfit clearly explains this in Five Mistakes in Moral Mathematics. He is arguing against the 'share of the total view' and but at least some of the arguments also apply to the Shapley value too (which is basically an improved version of 'share of the total'). In particular, the best things you have listed in favour of the Shapley value applied to making a moral decision correctly apply when you and others are all making the decision 'together'. If the others have already committed to their part in a decision, the counterfactual value approach looks better.
e.g. on your first example, if the other party has already paid their $1000 to P, you face a choice between creating 15 units of value by funding P or 10 units by funding the alternative. Simple application of Shapley value says you should do the action that creates 10 units, predictably making the world worse.
One might be able to get the best of both methods here if you treat cases like this where another agent has already committed to a known choice as part of the environment when calculating Shapley values. But you need to be clear about this. I consider this kind of approach to be a hybrid of the Shapley and counterfactual value approaches, with Shapley only being applied when the other agents' decisions are still 'live'. As another example, consider your first example and add the assumption that the other party hasn't yet decided, but that you know they love charity P and will donate to it for family reasons. In that case, the other party's decision, while not yet made, is not 'live' in the relevant sense and you should support P as well.
If you are going to pursue what the community could gain from considering Shapley values, then look into cases like this and subtleties of applying the Shapley value further — and do read that Parfit piece.