My current best guess on how to aggregate forecasts

by Jsevillamol3 min read6th Oct 20216 comments

28

Forecasting
Frontpage

In short: I still recommend using the geometric mean of odds as the default aggregation method, but I give my best guess on exceptions to this rule.

 

Since writing about how the geometric mean of odds compares to other forecasting aggregation methods I have received many comments asking for a more nuanced approach to choosing how to aggregate forecasts. I do not yet have a full answer to this question, but here I am going to outline my current best guess to help people with their research and give a chance to commenters to prove me wrong.

In short, here is my current best guess in the form of a flowchart:

Flowchart to choose a forecast aggregation method. Start from the yellow box and follow the arrows. You can right click the image and select "open image in new tab" for a bigger version.

Some explanations are in order:

  • I currently believe that the geometric mean of odds should be the default option for aggregating forecasts. In the two large scale empirical evaluations I am aware of [1] [2], it surpasses the mean of probabilities and the median (*). It is also the only method that makes the group aggregate behave as a Bayesian, and (in my opinion) it behaves well with extreme predictions.
  • If you are not aggregating all-considered views of experts, but rather aggregating models with mutually exclusive assumptions, use the mean of probabilities. For example, this will come up if you first compute your timelines for Transformative AI assuming it will be derived from transformer-like methods, and then assuming it will come from emulated beings, etc. In this case,  .
  • When the data includes poorly calibrated outliers, if it's possible exclude them and take the geometric mean. If not, we should use a pooling method resistant to outliers. The median is one such popular aggregation method.
  • If there is a known bias in the community of predictors you are polling for predicting positive resolution of binary questions, you can consider correcting for this. One correction that worked on metaculus data is taking the geometric mean of the probabilities (this pulls the aggregate towards zero compared to the geometric mean of odds). Better corrections are likely to exist.
  • If there is a track record of underconfidence in past aggregate predictions from the community, consider extremizing the final outcome. This has been common practice in academia for a while. For example, (Satopää et al, 2014) have found good performance using extremized logodds. To choose an extremizing factor I suggest experimenting with what extremizing factors would have given you good performance in past predictions from the same community (EDIT: Simon M lays out a case against extremizing).
  • Lastly, it seems that empirically the weighting you use for the predictions matters much more than the aggregation method. I do not have yet great recommendations on how to do weighting, but it seems that weighting by recency of the prediction and by track record of the predictor works well at least in some cases.

There are reasons to believe I will have a better idea of which aggregation methods work best in a given context in a year. For example, it is not clear to me how to detect and deal with outliers, none of the current aggregation methods give consistent answers when annualizing probabilities and there is a huge unexplored space of aggregation functions that we might tap into with machine learning methods.

In conclusion and repeating myself: for the time being I would recommend people to stick to the geometric mean of odds as a default aggregate. I also encourage emphasizing the 10%, 50% and 90% percentile of predictions as well as the number of predictions to summarize the spread.

If you have a good pulse on a problem with the data, above I suggest some solutions you can try. But beware applying them blindly and choosing the outcome you like best.
 

Acknowledgements:

Thanks to Simon M for his analysis of aggregation and weighting methods on Metaculus data. I thank Eric Neyman, Ozzie Gooen and Peter Wildeford for discussion and feedback.

 

Footnotes

(*) (Seaver, 1978) also performs different experiments comparing different pooling methods, and founds similar performance between the mean of probabilities and geometric mean of odds. However I suspect this is because the aggregated probabilities in the experiments were in a range where both methods give similar results.
 

28

6 comments, sorted by Highlighting new comments since Today at 5:31 PM
New Comment

I wonder how these compare with fitting a Beta distribution and using one of its statistics? I’m imagining treating each forecast (assuming they are probabilities) as an observation, and maximizing the Beta likelihood. The resulting Beta is your best guess distribution over the forecasted variable.

It would be nice to have an aggregation method which gave you info about the spread of the aggregated forecast, which would be straightforward here.

It's not clear to me that "fitting a Beta distribution and using one of it's statistics" is different from just taking the mean of the probabilities.

I fitting a beta distribution to Metaculus forecasts and looked at:

  • Median forecast
  • Mean forecast
  • Mean log-odds / Geometric mean of odds
  • Fitted beta median
  • Fitted beta mean

Scattering these 5 values against each other I get:

We can see fitted values are closely aligned with the mean and mean-log-odds, but not with the median. (Unsurprising when you consider the ~parametric formula for the mean / median).

The performance is as follows:

 brierlog_scorequestions
geo_mean_odds_weighted0.1160.37856
beta_median_weighted0.1180.378856
median_weighted0.1210.38856
mean_weighted0.1220.391856
beta_mean_weighted0.1230.396856

My intuition for what is going on here is that the beta-median is an extremized form of the beta-mean / mean, which is an improvement

Looking more recently (as the community became more calibrated), the beta-median's performance edge seems to have reduced:

 brierlog_scorequestions
geo_mean_odds_weighted0.090.29330
median_weighted0.0910.294330
beta_median_weighted0.0910.297330
mean_weighted0.0940.31330
beta_mean_weighted0.0950.314330

Hmm good question.

For a quick foray into this we can see what would happen if we use our estimate the mean of the max likelihood beta distribution implied by the sample of forecasts .

The log-likelihood to maximize is then 

 

The wikipedia article on the Beta distribution discusses this maximization problem in depth, pointing out that albeit no closed form exists if  and  can be assumed to be not too small the max likelihood estimate can be approximated as   and , where  and .

The mean of a beta with these max likelihood parameters is .

By comparison, the geometric mean of odds estimate is:

Here are two examples of how the two methods compare aggregating five forecasts

 

I originally did this to convince myself that the two aggregates were different. And they seem to be! The method seems to be close to the arithmetic mean in this example. Let's see what happens when we extremize one of the predictions:

 

We have made p3 one hundred times smaller. The geometric mean is suitable affected. The maximum likelihood beta mean stays close to the arithmetic mean, unperturbed. 

This makes me a bit less excited about this method, but I would be excited about people poking around with this method and related ones!

Thanks for this post - I think this was a very useful conversation to have started (at least for my own work!), even if I'm less confident than you in some of these conclusions (both because I just feel confused and because I've heard other people give good-sounding arguments for other conclusions). 

In the two large scale empirical evaluations I am aware of [1] [2], it surpasses the mean of probabilities and the median (*).

But it seems worth noting that in one of those cases the geometric mean of probabilities outperformed the geometric mean of odds. 

You later imply that you think this is at least partly because of a specific bias among Metaculus forecasts. But I'm not sure if you think it's fully because of that or whether that's the right explanation (I only skimmed the linked thread). And in any case the basic fact that geometric mean of probabilities performed best in this dataset seems worth noting if you're using performance in that dataset as evidence for some other aggregation method.

Thanks for this post - I think this was a very useful conversation to have started (at least for my own work!), even if I'm less confident than you in some of these conclusions 

Thank you for your kind words! To dismiss any impression of confidence, this represents my best guesses. I am also quite confused.

I've heard other people give good-sounding arguments for other conclusions

I'd be really curious if you can dig these up!

You later imply that you think [the geo mean of probs outperforming the geo mean of odds] is at least partly because of a specific bias among Metaculus forecasts. But I'm not sure if you think it's fully because of that or whether that's the right explanation

I am confident that the geometric mean of probs outperformed the geo mean of odds because of this bias. If you change the coding of all binary questions so that True becomes False and viceversa then you are going to get worse performance that the geo mean of odds.

This is because the geometric mean of probabilities does not map consistently predictions and their complements. With a basic example, suppose that we have . Then .

 So the geometric mean of probabilities in this sense it's not a consistent probability - it doesn't map the complement of probabilities to the the complement of the geometric mean as we would expect (the geometric mean of odds, the mean of probabilities and the median all satisfy this basic property).

So I would recommend viewing the geometric mean of probabilities as a hack to adjust the geometric mean of odds down. This is also why I think better adjustments likely exist, since this isn't a particularly well motivated adjustment. It does however seem to slighly improve Metaculus predictions, so I included it in the flowchart.

To drill this point even more, here is what we would get if we aggregated the predictions in the last 860 resolved metaculus binary questions by mapping each prediction to their complement, taking the geo mean of probs and taking the complement again:

The complement of the geometric mean of complement probabilities is called comp_geo_mean

As you can see, this change (that would not affect the other aggregates) significantly weakens the geo mean of probs.

I found some revelant discussion in the EA Forum about extremizing in footnote 5 of this post.

The aggregation algorithm was elitist, meaning that it weighted more heavily forecasters with good track-records who had updated their forecasts more often. In these slides, Tetlock describes the elitism differently: He says it gives weight to higher-IQ, more open-minded forecasters. The extremizing step pushes the aggregated judgment closer to 1 or 0, to make it more confident. The degree to which they extremize depends on how diverse and sophisticated the pool of forecasters is. The academic papers on this topic can be found here and here. Whether extremizing is a good idea is controversial; according to one expert I interviewed, more recent data suggests that the successes of the extremizing algorithm during the forecasting tournament were a fluke. After all, a priori one would expect extremizing to lead to small improvements in accuracy most of the time, but big losses in accuracy some of the time.

The post in general is quite good, and I recommend it.