Principled extremizing of aggregated forecasts

Jaime Sevilla

Principled extremizing of aggregated forecasts

Comments 3

Sorted by

New & upvoted

Eric Neyman

Hi! I'm an author of this paper and am happy to answer questions. Thanks to Jsevillamol for the summary!

A quick note regarding the context in which the extremization factor we suggest is "optimal": rather than taking a Bayesian view of forecast aggregation, we take a robust/"worst case" view. In brief, we consider the following setup:

(1) you choose an aggregation method.

(2) an adversary chooses an information structure (i.e. joint probability distribution over the true answer and what partial information each expert knows) to make your aggregation method do as poorly as possible in expectation (subject to the information structure satisfying the projective substitutes condition).

In this setup, the 1.73 extremization constant is optimal, i.e. maximizes worst-case performance.

That said, I think it's probably possible to do even better by using a non-linear extremization technique. Concretely, I strongly suspect that the less variance there is in experts' forecasts, the less it makes sense to extremize (because the experts have more overlap in the information they know). I would be curious to see how low a loss it's possible to get by taking into account not just the average log odds, but also the variance in the experts' log odds. Hopefully we will have formal results to this effect (together with a concrete suggestion for taking variance into account) sometime soon :)

AlexMennen

I am more hesitant to recommend the more complex extremization method where we use the historical baseline resolution log-odds

It's the other way around for me. Historical baseline may be somewhat arbitrary and unreliable, but so is 1:1 odds. If the motivation for extremizing is that different forecasters have access to independent sources of information to move them away from a common prior, but that common prior is far from 1:1 odds, then extremizing away from 1:1 odds shouldn't work very well, and historical baseline seems closer to a common prior than 1:1 odds does.

I'm interested in how to get better-justified odds ratios to use as a baseline. One idea is to use past estimates of the same question. For example, suppose metaculus asks "Does X happen in 2030", and the question closes at the end of 2021, and then it asks the exact same question again at the beginning of 2022. Then the aggregated odds that the first question closed at can be used as a baseline for the second question. Perhaps you could do something more sophisticated, like, instead of closing the question and opening an identical one, keep the question open, but use the odds that experts gave it at some point in the past as a baseline with which to interpret more recent odds estimates provided by experts. Of course, none of this works if there hasn't been an identical question asked previously, and the question has been open for a short amount of time.

Another possibility is to use two pools of forecasters, both of which have done calibration training, but one of which consists of subject-matter experts, and the other of which consists of people with little specialized knowledge on the subject matter, and ask the latter group not to do much research before answering. Then the aggregated odds of the non-experts can be used as a baseline when aggregating odds given by the experts, on the theory that the non-experts can give you a well-calibrated prior because of their calibration training, but won't be taking into account the independent sources of knowledge that the experts have.

Jaime Sevilla

Thanks for chipping in Alex!

It's the other way around for me. Historical baseline may be somewhat arbitrary and unreliable, but so is 1:1 odds.

Agreed! To give some nuance to my recommendation, the reason I am hesitant is mainly because of lack of academic precedent (as far as I know).

If the motivation for extremizing is that different forecasters have access to independent sources of information to move them away from a common prior, but that common prior is far from 1:1 odds, then extremizing away from 1:1 odds shouldn't work very well.

Note that the data backs this up! Using "pseudo-historical" odds is quite better than using 1:1 odds. See the appendix for more details.

[...] use past estimates of the same question.
[...] use the odds that experts gave it at some point in the past as a baseline with which to interpret more recent odds estimates provided by experts.

I'd be interested in seeing the results of such experiments using Metaculus data!

Another possibility is to use two pools of forecasters [...]

This one is trippy, I like it!

Comments

AlexMennen

I am more hesitant to recommend the more complex extremization method where we use the historical baseline resolution log-odds

Extremizing factor
$n = 1$	$d = 1$	$n = 10$	$d \approx 1.62$
$n = 2$	$d \approx 1.29$	$n = 50$	$d \approx 1.71$
$n = 3$	$d \approx 1.41$	$n = 100$	$d \approx 1.72$
$n = 5$	$d \approx 1.53$	$n = 1000$	$d \approx 1.73$

Method	Weighted	Brier	-log	Questions
Neyman aggregate (p=0.36)	Yes	0.106	0.340	899
Extremized mean of logodds (d=1.55)	Yes	0.111	0.350	899
Neyman aggregate (p=0.5)	Yes	0.111	0.351	899
Extremized mean of probabilities (d=1.60)	Yes	0.112	0.355	899
Metaculus prediction	Yes	0.111	0.361	774
Mean of logodds	Yes	0.116	0.370	899
Neyman aggregate (p=0.36)	No	0.120	0.377	899
Median	Yes	0.121	0.381	899
Extremized mean of logodds (d=1.50)	No	0.126	0.391	899
Mean of probabilities	Yes	0.122	0.392	899
Neyman aggregate (o=1.00)	No	0.126	0.393	899
Extremized mean of probabilities (d=1.60)	No	0.127	0.399	899
Mean of logodds	No	0.130	0.410	899
Median	No	0.134	0.418	899
Mean of probabilities	No	0.138	0.439	899
Baseline (p = 0.36)	N/A	0.230	0.652	899

Principled extremizing of aggregated forecasts

Principled extremizing of aggregated forecasts

Extreme forecasting

Why I became an extremist

In conclusion

Acknowledgements

Appendix: Testing Neyman's method on Metaculus data

Footnotes

Bibliography