Simon_M

Bottlenecks to more impactful crowd forecasting

You might be interested in both: "Most Likes" and "h-Index" metrics on MetaculusExtras which does have a visible upvote score. (Although I agree it would be nice to have it on Metaculus proper)

A Quick Overview of Forecasting and Prediction Markets: Why they’re useful, why they aren’t, and what’s next

Some nitpicks:

Forecasts have been more accurate than random 94% of the time since 2015

This is a terrible metric since most people looking at most questions on Metaculus wouldn't think they are all 50/50 coin flips.

Augur’s solution to this issue is to provide predictors with a set of outcomes on which predictors stake their earnings on the true outcome. Presumably, the most staked-on outcome is what actually happened (such as Biden winning the popular vote being the true outcome). In turn, predictors are rewarded for staking on true outcomes.

This doesn't actually resolve the problem of having clearly defined questions, it resolves the problemof having disagreements over how questions resolve.

I think that these issues are not fundamentally solvable since a lot of them are based on basic issues surrounding financial markets. Specifically, I’ve seen the

Keynesian Beauty Contest problem- users tend to always go with the community prediction in order to lower their personal risk. Since there is not a lot of financial incentive in the first place, it would not make a lot of sense for an individual to go out on a limb with their predictions, especially since so many of the existential questions resolve years into the future. There doesn’t seem to be a great way to get around that issue, aside from completely hiding the community prediction and rewarding those that predict without looking at it (Metaculus may already do this). But it would seem to be that the best thing we can do is maximize rewards for platforms like Metaculus and do the best optimization that we can with limited volume.

It's not clear to me which issues you think aren't fundamentally solvable?

It's not clear to me why you think that people predicting the community median is a Keynesian Beauty Contest? (It's not)

Forecasting without the community prediction *might* lead to better forecasts, but my expectation would be otherwise. The community forecast contains information which you might want to use in your forecast. (Also, if you don't believe the points are an incentive, then there's no incentive to forecast anything, Metaculus' points system should incentivise you to forecast your probability.)

I can imagine innovative political pollsters like 538 considering prediction markets in their algorithms, which may trickle down to more traditional outlets

- 538 aren't political pollsters
- Nate Silver is regularly disparaging of "Scottish Teens" (his term of art for Prediction Markets)

Reciprocal scoring is a method of forecasting in which a group of good forecasters predict the forecasts of other forecasters, which as it turns out makes for really accurate

forecasts.

I don't think that paper says what you think it does, and I actually think Scott's original criticism of reciprocal scoring is still valid.

also be a quick way to get a lot more data on the effectiveness of reciprocal scoring.

I would agree - Metaculus seems to be the best place to get good forecasting data at the moment. I'd be keen to see them try that.

Forecasting seems to have great potential to be a really useful tool not just for alignment questions but for everything from stock markets to policy recommendations

The stock [financial] market is already the worlds biggest prediction market. I'm not sure forecasting has much to add.

Specifically, I think that it would be helpful to look at new “meta-methods” like reciprocal scoring - in essence, predictions of predictions. Also, I think it would be helpful to track what actions predictors can take that have the most marginal impact, like seeing how often prediction updates are needed to maximize accuracy while minimizing predictor participation.

Strongly agree with this. I'm very interested in the general conversation at the moment for how to optimally aggregate forecasts (median vs mean-log-odds vs mean probability etc)

When pooling forecasts, use the geometric mean of odds

Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I'd have a hard time saying "there's a 70% chance it wins over the next 100 questions". If you're interested in betting at 70% odds I'd be interested.

There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.

No offense, but the academic literature can do one.

In this case I didn't try too hard to find an extremization factor that would work, just two attempts. I didn't need to mine for a factor that would work. But obviously we cannot generalize from just one example.

Again, I don't find this very persuasive, given what I already knew about the history of Metaculus' underconfidence.

Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.

I think extremizing might make sense if the other forecasts aren't public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.

Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.

I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It's taken the community ~5 years to reduce it's under-confidence, so maybe it'll take another 5 years. If people immediately update, I would expect this to be very unpredictable.

When pooling forecasts, use the geometric mean of odds

This has restored my faith on extremization

I think this is the wrong way to look at this.

Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).

Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.

I would NOT advocate extremizing the Metaculus community prediction going forward.

More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions *in hindsight *but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that's a strong claim which requires strong evidence.

My current best guess on how to aggregate forecasts

It's not clear to me that "fitting a Beta distribution and using one of it's statistics" is different from just taking the mean of the probabilities.

I fitting a beta distribution to Metaculus forecasts and looked at:

- Median forecast
- Mean forecast
- Mean log-odds / Geometric mean of odds
- Fitted beta median
- Fitted beta mean

Scattering these 5 values against each other I get:

We can see fitted values are closely aligned with the mean and mean-log-odds, but not with the median. (Unsurprising when you consider the ~parametric formula for the mean / median).

The performance is as follows:

brier | log_score | questions | |

geo_mean_odds_weighted | 0.116 | 0.37 | 856 |

beta_median_weighted | 0.118 | 0.378 | 856 |

median_weighted | 0.121 | 0.38 | 856 |

mean_weighted | 0.122 | 0.391 | 856 |

beta_mean_weighted | 0.123 | 0.396 | 856 |

My intuition for what is going on here is that the beta-median is an extremized form of the beta-mean / mean, which is an improvement

Looking more recently (as the community became more calibrated), the beta-median's performance edge seems to have reduced:

brier | log_score | questions | |

geo_mean_odds_weighted | 0.09 | 0.29 | 330 |

median_weighted | 0.091 | 0.294 | 330 |

beta_median_weighted | 0.091 | 0.297 | 330 |

mean_weighted | 0.094 | 0.31 | 330 |

beta_mean_weighted | 0.095 | 0.314 | 330 |

How does forecast quantity impact forecast quality on Metaculus?

I investigated this, and it doesn’t look like there is much evidence for herding among Metaculus users to any noticeable extent, or if there is herding, it doesn’t seem to increase as the number of predictors rises.

1. People REALLY like predicting multiples of 5

2. People still like predicting the median after accounting for this (eg looking at questions where the median isn't a multiple of 5)

(Another way to see how much forecasters love those multiples of 5)

How does forecast quantity impact forecast quality on Metaculus?

If one had access to the individual predictions, one could also try to take 1000 random bootstrap samples of size 1 of all the predictions, then 1000 random bootstrap samples of size 2, and so on and measure how accuracy changes with larger random samples. This might also be possible with data from other prediction sites.

I discussed this with Charles. It's not possible to do *exactly* this with the API, but we can approximate this by looking at the final predictions just before close.

We can see that:

- Questions with more predictors have better brier scores (regardless of # of predictors sampled)
- Performance increases with # of predictors up to ~100 predictors

To account for the different brier scores based on groups of questions, I have normalized by subtracting off the performance of 8 predictors. This makes point 2 from above more clear to see.

When discussing this with Charles he suggested that questions which are ~0 / 1 are more popular and therefore they look easier. Excluding them, those charts look as follows:

Amazingly this seems to be ~all of the effect making more popular questions "easier"!

(NB: there's only 22 questions with >= 256 predictors and 5% < p < 95% so the error bars on that cyan line should be quite wide)

N Predictors | >= N Predictors | >= N predictors | 5% < p < 95% |

8 | 852 | 673 |

16 | 843 | 665 |

32 | 786 | 613 |

64 | 537 | 393 |

128 | 196 | 116 |

256 | 43 | 22 |

When pooling forecasts, use the geometric mean of odds

I created a question series on Metaculus to see how big an effect this is and how the community might forecast this going forward.

When pooling forecasts, use the geometric mean of odds

If I was to summarise your post in another way, it would be this:

The biggest problem with pooling is that a point estimate isn't the end goal. In most applications you care about some transform of the estimate. In general, you're better off keeping all of the information (ie your new prior) rather than just a point estimate of said prior.

I disagree with you that the most natural prior is "mixture distribution over experts". (Although I wonder how much that actually ends up mattering in the real world).

I also think something "interesting" is being said here about the performance of estimates in the real world. If I had to say that the empirical performance of mean log-odds doing well, I would say that it means that "mixture distribution over experts" is not a great prior. But then, someone with my priors would say that...

My general take on this space is:

Re 1: the disconnect between decision makers and forecasting platforms. I think the problem comes in two directions.

Re 2: someone saying "X has a y% chance of happening" is not (usually) especially valuable to a decision maker. (Especially since the market is already accounting for what it expects the decision maker to do). Models (even fairly poor ones) often have more use to a decision maker, since they can see how their decision might affect the outcome. [Yes, there are ideas like counterfactual markets, but none of those ideas can really capture the full space of possibilities and will also just fragment liquidity]. The best you can really do is extract a model statistically (when indicator goes up, forecast goes down, so indicator might be saying something about event).

Re 3: It would take a while for me to summarise the evidence here, but I think there's a pretty strong case that central banks (eg the Federal Reserve in the US) are increasingly looking at market indicators when setting monetary policy. I think CEOs and other decision makers in business look at market prices as indicators when deciding direction of their companes. (Although it's hard to fully describe this as a prediction market as much as "looking at the competition" I think with some time I could articulate what I mean)