Simon_M

Posts

Sorted by New

Wiki Contributions

Comments

When pooling forecasts, use the geometric mean of odds

Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I'd have a hard time saying "there's a 70% chance it wins over the next 100 questions". If you're interested in betting at 70% odds I'd be interested.

There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.

No offense, but the academic literature can do one.

In this case I didn't try too hard to find an extremization factor that would work, just two attempts. I didn't need to mine for a factor that would work. But obviously we cannot generalize from just one example.

Again, I don't find this very persuasive, given what I already knew about the history of Metaculus' underconfidence.

Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.

I think extremizing might make sense if the other forecasts aren't public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.

Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.

I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration.  It's taken the community ~5 years to reduce it's under-confidence, so maybe it'll take another 5 years. If people immediately update, I would expect this to be very unpredictable.

When pooling forecasts, use the geometric mean of odds

This has restored my faith on extremization

I think this is the wrong way to look at this.

Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).

Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.

I would NOT advocate extremizing the Metaculus community prediction going forward.

 

More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions in hindsight but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that's a strong claim which requires strong evidence.

My current best guess on how to aggregate forecasts

It's not clear to me that "fitting a Beta distribution and using one of it's statistics" is different from just taking the mean of the probabilities.

I fitting a beta distribution to Metaculus forecasts and looked at:

  • Median forecast
  • Mean forecast
  • Mean log-odds / Geometric mean of odds
  • Fitted beta median
  • Fitted beta mean

Scattering these 5 values against each other I get:

We can see fitted values are closely aligned with the mean and mean-log-odds, but not with the median. (Unsurprising when you consider the ~parametric formula for the mean / median).

The performance is as follows:

 brierlog_scorequestions
geo_mean_odds_weighted0.1160.37856
beta_median_weighted0.1180.378856
median_weighted0.1210.38856
mean_weighted0.1220.391856
beta_mean_weighted0.1230.396856

My intuition for what is going on here is that the beta-median is an extremized form of the beta-mean / mean, which is an improvement

Looking more recently (as the community became more calibrated), the beta-median's performance edge seems to have reduced:

 brierlog_scorequestions
geo_mean_odds_weighted0.090.29330
median_weighted0.0910.294330
beta_median_weighted0.0910.297330
mean_weighted0.0940.31330
beta_mean_weighted0.0950.314330
How does forecast quantity impact forecast quality on Metaculus?

I investigated this, and it doesn’t look like there is much evidence for herding among Metaculus users to any noticeable extent, or if there is herding, it doesn’t seem to increase as the number of predictors rises.

 


1. People REALLY like predicting multiples of 5

2. People still like predicting the median after accounting for this (eg looking at questions where the median isn't a multiple of 5)

(Another way to see how much forecasters love those multiples of 5)

How does forecast quantity impact forecast quality on Metaculus?

If one had access to the individual predictions, one could also try to take 1000 random bootstrap samples of size 1 of all the predictions, then 1000 random bootstrap samples of size 2, and so on and measure how accuracy changes with larger random samples. This might also be possible with data from other prediction sites.

 

I discussed this with Charles. It's not possible to do exactly  this with the API, but we can approximate this by looking at the final predictions just before close.

Brier score from bootstrapped predictors

We can see that:

  1. Questions with more predictors have better brier scores (regardless of # of predictors sampled)
  2. Performance increases with # of predictors up to ~100 predictors

 

To account for the different brier scores based on groups of questions, I have normalized by subtracting off the performance of 8 predictors. This makes point 2 from above more clear to see.

When discussing this with Charles he suggested that questions which are ~0 / 1 are more popular and therefore they look easier. Excluding them, those charts look as follows:

Amazingly this seems to be ~all of the effect making more popular questions "easier"!

(NB: there's only 22 questions with >= 256 predictors and 5% < p < 95% so the error bars on that cyan line should be quite wide)

N Predictors>= N Predictors>= N predictors | 5% < p < 95%
8852673
16843665
32786613
64537393
128196116
2564322
When pooling forecasts, use the geometric mean of odds

I created a question series on Metaculus to see how big an effect this is and how the community might forecast this going forward.

When pooling forecasts, use the geometric mean of odds

If I was to summarise your post in another way, it would be this:

The biggest problem with pooling is that a point estimate isn't the end goal. In most applications you care about some transform of the estimate. In general, you're better off keeping all of the information (ie your new prior) rather than just a point estimate of said prior.

 

I disagree with you that the most natural prior is "mixture distribution over experts". (Although I wonder how much that actually ends up mattering in the real world).


I also think something "interesting" is being said here about the performance of estimates in the real world. If I had to say that the empirical performance of mean log-odds doing well, I would say that it means that "mixture distribution over experts" is not a great prior. But then, someone with my priors would say that...

When pooling forecasts, use the geometric mean of odds
import requests, json
import numpy as np
import pandas as pd

def fetch_results_data():
    response = {"next":"https://www.metaculus.com/api2/questions/?limit=100&status=resolved"}

    results = []
    while response["next"] is not None:
        print(response["next"])
        response = json.loads(requests.get(response["next"]).text)
        results.append(response["results"])
    return sum(results,[])


all_results = fetch_results_data()
binary_qns = [q for q in all_results if q['possibilities']['type'] == 'binary' and q['resolution'] in [0,1]]
binary_qns.sort(key=lambda q: q['resolve_time'])

def get_estimates(ys):
    xs = np.linspace(0.01, 0.99, 99)
    odds = xs/(1-xs)
    mean = np.sum(xs * ys) / np.sum(ys)
    geo_mean = np.exp(np.sum(np.log(xs) * ys) / np.sum(ys))
    geo_mean_odds = np.exp(np.sum(np.log(odds) * ys) / np.sum(ys))
    geo_mean_odds_p = geo_mean_odds/(1+geo_mean_odds)
    extremized_odds = np.exp(np.sum(2.5 * np.log(odds) * ys) / np.sum(ys))
    extr_geo_mean_odds = extremized_odds/(1+extremized_odds)
    median = weighted_quantile(xs, 0.5, sample_weight=ys)
    return [mean, geo_mean, median, geo_mean_odds_p, extr_geo_mean_odds], ["mean", "geo_mean", "median", "geo_mean_odds", "extr_geo_mean_odds_2.5"]

def brier(p, r):
    return (p-r)**2
def log_s(p, r):
    return -(r * np.log(p) + (1-r)*np.log(1-p))

X = []

for q in binary_qns:
    weighted = q['community_prediction']['full']['y']
    unweighted = q['community_prediction']['unweighted']['y']
    t = [q['resolution'], q["community_prediction"]["history"][-1]["nu"]]
    all_names = ['resolution', 'users']
    for (e, ys) in [('_weighted', weighted), ('_unweighted', unweighted)]:
        s, names = get_estimates(np.array(ys))
        all_names += [n+e for n in names]
        t += s
    t += [q["metaculus_prediction"]["full"]]
    all_names.append("metaculus_prediction")
    X.append(t)
df = pd.DataFrame(X, columns=all_names)

df_v = df[:]

pd.concat([df_v.apply(lambda x: brier(x, df["resolution"]), axis=0).mean().to_frame("brier"),
           (df_v.apply(lambda x: log_s(x, df["resolution"]), axis=0)).mean().to_frame("-log"),
           df_v.count().to_frame("questions"),
          ], axis=1).sort_values('-log')[:-1].round(3)
When pooling forecasts, use the geometric mean of odds

"more questions resolve positively than users expect"

Users expect 50 to resolve positively, but actually 60 resolve positive.

"users expect more questions to resolve positive than actually resolve positive"

Users expect 50 to resolve positive, but actually 40 resolve positive.

I have now editted the original comment to be clearer?

When pooling forecasts, use the geometric mean of odds

but also the average predictor improving their ability also fixed that underconfidence

What do mean by this? 

I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they've stopped being underconfident.  My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.

Doesn't that mean that it should be less accurate, given the bias towards questions resolving positively?

The bias isn't that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive. Shifting probabilities lower fixes this.

Basically lots of questions on Metaculus are "Will X happen?" where X is some interesting event people are talking about, but the base rate is perhaps low. People tend to overestimate the probability of X relative to what actually occurs.

Load More