Peter Mühlbacher

Research Scientist @ FutureSearch
180 karmaJoined Nov 2022


[Disclaimer: I'm working for FutureSearch]

on some readings of your post, “forecasting” becomes very broad and just encompasses all of research.

To add another perspective: Reasoning helps aggregating forecasts. Just consider one of the motivating examples for extremising, where, IIRC, some US president is handed the several (well-calibrated, say) estimates around ≈70% for P(head of some terrorist organisation is in location X)—if these estimates came from different sources, the aggregate ought to be bigger than 70%, whereas if it's all based on the same few sources, 70% may be one's best guess.

This is also something that a lot of forecasters may just do subconsciously when considering different points of view (which may be something as simple as different base rates or something as complicated as different AGI arrival models).

So from an engineering perspective there is a lot of value in providing rationales, even if they don't show up in the final forecasts.

Disclaimer: I work for Metaculus.

You can now forecast on how much AI benchmark progress will continue to be underestimated by the Metaculus Community Prediction (CP) on this Metaculus question! Thanks @Javier Prieto for prompting us to think more about this and inspiring this question!

Predict a distribution with a mean of

  • ≈0.5, if you expect the CP to be decently calibrated or just aren't sure about the direction of bias,
  • >0.5, if you think the CP will continue to underestimate AI benchmark progress, 
  • <0.5, if you think the CP will overestimate AI benchmark progress, e.g. by overreacting to this post.

Here is a Colab Notebook to get you started with some simulations.

And don't forget to update your forecasts about the AI benchmark progress questions in question, if the CP on this one has a mean far away from 0.5!

Disclaimer: I work for Metaculus.

Thanks for carefully looking into this @Javier Prieto, this looks very interesting! I'm particularly intrigued by identifying different biases for different categories and wondered how much weight you'd put on this being a statistical artefact vs a real, persistent bias that you would continue to worry about. Concretely, if we waited until a comparable number of AI benchmark progress questions, say, resolved, what would your P(Metaculus is underconfident on AI benchmark progress again) be? (Looking only at the new questions.)


Some minor comments:

About 70% of the predictions at question close had a positive log score, i.e. they were better than predicting a maximally uncertain uniform distribution over the relevant range (chance level).

I think the author knows what's going on here, but it may invite misunderstanding. This notion of "being better than predicting a […] uniform distribution" implies that a perfect forecast on the sum of two independent dice is "better than predicting a uniform distribution" only 2 out of 3 times, i.e. less than 70% of the time! (The probabilities for D_1+D_2 = 2,3,4,10,11, or 12 are all smaller than 1/#{possible outcomes}.)

The average log score at question close was 0.701 (Median: 0.868, IQR: [-0.165, 1.502][7]) compared to an average of 2.17 for all resolved continuous questions on Metaculus.

Given that quite a lot of these AI questions closed over a year before resolution, which is rather atypical for Metaculus, comparing log scores at question close seems a bit unfair. I think time-averaged scores would be more informative. (I reckon they'd produce a quantitatively different, albeit qualitatively similar picture.)

This also goes back to "Metaculus narrowly beats chance": We tried to argue why we believe that this isn't as narrow as others made it out to be (for reasonable definitions of "narrow") here.

You may want to have a look at our API!

As for the code, I wrote it in Julia and as part of a much bigger, ongoing project, so it's a bit of a mess. I.e. lots of code that's not relevant for this particular analysis. If you're interested, I could either send it to you directly or make it more public after cleaning it up a little.

  • For continuous questions we have that the Community Prediction (CP) is roughly on par with the Metaculus Prediction (MP) in terms of CRPS, but the CP fares better than the MP in terms of (continuous) log-score. Unfortunately the "higher = better" convention is used for the log score on the track record page—note that this is not the case for Brier or CRPS, where lower = better. 
    This difference is primarily due to the log-score punishing overconfidence more harshly than the CRPS: The CRPS (as computed here) is bounded between 0 and 1, while the (continuous) log-score is bounded "in the good direction", but can be arbitrarily bad. And, indeed, looking at the worst performing continuous AI questions shows that the CP was overconfident which is further exacerbated when extremising. This hardly matters if the CRPS is already pretty bad, but it can matter a lot for the log-score. 
    This is not just anecdotal evidence, you can check this yourself on our track record page, filtering for (continuous) AI questions, and checking the "surprisal function" in the "continuous calibration" tab.
  • > Do you have any thoughts on whether CRPS is preferable to log score?
    Too many! Most are really about the tradeoffs between local and non-local scoring rules for continuous questions. (Very brief summary: There are so many tradeoffs! Local scoring rules like the log-score have more appealing theoretical properties, but in practice they likely add noise. Personally, I also find local scoring rules more intuitive, but most forecasters seem to disagree.)
  • I see where you're coming from with the "true probability" issue. To be honest I don't think there is a significant disagreement here. I agree it's a somewhat silly term—that's why I kept wrapping it in scare quotes—but I think (/hoped) it should be clear from context what is meant by it. (I'm pretty sure you got it, so yay!)
    Overall, I still prefer to use "true probability" over "resilient probability" because a) I had to look it up, so I assume a few other readers would have to do the same and b) this just opens another can of worms: Now we have to specify what information can be reasonably obtained ("what about the exact initial conditions of the coin flip?", etc.) in order to avoid a vacuous definition that renders everything bar 0 and 1 "not resilient". 
    I'm open to changing my mind though, especially if lots of people interpret this the wrong way.