Javier Prieto

Topic Contributions


How accurate are Open Phil's predictions?

We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.

Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.

How accurate are Open Phil's predictions?

Very good point!

I see a few ways of assessing "global overconfidence":

  1. Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where you'd expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
  2. Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the (+) sign next to Details). A score between 0 and 1 indicates overconfidence. Open Phil's score is 0.175, so this is evidence that we're overconfident. I don't know how to put a meaningful confidence/credible interval on that number, so it's hard to say how strong this evidence is.
  3. Run a linear regression on the calibration curve and check that the slope is <1. When I do this for the original curve with 10 points, statsmodels OLS method spits out [0.772, 0.996] as a 95% confidence interval for the slope. I see this as stronger evidence of overconfidence than the previous ones.
How accurate are Open Phil's predictions?


  1. We're currently providing calibration and accuracy stats to our grant investigators through our Salesforce app in the hopes that they'll find that feedback useful and actionable.
  2. I'm not sure and I'd have to defer to decision-makers at OP. My model of them is that predictions are just one piece of evidence they look at.
How accurate are Open Phil's predictions?

Interesting, thanks for sharing that trick!

Our forecasting questions are indeed maximally uncertain in some absolute sense because our base rate is ~50%, but it may also be the case that they're particularly uncertain to the person making the prediction as you suggest.

How accurate are Open Phil's predictions?

A related issue is that people may be more comfortable making predictions about less important aspects of the project, since the consequences of being wrong are lower

I'm actually concerned about the same thing but for exactly the opposite reason, i.e. that because the consequences of being wrong (a hit to one's Brier score) are the same regardless of the importance of the prediction people might allocate the same time and effort to any prediction, including the more important ones that should perhaps warrant closer examination.

We're currently trialing some of the stuff you suggest about bringing in other people to suggest predictions. This might be an improvement, but it's too early to say, and scaling it up wouldn't be easy for a few reasons:

  1. It's hard to make good predictions about a grant without lots of context.
  2. Grant investigators are very time-constrained, so they can't afford to provide that context by having a lot of back and forth with the person suggesting the predictions.
  3. Most of the information needed to gain context about the grant is confidential by default.
Data on forecasting accuracy across different time horizons and levels of forecaster experience

Thanks for doing these analyses!

I recently had to dive into the Metaculus data for a report I'm writing and I produced the following plot along the way. I'm posting it here because it didn't make it into the final report, but I felt it was worth sharing anyway.

Each dot corresponds to the Brier score for the community prediction on every non-ambiguously resolved question as a function of time horizon (i.e. time remaining until resolution when the prediction was made). There are up to 101 predictions per question for the reasons you describe in the post. The red line is a moving average and the shaded area is a (t-distributed) 95% confidence interval around the mean.