Javier Prieto

Program Assistant, AI Governance & Policy @ Open Philanthropy
Working (0-5 years experience)
189Joined Nov 2021


Your likelihood_pool method is returning Brier scores >1. How is that possible? Also, unless you extremize, it should yield the same aggregates (and scores) as regular geometric mean of odds, no?

Thanks for posting this! I think this topic is extremely neglected and the lack of side effects among natural short-sleepers strongly suggests that there could be interventions with no obvious downsides.

My main concern with your drug-centered approach is: what if the causal path from short-sleeper genes to a short-sleeper phenotype flows through nerodevelopmental pathways, such that once neural structures are locked-in in adulthood it's not possible to induce the desired phenotype by mimicking the direct effects of the genes? If this is true, then reaping the benefits of short-sleeper genes would seem to require genetic engineering (I doubt embryo selection would scale given the low frequency of the target alleles). This would obviously be politically problematic and I'm not sure it'd be technically feasible right away (last time I checked, CRISPR people were worried about off-target mutations, but I'm not up to date with that literature so this may not be an issue anymore).

Have you considered holding out some languages at random to assess the impact of the program? You could e.g. delay funding for some languages by 1-2 years and try to estimate the difference in some relevant outcome during that period. I understand this may be hard or undesirable for several reasons (finding and measuring the right outcomes, opportunity costs, managing grantee expectations).

We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.

Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.

Very good point!

I see a few ways of assessing "global overconfidence":

  1. Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where you'd expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
  2. Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the (+) sign next to Details). A score between 0 and 1 indicates overconfidence. Open Phil's score is 0.175, so this is evidence that we're overconfident. I don't know how to put a meaningful confidence/credible interval on that number, so it's hard to say how strong this evidence is.
  3. Run a linear regression on the calibration curve and check that the slope is <1. When I do this for the original curve with 10 points, statsmodels OLS method spits out [0.772, 0.996] as a 95% confidence interval for the slope. I see this as stronger evidence of overconfidence than the previous ones.


  1. We're currently providing calibration and accuracy stats to our grant investigators through our Salesforce app in the hopes that they'll find that feedback useful and actionable.
  2. I'm not sure and I'd have to defer to decision-makers at OP. My model of them is that predictions are just one piece of evidence they look at.

Interesting, thanks for sharing that trick!

Our forecasting questions are indeed maximally uncertain in some absolute sense because our base rate is ~50%, but it may also be the case that they're particularly uncertain to the person making the prediction as you suggest.

A related issue is that people may be more comfortable making predictions about less important aspects of the project, since the consequences of being wrong are lower

I'm actually concerned about the same thing but for exactly the opposite reason, i.e. that because the consequences of being wrong (a hit to one's Brier score) are the same regardless of the importance of the prediction people might allocate the same time and effort to any prediction, including the more important ones that should perhaps warrant closer examination.

We're currently trialing some of the stuff you suggest about bringing in other people to suggest predictions. This might be an improvement, but it's too early to say, and scaling it up wouldn't be easy for a few reasons:

  1. It's hard to make good predictions about a grant without lots of context.
  2. Grant investigators are very time-constrained, so they can't afford to provide that context by having a lot of back and forth with the person suggesting the predictions.
  3. Most of the information needed to gain context about the grant is confidential by default.

Thanks for doing these analyses!

I recently had to dive into the Metaculus data for a report I'm writing and I produced the following plot along the way. I'm posting it here because it didn't make it into the final report, but I felt it was worth sharing anyway.

Each dot corresponds to the Brier score for the community prediction on every non-ambiguously resolved question as a function of time horizon (i.e. time remaining until resolution when the prediction was made). There are up to 101 predictions per question for the reasons you describe in the post. The red line is a moving average and the shaded area is a (t-distributed) 95% confidence interval around the mean.