Your likelihood_pool method is returning Brier scores >1. How is that possible? Also, unless you extremize, it should yield the same aggregates (and scores) as regular geometric mean of odds, no?
Thanks for posting this! I think this topic is extremely neglected and the lack of side effects among natural short-sleepers strongly suggests that there could be interventions with no obvious downsides.
My main concern with your drug-centered approach is: what if the causal path from short-sleeper genes to a short-sleeper phenotype flows through nerodevelopmental pathways, such that once neural structures are locked-in in adulthood it's not possible to induce the desired phenotype by mimicking the direct effects of the genes? If this is true, then reaping the benefits of short-sleeper genes would seem to require genetic engineering (I doubt embryo selection would scale given the low frequency of the target alleles). This would obviously be politically problematic and I'm not sure it'd be technically feasible right away (last time I checked, CRISPR people were worried about off-target mutations, but I'm not up to date with that literature so this may not be an issue anymore).
Have you considered holding out some languages at random to assess the impact of the program? You could e.g. delay funding for some languages by 1-2 years and try to estimate the difference in some relevant outcome during that period. I understand this may be hard or undesirable for several reasons (finding and measuring the right outcomes, opportunity costs, managing grantee expectations).
We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.
Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.
Very good point!
I see a few ways of assessing "global overconfidence":
Interesting, thanks for sharing that trick!
Our forecasting questions are indeed maximally uncertain in some absolute sense because our base rate is ~50%, but it may also be the case that they're particularly uncertain to the person making the prediction as you suggest.
A related issue is that people may be more comfortable making predictions about less important aspects of the project, since the consequences of being wrong are lower
I'm actually concerned about the same thing but for exactly the opposite reason, i.e. that because the consequences of being wrong (a hit to one's Brier score) are the same regardless of the importance of the prediction people might allocate the same time and effort to any prediction, including the more important ones that should perhaps warrant closer examination.
We're currently trialing some of the stuff you suggest about bringing in other people to suggest predictions. This might be an improvement, but it's too early to say, and scaling it up wouldn't be easy for a few reasons:
Thanks for doing these analyses!
I recently had to dive into the Metaculus data for a report I'm writing and I produced the following plot along the way. I'm posting it here because it didn't make it into the final report, but I felt it was worth sharing anyway.
Each dot corresponds to the Brier score for the community prediction on every non-ambiguously resolved question as a function of time horizon (i.e. time remaining until resolution when the prediction was made). There are up to 101 predictions per question for the reasons you describe in the post. The red line is a moving average and the shaded area is a (t-distributed) 95% confidence interval around the mean.