Thanks, Peter!
To your questions:
That's right. When defined using a base 2 logarithm, the score can be interpreted as "bits of information over the maximally uncertain (uniform) distribution". Forecasts assigning less probability mass to the true outcome than the uniform distribution result in a negative score.
Have you considered contacting the authors of the original QF paper? Glenn and Vitalik seem quite approachable. You could also post the paper on the RxC discord or (if you're willing to go for a high-effort alternative) submit it to their next conference.
Thanks for writing this up!
I think your (largely negative) results on QF under incomplete information should be more widely known. I consider myself to be relatively “plugged” into the online communities that have discussed QF the most (RxC, crypto, etc.) and I only learned about your paper a couple of months ago.
Here are a few more scattered thoughts prompted by the post:
I've been thinking about regranting on and off for about a year, specifically about whether it makes sense to use bespoke mechanisms like quadratic funding or some of its close cousins. I still don't know where I land on many design choices, so I won't say more about that now.
I'm not aware of any retrospective on FTXFF's program but it might be a good idea to do it when we have enough information to evaluate performance (so in 6-12 months?) Another thing in this vein that I think would be valuable and could happen right away is looking into SFF's S-process.
Cool app!
Are you pulling data from Manifold at all or is the backend "just" a squiggle model? If the latter, did you embed the markets by hand or are you automating it by searching the node text on Manifold and pulling the first market that pops up or something like that?
Thanks! That's a reasonable strategy if you can choose question wording. I agree there's no difference mathematically, but I'm not so sure that's true cognitively. Sometimes I've seen asymmetric calibration curves that look fine >50% but tend to overpredict <50%. That suggests it's easier to stay calibrated in the subset of questions you think are more likely to happen than not. This is good news for your strategy! However, note that this is based on a few anecdotal observations, so I'd caution against updating too strongly on it.
Glad you brought up real money markets because the real choice here isn't "5 unpaid superforecasters" vs "200 unpaid average forecasters" but "5 really good people who charge $200/h" vs "200 internet anons that'll do it for peanuts". Once you notice the difference in unit labor costs, the question becomes: for a fixed budget, what's the optimal trade-off between crowd size and skill? I'm really uncertain about that myself and have never seen good data on it.
Great analysis!
I wonder what would happen if you were to do the same exercise with the fixed-year predictions under a 'constant risk' model, i.e. P(t) = 1 - exp(-l*t)
with l = - year / log(1 - P(year))
, to get around the problem that we're still 3 years away from 2026. Given that timelines are systematically longer with a fixed-year framing, I would expect the Brier score of those predictions would be worse. OTOH, the constant risk model doesn't seem very reasonable here, so the results wouldn't have a straightforward interpretation.
This is really cool! As someone who's been doing these calculations in a somewhat haphazard way using a mix of pen and paper, spreadsheets, and Python scripts for years, it's nice to see someone put in the work to create a polished product that others can use.
Something that I've been meaning to incorporate to my estimates and that would be a killer feature for an app like this is a reasonable projection of future earnings, under the assumption that you'll get promoted / switch career paths at the average rate for someone in your current position. Sprinkle a bit of uncertainty on top, and you can get out a nice probability distribution over "time at FI" and "total money donated".
A product I would personally like to see because it'd be tremendously useful to me is "personal finance for nomadic EAs" i.e. if location is at most a minor constraint for you, where should you move to maximize the resources available to the effective charities of your choice? I expect that, for most people without such constraints, packing up and leaving is probably much more effective than fine-tuning the strategy to the place where they currently reside.
Your likelihood_pool method is returning Brier scores >1. How is that possible? Also, unless you extremize, it should yield the same aggregates (and scores) as regular geometric mean of odds, no?
Thanks for posting this! I think this topic is extremely neglected and the lack of side effects among natural short-sleepers strongly suggests that there could be interventions with no obvious downsides.
My main concern with your drug-centered approach is: what if the causal path from short-sleeper genes to a short-sleeper phenotype flows through nerodevelopmental pathways, such that once neural structures are locked-in in adulthood it's not possible to induce the desired phenotype by mimicking the direct effects of the genes? If this is true, then reaping ...
Have you considered holding out some languages at random to assess the impact of the program? You could e.g. delay funding for some languages by 1-2 years and try to estimate the difference in some relevant outcome during that period. I understand this may be hard or undesirable for several reasons (finding and measuring the right outcomes, opportunity costs, managing grantee expectations).
Unfortunately I think this kind of experimental approach is a bad fit here; opportunity costs seem really high, there's a small number of data points, and there's a ton of noise from other factors that language communities vary along.
Fortunately I think we'll have additional context that will help us assess the impacts of these grants beyond a black-box "did this input lead to this output" analysis.
We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.
Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.
Very good point!
I see a few ways of assessing "global overconfidence":
(+)
sign nextThanks!
Interesting, thanks for sharing that trick!
Our forecasting questions are indeed maximally uncertain in some absolute sense because our base rate is ~50%, but it may also be the case that they're particularly uncertain to the person making the prediction as you suggest.
A related issue is that people may be more comfortable making predictions about less important aspects of the project, since the consequences of being wrong are lower
I'm actually concerned about the same thing but for exactly the opposite reason, i.e. that because the consequences of being wrong (a hit to one's Brier score) are the same regardless of the importance of the prediction people might allocate the same time and effort to any prediction, including the more important ones that should perhaps warrant closer examination.
We're currently trialing some...
Thanks for doing these analyses!
I recently had to dive into the Metaculus data for a report I'm writing and I produced the following plot along the way. I'm posting it here because it didn't make it into the final report, but I felt it was worth sharing anyway.
Each dot corresponds to the Brier score for the community prediction on every non-ambiguously resolved question as a function of time horizon (i.e. time remaining until resolution when the prediction was made). There are up to 101 predictions per question for the reasons you describe in the post. The...
Thanks for writing this!
Since your decision seems to come down to the expected positive effect on your happiness, I'm curious whether you considered even cheaper happiness-boosting interventions. For example, hundreds (thousands?) of hours of meditation might give you the "love, belonging, connection" and "personal growth" benefits with fewer downsides, though this might work less reliably than having kids.