Javier Prieto

Program Assistant, AI Governance & Policy @ Open Philanthropy
331 karmaJoined Working (0-5 years)


Thanks for writing this!

Since your decision seems to come down to the expected positive effect on your happiness, I'm curious whether you considered even cheaper happiness-boosting interventions. For example, hundreds (thousands?) of hours of meditation might give you the "love, belonging, connection" and "personal growth" benefits with fewer downsides, though this might work less reliably than having kids.

Thanks, Peter!

To your questions:

  1. I'm fairly confident (let's say 80%) that Metaculus has underestimated progress on benchmarks so far. This doesn't mean it will keep doing so in the future because (i) forecasters may have learned from this experience to be more bullish and/or (ii) AI progress might slow down. I wouldn't bet on (ii), but I expect (i) has already happened to some extent -- it has certainly happened to me!
  2. The other categories have fewer questions and some have special circumstances that make the evidence of bias much weaker in my view. Specifically, the biggest misses in "compute" came from GPU price spikes that can probably be explained by post-COVID supply chain disruptions and increased demand from crypto miners. Both of these factors were transient. 
  3. I like your example with the two independent dice. My takeaway is that, if you have access to a prior that's more informative than a uniform distribution (in this case, "both dice are unbiased so their sum must be a triangular distribution"), then you should compare your performance against that. My assumption when writing this was that a (log-)uniform prior over the relevant range was the best we could do for these questions. This is in line with the fact that Metaculus's log score on continuous questions is normalized using a (log-)uniform distribution.
  4. That's a good point re: different time horizons. I didn't bother to check the average time between close and resolution for all questions on the platform, but, assuming it's <<1 year as you suggest, I agree it's an important caveat. If you know that number off the top of your head, I'll add it to the post.

That's right. When defined using a base 2 logarithm, the score can be interpreted as "bits of information over the maximally uncertain (uniform) distribution". Forecasts assigning less probability mass to the true outcome than the uniform distribution result in a negative score.

Have you considered contacting the authors of the original QF paper? Glenn and Vitalik seem quite approachable. You could also post the paper on the RxC discord or (if you're willing to go for a high-effort alternative) submit it to their next conference.

Thanks for writing this up!

I think your (largely negative) results on QF under incomplete information should be more widely known. I consider myself to be relatively “plugged” into the online communities that have discussed QF the most (RxC, crypto, etc.) and I only learned about your paper a couple of months ago.

Here are a few more scattered thoughts prompted by the post:

  1. I’m really intrigued by the dynamic setting and its potential to alleviate the information problem to some extent. I agree there should be more work on this, theoretical or empirical.
  2. Showing endogenous CQF is (in)efficient under complete information sounds relatively easy, right? I would love it if someone did this or explained why my intuition about hardness is wrong! (Though I expect an eventual efficiency proof wouldn’t go through under incomplete information for the same reasons as in regular QF, so I’m not sure how useful this is in practice.)
  3. Agree with all your points on matching and coordination – the mechanism doesn’t seem to be a good fit there.
  4. In the section on grantmaking, you seem to assume that experts wouldn’t be paying out of their own pockets, but this could be implemented with the following setup: the donor gives them a regranting pot that they can keep for themselves or spend on other projects that will be matched quadratically.
  5. I didn’t know quadratic voting is efficient under incomplete information. Add that to its other advantages (simplicity, budget, etc.) and it comes out as a much stronger option than QF. I have no take on whether it’s better or worse than the other mechanisms you mention, though my sense is that approval voting is the darling of many electoral reform wonks.

I've been thinking about regranting on and off for about a year, specifically about whether it makes sense to use bespoke mechanisms like quadratic funding or some of its close cousins. I still don't know where I land on many design choices, so I won't say more about that now.

I'm not aware of any retrospective on FTXFF's program but it might be a good idea to do it when we have enough information to evaluate performance (so in 6-12 months?) Another thing in this vein that I think would be valuable and could happen right away is looking into SFF's S-process.

Cool app!

Are you pulling data from Manifold at all or is the backend "just" a squiggle model? If the latter, did you embed the markets by hand or are you automating it by searching the node text on Manifold and pulling the first market that pops up or something like that?

Thanks! That's a reasonable strategy if you can choose question wording. I agree there's no difference mathematically, but I'm not so sure that's true cognitively. Sometimes I've seen asymmetric calibration curves that look fine >50% but tend to overpredict <50%. That suggests it's easier to stay calibrated in the subset of questions you think are more likely to happen than not. This is good news for your strategy! However, note that this is based on a few anecdotal observations, so I'd caution against updating too strongly on it.

Glad you brought up real money markets because the real choice here isn't "5 unpaid superforecasters" vs "200 unpaid average forecasters" but "5 really good people who charge $200/h" vs "200 internet anons that'll do it for peanuts". Once you notice the difference in unit labor costs, the question becomes: for a fixed budget, what's the optimal trade-off between crowd size and skill? I'm really uncertain about that myself and have never seen good data on it.

Great analysis!

I wonder what would happen if you were to do the same exercise with the fixed-year predictions under a 'constant risk' model, i.e. P(t) = 1 - exp(-l*t) with l = - year / log(1 - P(year)), to get around the problem that we're still 3 years away from 2026. Given that timelines are systematically longer with a fixed-year framing, I would expect the Brier score of those predictions would be worse. OTOH, the constant risk model doesn't seem very reasonable here, so the results wouldn't have a straightforward interpretation.

Load more