All of Javier Prieto's Comments + Replies

Thanks for writing this!

Since your decision seems to come down to the expected positive effect on your happiness, I'm curious whether you considered even cheaper happiness-boosting interventions. For example, hundreds (thousands?) of hours of meditation might give you the "love, belonging, connection" and "personal growth" benefits with fewer downsides, though this might work less reliably than having kids.

2
KidsOrNoKids
5mo
Thanks! Good question - indeed I did consider lots of meditation, writing a book, and various other things, and the decision was close. Having kids had the added bonus of "exploring the breadth of human experience" and that felt important. It also has a certain reliability as you mentioned.

Thanks, Peter!

To your questions:

  1. I'm fairly confident (let's say 80%) that Metaculus has underestimated progress on benchmarks so far. This doesn't mean it will keep doing so in the future because (i) forecasters may have learned from this experience to be more bullish and/or (ii) AI progress might slow down. I wouldn't bet on (ii), but I expect (i) has already happened to some extent -- it has certainly happened to me!
  2. The other categories have fewer questions and some have special circumstances that make the evidence of bias much weaker in my view. Specifi
... (read more)

That's right. When defined using a base 2 logarithm, the score can be interpreted as "bits of information over the maximally uncertain (uniform) distribution". Forecasts assigning less probability mass to the true outcome than the uniform distribution result in a negative score.

Have you considered contacting the authors of the original QF paper? Glenn and Vitalik seem quite approachable. You could also post the paper on the RxC discord or (if you're willing to go for a high-effort alternative) submit it to their next conference.

Thanks for writing this up!

I think your (largely negative) results on QF under incomplete information should be more widely known. I consider myself to be relatively “plugged” into the online communities that have discussed QF the most (RxC, crypto, etc.) and I only learned about your paper a couple of months ago.

Here are a few more scattered thoughts prompted by the post:

  1. I’m really intrigued by the dynamic setting and its potential to alleviate the information problem to some extent. I agree there should be more work on this, theoretical or empirical.
  2. Show
... (read more)
1
Luis Mota Freitas
9mo
Thanks for the comment! On the point of making this information more well-known, is there an easy way to do so, given that I have very little familiarity with these communities? I haven't tried it, and it could turn out to be quite easy, but I think it's probably not so trivial to prove the result either way.

I've been thinking about regranting on and off for about a year, specifically about whether it makes sense to use bespoke mechanisms like quadratic funding or some of its close cousins. I still don't know where I land on many design choices, so I won't say more about that now.

I'm not aware of any retrospective on FTXFF's program but it might be a good idea to do it when we have enough information to evaluate performance (so in 6-12 months?) Another thing in this vein that I think would be valuable and could happen right away is looking into SFF's S-process.

Cool app!

Are you pulling data from Manifold at all or is the backend "just" a squiggle model? If the latter, did you embed the markets by hand or are you automating it by searching the node text on Manifold and pulling the first market that pops up or something like that?

2
Nathan Young
1y
The backend is a squiggle model with manifold slugs in the comments.  I embedded the markets by hand, but we have built a GPT node creator, it's just pretty bad. I think maybe we will put automatic search on titles and then it can suggest manifold (and metaculus) nodes and you get the options to choose. If there are features that would make you use this, let me know, it took like a week of dev time so isn't hard to change. Likewise I guess I'll try and interview different researchers and build their models too, perhaps throw up a website for it.

Thanks! That's a reasonable strategy if you can choose question wording. I agree there's no difference mathematically, but I'm not so sure that's true cognitively. Sometimes I've seen asymmetric calibration curves that look fine >50% but tend to overpredict <50%. That suggests it's easier to stay calibrated in the subset of questions you think are more likely to happen than not. This is good news for your strategy! However, note that this is based on a few anecdotal observations, so I'd caution against updating too strongly on it.

1
Diego Oliveira
1y
Thanks for your reply. The possibility of asymmetry suggests even more that we shouldn't predict in the whole [0%-100%] range, but rather stick to whatever half of the interval we feel more comfortable with. All we have to do is to get in the habit of flipping the "sign" of the question (i.e, taking the complement of the sample space) when needed, which usually amounts to adding the phrase "It's not the case that" in front of the prediction. This leads to roughly double the number of samples per bin, and therefore more precise estimates of our calibration. And since we have to map an event to a set that is now half the size it was before, it seems easier for us to get better at it over time. Do you see any reason not to change Open Philanthropy's approach to forecasting besides the immense logistic effort this implies?

Glad you brought up real money markets because the real choice here isn't "5 unpaid superforecasters" vs "200 unpaid average forecasters" but "5 really good people who charge $200/h" vs "200 internet anons that'll do it for peanuts". Once you notice the difference in unit labor costs, the question becomes: for a fixed budget, what's the optimal trade-off between crowd size and skill? I'm really uncertain about that myself and have never seen good data on it.

Great analysis!

I wonder what would happen if you were to do the same exercise with the fixed-year predictions under a 'constant risk' model, i.e. P(t) = 1 - exp(-l*t) with l = - year / log(1 - P(year)), to get around the problem that we're still 3 years away from 2026. Given that timelines are systematically longer with a fixed-year framing, I would expect the Brier score of those predictions would be worse. OTOH, the constant risk model doesn't seem very reasonable here, so the results wouldn't have a straightforward interpretation.

2
PatrickL
1y
I'd be interested in seeing this too! Although I'm not planning to spend the time to do this soon, I'd be up for having a quick chat if someone else was up for it. The constant risk model seems to me like a better-than-nothing model, so could be a slight update if it gives different results.  Or could just use a linear increase of probability from 0 years to 10 years. FWIW - from this survey alone, I'm not convinced that timelines are systematically longer with a fixed-year framing.  I sampled ten forecasts where probabilities were given in both framings on a 10 year timescale, and five of them (Subtitles, Transcribe, Top forty, Random game, Explain) gave later forecasts when asked with ‘probability in N years’ rather than ‘year that the probability is M’, three of them (Video scene, Read aloud, Atari) gave the same forecasts, and two of them (Rosetta, Taylor) gave an earlier forecast. So from this sample, it seems they're commonly longer.

This is really cool! As someone who's been doing these calculations in a somewhat haphazard way using a mix of pen and paper, spreadsheets, and Python scripts for years, it's nice to see someone put in the work to create a polished product that others can use.

Something that I've been meaning to incorporate to my estimates and that would be a killer feature for an app like this is a reasonable projection of future earnings, under the assumption that you'll get promoted / switch career paths at the average rate for someone in your current position. Sprinkle a bit of uncertainty on top, and you can get out a nice probability distribution over "time at FI" and "total money donated".

1
NicoleJaneway
1y
Javier, I think it would be extremely cool if someone made this projections tool based on salary data aggregated from levels.fyi
5
Rebecca Herbst
1y
This is a thoughtful comment. And what you are requesting is also often  missing from many FIRE calculators. Perhaps there is a simple work around to this...like including a formula that says "My salary will increase by X every Y years". Obviously this is hard to predict, but does offer the user a little more flexibility if they assume their income will go up and want to play around with the numbers. At the same time, we are also assuming the user's expenses remain the same, which we know is not necessarily true as well. So trying to find that balance between simple and flexible. I will think on this more.  Thank you. 

A product I would personally like to see because it'd be tremendously useful to me is "personal finance for nomadic EAs" i.e. if location is at most a minor constraint for you, where should you move to maximize the resources available to the effective charities of your choice? I expect that, for most people without such constraints, packing up and leaving is probably much more effective than fine-tuning the strategy to the place where they currently reside.

1
NicoleJaneway
1y
Fair point, Javier!  You might work with EA Anywhere on this.   Related self-promotion:  I wrote up this article about my experience getting paid to digital nomad in Tulsa, OK.  I've since moved, but I really enjoyed my time there, and I still recommend moving there for the year-long stipend.

Your likelihood_pool method is returning Brier scores >1. How is that possible? Also, unless you extremize, it should yield the same aggregates (and scores) as regular geometric mean of odds, no?

7
Jaime Sevilla
1y
 I am so dumb I was mistakenly using odds instead of probs to compute the brier score :facepalm: And yes, you are right, we should extremize before aggregating. Otherwise, the method is equivalent to geo mean of odds. It's still not very good though

Thanks for posting this! I think this topic is extremely neglected and the lack of side effects among natural short-sleepers strongly suggests that there could be interventions with no obvious downsides.

My main concern with your drug-centered approach is: what if the causal path from short-sleeper genes to a short-sleeper phenotype flows through nerodevelopmental pathways, such that once neural structures are locked-in in adulthood it's not possible to induce the desired phenotype by mimicking the direct effects of the genes? If this is true, then reaping ... (read more)

Have you considered holding out some languages at random to assess the impact of the program? You could e.g. delay funding for some languages by 1-2 years and try to estimate the difference in some relevant outcome during that period. I understand this may be hard or undesirable for several reasons (finding and measuring the right outcomes, opportunity costs, managing grantee expectations).

Unfortunately I think this kind of experimental approach is a bad fit here; opportunity costs seem really high, there's a small number of data points, and there's a ton of noise from other factors that language communities vary along.

Fortunately I think we'll have additional context that will help us assess the impacts of these grants beyond a black-box "did this input lead to this output" analysis.

We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.

Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.

Very good point!

I see a few ways of assessing "global overconfidence":

  1. Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where you'd expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
  2. Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the (+) sign next
... (read more)
4
Charles Dillon
2y
One thing to note here is it is plausible that your errors are not symmetric in expectation, if there's some bias towards phrasing questions one way or another (this could be something like frequently asking "will [event] happen" where optimism might cause you to be too high in general, for example). This might mean assuming linearity could be wrong. This is probably easier for you to tell since you can see the underlying data.
3
Dan_Keys
2y
I haven't seen a rigorous analysis of this, but I like looking at the slope, and I expect that it's best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.

Thanks!

  1. We're currently providing calibration and accuracy stats to our grant investigators through our Salesforce app in the hopes that they'll find that feedback useful and actionable.
  2. I'm not sure and I'd have to defer to decision-makers at OP. My model of them is that predictions are just one piece of evidence they look at.

Interesting, thanks for sharing that trick!

Our forecasting questions are indeed maximally uncertain in some absolute sense because our base rate is ~50%, but it may also be the case that they're particularly uncertain to the person making the prediction as you suggest.

A related issue is that people may be more comfortable making predictions about less important aspects of the project, since the consequences of being wrong are lower

I'm actually concerned about the same thing but for exactly the opposite reason, i.e. that because the consequences of being wrong (a hit to one's Brier score) are the same regardless of the importance of the prediction people might allocate the same time and effort to any prediction, including the more important ones that should perhaps warrant closer examination.

We're currently trialing some... (read more)

Thanks for doing these analyses!

I recently had to dive into the Metaculus data for a report I'm writing and I produced the following plot along the way. I'm posting it here because it didn't make it into the final report, but I felt it was worth sharing anyway.

Each dot corresponds to the Brier score for the community prediction on every non-ambiguously resolved question as a function of time horizon (i.e. time remaining until resolution when the prediction was made). There are up to 101 predictions per question for the reasons you describe in the post. The... (read more)

1
Charles Dillon
2y
Nice graph, thanks!