How accurate are Open Phil's predictions?

Javier Prieto🔸; Coefficient Giving

Comments 21

Sorted by

New & upvoted

Minor point, but I disagree with the unqualified claim of being well calibrated here except for the 90% bucket, at least a little.

Weak evidence that you are overconfident in each of the 0-10, 10-20, 70-80, 80-90 and 90%+ buckets is decent evidence of an overconfidence bias overall, even if those errors are mostly individually within the margin of error.

Javier Prieto🔸

Very good point!

I see a few ways of assessing "global overconfidence":

Lump all predictions into two bins (under and over 50%) and check that the lower point is above the diagonal and the upper one is below the diagonal. I just did this and the points are where you'd expect if we were overconfident, but the 90% credible intervals still overlap with the diagonal, so pooling all the bins in this way still provides weak evidence of overconfidence.
Calculating the OC score as defined by Metaculus (scroll down to the bottom of the page and click the (+) sign next to Details). A score between 0 and 1 indicates overconfidence. Open Phil's score is 0.175, so this is evidence that we're overconfident. I don't know how to put a meaningful confidence/credible interval on that number, so it's hard to say how strong this evidence is.
Run a linear regression on the calibration curve and check that the slope is <1. When I do this for the original curve with 10 points, statsmodels OLS method spits out [0.772, 0.996] as a 95% confidence interval for the slope. I see this as stronger evidence of overconfidence than the previous ones.

Charles Dillon 🔸

One thing to note here is it is plausible that your errors are not symmetric in expectation, if there's some bias towards phrasing questions one way or another (this could be something like frequently asking "will [event] happen" where optimism might cause you to be too high in general, for example). This might mean assuming linearity could be wrong.

This is probably easier for you to tell since you can see the underlying data.

Dan_Keys

I haven't seen a rigorous analysis of this, but I like looking at the slope, and I expect that it's best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.

JamesÖz 🔸

I'm probably missing something but doesn't the graph show OP is under-confident in the 0-10 and 10-20 bins? e.g. those data points are above the dotted grey line of perfect calibration where the 90%+ bin is far below?

KaseyShibayama

I think overconfident and underconfident aren't crisp terms to describe this. With binary outcomes, you can invert the prediction and it means the same thing (20% chance of X == 80% chance of not X). So being below the calibration line in the 90% bucket and above the line in the 10% bucket are functionally the same thing.

Charles Dillon 🔸

I'm using overconfident here to mean closer to extreme confidence (0 or 100, depending on whether they are below or above 50%, respectively) than they should be.

[anonymous]

appreciate the public accountability here!

MaxRa

Thanks for sharing, super interesting!

The organization-wide Brier score (measuring both calibration and resolution) is .217, which is somewhat better than chance (.250). This requires careful interpretation, but in short we think that our reasonably good Brier score is mostly driven by good calibration, while resolution has more room for improvement (but this may not be worth the effort). [more]

Another explanation for the low resolution, besides the limited time you spend on the forecasts, might be that you chose questions that you are most uncertain about (i.e. that you are around 50% certain about resolving positively), right?

This is something I noticed when making my own forecasts. To remove this bias I sometimes use a dice to chose the number for questions like

By Jan 1, 2018,the grantee will have staff working in at least [insert random number from a reasonable range] European countries

Javier Prieto🔸

Interesting, thanks for sharing that trick!

Our forecasting questions are indeed maximally uncertain in some absolute sense because our base rate is ~50%, but it may also be the case that they're particularly uncertain to the person making the prediction as you suggest.

Karthik Tadepalli

This is a great exercise. I am definitely concerned about the endogenous nature of predictions: I think you are definitely right that people are more likely to offer predictions on the aspects of their project that are easier to predict, especially since the prompt you showed explicitly asks them in an open-ended way. A related issue is that people may be more comfortable making predictions about less important aspects of the project, since the consequences of being wrong are lower. If this is happening, then this forecasting accuracy wouldn't generalize at all.

Both of these issues can be partly addressed if the predictions are solicited by another person reading the writeup, rather than chosen by the writer. For example, Alice writes up an investigation into human challenge trials as a strategy for medical R&D, Bob reads it and asks Alice for some predictions that he feels are important to complementing the writeup e.g. "will human challenge trials be used for any major disease by the start of 2023?" and "will the US take steps to encourage human challenge trials by the start of 2024?"

This obviously helps avoid Alice making predictions about only easier questions, and it also ensures that the predictions being made are actually decision-relevant (since they are being solicited by someone else who serves the role of an intelligent layperson/policymaker reading the report). Seems like a win-win to me.

Javier Prieto🔸

A related issue is that people may be more comfortable making predictions about less important aspects of the project, since the consequences of being wrong are lower

I'm actually concerned about the same thing but for exactly the opposite reason, i.e. that because the consequences of being wrong (a hit to one's Brier score) are the same regardless of the importance of the prediction people might allocate the same time and effort to any prediction, including the more important ones that should perhaps warrant closer examination.

We're currently trialing some of the stuff you suggest about bringing in other people to suggest predictions. This might be an improvement, but it's too early to say, and scaling it up wouldn't be easy for a few reasons:

It's hard to make good predictions about a grant without lots of context.
Grant investigators are very time-constrained, so they can't afford to provide that context by having a lot of back and forth with the person suggesting the predictions.
Most of the information needed to gain context about the grant is confidential by default.

Dan_Keys

There are several different sorts of systematic errors that you could look for in this kind of data, although checking for them requires including more features of each prediction than the ones that are here.

For example, to check for optimism bias you'd want to code whether each prediction is of the form "good thing will happen", "bad thing will happen", or neither. Then you can check if probabilities were too high for "good thing will happen" predictions and too low for "bad thing will happen" predictions. (Most of the example predictions were "good thing will happen" predictions, and it looks like probabilities were not generally too high, so probably optimism bias was not a major issue.)

Some other things you could check for:

tracking what the "default outcome" would be, or whether there is a natural base rate, to see if there has been a systematic tendency to overestimate the chances of a non-default outcome (or to underestimate it)
dividing predictions up into different types, such as predictions about outcomes in the world (e.g. >20 new global cage-free commitments), predictions about inputs / changes within the organization (e.g. will hire a comms person within 9 months), and predictions about people's opinions (e.g. [expert] will think [the grantee’s] work is ‘very good’), to check for calibration & accuracy on each type of prediction
trying to distinguish the relative accuracy of different forecasters. If there are too few predictions per forecaster, you could check if any forecaster-level features are correlated with overconfidence or with Brier score (e.g., experience within the org, experience making these predictions, some measure of quantitative skills). The aggregate pattern of overconfidence in the >80% and <20% bins can show up even if most forecasters are well-calibrated and only (say) 25% are overconfident, as overconfident predictions are averaged with well-calibrated predictions. And those 25% influence these sorts of results graphs more than it seems, because well-calibrated forecasters use the extreme bins less often. Even if only 25% of all predictions are made by overconfident forecasters, half of the predictions in the >80% bins might be from overconfident forecasters

Javier Prieto🔸

We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet.

Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.

Diego Oliveira 🔸

Thanks for sharing this! I've been forecasting myself for 5 months now (got 1005 resolved predictions so far), and I adopted a slightly different strategy to increase the number of samples: I only predict in the range [50%-100%]. After all, there doesn't seem to be any probabilistically or cognitively relevant difference between [predicting X will happen with 20% probability] and [not-X will happen with 80% probability]

What do you folks think about this?

Javier Prieto🔸

Thanks! That's a reasonable strategy if you can choose question wording. I agree there's no difference mathematically, but I'm not so sure that's true cognitively. Sometimes I've seen asymmetric calibration curves that look fine >50% but tend to overpredict <50%. That suggests it's easier to stay calibrated in the subset of questions you think are more likely to happen than not. This is good news for your strategy! However, note that this is based on a few anecdotal observations, so I'd caution against updating too strongly on it.

Diego Oliveira 🔸

Thanks for your reply. The possibility of asymmetry suggests even more that we shouldn't predict in the whole [0%-100%] range, but rather stick to whatever half of the interval we feel more comfortable with. All we have to do is to get in the habit of flipping the "sign" of the question (i.e, taking the complement of the sample space) when needed, which usually amounts to adding the phrase "It's not the case that" in front of the prediction. This leads to roughly double the number of samples per bin, and therefore more precise estimates of our calibration. And since we have to map an event to a set that is now half the size it was before, it seems easier for us to get better at it over time.

Do you see any reason not to change Open Philanthropy's approach to forecasting besides the immense logistic effort this implies?

Ben Stewart

I echo the props; what a great way to live up to your name!

How do you intend to translate this analysis into practical change (if you think any change is warranted)?
In your opinion, how do the included forecasts affect grant-making decisions?
My thought here is that when making a forecast about a grant, the person may not be purely playing an accuracy game - they may also be considering what the strategic/communicative impact of such forecasts are likely to be (this could be one source of the 90+% miscalibration - if I wanted to sell an idea or seem confident, I could inflate my forecast so that a particular outcome is 90+% likely).

Javier Prieto🔸

Thanks!

We're currently providing calibration and accuracy stats to our grant investigators through our Salesforce app in the hopes that they'll find that feedback useful and actionable.
I'm not sure and I'd have to defer to decision-makers at OP. My model of them is that predictions are just one piece of evidence they look at.

Sharmake

On Number 3's criticism, that's just using approximated Solonomoff Induction, which is indeed a vaild method. Of course, we do have biases that lead us astray, which is the problem with Solonomoff Induction.

brb243

-6

What kinds of predictions do we make? Here are some examples:

Who suggests the outcomes that are predicted? Are these taken from the application or from a predefined set? Are you keeping/updating a database of outcomes that you try to achieve (e. g. complementary aspects of long-term wellbeing and security)?

How do you combine the estimates? Is there a function or conditional expression?

Considering the unparalleled computing power and breadth of prioritized knowledge of the human brain, would it be better to use intuition of the grant managers? Especially since the quantification depends on expert insights so does not avoid subjectivity.

Does the engagement of multiple evaluators reduce biases when quantifications are used? Human group dynamic navigation^[1] in qualitative discussions can optimize for accuracy. Discussants perceive others' expertise and make complex weighing of everyone's' perspectives, while developing these viewpoints.^[2]

A relevant expressive statement can be: Don't get tricked by AI, it is better to be human.

^{^}
if no threats are present
^{^}
Individual quantitative estimates can be subject to 'normative' bias, where respondents reply by what is the most appropriate assuming static norms. Discussions can optimize for (collective) problem solving, notwithstanding traditional norms. This could also apply to evaluation of subjective outcomes, such as 'has the grantee influenced expert X well?'

Comments

Dan_Keys

Some other things you could check for:

tracking what the "default outcome" would be, or whether there is a natural base rate, to see if there has been a systematic tendency to overestimate the chances of a non-default outcome (or to underestimate it)
dividing predictions up into different types, such as predictions about outcomes in the world (e.g. >20 new global cage-free commitments), predictions about inputs / changes within the organization (e.g. will hire a comms person within 9 months), and predictions about people's opinions (e.g. [expert] will think [the grantee’s] work is ‘very good’), to check for calibration & accuracy on each type of prediction
trying to distinguish the relative accuracy of different forecasters. If there are too few predictions per forecaster, you could check if any forecaster-level features are correlated with overconfidence or with Brier score (e.g., experience within the org, experience making these predictions, some measure of quantitative skills). The aggregate pattern of overconfidence in the >80% and <20% bins can show up even if most forecasters are well-calibrated and only (say) 25% are overconfident, as overconfident predictions are averaged with well-calibrated predictions. And those 25% influence these sorts of results graphs more than it seems, because well-calibrated forecasters use the extreme bins less often. Even if only 25% of all predictions are made by overconfident forecasters, half of the predictions in the >80% bins might be from overconfident forecasters

^{^}

if no threats are present

^{^}

Individual quantitative estimates can be subject to 'normative' bias, where respondents reply by what is the most appropriate assuming static norms. Discussions can optimize for (collective) problem solving, notwithstanding traditional norms. This could also apply to evaluation of subjective outcomes, such as 'has the grantee influenced expert X well?'

^{^}

Here is a fuller list of reasons we make explicit quantified forecasts and later check them for accuracy, as described in an internal document by Luke Muehlhauser:

There is some evidence that making and checking quantified forecasts can help you improve the accuracy of your predictions over time, which in theory should improve the quality of our grantmaking decisions (on average, in the long run).
Quantified predictions can enable clearer communication between grant investigators and decision-makers. For example, if you just say it “seems likely” the grantee will hit their key milestone, it’s unclear whether you mean a 55% chance or a 90% chance.
Explicit quantified predictions can help you assess grantee performance relative to initial expectations, since it’s easy to forget exactly what you expected them to accomplish, and with what confidence, unless you wrote down your expectations when you originally made the grant.
The impact of our work is often difficult to measure, so it can be difficult for us to identify meaningful feedback loops that can help us learn how to be more effective and hold ourselves accountable to our mission to help others as much as possible. In the absence of clear information about the impact of our work (which is often difficult to obtain in a philanthropic setting), we can sometimes at least learn how accurate our predictions were and hold ourselves accountable to that. For example, we might never know whether our grant caused a grantee to succeed at X and Y, but we can at least check whether the things we predicted would happen did in fact happen, with roughly the frequencies we predicted.

^{^}

In some rare cases, it’s possible for the people managing the database to score predictions using information available to them. However, predictions tend to be very in-the-weeds, so scoring them typically requires input from the grant investigators who made them.

^{^}

The horizontal coordinate of the gray dots is calculated by averaging the confidence of all the predictions in each bin. Note that this is in general different from the midpoint of the bin; for example, if there are only two predictions in the 45%-55% bin and they have 46% and 48% confidence, respectively, then the point of perfect calibration in that bin would be 47%, not 50%.

^{^}

This footnote produced errors in the Forum editor — read it here.

^{^}

We’re leaving out focus areas with less than $10M moved in the subsequent analyses. The excluded focus areas are South Asian Air Quality, History of Philanthropy, and Global Health and Wellbeing.

^{^}

This sentence and some other explanatory language in this report are borrowed from an internal guide about forecasting written by Luke Muehlhauser.

^{^}

These intervals assume a uniform prior over (0, 1). This means that, for a bin with T true predictions and F false predictions, the intervals are calculated using a Beta(T+1, F+1) distribution.

^{^}

Detailed calibration data for each bin are provided below. Note that intervals are open to the left and closed to the right; a 30% prediction would be included in the 20-30 bin, but a 20% prediction would be included in the 10-20 bin.

^{^}

However, given that there is high variance in calibration across predictors, this may not be the best idea in all cases. For personal advice, predictors may wish to refer to their own calibration curve, or their team’s curve.

^{^}

This footnote produced errors in the Forum editor — read it here.

^{^}

A score of 0.25 is a reasonable baseline in our case because the base rate for past predictions happens to be very close to 50%. This means that predictors in the future could state 50% confidence on all predictions and, assuming the base rate stays the same (i.e. the population of questions that predictors sample from is stable over time), get close to perfect calibration without achieving any resolution.

^{^}

For comparison, first-year participants in the Good Judgment Project (GJP) that were not given any training got a score of 0.21 (appears as 0.42 in table 4 here; Tetlock et al. scale their Brier score such that, for binary questions, we’d need to multiply our scores by 2 to get numbers with the same meaning). The Metaculus community averages 0.150 on binary questions as of this writing (May 2022). Both comparisons have very obvious caveats: the population of questions on GJP or Metaculus is very different from ours and both platforms calculate average Brier scores over time, taking into account updates to the initial forecast, while our grant investigators only submit one forecast and never try to refine it later.

^{^}

For a base rate of 50%, resolution ranges from 0 (worst) to 0.25 (best). OP’s resolution is 0.037.

^{^}

A caveat about this data: I’m taking the difference between ‘End Date’ (i.e. when a prediction is ready to be assessed) and ‘Investigation Close Date’ (the date the investigator submitted their request for conditional approval). This underestimates the time span between forecast and resolution because predictions are made before the investigation closes. This explains the fact that some time deltas are slightly negative. The most likely explanation for this is that the grant investigator wrote the prediction long before submitting the write-up for conditional approval.

^{^}

This is in line with evidence from GJP and (less so) Metaculus showing that accuracy drops as time until question resolution increases. However, note that the opposite holds for PredictionBook, i.e. Brier scores tend to get better the longer the time horizon. Our working hypothesis to explain this paradoxical result is that, when users get to select the questions they forecast on (as they do on PredictionBook), they will only pick “easy” long-range questions. When the questions are chosen by external parties (as in GJP), they tend to be more similar in difficulty across time horizons. Metaculus sits somewhere in the middle, with community members posting most questions and opening them to the public. We may be able to test this hypothesis in the future by looking at data from Hypermind, which should fall closer to GJP than to the others because questions on the platform are commissioned by external parties.

^{^}

This selection effect could come about through several mechanisms. One such mechanism could be picking well-defined processes more often in long-range forecasts than in short-range ones. In those cases, what matters is not the calendar time elapsed between start and end but the number and complexity of steps in the process. For example, a research grant may contain predictions about the likely output of that research (some finding or publication) that can’t be scored until the research has been conducted. If the research was delayed for some reason, or if it happens earlier than expected due to e.g. a sudden influx of funding, that doesn’t change the intrinsic difficulty of predicting anything about the research outcomes themselves.

How accurate are Open Phil's predictions?

1. How we make and check our forecasts

2. Results

3. Caveats and sources of bias