The superforecasting phenomenon - that certain teams of forecasters are better than other prediction mechanisms like large crowds and simple statistical rules - seems sound. But serious interest in superforecasting stems from the reported triumph of forecaster generalists over non-forecaster experts. (Another version says that they also outperform analysts with classified information.)
So distinguish some claims:
- "Forecasters > the public"
- "Forecasters > simple models"
- "Forecasters > experts"
3a. "Forecasters > experts with classified info"
3b. "Averaged forecasters > experts"
3c. "Aggregated forecasters > experts"
Is (3) true? This post reviews all the studies we could find on experts vs forecasters. (We also attempt to cover the related question of prediction markets vs experts.)
First, our conclusions. These look pessimistic, but are mostly pretty uncertain:
- We think claim (1) is true with 99% confidence and claim (2) is true with 95% confidence. But surprisingly few studies compare experts to generalists (i.e. study claim 3). Of those we found, the analysis quality and transparency leave much to be desired. The best study found that forecasters and health professionals performed similarly. In other studies, experts had goals besides accuracy, or there were too few of them to produce a good aggregate prediction.
- (3a) A common misconception is that superforecasters outperformed intelligence analysts by 30%. Instead: Goldstein et al showed that [EDIT: the Good Judgment Project's best-performing aggregation method] outperformed the intelligence community, but this was partly due to the different aggregation technique used (the GJP weighting algorithm performs better than prediction markets, given the apparently low volumes of the ICPM market). The forecaster prediction market performed about as well as the intelligence analyst prediction market; and in general, prediction pools outperform prediction markets in the current market regime (e.g. low subsidies, low volume, perverse incentives, narrow demographics). [85% confidence]
- (3b) In the same study, the forecaster average was notably worse than the intelligence community.
- (3c) Ideally, we would pit a crowd of forecasters against a crowd of experts. Only one study, an unpublished extension of Sell et al. manages this; it found a small (~3%) forecaster advantage.
- The bar may be low. That is: it doesn't seem that hard to become a top forecaster, at present. Expertise, plus basic forecasting training and active willingness to forecast regularly, were enough to be on par with the best forecasters. [33%]
- In more complex domains, like ML, there could be significant returns to expertise. So it might be better to broaden focus from generalist forecasters to competent ML pros who are excited about forecasting. [40%]
Table of studies
US Intelligence Community Prediction Market (ICPM)
Good Judgement Project (GJP): an average, vs a prediction market (PM), vs the best method (selected post hoc among 20). 
N=139 geopolitical questions
Equal performance for expert and forecaster prediction markets. The best aggregation method was notably better than ICPM. The best method was selected post hoc among 20, but several of the other methods performed within 2% of the best.
Mean of means of daily Brier scores (MMBD)
** p < .001 vs ICPM.
Unpublished document used to justify the famous “Supers are 30% better than the CIA” claim.
The most direct comparison between forecasters (GJP PM) and experts (ICPM) finds similar performance (insignificant diff).
Prediction markets seem worse than super-aggregating opinion pools (see Appendix A); this study itself shows a large gap between GJP (PM) and GJP (best).
Christian Ruhl offers inside information about the study context here.
Qualitative forecasts from intelligence reports. Seasoned professional analysts produced:
vs the aforementioned ICPM
N=99 geopolitical questions, 28 of which were had a “fuzzy” resolution criteria
Mean absolute error of ICPM was better (p<.001) than the reports. Moreover, the initial forecasts by seasoned intelligence analysts were better (p<.05) than the forecasts imputed by them from the reports. Note that Initial forecasts were almost as good as ICPM forecasts.
Mean absolute error
Initial and imputed probabilities were compared to ICPM probabilities selected on the days on which the readers submitted their initial and imputed probabilities.
Due to the posting delay, ICPM had information not available to the report authors. However, longer posting delays would decrease ICPM advantage.
Both ICPM probabilities and imputed estimates were poorly calibrated: with Calibration Indexes of .047 and .097 respectively (much higher than .025, .014, and .016 from other studies).
Mandel (2019) critiques the study. Their Table 1 is illuminating:
Mean Brier scores
Updated personal forecasts did better than ICPM (p=.087). Data suggests that seasoned analysts performed comparably to the prediction market. Note that their initial average Brier scores ranged from .145 to .362 so there is room for selection.
(See fn 5 for whether we can conclude anything about the quality of intelligence reports.)
ICPM v. InTrade vs the “10 best IC experts we could identify on each topic”
Note that the three groups answered different questions ("approximately matching topics")
The market prices provided significantly more accurate forecasts than experts (p < 0.01).
No statistical difference in accuracy between the ICPM and InTrade.
Brier score summary stats
The different n per group is confusing (it suggests that predictions by group might not have been well balanced). It would have been better if every forecast of an IC SME was matched with a forecast from ICPM and InTrade on the same day and on the same time horizon.
InTrade is at .037, which suggests that traders were rarely (if ever) predicting confidently, and so were rarely on the wrong side of maybe.
Hypermind + John Hopkins study. Started a year before the pandemic.
Health pros (n=388) vs Hypermind forecasters
n=61 settled questions
Public health pros (n=149) vs Hypermind forecasters (n=88)
(Sample from the talk is the subset of the crowd which was recruited earliest, thus with the most opportunities to forecast questions.)
From the paper:
On the face of it, roughly equal performance. Of the top 10 forecasters:
And 1st place went to one of the very few public-health professionals who was also a skilled Hypermind forecaster.
Key problem: experts got busy with the pandemic, so forecasters updated their forecasts relatively more often.
From the talk:
Experienced life science pros (n=10)
Top-1% Metaculus forecasters (n=11)
Consensus: the aggregate of the 2 groups
Only 6 out 23 questions have resolved. They concerned safety, efficacy, and timing of a COVID-19 vaccine.
Trained forecasters had the highest log scores on average, followed by consensus models, and then subject-matter experts (nonsignificantly: the study is underpowered).
Two semi-mechanistic models
Ensemble of all models submitted to the Forecast Hub
Crowd forecasts based on n=32 forecasters (17 are self-identified experts in forecasting or epidemiology)
Crowd consistently outperformed epidemiological models as well as the Hub ensemble when forecasting cases but not when forecasting deaths.
Weighted Interval Score (WIS, the lower the better) relative to the Hub ensemble
For cases, forecaster contributions (compared to the Hub ensemble without forecaster contributions) consistently improved performance across all forecasting horizons (e.g., rel. WIS 0.9, two weeks ahead).
For deaths, contributions from the renewal model and crowd forecast together improved performance only for one week ahead predictions and showed an increasingly negative impact on performance for longer horizons (rel. WIS 1.01 two weeks ahead, 1.05 four weeks ahead). Individual contributions from both the renewal model and the crowd forecast were largely negative.
Not clear how good Forecast Hub models were, but their credentials were impressive.
Still suggests that crowd forecasting might be useful in practice.
A single superforecaster vs CDC-funded panel of experts
n=28 pandemic related questions from UMass
Forecaster did 10% better than experts as judged by Brier score:
|As usual, it’s unclear if the panel faced other incentives but forecasting accuracy.|
Movie critics (n=40) vs Betfair, a prediction market: variable n, including “low liquidity markets”
Task: Predicting Oscar winners
Prediction market RMSE was 10%+ better than pundits.
RMSE for 2013 Oscar
(the Hollywood Stock Exchange seems to be doing 10% to 50% worse than Betfair, Intrade, and PredicWise.)
Hollywood Stock Exchange, a virtual-points prediction market
Two expert predictions: Box Office Mojo, Box Office Report.
HSX is much better than BOR in terms of MAPE (n=24). And recalibrated HSX prediction is nonsignificantly different from BOM (n=140).
|An impressively accurate model built on top of FantasySCOTUS predictions, and from Ruger et al. (2004) we know that simple models outperform experts.||FantasySCOTUS|
The Forecasting Project’s decision tree vs FantasySCOTUS vs The Forecasting Project’s experts
FantasySCOTUS most active users vs other users
Decision tree: 75%
(but this comparison is across different judges and terms)
The "power predictor" average, 7.93 points, was higher than the crowd average, 7.25 points.
“The results do not conclusively prove that the power predictors’ forecasts were superior to those of the crowd. Although the power predictors generally do better, the crowd is able to make rather strong predictions to bridge the gap.”
Most (at least 75%) of the active FantasySCOTUS bettors have specialized backgrounds.
Fairly simple decision tree vs subject matter experts
|The model predicted 75% of the cases correctly, which was more accurate than their experts with 59.1%.|
The Forecasting Project, SCOTUS
Hypermind and 7 statistical models.
6 questions on U.S. 2014 midterm elections: majority-control of the Senate and 5 most-undecided states.
Mean Daily Brier Score
|Low n and errors are somewhat correlated, so this isn't particularly informative.|
Corporate: demand forecasting, project completion, project quality, external events
MSE prediction market / MSE experts at firms
ABM = an anonymous "basic materials" conglomerate
We were given a set of initial studies to branch out from.
And some general suggestions for scholarship:
- look for review articles
- look for textbooks and handbooks or companions
- find key terms
- go through researchers’ homepages/google scholar
Superforecasting began with IARPA’s ACE tournament. (Misha thinks the evidence in Tetlock’s Expert Political Judgment doesn’t fit for our purposes: there were no known skilled-amateur forecasters at that point.)
A Google Scholar search for studies funded by IARPA ACE yielded no studies. We looked at other IARPA projects (ForeST, HCT, and OSI), which sounded remotely relevant to our goals.
We searched Google Scholar for (non-exhaustive list): “good judgment project”, “superforecasters”, “collective intelligence”, “wisdom of crowds”, “crowd prediction”, “judgemental forecasting”, …, and various combinations of these, and “comparison”, “experts”, …
We got niche prediction markets from the Database of Prediction Marketsand searched for studies mentioning them. Hollywood SX and FantasySCOTUS paid off as a result. We also searched for things people commonly predict: sports, elections, Oscars, and macroeconomics.
In the process, we read the papers for additional keywords and references. We also looked for other papers from the authors we encountered.
On AI forecasting
In more complex domains, like ML, there could be significant returns to knowledge and expertise. It seems to us that moving from generalist forecasters to competent ML practitioners/researchers might be in order, because:
- To predict e.g. scaling laws and emerging capabilities, people need to understand them, which requires some expertise and understanding of ML
- It's unclear whether general forecasters actually outperform experts in a legible domain, even though we believe in the phenomenon of superforecasting, (that some people are much better forecasters than most). We also liked David Manheim's take on Superforecasting.
- We think that this will plausibly reduce ML researchers’ aversion to forecasting proposals — and if we were to execute it, we would be selecting good forecasters based on their performance anyway. It seems potentially feasible.
Finally, we note that the above is heavily limited by lack of data (lack of data collected and a lack of availability). We hope that the experimental data gets reanalyzed at least.
Thanks to Emile Servan-Schreiber, Luke Muehlhauser, and Javier Prieto for comments. These commenters don't necessarily endorse any of this. Mistakes are our own. Research funded by Open Philanthropy.
This is almost a trivial claim, since forecasters are by definition more interested in current affairs than average, and much more interested in epistemics than average. So we’d select for the subset of “the public” who should outperform simply through increased effort, even if all the practice and formal calibration training did nothing, and it probably does do something.
Previously this section said "superforecasters"; after discussion, it seems more prudent to say "the Good Judgment Project's best-performing aggregation method". See this comment for details.
Our exact probability hinges on what's considered low and on how good e.g. Hypermind's trained forecasters are. This is less obvious than it seems: in the CSET Foretell tournament, the top forecasters were not uniformly good; a team which included Misha finished with a 4x better relative Brier score than the "top forecaster" team. Further, our priors are mixed: (a) common sense makes us favor experts, (b) but common sense also somewhat favors expert forecasters, (c) Tetlock's work on expert political judgment pushes us away from politics experts, and finally (d) we have first-hand experience about superforecasting being real.
All Surveys Logit "takes the most recent forecasts from a selection of individuals in GJP’s survey elicitation condition, weights them based on a forecaster’s historical accuracy, expertise, and psychometric profile, and then extremizes the aggregate forecast (towards 1 or 0) using an optimized extremization coefficient.” Note that this method was selected post hoc, which raises the question of multiple comparisons; the authors respond that “several other GJP methods were of similar accuracy (<2% difference in accuracy).”
There is some inconclusive research comparing real- and play-money: Servan-Schreiber et al. (2004) find no significant difference for predicting NFL (American football); Rosenbloom & Notz (2006) find that in non-sports events, real-money markets are more accurate and that they are comparably accurate for sports markets; and Slamka et al. (2008) finds real- and play-money prediction markets comparable for UEFA (soccer).
MMBD is not a proper scoring rule (one incentivizing truthful reporting). If a question has a chance of resolving early (e.g., all questions of the form “will X occur by date?”), the rule incentivizes forecasters to report higher probabilities for such outcomes. This could have affected GJP (avg and best) predictors, who were rewarded for it; but should have not affected ICPM and GJP (PM), as these used the Logarithmic Market Scoring Rule.
See Sempere & Lawsen (2021) for details.
Our understanding is that these were not averaged. On average there were ~2.5 imputed predictions per report.
It's unclear if imputers did a reasonable job separating their personal views from their imputations. Mandel (2019) notes that the Pearson correlation between mean Brier scores for personal and imputed forecasts is very high, r(3)=.98, p=.005. Imputers average Brier scores ranged from .145 to .362 suggesting that traditional analysis’ apparent accuracy depends on whether interpreters are better or worse forecasters. Lehner and Stastny (2019) responded.
"We replicated some of these markets in the ICPM, or identified closely analogous predictions if they existed, so that direct comparisons between the two prediction markets could be made over time."
“We repeatedly collected forecasts from our markets and our experts to sample various time horizons, ranging from very near-term forecasts to as long as 4 months before a subject was resolved. All told, we collected 152 individual forecasts from the ICPM, InTrade, and individual IC experts over approximately matching topics and time horizons.”
A perfectly calibrated forecaster expects on average brier points from their prediction. So this average Brier suggests that a “typical" InTrade prediction was either <4% or >96%. From experience, this feels too confident and suggests that questions were either biased towards low noise or that luck is partly responsible for such good performance.
Personal communication with Servan-Schreiber.
Given N log scores, scaled rank assigns a value of 1/N to the smallest log score, a value of 2/N to the second smallest log score, and so on, assigning a value of 1 to the highest log score. (As with log scores, here computed from probability density functions, — the higher rank the better.)
It’s unclear to me how well they did compare to a prior based on how often SCOTUS reverses the decisions. The historical average is ~70% with ~80% reversals in 2008, the relevant term.