Note: I initially published this summary as a blog post here. The blog has more information about the context of this post and my reasoning transparency. In brief, this post was me summarizing what I learned about prediction polling, a specific flavor of forecasting, while trying to understand how it could be applied to forecasting global catastrophic risks. If I cite a source, it means I read it in full. If I relay a claim, it means it's made in the cited source and that I found it likely to be true (>50%). I do not make any original arguments (that's done elsewhere on the blog), but I figured this might be helpful in jumpstarting other people's understanding on the topic or directing them towards sources that they weren't aware of. Any and all feedback is most welcome.
The text should all be copied exactly from my original post, but I redid the footnotes/citations to use the EA forum formatting. If any of those are missing/seem incorrect please let me know.
Prediction polling is asking someone their expected likelihood of an event. I ask, “What is the percent chance that a meteorite larger than a baseball falls through my roof before 2025?” and you say “2%”. You’ve been prediction polled.
This term isn’t very common in the research that I’ve reviewed so far. If you’ve seen these topics discussed before it was probably under a heading like “Forecasting Tournaments”. In fact, that’s what I initially planned on titling this post. To date, most forecasting tournaments have used prediction polling, but in principle they could be competitions between forecasters using any forecasting technique and some actually have been used to compare prediction polling and prediction markets. So, this post is focused on asking people to directly estimate likelihoods, but it will end up covering a lot of the most significant historical forecasting tournaments that I’m aware of.
Polls are almost legendarily inaccurate, so why would we think we could get useful information for predicting future events? Well, researchers actually weren’t sure at all that we would and lots of laboratory experiments showed that humans were fallacy prone beasts with a very weak grasp of probabilistic fundamentals. Not to mention the complexity of the real world we inhabit.
Luckily, the Intelligence Advanced Research Projects Activity (IARPA), an organization within the US federal government kicked off the Aggregative Contingent Estimation (ACE) Program in 2011 to try and find ways to improve the accuracy of intelligence forecasts. It pitted teams of researchers against each other to find out who could most accurately predict real world events in geopolitics. Examples included whether a given ruler would be overthrown, the odds of an armed conflict, or the price of a given commodity. These questions were all objectively resolvable with clear criteria. This means that we can retrospectively compare which teams successfully predicted questions along with how accurately and how quickly they came to these conclusions. 
Of particular interest is the winning team of the initial tournament (First two years), the Good Judgement Project (GJP), led by Phillip Tetlock. Not only did it win, it split its large pool of recruited forecasters between various experimental conditions that would let us compare their relative effects on forecaster accuracy. Forecasts on a given question could be continually updated until it was resolved, and the best GJP forecasting method was on the right side of 50/50 in 86.2% of all the daily forecasts across 200 questions selected for their relevance to the intelligence community! This was 60% better than the average of solo forecasters and 40% better than the other teams.  As a happy side effect, participants’ opinions on which US policies should be implemented became were less extreme on a liberal/conservative spectrum after 2 years of forecasting, despite this not being at all the topic of the questions. Researchers theorized that this could have been caused by an overall reduction in confidence thanks to the grueling epistemic conditions of working with uncertainty paired with a very tight feedback loop on the accuracy of their predictions. 
From their experiments we learned that you could train forecasters in debiasing their predictions to improve their accuracy by 10%, put them in collaborative teams to increase their accuracy by 10%, or use an aggregation algorithm to combine the forecasts of many different forecasters and improve accuracy by 35% compared to an unweighted average of those forecasts! These effects were found to be largely independent meaning you could stack them for cumulative effects, thus the dominance of GJP in the competition. Perhaps the most interesting finding was that the most accurate individual forecasters in each experimental condition in year 1, the top 2%, continued to outperform their peers in year 2. This showed that at least part of forecasting performance was an individual skill that could be identified and/or cultivated, rather than just luck. 
Digging slightly deeper into these interventions, the de-biasing training was only conducted once per year for 9 month long tournament sessions, but forecasters that participated in it showed increased accuracy over their peers for that full duration. After the first 2 years of the GJP, the training was updated to be graphical, incorporate feedback from top forecasters, and add a module on political reasoning. Overall the training was still designed to take less than an hour, but trained forecasters remained more accurate then untrained ones over the course of each year. Both groups of forecasters grew more accurate over time with practice, but trained individuals seemed to improve faster. Researchers theorized that training could probably have a much stronger impact than this based on the low intensity (1 hour, once a year!) and the minimal refinement its contents had gone through.
Multiple team conditions were examined and forecasters collaborating together with teammates were the most accurate, followed by those who could see each other’s forecasts but not collaborate, followed by fully independent forecasters. This might not be what you would have expected a priori, with phenomenon like groupthink and diffusion of responsibility giving mechanisms for teamwork to have potentially reduced accuracy.
In years 2 and 3 of the GJP, the team dynamics associated with success were examined in more detail. Teammates collaborated via an online platform where they were able to see forecasts from their teammates as well as comment on these forecasts and other comments. Teams formed from the top 2% most accurate forecasters from the prior year performed above and beyond what would be predicted from their individual accuracy, implying additional factors at play. These teams left more and longer rationales accompanying their initial forecasts, as well as more and longer comments. Engagement in these conversations was much more evenly distributed among top teams than other teams. More team members being engaged on a given question was also associated with increasing forecast accuracy for that question, plateauing around 50-60% team participation. Analysis of the text contents of forecast explanations and responses showed that top teams more often discussed metacognition, thinking about how they were thinking. They also more frequently employed concepts from their probability/debiasing training, and talked about collaborating with their teammates.
After 10 years of GJP participants, from the aforementioned “superforecasters” to Mechanical Turk workers, writing rationales to accompany their numerical forecasts, researchers looked for patterns associated with accuracy. Their takeaway was that “If you want more accurate probability judgments in geopolitical tournaments, you should look for good perspective takers who are tolerant of cognitive dissonance (have high IC [Integrative Complexity] and dialectical scores) and who draw adeptly on history to find comparison classes of precedents that put current situations in an outside-view context.”
The massive gains in accuracy provided by aggregation algorithms are also interesting. Early aggregation efforts found that forecasters tended to be underconfident that a predicted event would occur. Using this self evaluation of expertise along with the past accuracy of the forecasters, the aggregation algorithm could be made quite accurate. Researchers theorized that the systemic under confidence of forecasters that made aggregation algorithms came from two main sources:
A rational forecaster will have an initial forecast of 50% for a binary question and update towards the expected correct outcome (either 0% or 100%) as they gain information and therefore confidence. This prior and the impossibility of having total information leads them to under predict outcomes.
The bounded nature of the 0-100% probability scale means that random noise is more likely to pull your forecast towards whatever end of the scale its farthest from. IE, if your current forecast is 90%, there’s a lot more room for random noise to pull you back towards 0%.
A later mathematical exploration of forecasting error created a model for disaggregating the contributions of interventions to improvements on accuracy into three components: Bias, Information, and Noise. Bias is systematic error across predictions such as being chronically overconfident in change, information error stems from having an incomplete picture of events, and noise is random error. The key finding is that all of the interventions in the GJP (training, teaming, and identifying superforecasters) primarily improved accuracy via reduction in the noise component of this model. Roughly a 50% contribution vs. 25% each for bias reduction and information improvement. The relative consistency of this across the different interventions, combined with my inability to follow the math involved and this model clashing with my own intuitions and experiences all keep me from putting too much weight into their explanation for why this model works. That being said, the researchers share that this model underpins their most successful aggregation to date, which is being used on work that is still in progress. My unfounded suspicion is that this math is correct and representing something real, but these parameters don’t directly correspond to the concepts they’ve named them after.
You can think of prediction markets, where participants bet on the outcome of an event and the current market price reflects a likelihood estimate, as a sort of aggregation algorithm. Researchers used years 2 and 3 of the GJP to have some forecasters submit predictions via markets in addition to the predominant polling condition. The aggregation algorithm on polled predictions was found to be significantly more accurate than prediction markets, which in turn were found to be significantly more accurate than the simple averaging of forecasts.
This wasn’t an apples to apples comparison as no prediction market users were in the “team” condition of the experiments, while some prediction poll users were. Additionally, the lack of real money incentives likely reduced market accuracy. Interestingly, the aggregated polls were most accurate relative to the markets on longer duration questions and particularly during the beginning and end of their open periods. Researchers speculated that this is when forecasters were most uncertain, and this uncertainty likely translated into large spreads and lower liquidity. Some forecasters were asked to submit their predictions via both polling and trading in markets at the same time, and aggregation algorithms that incorporated both sets of data outperformed all the rest indicating that there was additional information captured just by asking again for what should have been the same thing.
In addition to a variety of experimental conditions within the tournament, GJP also subjected participants to a battery of psychometric tests at the start of each year. This allowed researchers to look for patterns in what consistently differentiated the most accurate forecasters from the rest. The most accurate forecasters were found to…
- Be at least one standard deviation higher in fluid intelligence (across many means of measurement) than the general population.
- Be at least one standard deviation higher than the general population in crystallized intelligence, and even higher on political knowledge questions.
- Enjoy solving problems more, have a higher need for cognition, and were more inclined towards actively open minded thinking than other participants.
- Be more inclined towards a secular agnostic/atheistic worldview that did not attribute events to fate or the supernatural, including close-call counterfactuals.
- Be more scope sensitive.
- Make forecasts with greater granularity, and this greater granularity was relevant to their higher accuracy.
- Answer more questions, updated their forecasts more frequently, and investigated relevant news articles more often.
- Engage with their teams more often with more, longer comments and forum posts. These comments were more likely to contain questions and they were more likely to respond to questions being asked of them. They were also more likely to reply to the comments of others in general.
- Converge towards a tighter distribution of responses over time with their teammates, while other teams actually diverged. 
Many of the above points were replicated via a later IARPA tournament on a population of forecasters more diverse than the mainly academic recruits from the GJP. Forecasting performance remained stable over time and could be best predicted by past performance, though fluid intelligence, numeracy, and engagement were still shown to be significant predictors.
The point above regarding granularity refers to how finely along the percentage scale from 0-100% forecasters could meaningfully differentiate between probabilities. Researchers could test this after the fact by rounding forecasts into various sized “bins” and seeing if this degraded accuracy. They found that the top 2% of forecasters benefited significantly from using at least 11 bins while the entire population of forecasters benefited from using more than 7. This didn’t vary significantly based on forecaster attributes, type of question, or duration of question.
Across the first 4 years of the GJP, researchers also examined a link between how forecasters updated their predictions and their accuracy. Forecasters that updated their predictions more frequently tended to be more accurate, scored higher on crystallized intelligence and open-mindedness, accessed more information, and improved over time. Forecasters that made updates with smaller magnitudes tended to be more accurate. They scored higher on fluid intelligence and had more accurate initial forecasts. These results are part of the impact training had on improving accuracy. Conversely, large updates and affirming prior forecasts rather than adjusting them were associated with lower accuracy. Uniformly adjusting the magnitude of forecast updates almost exclusively decreased forecast accuracy. Decreasing update magnitude by 30% significantly worsened accuracy while increasing update magnitude by 30% very slightly improved it, dwarfed by other interventions by 1-2 orders of magnitude. It’s unclear how specific these results are to the domain of global political forecasting of events with near term time horizons. It’s possible that these update behaviors were most accurate because they best mirrored the reality of unfolding events and information availability, and that the most accurate behaviors in other contexts would be very different.
While all of this fantastic research from the GJP informed me greatly on what kinds of forecasting was viable in the world of geopolitics, and in what context, there were some significant gaps remaining in being able to confidently apply prediction polling to forecasting Global Catastrophic Risks.
The most glaring gap is the long delay before finding out if a prediction was accurate or not. When a question is resolving in a few months you can just wait to find out. When it’s resolving in a few decades, this approach isn’t as useful. If you want to predict counterfactual scenarios, such as the impact of a hypothetical policy being implemented, even waiting decades would never get you an answer!
A possible answer is reciprocal scoring. Up until this point all measures of accuracy that I’ve referenced have used a “proper” (or objective) scoring rule that purely and directly incentivizes accuracy from forecasters by comparing their results to reality. Reciprocal scoring is an intersubjective scoring rule, in that it asks forecasters to predict the forecasts of other, skilled forecasters. The theoretical underpinnings imply that when properly incentivized, forecasters will still strive to forecast the truth when being judged by reciprocal scoring. To test this, researchers had forecasters in different groups use the two different methods of scoring, on objectively resolvable questions, and found the accuracy of the forecasts to be statistically identical! Then, in a second study, forecasters being judged by reciprocal scoring predicted the death toll from COVID-19 if different policies or combinations of policies had been implemented immediately.
The resulting conclusions seemed reasonable and consistent, at least as best as you could tell in hindsight and against universes that never came to be. Of the 19 of 21 tournament participants in Study 2 that completed a post-tournament survey, only 12 of them (63%) responded “I always reported my exact true beliefs”. On average the other 7 reported there was only a 15% difference between their beliefs and the beliefs they forecasted. The reasons shared for not reporting their true beliefs were, in popular order:
- “I thought that my team’s median forecast was a good proxy for the other team’s median, even if it was objectively inaccurate”
- “I thought I was able to identify patterns in how forecasters would respond to these questions”
- “I thought that forecasters in the other group might be biased in a certain direction”
- “I thought that I might have obtained information that the forecasters in the other group did not have access to”
It’s very interesting to me that reciprocal scoring could have been so accurate in the first study, yet a third of participants in the second study report intentionally deviating from trying to directly predict accuracy. Is it possible that these deviations actually generally reduce noise/error by pulling otherwise less accurate forecasters towards the median?
Another obstacle in the application of prediction polling to GCRs is knowing which questions to ask forecasters in the first place in order to get the most useful information. We could just let subject matter experts create them, but is this the best way? One solution being explored is intentionally including questions from along a rigor-relevance tradeoff spectrum. Some near term indicators that are objectively scorable and expected (but not known) to be indicative of longer term events of interest, and longer run outcomes that we directly care about but that will need to be intersubjectively scored. Another is crafting “conditional trees”, where forecasters identify the branching impact of early indicators on the probability of later outcomes, to systematically identify the most relevant questions to forecast.
Lots of other interventions, like controlling the structure of teams or improving training are hypothesized in improving forecasters’ performance in this new domain. Researchers are working on designing and facilitating a next generation of forecasting tournaments to, in the spirit of the GJP, figure out what works and what doesn’t empirically. I believe I actually had the honor of participating in the first of these recently, and I’ll be tracking the publications of the associated researchers so that I can continue to update this page.
Tetlock, Philip E., et al. "Forecasting tournaments: Tools for increasing transparency and improving the quality of debate." Current Directions in Psychological Science 23.4 (2014): 290-295.
Mellers, Barbara, Philip Tetlock, and Hal R. Arkes. "Forecasting tournaments, epistemic humility and attitude depolarization." Cognition 188 (2019): 19-26.
Mellers, Barbara, et al. "Psychological strategies for winning a geopolitical forecasting tournament." Psychological science 25.5 (2014): 1106-1115.
Chang, Welton, et al. "Developing expert political judgment: The impact of training and practice on judgmental accuracy in geopolitical forecasting tournaments." Judgment and Decision making 11.5 (2016): 509-526.
Horowitz, Michael, et al. "What makes foreign policy teams tick: Explaining variation in group performance at geopolitical forecasting." The Journal of Politics 81.4 (2019): 1388-1404.
Karvetski, Christopher, et al. "Forecasting the accuracy of forecasters from properties of forecasting rationales." Available at SSRN 3779404 (2021).
Satopää, Ville A., et al. "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting 30.2 (2014): 344-356.
Baron, Jonathan, et al. "Two reasons to make aggregated probability forecasts more extreme." Decision Analysis 11.2 (2014): 133-145.
Satopää, Ville A., et al. "Bias, information, noise: The BIN model of forecasting." Management Science 67.12 (2021): 7599-7618.
Atanasov, Pavel, et al. "Distilling the wisdom of crowds: Prediction markets vs. prediction polls." Management science 63.3 (2017): 691-706.
Dana, Jason, et al. "Are markets more accurate than polls? The surprising informational value of “just asking”." Judgment and Decision Making 14.2 (2019): 135-147.
Mellers, Barbara, et al. "Identifying and cultivating superforecasters as a method of improving probabilistic predictions." Perspectives on Psychological Science 10.3 (2015): 267-281.
Mellers, Barbara, et al. "The psychology of intelligence analysis: drivers of prediction accuracy in world politics." Journal of experimental psychology: applied 21.1 (2015): 1.
Himmelstein, Mark, Pavel Atanasov, and David V. Budescu. "Forecasting forecaster accuracy: Contributions of past performance and individual differences." Judgment and Decision Making 16.2 (2021): 323-362.
Friedman, Jeffrey A., et al. "The value of precision in probability assessment: Evidence from a large-scale geopolitical forecasting tournament." International Studies Quarterly 62.2 (2018): 410-422.
Atanasov, Pavel, et al. "Small steps to accuracy: Incremental belief updaters are better forecasters." Proceedings of the 21st ACM Conference on Economics and Computation. 2020.
Karger, Ezra, et al. "Reciprocal scoring: A method for forecasting unanswerable questions." Available at SSRN 3954498 (2021).
Karger, Ezra, Pavel D. Atanasov, and Philip Tetlock. "Improving judgments of existential risk: Better forecasts, questions, explanations, policies." Questions, Explanations, Policies (January 5, 2022) (2022).