Efforts to Improve the Accuracy of Our Judgments and Forecasts (Open Philanthropy)

lukeprog

This is a linkpost for https://www.openphilanthropy.org/blog/efforts-improve-accuracy-our-judgments-and-forecasts

Our grantmaking decisions rely crucially on our uncertain, subjective judgments — about the quality of some body of evidence, about the capabilities of our grantees, about what will happen if we make a certain grant, about what will happen if we don’t make that grant, and so on.

In some cases, we need to make judgments about relatively tangible outcomes in the relatively near future, as when we have supported campaigning work for criminal justice reform. In others, our work relies on speculative forecasts about the much longer term, as for example with potential risks from advanced artificial intelligence. We often try to quantify our judgments in the form of probabilities — for example, the former link estimates a 20% chance of success for a particular campaign, while the latter estimates a 10% chance that a particular sort of technology will be developed in the next 20 years.

We think it’s important to improve the accuracy of our judgments and forecasts if we can. I’ve been working on a project to explore whether there is good research on the general question of how to make good and accurate forecasts, and/or specialists in this topic who might help us do so. Some preliminary thoughts follow.

In brief:

There is a relatively thin literature on the science of forecasting.^[1] It seems to me that its findings so far are substantive and helpful, and that more research in this area could be promising.
This literature recommends a small set of “best practices” for making accurate forecasts that we are thinking about how to incorporate into our process. It seems to me that these “best practices” are likely to be useful, and surprisingly uncommon given that.
In one case, we are contracting to build a simple online application for credence calibration training: training the user to accurately determine how confident they should be in an opinion, and to express this confidence in a consistent and quantified way. I consider this a very useful skill across a wide variety of domains, and one that (it seems) can be learned with just a few hours of training. (Update: This calibration training app is now available.)

I first discuss the last of these points (credence calibration training), since I think it is a good introduction to the kinds of tangible things one can do to improve forecasting ability.

Calibration training

An important component of accuracy is called “calibration.” If you are “well-calibrated,” what that means is that statements (including predictions) you make with 30% confidence are true about 30% of the time, statements you make with 70% confidence are true about 70% of the time, and so on.

Without training, most people are not well-calibrated, but instead overconfident. Statements they make with 90% confidence might be true only 70% of the time, and statements they make with 75% confidence might be true only 60% of the time.^[2] But it is possible to “practice” calibration by assigning probabilities to factual statements, then checking whether the statements are true, and tracking one’s performance over time. In a few hours, one can practice on hundreds of questions and discover patterns like “When I’m 80% confident, I’m right only 65% of the time; maybe I should adjust so that I report 65% for the level of internally-experienced confidence I previously associated with 80%.”

I recently attended a calibration training webinar run by Hubbard Decision Research, which was essentially an abbreviated version of the classic calibration training exercise described in Lichtenstein & Fischhoff (1980). It was also attended by two participants from other organizations, who did not seem to be familiar with the idea of calibration and, as expected, were grossly overconfident on the first set of questions.^[3] But, as the training continued, their scores on the question sets began to improve until, on the final question set, they both achieved perfect calibration.

For me, this was somewhat inspiring to watch. It isn’t often the case that a cognitive skill as useful and domain-general as probability calibration can be trained, with such objectively-measured dramatic improvements, in so short a time.

The research I’ve reviewed broadly supports this impression. For example:

Rieber (2004) lists “training for calibration feedback” as his first recommendation for improving calibration, and summarizes a number of studies indicating both short- and long-term improvements on calibration.^[4] In particular, decades ago, Royal Dutch Shell began to provide calibration for their geologists, who are now (reportedly) quite well-calibrated when forecasting which sites will produce oil.^[5]
Since 2001, Hubbard Decision Research has trained over 1,000 people across a variety of industries. Analyzing the data from these participants, Doug Hubbard reports that 80% of people achieve perfect calibration (on trivia questions) after just a few hours of training. He also claims that, according to his data and at least one controlled (but not randomized) trial, this training predicts subsequent real-world forecasting success.^[6] I should note that calibration isn’t sufficient by itself for good forecasting. For example, you can be well-calibrated on a set of true/false statements, for which about half the statements happen to be true, simply by responding “True, with 50% confidence” to every statement. This performance would be well-calibrated but not very informative. Ideally, an expert would assign high confidence to statements that are likely to be true, and low confidence to statements that are unlikely to be true. An expert that can do so is not just well-calibrated, but also exhibits good “resolution” (sometimes called “discrimination”). If we combine calibration and resolution, we arrive at a measure of accuracy called a “proper scoring rule.”^[7] The calibration trainings described above sometimes involve proper scoring rules, and likely train people to be well-calibrated while exhibiting at least some resolution, though the main benefit they seem to have (based on the research and my observations) pertains to calibration specifically.

The primary source of my earlier training in calibration was a game intended to automate the process. The Open Philanthropy Project is now working with developers to create a more extensive calibration training game for training our staff; we will also make the game available publicly. [Update: You can now play the game.]

Further advice for improving judgment accuracy

Below I list some common advice for improving judgment and forecasting accuracy (in the absence of strong causal models or much statistical data) that has at least some support in the academic literature, and which I find intuitively likely to be helpful.^[8]

Train probabilistic reasoning: In one especially compelling study (Chang et al. 2016), a single hour of training in probabilistic reasoning noticeably improved forecasting accuracy.^[9] Similar training has improved judgmental accuracy in some earlier studies,^[10] and is sometimes included in calibration training.^[11]
Incentivize accuracy: In many domains, incentives for accuracy are overwhelmed by stronger incentives for other things, such as incentives for appearing confident, being entertaining, or signaling group loyalty. Some studies suggest that accuracy can be improved merely by providing sufficiently strong incentives for accuracy such as money or the approval of peers.^[12]
Think of alternatives: Some studies suggest that judgmental accuracy can be improved by prompting subjects to consider alternate hypotheses.^[13]
Decompose the problem: Another common recommendation is to break each problem into easier-to-estimate sub-problems.^[14]
Combine multiple judgments: Often, a weighted (and sometimes “extremized”^[15]) combination of multiple subjects’ judgments outperforms the judgments of any one person.^[16]
Correlates of judgmental accuracy: According to some of the most compelling studies on forecasting accuracy I’ve seen,^[17] correlates of good forecasting ability include “thinking like a fox” (i.e. eschewing grand theories for attention to lots of messy details), strong domain knowledge, general cognitive ability, and high scores on “need for cognition,” “actively open-minded thinking,” and “cognitive reflection” scales. [Note: Links added by the authors of this sequence, not by the author of the original post.]
Prediction markets: I’ve seen it argued, and find it intuitive, that an organization might improve forecasting accuracy by using prediction markets. I haven’t studied the performance of prediction markets yet.
Learn a lot about the phenomena you want to forecast: This one probably sounds obvious, but I think it’s important to flag, to avoid leaving the impression that forecasting ability is more cross-domain/generalizable than it is. Several studies suggest that accuracy can be boosted by having (or acquiring) domain expertise. A commonly-held hypothesis, which I find intuitively plausible, is that calibration training is especially helpful for improving calibration, and that domain expertise is helpful for improving resolution.^[18]

Another interesting takeaway from the forecasting literature is the degree to which - and consistency with which - some experts exhibit better accuracy than others. For example, tournament-level bridge players tend to show reliably good accuracy, whereas TV pundits, political scientists, and professional futurists seem not to.^[19] A famous recent result in comparative real-world accuracy comes from a series of IARPA forecasting tournaments, in which ordinary people competed with each other and with professional intelligence analysts (who also had access to expensively-collected classified information) to forecast geopolitical events. As reported in Tetlock & Gardner’s Superforecasting, forecasts made by combining (in a certain way) the forecasts of the best-performing ordinary people were (repeatedly) more accurate than those of the trained intelligence analysts.

How commonly do people seek to improve the accuracy of their subjective judgments?

Certainly many organizations, from financial institutions (e.g. see Fabozzi 2012) to sports teams (e.g. see Moneyball), use sophisticated quantitative models to improve the accuracy of their estimates. But the question I’m asking here is: In the absence of strong models and/or good data, when decision-makers must rely almost entirely on human subjective judgment, how common is it for those decision-makers to explicitly invest substantial effort into improving the (objectively-measured) accuracy of those subjective judgments?

Overall, my impression is that the answer to this question is “Somewhat rarely, in most industries, even though the techniques listed above are well-known to experts in judgment and forecasting accuracy.”

Why do I think that? It’s difficult to get good evidence on this question, but I provide some data points in a footnote.^[20]

Ideas we’re exploring to improve accuracy for GiveWell and Open Philanthropy Project staff

Below is a list of activities, aimed at improving the accuracy of our judgments and forecasts, that are either ongoing, under development, or under consideration at GiveWell and the Open Philanthropy Project:

As noted above, we have contracted a team of software developers to create a calibration training web/phone application for staff and public use. (Update: This calibration training app is now available.)
We encourage staff to participate in prediction markets and forecasting tournaments such as PredictIt and Good Judgment Open, and some staff do so.
Both the Open Philanthropy Project and GiveWell recently began to make probabilistic forecasts about our grants. For the Open Philanthropy Project, see e.g. our forecasts about recent grants to Philip Tetlock and CIWF. For GiveWell, see e.g. forecasts about recent grants to Evidence Action and IPA. We also make and track some additional grant-related forecasts privately. The idea here is to be able to measure our accuracy later, as those predictions come true or are falsified, and perhaps to improve our accuracy from past experience. So far, we are simply encouraging predictions without putting much effort into ensuring their later measurability.
We’re going to experiment with some forecasting sessions led by an experienced “forecast facilitator” - someone who helps elicit forecasts from people about the work they’re doing, in a way that tries to be as informative and helpful as possible. This might improve the forecasts mentioned in the previous bullet point.

I’m currently the main person responsible for improving forecasting at the Open Philanthropy Project, and I’d be very interested in further ideas for what we could do.

Technically, the scientific study of forecasting goes back to at least the 1940s, and arguably earlier. However, I am most interested in studies which do all of the following:
- collect forecasts about phenomena for which there aren’t strong models and/or lots of data,
- assess the accuracy of those forecasts using a proper scoring rule,
- relative to the accuracy achieved by some reasonable baseline or control group, and which
- don’t have other well-known but common limitations, such as failing to adjust for multiple comparisons.
In attempting to learn what I can from the forecasting literature, I haven’t relied exclusively on studies which have all the features listed above, but my hope is that this list of features helps to clarify which types of studies I’ve tried hardest to find and learn from. It is in this sense that the science of forecasting is a “thin literature,” even though there are thousands of published papers about forecasting, stretching back to the 1940s and earlier. ↩︎
Lichtenstein et al. (1982); Bazerman & Moore (2013), ch. 2. ↩︎
I had previously practiced calibration using an online game intended to give a form of automated calibration training. ↩︎
From Russo (2004):

Evidence indicates that the calibration of judgment can be substantially enhanced through feedback about one’s own probability judgments. In one experiment, participants working at computers were asked general knowledge questions to which they gave their answers as well as their subjective probabilities. After each session of 200 items, the participants received a summary of their performance, which they discussed with the experimenter. Most participants were poorly calibrated before the training; of these, all substantially improved. Moreover, although the subjects participated in eleven training sessions, “all of the improvement came between the first and second round of feedback.” Since the first training session lasted about an hour, with another forty-five minutes for preliminary instruction, it appears that with intensive feedback calibration can be dramatically improved in approximately two hours.

P. George Benson and Dilek Onkal report similarly large gains after calibration feedback in a forecasting task. In this study, the improvement also occurred in one step, but it was between the second and third training sessions. Likewise, Marc Alpert and Howard Raiffa report that after calibration feedback the number of 98 percent confidence-range questions missed by their Harvard MBA students “fell from a shocking 41 percent to a depressing 23 percent.” While 23 percent is far more than the ideal 2 percent, it nevertheless represents a large improvement from 41 percent and it, too, was achieved after only one round of practice.

While these results occurred in a laboratory setting, some evidence shows that calibration training in the workplace can be effective as well. The energy firm Royal Dutch/Shell successfully implemented a training program to improve the calibration of its geologists in finding oil deposits. Prior to the training, the geologists had been markedly overconfident, assigning a 40 percent confidence to locations that yielded oil less than 20 percent of the time. Predicting the location of oil deposits is clearly very different from predicting international events, with fewer variables and more reliable data. But, it may be analogous to imagery intelligence.

↩︎
Russo & Schoemaker (2014):

[The geologists] were given files from their archives containing many factors affecting oil deposits, but without the actual results. For each past case, they had to provide best guesses for the probability of striking oil as well as ranges as to how much a successful well might produce. Then they were given feedback as to what had actually happened. The training worked wonderfully: now, when Shell geologists predict a 30 per cent chance of producing oil, three out of ten times the company averages a hit…

↩︎
From Hubbard & Seiersen (2016), ch. 7:

Since [2001], Hubbard and his team at Hubbard Decision Research have trained well over 1,000 people in calibration methods and have recorded their performance, both their expected and actual results on several calibration tests, given one after the other during a half-day workshop.

[…]

To determine who is calibrated we have to allow for some deviation from the target, even for a perfectly calibrated person. Also, an uncalibrated person can get lucky. Accounting for this statistical error in the testing, fully 80% of participants are ideally calibrated after the fifth calibration exercise. They are neither underconfident nor overconfident. Their 90% [confidence intervals] have about a 90% chance of containing the correct answer.

Another 10% show significant improvement but don’t quite reach ideal calibration. And 10% show no significant improvement at all from the first test they take…

…But does proven performance in training reflect an ability to assess the odds of real-life uncertainties? The answer here is an unequivocal yes. Hubbard tracked how well-calibrated people do in real-life situations on multiple occasions, but one particular controlled experiment done in the IT industry still stands out. In 1997, Hubbard was asked to train the analysts of the IT advisory firm Giga Information Group (since acquired by Forrester Research, Inc.) in assigning odds to uncertain future events. Giga was an IT research firm that sold its research to other companies on a subscription basis. Giga had adopted the method of assigning odds to events it was predicting for clients, and it wanted to be sure it was performing well.

Hubbard trained 16 Giga analysts using the methods described earlier. At the end of the training, the analysts were given 20 specific IT industry predictions they would answer as true or false and to which they would assign a confidence. The test was given in January 1997, and all the questions were stated as events occurring or not occurring by June 1, 1997 (e.g., “True or False: Intel will release its 300 MHz Pentium by June 1,” etc.). As a control, the same list of predictions was also given to 16 of their chief information officer (CIO) clients at various organizations. After June 1 the actual outcomes could be determined. Hubbard presented the results at Giga World 1997, their major IT industry symposium for the year…

…the analysts’ results… were very close to the ideal confidence, easily within allowable error…

In comparison, the results of clients who did not receive any calibration training (indicated by the small triangles) were very overconfident… All of these results are consistent with what has typically been observed in a number of other calibration studies over the past several decades.

I haven’t seen the details of Hubbard’s study, and in any case it suffers from multiple design limitations — for example, the treatment (calibration training) wasn’t assigned randomly. ↩︎
A proper scoring rule, applied to a set of probabilistic judgments or forecasts, awards points for both calibration and resolution, and does so in a way that incentivizes judges to report their probabilities honestly. Measures like this should be assessed with respect to an appropriate benchmark. Tetlock & Gardner (2015) explain this point in the context of assessing forecasts for accuracy using a proper scoring rule called a Brier score, which ranges from 0 to 1, with lower numbers representing better scores (ch. 3):

Let’s suppose we discover that you have a Brier score of 0.2. That’s far from godlike omniscience (0) but a lot better than chimp-like guessing (0.5), so it falls in the range of what one might expect from, say, a human being. But we can say much more than that. What a Brier score means depends on what’s being forecast. For instance, it’s quite easy to imagine circumstances where a Brier score of 0.2 would be disappointing. Consider the weather in Phoenix, Arizona. Each June, it gets very hot and sunny. A forecaster who followed a mindless rule like, “always assign 100% to hot and sunny” could get a Brier score close to 0, leaving 0.2 in the dust. Here, the right test of skill would be whether a forecaster can do better than mindlessly predicting no change. This is an underappreciated point. For example, after the 2012 presidential election, Nate Silver, Princeton’s Sam Wang, and other poll aggregators were hailed for correctly predicting all fifty state outcomes, but almost no one noted that a crude, across-the-board prediction of “no change” — if a state went Democratic or Republican in 2008, it will do the same in 2012 — would have scored forty-eight out of fifty, which suggests that the many excited exclamations of “He called all fifty states!” we heard at the time were a tad overwrought. Fortunately, poll aggregators are pros: they know that improving predictions tends to be a game of inches.

Another key benchmark is other forecasters. Who can beat everyone else? Who can beat the consensus forecast? How do they pull it off? Answering these questions requires comparing Brier scores, which, in turn, requires a level playing field. Forecasting the weather in Phoenix is just plain easier than forecasting the weather in Springfield, Missouri, where weather is notoriously variable, so comparing the Brier scores of a Phoenix meteorologist with those of a Springfield meteorologist would be unfair. A 0.2 Brier score in Springfield could be a sign that you are a world-class meteorologist. It’s a simple point, with a big implication: dredging up old forecasts from newspapers will seldom yield apples-to-apples comparisons because, outside of tournaments, real-world forecasters seldom predict exactly the same developments over exactly the same time period.

↩︎
In the footnotes that follow each piece of “common advice” I list for this post, I do not provide a thorough evaluation of the evidence supporting each claim, but merely provide some pointers to the available evidence. I have skimmed these and other studies only briefly, and my choices for which pieces of advice to include here relies as much on my intuitions about what seems likely to work — given my studies of forecasting, psychology, and other fields, as well as my general understanding of the world — as it does on an evaluation of the specific evidence I point to. In fact, I suspect that upon closer examination, I would find some of the primary studies listed or cited in these footnotes to be deeply flawed and unconvincing.

My list of common advice is not an exhaustive one. For additional suggestions, see e.g. Bazerman & Moore (2013) ch. 12, Rieber (2004), and Soll et al. (2016). ↩︎
Chang et al. (2016) describe the training module randomly assigned to some participants in the Good Judgment Project forecasting tournaments:

Training evolved from year 1 to 4, but was never designed to take more than an hour. Common probabilistic reasoning principles included the understanding and use of event base-rates, basic principles of belief updating in a way that reflected the probative value of new evidence, the value of averaging independent evidence, the difference between calibration and resolution in Brier scoring, the pros and cons of using statistical-mathematical models to inform forecasts, and a discussion of common biases in probability judgment.

…Training in year 1 consisted of two different modules: probabilistic reasoning training and scenario training. Scenario-training was a four-step process: 1) developing coherent and logical probabilities under the probability sum rule; 2) exploring and challenging assumptions; 3) identifying the key causal drivers; 4) considering the best and worst case scenarios and developing a sensible 95% confidence interval of possible outcomes; and 5) avoid over-correction biases. The principles were distilled into an acronym QUEST: Question views, Use plausible worst-case and best-case scenarios, Explore assumptions, Several assumptions should be considered, Take heed of biases… Scenario training was designed in a way very similar to analytic training already used by the intelligence community, encouraging trainees to think critically about assumptions, potential futures, and causal mechanisms that could be at play on a given forecasting question.

Probabilistic reasoning training consisted of lessons that detailed the difference between calibration and resolution, using comparison classes and base rates (Kahneman & Tversky, 1973; Tversky & Kahneman, 1981), averaging and using crowd wisdom principles (Surowiecki, 2005), finding and utilizing predictive mathematical and statistical models (Arkes, 1981; Kahneman & Tversky, 1982), cautiously using time-series and historical data, and being self-aware of the typical cognitive biases common throughout the population. The training encouraged forecasters to remember the principles by the acronym CHAMP (Table 2)…

In year 2, probabilistic reasoning and scenario training were combined into a single module. Graphics and more checks on learning were added.

Year 3 expanded on year 1 and year 2 training by delivering the content in a graphical format (online via commercial software) and adding a letter S to CHAMP, as well as a new political science content module described by the acronym KNOW. The additional S encouraged forecasters to select the right questions to answer and seek out subjects where they have a comparative advantage. The additional KNOW module encouraged forecasters to understand the dynamics involving key political players (Bueno De Mesquita & Smith, 2005; Waltz, 2001), determine the influence of norms and international institutions (Finnemore & Sikkink, 1998; Keohane, 2005), seek out other political perspectives and be aware of potential wildcard scenarios (Taleb, 2010). The original CHAMP guidelines were also slightly modified based on lessons learned and observation of the best forecasters, together forming the revised guidelines under the acronym CHAMPS KNOW (Table 3). Additional checks on learning (i.e., short quizzes) were integrated into this version of the training as well…

Year 4 training was very similar to year 3 training. The probabilistic reasoning training was delivered via a customized web platform. Almost all information conveyed was illustrated with graphical examples or pictures. The main CHAMPS KNOW framework remained intact — save for the revision of the S guideline from “Select the right questions to answer” to “Select the right level of effort to devote to each question,” which provided a sharper and clearer description of performing cognitive triage on the forecasting question pool…

Training yielded significant improvements in Brier score across all four tournament years (Figure 1). In year 1, both probability-trained forecasters (n = 119, MStd Brier Score = -0.05, SD = 0.24) and scenario-trained forecasters (n = 113, MStd Brier Score = -0.06, SD = 0.23) outperformed control forecasters (n = 152, M Std Brier Score = +0.07, SD = 0.28), F(2, 381) = 12.1, p < .001. Accuracy did not differ between probability-trained and scenario-trained forecasters. The improvement in mean Brier scores from probability-training and scenario-training was 10% and 11%, respectively, relative to control forecasters.

In year 2, training increased accuracy, with probability-trained individuals (n = 205, MStd Brier Score = –0.10, SD Std = 0.25) outperforming control individuals (n = 194, M Std Brier Score = +0.05, SD Std = 0.25), t(395) = 5.95, p < .001, a 12% score improvement. In year 3, training was associated with better performance (trained n = 97, MStd Brier Score = –0.08, SD Std = 0.27, control n = 116, MStd Brier Score = 0.00, SD Std = 0.28), t(207) = 2.32, p = .021, with trained individuals again achieving greater accuracy than controls, a 6% score improvement. Finally, in year 4, training was also significant, (trained n = 131, M Std Brier Score = –0.01, SD Std = 0.26, control n = 102, MStdBrierScore = –0.08, SD Std = 0.24), t(225) = 2.20, p = .028, a 7% score improvement. Additionally, as reported elsewhere, training improved the calibration and resolution of forecasters by reducing overconfidence (Mellers et al., 2014; Moore et al., 2016). Overall, the individual forecasters with probability-training consistently outperformed controls across all four years (Table 4).

Section 1 of this paper also provides a succinct and up-to-date review of past work on debiasing and judgmental accuracy training.

My judgment that Chang et al. (2016) is an “especially compelling” study comes substantially (but not entirely) from the fact that it overcomes some of the limitations of past work, as summarized by the authors:

A number of studies have shed light on how probability estimates and judgments can be improved… However, past work suffers from at least six sets of limitations: 1) over-reliance on student subjects who are often neither intrinsically nor extrinsically motivated to master the task…; 2) one-shot experimental tasks that limit both subjects’ opportunities to learn and researchers’ opportunities to assess whether experimentally induced gains were sustainable over time or whether they just briefly cued better thinking…; 3) brief training modules, often as short as 10–15 minutes, that afforded few opportunities for retesting… and exploring the potential interactive effects of training and deliberate practice…; 4) debiasing interventions that are narrowly tailored to a single bias (e.g., over-confidence, hindsight) and not designed to help with problems that activate multiple biases…; 5) multifaceted and lengthy educational interventions, such as statistics courses, that are high in ecological validity but lose the internal validity advantages that accrue from random assignment…; and 6) limited study of the moderating effects of individual differences beyond cognitive ability…

We set out to overcome many of these problems. Our study uses a highly diverse cross-section of the population that, based on the effort expended for compensation provided, is almost certainly more intrinsically motivated than the standard undergraduate sample. The research went on for four years, tested lengthier debiasing methods, and investigated individual-difference moderators. Our study also represents one of the most rigorous tests of debiasing methods to date. The open-ended experimental task, forecasting a wide range of political and economic outcomes, is widely recognized as difficult (Jervis, 2010; Tetlock, 2005)… Our work does not correct all of the aforementioned conceptual and methodological problems, but we can address a significant fraction of them.

↩︎
For example, individual components of the training module from Chang et al. (2016) have been tested in earlier studies, as noted by Chang et al. (2016):

Considering base rates can also improve judgmental accuracy (Kahneman & Tversky, 1973; Tversky & Kahneman, 1981). [And] teaching people reference-class forecasting reduces base-rate neglect more than calling attention to the bias itself (Case, Fantino & Goodie, 1999; Fischhoff & Bar-Hillel, 1984; Flyvbjerg, 2008; Kahneman & Tversky, 1977; Lovallo, Clarke & Camerer, 2012).

↩︎
For example, the Doug Hubbard training I attended included some training in probabilistic reasoning, which in part was necessary to ensure the participants understood how the calibration training was supposed to work. ↩︎
Presumably, strong monetary incentives are the primary reason why most financial markets are as efficient as they are, and strong monetary and/or reputational incentives explain why prediction markets work as well as they do (Wolfers & Zitzewitz 2004).

Relatedly, Tetlock & Gardner (2015) remark:

It is quite remarkable how much better calibrated forecasters are in the public IARPA tournaments than they were in [Tetlock’s] earlier anonymity-guaranteed EPJ tournaments. And the evidence from lab experiments is even more decisive. Public tournaments create a form of accountability that attunes us to the possibility we might be wrong. Tournaments have the effect that Samuel Johnson ascribed to the gallows: they concentrate the mind (in the case of tournaments, on avoiding reputational death). See P. E. Tetlock and B. A. Mellers, “Structuring Accountability Systems in Organizations,” in Intelligence Analysis: Behavioral and Social Scientific Foundations, ed. B. Fischhoff and C. Chauvin (Washington, DC: National Academies Press, 2011), pp. 249–70; J. Lerner and P. E. Tetlock, “Accounting for the Effects of Accountability,” Psychological Bulletin 125 (1999): 255–75.

Kahan (2015) summarizes an emerging literature on monetary incentives for accuracy in the context of politically motivated reasoning:

In an important development, several researchers have recently reported that offering monetary incentives can reduce or eliminate polarization in the answers that subjects of diverse political out-looks give to questions of partisan import (Khanna & Sood 2016; Prior, Sood & Gaurav 2015; Bullock, Gerber, Hill & Huber 2015).

[…]

If monetary incentives do meaningfully reverse identity-protective forms of information processing in studies that reflect the PMRP [politically-motivated reasoning paradigm] design, then a plausible inference would be that offering rewards for “correct answers” is a sufficient intervention to summon the truth-seeking information-processing style that (at least some) subjects use outside of domains that feature identity-expressive goals. In effect, the incentives transform subjects’ from identity-protectors to knowledge revealers (Kahan 2015a), and activate the corresponding shift in information-processing styles appropriate to those roles.

Whether this would be the best understanding of such results, and what the practical implications of such a conclusion would be, are also matters that merit further empirical examination.

↩︎
In his review of “debiasing” strategies, Larrick (2004) summarized the evidence for the “think of alternatives” strategy this way:

By necessity, cognitive strategies tend to be context-specific rules tailored to address a narrow set of biases, such as the law of large numbers or the sunk cost rule. This fact makes the simple but general strategy of “consider the opposite” all the more impressive, because it has been effective at reducing overconfidence, hindsight biases, and anchoring effects (see Arkes, 1991; Mussweiler, Strack, & Pfeiffer, 2000). The strategy consists of nothing more than asking oneself, “What are some reasons that my initial judgment might be wrong?” The strategy is effective because it directly counteracts the basic problem of association-based processes – an overly narrow sample of evidence – by expanding the sample and making it more representative. Similarly, prompting decision makers to consider alternative hypotheses has been shown to reduce confirmation biases in seeking and evaluating new information.

Soll and Klayman (2004) have offered an interesting variation on “consider the opposite.” Typically, subjective range estimates exhibit high overconfidence. Ranges for which people are 80 percent confident capture the truth 30 percent to 40 percent of the time. Soll and Klayman (2004) showed that having judges generate 10th and 90th percentile estimates in separate stages – which forces them to consider distinct reasons for low and high values – increased hit rates to nearly 60 percent by both widening and centering ranges.

“Consider the opposite” works because it directs attention to contrary evidence that would not otherwise be considered. By comparison, simply listing reasons typically does not improve decisions because decision makers tend to generate supportive reasons. Also, for some tasks, reason generation can disrupt decision-making accuracy if there is a poor match between the reasons that are easily articulated and the actual factors that determine an outcome (Wilson & Schooler, 1991). Lastly, asking someone to list too many contrary reasons can backfire – the difficulty of generating the tenth “con” can convince a decision maker that her initial judgment must have been right after all…

↩︎
Moore et al. (2016) summarize a few of the studies on problem decomposition and judgment accuracy:

Researchers have devoted a great deal of effort to developing ways to reduce overprecision [a type of overconfidence]. Most of the research has revolved around three main approaches… [one of which is] decomposing the response set or alternatives into smaller components and considering each one of them separately…

…[This approach] capitalizes on support theory’s subadditivity effect (Tversky & Koehler, 1994). It suggests counteracting overprecision by taking the focal outcome and decomposing it into more specific alternatives. Fischhoff, Slovic, and Lichtenstein (1978) found that the sum of all probabilities assigned to the alternatives that make up the set is larger than the probability assigned to the set as a whole. Thus, when estimating likelihoods for a number of possible outcomes, the more categories the judge is assessing (and the less we include under “all others”) the less confident they will be that their chosen outcome is the correct one. Decomposition of confidence intervals has also achieved encouraging results. Soll and Klayman (2004) asked participants to estimate either an 80% confidence interval or the 10th and 90th fractiles separately (the distance between which should cover 80% of the participant’s probability distribution). They found that the consideration of the high and low values separately resulted in wider and less overprecise intervals.

One elicitation method combines both the consideration of more information and the decomposition of the problem set into more specific subsets. The SPIES method (short for Subjective Probability Interval Estimates) (Haran, Moore, & Morewedge, 2010) turns a confidence interval into a series of probability estimates for different categories across the entire problem set. Instead of forecasting an interval that should include, with a certain level of confidence, the correct answer, the participant is presented with the entire range of possible outcomes. This range is divided into bins, and the participant estimates the probability of each bin to include the correct answer. For example, to predict the daily high temperature in Chicago on May 21, we can estimate the probability that this temperature will be below 50°F, between 51°F and 60°F, between 61°F and 70°F, between 71°F and 80°F, between 81°F and 90°F, and 91°F or more. Because these bins cover all possible options, the sum of all estimates should amount to 100%. From these subjective probabilities we can extract an interval for any desired confidence level. This method not only produces confidence intervals that are less overprecise than those produced directly but it also reduces overprecision in subsequent estimates when participants switch back to the traditional confidence interval method (Haran, Moore et al., 2010). This reduction, however, does not seem to stem from the generalization of a better estimation process. Rather, the most pronounced improvements in estimates after a SPIES practice period seem to be when the SPIES task turns judges’ attention to values previously regarded as the most unlikely (Haran, 2011). It may be possible, then, that when people are made aware of the possibility that their knowledge is incomplete (by directly estimating likelihoods of values which they completely ignored before), they increase caution in their confidence intervals.

Tetlock & Gardner (2015), on the basis of the forecasting tournaments reported in that book, list problem decomposition as one of their “Ten Commandments for Aspiring Superforecasters”:

(2) Break seemingly intractable problems into tractable sub-problems.

Channel the playful but disciplined spirit of Enrico Fermi who — when he wasn’t designing the world’s first atomic reactor — loved ballparking answers to head-scratchers such as “How many extraterrestrial civilizations exist in the universe?” Decompose the problem into its knowable and unknowable parts. Flush ignorance into the open. Expose and examine your assumptions. Dare to be wrong by making your best guesses. Better to discover errors quickly than to hide them behind vague verbiage.

Superforecasters see Fermi-izing as part of the job. How else could they generate quantitative answers to seemingly impossible-to-quantify questions about Arafat’s autopsy, bird-flu epidemics, oil prices, Boko Haram, the Battle of Aleppo, and bond-yield spreads.

We find this Fermi-izing spirit at work even in the quest for love, the ultimate unquantifiable. Consider Peter Backus, a lonely guy in London, who guesstimated the number of potential female partners in his vicinity by starting with the population of London (approximately six million) and winnowing that number down by the proportion of women in the population (about 50%), by the proportion of singles (about 50%), by the proportion in the right age range (about 20%), by the proportion of university graduates (about 26%), by the proportion he finds attractive (only 5%), by the proportion likely to find him attractive (only 5%), and by the proportion likely to be compatible with him (about 10%). Conclusion: roughly twenty-six women in the pool, a daunting but not impossible search task.

There are no objectively correct answers to true-love questions, but we can score the accuracy of the Fermi estimates that superforecasters generate in the IARPA tournament. The surprise is how often remarkably good probability estimates arise from a remarkably crude series of assumptions and guesstimates.

↩︎
Tetlock & Gardner (2015), ch. 4, describe “extremizing” this way:

When you combine the judgments of a large group of people to calculate the “wisdom of the crowd” you collect all the relevant information that is dispersed among all those people. But none of those people has access to all that information. One person knows only some of it, another knows some more, and so on. What would happen if every one of those people were given all the information? They would become more confident — raising their forecasts closer to 100% or zero. If you then calculated the “wisdom of the crowd” it too would be more extreme. Of course it’s impossible to give every person all the relevant information — so we extremize to simulate what would happen if we could.

↩︎
Soll et al. (2016) summarize some of this literature briefly:

When judgments are provided by many people, an extremely effective way to combine them is to weight them equally, such as by taking the simple average or applying majority rule (e.g., Clemen, 1989; Hastie & Kameda, 2005). The idea of harnessing the “wisdom of crowds” has been applied to a wide variety of contexts, ranging from sports prediction markets to national security (Surowiecki, 2004). For quantity estimates, averaging provides benefits over the average individual whenever individual guesses bracket the truth (i.e., some guesses on both sides), so that high and low errors will cancel out (Larrick, Mannes, & Soll, 2012).

Tetlock & Gardner (2015) also report several “wisdom of crowds” effects in Tetlock et al.’s forecasting tournaments:

Teams of ordinary forecasters beat the wisdom of the crowd by about 10%. Prediction markets beat ordinary teams by about 20%. And superteams beat prediction markets by 15% to 30%.

I can already hear the protests from my colleagues in finance that the only reason the superteams beat the prediction markets was that our markets lacked liquidity: real money wasn’t at stake and we didn’t have a critical mass of traders. They may be right. It is a testable idea, and one worth testing. It’s also important to recognize that while superteams beat prediction markets, prediction markets did a pretty good job of forecasting complex global events.

↩︎
I refer to Tetlock’s forecasting tournaments, both those reported in Tetlock (2005) and especially those reported in Tetlock & Gardner (2015). ↩︎
There is relatively little (compelling) literature on this hypothesis, and I haven’t evaluated that literature carefully, but my understanding is that what literature exists tends to support the hypothesis, including Tetlock’s work, which I find unusually convincing (due to the strength of the study designs).

Non-Tetlock work on this is reviewed (for example) in the section on previous literature in Legerstee & Franses (2014), where the authors write:

[One] kind of feedback is task properties feedback, which is sometimes also called environmental feedback. It involves providing the forecaster with statistical information on the variable to be forecast. It can encompass data characteristics or statistical model forecasts. Note that it might be argued that this is not genuine feedback as it is provided before the judgmental forecast is given and is not feedback on the performance of the judgmental forecaster (see Björkman, 1972). This task properties feedback has received most attention in research on feedback on judgmental forecasting (see Sanders, 1992; Remus et al., 1996; Welch et al., 1998; Goodwin and Fildes, 1999). In all cases it is found to improve forecast accuracy and in general it is found to be the most effective form of feedback (Lawrence et al., 2006).

Intuitively, it seems plausible that “task properties feedback” — which in other words is simply information about the phenomenon to be forecast, before the forecast is made — should especially improve the resolution of one’s forecasts, while feedback on one’s forecasting performance should improve one’s calibration. This hypothesis is (weakly) supported by e.g. Stone & Opel (2000):

Benson and Onkal (1992) suggest that environmental feedback [i.e. task properties feedback], unlike performance feedback, should be effective for improving people’s discrimination skill [i.e. resolution], since environmental information provides information about the event to be judged. Only a small amount of work, however, has examined the impact of environmental feedback isolated from other types of feedback on judgmental accuracy. Lichtenstein and Fischhoff (1977, Experiment 2) trained participants to discriminate between European and American handwriting by providing them with samples of each type of handwriting. This handwriting training served as a type of environmental feedback, as it provided the participants with task information. As predicted, those participants who underwent the training procedure achieved higher discrimination scores than did those who received no such training.

If calibration and discrimination are psychologically distinct concepts, then providing domain-specific information (environmental feedback) should have no impact on calibration. In fact, Lichtenstein and Fischhoff did find an improvement in calibration scores resulting from the training in their study. However, they concluded that this improvement did not reflect a true improvement in calibration skill, but instead resulted from the hard–easy effect (cf. Lichtenstein et al., 1982; Suantak, Bolger, & Ferrell, 1996), whereby difficult questions (those answered correctly 50–70% of the time) produce overconfidence, easy questions (those answered correctly 80–100% of the time) produce underconfidence, and those of moderate difficulty (those answered correctly 70–80% of the time) produce the best calibration. Since improvements in discrimination reflect gains in substantive knowledge on a topic, it would be expected that gains in discrimination would be accompanied by an increased number of questions answered correctly. Indeed, those participants who underwent the handwriting training answered 71% correctly while those who did not undergo the training answered only 51% correctly. Thus, on the basis of the increase in percentage of items answered correctly alone, the improvement in calibration could be attributed to the hard–easy effect rather than to a true improvement in calibration skill.

[…]

The previous review suggests that, within the domains studied, performance feedback improves calibration and that environmental feedback improves discrimination. There is also reason to believe that performance feedback does not affect discrimination and that environmental feedback does not affect calibration; however, these conclusions are more equivocal, in that past findings have been open to multiple interpretations. The primary goal of the present study, then, was to demonstrate this dissociation…

[…]

The results [of the new experiment reported in this study] strongly supported this hypothesis. Additionally, we found two unexpected effects: (1) the impact of feedback was greater for hard slides than for easy slides, and (2) environmental feedback led to increased overconfidence for easy slides.

In the large forecasting tournaments described in Tetlock & Gardner (2015), there were many important correlates of forecasting accuracy, and several of them were related to domain knowledge: “political knowledge,” “average number of articles shared,” “average number of articles checked,” and others — see Table 3 of Mellers et al. (2015). But this, too, is relatively weak evidence, as domain knowledge was not manipulated experimentally. ↩︎
Weather forecasters are commonly cited as a group that exhibits good accuracy (e.g. see Silver 2012, ch. 4), but they do not provide an example of accurate judgment in the absence of reasonably strong models and plentiful data.

On bridge players, see Keren (1987). On TV pundits and political scientists, see ch. 2 of Silver (2012). For reviews of relevant judgmental forecasting literature, see e.g. Stone & Opel (2000) and Lawrence et al. (2006).

As for professional futurists: I’m currently investigating the track record of long-range forecasting, much of which has been performed by professional “futurists.” I might change my mind later, but so far my impression is that the accuracy of long-range (≥10 year) forecasts by the most-respected, best-resourced professional futurists of the 50s-90s has not been very good. This shouldn’t be a surprise: as far as I can tell, professional futurists of this period rarely if ever engaged in probability calibration training, and forecasting the long-term future is no doubt more difficult than forecasting short-term outcomes. (Of course it’s possible that contemporary futurists are more accurate than those from the 50s-90s, but we’ll have to wait for time to pass before we can evaluate the accuracy of their long-range forecasts.)

Thus far, I’ve published only one finding from my ongoing investigation of the track record of long-range forecasting, concerning some technology forecasts from a book called The Year 2000. ↩︎
Here is an incomplete list of data points that informed my impression:
1. At least “several” companies have invested in explicit probability calibration training, e.g. Royal Dutch Shell and Doug Hubbard’s clients.
2. The jacket cover of 1989’s Decision Traps, which includes calibration training as one of its key recommendations (pp. 96-102), claims that the authors “have improved the decision-making skills of thousands of Fortune 500 executives with the program described in this book. Their clients come from such companies as General Motors, Royal Dutch/Shell, IBM, [and others].” A 2001 book by the same authors (Winning Decisions) says, in the acknowledgments, that “John Oakes teamed up with us in the mid-1990s to design a management training program based on our book Decision Traps.”
3. As late as March 2014, and shortly before the publication of the major papers describing IARPA’s forecasting tournaments, Mandel et al. (2014) claimed that “the work described later in this report is… to the best of our knowledge, the first systematic, long-term evaluation of the quality of analytic forecasts extracted from real intelligence reports, using [a proper scoring rule].”
4. I asked Robin Hanson, creator of the first corporate prediction market and a leading advocate for the use of prediction markets, for his impression about how commonly firms make use of prediction markets. His emailed reply was that “The fraction using prediction markets is quite tiny, surely far smaller than the use of statistics etc. I’d be surprised if there are 100 firms using [prediction markets] at any one time.”
5. Chang et al. (2016) reports that “few organizations have embraced the debiasing methods that have been developed (Croskerry, 2003; Graber et al., 2012; Lilienfeld et al., 2009).”
↩︎

Effective Altruism Forum
EA Forum

Efforts to Improve the Accuracy of Our Judgments and Forecasts (Open Philanthropy)

19

Calibration training

Further advice for improving judgment accuracy

How commonly do people seek to improve the accuracy of their subjective judgments?

Ideas we’re exploring to improve accuracy for GiveWell and Open Philanthropy Project staff

19

Reactions