Evidence on good forecasting practices from the Good Judgment Project: an accompanying blog post

kokotajlod

Evidence on good forecasting practices from the Good Judgment Project: an accompanying blog post

kokotajlod

25 min readFeb 15, 2019

Comments 14

Sorted by

New & upvoted

Gregory Lewis🔸

Excellent. This series of interviews with superforecasters is also interesting. [H/T Ozzie]

Aaron Gertler 🔸

If anyone wants to read about a non-superforecaster who still took the tournament moderately seriously, I wrote up my experience participating in Season 4 (and getting a good-but-not-great score by relying on a couple of basic heuristics).

kokotajlod

(I'm the author)

Yep, I still endorse the post. It does what it says on the tin, and it does it well. Highest compliment I've received about it (courtesy of Katja): Good Judgment project guy got back to us [...] and also said, “And I just realized that your shop wrote a very smart, subtle review of Tetlock’s book Superforecasting a couple years ago. I’ve referred to it many times.”

I recently had an opportunity to reflect on how it influenced me and what if anything I now disagree with:

Two years ago I wrote a deep-dive summary of Superforecasting and the associated scientific literature. I learned about the “Outside view” / “Inside view” distinction, and the evidence supporting it. At the time I was excited about the concept and wrote: “...I think we should do our best to imitate these best-practices, and that means using the outside view far more than we would naturally be inclined.”
Now that I have more experience, I think the concept is doing more harm than good in our community. The term is easily abused and its meaning has expanded too much. I recommend we permanently taboo “Outside view,” i.e. stop using the word and use more precise, less confused concepts instead. This post explains why.

One minor thing to change has to do with what Linch points out.

Linch

when the authors rounded the superforecaster’s forecasts to the nearest 0.05, their accuracy dropped.

You might be interested in knowing that this part is not supported by the evidence, see KrisMoore's comment on Metaculus.

I also personally doubt that the "calibration precision" of most supers is at 101 units (though it's certainly possible!)

Gregory Lewis🔸

It is true that given the primary source (presumably this), the implication is that rounding supers to 0.1 hurt them, but 0.05 didn't:

To explore this relationship, we rounded forecasts to the nearest 0.05, 0.10, or 0.33 to see whether Brier scores became less accurate on the basis of rounded forecasts rather than unrounded forecasts. [...]

For superforecasters, rounding to the nearest 0.10 produced significantly worse Brier scores [by implication, rounding to the nearest 0.05 did not]. However, for the other two groups, rounding to the nearest 0.10 had no influence. It was not until rounding was done to the nearest 0.33 that accuracy declined.

Prolonged aside:

That said, despite the absent evidence I'm confident accuracy with superforecasters (and ~anyone else - more later, and elsewhere) does numerically drop with rounding to 0.05 (or anything else), even if has not been demonstrated to be statistically significant:

From first principles, if the estimate has signal, shaving bits of information from it by rounding should make it less accurate (and it obviously shouldn't make it more accurate, pretty reliably setting the upper bound of our uncertainty to 0).

Further, there seems very little motivation for the idea we have n discrete 'bins' of probability across the number line (often equidistant!) inside our heads, and as we become better forecasters n increases. That we have some standard error to our guesses (which ~smoothly falls with increasing skill) seems significantly more plausible. As such the 'rounding' tests should be taken as loose proxies to assess this error.

Yet if error process is this, rather than 'n real values + jitter no more than 0.025', undersampling and aliasing should introduce a further distortion. Even if you think there really are n bins someone can 'really' discriminate between, intermediate values are best seen as a form of anti-aliasing ("Think it is more likely 0.1 than 0.15, but not sure, maybe its 60/40 between them so I'll say 0.12") which rounding ablates. In other words 'accurate to the nearest 0.1' does not mean the second decimal place carries no information.

Also, if you are forecasting distributions rather than point estimates (cf. Metaculus), said forecast distributions typically imply many intermediate value forecasts.

Empirically, there's much to suggest a T2 error explanation of the lack of a 'significant' drop. As you'd expect, the size of the accuracy loss grows with both how coarsely things are rounded, and the performance of the forecaster. Even if relatively finer coarsening makes things slightly worse, we may expect to miss it. This looks better to me on priors than these trends 'hitting a wall' at a given level of granularity (so I'd guess untrained forecasters are numerically worse if rounded to 0.1, even if the worse performance means there is less signal to be lost, and in turn makes this hard to 'statistically significantly' detect).

I'd adduce other facts against too. One is simply that superforecasters are prone to not give forecasts on a 5% scale, using intermediate values instead: given their good callibration, you'd expect them to iron out this Brier-score-costly jitter (also, this would be one of the few things they are doing worse than regular forecasters). You'd also expect discretization in things like their calibration curve (e.g. events they say happen 12% of the time in fact happen 10% of time, whilst events that they say happen 13% of the time in fact happen 15% of the time), or other derived figures like ROC.

This is ironically foxy, so I wouldn't be shocked for this to be slain by the numerical data. But I'd bet at good odds (north of 3:1) things like "Typically, for 'superforecasts' of X%, these events happened more frequently than those forecast at (X-1)%, (X-2)%, etc."

Larks

It always seemed strange to me that the idea was expressed as 'rounding'. Replacing a 50.4% with 50% seems relatively innocuous to me; replacing 0.6% with 1% - or worse, 0.4% with 0% - seems like a very different thing altogether!

Linch

I think I broadly agree with what you say and will not bet against your last paragraph, except for the trivial sense that I expect most studies to be too underpowered to detect those differences.

kokotajlod

Thanks, I'll update the text when I get access to Metaculus again (I've blocked myself from it for productivity reasons lol)

Aaron Gertler 🔸

This post was awarded an EA Forum Prize; see the prize announcement for more details.

My notes on what I liked about the post, from the announcement:

"Evidence on good forecasting practices from the Good Judgment Project" is a thorough, well-organized summary of forecasting — a topic often discussed on the Forum, but rarely with this amount of data.
We may know that prediction markets are “useful”, but the author goes far beyond that, explaining how well different types of markets (and non-market mechanisms) have performed in prediction tournaments, and which characteristics the best forecasters tend to have. This research could be useful to any number of future forecasting projects in the community.
Additionally, the author:
Uses numbered headers to separate sections.
Includes hyperlinked footnotes for all citations.
Notes cases where information from original sources is missing or uncertain, giving readers ideas for ways to contribute to his research. (For example, I’d love to learn more about Tetlock’s “perpetual beta” concept, if anyone cares to go and find it.)
Overall, this is a remarkable post, and I hope that other Forum users create similarly excellent summaries of important concepts.

Ofer

Thank you for writing this.

Is the one-hour training module publicly available?

One might worry that training improves accuracy by motivating the trainees to take their jobs more seriously. Indeed it seems that the trained forecasters made more predictions per question than the control group, though they didn’t make more predictions overall. Nevertheless it seems that the training also had a direct effect on accuracy as well as this indirect effect.34

I could not find results like the ones in Table 4 in which the Brier scores are based only on the first answer that forecasters provide. Allowing forecasters to update their forecasts as frequently as they want (while reporting average daily Brier scores) plausibly gives an advantage to the forecasters who are willing to invest more time in their task.

The paper from which Table 4 is from stated that "Training was a significant predictor of average number of forecasts per question for year 1 and the number of forecasts per question was also significant predictor of accuracy (measured as mean standardized Brier score)". Consider Table 10 in the paper that shows "Forecasts per question per user by year". Notice that in year 3 the forecasters that got training made 4.27 forecasts per question, while forecasters that did not get training made only 1.90 forecasts per question. The paper includes additional statistical analyses related to this issue (unfortunately I don't have the combination of time and background in statistics to understand them all).

kokotajlod

The exact training module they used is probably not public, but they do have a training module on their website. It costs money though.

For sure, forecasters who devoted more effort to it tended to make more accurate predictions. It would be surprising if that wasn't true!

Pablo

In case it helps others decide whether or not to take the Superforecasting Fundamentals course, I'm reposting a brief message I sent to the CEA Slack workspace back in August 2017:

I took it a year or so ago. The course is very good, but also very basic: I clearly wasn’t the target audience, since I was already quite familiar with most of the content. I wouldn't recommend it unless you don’t know anything about forecasting.

Ofer

For sure, forecasters who devoted more effort to it tended to make more accurate predictions. It would be surprising if that wasn't true!

I agree. But I am not referring to an extra effort that makes a person provide a better forecast (e.g. by spending more time looking for arguments), but rather an extra effort that allows one to improve their average daily Brier scores by simply using new public information that was not available when the question was first presented (e.g. new poll results).

kokotajlod

I agree that this was probably a factor that contributed to the accuracy gains of people who made more frequent forecasts. It may even have been doing most of the work; I'm not sure.

Comments

Gregory Lewis🔸

It is true that given the primary source (presumably this), the implication is that rounding supers to 0.1 hurt them, but 0.05 didn't:

To explore this relationship, we rounded forecasts to the nearest 0.05, 0.10, or 0.33 to see whether Brier scores became less accurate on the basis of rounded forecasts rather than unrounded forecasts. [...]

For superforecasters, rounding to the nearest 0.10 produced significantly worse Brier scores [by implication, rounding to the nearest 0.05 did not]. However, for the other two groups, rounding to the nearest 0.10 had no influence. It was not until rounding was done to the nearest 0.33 that accuracy declined.

Prolonged aside:

Also, if you are forecasting distributions rather than point estimates (cf. Metaculus), said forecast distributions typically imply many intermediate value forecasts.