https://www.openphilanthropy.org/blog/how-feasible-long-range-forecasting (a)
The opening:
How accurate do long-range (≥10yr) forecasts tend to be, and how much should we rely on them?
As an initial exploration of this question, I sought to study the track record of long-range forecasting exercises from the past. Unfortunately, my key finding so far is that it is difficult to learn much of value from those exercises, for the following reasons:
1. Long-range forecasts are often stated too imprecisely to be judged for accuracy. [More]
2. Even if a forecast is stated precisely, it might be difficult to find the information needed to check the forecast for accuracy. [More]
3. Degrees of confidence for long-range forecasts are rarely quantified. [More]
4. In most cases, no comparison to a “baseline method” or “null model” is possible, which makes it difficult to assess how easy or difficult the original forecasts were. [More]
5. Incentives for forecaster accuracy are usually unclear or weak. [More]
6. Very few studies have been designed so as to allow confident inference about which factors contributed to forecasting accuracy. [More]
7. It’s difficult to know how comparable past forecasting exercises are to the forecasting we do for grantmaking purposes, e.g. because the forecasts we make are of a different type, and because the forecasting training and methods we use are different. [More]
The accuracy of technological forecasts, 1890-1940 is a paper I happened to already know about that seems somewhat relevant but I didn't see mentioned:
Thanks! I knew there was one major study I was missing from the 70s, and that I had emailed people about before, but I couldn't track it down when I was writing this post, and I'm pretty sure this is the one I was thinking of. Of course, this study suffers from several of the problems I list in the post.
Happy to see this focus. I still find it quite strange out how little attention the general issue has gotten from other groups and how few decent studies exist.
I feel like one significant distinction for these discussions is that of calibration vs. resolution. This was mentioned in the footnotes (with a useful table) but I think it may deserve more attention here.
If long-term calibration is expected to be reasonable, then I would assume we could get much of the important information we could be interested in about forecasting ability from the resolution numbers. If forecasters are confident in predictions for a 5-20+ year time frame, this would be evident in corresponding high-resolution forecasts. If we want to compare these to baselines we could set them up now and compare resolution numbers.
We could also have forecasters do meta-forecasts; forecasts about forecasts. I believe that the straightforward resolution numbers should provide the main important data, but there could be other things you may be interested. For example, "What average level of resolution could we get on this set of questions if we were to spend X resources forecasting them?" If the forecasters were decently calibrated the main way this could go poorly is if the predictions to these questions would be low resolution, but if so that would be apparent quickly.
The much trickier thing seems to be calibration. If we cannot trust our forecasts to be calibrated over long time horizons, then the resolution of their forecasts is likely to be misleading, possibly in a highly systematic and deceiving way.
However, long-term calibration seems like a relatively constrained question to me, and one with possibly a pretty positive outlook. My impression from the table and spreadsheet is that in general, calibration was shown to be quite similar for short and long term forecasts. Also, it's not clear to me why calibration would be dramatically worse in long-term questions than it would be in specific short-term questions that we could test for cheap. For instance, if we expected that forecasters may be poorly calibrated on long-term questions because the incentives are poor, we could try having forecasters forecast very short-term questions with similarly poor incentives. I recall reading Anthony Aguirre speculating that he didn't expect Metaculus's forecaster's incentives to change much for long-term questions, but I forgot where this was mentioned (it may have been a podcast).
Having some long-term studies seems quite safe as well, but I'm not sure how much extra benefit they will give us compared to more rapid short-term studies combined with large sets of long-term predictions by calibrated forecasters (which should come with numbers of resolution).
Separately, I missed the footnotes on my first read through, but think that may have been my favorite part of it. The link is a bit small (though clicking on the citation numbers brings it up).
I'd be interested to know how people think long-range forecasting is likely to differ from short-range forecasting, and to what degree we can apply findings from short-range forecasting to long-range forecasting. Could it be possible to, for example, ask forecasters to forecast at a variety of short-range timescales, fit a curve to their accuracy as a function of time (or otherwise try to mathematically model the "half-life" of the knowledge powering the forecast--I don't know what methodologies could be useful here, maybe survival analysis?) and extrapolate this model to long-range timescales?
I'm also curious why there isn't more interest in presenting people with historical scenarios and asking them to forecast what will happen next in the historical scenario. Obviously if they already know about that period of history this won't work, but that seems possible to overcome.
If forecasters are giving forecasts for similar things over different times, their resolution should very obviously decrease with time. A good example of this are time series forecasts, which grow in uncertainty over time projected into the future.
To site my other comment here, the tricky part, from what I could tell is calibration, but this is a more narrow problem. More work could definitely be done to test calibration over forecast time. My impression is that it doesn't fall dramatically, probably not enough to make a very smooth curve. I feel like if it were the case that it reliably fell for some forecasters, and those forecasters learned that, they could adjust accordingly. Of course, if the only feedback cycles are 10-year forecasts, that could take a while.
Image from the Bayesian Biologist: https://bayesianbiologist.com/2013/08/20/time-series-forecasting-bike-accidents/
I'm not sure what you mean by resolution. But if you mean accuracy, perhaps a counter example is the reversion of stock values to the long-term mean appreciation curve creating value forecasts that actually become more accurate five or 10 years out than in the near term?
When you have these long term predictions which you plan on keeping track of, it is helpful, if possible, to create multiple models to apply to each forecast so that in the retrospective one can determine which, if any of the models, was more successful than the others.
So perhaps you have a prediction about how many volunteers will be required for a particular initiative to save x lives 10 years out. If you keep three separate forecasting reports which are explicit about their reasoning, then the iterative improvement process can happen a bit more quickly.