Data on forecasting accuracy across different time horizons and levels of forecaster experience

Charles Dillon 🔸

Data on forecasting accuracy across different time horizons and levels of forecaster experience

Charles Dillon 🔸

27 min read · May 27, 2021

127

Comments 7

Sorted by

New & upvoted

Javier Prieto🔸

Thanks for doing these analyses!

I recently had to dive into the Metaculus data for a report I'm writing and I produced the following plot along the way. I'm posting it here because it didn't make it into the final report, but I felt it was worth sharing anyway.

Each dot corresponds to the Brier score for the community prediction on every non-ambiguously resolved question as a function of time horizon (i.e. time remaining until resolution when the prediction was made). There are up to 101 predictions per question for the reasons you describe in the post. The red line is a moving average and the shaded area is a (t-distributed) 95% confidence interval around the mean.

Charles Dillon 🔸

Nice graph, thanks!

Nathan Young

E[Brier] is the Brier score you would expect if the predictors were perfectly calibrated. It is quite similar across subgroups, as is average question duration. I include it to check if subgroups were typically making more or less confident predictions on average - more confident predictions would have a lower E[Brier].

How can you come up with a number for this? Surely a perfect predictor would have a Brier of 0? (I'm definitely wrong but I'd like someone to explain)

Charles Dillon 🔸

"Perfectly calibrated", not "perfect". So if all of their predictions were correct, I.e. 20% of their 20% predictions came true etc.

So in this case, someone making all 90% predictions will have an expected score of 0.9×0.1^2 + 0.1×0.9^2 =0.09, while someone making all 80% predictions will have an expected score of 0.8×0.2^2 + 0.2×0.8^2=0.16

In general a lower expected score means your typical prediction was more confident.

Nathan Young

This may be answered in the post, in which case I'll delete, how much effort is it to do this for the metaculus median?

Charles Dillon 🔸

Using the code I linked above, it should require only minor changes if the Metaculus prediction is in one of the time series in the data, which I guess it is? Probably for someone with good familiarity with the API it would be a matter of an hour or two, otherwise it might take a bit longer.

I unfortunately will not have time to do this anytime soon.

niplav

Great investigation! Now I'm slightly less salty that your post is exclusively cited when it comes to the relation of range and accuracy (though I may still bask in the glory of second-hand citation :-p).

Few users updated their predictions, and updating was not associated with lower Brier scores overall, though there was not enough data to infer much here. Of 9230 updates, 3141 (34%) were performed by the most frequent individual predictor and 4710 (51%) were due to the top 3 most frequent updators.

Matches my experience, though I think Metaculus is slightly better in this regard. Should still give observers pause to think about how suboptimal those platforms are.

The models were:

<1y: 0.9400*Prediction - 0.0154

1-3y: 0.9122*Prediction - 0.1066

3-5y: 0.8927*Prediction - 0.0837

5+y: 0.8587*Prediction - 0.1089

This is super cool!

I wanted to look into whether forecasters appeared to get better over time. For this, I took those forecasters with >100 predictions, and compared their performance on their first 50 predictions to their last 50.

The answer appeared to be "maybe". There was no improvement in Brier scores or over confidence, but it is possible that they may have tried to predict more difficult questions in their later questions

I think that 100 predictions just isn't enough, especially if you're not doing deliberate practice. I think my predictions started getting okay after having experienced ~100 question resolutions, which would imply several hundred predictions. Surprised to hear the reviewer had the opposite opinion!

It should be possible to test this by performing a similar analysis, but looking at predictions made after a certain number of resolutions for that user and checking whether there is an improvement. I think resolutions should be the focus here: You can learn very little from predictions that you don't know the outcome of yet (though I've found it helpful to predict Metaculus with the community prediction hidden and then check against the community). I'm not sure it would be worth the effort to perform this analysis, but I'll put it on my todo list.

For the Metaculus data I could glean less information, as there were fewer questions, and no user level data available.

FWIW Metaculus now makes their user-level data available to researchers if you ask nicely.

Since we now know that 41% of things happen ;-), it'd be interesting to see whether things that are far off happen more rarely (or, in plain english, do questions with longer horizons resolve positively less often?). I don't think you looked into into this here, right?

As for data sources, I've started working on a collection of forecasting datasets, but my funding for that ran out and wasn't renewed :-/ Maybe I'll find a way to finish it.

Comments

niplav

Few users updated their predictions, and updating was not associated with lower Brier scores overall, though there was not enough data to infer much here. Of 9230 updates, 3141 (34%) were performed by the most frequent individual predictor and 4710 (51%) were due to the top 3 most frequent updators.

Matches my experience, though I think Metaculus is slightly better in this regard. Should still give observers pause to think about how suboptimal those platforms are.

The models were:

<1y: 0.9400*Prediction - 0.0154

1-3y: 0.9122*Prediction - 0.1066

3-5y: 0.8927*Prediction - 0.0837

5+y: 0.8587*Prediction - 0.1089

This is super cool!

I wanted to look into whether forecasters appeared to get better over time. For this, I took those forecasters with >100 predictions, and compared their performance on their first 50 predictions to their last 50.

The answer appeared to be "maybe". There was no improvement in Brier scores or over confidence, but it is possible that they may have tried to predict more difficult questions in their later questions

For the Metaculus data I could glean less information, as there were fewer questions, and no user level data available.

FWIW Metaculus now makes their user-level data available to researchers if you ask nicely.

As for data sources, I've started working on a collection of forecasting datasets, but my funding for that ran out and wasn't renewed :-/ Maybe I'll find a way to finish it.

	N	μPreds	μQs	μDuration	μBrier	μE[Brier]	Overconfidence
Preds
1	615	1.00	1.00	325.19	0.245	0.130	47.8%
2-4	545	2.65	2.52	299.54	0.228	0.133	38.8%
5-10	258	6.80	6.27	349.06	0.212	0.134	31.2%
11-25	182	16.59	15.31	337.99	0.190	0.138	22.2%
26-100	170	50.02	45.15	334.68	0.167	0.138	11.2%
100+	76	423.73	319.01	333.20	0.153	0.135	5.6%

	μBrier	μE[Brier]	μScaledBrier
NumPreds
1	0.245	0.130	0.220
2-4	0.228	0.133	0.209
5-10	0.212	0.134	0.199
11-25	0.190	0.138	0.184
26-100	0.167	0.138	0.165
100+	0.153	0.135	0.152

	μBrier	μE[Brier]	Overconfidence
TimeHorizon
<1 year	0.162	0.138	10.4%
1-3 years	0.162	0.139	10.4%
3-5 years	0.149	0.126	9.3%
5+ years	0.140	0.118	8.2%

First 50			Last 50
μBrier	μE[Brier]	OverConf	μBrier	μE[Brier]	OverConf
0.153	0.137	7.0%	0.156	0.135	7.9%

First 50			Last 50
μBrier	μE[Brier]	OverConf	μBrier	μE[Brier]	OverConf
0.147	0.126	9.5%	0.156	0.124	13.4%

Data on forecasting accuracy across different time horizons and levels of forecaster experience

Data on forecasting accuracy across different time horizons and levels of forecaster experience

Key Points

Background

Data

PredictionBook Analysis

Brier Scores

Updating

Calibration

Did practice make perfect?

Metaculus Analysis

Brier Scores

Calibrations

Future Work

Appendix

Data sources I looked into but did not find useful

PredictionBook and Metaculus data extraction

Samples for PredictionBook calibration plots

Credits

First 10			11th-50th
μBrier	μE[Brier]	OverConf	μBrier	μE[Brier]	OverConf
0.159	0.140	9.0%	0.153	0.134	8.1%

P	0.05	0.15	0.25	0.35	0.45	0.55	0.65	0.75	0.85	0.95
NPreds
1	64	19	26	24	6	9	25	53	41	69
2-4’	186	46	77	36	17	26	69	79	54	116
5-10	255	48	104	40	18	31	61	100	71	156
11-25	502	102	242	78	59	74	132	184	115	242
26-100	1455	398	555	230	192	303	336	389	278	608
>100	5590	1974	2386	1128	1455	1792	1415	1078	973	1789

	μBrier	μE[Brier]	Overconfidence
Time Horizon
<1 year	0.151	0.172	-0.285
2-4	0.167	0.193	-0.330
5-10	0.169	0.192	-0.310

P	0.05	0.15	0.25	0.35	0.45	0.55	0.65	0.75	0.85	0.95
Duration
>1y	5683	1948	2593	1170	1316	1760	1562	1503	1219	2435
1-3y	1303	413	530	238	276	321	323	254	226	340
3-5y	729	147	164	96	126	126	121	83	62	139
5+y	337	79	103	32	29	28	32	43	25	66

P	0.05	0.15	0.25	0.35	0.45	0.55	0.65	0.75	0.85	0.95
Duration
>1y	2490	940	1111	517	619	843	668	612	520	997
1-3y	551	192	216	111	119	148	130	93	96	132
3-5y	324	73	74	39	57	57	60	34	31	60
5+y	155	31	49	17	14	13	16	17	8	28