Hide table of contents

Note: This is a linkpost for the Metaculus Journal. I work for Metaculus and this is part of a broader exploration of how we can compare forecasting performance. 

Introduction

"Which one of two forecasters is the better one?" is a question of great importance. Countless internet points and hard-earned bragging rights are at stake. But it is also a question of great practical concern to decision makers. Imagine you are king Odysseus of Ithaca and you urgently need counsel for important government affairs. You vividly remember the last time you went away on a journey and now ask yourself: Is Delphi really worth the extra mile, or is the oracle of Dodona just as good? 

Being Odysseus, a man of cunning, you quickly think of ways to obtain more knowledge. The obvious answer, of course, would be to run a forecasting tournament between the two oracles. But how many forecasting questions do you need? Divine revelations are expensive. And once the questions resolve and one oracle comes out ahead, how sure can you be that the winning oracle is really better? Oracling is not an exact science and sometimes, after all, the gods do play dice. Even a perfect oracle (who always reports a probability equal to the true probability of an event) will sometimes look bad if Tyche is not on her side. 

Methods

You quickly decide you need to run a simulation study. That way you will know the 'true probabilities' of events. You can easily compare a perfect forecaster with a noisy forecaster and see whether the difference comes out as significant. You want to look at binary outcomes, meaning that either an event happens (outcome = 1) or it doesn't (outcome = 0). You also decide to restrict the analysis to only two forecasters to avoid issues with multiple comparisons and additional complications. Battles become complicated if too many factions are involved. 

You choose to do a thousand replications of your experiment. For every replication, you simulate 1000 true probabilities, observations and corresponding forecasts. You then add increasing amounts of noise to the forecasts, score them, and then do a paired t-test to check whether there would be a significant difference between a noisy forecaster and a perfect forecaster. 

There is a qualitative difference between a forecast that is moved from 50% to 55% and one that is moved from 94.5% to 99.5%. The former is only a small update, whereas the latter corresponds to a massive increase in confidence. This becomes obvious when you convert the probabilities to odds (odds = p / (1 - p)). The move from 50% to 55% would mean a move from 10:10 to 11:9 expressed as odds. the move from 94.5% to 99.5% would mean a move from 189:11 to 1990:10. To reflect this, you decide to convert your probabilities to log odds, add some normally distributed noise, and convert back to probabilities:  

(You could of course also multiply odds by some noise instead of adding noise to log odds, but the latter is more elegant and computationally efficient. You are, after all, Odysseus the Cunning, and don't care about the fact that logarithms were first described - invented? - in 1614). 

This is the precise algorithm you use: 

For each of 1000 replications:
1. Draw N = 1000 true probabilities as a random uniform variable between 0 and 1
2. Create N = 1000 observations as draws from a Bernoulli distribution (drawing either a 0 or a 1) with the probability of drawing a 1 equal to the true probabilities drawn in step 1)
3. Create 5 forecasts from with differing noise levels, 0 (no noise), 0.1, 0.3, 0.5, and 1
- Convert probabilities to log odds (log (p / 1-p))
- noisy_log_odds = log_odds + N(mean = 0, sd = noise level)
- Convert back to obtain noisy probabilities
4. Score the forecasts using the log score
- For different noise levels and number of observations
5. Use a paired t-test to check whether the difference in score between the noisy forecaster and the perfect forecaster (no noise) is significantly different from zero

The log score is a proper scoring rule, meaning that it can't be cheated and forecasters are always incentivised to report their true belief. It is commonly defined as the negative logarithm of the forecast probability evaluated at the observed outcome. If your forecast is  that an event will actually happen and it happens (outcome = 1), then your log score is . If it does not happen (outcome = 0), then your log score is . Smaller values are good (best is 0) and larger values are bad (the log score represents a penalty). 

Results

Here is a summary of the simulated data. The perfect forecaster (no noise) gets a log score of 0.501, the noisiest one a log score of 0.581. The next column shows the mean absolute difference between a forecaster's predictions and the true probability. This is 0 for the perfect forecaster (who always reports the true probabilities). For the forecaster with noise level 0.5, for example, their prediction on average differs by 6.5 percentage points from the perfect forecaster. The last column shows the difference between perfect forecast and noisy forecast in log odds space. The table shows Brier scores as well as those are more intuitive for some readers. 

Given the way the noise was constructed, the average absolute difference between prediction and true probability is slightly deceiving. Adding noise in log odds space leads to probability differences that appear higher in absolute terms when predictions are closer to 0.5. This is illustrated in the following plot:

Figure 1: Average difference between the probability prediction of the noisy forecaster and the true probability.

Delphi or Dodona - significant choices

Now, how easy is it in an ideal world to say that one forecaster is significantly better than another? You, Odysseus the Cunning, predate the invention of the t-test by around 3 thousand years. But that shall not prevent you from applying a paired t-test to compare your noisy forecasters against the perfect forecaster. To simulate the effect of the number of available questions on the significance level, you run your tests for a different number of questions included. Here are your results: 

Figure 2: The proportion of replications in which forecasts from the perfect forecaster are deemed to be significantly better than those of the noisy forecaster

The plot shows the proportion of replications that bring up a significant difference between a noisy forecaster and the perfect forecaster. The following shows the same information as a table: 

You scratch your head and beard in astonishment. "By Zeus! It seems like even in an ideal world I need quite a few observations (or quite a large difference in skill) to validly conclude that two forecasters are significantly different".

Indeed, even for 100 questions (which is rather large for a typical forecasting tournament) and a fairly large difference in forecaster performance, this would come up as significant less than 50% of the time. And in an actual real life forecasting tournament, we couldn't even be sure that the questions are completely uncorrelated. If they are, then that reduces the effective sample size and you'd need even more observations. 

But would the better forecaster at least win in a direct stand-off? Would the perfect forecaster have a lower average log score than the noisy one? You go back to your abacus and move a few marbles to crunch the numbers again. This is your result:  

Figure 3: The proportion of replications in which the perfect forecaster beats the noisy forecaster, i.e. has a lower mean log score across all available questions

You take a deep breath. This also doesn't look ideal. The noisiest forecaster would win against the perfect forecaster around 8% of the time in a tournament with 50 questions. But a forecaster who is just half as noisy would already win about 24% of the time in the same tournament. That's still assuming that all questions are perfectly independent. You try to console yourself: The fact that the perfect forecaster would  win against the noisiest forecaster in a 50 question tournament 92% of the time isn't too bad! But then you sigh: "Not too bad, but still useless. This info would make me happy if I were a (perfect) forecaster! But sadly I'm just a poor king in Ithaca and I still don't know which oracle to trust."

Conclusions

Decision makers want to know if one forecaster is better than the other. This is the one they should listen to. The last simulation (Figure 3) suggests that a noticeably better forecaster has a decent chance of winning a direct comparison. The problem is that even if both forecasters are equally good, one of them has to win. The mere fact that one of them won is some evidence they are better, but not that much. Even with a quite noticeable difference in performance and in an ideal scenario, it takes quite a few questions to make a decision at a standard 5%-significance level. This is important to know, and we should adjust our expectations accordingly whenever we try to compare between two forecasters, forecasting platforms, or oracles. 

You, Odysseus, sit down on a rock. "How, by Zeus, am I supposed to run 500 questions by the oracles of Dodona and Delphi? Get their answers and wait for all the questions to resolve? This is nothing a mere mortal can do! I really wish all major oracles would just share their questions among each other. Every oracle could make predictions on all of them and over time we would have a great database with all their track records." You look up to the skies and send a short prayer to the holy quintinity of Metaculus, INFER, Good Judgment, Hypermind, and Manifold. "This would indeed be a project worthy of the gods!" 

 


 

 

Comments6


Sorted by Click to highlight new comments since:

You can get a sense for these sorts of numbers just by looking at a binomial distribution.

e.g., Suppose that there are n events which each independently have a 45% chance of happening, and a noisy/biased/inaccurate forecaster assigns 55% to each of them.

Then the noisy forecaster will look more accurate than an accurate forecaster (who always says 45%) if >50% of the events happen, and you can use the binomial distribution to see how likely that is to happen for different values of n. For example, according to this binomial calculator, with n=51 there is a 24% chance that at least 26/51 of the p=.45 events will resolve as True, and with n=201 there is a 8% chance (I'm picking odd numbers for n so that there aren't ties).

With slightly more complicated math you can look at statistical significance, and you can repeat for values other than trueprob=45% & forecast=55%.

Good comment, thank you!

Great work and great writing, thank you. I wonder if there's anything better powered than t-tests in this setting though?

ETA: is "which forecaster is best?" actually the right question to be answering? If the forecasts are close enough that we can't tell the difference after 100 questions, maybe we don't care about the difference?

Can't think of anything better than a t-test, but open for suggestions. 

If a forecaster is consistently off by like 10 percentage points - I think that is a difference that matters. But even in that extreme scenario where the (simulated) difference between two forecasters is in fact quite large, we have a hard time picking that up using standard significance tests.

Afraid I don't have good ideas here.

Intuitively, I think there should be a way to take advantage of the fact that the outcomes are heavily structured. You have predictions on the same questions and they have a binary outcome.

OTOH, if in 20% of cases the worse forecaster is better on average, that suggests that there is just a hard bound on how much we can get.

If I can modus tolens this modus polens, it feels to me that

Indeed, even for 100 questions [...] this would come up as significant less than 50% of the time

is evidence that the noise level is low, and the skill difference is small.

E.g., taking the top 20 forecasters in Metaculus' last Quarterly Cup, we see average score differences of ~0.05 (equivalent to your highest noise level), and that's among the very top forecasters we had on that tournament!

Curated and popular this week
Paul Present
 ·  · 28m read
 · 
Note: I am not a malaria expert. This is my best-faith attempt at answering a question that was bothering me, but this field is a large and complex field, and I’ve almost certainly misunderstood something somewhere along the way. Summary While the world made incredible progress in reducing malaria cases from 2000 to 2015, the past 10 years have seen malaria cases stop declining and start rising. I investigated potential reasons behind this increase through reading the existing literature and looking at publicly available data, and I identified three key factors explaining the rise: 1. Population Growth: Africa's population has increased by approximately 75% since 2000. This alone explains most of the increase in absolute case numbers, while cases per capita have remained relatively flat since 2015. 2. Stagnant Funding: After rapid growth starting in 2000, funding for malaria prevention plateaued around 2010. 3. Insecticide Resistance: Mosquitoes have become increasingly resistant to the insecticides used in bednets over the past 20 years. This has made older models of bednets less effective, although they still have some effect. Newer models of bednets developed in response to insecticide resistance are more effective but still not widely deployed.  I very crudely estimate that without any of these factors, there would be 55% fewer malaria cases in the world than what we see today. I think all three of these factors are roughly equally important in explaining the difference.  Alternative explanations like removal of PFAS, climate change, or invasive mosquito species don't appear to be major contributors.  Overall this investigation made me more convinced that bednets are an effective global health intervention.  Introduction In 2015, malaria rates were down, and EAs were celebrating. Giving What We Can posted this incredible gif showing the decrease in malaria cases across Africa since 2000: Giving What We Can said that > The reduction in malaria has be
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f
LewisBollard
 ·  · 8m read
 · 
> How the dismal science can help us end the dismal treatment of farm animals By Martin Gould ---------------------------------------- Note: This post was crossposted from the Open Philanthropy Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- This year we’ll be sharing a few notes from my colleagues on their areas of expertise. The first is from Martin. I’ll be back next month. - Lewis In 2024, Denmark announced plans to introduce the world’s first carbon tax on cow, sheep, and pig farming. Climate advocates celebrated, but animal advocates should be much more cautious. When Denmark’s Aarhus municipality tested a similar tax in 2022, beef purchases dropped by 40% while demand for chicken and pork increased. Beef is the most emissions-intensive meat, so carbon taxes hit it hardest — and Denmark’s policies don’t even cover chicken or fish. When the price of beef rises, consumers mostly shift to other meats like chicken. And replacing beef with chicken means more animals suffer in worse conditions — about 190 chickens are needed to match the meat from one cow, and chickens are raised in much worse conditions. It may be possible to design carbon taxes which avoid this outcome; a recent paper argues that a broad carbon tax would reduce all meat production (although it omits impacts on egg or dairy production). But with cows ten times more emissions-intensive than chicken per kilogram of meat, other governments may follow Denmark’s lead — focusing taxes on the highest emitters while ignoring the welfare implications. Beef is easily the most emissions-intensive meat, but also requires the fewest animals for a given amount. The graph shows climate emissions per tonne of meat on the right-hand side, and the number of animals needed to produce a kilogram of meat on the left. The fish “lives lost” number varies significantly by
Recent opportunities in Forecasting
20
Eva
· · 1m read