I summarised in the table below the properties of light-tailed and heavy-tailed distributions as described in Ben Kuhn's article Searching for outliers. Any errors/misinterpretations are my own. The following sections contain some transcriptions from Ben Kuhn's article respecting each of the properties.

Distribution | Light-tailed | Heavy-tailed |
---|---|---|

Ratio between the top percentiles and the median | Low | High |

Generation mechanism | Additive | Multiplicative |

Importance of outliers | Low | High |

Number of samples to get a really good outcome | Low | High |

What to filter for | "Probably good" | "Maybe amazing" |

### Ratio between the top percentiles and the median

As a rule of thumb, a heavy-tailed distribution is one where the top few percent of outcomes are a large multiple of the typical or median outcome:

- Income is heavy-tailed: the median person globally lives on $2,500 a year, while the top 1% live on $45,000, almost 20× more.
- Height is light-tailed: the tallest people are only a few feet taller than average.
- If height followed the same distribution as income, Elon Musk, who made $121b in 2021, would be about 85,000 km tall, or about ¼ of the distance from the earth to the moon.

### Generation mechanism

Light-tailed distributions most often occur because the outcome is the result of many independent contributions, while heavy-tailed distributions often arise from the result of processes that are multiplicative or self-reinforcing:

- For example, the richer you are, the easier it is to earn more money.
- The more Twitter followers you have, the more retweets you’ll get, and the more you’ll be exposed to new potential followers.
- The cost-effectiveness of a global health intervention comes from multiplying many different variables:
- How bad the disease you’re fighting is.
- How much of an impact the intervention has on the disease.
- How costly doing the intervention for one person is.
- Each of which itself is the product of several other factors.

### Importance of outliers

Notably, in a light-tailed distribution, outliers don’t matter much:

- The 1% of tallest people are still close enough to the average person that you can safely ignore them most of the time.
- By contrast, in a heavy-tailed distribution, outliers matter a lot: even though 90% of people live on less than $15,000 a year, there are large groups of people making 1,000 times more.
- Because of this, heavy-tailed distributions are much less intuitive to understand or predict.

### Number of samples to get a really good outcome

The most important thing to remember when sampling from heavy-tailed distributions is that getting lots of samples improves outcomes a ton:

- In a light-tailed context—say, picking fruit at the grocery store—it’s fine to look at two or three apples and pick the best-looking one:
- It would be completely unreasonable to, for example, look through the entire bin of apples for that one apple that’s just a bit better than anything you’ve seen so far.

- In a heavy-tailed context, the reverse is true:
- It would be similarly unreasonable to, say, pick your romantic partner by taking your favorite of the first two or three single people you run into.
- Every additional sample you draw increases the chance that you get an outlier.
- So one of the best ways to improve your outcome is to draw as many samples as possible.

### What to filter for

Another consequence of the numbers game is that the strategy that you use to filter your samples is very important:

- It’s very important for your filters to be as tightly correlated with what you actually care about as possible, so that you don’t rule candidates out for bad reasons.
- A subtlety here is that the traits that make a candidate a potential outlier are often very different from the traits that would make them “pretty good”.
- So improving your filtering process to produce more “pretty good” candidates won’t necessarily increase the rate of finding outliers, and might even decrease it.
- Because of this, it’s important to filter for “maybe amazing”, not “probably good”.