by Max Daniel & Benjamin Todd
[ETA: See also this summary of our findings + potential lessons by Ben for the 80k blog.]
Some people seem to achieve orders of magnitudes more than others in the same job. For instance, among companies funded by Y Combinator the top 0.5% account for more than ⅔ of the total market value; and among successful bestseller authors, the top 1% stay on the New York Times bestseller list more than 25 times longer than the median author in that group.
This is a striking and often unappreciated fact, but raises many questions. How many jobs have these huge differences in achievements? More importantly, why can achievements differ so much, and can we identify future top performers in advance? Are some people much more talented? Have they spent more time practicing key skills? Did they have more supportive environments, or start with more resources? Or did the top performers just get lucky?
More precisely, when recruiting, for instance, we’d want to know the following: when predicting the future performance of different people in a given job, what does the distribution of predicted (‘ex-ante’) performance look like?
This is an important question for EA community building and hiring. For instance, if it’s possible to identify people who will be able to have a particularly large positive impact on the world ahead of time, we’d likely want to take a more targeted approach to outreach.
More concretely, we may be interested in two different ways in which we could encounter large performance differences:
- If we look at a random person, by how much should we expect their performance to differ from the average?
- What share of total output should we expect to come from the small fraction of people we’re most optimistic about (say, the top 1% or top 0.1%) – that is, how heavy-tailed is the distribution of ex-ante performance?
(See this appendix for how these two notions differ from each other.)
Depending on the decision we’re facing we might be more interested in one or the other. Here we mostly focused on the second question, i.e., on how heavy the tails are.
This post contains our findings from a shallow literature review and theoretical arguments. Max was the lead author, building on some initial work by Ben, who also provided several rounds of comments.
You can see a short summary of our findings below.
We expect this post to be useful for:
- (Primarily:) Junior EA researchers who want to do further research in this area. See in particular the section on Further research.
- (Secondarily:) EA decision-makers who want to get a rough sense of what we do and don’t know about predicting performance. See in particular this summary and the bolded parts in our section on Findings.
- We weren’t maximally diligent with double-checking our spreadsheets etc.; if you wanted to rely heavily on a specific number we give, you might want to do additional vetting.
To determine the distribution of predicted performance, we proceed in two steps:
- We start with how ex-post performance is distributed. That is, how much did the performance of different people vary when we look back at completed tasks? On these questions, we’ll review empirical evidence on both typical jobs and expert performance (e.g. research).
- Then we ask how ex-ante performance is distributed. That is, when we employ our best methods to predict future performance by different people, how will these predictions vary? On these questions, we review empirical evidence on measurable factors correlating with performance as well as the implications of theoretical considerations on which kinds of processes will generate different types of distributions.
Here we adopt a very loose conception of performance that includes both short-term (e.g. sales made on one day) and long-term achievements (e.g. citations over a whole career). We also allow for performance metrics to be influenced by things beyond the performer’s control.
Our overall bottom lines are:
- Ex-post performance appears ‘heavy-tailed’ in many relevant domains, but with very large differences in how heavy-tailed: the top 1% account for between 4% to over 80% of the total. For instance, we find ‘heavy-tailed’ distributions (e.g. log-normal, power law) of scientific citations, startup valuations, income, and media sales. By contrast, a large meta-analysis reports ‘thin-tailed’ (Gaussian) distributions for ex-post performance in less complex jobs such as cook or mail carrier : the top 1% account for 3-3.7% of the total. These figures illustrate that the difference between ‘thin-tailed’ and ‘heavy-tailed’ distributions can be modest in the range that matters in practice, while differences between ‘heavy-tailed’ distributions can be massive. (More.)
- Ex-ante performance is heavy-tailed in at least one relevant domain: science. More precisely, future citations as well as awards (e.g. Nobel Prize) are predicted by past citations in a range of disciplines, and in mathematics by scores at the International Maths Olympiad. (More.)
- More broadly, there are known, measurable correlates of performance in many domains (e.g. general mental ability). Several of them appear to remain valid in the tails. (More.)
- However, these correlations by itself don’t tell us much about the shape of the ex-ante performance distribution: in particular, they would be consistent with either thin-tailed or heavy-tailed ex-ante performance. (More.)
- Uncertainty should move us toward acting as if ex-ante performance was heavy-tailed – because if you have some credence in it being heavy-tailed, it’s heavy-tailed in expectation – but not all the way, and less so the smaller our credence in heavy-tails. (More.)
- To infer the shape of the ex-ante performance distribution, it would be more useful to have a mechanistic understanding of the process generating performance, but such fine-grained causal theories of performance are rarely available. (More.)
- Nevertheless, our best guess is that moderately to extremely heavy-tailed ex-ante performance is widespread at least for ‘complex’ and ‘scaleable’ tasks. (I.e. ones where the performance metric can in practice range over many orders of magnitude and isn’t artificially truncated.) This is based on our best guess at the causal processes that generate performance combined with the empirical data we’ve seen. However, we think this is debatable rather than conclusively established by the literature we reviewed. (More.)
- There are several opportunities for valuable further research. (More.)
Overall, doing this investigation probably made us a little less confident that highly heavy-tailed distributions of ex-ante performance are widespread, and think that common arguments for it are often too quick. That said, we still think there are often large differences in performance (e.g. some software engineers have 10-times the output of others), these are somewhat predictable, and it’s often reasonable to act on the assumption that the ex-ante distribution is heavy-tailed in many relevant domains (broadly, when dealing with something like ‘expert’ performance as opposed to ‘typical’ jobs).
Some advice for how to work with these concepts in practice:
- In practice, don’t treat ‘heavy-tailed’ as a binary property. Instead, ask how heavy the tails of some quantity of interest are, for instance by identifying the frequency of outliers you’re interested in (e.g. top 1%, top 0.1%, …) and comparing them to the median or looking at their share of the total. 
- Carefully choose the underlying population and the metric for performance, in a way that’s tailored to the purpose of your analysis. In particular, be mindful of whether you’re looking at the full distribution or some tail (e.g. wealth of all citizens vs. wealth of billionaires).
In an appendix, we provide more detail on some background considerations:
- The conceptual difference between ‘high variance’ and ‘heavy tails’: Neither property implies the other. Both mean that unusually good opportunities are much better than typical ones. However, only heavy tails imply that outliers account for a large share of the total, and that naive extrapolation underestimates the size of future outliers. (More.)
- We can often distinguish heavy-tailed from light-tailed data by eyeballing (e.g. in a log-log plot), but it’s hard to empirically distinguish different heavy-tailed distributions from one another (e.g. log-normal vs. power laws). When extrapolating beyond the range of observed data, we advise to proceed with caution and to not take the specific distributions reported in papers at face value. (More.)
- There is a small number of papers in industrial-organizational psychology on the specific question whether performance in typical jobs is normally distributed or heavy-tailed. However, we don’t give much weight to these papers because their broad high-level conclusion (“it depends”) is obvious but we have doubts about the statistical methods behind their more specific claims. (More.)
- We also quote (in more detail than in the main text) the results from a meta-analysis of predictors of salary, promotions, and career satisfaction. (More.)
- We provide a technical discussion of how our metrics for heavy-tailedness are affected by the ‘cutoff’ value at which the tail starts. (More.)
Finally, we provide a glossary of the key terms we use, such as performance or heavy-tailed.
For more details, see our full write-up.
We'd like to thank Owen Cotton-Barratt and Denise Melchin for helpful comments on earlier drafts of our write-up, as well as Aaron Gertler for advice on how to best post this piece on the Forum.
Most of Max's work on this project was done while he was part of the Research Scholars Programme (RSP) at FHI, and he's grateful to the RSP management and FHI operations teams for keeping FHI/RSP running, and to Hamish Hobbs and Nora Ammann for support with productivity and accountability.
We're also grateful to Balsa Delibasic for compiling and formatting the reference list.
 For performance in “high-complexity” jobs such as attorney or physician, that meta-analysis (Hunter et al. 1990) reports a coefficient of variation that’s about 1.5x as large as for ‘medium-complexity' jobs. Unfortunately, we can’t calculate how heavy-tailed the performance distribution for high-complexity jobs is: for this we would need to stipulate a particular type of distribution (e.g. normal, log-normal), but Hunter et al. only report that the distribution does not appear to be normal (unlike for the low- and medium-complexity cases).
 Similarly, don’t treat ‘heavy-tailed’ as an asymptotic property – i.e. one that by definition need only hold for values above some arbitrarily large value. Instead, consider the range of values that matter in practice. For instance, a distribution that exhibits heavy tails only for values greater than 10^100 would be heavy-tailed in the asymptotic sense. But for e.g. income in USD values like 10^100 would never show up in practice – if your distribution is supposed to correspond to income in USD you’d only be interested in a much smaller range, say up to 10^10. Note that this advice is in contrast to the standard definition of ‘heavy-tailed’ in mathematical contexts, where it usually is defined as an asymptotic property. Relatedly, a distribution that only takes values in some finite range – e.g. between 0 and 10 billion – is never heavy-tailed in the mathematical-asymptotic sense, but it may well be in the “practical” sense (where you anyway cannot empirically distinguish between a distribution that can take arbitrarily large values and one that is “cut off” beyond some very large maximum).
"the top 1% stay on the New York Times bestseller list more than 25 times longer than the median author in that group."
FWIW my intuition is not that this author is 25x more talented, but rather that the author and their marketing team are a little bit more talented in a winner-takes-most market.
I wanted to point this out because I regularly see numbers like this used to justify claims that individuals vary significantly in talent or productivity. It's important to keep the business model in mind if you're claiming talent based on sales!
(Research citations are also a winner-takes-most market; people end up citing the same paper even if it's not much better than the next best paper.)
I fully agree with this, and think we essentially say as much in the post/document. This is e.g. why we've raised different explanations in the 2nd paragraph, immediately after referring to the phenomenon to be explained.
Curious if you think we could have done a better job at clarifying that we don't think differences in outcomes can only be explained by differences in talent?
Let me try a different framing and see if that helps. Economic factors mediate how individual task performance translates into firm success. In industries with winner-takes-most effects, small differences in task performance cause huge differences in payoffs. "The Economics of Superstars" is a classic 1981 paper on this. But many industries aren't like that.
Knowing your industry tells you how important it is to hire the right people. If you're hiring someone to write an economics textbook (an example from the "Superstars" paper), you'd better hire the best textbook-writer you can find, because almost no one buys the tenth-best economics textbook. But if you're running a local landscaping company, you don't need the world's best landscaper. And if your industry has incumbent "superstar" firms protected by first-mover advantages, economies of scale, or network effects, it may not matter much who you hire.
So in what kind of "industry" are the EA organizations you want to help with hiring? Is there some factor that multiplies or negates small individual differences in task performance?
My point is more "context matters," even if you're talking about a specific skill like programming, and that the contexts that generated the examples in this post may be meaningfully different from the contexts that EA organizations are working in.
I don't necessarily disagree with anything you and Max have written; it's just a difference of emphasis, especially when it comes to advising people who are making hiring decisions.
I was going to raise a similar comment to what others have said here. I hope this adds something.
I think we need to distinguish quality and quantity of 'output' from 'success' (the outcome of their output). I am deliberately not using 'performance' as it's unclear, in common language, which one of the two it refers to. Various outputs are sometimes very reproducible - anyone can listen to a music track, or read an academic paper. There are often huge rewards to being the best vs second best - eg winning in sports. And sometimes success generates further success (the 'Matthew effect') - more people want to work with you, etc. Hence, I don't find it all weird to think that small differences in outputs, as measured on some cardinal scale, sometimes generate huge differences in outcomes.
I'm not sure exactly what follows from this. I'm a bit worried you're concentrated on the wrong metric - success - when it's outputs that are more important. Can you explain why you focus on outcomes?
Let's say you're thinking about funding research. How much does it matter to fund the best person? I mean, they will get most of the credit, but if you fund the less-than-best, that person's work is prob... (read more)
Data from the IAP indicates that they can identify the top few percent of successful inventions with pretty good accuracy. (Where "success" is a binary variable – not sure how they perform if you measure financial returns.)
On your main point, this was the kind of thing we were trying to make clearer, so it's disappointing that hasn't come through.
Just on the particular VC example:
Most VCs only pick from the top 1-5% of startups. E.g. YC's acceptance rate is 1%, and very few startups they reject make it to series A. More data on VC acceptance rates here: https://80000hours.org/2014/06/the-payoff-and-probability-of-obtaining-venture-capital/
So, I think that while it's mostly luck once you get down to the top 1-5%, I think there's a lot of predictors before that.
Also see more on predictors of startup performance here: https://80000hours.org/2012/02/entrepreneurship-a-game-of-poker-not-roulette/
The Canadian inventors assistance program provides a rating of how good an invention is to inventors for a nominal fee. A large fraction of the people who get a bad rating try to make a company anyway, so we can judge the accuracy of their evaluations.
55% of the inventions which they give the highest rating to achieve commercial success, compared to 0% for the lowest rating.
FWIW I think it's the authors' job to anticipate how their audience is going to engage with their writing, where they're coming from etc. - You were not the only one who reacted by pushing back against our framing as evident e.g. from Khorton's much upvoted comment.
So no matter what we tried to convey, and what info is in the post or document if one reads closely enough, I think this primarily means that I (as main author of the wording in the post) could have done a better job, not that you or anyone else is being obtuse.
YC having a low acceptance rate could mean they are highly confident in their ability to predict ex ante outcomes. It could also mean that they get a lot of unserious applications. Essays such as this one by Paul Graham bemoaning the difficulty of predicting ex ante outcomes make me think it is more the latter. ("it's mostly luck once you get down to the top 1-5%" makes it sound to me like ultra-successful startups should have elite founders, but my take on Graham's essay is that ultra-successful startups tend to be unusual, often in a way that makes them look non-elite according to traditional metrics -- I tend to suspect this is true of exceptionally innovative people more generally)
Fwiw, I wrote a post explaining such dynamics a few years ago.
I think you're right that complexity at the very least isn't the only cause/explanation for these differences.
E.g. Aguinis et al. (2016) find that, based on an analysis of a very large number of productivity data sets, the following properties make a heavy-tailed output distribution more likely:
As we explain in the paper, I have some open questions about the statistical approach in that paper. So I currently don't take their analysis to be that much evidence that this is in fact right. However, they also sound right to me just based on priors and based on theoretical considerations (such as the ones in our section on why we expect heavy-tailed ex-ante performance to be widespread).
In the part you quoted, I wrote "less complex jobs" because the data I'm reporting is from a paper that ... (read more)
I tried to sum up the key messages in plain language in a Twitter thread, in case that helps clarify.
Great post! Seems like the predictability questions is impt given how much power laws surface in discussion of EA stuff.
I want to argue that things which look like predicting future citations from past citations are at least partially "uninteresting" in their predictability, in a certain important sense.
(I think this is related to other comments, and have not read your google doc, so apologies if I'm restating. But I think its worth drawing out this distinction)
In many cases I can think of wanting good ex-ante prediction of heavy-tailed outcomes, I want to make these predictions about a collection which is in an "early stage". For example, I might want to predict which EAs will be successful academics, or which of 10 startups seed rounds I should invest in.
Having better predictive performance at earlier stages gives you a massive multiplier in heavy-tailed domains: investing in a Series C is dramatically more expensive than a seed investment.
Given this, I would really love to have a function which takes in the intrinsic characteristi... (read more)
Thanks! I agree with a lot of this.
I think the case of citations / scientific success is a bit subtle:
... (read more)
- My guess is that the preferential attachment story applies most straightforwardly at the level of papers rather than scientists. E.g. I would expect that scientists who want to cite something on topic X will cite the most-cited paper on X rather than first looking for papers on X and then looking up the total citations of their authors.
- I think the Sinatra et al. (2016) findings which we discuss in our relevant section push at least slightly against a story that says it's all just about "who was first in some niche". In particular, if preferential attachment at the level of scientists was a key driver, then I would expect authors who get lucky early in their career - i.e. publish a much-cited paper early - to get more total citations. In particular, citations to future papers by a fixed scientist should depend on citations to past papers by the same scientist. But that is not what Sinatra et al. find - they instead find that within the career of a fixed scientist the per-paper citations seem entirely random.
- Instead their model uses citations to estimate an 'intrinsic character
Hi Max and Ben, a few related thoughts below. Many of these are mentioned in various places in the doc, so seem to have been understood, but nonetheless have implications for your summary and qualitative commentary, which I sometimes think misses the mark.
... (read more)
- Often, you can't derive this directly from the distribution's mathematical type. In particular, you cannot derive it from whether a distribution is heavy-tailed in the mathematical sense.
- Log-normal distributions are particuarly common and are a particular offender here, because they tend to occur whenever lots of independent factors are multiplied together. But here is the approx
So taking a step back for a second, I think the primary point of collaborative written or spoken communication is to take the picture or conceptual map in my head and put it in your head, as accurately as possible. Use of any terms should, in my view, be assessed against whether those terms are likely to create the right picture in a reader's or listener's head. I appreciate this is a somewhat extreme position.
If everytime you use the term heavy-tailed (and it's used a lot - a quick CTRL + F tells me it's in the OP 25 times) I have to guess from context whether you mean the mathematical or commonsense definitions, it's more difficult to parse what you actually mean in any given sentence. If someone is reading and doesn't even know that those definitions substantially differ, they'll probably come away with bad conclusions.
This isn't a hypothetical corner case - I keep seeing people come to bad (or at least unsupported) conclusions in exactly this way, while thinking that their reasoning is mathematically sound and thus nigh-incontrovertible. To quote myself above:... (read more)
Thanks for this. I do think there's a bit of sloppiness in EA discussions about heavy-tailed distributions in general, and the specific question of differences in ex ante predictable job performance in particular. So it's really good to see clearer work/thinking about this.
I have two high-level operationalization concerns here:
... (read more)
- Whether performance is ex ante predictable seems to be a larger function of our predictive ability than of the world. As an extreme example of what I mean, if you take our world on November 7, 2016 and run high-fidelity simulations 1,000,000 times , I expect 1 million/1 million of those simulations to end up with Donald Trump winning the 2016 US presidential election. Similarly, with perfect predictive ability, I think the correlation between ex ante predicted work performance and ex post actual performance approach 1 (up to quantum) . This may seem like a minor technical point, but I think it's important to be careful of the reasoning here when we ask whether claims are expected to generalize from domains with large and obvious track records and proxies (eg past paper citations to future paper citations) or even domains where th
Thanks for these points!
My super quick take is that 1. definitely sounds right and important to me, and I think it would have been good if we had discussed this more in the doc.
I think 2. points to the super important question (which I think we've mentioned somewhere under Further research) how typical performance/output metrics relate to what we ultimately care about in EA contexts, i.e. positive impact on well-being. At first glance I'd guess that sometimes these metrics 'overstate' heavy-tailedness of EA impact (for e.g. the reasons you mentioned), but sometimes they might also 'understate' them. For instance, the metrics might not 'internalize' all the effects on the world (e.g. 'field building' effects from early-stage efforts), or for some EA situations the 'market' may be even more winner-takes-most than usual (e.g. for some AI alignment efforts it only matters if you can influence DeepMind), or the 'production function' might have higher returns to talent than usual (e.g. perhaps founding a nonprofit or contributing valuable research to preparadigmatic fields is "extra hard" in a way not captured by standard metrics when compared to easier cases).
Nice, I think developing a deeper understanding here seems pretty useful, especially as I don't think the EA community can just copy the best hiring practices of existing institutions due to lack in shared goals (e.g. most big tech firms) or suboptimal hiring practices (e.g. non-profits & most? places in academia).
I'm really interested in the relation between the increasing number of AI researchers and the associated rate of new ideas in AI. I'm not really sure how to think about this yet and would be interested in your (or anybody's) thoughts. S... (read more)
C-dawg in the house!
I have concerns about how this post and research is framed and motivated.
This is because its methods imply a certain worldview and is trying to help hiring or recruiting decisions in EA orgs, and we should be cautious.
Like, I think, loosely speaking, I think “star systems” is a useful concept / counterexample to this post.
In this view of the world, someone’s in a “star system” if a small number of people get all the rewards, but not from what we would comfortably call productivity or performance.
So, like, for intuition, most Olympic athletes train near poverty but a small number manage to “get on a cereal box” and become a millionaire. They have higher ability, but we wouldn’t say that Gold medal winners are 1000x more productive than someone they beat by 0.05 seconds.
You might view “Star systems” negatively because they are unfair—Yes, and in addition to inequality, they have may have very negative effects: they promote echo chambers in R1 research, and also support abuse like that committed by Harvey Weinstein.
However, “star systems” might be natural and optimal given how organizations and projects need to be... (read more)
1.... (read more)
For different take on very similar topic check this discussion between me and Ben Pace (my reasoning was based on the same Sinatra paper).
Thank you for writing this!
While this is not the high note of the paper, I read with quite some interest your notes about heavy tailed distributions.
I think that the concept of heavy tailed distributions underpins a lot of considerations in EA, yet as you remark many people (including me) are still quite confused about how to formalize the concept effectively, and how often it applies in real life.
Glad to see more thinking going into this!
[The following is a lightly edited response I gave in an email conversation.]
My overall intuition is that the full picture we paint suggests personal fit, and especially being in the tail of personal fit, is more important than one might naively think (at least in domains w... (read more)
Minor typo: "it’s often to reasonable to act on the assumption" probably should be "it’s often reasonable to act on the assumption"
Surprised to see nothing (did I overlook?) about: The People vs. The Project/Job: The title, and the lead sentence,
suggest the work focuses essentially on people's performance, but already in the motivational examples... (read more)
Basic statistics question: the GMA predictors research seems to mostly be using the Pearson correlation coefficient, which I understand to measure linear correlation between variables.
But a linear correlation would imply that billionaires have an IQ of 10,000 or something which is clearly implausible. Are these correlations actually measuring something which could plausibly be linearly related (e.g. Z score for both IQ and income)?
I read through a few of the papers cited and didn't see any mention of this. I expect this to be especially significant at the tails, which is what you are looking at here.
It might be worth discussing the larger question which is being asked. For example, your IMO paper seems to be work by researchers who advocate looser immigration policies for talented youth who want to move to developed countries. The larger question is "What is the expected scientific impact of letting a marginal IMO medalist type person from Honduras immigrate to the US?"
These quotes from great mathematicians all downplay the importance of math competitions. I think this is partially because the larger question they're interested in is different, som... (read more)