I've done nothing to test these heuristics and have no empirical evidence for how well they work for forecasting replications or anything else. I’m going to write them anyway. The heuristics I’m listing are roughly in order of how important I think they are. My training is as an economist (although I have substantial exposure to political science) and lots of this is going to be written from an econometrics perspective.

**How much does the result rely on experimental evidence vs causal inference from observational evidence? **

I basically believe without question every result that mainstream chemists and condensed matter physicists say is true. I think a big part of this is that in these fields it’s really easy to experimentally test hypotheses, and to really precisely test differences in hypotheses experimentally. This seems great.

On the other hand, when relying on observational evidence to get reliable causal inference you have to control for __confounders__ while not controlling for __collider__s. This is really hard! It generally requires finding a natural experiment that introduces randomisation or having *very* good reason to think that you’ve controlled for all confounders.

We also make quite big updates on which methods effectively do this. For instance, until last year we thought that __two-way fixed effects__ did a pretty good job of this before we realised that actually__ heterogeneous treatment effects are a really big deal__ for two-way fixed effects estimators.

What’s more, in areas that use primarily observational data there’s a really big gap between fields in how often papers even try to use causal inference methods and how hard they work to show that their __identifying assumptions__ hold. I generally think that modern microeconomics papers are the best on this and nutrition science the worst.

I’m slightly oversimplifying by using a strict division between experimental and observational data. All data is observational and what matters is how credibly you think you’ve observed __what would happen counterfactually without some change__. But in practice, this is much easier in settings where we think that we can change the thing we’re interested in without other things changing.

There are some difficult questions around scientific realism here that I’m going to ignore because I’m mostly interested in how much we can trust a result in typical use cases. The notable area where I think this actually bites is thinking about the implications of basic physics for __longtermism__ where it does seem like basic physics actually changes quite a lot over time with important implications for questions like how __large we expect the future to be__.

**Are there practitioners using this result and how strong is the selection pressure on the result **

If a result is being used a lot and there would be easily noticeable and punishable consequences if the result was wrong, I’m way more likely to believe that the result is at least roughly right if it’s relied on a lot.

For instance, this means I’m actually really confident that important results in __auction design hold__. Auction design is used all the time by both __government__ and __private sector actors__ in ways that earn these actors billions of dollars and, in the private sector case at least, are iterated on regularly.

Auction theory is an interesting case because it comes out of pretty abstract microeconomic theory and wasn’t developed really based on laboratory experiments, but I’m still pretty confident in it because of how widely it’s used by practitioners and is subject to strong selection pressure.

On the other hand, I’m much less confident in lots of political science research. It seems like places like hedge funds don’t use it that much to predict market outcomes, it doesn’t seem to be used by governments that much, and it’s really hard to know how counterfactually important, say, World Bank programs that use political science were.

**How large is the literature that supports the result and how many techniques have been used to support **

This view actually does have some empirical support. There’s this nice paper where a load of different researchers are given the same (I think simulated) data and looked at how researchers result. They found that there was quite a lot of difference between what researchers found based on things like their coding choices and what statistical techniques they used, but that when there was a real effect the average paper found an effect of right sign and roughly right magnitude, and when there was no real effect the average researcher found roughly no effect. I’m afraid I can’t find either the paper and I can’t be bothered to link to the Noah Smith or Matt Clancy blog posts on it.

Mostly though I use this heuristic because it seems pretty sensible.

External validity is how likely it is that a result generalises from whatever the study setting was to the setting in which the result is used.

I think this is a really big deal for lots of __RCT-based__ __development economics__. We just see, really quite often, that results that seem to consistently hold when tested with RCTs don’t hold when scaled up.

I’m more sceptical of the external validity of a result the more intensive the intervention is and so the more buy-in and effort is needed from participants and researchers. Seems pretty likely that when the intervention is used it won’t have as much effort put into it. I’m particularly sceptical if the intervention is complex or precise.

**Results given **__statistical power__** **

Statistical power says how likely it is to see an effect size given the true effect size and the sample size. If the statistical power of a test is low but significant results are found, it’s likely that the r__esearcher just got lucky and the true effect size is much smaller and/or the opposite sign.__

The intuition for this is that if a statistical test is underpowered - say for this example the power is under 50% - then it’s unlikely that a statistically significant effect is found.

If a statistically significant effect is found then something weird must have happened, like the specific sample that was used stochastically having really large effect sizes. The intuition for this is that if you have a small sample size (so your estimate of population statistics has a high variance) and are very unwilling to accept mistakes in the direction of finding effects that aren’t there, you need a really large mean effect size to be confident that there’s any effect at all! This effect size has to be larger than the mean effect size because, by assumption, you’re test is unlikely to detect an effect given the true distribution of the variable in question - this is what it means for a test to have low power.

More sinisterly, it could also imply some selection effect for which results are observed, like publication bias or the methods the researchers used.

I want to caveat this section by saying that I don’t have a very good intuition for power calculations and how much they actually affect how likely results are to replicate.

**How strong is the **__social desirability bias__** at play**

This seems somewhat important, but I think is often overplayed in the EA and rationality communities. But it does in practice mean that I think I’m less likely to see papers that find, say, that child poverty has no effect on future outcomes. My vibe is that psychology seems __particularly bad__ for this for some reason?

But also I see papers that find socially undesirable results all the time!

For instance, __this paper__ finds negative effects of democracy on state capacity for places with middling levels of democracy, __this paper__ finds higher levels of interest in reading amongst preschool-age girls, and __this paper__ finds no association between youth unemployment and crime. It’s really easy to find these papers! You just search for them on Google Scholar.

**Have there been formal tests of **__publication bias____ __

We can test whether the distribution of results on a specific question looks like it should if publication was independent of the sign and magnitude of their results. I’m a lot less confident in a field if it consistently finds publication bias.

Thanks for this, a really nice write up. I like these heuristics, and will try to apply them.

On the intuition behind how to interpret statistical power, doesn't a bayesian perspective help here?

If someone was conducting a statistical test to decide between two possibilities, and you knew nothing about their results except: (i) their calculated statistical power was B (ii) the statistical significance threshold they adopted was p and (iii) that they ultimately reported a positive result using that threshold, then how should you update on that, without knowing any more details about their data?

I think not having access to the data or reported effect sizes actually simplifies things a lot, and the Bayes factor you should update your priors by is just B/p (prob of observing this outcome if an effect / prob of observing this outcome if no effect). So if the test had half the power, the update to your prior odds of an effect should be half as big?

Yeah, I think a Bayesian perspective is really helpful here and this reply seems right.

Just highlighting this paragraph because I think it's extremely important. As a policymaker, the vast majority of research I see from think tanks etc include poorly justified assumptions. It's become one of the first things I look for now, in part because it's an easy prompt for me to spot a wide range of issues.

For an interesting take on the (important) argument around statistical power:

Gelman's The “What does not kill my statistical significance makes it stronger” fallacy:

https://statmodeling.stat.columbia.edu/2017/02/06/not-kill-statistical-significance-makes-stronger-fallacy/

Nice post, Nathan! Relatedly, I recommend Why randomized controlled trials matter and the procedures that strengthen them by Our World in Data.

Executive summary: The author describes several heuristics for evaluating the trustworthiness of scientific results, focusing on the strength of evidence, scrutiny from practitioners, consensus in the literature, social biases, and tests for publication bias.Key points:This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact usif you have feedback.