Hide table of contents

I've done nothing to test these heuristics and have no empirical evidence for how well they work for forecasting replications or anything else. I’m going to write them anyway. The heuristics I’m listing are roughly in order of how important I think they are. My training is as an economist (although I have substantial exposure to political science) and lots of this is going to be written from an econometrics perspective.

How much does the result rely on experimental evidence vs causal inference from observational evidence? 

I basically believe without question every result that mainstream chemists and condensed matter physicists say is true. I think a big part of this is that in these fields it’s really easy to experimentally test hypotheses, and to really precisely test differences in hypotheses experimentally. This seems great. 

On the other hand, when relying on observational evidence to get reliable causal inference you have to control for confounders while not controlling for colliders. This is really hard! It generally requires finding a natural experiment that introduces randomisation or having very good reason to think that you’ve controlled for all confounders. 

We also make quite big updates on which methods effectively do this. For instance, until last year we thought that two-way fixed effects did a pretty good job of this before we realised that actually heterogeneous treatment effects are a really big deal for two-way fixed effects estimators. 

What’s more, in areas that use primarily observational data there’s a really big gap between fields in how often papers even try to use causal inference methods and how hard they work to show that their identifying assumptions hold. I generally think that modern microeconomics papers are the best on this and nutrition science the worst. 

I’m slightly oversimplifying by using a strict division between experimental and observational data. All data is observational and what matters is how credibly you think you’ve observed what would happen counterfactually without some change. But in practice, this is much easier in settings where we think that we can change the thing we’re interested in without other things changing. 

There are some difficult questions around scientific realism here that I’m going to ignore because I’m mostly interested in how much we can trust a result in typical use cases. The notable area where I think this actually bites is thinking about the implications of basic physics for longtermism where it does seem like basic physics actually changes quite a lot over time with important implications for questions like how large we expect the future to be

Are there practitioners using this result and how strong is the selection pressure on the result 

If a result is being used a lot and there would be easily noticeable and punishable consequences if the result was wrong, I’m way more likely to believe that the result is at least roughly right if it’s relied on a lot.

For instance, this means I’m actually really confident that important results in auction design hold. Auction design is used all the time by both government and private sector actors in ways that earn these actors billions of dollars and, in the private sector case at least, are iterated on regularly. 

Auction theory is an interesting case because it comes out of pretty abstract microeconomic theory and wasn’t developed really based on laboratory experiments, but I’m still pretty confident in it because of how widely it’s used by practitioners and is subject to strong selection pressure. 

On the other hand, I’m much less confident in lots of political science research. It seems like places like hedge funds don’t use it that much to predict market outcomes, it doesn’t seem to be used by governments that much, and it’s really hard to know how counterfactually important, say, World Bank programs that use political science were.  

How large is the literature that supports the result and how many techniques have been used to support 

This view actually does have some empirical support. There’s this nice paper where a load of different researchers are given the same (I think simulated) data and looked at how researchers result. They found that there was quite a lot of difference between what researchers found based on things like their coding choices and what statistical techniques they used, but that when there was a real effect the average paper found an effect of right sign and roughly right magnitude, and when there was no real effect the average researcher found roughly no effect. I’m afraid I can’t find either the paper and I can’t be bothered to link to the Noah Smith or Matt Clancy blog posts on it. 

Mostly though I use this heuristic because it seems pretty sensible. 

External validity

External validity is how likely it is that a result generalises from whatever the study setting was to the setting in which the result is used. 

I think this is a really big deal for lots of RCT-based development economics. We just see, really quite often, that results that seem to consistently hold when tested with RCTs don’t hold when scaled up. 

I’m more sceptical of the external validity of a result the more intensive the intervention is and so the more buy-in and effort is needed from participants and researchers. Seems pretty likely that when the intervention is used it won’t have as much effort put into it. I’m particularly sceptical if the intervention is complex or precise. 

Results given statistical power 

Statistical power says how likely it is to see an effect size given the true effect size and the sample size. If the statistical power of a test is low but significant results are found, it’s likely that the researcher just got lucky and the true effect size is much smaller and/or the opposite sign.

The intuition for this is that if a statistical test is underpowered - say for this example the power is under 50% - then it’s unlikely that a statistically significant effect is found. 

If a statistically significant effect is found then something weird must have happened, like the specific sample that was used stochastically having really large effect sizes. The intuition for this is that if you have a small sample size (so your estimate of population statistics has a high variance) and are very unwilling to accept mistakes in the direction of finding effects that aren’t there, you need a really large mean effect size to be confident that there’s any effect at all! This effect size has to be larger than the mean effect size because, by assumption, you’re test is unlikely to detect an effect given the true distribution of the variable in question - this is what it means for a test to have low power. 

More sinisterly, it could also imply some selection effect for which results are observed, like publication bias or the methods the researchers used. 

I want to caveat this section by saying that I don’t have a very good intuition for power calculations and how much they actually affect how likely results are to replicate. 

How strong is the social desirability bias at play 

This seems somewhat important, but I think is often overplayed in the EA and rationality communities. But it does in practice mean that I think I’m less likely to see papers that find, say, that child poverty has no effect on future outcomes. My vibe is that psychology seems particularly bad for this for some reason? 

But also I see papers that find socially undesirable results all the time! 

For instance, this paper finds negative effects of democracy on state capacity for places with middling levels of democracy, this paper finds higher levels of interest in reading amongst preschool-age girls, and this paper finds no association between youth unemployment and crime. It’s really easy to find these papers! You just search for them on Google Scholar. 

Have there been formal tests of publication bias 

We can test whether the distribution of results on a specific question looks like it should if publication was independent of the sign and magnitude of their results. I’m a lot less confident in a field if it consistently finds publication bias.

Comments6


Sorted by Click to highlight new comments since:

Thanks for this, a really nice write up. I like these heuristics, and will try to apply them.

On the intuition behind how to interpret statistical power, doesn't a bayesian perspective help here?

If someone was conducting a statistical test to decide between two possibilities, and you knew nothing about their results except: (i) their calculated statistical power was B (ii) the statistical significance threshold they adopted was p and (iii) that they ultimately reported a positive result using that threshold, then how should you update on that, without knowing any more details about their data?

I think not having access to the data or reported effect sizes actually simplifies things a lot, and the Bayes factor you should update your priors by is just B/p (prob of observing this outcome if an effect / prob of observing this outcome if no effect). So if the test had half the power, the update to your prior odds of an effect should be half as big?

[anonymous]1
0
0

Yeah, I think a Bayesian perspective is really helpful here and this reply seems right.  

What’s more, in areas that use primarily observational data there’s a really big gap between fields in how often papers even try to use causal inference methods and how hard they work to show that their identifying assumptions hold.

Just highlighting this paragraph because I think it's extremely important. As a policymaker, the vast majority of research I see from think tanks etc include poorly justified assumptions. It's become one of the first things I look for now, in part because it's an easy prompt for me to spot a wide range of issues.

For an interesting take on the (important) argument around statistical power:
Gelman's The “What does not kill my statistical significance makes it stronger” fallacy:
https://statmodeling.stat.columbia.edu/2017/02/06/not-kill-statistical-significance-makes-stronger-fallacy/

Executive summary: The author describes several heuristics for evaluating the trustworthiness of scientific results, focusing on the strength of evidence, scrutiny from practitioners, consensus in the literature, social biases, and tests for publication bias.

Key points:

  1. Results based on experimental evidence are more trustworthy than those relying solely on observational data and causal inference.
  2. Results used extensively by practitioners under high scrutiny warrant more trust.
  3. A large, diverse academic literature supporting a finding instills confidence.
  4. Consider the study's external validity and statistical power.
  5. Account for potential social desirability biases.
  6. Formal tests showing no publication bias increase trustworthiness.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
 ·  · 5m read
 · 
[Cross-posted from my Substack here] If you spend time with people trying to change the world, you’ll come to an interesting conundrum: Various advocacy groups reference previous successful social movements as to why their chosen strategy is the most important one. Yet, these groups often follow wildly different strategies from each other to achieve social change. So, which one of them is right? The answer is all of them and none of them. This is because many people use research and historical movements to justify their pre-existing beliefs about how social change happens. Simply, you can find a case study to fit most plausible theories of how social change happens. For example, the groups might say: * Repeated nonviolent disruption is the key to social change, citing the Freedom Riders from the civil rights Movement or Act Up! from the gay rights movement. * Technological progress is what drives improvements in the human condition if you consider the development of the contraceptive pill funded by Katharine McCormick. * Organising and base-building is how change happens, as inspired by Ella Baker, the NAACP or Cesar Chavez from the United Workers Movement. * Insider advocacy is the real secret of social movements – look no further than how influential the Leadership Conference on Civil Rights was in passing the Civil Rights Acts of 1960 & 1964. * Democratic participation is the backbone of social change – just look at how Ireland lifted a ban on abortion via a Citizen’s Assembly. * And so on… To paint this picture, we can see this in action below: Source: Just Stop Oil which focuses on…civil resistance and disruption Source: The Civic Power Fund which focuses on… local organising What do we take away from all this? In my mind, a few key things: 1. Many different approaches have worked in changing the world so we should be humble and not assume we are doing The Most Important Thing 2. The case studies we focus on are likely confirmation bias, where
 ·  · 2m read
 · 
I speak to many entrepreneurial people trying to do a large amount of good by starting a nonprofit organisation. I think this is often an error for four main reasons. 1. Scalability 2. Capital counterfactuals 3. Standards 4. Learning potential 5. Earning to give potential These arguments are most applicable to starting high-growth organisations, such as startups.[1] Scalability There is a lot of capital available for startups, and established mechanisms exist to continue raising funds if the ROI appears high. It seems extremely difficult to operate a nonprofit with a budget of more than $30M per year (e.g., with approximately 150 people), but this is not particularly unusual for for-profit organisations. Capital Counterfactuals I generally believe that value-aligned funders are spending their money reasonably well, while for-profit investors are spending theirs extremely poorly (on altruistic grounds). If you can redirect that funding towards high-altruism value work, you could potentially create a much larger delta between your use of funding and the counterfactual of someone else receiving those funds. You also won’t be reliant on constantly convincing donors to give you money, once you’re generating revenue. Standards Nonprofits have significantly weaker feedback mechanisms compared to for-profits. They are often difficult to evaluate and lack a natural kill function. Few people are going to complain that you provided bad service when it didn’t cost them anything. Most nonprofits are not very ambitious, despite having large moral ambitions. It’s challenging to find talented people willing to accept a substantial pay cut to work with you. For-profits are considerably more likely to create something that people actually want. Learning Potential Most people should be trying to put themselves in a better position to do useful work later on. People often report learning a great deal from working at high-growth companies, building interesting connection
 ·  · 31m read
 · 
James Özden and Sam Glover at Social Change Lab wrote a literature review on protest outcomes[1] as part of a broader investigation[2] on protest effectiveness. The report covers multiple lines of evidence and addresses many relevant questions, but does not say much about the methodological quality of the research. So that's what I'm going to do today. I reviewed the evidence on protest outcomes, focusing only on the highest-quality research, to answer two questions: 1. Do protests work? 2. Are Social Change Lab's conclusions consistent with the highest-quality evidence? Here's what I found: Do protests work? Highly likely (credence: 90%) in certain contexts, although it's unclear how well the results generalize. [More] Are Social Change Lab's conclusions consistent with the highest-quality evidence? Yes—the report's core claims are well-supported, although it overstates the strength of some of the evidence. [More] Cross-posted from my website. Introduction This article serves two purposes: First, it analyzes the evidence on protest outcomes. Second, it critically reviews the Social Change Lab literature review. Social Change Lab is not the only group that has reviewed protest effectiveness. I was able to find four literature reviews: 1. Animal Charity Evaluators (2018), Protest Intervention Report. 2. Orazani et al. (2021), Social movement strategy (nonviolent vs. violent) and the garnering of third-party support: A meta-analysis. 3. Social Change Lab – Ozden & Glover (2022), Literature Review: Protest Outcomes. 4. Shuman et al. (2024), When Are Social Protests Effective? The Animal Charity Evaluators review did not include many studies, and did not cite any natural experiments (only one had been published as of 2018). Orazani et al. (2021)[3] is a nice meta-analysis—it finds that when you show people news articles about nonviolent protests, they are more likely to express support for the protesters' cause. But what people say in a lab setting mig