Tyler Cowen posted a link to this paper(PDF), outlining how effective programs are when transported to new contexts, or scaled up by governments.
Two key quotes:
The program implementer is the main source of heterogeneity in results,
with government-implemented programs faring worse than and being poorly predicted
by the smaller studies typically implemented by academic/NGO research teams, even
controlling for sample size
The average intervention-outcome combination is comprised 37% of positive, significant studies;
58% of insignificant studies; and 5% of negative, significant studies. If a particular result is positive
and significant, there is a 61% chance the next result will be insignificant and a 7% chance the
next result will be significant and negative, leaving only about a 32% chance the next result will
again be positive and significant.
What a great study! I read through it and if I understand the PRESS statistic correctly, if you have a meta-analysis of impact evaluations, their average effect size predicts 33% of the variation of the next study run on the topic. So that’s not as great as I expected, but given the huge variations between countries and charities, it makes sense.
My takeaway is that since studies aren’t as generalizable as I would like, I should weight a study done on the charity I’m supporting much higher. This makes GiveDirectly a much stronger choice because it has a study run on it itself.
Another takeaway - studies done on studies are EA badass.