Tyler Cowen posted a link to this paper(PDF), outlining how effective programs are when transported to new contexts, or scaled up by governments.
Two key quotes:
The program implementer is the main source of heterogeneity in results,
with government-implemented programs faring worse than and being poorly predicted
by the smaller studies typically implemented by academic/NGO research teams, even
controlling for sample size
The average intervention-outcome combination is comprised 37% of positive, significant studies;
58% of insignificant studies; and 5% of negative, significant studies. If a particular result is positive
and significant, there is a 61% chance the next result will be insignificant and a 7% chance the
next result will be significant and negative, leaving only about a 32% chance the next result will
again be positive and significant.