As a small note, we might get more precise estimates of the effects of a program by predicting magnitudes rather than whether something will replicate (which is what we're doing with the Social Science Prediction Platform). That said, I think a lot of work needs to be done before we can have trust in predictions, and there will always be a gap between how comfortable we are extrapolating to other things we could study vs. "unquantifiable" interventions.
(There's an analogy to external validity here, where you can do more if you can assume the study you predict is drawn from the same set as those you have studied, or the same set if weighted in some way. You could in principle make an ordering of how feasible something is to be studied, and regress your ability to predict on that, but that would be incredibly noisy and not practical as things stand, and past some threshold you don't observe studies anymore and have little to say without making strong assumptions about generalizing past that threshold.)
I'll try to think about this some more. It's a good question.
Great comment. I don't think anyone, myself included, would say the means are not the same and therefore everything is terrible. In the podcast, you can see my reluctance to that when Rob is trying to get me to give one number that will easily summarize how much results in one context will extrapolate to another, and I just don't want to play ball (which is not at all to criticize!). The number I tend to focus on these days (tau squared) is not one that is easily interpretable in that way - instead, it's a measure of the unexplained variation in results - but how much is unexplained clearly depends on what model you are using (and because it is a variance, it really depends on units, making it hard to interpret across interventions except for those dealing with the same kind of outcome). On this view, if you can come up with a great model to explain away more of the heterogeneity, great! I am all for models that have better predictive power.
On the other hand:
1) I do worry that often people are not building more complicated models, but rather thinking about a specific study (if lucky, a group of studies), most likely being biased towards those which found particularly large effects as people seem to update more on positive results.
2) I am not convinced that focusing on mechanisms will completely solve the problem. I agree that interventions that are more theory-based should (in theory) have more similar results -- or at least results that are better able to be predicted, which is more to the point. On the other hand, implementation details matter. I agree with Glennerster and Bates that there is an undue focus on setting -- everyone wants an impact evaluation done in their particular location. But I think there is too much focus on setting because (perhaps surprisingly) when I look in the AidGrade data, there is little to no effect of geography on the impact found, by which I mean that a result from (say) Kenya does not even generalize to Kenya very well (and I believe James Rising and co-authors have found similar results using a case study of conditional cash transfers). This isn't always going to be true; for example, the effect of health interventions depend on the baseline prevalence of disease, and baseline prevalences can be geographically clustered. But what I worry -- without convincing evidence yet so take this with a grain of salt -- is that small implementation details might frequently wash out the effects of knowing the mechanisms. Hopefully, we will have more evidence on this in the future (whichever way that evidence goes), and I very much hope that the more positive view turns out to be true.
I do agree with you that it's possible that researchers (and policymakers?) are able to account for some of the other factors when making predictions. I also said that there was some evidence that people were updating more on the positive results; I need to dig into the data a bit more to do subgroup analyses, but one way to reconcile these results (which would be consistent with what I have seen using different data) is that some people may be better at it than others. There are definitely times when people are wildly off, as well. I don't think I have a good enough sense yet of when predictions are good and when they are not, and that would be valuable.
Edit: I meant to add, there are a lot of frameworks that people use to try to get a handle on when they can export results or how to generalize. In addition to the work cited in Glennerster and Bates, see Williams for another example. And talking with people in government, there are a lot of other one-off frameworks or approaches people use internally. I am a fan of this kind of work and think it highly necessary, even though I am quite confident it won't get the appreciation it deserves within academia.
This video might also add to the discussion - the closing panel at CSAE this year was largely on methodology, moderated by Hilary Greaves (head of the new Global Priorities Institute at Oxford), with Michael Kremer, Justin Sandefur, Joseph Ssentongo, and myself. Some of the comments from the other panellists still stick with me today.
And when groups do work on these issues there is a tendency towards infighting.
Some things that could help:
Bringing people together is hugely important to working constructively.
I agree that it would be important to weigh the costs and benefits - I don't think it's exclusively an issue with RCTs, though.
One thing that could help in doing this calculus is a better understanding of when our non-study-informed beliefs are likely to be accurate.
I know at least some researchers are working in this area - Stefano DellaVigna and Devin Pope are looking to follow up their excellent papers on predictions with another one looking at how well people predict results based on differences in context, and Aidan Coville and I also have some work in this area using impact evaluations in development and predictions gathered from policymakers, practitioners, and researchers.