I would really like to read a summary of this book. The reviews posted here (edit: in the original post) do not actually give much insight as to the contents. I'm hoping someone will post a detailed summary on the forum (and, as EAs love self-criticism, fully expect someone will!).
I'm not going to deal with the topic of the post, but there's another reason to not post under a burner account if it can be avoided that I haven't seen mentioned, which this post indirectly highlights.
When people post under burner accounts, it makes it harder to be confident in the information that the posts contain, because there is ambiguity and it could be the same person repeatedly posting. To give one example (not the only one), if you see X number of burner accounts posting "I observe Y", then that could mean anywhere from 1 to X observations of Y, and it's hard to get a sense of the true frequency. This means it undermines the message of those posting, to post under burners, because some of their information will be discounted.
In this post, the poster writes "Therefore, I feel comfortable questioning these grants using burner accounts," which suggests in fact that they do have multiple burner accounts. I recognize that using the same burner account would, over time, aggregate information that would lead to slightly less anonymity, but again, the tradeoff is that it significantly undermines the signal. I suspect it could lead to a vicious cycle for those posting, if they repeatedly feel like their posts aren't being taken seriously.
Thanks for mentioning the Social Science Prediction Platform! We had some interest from other sciences as well.
With collaborators, we outlined some other reasons to forecast research results here: https://www.science.org/doi/10.1126/science.aaz1704. In short, forecasts can help to evaluate the novelty of a result (a double-edged sword: very unexpected results are more likely to be suspect), mitigate publication bias against null results / provide an alternative null, and over time help to improve the accuracy of forecasting. There are other reasons, as well, like identifying which treatment to test or which outcome variables to focus on (which might have the highest VoI). In the long run, if forecasts are linked to RCT results, it could also help us say more about those situations for which we don't have RCTs - but that's a longer-term goal. If this is an area of interest, I've got a podcast episode, EA Global presentation and some other things in this vein... this is probably the most detailed.
I agree that there's a lot of work in this area and decision makers actively interested in it. I'll also add that there's a lot of interest on the researcher side, which is key.
P.S. The SSPP is hiring web developers, if you know anyone who might be a good fit.
As a small note, we might get more precise estimates of the effects of a program by predicting magnitudes rather than whether something will replicate (which is what we're doing with the Social Science Prediction Platform). That said, I think a lot of work needs to be done before we can have trust in predictions, and there will always be a gap between how comfortable we are extrapolating to other things we could study vs. "unquantifiable" interventions.
(There's an analogy to external validity here, where you can do more if you can assume the study you predict is drawn from the same set as those you have studied, or the same set if weighted in some way. You could in principle make an ordering of how feasible something is to be studied, and regress your ability to predict on that, but that would be incredibly noisy and not practical as things stand, and past some threshold you don't observe studies anymore and have little to say without making strong assumptions about generalizing past that threshold.)
Great comment. I don't think anyone, myself included, would say the means are not the same and therefore everything is terrible. In the podcast, you can see my reluctance to that when Rob is trying to get me to give one number that will easily summarize how much results in one context will extrapolate to another, and I just don't want to play ball (which is not at all to criticize!). The number I tend to focus on these days (tau squared) is not one that is easily interpretable in that way - instead, it's a measure of the unexplained variation in results - but how much is unexplained clearly depends on what model you are using (and because it is a variance, it really depends on units, making it hard to interpret across interventions except for those dealing with the same kind of outcome). On this view, if you can come up with a great model to explain away more of the heterogeneity, great! I am all for models that have better predictive power.
On the other hand:
1) I do worry that often people are not building more complicated models, but rather thinking about a specific study (if lucky, a group of studies), most likely being biased towards those which found particularly large effects as people seem to update more on positive results.
2) I am not convinced that focusing on mechanisms will completely solve the problem. I agree that interventions that are more theory-based should (in theory) have more similar results -- or at least results that are better able to be predicted, which is more to the point. On the other hand, implementation details matter. I agree with Glennerster and Bates that there is an undue focus on setting -- everyone wants an impact evaluation done in their particular location. But I think there is too much focus on setting because (perhaps surprisingly) when I look in the AidGrade data, there is little to no effect of geography on the impact found, by which I mean that a result from (say) Kenya does not even generalize to Kenya very well (and I believe James Rising and co-authors have found similar results using a case study of conditional cash transfers). This isn't always going to be true; for example, the effect of health interventions depend on the baseline prevalence of disease, and baseline prevalences can be geographically clustered. But what I worry -- without convincing evidence yet so take this with a grain of salt -- is that small implementation details might frequently wash out the effects of knowing the mechanisms. Hopefully, we will have more evidence on this in the future (whichever way that evidence goes), and I very much hope that the more positive view turns out to be true.
I do agree with you that it's possible that researchers (and policymakers?) are able to account for some of the other factors when making predictions. I also said that there was some evidence that people were updating more on the positive results; I need to dig into the data a bit more to do subgroup analyses, but one way to reconcile these results (which would be consistent with what I have seen using different data) is that some people may be better at it than others. There are definitely times when people are wildly off, as well. I don't think I have a good enough sense yet of when predictions are good and when they are not, and that would be valuable.
Edit: I meant to add, there are a lot of frameworks that people use to try to get a handle on when they can export results or how to generalize. In addition to the work cited in Glennerster and Bates, see Williams for another example. And talking with people in government, there are a lot of other one-off frameworks or approaches people use internally. I am a fan of this kind of work and think it highly necessary, even though I am quite confident it won't get the appreciation it deserves within academia.
This video might also add to the discussion - the closing panel at CSAE this year was largely on methodology, moderated by Hilary Greaves (head of the new Global Priorities Institute at Oxford), with Michael Kremer, Justin Sandefur, Joseph Ssentongo, and myself. Some of the comments from the other panellists still stick with me today.
https://ox.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=ec3f076c-9c71-4462-9b84-a8a100f5a44c
And when groups do work on these issues there is a tendency towards infighting.
Some things that could help:
Bringing people together is hugely important to working constructively.
I agree that it would be important to weigh the costs and benefits - I don't think it's exclusively an issue with RCTs, though.
One thing that could help in doing this calculus is a better understanding of when our non-study-informed beliefs are likely to be accurate.
I know at least some researchers are working in this area - Stefano DellaVigna and Devin Pope are looking to follow up their excellent papers on predictions with another one looking at how well people predict results based on differences in context, and Aidan Coville and I also have some work in this area using impact evaluations in development and predictions gathered from policymakers, practitioners, and researchers.
I've stayed at a (non-EA) professional contact's house before when they'd invited me to give a talk and later very apologetically realized they didn't have the budget for a hotel. They likely felt obliged to offer; I felt like it would be awkward to decline. We were both at pains to be extremely, exceedingly, painstakingly polite given the circumstances and turn the formality up a notch.
I agree the org should have paid for a hotel, I'm only mentioning this because if baseline formality is a 5, I would think it would be more normal to kick it up to a 10 under the circumstances. It makes this situation all the more bizarre.