Currently Research Director at Founders Pledge, but posts and comments represent my own opinions, not FP’s, unless otherwise noted.
I worked previously as a data scientist and as a journalist.
Hey Matthew, thanks for sharing this. Can you provide some more information (or link to your thoughts elsewhere) on why fervor around UV-C is misplaced? As you know, ASHRAE Standards 185.1 and 185.2 concern testing of UV devices for germicidal irradiation, so I'd be particularly interested to know if this was an area that ASHRAE itself had concluded was unpromising.
I thought of some other down-the-line feature requests
Ah, great! I think it would be nice to offer different aggregation options, though if you do offer one I agree that geo mean of odds is the best default. But I can imagine people wanting to use medians or averages, or even specifying their own aggregation functions. Especially if you are trying to encourage uptake by less technical organizations, it seems important to offer at least one option that is more legible to less numerate people.
Honestly, what surprises me most here is how similar all four organizations' numbers are across most of the items involved
This was also gratifying for us to see, but it's probably important to note that our approach incorporates weights from both GiveWell and HLI at different points, so the estimates are not completely independent.
I haven't thought extensively about what kind of effect size I'd expect, but I think I'm roughly 65-70% confident that the RCT will return evidence of a detectable effect.
But my uncertainty is more in terms of rating upon re-evaluating the whole thing. Since I reviewed SM last year, we've started to be a lot more punctilious about incorporating various discounts and forecasts into CEAs. So on the one hand I'd naturally expect us to apply more of those discounts on reviewing this case, but on the other hand my original reason for not discounting HLI's effect size estimates was my sense that their meta-analytic weightings appropriately accounted for a lot of the concerns that we'd discount for. This generates uncertainty that I expect we can resolve once we dig in.
As promised, I am returning here with some more detail. I will break this (very long) comment into sections for the sake of clarity.
My overview of this discussion
It seems clear to me that what is going on here is that there are conflicting interpretations of the evidence on StrongMinds' effectiveness. In particular, the key question here is what our estimate of the effect size of SM's programs should be. There are other uncertainties and disagreements, but in my view, this is the essential crux of the conversation. I will give my own (personal) interpretation below, but I cannot stress enough that the vast majority of the relevant evidence is public—compiled very nicely in HLI's report—and that neither FP's nor GWWC's recommendation hinges on "secret" information. As I indicate below, there are some materials that can't be made public, but they are simply not critical elements of the evaluation, just quotes from private communications and things of that nature.
We are all looking at more or less the same evidence and coming to different conclusions.
I also think there is an important subtext to this conversation, which is the idea that both GWWC and FP should not recommend things for which we can't achieve bednet-level levels of confidence. We simply don't agree, and accordingly this is not FP's approach to charity evaluation. As I indicated in my original comment, we are risk-neutral and evaluate charities on the basis of expected cost-effectiveness. I think GiveWell is about as good as an organization can be at doing what GiveWell does, and for donors who prioritize their giving conditional on high levels of confidence, I will always recommend GiveWell top charities over others, irrespective of expected value calculations. It bears repeating that even with this orientation, we still think GiveWell charities are around twice as cost-effective as StrongMinds. I think Founders Pledge is in a substantially different position, and from the standpoint of doing the most possible good in the world, I am confident that risk-neutrality is the right position for us.
We will provide our recommendations, along with any shareable information we have to support them, to anyone who asks. I am not sure what the right way for GWWC to present them is.
How this conversation will and won't affect FP's position
What we won't do is take immediate steps (like, this week) to modify our recommendation or our cost-effectiveness analysis of StrongMinds. My approach to managing FP's research is to try to thoughtfully build processes that maximize the good we do over the long term. This is not a procedure fetish; this is a commonsensical way to ensure that we prioritize our time well and allocate important questions the resources and systematic thought they deserve.
What we will do is incorporate some important takeaways from this conversation during StrongMinds' next re-evaluation, which will likely happen in the coming months. To my eye, the most important takeaway is that our rating of StrongMinds may not sufficiently account for uncertainty around effect size. Incorporating this uncertainty would deflate SM's rating and may bring it much closer to our bar of 1x GiveDirectly.
More generally, I do agree with the meta-point that our evaluations should be public. We are slowly but surely moving in this direction over time, though resource constraints make it a slow process.
FP's materials on StrongMinds
How I view the evidence about StrongMinds
Our task as charity evaluators is, to the extent possible, to quantify the important considerations in estimating a charity's impact. When I reviewed HLI's work on StrongMinds, I was very satisfied that they had accounted for many different sources of uncertainty. I am still pretty satisfied, though I am now somewhat more uncertain myself.
A running theme in critiques of StrongMinds is that the effects they report are unbelievably large. I agree that they are very large. I don't agree that the existence of large-seeming effects is itself a knockdown argument against recommending this charity. It is, rather, a piece of evidence that we should consider alongside many other pieces of evidence.
I want to oversimplify a bit by distinguishing between two different views of how SM could end up reporting very large effect sizes.
I am very satisfied that (1) is not the case here. There are two reasons for this. First, the intervention is well-supported by a fair amount of external evidence. This program is not "out of nowhere"; there are good reasons to believe it has some (possibly small) effect. Second, though StrongMinds' recent data collection practices have been wanting, they have shown a willingness to be evaluated (the existence of the Ozler RCT is a key data point here). With FP, StrongMinds were extremely responsive to questions and forthcoming and transparent with their answers.
Now, I think (2) is very likely to be the case. At FP, we increasingly try to account for this uncertainty in our CEAs. As you'll note in the link above, we didn't do that in our last review of StrongMinds, yielding a rating of roughly 5-6xGiveDirectly (per our moral weights, we value a WELLBY at about $160). So the question here is how much of the observed effect is due to bias? If it's 80%, we should deflate our rating to 1.2x at StrongMinds' net review. In this scenario it would still clear our bar (though only just).
In the absence of prior evidence about IPT-g, I think we might likely conclude that the observed effects are overwhelmingly due to bias. But I don't think this is a Pascal's Mugging-type scenario. We are not seeing a very large, possibly dubious effect that remains large in expectation even after deflating for dubiousness. We are seeing a large effect that is very broadly in line with the kind of effect we should expect on priors.
What I expect for the future
In my internal forecast attached to our last evaluation, I gave an 80% probability to us finding that SM would have an effectiveness of between 5.5x and 7x GD at its next evaluation. I would lower this significantly, to something like 40%, and overall I would say that I think there's a 70-80% chance we'll still be recommending SM after its next re-evaluation.
Hey Simon, I remain slightly confused about this element of the conversation. I take you to mean that, since we base our assessment mostly on HLI's work, and since we draw different conclusions from HLI's work than you think are reasonable, we should reassess StrongMinds on that basis. Is that right?
If so, I do look forward to your thoughts on the HLI analysis, but in the meantime I'd be curious to get a sense of your personal levels of confidence here — what does a distribution of your beliefs over cost-effectiveness for StrongMinds look like?
Hey Vasco, you make lots of good points here that are worth considering at length. These are topics we've discussed on and off in a fairly unstructured way on the research team at FP, and I'm afraid I'm not sure what's next when it comes to tackling them. We don't currently have a researcher dedicated to animal welfare, and our recommendations in that space have historically come from partner orgs.
Just as context, the reason for this is that FP has historically separated our recommendations into three "worldviews" (longtermism, current generations, and animal welfare). The idea is that it's a lot easier to shift member grantmaking across causes within a worldview (e.g. from rare diseases to malaria, for instance) than across worldviews (e.g. to get people to care much more about chickens). The upshot of this, for better or for worse, is that we end up spending a lot of time prioritizing causes within worldviews, and avoiding the question of how to prioritize across worldviews.
This is also part of the reason we don't have a dedicated animal welfare researcher — we haven't historically moved as much money within that worldview as within our others. But it's actually not sure which way the causality flows in that case, so your post is a good nudge to think more seriously about this, as well as the ways we might be able to incorporate animal welfare considerations into our GHD calculations, worldview separations notwithstanding.