Currently Research Director at Founders Pledge, but posts and comments represent my own opinions, not FP’s, unless otherwise noted.
I worked previously as a data scientist and as a journalist.
Honestly, what surprises me most here is how similar all four organizations' numbers are across most of the items involved
This was also gratifying for us to see, but it's probably important to note that our approach incorporates weights from both GiveWell and HLI at different points, so the estimates are not completely independent.
Thanks, bruce — this is a great point. I'm not sure if we would account for the costs in the exact way I think you have done here, but we will definitely include this consideration in our calculation.
I haven't thought extensively about what kind of effect size I'd expect, but I think I'm roughly 65-70% confident that the RCT will return evidence of a detectable effect.
But my uncertainty is more in terms of rating upon re-evaluating the whole thing. Since I reviewed SM last year, we've started to be a lot more punctilious about incorporating various discounts and forecasts into CEAs. So on the one hand I'd naturally expect us to apply more of those discounts on reviewing this case, but on the other hand my original reason for not discounting HLI's effect size estimates was my sense that their meta-analytic weightings appropriately accounted for a lot of the concerns that we'd discount for. This generates uncertainty that I expect we can resolve once we dig in.
As promised, I am returning here with some more detail. I will break this (very long) comment into sections for the sake of clarity.
My overview of this discussion
It seems clear to me that what is going on here is that there are conflicting interpretations of the evidence on StrongMinds' effectiveness. In particular, the key question here is what our estimate of the effect size of SM's programs should be. There are other uncertainties and disagreements, but in my view, this is the essential crux of the conversation. I will give my own (personal) interpretation below, but I cannot stress enough that the vast majority of the relevant evidence is public—compiled very nicely in HLI's report—and that neither FP's nor GWWC's recommendation hinges on "secret" information. As I indicate below, there are some materials that can't be made public, but they are simply not critical elements of the evaluation, just quotes from private communications and things of that nature.
We are all looking at more or less the same evidence and coming to different conclusions.
I also think there is an important subtext to this conversation, which is the idea that both GWWC and FP should not recommend things for which we can't achieve bednet-level levels of confidence. We simply don't agree, and accordingly this is not FP's approach to charity evaluation. As I indicated in my original comment, we are risk-neutral and evaluate charities on the basis of expected cost-effectiveness. I think GiveWell is about as good as an organization can be at doing what GiveWell does, and for donors who prioritize their giving conditional on high levels of confidence, I will always recommend GiveWell top charities over others, irrespective of expected value calculations. It bears repeating that even with this orientation, we still think GiveWell charities are around twice as cost-effective as StrongMinds. I think Founders Pledge is in a substantially different position, and from the standpoint of doing the most possible good in the world, I am confident that risk-neutrality is the right position for us.
We will provide our recommendations, along with any shareable information we have to support them, to anyone who asks. I am not sure what the right way for GWWC to present them is.
How this conversation will and won't affect FP's position
What we won't do is take immediate steps (like, this week) to modify our recommendation or our cost-effectiveness analysis of StrongMinds. My approach to managing FP's research is to try to thoughtfully build processes that maximize the good we do over the long term. This is not a procedure fetish; this is a commonsensical way to ensure that we prioritize our time well and allocate important questions the resources and systematic thought they deserve.
What we will do is incorporate some important takeaways from this conversation during StrongMinds' next re-evaluation, which will likely happen in the coming months. To my eye, the most important takeaway is that our rating of StrongMinds may not sufficiently account for uncertainty around effect size. Incorporating this uncertainty would deflate SM's rating and may bring it much closer to our bar of 1x GiveDirectly.
More generally, I do agree with the meta-point that our evaluations should be public. We are slowly but surely moving in this direction over time, though resource constraints make it a slow process.
FP's materials on StrongMinds
How I view the evidence about StrongMinds
Our task as charity evaluators is, to the extent possible, to quantify the important considerations in estimating a charity's impact. When I reviewed HLI's work on StrongMinds, I was very satisfied that they had accounted for many different sources of uncertainty. I am still pretty satisfied, though I am now somewhat more uncertain myself.
A running theme in critiques of StrongMinds is that the effects they report are unbelievably large. I agree that they are very large. I don't agree that the existence of large-seeming effects is itself a knockdown argument against recommending this charity. It is, rather, a piece of evidence that we should consider alongside many other pieces of evidence.
I want to oversimplify a bit by distinguishing between two different views of how SM could end up reporting very large effect sizes.
I am very satisfied that (1) is not the case here. There are two reasons for this. First, the intervention is well-supported by a fair amount of external evidence. This program is not "out of nowhere"; there are good reasons to believe it has some (possibly small) effect. Second, though StrongMinds' recent data collection practices have been wanting, they have shown a willingness to be evaluated (the existence of the Ozler RCT is a key data point here). With FP, StrongMinds were extremely responsive to questions and forthcoming and transparent with their answers.
Now, I think (2) is very likely to be the case. At FP, we increasingly try to account for this uncertainty in our CEAs. As you'll note in the link above, we didn't do that in our last review of StrongMinds, yielding a rating of roughly 5-6xGiveDirectly (per our moral weights, we value a WELLBY at about $160). So the question here is how much of the observed effect is due to bias? If it's 80%, we should deflate our rating to 1.2x at StrongMinds' net review. In this scenario it would still clear our bar (though only just).
In the absence of prior evidence about IPT-g, I think we might likely conclude that the observed effects are overwhelmingly due to bias. But I don't think this is a Pascal's Mugging-type scenario. We are not seeing a very large, possibly dubious effect that remains large in expectation even after deflating for dubiousness. We are seeing a large effect that is very broadly in line with the kind of effect we should expect on priors.
What I expect for the future
In my internal forecast attached to our last evaluation, I gave an 80% probability to us finding that SM would have an effectiveness of between 5.5x and 7x GD at its next evaluation. I would lower this significantly, to something like 40%, and overall I would say that I think there's a 70-80% chance we'll still be recommending SM after its next re-evaluation.
Hey Simon, I remain slightly confused about this element of the conversation. I take you to mean that, since we base our assessment mostly on HLI's work, and since we draw different conclusions from HLI's work than you think are reasonable, we should reassess StrongMinds on that basis. Is that right?
If so, I do look forward to your thoughts on the HLI analysis, but in the meantime I'd be curious to get a sense of your personal levels of confidence here — what does a distribution of your beliefs over cost-effectiveness for StrongMinds look like?
Fair enough. I think one important thing to highlight here is that though the details of our analysis have changed since 2019, the broad strokes haven’t — that is to say, the evidence is largely the same and the transformation used (DALY vs WELLBY), for instance, is not super consequential for the rating.
The situation is one, as you say, of GIGO (though we think the input is not garbage) and the main material question is about the estimated effect size. We rely on HLI’s estimate, the methodology for which is public.
I think your (2) is not totally fair to StrongMinds, given the Ozler RCT. No matter how it turns out, it will have a big impact on our next reevaluation of StrongMinds.
Edit: To be clearer, we shared updated reasoning with GWWC but the 2019 report they link, though deprecated, still includes most of the key considerations for critics, as evidenced by your observations here, which remain relevant. That is, if you were skeptical of the primary evidence on SM, our new evaluation would not cause you to update to the other side of the cost-effectiveness bar (though it might mitigate less consequential concerns about e.g. disability weights).
“I think my main takeaway is my first one here. GWWC shouldn't be using your recommendations to label things top charities. Would you disagree with that?”
Yes, I think so- I’m not sure why this should be the case. Different evaluators have different standards of evidence, and GWWC is using ours for this particular recommendation. They reviewed our reasoning and (I gather) were satisfied. As someone else said in the comments, the right reference class here is probably deworming— “big if true.”
The message on the report says that some details have changed, but that our overall view is represented. That’s accurate, though there are some details that are more out of date than others. We don’t want to just remove old research, but I’m open to the idea that this warning should be more descriptive.
I’ll have to wait til next week to address more substantive questions but it seems to me that the recommend/don’t recommend question is most cruxy here.
On reflection, it also seems cruxy that our current evaluation isn’t yet public. This seems very fair to me, and I’d be very curious to hear GWWC’s take. We would like to make all evaluation materials public eventually, but this is not as simple as it might seem and especially hard given our orientation toward member giving.
Though this type of interaction is not ideal for me, it seems better for the community. If they can’t be totally public, I’d rather our recs be semi-public and subject to critique than totally private.
Hi Simon, thanks for writing this! I’m research director at FP, and have a few bullets to comment here in response, but overall just want to indicate that this post is very valuable. I’m also commenting on my phone and don’t have access to my computer at the moment, but can participate in this conversation more energetically (and provide more detail) when I’m back at work next week.
I basically agree with what I take to be your topline finding here, which is that more data is needed before we can arrive at GiveWell-tier levels of confidence about StrongMinds. I agree that a lack of recent follow-ups is problematic from an evaluator’s standpoint and look forward to updated data.
FP doesn’t generally strive for GW-tier levels of confidence; we’re risk-neutral and our general procedure is to estimate expected cost-effectiveness inclusive of deflators for various kinds of subjective consideration, like social desirability bias.
The 2019 report you link (and the associated CEA) is deprecated— FP hasn’t been resourced to update public-facing materials, a situation that is now changing—but the proviso at the top of the page is accurate: we stand by our recommendation.
This is because we re-evaluated StrongMinds last year based on HLI’s research. The new evaluation did things a little bit differently than the old one. Instead of attempting to linearize the relationship between PHQ-9 score reductions and disability weights, we converted the estimated treatment effect into WELLBY-SDs by program and area of operation, an elaboration made possible by HLI’s careful work and using their estimated effect sizes. I reviewed their methodology and was ultimately very satisfied— I expect you may want to dig into this in more detail. This also ultimately allows the direct comparison with cash transfers.
I pressure-tested this a few ways: by comparing with the linearized-DALY approach from our first evaluation, from treating the pre-post data as evidence in a Bayesian update on conservative priors, and by modeling the observed shifts using the assumption of an exponential distribution in PHQ-9 scores (which appears in some literature on the topic). All these methods made the intervention look pretty cost-effective.
Crucially, FP’s bar for recommendation to members is GiveDirectly, and we estimate StrongMinds at roughly 6x GD. So even by our (risk-neutral) lights, StrongMinds is not competitive with GiveWell top charities.
A key piece of evidence is the 2002 RCT on which the program is based. I do think a lot—perhaps too much—hinges on this RCT. Ultimately, however, I think it is the most relevant way to develop a prior on the effectiveness of IPT-g in the appropriate context, which is why a Bayesian update on the pre-post data seems so optimistic: the observed effects are in line with the effect size from the RCT.
The effect sizes observed are very large, but it’s important to place in the context of StrongMinds’ work with severely traumatized populations. Incoming PHQ-9 scores are very, very high, so I think 1) it’s reasonable to expect some reversion to the mean in control groups, which we do see and 2) I’m not sure that our general priors about the low effectiveness of therapeutic interventions are likely to be well-calibrated here.
Overall, I agree that social desirability bias is a very cruxy issue here, and could bear on our rating of StrongMinds. But I think even VERY conservative assumptions about the role of this bias in the effect estimate would cause SM still to clear our bar in expectation.
Just a note on FP and our public-facing research: we’re in the position of prioritizing our research resources primarily to make recommendations to members, but at the same time we’re trying to do our best to provide a public good to the public and the EA community. I think we’re still not sure how to do this, and definitely not sure how to resource for it. We are working on this.
Hey Nick, thanks for this very valuable experience-informed comment. I'm curious what you make of the original 2002 RCT that first tested IPT-G in Uganda. When we (at Founders Pledge) looked at StrongMinds (which we currently recommend, in large part on the back of HLI's research), I was surprised to see that the results from the original RCT lined up closely with the pre/post scores reported by recent program participants.
Would your take on this result be that participants in the treated group were still basically giving what they saw as socially desirable answers, irrespective of the efficacy of the intervention? It's true that the control arm in the 2002 RCT did not receive a comparable placebo treatment, so that does seem a reasonable criticism. But if the socially desirability bias is so strong as to account for the massive effect size reported in the 2002 paper, I'd expect it to appear in the NBER paper you cite, which also featured a pure control group. But that paper seems to find no effect of psychotherapy alone.
I agreed with your comment (I found it convincing) but downvoted it because if I was a first-time poster here, I would be much less likely to post again after having my first post characterized as foolish.
As one of many “naive functionalists”, I found the OP was very valuable as a challenge to my thinking, and so I want to come down strongly against discouraging such posts in any way.