1039 karmaJoined Oct 2019


Currently Research Director at Founders Pledge, but posts and comments represent my own opinions, not FP’s, unless otherwise noted.

I worked previously as a data scientist and as a journalist.


Hey Vasco, you make lots of good points here that are worth considering at length. These are topics we've discussed on and off in a fairly unstructured way on the research team at FP, and I'm afraid I'm not sure what's next when it comes to tackling them. We don't currently have a researcher dedicated to animal welfare, and our recommendations in that space have historically come from partner orgs.

Just as context, the reason for this is that FP has historically separated our recommendations into three "worldviews" (longtermism, current generations, and animal welfare). The idea is that it's a lot easier to shift member grantmaking across causes within a worldview (e.g. from rare diseases to malaria, for instance) than across worldviews (e.g. to get people to care much more about chickens). The upshot of this, for better or for worse, is that we end up spending a lot of time prioritizing causes within worldviews, and avoiding the question of how to prioritize across worldviews.

This is also part of the reason we don't have a dedicated animal welfare researcher — we haven't historically moved as much money within that worldview as within our others. But it's actually not sure which way the causality flows in that case, so your post is a good nudge to think more seriously about this, as well as the ways we might be able to incorporate animal welfare considerations into our GHD calculations, worldview separations notwithstanding.

Hey Matthew, thanks for sharing this. Can you provide some more information (or link to your thoughts elsewhere) on why fervor around UV-C is misplaced? As you know, ASHRAE Standards 185.1 and 185.2 concern testing of UV devices for germicidal irradiation, so I'd be particularly interested to know if this was an area that ASHRAE itself had concluded was unpromising.

I thought of some other down-the-line feature requests

  • Google Sheets integration (we currently already store our forecasts in a Google sheet)
  • Relatedly, ability to export to CSV (does this already exist and I just missed it?)
  • Ability to designate a particular resolver
  • Different formal resolution mechanisms, like a poll of users.

Ah, great! I think it would be nice to offer different aggregation options, though if you do offer one I agree that geo mean of odds is the best default. But I can imagine people wanting to use medians or averages, or even specifying their own aggregation functions. Especially if you are trying to encourage uptake by less technical organizations, it seems important to offer at least one option that is more legible to less numerate people.

I have already installed this and started using this at Founders Pledge. Thanks for making this! I've been wanting something like this for a long time.

Some feature requests:

  • Aggregation choices (e.g. geo mean of odds would be nice)
  • Brier scores for users
  • Calibration curves for users

Honestly, what surprises me most here is how similar all four organizations' numbers are across most of the items involved


This was also gratifying for us to see, but it's probably important to note that our approach incorporates weights from both GiveWell and HLI at different points, so the estimates are not completely independent.

Thanks, bruce — this is a great point. I'm not sure if we would account for the costs in the exact way I think you have done here, but we will definitely include this consideration in our calculation.

I haven't thought extensively  about what kind of effect size I'd expect, but I think I'm roughly 65-70% confident that the RCT will return evidence of a detectable effect.

But my uncertainty is more in terms of rating upon re-evaluating the whole thing. Since I reviewed SM last year, we've started to be a lot more punctilious about incorporating various discounts and forecasts into CEAs. So on the one hand I'd naturally expect us to apply more of those discounts on reviewing this case, but on the other hand my original reason for not discounting HLI's effect size estimates was my sense that their meta-analytic weightings appropriately accounted for a lot of the concerns that we'd discount for. This generates uncertainty that I expect we can resolve once we dig in.

As promised, I am returning here with some more detail. I will break this (very long) comment into sections for the sake of clarity.

My overview of this discussion

It seems clear to me that what is going on here is that there are conflicting interpretations of the evidence on StrongMinds' effectiveness. In particular, the key question here is what our estimate of the effect size of SM's programs should be. There are other uncertainties and disagreements, but in my view, this is the essential crux of the conversation. I will give my own (personal) interpretation below, but I cannot stress enough that the vast majority of the relevant evidence is public—compiled very nicely in HLI's report—and that neither FP's nor GWWC's recommendation hinges on "secret" information. As I indicate below, there are some materials that can't be made public, but they are simply not critical elements of the evaluation, just quotes from private communications and things of that nature.

We are all looking at more or less the same evidence and coming to different conclusions.

I also think there is an important subtext to this conversation, which is the idea that both GWWC and FP should not recommend things for which we can't achieve bednet-level levels of confidence. We simply don't agree, and accordingly this is not FP's approach to charity evaluation. As I indicated in my original comment, we are risk-neutral and evaluate charities on the basis of expected cost-effectiveness. I think GiveWell is about as good as an organization can be at doing what GiveWell does, and for donors who prioritize their giving conditional on high levels of confidence, I will always recommend GiveWell top charities over others, irrespective of expected value calculations. It bears repeating that even with this orientation, we still think GiveWell charities are  around twice as cost-effective as StrongMinds. I think Founders Pledge is in a substantially different position, and from the standpoint of doing the most possible good in the world, I am confident that risk-neutrality is the right position for us.

We will provide our recommendations, along with any shareable information we have to support them, to anyone who asks. I am not sure what the right way for GWWC to present them is.

How this conversation will and won't affect FP's position

What we won't do is take immediate steps (like, this week) to modify our recommendation or our cost-effectiveness analysis of StrongMinds. My approach to managing FP's research is to try to thoughtfully build processes that maximize the good we do over the long term. This is not a procedure fetish; this is a commonsensical way to ensure that we prioritize our time well and allocate important questions the resources and systematic thought they deserve. 

What we will do is incorporate some important takeaways from this conversation during StrongMinds' next re-evaluation, which will likely happen in the coming months. To my eye, the most important takeaway is that our rating of StrongMinds may not sufficiently account for uncertainty around effect size.  Incorporating this uncertainty would deflate SM's rating and may bring it much closer to our bar of 1x GiveDirectly.

More generally, I do agree with the meta-point that our evaluations should be public. We are slowly but surely moving in this direction over time, though resource constraints make it a slow process.

FP's materials on StrongMinds

  • A copy of our CEA. I'm afraid this may not be very elucidating, as essentially all we did here is take HLI's estimates and put them into a format that works better with our ratings system. One note is that we don't apply any subjective discounts in this CEA - this is the kind of thing I expect might change in future.
  • Some exploration I did in R and Stan to try to test various components of the analysis. In particular, this contains several attempts to use SM's pre-post data (corrected for a hypothesized counterfactual) to update on several different more general priors. Of particular interest are this review from which I took a prior on psychosocial interventions in LMICs and this one which offers a much more outside view-y prior.
    • Crucially, I really don't think this type of explicit Bayesian update is the right way to estimate effects here; I much prefer HLI's way of estimating effects (it leaves a lot less data on the table).
    • The main goal of this admittedly informal analysis was to test under what alternate analytic conditions our estimate of SM's effectiveness would fall below our recommendation bar.
  • We have an internal evaluation template that I have not shared, since it contains quotes from private communications with StrongMinds. There's nothing mysterious or particularly informative here; we just don't share details of private communications that weren't conducted with the explicit expectation that they'd be shared. This is the type of template that in future we hope to post publicly with privileged communications excised.

How I view the evidence about StrongMinds

Our task as charity evaluators is, to the extent possible, to quantify the important considerations in estimating a charity's impact. When I reviewed HLI's work on StrongMinds, I was very satisfied that they had accounted for many different sources of uncertainty. I am still pretty satisfied, though I am now somewhat more uncertain myself.

A running theme in critiques of StrongMinds is that the effects they report are unbelievably large. I agree that they are very large. I don't agree that the existence of large-seeming effects is itself a knockdown argument against recommending this charity. It is, rather, a piece of evidence that we should consider alongside many other pieces of evidence.

I want to oversimplify a bit by distinguishing between two different views of how SM could end up reporting very large effect sizes.

  1. The reported effects are essentially made-up. The intervention has no effect at all, and the illusion of an effect is driven by fraud at worst and severe confirmation bias at best.
  2. The reported effects are severely inflated by  selection bias, social desirability bias, and other similar factors.

I am very satisfied that (1) is  not the case here. There are two reasons for this. First, the intervention is well-supported by a fair amount of external evidence. This program is not "out of nowhere"; there are good reasons to believe it has some (possibly small) effect. Second, though StrongMinds' recent data collection practices have been wanting, they have shown a willingness to be evaluated (the existence of the Ozler RCT is a key data point here). With FP, StrongMinds were extremely responsive to questions and forthcoming and transparent with their answers.

Now, I think (2) is very likely to be the case. At FP, we increasingly try to account for this uncertainty in our CEAs. As you'll note in the link above, we didn't do that in our last review of StrongMinds, yielding a rating of roughly 5-6xGiveDirectly (per our moral weights, we value a WELLBY at about $160). So the question here is how much of the observed effect is due to bias? If it's 80%, we should deflate our rating to 1.2x at StrongMinds' net review. In this scenario it would still clear our bar (though only just).

In the absence of prior evidence about IPT-g, I think we might likely conclude that the observed effects are overwhelmingly due to bias. But I don't think this is a Pascal's Mugging-type scenario. We are not seeing a very large, possibly dubious effect that remains large in expectation even after deflating for dubiousness. We are seeing a large effect that is very broadly in line with the kind of  effect we should expect on priors.

What I expect for the future

In my internal forecast attached to our last evaluation, I gave an 80% probability to us finding that SM would have an effectiveness of between 5.5x and 7x GD at its next evaluation. I would lower this significantly, to something like 40%, and overall I would say that I think there's a 70-80% chance we'll still be recommending SM after its next re-evaluation.

Hey Simon, I remain slightly confused about this element of the conversation. I take you to mean that, since we base our assessment mostly on HLI's work, and since we draw different conclusions from HLI's work than you think are reasonable, we should reassess StrongMinds on that basis. Is that right?

If so, I do look forward to your thoughts on the HLI analysis, but in the meantime I'd be curious to get a sense of your personal levels of confidence here — what does a distribution of your beliefs over cost-effectiveness for StrongMinds look like?

Load more