Since folks are interested in encouraging critiques of EA—an admirable sentiment!—I wrote the following as a good-faith, friendly, and hopefully modest critique. [Note: all of the following is about global health/development, not about long-termist endeavors. I try to make this clear throughout, but might occasionally have let slip some overly-broad phrasing.]
EAs write compelling articles about why RCTs are a great way to understand the causal impact of a policy or treatment. And GiveWell’s claim to fame is that it has led to many millions of dollars of donations to “several charities focusing on RCT-backed interventions as the ‘most effective’ ones around the world.”
But I wonder if the EA movement is allocating nearly enough money to new RCTs and program evaluations, or to R&D more broadly, so as to build out new evidence in a strategic way.
After all, the agreed-upon list of the “best” interventions identified by RCTs seems . . . a bit stagnant.
- When I spoke at the EA Global conference in 2016, GiveWell’s best ideas for global giving involved malaria, deworming, and cash transfers.
- When I look at GiveWell’s current list of the top charities, they still are mostly focused on malaria, deworming, and cash transfers (albeit with the addition of Vitamin A supplements and a vaccine program operating in northwest Nigeria).
Such a tiny set of interventions doesn’t seem anywhere near the scale of the many inequities and problems in the world today. Indeed, Open Phil is offering up to $150 million in a regranting challenge, which seems to be a signal that they have more money to give away than they currently are able to deploy to existing causes.
In any event, how do we know that a handful of interventions and organizations are the best ideas to fund? Because at some point in the past, someone thought to fund rigorous RCTs on anti-malaria efforts, deworming, Vitamin A, cash incentives, etc.
But why would a handful of isolated ideas be the best we can possibly do?
To be a bit provocative (commenters will hopefully point out corrections):
We’ve mostly [albeit not entirely] taken the world’s supply of research as a given—with all of its oversights, poorly-aligned academic incentives, and irreproducibility—and then picked the best-supported interventions we could find there.
But the world’s supply of program evaluations, RCTs, jurisdiction-wide studies (e.g., difference-in-differences), and implementation research is not fixed. Gates, WHO, Wellcome, the World Bank, etc., do fund a constant stream of research, but it isn’t clear why we would expect them to identify and fund the most promising programs and the best studies for EA purposes.
If we want to expand the list of cost-effective ideas, and if EA as a movement has more money than it knows what to do with, perhaps we should develop an EA-focused R&D agenda that is robust, coherent, and focused on the problems of effectiveness at a broad scale? Over time, we could come up with any number of ideas to add to GiveWell’s list.
Doesn't EA Already Fund Research?
There are a number of cases where EA does indeed fund academic research on the effectiveness of interventions, such as GiveWell’s recent funding of this Michael Kremer et al. meta-analysis finding that water chlorination is a highly cost-effective way of improving child mortality. GiveWell has written recently of its commitment to research on malnutrition and lead exposure, while OpenPhil has recently funded research on air quality sensors, Covid vaccines, a potential syphilis vaccine, etc. And I'm sure there are other examples I've missed.
But on a closer look, not much of this research is squarely within the realm of what I’m talking about – i.e., directly funding RCTs and program evaluations themselves as part of a broader and well-designed agenda.
For example, the Kremer et al. meta-analysis of water treatment hinged on 15 main program evaluations (see Table 1). As far as I can tell, none of them were funded by major EA initiatives or donors:
- The Haushofer et al. 2021 paper was funded by NIH, the Dioraphte Foundation, and Sint Antonius Stichting.
- The Dupas et al. 2021 paper was funded by Stichting Dioraphte and the Stanford Center for Innovation in Global Health.
- The Humphrey et al. 2019 paper was funded by Gates Foundation, UK Department for International Development, Wellcome Trust, Swiss Development Cooperation, UNICEF, and NIH.
- The Kirby et al. 2019 paper was funded by DelAgua Health Limited.
- The Null et al. 2018 paper was funded by the US Agency for International Development and the Gates Foundation.
- The Luby et al. 2018 paper was funded by the Gates Foundation.
- The Boisson et al. 2013 paper was funded by Program for Appropriate Technology in Health (PATH); United States Agency for International Development (USAID); Medentech, Ltd.; and Chemical Chlorine Association.
- The Peletz et al. 2012 paper was funded by Vestergaard-Frandsen SA and the United States National Science Foundation.
- The Kremer et al. 2011 paper was funded by Hewlett Foundation, USDA/Foreign Agricultural Service, International Child Support, Swedish International Development Agency, Finnish Fund for Local Cooperation in Kenya, google.org, the Bill and Melinda Gates Foundation, and the Sustainability Science Initiative at the Harvard Center for International Development.
- The other studies are from 2006 and before, when the EA movement didn’t really exist yet.
Instead, the term “research” in this case consisted of summarizing other people’s research, followed by inserting the effect sizes into cost models, etc.
Which is a fine and valuable activity! Indeed, I think that rigorous meta-analysis is one of the best things to perform and fund (while at the Arnold Foundation, I funded BITTS to work with Rachael Meager on her groundbreaking work in this area).
Again, though, it’s derivative of what everyone else chooses to fund. If we don’t fund enough underlying RCTs and evaluations, then we are assuming that we can mostly sit back and wait to see what emerges from Gates/WHO/etc. Then we have to assume that those studies are conducted with the right amount of rigor and the right amount of focus on cost and quality, etc.
But that isn’t happening very often (see above). Indeed, an EA leader told me in conversation that there are problems with relying on the existing academic literature.
First, there is an imperfect overlap between the questions that applied researchers want to study, and the questions that EAs would want to answer.
Second, even when there’s overlap, the academic journal system usually doesn’t ask researchers to collect cost data, which means that if you want to know anything about cost-effectiveness, you’re left with trying to reconstruct those numbers after the fact.
This implies that there are many opportunities for EA to exploit weaknesses in the current academic system by funding a large number of R&D projects with an eye towards cost-effectiveness and scale.
What About Existing Research Agendas?
Aren’t there existing research agendas compiled by EAs? Absolutely. But the most thoughtful and thorough research agendas (e.g., here and here) are all about philosophy and long-termism. I haven’t yet found any similarly thorough research agenda for global health and development, economic advancement, etc. (check out the page for GiveWell's "research agenda" by comparison).
What about GiveWell’s incubation grants, of which there are several dozen listed here. Isn’t that what I’m looking for?
Yes and no. Some of the grants are exactly the sort of thing I would suggest, such as this grant to IDinsight, or this grant to CEGA, this grant to Evidence Action, and this grant to conduct an RCT on cash incentives for vaccination. The recent $14m grant to Evidence Action (one of many to that organization) might also be quite within the spirit of what I would recommend, although to date that organization seems mostly like an implementing organization for deworming and safe water initiatives.
Indeed, GiveWell says they are looking for “academic research to evaluate program evidence,” and “early-stage funding for a promising organization,” and “monitoring and evaluation of an existing organization.”
While it’s a bit subtle, notice what isn’t here: academic research that doesn’t just “evaluate” program evidence (often generated elsewhere), but an extensive R&D agenda to create new program evidence.
That might explain why much of the list seems to be focused on deworming, malaria, and malnutrition, with some grants to organizations focused on other assorted topics (lead exposure, suicide prevention, water treatment, syphilis screening, seasonal migration, and of course, the well-known study of face masks for preventing Covid).
Moreover, many of the grants, on further reading, aren’t necessarily about funding rigorous RCTs or other types of rigorous evaluation, but are about program support or technical assistance.
In sum, the list of incubation grants seems like a great start—but it would be even better if fleshed out into a coherent and extensive R&D agenda akin to the ones on long-termism.
An Interlude on Scale
As an interlude here: We talk about RCTs and RCT results all the time. Not nearly enough people to date have been talking about the problem of scale.
I don’t mean “what organization or government can deliver this intervention at a larger scale,” although that is part of it. Instead, I mean to ask a much broader question: how do we even know that what works in one or two studies will work when replicated at a larger scale?
Because you know what is disastrous? Some studies say that X works; everyone decides that X works; whole organizations spin up to do X and government agencies decide to do X; much money is spent; and then it turns out that there’s no way to make X work at scale no matter what you do. It isn’t replicable, or it isn’t even the type of idea that would ever work at scale.
One problem is that whether it’s economics, international development, education, or any other public policy issue, there are far too many small, ad hoc, one-off studies. Not enough chance for replicability, in other words.
But even if the studies are replicable, not enough ideas are scalable in the first place.
There are human capital effects—perhaps the small startup study was run with the best teachers/workers/etc. someone could recruit, but scaling up would inevitably mean that you get less-qualified people delivering the program. That’s just reality no matter what organization or government is involved.
A classic example is what happened with class size reduction. One fairly small experiment showed that reducing class size had amazingly positive effects, but when class size reduction was adopted by the state of California, a study showed that the “increase in the share of teachers with neither prior experience nor full certification dampened the benefits of smaller classes, particularly in schools with high shares of economically disadvantaged, minority students.”
In other words, putting students in small classes was a great idea in a small study, but when you try to reduce class size statewide (never mind nationwide), you end up having to hire so many teachers that their quality and experience goes down, mostly or fully offsetting the benefits of smaller classes.
Then there are peer effects. Perhaps if you give a drug treatment program or a high school graduation program to just 10% of the local students, they get distracted by the other 90% of kids not in the program, but if you put everyone in the program, they would all reinforce each other’s decisions. In this case, that would mean that the small study in fact underestimates the program’s effect at a larger scale. So if we dismiss this type of program on the basis of small studies, we might be missing the boat.
Then there are general equilibrium effects. For example, job training programs often have positive effects in isolation. If you pick 100 or 200 people to give better training, they might do better than otherwise.
But can such programs work when scaled up? After all, if there are 100 welding jobs in a community (just to make up a hypothetical example), and if you train 100 people for those jobs, they might do well, but if you train 500 people for the 100 welding jobs, the training effect will necessarily dissipate.
Unfortunately, that’s what one study found in France, where the study team had the (rare!) opportunity to randomize not just who got access to the job training program, but how many people the program actually served across different communities. They found that the job training successes came mostly at the expense of the control group.
Discouraging news, but a highly valuable study because it pointed out the flaw in thinking that if job training works for a few people, it would work just as well when offered to everyone.
Note: this isn’t a comprehensive look at the scaling issue. For more, see John List’s new book or his recent A16z interview (or his scholarly work on the issue here, here, and here). And my friends Mary Ann Bates and Rachel Glennerster (both formerly at J-PAL) wrote a great article about scaling and generalizability.
Maybe Research Isn't Worth It?
I’ve been critiquing short-termist EA for lacking a full-blown R&D agenda. But maybe that is unfair, as much research wouldn’t be a cost-effective way of donating money. After all, most ideas don’t work if rigorously evaluated. Maybe sponsoring research, if discounted by the probability of actually finding something that works, ends up not being anywhere near as good as giving to Give Directly.
That’s a great point. Let’s consider why it might be wrong.
First, GiveWell’s 2020 giving portfolio was ~$250M to interventions that presumably have around a 10x return over just giving cash (the GiveWell expectation is to find charities that are 5-15x as effective as cash). Imagine that with a $20 million investment in research, we could identify two new interventions that are 15x as effective as cash, rather than 5x, and therefore GiveWell could move donations to the 15x category rather than 5x. The additional social value from deploying, let's say, $100 million of GiveWell donations to those new interventions would be ($100m * 15) - ($100m * 5), or $1 billion. That would be a 50x return on the $20 million investment in research, and that's just counting one year's worth of giving!
Put another way, spending $20 million on research needs only a 2% chance of succeeding in order to have an expected value of paying for itself within the first year. Never mind future years (I don't want to bother with the effects of inflation, discount rates, etc.).
Switch contexts to the United States, where the OpenPhil expectation is that after taking into account the diminishing marginal value of money, grants should have roughly a 1,000x payoff in order to be as good as GiveWell's recommendations. Imagine, if you will, that R&D could identify interventions that have a 2,000x return or even a 1,000,000x return (it's not hard to imagine that a 2017 grant to create a pan-coronavirus vaccine might have easily gotten such a return, if it had worked!).
The math is similar, but essentially a 2,000x return in the US is 2x better than GiveWell, and thus if $10 million in R&D found such a program/policy and then was used to drive $100 million in yearly giving, that would be an additional $190 million in benefits in the first year alone ($200m - $10m), for a 19x return on investment.
These numbers are obviously a bit arbitrary—as are all numbers about the expected value of future hypothetical interventions! But they nonetheless illustrate the fact that funding research on new interventions and policies might be highly cost-effective even if very few of the new ideas pan out.
[Caveat: I'm an advisor to the Social Science Research Council, which applied to OpenPhil for funding based on the above rationale. So take it with a huge grain of salt.]
Second, to be slightly less arbitrary: Michael Kremer et al. have a working paper from last year, in which they estimate the return on investing in “social science R&D,” based on data from USAID’s Development Innovation Ventures. They developed a model “for determining whether the return on an innovation portfolio exceeds a benchmark, such as the economy-wide return on capital or the opportunity cost of more conventional development assistance investments.”
Then, they applied this model to 41 R&D awards made between 2010 and 2012; all of the awards were for issues like safe water filters in Kenya, rural solar accessibility in Uganda, fighting tuberculosis in India, and the like.
It turned out that there was only enough data to estimate benefits from five of the innovations, but those alone “generated $281 million in social benefits,” compared to a $16 million cost for the entire portfolio. In other words, “setting aside any potential future benefits and any realized benefits of the other 36 innovations . . .benefits of these five innovations would have paid for the cost of the entire DIV portfolio at least 17 times over.”
This sort of benefit-cost ratio shouldn’t be a surprise. For example, as OpenPhil’s blog post observes, the Rockefeller Foundation saved “over a billion people from starvation” by employing Normal Borlaug as a plant scientist.
I could be missing something, but it seems like few (if any) EA organizations employ hard scientists who are directly working on issues like that. EA organizations are much more likely to employ scientists working on AGI issues that, while possibly important, are as yet quite speculative and sci-fi in nature. (Indeed, I suspect that whatever they are doing is not only hopeless, but might be as likely to cause future harm as future good--it's hard enough to measure the correct sign as to a medical or health intervention delivered to people right in front of you, let alone the sign of today's efforts as to AGI impacts on people in the far future).
Third, no such cost-benefit analysis is available as to many long-termist research questions (e.g., “how much weight should we place on philosophical arguments” or whether to diversify across different worldviews). I doubt many (or any) of those questions would pass muster under a reasonable cost-benefit analysis—after all, any effects of most philosophizing would be exceedingly diffuse. It still may be worth researching those questions, but you can’t justify such research on the ground that it is highly likely to lead to direct donations that are more impactful than GiveWell’s current list.
Fourth, as Holden Karnofsky points out, there’s a strong case for funding high-risk, high-reward projects that are up to 90+% likely to fail: “hits-based giving.” My one disagreement is that he says this “calls for approaching our giving with some counterintuitive principles — principles that are very different from those underlying our work on GiveWell.”
I’d argue that if we invest in RCTs, evaluations, and other forms of R&D that have a tiny chance of finding a cost-effective treatment/program/innovation, that research investment would be a hit-based way of furthering GiveWell’s aims. Thus, there is no contradiction here. Indeed, funding a broad range of empirical evaluations might be the best example of “hits-based giving.” The result would be a much broader set of interventions, programs, policies, etc., that feed into GiveWell’s analyses.
Fifth, funding more R&D could have positive spillovers in ways far beyond the direct effect of the intervention or program in question. For example, the mere act of getting program implementers to think about R&D questions could help them improve the program in the future (such as better targeting or delivery). Moreover, a greater focus on research could, in theory, influence other actors (policymakers, grantors, etc.) to care more about having solid evidence, thus hastening the discontinuation of ineffective programs while increasing the support for effective programs.
In conclusion, insofar as the EA movement focuses on short-term economic and human development, it needs a more robust, thoughtful, and thorough R&D agenda with three stages. The agenda should start with a vast number of possible innovations that might need some help to spin up in the first place (albeit with an eye towards what would ever be possible to scale); it should fund a number of pilot experiments; and the agenda should then recommend how to fund RCTs and other evaluations to show a program’s cost-effectiveness at scale.
To be clear, I'm sure there are lots of research grants that I've missed! And I do see bits and pieces of such an R&D agenda right now—akin to seeing 50 or 100 pieces of a 1,000-piece puzzle.
A good start, but not enough.
Thanks to Kerry Vaughan and to an anonymous person for comments. They don't necessarily agree with anything I said.