Rethink Grants (RG) is an analysis-driven grant evaluation experiment by Rethink Priorities and Rethink Charity. In addition to estimating the expected costs and impacts of the proposed project, RG assists with planning, sourcing funding, facilitating networking opportunities, and other as-needed efforts traditionally subsumed under project incubation. We do not yet fund grants ourselves, but refer grants to other grantmakers within our networks who we have reason to believe would be interested.
Donational’s Corporate Ambassador Program
This report is our first published evaluation – an assessment of a new project proposed by Donational. The Donational platform more efficiently processes donations made through its partner organizations, and allows users to set up donation portfolios informed by the expertise of charity evaluators endorsed by the effective altruism community. Donational requested $100,000 to establish a Corporate Ambassador Program (CAP), which would recruit advocates for effective giving in US workplaces. These ‘ambassadors’ would encourage their colleagues to donate through the platform, thereby raising money for highly effective charities.
We evaluated CAP with reference to five criteria: a formal cost-effectiveness estimate (based on an Excel model), team strength, indirect benefits, indirect harms, and robustness to moral uncertainty. Each was given a qualitative assessment of Low, Medium, or High, corresponding to a numerical score of 1–3. The weighted average of these scores constituted the overall Project Potential Score, which formed the basis of our final decision.
Cost-effectiveness Estimate: Low (1)
The base case donation-cost ratio of around 2:1 is below the 3x return that we consider the approximate minimum for the project to be worthwhile, and far from the 10x or higher reported by comparable organizations. The results are sensitive to the number and size of pledges (recurring donations), and CAP's ability to retain both ambassadors and pledgers. Because of the high uncertainty, very rough value of information calculations suggest that the benefits of running a pilot study to further understand the impact of CAP would outweigh the costs by a large margin.
Team Strength: Medium (2)
Donational’s founder, Ian Yamey, is very capable and falls on the high end of this score. His track record suggests above-average competency in several dimensions of project implementation. While the planning process for the CAP presented some potential gaps in awareness, Yamey demonstrates an eagerness to take corrections and steadfast commitment to iterating his plans in search of the most effective version of the program.
Indirect Benefits: High (3)
We think there is a small-to-moderate chance that CAP would generate several very impactful indirect benefits. For example, the additional donations going to animal-focused charities may reduce the risk of global pandemics caused by antibiotic resistance, and the program may help create a broader culture of effective giving at US workplaces.
Indirect Harms: High (1)
We also think there is a small-to-moderate chance that CAP would indirectly cause or exacerbate a number of problems. For instance, charities that reduce poverty and disease may cause economic growth, which is likely to increase the number of animals raised in factory farms and could contribute to climate change and existential risks.
Robustness to Moral Uncertainty: Medium (2)
CAP is compatible with most worldviews, but there may be exceptions. For example, some socialists believe charity often does more harm than good by promoting individualistic norms.
Project Potential Score: Low (1.27)
RG team members gave the vast majority of the weight to the cost-effectiveness estimate, leading to an overall Project Potential Score of 1.27. This falls clearly into the Low category.
After concluding the evaluation, we have decided not to recommend funding for a full-scale CAP at this time. This is based heavily on our cost-effectiveness analysis, which suggests the program is unlikely to be worthwhile at any reasonable cost-effectiveness threshold, at least in its proposed form. However, we have recommended funding of up to $40,000 to run a pilot study based on three primary considerations: (i) concern that the cost-effectiveness analysis may underestimate CAP’s promisingness relative to comparable programs; (ii) the potentially high value of information from running a pilot; and (iii) the relatively low risk involved, given that we expect the pilot’s costs to be lower than the volume of donations generated.
The future of Rethink Grants
Any further evaluations conducted by RG will not necessarily involve the same process or level of rigor as seen in this report. This evaluation was an experiment that involved frontloading many one-time costs, such as creating the evaluation framework; however, we also recognise important shortcomings of our methodology, and are aware that evaluations of early-stage projects in this depth may not always be advisable, depending on factors such as the grant size, feasibility of running a low-cost pilot study, and availability of relevant data. Should RG continue, future evaluations will incorporate a number of important lessons learned from this experience.
Introducing Rethink Grants
Rethink Grants (RG) is a collaboration between the Rethink Priorities research team and the Rethink Charity senior leadership. RG works individually with project leads to produce tailored evaluations of grant proposals, and refers those projects to grantmakers within our networks when we believe they merit funding. In addition, we assist in early-stage planning, facilitating networking opportunities, and other as-needed efforts traditionally subsumed under project incubation. RG’s single most important value add is our uniquely thorough and personalized approach.
Our principal aim is to help raise the quality of funded projects within effective altruism and adjacent world-improvement domains. The RG process signal-boosts their potential value through formal recommendations made on the strength of our in-depth analysis.
Below, we discuss our principles and process in more detail, and then report on our first grant evaluation – an assessment of a new workplace giving program proposed by Donational.
We conduct and publish a detailed, transparent evaluation of every potential grant recommendation.
We base decisions primarily on cost-effectiveness, while recognizing that formal expected value estimates should not be taken literally.
We take a growth approach to evaluating projects, considering not just the impact of a marginal dollar but the potential cost-effectiveness at scale.
We see grants as experiments. Following our thorough assessment, we err on the side of cautiously testing promising ideas. We then help set up criteria for success and failure, and then renew the grant recommendation if the approach shows promise (or there is a promising pivot).
Beyond just evaluating grant opportunities, we want to help prospective grantees improve. To do this, we offer them access to our research, general support, and detailed feedback.
Because we have to make granting decisions under significant moral uncertainty, we aim to practice worldview diversification.
Rethink Grants will begin with an in-network approach to sourcing projects, relying on trusted referrals to help us reach out to promising individuals and organizations. If RG continues to conduct evaluations, we then consider projects on a rolling basis. A project that seems potentially cost-effective, run by a high-quality team, and has room for more funding moves forward through our evaluation process.
The potential of a given project is normally assessed using both quantitative and qualitative estimates of cost-effectiveness and overall impact. To a large extent, evaluations are tailored to the individual proposal, but the following criteria are used in most cases:
- Cost-effectiveness Estimate – Based on a formal cost-effectiveness analysis, how much impact do we expect it to create per dollar spent?
- Team Strength – How effectively will the organization be able to implement and grow the project?
- Indirect Benefits – How large might the indirect benefits of the project be?
- Indirect Harms – How large might the indirect harms of the project be?
- Robustness to Moral Uncertainty – Might the project cause terrible harms from the perspective of different worldviews?
To understand how a particular project fares against these criteria, we aim to spend between 6 and 10 weeks gathering information relevant to the project through literature review, along with conversations with the organization’s team, subject-area experts, and potential funders.
The RG team that conducted this evaluation comprised Derek Foster, Luisa Rodriguez, Tee Barnett, Marcus A. Davis, Peter Hurford, and David Moss. In future evaluations, contributing team members may vary.
Launched in 2018 by Ian Yamey, Donational is a user-friendly online platform that aims to empower individuals to make their donation dollars do as much good as possible. It does this in two main ways: by more efficiently processing donations obtained through partner organizations, and by guiding donors towards more effective giving opportunities.
About 75% of Donational’s users to date have come through its partner One for the World, which encourages students to commit to donating 1% of their pre-tax income upon graduation. OFTW’s portfolio of recommended charities currently comprises 16 of GiveWell’s recommended and “standout” charities, five of which are designated “Top Picks” by OFTW. For those users, OFTW had presumably already done the work of convincing users to change their donation habits, but some useful features available on the Donational platform multiply those benefits.
- It charges a lower fee (2%) than any comparable organization for disbursing the donations to the charities.
- It provides donation pages tailored to each school, which allows OFTW to set the default pledge size to 1% of the average graduate income for that particular institution.
- It enables pledges more than one year in advance.
- It automatically updates the users’ credit card information when a new card is issued. According to OFTW, this keeps recurring donors from lapsing in cases where they might have otherwise forgotten to update their details on the platform.
About a quarter of users find Donational through other means, such as web searches. The platform uses a chat bot to ask these visitors a set of basic questions about their values, giving history, and giving goals, and the answers are used to design a customized giving portfolio, normally based around charities suggested by Donational. Users can then learn more about the recommended charities, and adjust the allocation of their portfolios accordingly. During the process, Donational also ‘nudges’ users to set up recurring (rather than one-off) donations, and to give a higher percentage of their income.
The recommended charities currently encompass a broad range of causes, including global health, developing world poverty, US social justice issues, animal welfare, environmental protection, and climate change – and users can add any other charity that is registered in the US as a 501(c)3 organization. However, to maximise impact, Yamey has agreed in future to limit the selection to those recommended by GiveWell and Animal Charity Evaluators, plus one US criminal justice organization suggested by relevant staff at the Open Philanthropy Project.
Costs and revenue
In addition to the opportunity costs associated with the two days that Yamey spends on Donational each week, the project operated on a budget of $32,909 in 2018. However, it also earns revenue by charging a 2% fee on all donations processed through the platform. Once the students who have taken the OFTW pledge begin to graduate, Yamey predicts around $1.5 million per year will be disbursed, making the project roughly cost-neutral.
The Corporate Ambassador Program
While continuing its partnerships and remaining available for individuals who find the website independently, Donational is considering launching a Corporate Ambassador Program (CAP). The plan is to recruit volunteers at companies, who would then encourage their colleagues to donate through Donational. These ‘ambassadors’ would be given resources and training, enabling them to more effectively pitch the idea of high-impact giving to their coworkers. If the program is successful, its direct impacts would be threefold:
- Donational hopes its users will donate more regularly than they otherwise would have.
- Donational hopes its users will donate larger sums each time than they otherwise would have.
- Donational hopes people will donate to more impactful charities than they otherwise would have.
If the program is successful at a large scale, there may be additional benefits that are less direct. In particular, Donational hopes to contribute to a culture shift in workplaces, helping effective giving to become the norm rather than the exception.
The remainder of this report evaluates CAP against our five criteria, beginning with a cost-effectiveness estimate.
by Derek Foster
Rethink Grants explicitly models the costs and consequences of the proposed project to generate a cost-effectiveness estimate (CEE). Our approach to cost-effectiveness analysis has a number of notable features:
- The analysis is as transparent as possible without sacrificing too much precision. Depending on the nature of the project, the time available, and the type of analysis required, the model may be created in Guesstimate, Google Sheets, Excel, or R. The model structure and individual parameters are clearly described and justified, and its limitations are highlighted and discussed.
- In line with the growth approach, we aim to assess the project’s long-run cost-effectiveness, not just the impact of a marginal dollar. This normally involves estimating the costs and consequences of the project over different time periods and at different scales, taking into account the probability of reaching (or duration at) each scale.
- Potential biases, such as the planning fallacy, are explicitly considered and factored into the analysis where possible.
- Model parameters often involve considerable subjective judgement. Where the CEE is likely to be sensitive to this, we may take a weighted average of probability distributions elicited from multiple RG team members and/or relevant experts. The model is also designed so that users can easily replace the team’s inputs with their own assumptions.
- Where feasible, the primary analysis is probabilistic, taking into account uncertainty around all the parameters. This usually produces a more accurate CEE than a deterministic model, and enables more informative analyses of decision uncertainty (Claxton, 2008).
- A range of methods are used to characterise uncertainty. These may include confidence intervals, cost-effectiveness planes, cost-effectiveness acceptability curves and frontiers, one-way and multi-way deterministic sensitivity analyses, and assessments of heterogeneity (see Briggs et al., 2012 for an overview). We may also estimate the value of gathering additional information, to determine whether it would be worth conducting further research before making a final decision (Wilson et al., 2014).
- Discount rates may be applied to both costs and outcomes. The appropriate figures will vary among projects but – in line with the philosophical consensus – the rate for outcomes does not give any weight to pure time preference (the idea that benefits are worth less in the future, just because they’re in the future).
While assessing cost-effectiveness is the primary goal of the evaluation, we do not take expected value estimates literally. To avoid the illusion of high precision, we therefore also rate cost-effectiveness more subjectively as Low, Medium, or High. This is done with reference to ‘best buys’ in the same cause area, rather than by trying to use one outcome metric for a broad range of interventions. As well as enabling comparison with other criteria, this approach is in line with our efforts towards worldview diversification.
The application of our methods to Donational’s Corporate Ambassador Program is described in detail below.
We constructed a mathematical model to estimate the expected costs (in US dollars), consequences, and cost-effectiveness of CAP. We excluded indirect effects, moral uncertainty, and team strength from the final model due to the difficulty of making meaningful quantitative estimates; these are addressed more subjectively as separate criteria. This section outlines some key methodological choices, the model structure, our methods for estimating parameter inputs, and the sensitivity analyses we carried out on the results.
We created the model in Microsoft Excel as it seemed like the best compromise between transparency and functionality. Google Sheets is a little more accessible for end users, but complex modelling can be trickier in Sheets, especially when macros are needed. R is powerful but the primary modeller was not proficient in it, and the calculations can be harder to examine for those unfamiliar with the language. Guesstimate is more convenient for some kinds of straightforward probabilistic estimates, but it lacks some key features (such as charts) that are necessary for important sensitivity analyses, and can run into difficulty when the distribution of costs and/or effects includes negative numbers.
While it is the effect on overall wellbeing that ultimately matters, it was not feasible in the time available to convert the outcomes of a diverse group of charities into one common metric. Instead, the measure of benefit was donations to CAP-recommended charities, adjusted for counterfactual impact.
The primary outcome measure, which constitutes our CEE, was the donation-cost ratio (DCR), i.e. the number of (time discounted) dollars donated per (time discounted) dollar spent on the program. The CEE could also be expressed as a cost-donation ratio (CDR), which is more similar in form to widely-used cost-effectiveness ratios (such as dollars per life saved) in that it divides the costs by the outcomes. However, the CDR’s interpretation is less intuitive for this kind of project, e.g. a return of $7 million for an expenditure of $1 million implies a DCR of 7 but a CDR of 0.142.
Note that the DCR is not the same as the benefit-cost ratio widely used in economics, which puts costs and effects in the same (monetary) units. Unlike in a benefit-cost ratio, a dollar spent on the program may have a different opportunity cost (benefit foregone) than a dollar donated – expenditure may generate either more or less value than equivalent donations. Nor is it quite the same as a return on investment, which is typically based on absolute revenue rather than the (discounted) net present value of investments. It is equivalent to what One for the World, The Life You Can Save (TLYCS), and Giving What We Can (GWWC) call their “leverage ratio”, although there are some differences in the methodology used to estimate it.
For the base case (primary) analysis, cost-effectiveness was assessed by program year (see the C-E by Program Year worksheet). In other words, all costs and donations were attributed to the year the ambassadors that caused them were recruited, rather than the year they took place. For example, the present-discounted lifetime value of a pledge (Value of Pledge worksheet) taken in year 1 was considered a year 1 outcome, even though a large proportion of the donations would not be received until much later. This seemed to provide the most relevant information, since we assume funders would be more interested in how much value would be created by funding the program for a certain period, rather than when the impact would be realized. However, we also calculated the absolute volume of donations processed in each year, primarily to help Yamey with planning, and the cost-effectiveness by year of disbursement (Disbursements worksheet).
The comparator for the sake of this analysis is implicitly ‘Do Nothing’, which we assume has no costs or consequences. Ideally, we would have compared CAP directly to one or more alternative projects and calculated the incremental DCR (the difference in donations divided by the difference in costs). For example, if CAP has expected donations of $10 million and costs of $2 million, and the alternative has donations of $8 million and costs of $1 million, the DCR for CAP would be ($10M / $2M) = 5, but the incremental DCR would be ($10M - $8M) / ($2M - $1M) = $2M / $1M = 2. So long as the alternative was a viable option, the relevant figure would have been 2x, which reflects the return achieved compared to what would otherwise happen (the counterfactual). However, after discussions with relevant organizations, it was unclear whether another similar program would be run, when it would begin, whether CAP would displace that program (rather than run alongside it), who would fund it, and how costly and effective it would be in comparison. We therefore decided to disregard potential alternatives in the main analysis – though they have influenced our cost-effectiveness threshold and are discussed later in this report.
We evaluated CAP with reference to the minimum acceptable donation-cost ratio (minDCR). Cost-effectiveness thresholds like this should ideally be based on the opportunity cost of carrying out the program, which depends on how those resources would otherwise be spent. For instance, marginal governmental health spending in India averts a disability-adjusted life-year for around $300 (Peasgood, Foster, & Dolan, 2019, p. 35) so spending more than this from a fixed government budget is likely to cause more harm than good. However, the opportunity cost is very hard to estimate in the case of CAP, since we are not certain of who the funder would be, or how the funds would otherwise be used. We therefore compared the outcomes to three potential thresholds:
- 1x (meaning a minDCR of 1, i.e. a dollar donated for every dollar spent on the program) may be considered an absolute lower bound, since a lower ratio implies that it would be better to donate the money directly to the charities.
- 3x is approximately the return that both Yamey and the Rethink Grants team consider the minimum to make the project worthwhile, and is therefore the primary reference point.
- 10x is roughly in line with cost-effectiveness estimates by the most comparable existing programs, OFTW (which kindly provided access to its internal CEA) and TLYCS. GWWC claims a “realistic” leverage ratio of more than 100:1, but a recent analysis by Rethink Priorities casts doubt on the estimate. GWWC is also of a substantially different nature from CAP, TLYCS, and OFTW in that it primarily targets a small number of committed effective altruists rather than larger numbers of ‘ordinary’ donors.
Results are presented for three time horizons: 3, 10, and 20 years. We chose 10 years for the base case because it seemed like a reasonable compromise between recognizing long-term potential and being able to make meaningful predictions. Results for other horizons can easily be obtained using the “User input” cell next to the Horizon parameter (#42 in the Parameters worksheet).
By default, the model includes a pilot year in the main results. Preliminary ‘back of the envelope’ calculations early in the evaluation process suggested that the DCR would not be high or certain enough to justify large-scale funding from the start, so we switched to considering funding for a pilot study. The pilot period is considered ‘year 0’ so a 3-year horizon actually covers 4 years (years 0, 1, 2, and 3), a 10-year horizon 11 years, and so on. We felt this was appropriate as the pilot costs and outcomes contribute to its expected value. However, the pilot year can easily be excluded from the analysis using the switch on the right of the Main Results worksheet.
Similarly, the probability of CAP progressing beyond the pilot study (parameter #41) can be switched off. It does not affect the DCR as the costs and donations are multiplied by the same number, but it does affect the total expected costs and impact. These switches also make the model easier to update after the pilot study (should one take place).
The structure of the model was based heavily on Yamey’s description of the intended program, but also took into account information from team members and effective altruists with relevant experience, such as those who had engaged in workplace fundraising. The parameters can be grouped into ones relating to impact, costs, and both concurrently.
In the broadest terms, the effectiveness of the project was considered a function of the number of ambassadors, the number of donors each ambassador managed to recruit, and the size of the donations.
To reflect different potential growth trajectories, ambassador numbers were only directly estimated for the pilot study (Parameter #1) and year 1 of the full program (#2), with subsequent years’ numbers obtained using an annual (linear) growth rate (#3) and a maximum scale (#4), both measured in terms of the number of ambassadors recruited. The program was assumed to remain at that scale indefinitely once reached. This structure was informed by Yamey’s belief – which we shared – that, at some point, there would be diminishing returns to scale. This could occur because recruiting ambassadors would become more difficult once the ‘low-hanging fruit’ (such as personal contacts in companies with a culture receptive to effective giving) have been exhausted, and because organizations may be more difficult to manage beyond a certain size. Ambassadors were assumed to remain active for a maximum of two years, after which we thought most donor recruitment opportunities would have already been taken. A composite parameter, the number of ambassador-years (i.e. average years of active donor recruitment by one ambassador), was calculated based on the ambassador “churn” (non-participation) in each year (#5 and #6).
Donors were divided into “pledgers” who commit to recurring payments, and “one-time donors”. The number of each type recruited per ambassador-year was estimated directly (#7 and #8), along with the donor churn – the proportion of the value of pledged donations that are not received, due primarily to pledgers cancelling payments. Churn was estimated separately for the first year after taking the pledge (#9) and subsequent years (#10), as evidence from TLYCS (provided by email) and OFTW suggests attrition would be highest soon after the pledge becomes active.
The size of donations was also estimated separately for the one-off (#11) and recurring (#12) donations, since they are likely to be different. They were then adjusted for ‘funging’ – roughly speaking, the displacement of funds from other sources. “EA funging” (#13) is the proportion of the value of the donations that would have been received by EA charities in the absence of CAP. This includes 'direct’ funging: any donations made by people recruited through CAP that they would have made to effective altruist causes anyway, e.g. because they would have taken another EA pledge (TLYCS, OFTW, GWWC) or found EA organisations through other channels. It also covers 'indirect funging': the proportion of the impact of CAP donations that would have obtained anyway, e.g. because large funders would have (partially) filled the funding gap of CAP-recommended charities, and with a smaller opportunity cost than CAP donations. In particular, Good Ventures regularly grants to GiveWell top charities based in part on the size of their funding gaps, which would be smaller in a world with CAP. There are reasons to believe Good Ventures’ giving is primarily constrained by the availability of research into giving opportunities, so the main effect of their giving less to GiveWell charities would be to hold on to the money for several years or decades, at which point we might expect there to be less impact per dollar. However, Good Ventures does not fill the funding gaps entirely, and giving later would presumably still have some benefit, so much less than 100% of a CAP donation is ‘funged’.
“Non-EA funging” (#14) is the proportion of the remaining donations that would have gone to charities not currently recommended by EA organizations had they not been given via CAP. For example, a donor may cancel their monthly donations to Oxfam and give (some of) it to the Against Malaria Foundation instead; or they might have started giving money they wouldn’t have otherwise donated to Oxfam had they not been exposed to CAP.
With the assistance of a further parameter, the value of non-EA donations relative to EA donations (#15), these are used to construct two “adjusted mean donation” composite parameters – one each for one-time and recurring donations – that are used in the rest of the model. These represent the counterfactual impact of donations better than the absolute donation size.
The net cost of a program is a function of expenditure and revenue.
Labor dominated the cost parameters. First, above a certain number of ambassadors (#18), ambassador managers would be required. In addition to their annual salary (#16), the number of managers was estimated based on the number of ambassadors the team believed each manager could handle (including recruitment, training, and support while active) (#17). Second, we estimated the scale above which a part-time (#20) and a full-time (#21) chief operating officer (COO) would be needed to lead the day-to-day activities of the program, along with their salary (#19). Third, the cost of hiring a software developer to process donations was based on an expected hourly rate (#22) and the volume of donations processed by the platform (#23). (The software developer would not process the donations directly, but Yamey believes that the work required – such as supporting user accounts – scales roughly with donation volume.) Fourth, Yamey thinks that a second software developer (#24), to work on infrastructure, such as tools for ambassadors to communicate with each other, would be needed above a certain scale (#25). Fifth, Yamey also thought a part-time marketer (#26) would be needed after reaching a certain threshold (#27) – though not during a pilot study of any size – and a full-time marketer (#28) at a larger scale. Sixth, after the pilot year, Yamey would also like to hire a contractor to do graphic design (#29), with the number of hours depending on the number of ambassadors (#30 and #31).
There are three non-labor expense parameters. Beyond the pilot, CAP will have to pay for web hosting and information technology-related expenses (#32). Based on his previous experience, Yamey thinks a multiple of the square root of donations is the best way to capture the returns to scale for this item. Ambassadors may also incur marketing and travel costs (#33), though we assumed there would be a discount above 400 ambassadors (#34) – a fairly arbitrary figure provided by Yamey. Finally, we assumed miscellaneous costs (#35) as a proportion of all other costs combined, as this seems to be standard in program budgeting.
Donational currently charges a fee of 2% to process donations. Yamey intends to apply this to donations through CAP as well (#36), in order to partly offset the costs of running the program, though we did not assume the rate would necessarily remain at 2%.
Inflation and discounting
An annual inflation rate (#37) was applied to both costs and donations. This captures the tendency for costs to rise over time. It is also applied to donations in this model, since we might expect the salaries of donors, and therefore donation size, to increase at roughly the same rate.
The choice of discount rates was much more complicated, and different rates were required for different parts of the model. However, the team provided separate rates for three other categories of reasons for discounting, based in part on analyses by GiveWell staff (James Snowden, Caitlin McGugan, Emma Trefethen, and Josh Rosenberg).
- Improving circumstances and reinvestment (#38). It is widely believed that spending now generally creates more benefit than spending the same amount later (diminishing marginal utility of consumption), as people tend to be getting richer, healthier, etc. Moreover, beneficiaries can make capital investments that grow over time, which makes donations more valuable sooner than later. While these two processes are independent, they were combined into one rate as they both reflect the ‘real’ value (opportunity cost) of a given cost or donation, as distinct from the probability of that cost or donation occurring.
- Background uncertainty (#39). Roughly speaking, this represents the risk of the program closing down due to a catastrophic event, such as a natural disaster, rapid technological advance, or economic collapse. More precisely, it is the annual expected proportion of costs and outcomes that are not counterfactually caused by CAP due to factors not directly related to the program.
- Program uncertainty (#40). This covers the probability of CAP closing down due to factors other than the ‘background’ risks mentioned above. Example reasons include poor outcomes, failure to obtain funding, legal issues, and internal conflict. Note that this does not include the additional uncertainty of the pilot year, which has its own parameter, i.e. this is the probability of CAP failing each year given that it has progressed beyond the pilot.
These were combined into three composite parameters for use in different parts of the model:
- (#38 + #39): Annual discount rate for value of a pledge. The lifetime value of a pledge (see the “Value of pledge” worksheet) would be affected by changes in the value of money over time and extreme events that disrupt (or render obsolete) the donations, but we assume that there would remain a mechanism for collecting and disbursing the pledged funds even if the program shut down for internal reasons.
- (#38 + #39 + #40): Annual discount rate for value of the program. The expected value of the CAP program in any given year, as estimated in the C-E by Program Year worksheet, is influenced by all three factors.
- (#39 + #40): Annual discount rate for value of disbursements. When predicting the absolute number of dollars processed by Donational due to CAP, which is useful for planning purposes, only the probability of those payments happening is relevant. However, note that Disbursements worksheet also provides estimates of the value of disbursements each year, which uses the discount rate for the value of the program, and those figures are the basis for the cost-effectiveness estimates by year of disbursement given in the Main Results worksheet.
The procedure developed for obtaining parameter values tried to strike a balance between rigor and practicality. As well as getting estimates (usually a best guess plus 90% confidence interval [CI]) from Yamey, we requested informal guesstimates on key parameters from several anonymous individuals with relevant knowledge and experience. We also examined relevant information, such as OFTW’s 6-month update and additional data kindly provided by the OFTW team; TLYCS’s annual reports and email discussions with its senior staff; GWWC’s impact assessment; and general information found online.
However, parameter values were ultimately obtained by eliciting and aggregating probability distributions from the six Rethink Grants team members (TMs). The two exceptions to this were the number of ambassadors in the pilot study (#1), which we decided by consensus after taking into account preliminary results, and the time horizon (#42), as we had decided it would be more informative to present results for multiple horizons. The process was based loosely on the Delphi method but also drew heavily on materials and software developed for use with the Sheffield Elicitation Framework (SHELF). Detailed protocols for these (plus a more Bayesian approach called Cooke’s method) are provided in a report for the European Food Safety Authority (EFSA, 2014). However, with 40 parameters to estimate, conflicting schedules, and many other obligations, it was not possible for our team to follow either of them in full. For example, SHELF requires all ‘experts’ (in this case TMs) to undergo calibration training then gather together in a multi-hour (often multi-day) workshop to produce a ‘consensus distribution’ for each parameter.
The elicitation process obtained five values from each of the six TMs for each parameter, in the following order:
- Lower plausible limit (L). The TM was almost certain the true value lay above this quantity (less than a 1 in 1,000 chance it was lower).
- Upper plausible limit (U). The TM was almost certain the true value was lower (less than a 1 in 1,000 chance it was higher).
- Median (M). The TM thought there was an equal chance it was higher or lower than this value.
- 5th percentile (pc5). The TM believed there was a 1 in 20 chance it was lower.
- 95th percentile (pc95). The TM believed there was a 1 in 20 chance it was higher.
The process had three rounds:
- Make initial guesstimates. The first round relied entirely on TMs’ existing knowledge to obtain preliminary figures. In order to minimize bias, they placed the values in their own spreadsheet, without discussion, viewing others’ inputs, or doing further reading. They were allowed to skip this round for any parameters about which they felt it was impossible to make meaningful guesstimates, or if they were very short on time.
- Consider additional information. In this round, TMs were given additional relevant information about each parameter. This included information gathered by the primary cost-effectiveness analyst, such as Yamey’s own estimates, data from similar programs like OFTW, and comments from third parties with relevant experience. Before updating their estimates, they were also encouraged to do their own research and to use the SHELF-single or MATCH web apps to fit a probability distribution to their inputs. They were then asked to record a “confidence” score between 0 and 10, representing how much they trusted their inputs, plus a rationale for their responses.
- Consider other team members’ inputs. After everyone had completed Rounds 1 and 2 for all parameters, they looked at other TMs’ estimates and comments, and used that new information to update their own if they wished.
All TMs were provided detailed instructions for each round, including ways to improve the accuracy of their estimates and minimize bias. Each parameter was also given a priority score from 1 to 5, based loosely on the results of a preliminary sensitivity analysis, which TMs could use to guide how much time to spend on considering their inputs.
Fitting and aggregation
Inputs from all TMs were combined in an Excel spreadsheet (see Team Inputs). Jeremy Oakley, developer of the SHELF R package and Shiny apps, kindly wrote an R script to fit a distribution to each one. Parameters with a hard lower bound but no theoretical maximum (such as ambassador numbers, which could not be negative) were fitted to lognormal or gamma distributions. Those with hard upper and lower bounds (such as probabilities) were fitted to beta. The normal distribution was used for the remainder.
A ‘linear pool’ (weighted average) of distributions for each parameter was then generated within Excel, with weights determined by TMs’ self-reported confidence levels. Specifically, the formula (see the “Pooled sample” column) chose a sample from one of the six distributions (“Sample” column), where the probability of each distribution being chosen was proportional to its weight (“Weight” column). All parameter inputs were derived from these pooled distributions. There are many other ways of mathematically aggregating distributions (e.g. log-linear pooling and fully Bayesian methods), but the evidence suggests linear pooling tends to be comparably accurate as well as being much more straightforward (e.g. see O’Hagan et al., 2006, chapter 9).
We encountered some difficulties during this process, such as missing inputs and poorly-fitting distributions.
Appendix 1 outlines these challenges, the steps taken to address them, and some potential ways of improving the procedure in future evaluations.
We assessed uncertainty around the results using both probabilistic and deterministic sensitivity analyses (PSA and DSA, respectively).
The probabilistic analysis used 5,000 Monte Carlo simulations to generate expected costs and outcomes. This method takes a random sample from each of the input distributions, records the results, and repeats the process many times. The DCR calculated from the probabilistic point estimates (means) af the costs and donations is considered the base case CEE.
The simulations can be used to characterise overall uncertainty much better than the DSA. We calculated 90% CIs for the net costs and impact-adjusted donations, but it was not possible to provide a meaningful CI for the DCR because in some simulations the costs, donations, or both were negative. This can happen when the revenue earned by charging a processing fee is greater than the expenditure, or when CAP diverts donations from more effective charities. Uncertainty around the DCR was therefore represented in other ways.
A cost-effectiveness plane (a special kind of scatterplot) illustrated the spread of values by plotting the results of the simulations and the cost-effectiveness thresholds.
Cost-effectiveness acceptability curves (CEACs) showed the probability of each alternative (CAP or Do Nothing) being cost-effective – having the highest net benefit – at different thresholds. Net benefit is the value of donations minus the cost of the program, similar to the concept of net present value used in other fields.
A cost-effectiveness acceptability frontier (CEAF) showed the probability of the most cost-effective option at any given threshold being optimal – having the highest expected net benefit – which is normally the most relevant criterion for decision-making. (In most cases, the option with the highest probability of being cost-effective, as indicated by the CEACs, is also the optimal choice, but there are exceptions.)
We also calculated the expected value of perfect information (EVPI). This is the theoretical maximum that should be spent to remove all uncertainty in the model, which can help guide decisions such as how much (if anything) to invest in a pilot study.
Cost-effectiveness planes are introduced in Black (1990). CEACs, CEAFs, and EVPI are explained in more detail in Barton, Briggs, & Fenwick (2008), and the steps for calculating them in this case are detailed in Appendix 2.
Since removing all uncertainty is infeasible, it is important to consider the value of the information that could realistically be obtained in a pilot study. There are established methods for doing this, but they are relatively complex and, in Excel, would require a macro that runs for dozens of hours (see e.g. Briggs, Claxton, & Sculpher, 2006, chapter 7; Wilson et al., 2014; Strong, Oakley, Brennan, & Breeze, 2015). We therefore made extremely rough estimates, as follows.
- We guesstimated the proportion of the remaining decision uncertainty that would be resolved by the pilot study.
- We multiplied that by the EVPI to get the expected value of information obtained in the pilot.
- We subtracted the estimated cost of the pilot, which gave us the expected net benefit of the pilot. A positive figure indicates that the pilot would cost less than the value of the information obtained, suggesting it would be worthwhile.
We did this for all three thresholds, and four potential pilot types:
- Yamey alone. Yamey thinks he could recruit and manage about 5 ambassadors without incurring significant costs or requiring external assistance.
- Volunteer. Yamey thinks a part-time volunteer could manage up to 10 ambassadors. It is unclear whether they would require a stipend, but to be conservative we have assumed a total cost of $10,000.
- PT COO. Yamey’s preference is to hire a chief operating officer. He thinks a part-time COO, on a salary of about $40,000, could run up to 20 ambassadors (though we suspect this is at the lower end of the feasible range) while also working on overall strategy.
- FT COO. For about $80,000, Yamey believes a full-time COO could handle up to 50 ambassadors alongside other tasks.
The deterministic CEE was obtained using the means of the pooled distributions used in the PSA (described above). A DSA then identified the main sources of uncertainty in order to guide information-gathering priorities, both during this evaluation and potentially in future studies, such as the pilot study.
- In a one-way sensitivity analysis, each parameter was set to the 5th and 95th percentiles, and the resulting DCRs presented in a tornado chart.
- A threshold analysis determined the value that each of the 10 most sensitive parameters would need to attain in order for the DCR to reach our potential minDCR thresholds.
- A two-way sensitivity analysis recorded the DCRs resulting from changing any two of the 10 most sensitive parameters at once.
A number of measures were taken to minimise the risk of error in the model, including the following.
- Named ranges reduced the risk of erroneous cell references.
- The effect of variations in inputs on outputs were observed, to ensure they made sense (e.g. higher costs → lower cost-effectiveness).
- Consistency across different parts of the model was checked, e.g. higher EVPI when the CEAF is lower.
- The percentiles of fitted distributions were compared to each TM's inputs for each parameter, and any substantial disparities were investigated further.
- Samples from the pooled distributions were generated in R and compared to the Excel-based results.
- Parameter values, and final results, were compared to our preliminary estimates, and any major disparities investigated.
- Macros included an explanation for every line of code, and key inputs (such as number of samples) were displayed in the worksheet.
- The model was thoroughly checked by one RG team member, and less thoroughly by other TMs, Yamey, and two experienced health economists.
Each TM's estimates and fitted distributions are shown in the Team Inputs worksheet, and the means and 90% confidence intervals of the pooled distributions are in the Parameters worksheet. Overall, the inputs were much more pessimistic than those entered into preliminary models, which were based heavily on Yamey’s guesstimates. The main exception was the average pledge size of nearly $1,000, which the team thought would be close to those reported by OFTW for undergraduates, although our 90% CI was wide (approximately $100–$3,000). Even after adjusting our estimates in light of each other’s inputs and comments, there was considerable divergence of opinion among TMs for many parameters; for example, the median estimates for 1st year donor churn ranged from 22% to 70%, EA funging from 10% to 67%, and the number of ambassadors per manager from 15 to 80. Notably, the three individuals who were most closely involved in the evaluation tended to give more optimistic inputs than the three more detached TMs. The average confidence score of 2.3/10 – which indicates how much we trusted our estimates beyond the uncertainty reflected in the confidence intervals – also reflects the highly speculative nature of most parameters.
Table 1: Base case probabilistic results
In the base case, CAP is expected to cost around $500,000 over a 10-year horizon, but with a very wide 90% confidence interval (approximately $50,000–$1.5 million). Expected impact-adjusted donations are about $1 million, with even greater uncertainty ($7,000–$4 million). The base case cost-effectiveness estimate is a donation-cost ratio of 1.94, meaning just under $2 is donated to CAP-recommended charities for every dollar spent on the program. This is higher than our lowest potential cost-effectiveness threshold of 1x, but well below the primary reference point of 3x, and even further off the CEEs reported by other EA fundraising organizations.
Figure 1: Cost-effectiveness plane
The cost-effectiveness plane (Figure 1) shows a tight cluster of estimates with less than $2 million in donations and costs, although a non-trivial proportion of each surpass this figure. Donations in particular are positively skewed, with a handful of the 5,000 scenarios reaching $20 million and beyond (not shown on the plane for presentational reasons). The markers in the north-west (top left) quadrant represent scenarios where Do Nothing strictly dominates CAP (i.e. CAP causes negative effects with a positive cost), likely reflecting the small risk that CAP would displace donations to more effective charities. Conversely, the few estimates just inside the south-east (bottom right) quadrant suggest a very small chance that CAP would dominate Do Nothing (i.e. cost less – by generating revenue greater than its expenditure – and cause more benefit).
Figure 2: Cost-effectiveness acceptability curves and frontier
The cost-effectiveness acceptability curves (Figure 2) suggest there is just a 36% chance of CAP being cost-effective (having higher net benefit than Do Nothing) at a minDCR of 1x. The cost-effectiveness acceptability frontier nevertheless indicates that CAP would be the optimal choice at that threshold (i.e. have the highest expected net benefit). This is because the distribution of net benefits at that threshold is positively skewed, with a mean higher than the median. Beyond a minDCR of 1.94, however, Do Nothing becomes optimal; there is just a 15% chance of CAP being optimal at 3x, and 4% at 10x. In other words, this analysis suggests that, for a risk-neutral donor, paying upfront for the full 10-year program would only make sense if their minimum acceptable DCR was below about 2x.
Figure 3: Expected value of perfect information (EVPI) at different cost-effectiveness thresholds
Table 2: Very approximate estimates of the value of information that could be obtained from various sizes of pilot study.
The expected value of perfect information (Figure 3) at the 1x, 3x, and 10x thresholds is around $230,000, $170,000, and $30,000, respectively. Our crude estimates of the value of a pilot study are shown in Table 2, with green cells indicating that the pilot is probably worthwhile. At our primary threshold of 3x, hiring a full-time COO to run a pilot with about 50 ambassadors could be justified, but the expected net benefit – the difference between the value of information obtained and the costs incurred – is a little higher for smaller pilots led by a part-time COO, a volunteer on a stipend, or Yamey alone. With a minDCR of 10x, only a small pilot run by Yamey himself (perhaps with the assistance of unpaid volunteers) seems warranted. Note that these estimates disregard the donations resulting from the pilot, which are expected to be at least as high as the costs, so they may be considered conservative. However, it is also worth highlighting that value of information is very sensitive to the time horizon: a program of shorter expected duration would generate less total value so it would not be worth spending as much on finding out whether to support it, and the converse would be true of a longer one.
Table 3: Deterministic results by program year
The deterministic analysis (based on the means of the parameter inputs) gave 10-year expected costs of about $640,000. As indicated by the pie chart (Figure 4), ambassador managers account for about half of expenditure, followed by the chief operating officer’s salary. The marketer and miscellaneous costs constitute a majority of the rest.
Figure 4: Breakdown of costs
According to our model, the average ambassador would generate about $4,000 for CAP charities ($2,800 after adjusting for funging, and $2,000 after discounting). While around 72% of donors would give a one-off (rather than recurring) donation, pledges account for 89% of the expected donation volume. Each pledge is estimated to have an impact-adjusted discounted value of nearly $900 over a 20-year period; as indicated by Figure 5, almost all of this value is realized within the first five years after taking the pledge.
Figure 5: Value of a pledge over time
The 10-year deterministic donation-cost ratio is 1.52, significantly lower than the probabilistic one. Cost-effectiveness is worse (1.21) in the first three years, when ambassador numbers are high enough to incur considerable labour costs but not high enough to generate a lot of donations; yet the model predicts only modest returns to scale, with a DCR of just 1.56 at the 20-year mark. Excluding the pilot study from the totals does not significantly affect the cost-effectiveness.
Figures by year of disbursement (Table 4) are lower due to the lag in receiving pledged donations. Even over a 20-year horizon, the DCR is not expected to reach 1. Note that these estimates assume the program (e.g. the recruitment of new ambassadors) continues at least as long as the given time horizon. If the program (and therefore expenditure) stops, but the donations from outstanding pledges are still received in later years, the DCR at those later horizons will be higher.
Table 4: Deterministic results by year of disbursement
One-way sensitivity analysis
The tornado chart (Figure 6) gives some indication of which parameters contribute the most uncertainty, though it cannot account for interactions among parameters. Optimistic assumptions for any one of three parameters – mean pledge size, number of pledges, and donor churn beyond the first year – cause the DCR to comfortably surpass the 3x threshold. The pessimistic confidence limit for any of the top eight parameters brings the DCR below 1. Interestingly, the ‘pessimistic’ 5th percentile value for maximum scale actually raises the DCR more than the ‘optimistic’ 95th percentile, because with a very low number of ambassadors the major costs are not yet incurred. Something similar happens with 1st year ambassador churn. This should not be taken to imply that a smaller program is preferable (the overall impact is far lower), but it is one of several indications that the program has limited returns to scale.
Figure 6: Tornado chart illustrating the results of a one-way sensitivity analysis on the 20 most sensitive parameters
The threshold analysis (Table 5) revealed that achieving 3x would require a mean donation over $2,000, seven pledges per ambassador, donor churn beyond the first year of 12%, or (very unrealistically) 2nd year ambassador churn of just 8% – assuming all other parameters remain unchanged. A 10x return would require about three pledges of $7,000 per ambassador, or 23 pledges at the base case mean of just under $1,000, both of which seem fairly implausible. No change to any one of the other seven parameters would enable either 3x or 10x.
Table 5: Threshold analysis on the 10 most sensitive parameters
Two-way sensitivity analysis
The two-way analysis helps to capture interactions between pairs of parameters, which can lead to fluctuations in the cost-effectiveness ratio greater than the sum of the changes caused by varying them individually. As shown in Figure 7, optimistic values for any two of the three most sensitive individual parameters – average pledge size (#12), average number of pledgers per ambassador-year (#8), and donor churn after the first year (#10) – would enable CAP to reach 10x. A further 25 combinations push the DCR past 3x, while pessimistic confidence limits for almost any two parameters bring the DCR below 1.
Figure 7: Two-way sensitivity analysis on the 10 most sensitive parameters
Our overall subjective score for cost-effectiveness is decided with reference to ‘best buys’ of a similar nature. In this case, One for the World and The Life You Can Save are the most relevant comparators, as they solicit both pledges and one-off donations from individuals who do not necessarily consider themselves ‘effective altruists’. Of course, there are significant differences: OFTW primarily operates in universities, and TLYCS appeals to a broad range of demographics. But for a CAP-style program to warrant funding over these alternatives, it should arguably demonstrate comparable cost-effectiveness.
OFTW and TLYCS both report donation-cost ratios of at least 10:1. The subjective score was therefore given using the following criteria:
The base case DCR of 1.94 is equivalent to 0.194x the best buy in our subjective scoring framework. This falls clearly into the Low category.
Our analysis suggests CAP is unlikely to be cost-effective. The base case estimate of around $2 donated per dollar spent is below the 3x return that both Yamey and the Rethink Grants team consider an approximate lower bound for the project to be worthwhile, and far from the 10x or higher reported by One for the World and The Life You Can Save. It consequently receives an overall score of Low in our subjective framework. With only a 15% chance of being cost-effective at the 3x threshold, it would be unwise to invest in a full-scale program at this stage.
Nevertheless, the analysis provides a strong case for running a pilot study. Our very rough estimates suggest that, assuming a total program duration of at least several years, a pilot of any reasonable size would cost far less than the value of the information it would generate. A small or medium-sized study (5–30 ambassadors) run by a part-time chief operating officer, a volunteer, or Yamey alone seems to offer the most favorable trade-off between information gain and cost.
Our sensitivity analyses can be used to guide further research and program development. The primary sources of uncertainty appear to be the number and size of recurring donations, and CAP’s ability to retain both ambassadors and donors, so these should be the focus of the pilot study. It may also be worth putting some additional resources into determining the counterfactual impact of a donation, taking into account funging from both EA and non-EA sources. Since ambassador manager salaries are the major cost, alternative program structures that do not involve so much oversight of volunteers, or that use volunteers to support ambassadors, should perhaps be considered as well.
This analysis has many limitations, only a few of which can be discussed here. Overall, it seems likely to have underestimated the promisingness of CAP relative to the alternatives, for several reasons.
- Evaluations of comparator programs use different methodology. We have not closely examined the calculations behind leverage ratios reported by other organizations, but it may not be appropriate to make direct comparisons. For example, OFTW uses lower discount rates and does not appear to take into account funging from large donors such as Good Ventures, while TLYCS does not adjust for funging at all (though it gives a helpful discussion of counterfactual issues in its 2017 annual report). We suspect 10x is therefore an unreasonably high bar.
- More generally, comparing to ‘best buys’ of a similar nature could be misleading. In particular, we considered any DCR under 5x Low, yet a DCR over 1x – as in our base case – would imply that supporting CAP would be better than donating directly to some of the most cost-effective charities in the world. This suggests that funders who would be happy to directly support GiveWell- or ACE-recommended charities ought to consider CAP a competitive opportunity. Depending on how the funder would otherwise use the money, it is even possible that a sub-1x return would be cost-effective (although Yamey has stated that he would likely not consider CAP worthwhile in such circumstances).
- It does not account for indirect benefits, which may well have higher expected value than the direct impact. The potential for creating a culture of effective giving in workplaces, for example, could be more important than the direct impact of the donations. This is addressed further in the Indirect Benefits section below.
- It assumes a static program structure. The model necessarily makes a number of assumptions about the nature of CAP, and implicitly assumes these would remain constant over the years. In reality, a well-run program would evolve in response to information, opportunities, and constraints. For example, Donational could look into payroll giving, solicit pledges from ambassadors’ friends and family, or pursue a smaller number of high-value donations from senior managers at large firms. As discussed in the Team Strength section, there are signs that Donational would be capable of adapting over time – and if the program still did not seem very promising, Yamey has declared an intention to close it down rather than continue indefinitely, thereby minimizing any losses.
There are also ways in which the analysis may favor CAP.
It does not account for indirect harms. CAP could backfire or have negative ‘spillover’ effects that make it less cost-effective or even harmful overall. This is discussed in the relevant section below.
It does not account for moral uncertainty. Under some worldviews, CAP itself, or some of the recommended charities, may do active harm. This is also addressed in a separate section.
It is vulnerable to cognitive biases. We tried to take into account optimism bias when providing the parameter inputs, but it is a pervasive phenomenon and we cannot be sure that we entirely escaped its influence, particularly since we did not undergo formal calibration training. After taking a considerable amount of both Yamey’s and RG team members’ time, there also is a danger of being influenced by reciprocity – a sense of obligation to offer something in return – and something akin to the sunk cost fallacy – the feeling that, having invested so much in the evaluation, it would be a shame not to recommend at least a little funding. We consciously tried to resist these pressures, but we may not have been entirely successful.
It relied heavily on Yamey’s inputs. Many of the RG team’s parameter estimates, as well as the model structure, were quite heavily influenced by Yamey’s own predictions. This is not necessarily irrational as he was in a better position to estimate some parameters, and we did not get a strong sense that he was consciously trying to exaggerate the likely success of the project. However, project founders are perhaps especially vulnerable to optimism bias (Cassar 2009), so it is possible we gave too much weight to his guesstimates, particularly given that he himself was highly uncertain about many of them.
It only considers financial costs of personnel time. Yamey and any staff, contractors, or volunteers engaged in the project may otherwise be doing things that are worth more than the costs used in this analysis. For example, a good COO might instead earn a high salary and donate a large proportion to charity, or work on another high-impact startup. However, there are several reasons for disregarding these ‘opportunity costs’, including the following.
- It is very hard to make meaningful estimates of these, especially before the program has begun.
- They may be at least partly accounted for in the cost-effectiveness thresholds. We don’t think, say, a 2x return would be worthwhile because the personnel (not just the funders) could have more impact through other activities.
- Most other relevant CEAs also use financial costs, so departing from this tendency may hinder comparisons across projects.
It nevertheless remains a concern, and users may wish to put their own cost assumptions in the model.
Other limitations of our methods may substantially affect the results, but in an unclear direction.
- Elicited parameter inputs are highly uncertain. Our team members did not put a great deal of trust in their parameter estimates. In many cases, this reflected both uncertainty about the quantity being estimated, and difficulty identifying a standard distribution that matched their beliefs. This is understandable, since there was no good information on the vast majority of them, and no opportunity for proper calibration training that could have improved our ability to make good estimates. But it should be emphasised that the confidence intervals do not capture all of the uncertainty around these critical inputs. Some additional issues with our elicitation and aggregation methods are discussed in Appendix 1, along with some potential ways of improving the process.
- Parameter correlation is not fully captured. A common criticism of probabilistic analyses is that they implicitly treat all parameters as independent, which is often unrealistic. We partially addressed this concern by making the costs dependent on indicators of scale (such as the number of ambassadors), but some relationships remain unmodelled. For example, pledge size may be higher when the number of pledges is lower (negative correlation), since it would likely reflect a strategy of targeting a smaller number of high earners; and high first-year donor churn would be a good predictor of high second-year donor churn (positive correlation), since they are likely to be driven by similar factors. The Monte Carlo simulations on which the base case results depend consequently include some fairly implausible scenarios. The benefits of probabilistic analysis almost certainly outweigh these drawbacks, and users are free to replace the inputs with their own assumptions for any and all parameters, but it does add an additional element of uncertainty.
- More generally, expected value estimates should not (usually) be taken literally. The use of relatively sophisticated methods can give the illusion of greater certainty than is warranted given the various limitations of both the model structure and inputs. Complex models are also more prone to errors, and harder to replicate, than simple ones. We have greater confidence in this analysis than we do in a ‘back of the envelope’ calculation, or in most models used for evaluating prospective investments, but we would be somewhat surprised if it turned out to be a highly accurate representation of reality.
Given the shortcomings of our cost-effectiveness analysis, it is important to also consider our other criteria – Team Strength, Indirect Benefits and Harms, and Robustness to Moral Uncertainty. These are discussed in the following sections.
by Tee Barnett
Most public-facing sources with the authority to comment on what it takes to create a successful project stress the importance of founding team members. Interestingly, our review of the evaluation frameworks used by grantmakers in effective altruism and adjacent spaces have comparatively little to contribute, potentially because assessing teams is notoriously difficult and revealing information about people is understandably a very sensitive endeavor. This reluctance is likely compounded by uncertainty about the appropriate methodological approaches, let alone parsing the highly subjective world of human capability and potential.
Our claim is that the successes and failures of a project rest in part upon the vision, coordination and implementation of the team. By extension, our evaluation places nontrivial weight on the founder and team evaluation. Simply because navigating the dynamics between team members within a project is tricky and enormously subjective, we believe this is not reason enough to shy away from making sincere efforts to identify the ways in which the plans, competencies and interpersonal fit of team members might affect project success.
Our team assessment criteria, kept internal largely to guard against influencing the way applicants present themselves, was constructed to supplement the industry wisdom of evaluating teams based upon ‘track record’, preferred credential signals, and in-network litmus tests. The additional considerations we find important in evaluating founders and teams are derived from firsthand and secondhand sources, official and unofficial conversations, and by experience, reasoning, and intuitions. It is a checklist that helps us identify disqualifying criteria, core qualifications, and strengths and weaknesses, that is designed to span every detectable aspect of a team’s ability to carry out their project plans. This includes affordances for future skill growth and personal considerations. Rather than being affixed to a rigid framework, or attempting to over-quantify subjective considerations, our criteria primarily seek to uncover deviations in any direction from what could be best described as baseline competency and commitment to launching and scaling up effective projects.
Specific to the RG evaluation process, the Team Strength section also accounts for what we discover throughout the course of the aforementioned ‘incubation’ portion of our process. Early-stage projects necessarily have components of their plans that will need refinement, and part of the value-add of RG is to offer as-needed planning assistance. While the sophistication of a project’s plan and our perception of the team’s ability to execute are key components of the evaluation, RG also factors in the identifiable competencies of the team required for survival and later flourishing, which can be broadly construed as the ability of the team to update and act upon revised plans.
Donational’s progression displays ample evidence of Yamey’s fitness for leading a project of this sort, beginning first as a personal donation automation system and later evolving into a platform that processes hundreds of thousands of dollars for effective charities.
Yamey serving as the sole technical presence on the project is testament to his skill in crafting a high-quality platform that serves the needs of One for the World, and presents an attractive opportunity to test a corporate outreach model. As we alluded to above, our team evaluation criteria, in this instance evaluating Donational as a one-man team for the time being, seeks to stack Yamey’s abilities against baseline competency in the coordination and execution of the project.
As it relates to the post-assessment write-up, this means that the following will not only briefly touch on easily identifiable indicators of fitness to lead the project (e.g. track record, education, formal skills acquired), but also emphasize less obvious team qualities that could adequately handle bringing a successful project to fruition. Much of this will be done by way of introducing selected criteria that we feel have revealed these considerations.
Yamey cleanly passes all of the ‘defeator’ (disqualifier) criteria, defined as discoverable things that would be critically bad for the project. This section comprises a narrow set of indicators that would disqualify the project outright as an option for funders to consider. The ‘defeators’ section breaks down along the following lines:
- Disposition, traits, and beliefs
- Abilities and skills
- Life plan considerations
- Lack of several core qualifiers
A conventional example of subcriteria that would fall under “dispositions, traits, and beliefs” would be interpersonal presentation. Being ‘tone-deaf’ in this domain could include making repeated poor choices in public contexts or having a bad reputation within a given community, which severely inhibits the success of the project moving forward. The method for determining this criterion ranges from documenting general interpersonal impressions to confirming reputational reads with trusted community members.
Core qualifier check
Yamey appears to pass all of the ‘core qualifier’ criteria, defined as discoverable things perceived to be crucial for the survival and eventual scaling of the project. This section comprises a broad array of indicators that break down along the following lines:
- Disposition, traits, and beliefs
- Abilities and skills
- Life plan considerations
- Lacking defeaters
A conventional example of a Core Qualifier subcriterion that would fall under “abilities and skills” is understanding and balance in judgement deferral. Because many projects encompass a vast number of domains, many of which the founder will need assistance in navigating, knowing when to defer is crucial. Deferring too much to others risks making the founder too dependent upon others for the enormous amount of decisions that need to be made and undermines the judgement of the person who has the most access to decision-relevant information. Deferring too little to others could naively set the project up to incur critical failures or miss out on important opportunities.
In the case of Donational, Yamey’s track record suggests appropriate and constructive patterns of deferral. For example, Yamey largely outsourced decisions about which charities to include on the platform to those recommended by GiveWell, Animal Charity Evaluators, and the Open Philanthropy Project. In direct discussions with the RG team as well, Yamey often updated his outlook when factoring the input of others, appearing to weigh the contributions of others sensibly. His self-directedness was also clearly evident, having pushed back on the RG evaluation team where it made sense. Although Donational scores well against the Core Qualifier criteria, we did uncover some potential flags in within the subcriteria “competency across project dimensions” outlined below.
Highlighted criterion #1: Life plan considerations
Life plan considerations in this context are in reference to one’s life plans in relation to the proposed project. Life plan considerations are critical to the success of early-stage projects because assurances must be made that the project founder(s) and core team members intend to treat the project as an important part of their lives. If the founder plans on initiating a multi-year plan to carry out an impactful project, there must be evidence that the founder is committed to the effort, or that other potential futures will not obviously disrupt that commitment.
A founder’s perceived values are also crucial to accurately mapping out life plan considerations. As an example, if a founder is explicitly dissatisfied with their role in the project because actually being ‘on the ground’ in some sense makes them happier, this does not bode well for their prospects of remaining on the project. Another example could be impending life circumstances that do not seem to square with the realities of playing a founding role in a project. Tension would arise, for example, in a case where a startup founder actually biases toward feeling more validated by institutional credential signals (e.g. degrees accrued, institutions attended). Resting within their life plans may be latent anxiety that years are passing without any ‘progress’ toward getting a desired career where traditionally recognized credential signals are critical to acquire. Plan tensions of this sort are hardly uncommon, especially within modern labor markets.
Encouraging signs from Donational
Since early 2018, Donational has been a stably running platform that shows little signs of winding down any time soon. Yamey created Donational in the hopes of leading a more impactful life and sees the project as an important part of his life plans for the foreseeable future. This, in addition to having the flexibility to forego two days per week valued at ~$80,000 per annum, displays a well-above average commitment to integrating Donational into his life plans.
Nothing has suggested that Yamey would abruptly divest effort from Donational. Donational was originally created to satisfy an array of intrinsic needs for Yamey. Notable to mention here was a stated intrinsic motivation to create unique solutions for the world that contribute to the greatest good, rather than filling an existing role or serving as part of a team to amplify the impact of others. His belief in wanting to make a unique contribution towards doing the most good generally reflects well for life plan considerations related to this kind of project.
Often new projects benefit from founders that desire and drive to create unique value for the world. This stated desire appears consistent with Yamey’s work history, and area of study as well, leading up to the founding of Donational. Many of the other relatively minor signs of creating aligned life plans, including indefinite plans to continue living in a place that is generally regarded as well-suited for EA outreach (New York City), check out – aside from the potential issues listed below.
Due to the ease of maintaining an automated platform, much of this suggests Yamey could be a sensible bet for experimenting with outreach approaches that could tie into Donational. The partnership with OFTW, a GiveWell-incubated organization which has partnered with Donational in processing incoming donations, serves as proof of concept in this respect.
Yamey clears many of the life plan considerations checks. The comfortable slot Donational holds in Yamey’s life plans, however, may also work against the project’s potential upside in several distinct ways. By carefully tracing his plans over several conversations, we concluded that it is unlikely Yamey would go on to lead the project full-time due to three primary constraints regarding the project budget:
- Yamey’s income expectations are largely derived from the private sector.
- Yamey would like the project to be self-sustaining such that fees for processing will cover staff salary requirements. According to our model, this would only become viable after several million dollars in processed payments depending on salary requirements – well above current levels.
- Personal flourishing through professional development is quite important to him, and he believes this is less likely to occur working by himself or even with too small of a team.
To sum this up, in order to have Yamey go full-time on the project, the operating budget would need to cover Yamey’s personal salary expectations and those of a small team, at least some of whom would be incurring a relatively high cost of living in New York City. All of this suggests that impact returns from the Donational platform and outreach efforts will need to be relatively high in order to justifiably cover a potential future where Yamey is working full-time on the project. These realities have a very real effect on how plans should be approached and built through the Donational project.
The constraints keeping Yamey from running the project full-time are not straightforwardly prohibitive. Successful entrepreneurs capable of running multiple projects exist. Nonetheless, on two days a week running the technical side of things, real trade-offs will exist regarding Yamey’s bandwidth to conduct various experiments or upskill in order to approach new domains that could be crucial to the project (e.g. fundraising and recruitment). We factor these realities into our funding recommendation for the CAP pilot.
Yamey’s life plans create the need to recruit other individuals to cover the resulting skill and bandwidth gaps. For example, Yamey’s current plan involves recruiting a co-founder and COO-type to execute the CAP pilot and other vital parts of the project. Some focus of our assessment then shifts to whether bringing in others to cover these gaps is possible and desirable. Important to consider as well is Yamey’s ability to recruit and stably maintain a team of other individuals in carrying out these plans. His track record and current responsibilities leading a team for his day job suggest high competency here, particularly having played a vital role in scaling up three organizations previously. In the proceeding section, we present our read on Yamey’s capability in this regard.
- Much suggests this project will stably remain in good hands for the foreseeable future. Yamey’s life plans, stated values and core competencies appear to match quite well with the continuation of the project.
- Donational’s position in Yamey’s plans, along with the nature of the project itself, present a potentially good opportunity for outreach and marketing experimentation.
- Various specified constraints make it very unlikely that Yamey works full-time on the project and also put pressure on the CAP plans to yield relatively high returns to justify the investment were a full-time transition to occur.
- Much depends on execution of plans that compensate for life plan constraints.
Highlighted criterion #2: Competency across project dimensions
Bridging abstract plans all the way to implementation and adoption requires regularly working on different ‘levels’ of a project as well, mostly within the object- and meta- levels. Skillfully navigating these levels demonstrates fluency and command along several dimensions relevant to the project. The object-level are concrete dimensions of a project, such as the completion of tasks. The meta-level could be characterized as tactical or strategic dimensions. Ideally, a founding team will possess awareness at these different levels, remaining responsive to various pressures and incoming evidence streams. It is rare to find people that can fluidly move between both the object- and meta-level dimensions, so in this respect, Yamey stands out.
One example at the meta level is having the ability to identify, diagnose, and address bottlenecks. As a project evolves, there will arise bottlenecks, or points of congestion in the functioning of a system, that need to be identified and remedied. Diagnosing bottlenecks alone requires intimately understanding a variety of factors pertaining to the project and its goals, some of which reside beyond simply carrying out intended tasks. These points of congestion can surface globally for the project (e.g. the team isn’t aware of the importance of external marketing, which is plausibly slowing revenue generation) within and between departments, and personal to individuals. Planning and action are also needed for getting bottlenecks addressed, however. A bottleneck diagnosis is useless unless an individual can persuade others on the team of its importance and the need to take action. This influences the attentional direction of the team, affecting the priorities of the project in turn. Having the ability to cause the resolution of bottlenecks, whether it be as a sole individual or on a team, is crucial for the team to possess. We outline below what the “competency across project dimensions” criterion indicated about the strength of Donational as a project.
In order to have created a solution that fits squarely as a preferred donation processing option for OFTW and TLYCS, Yamey displayed unusually high competence in understanding and executing on problem/solution fit considerations. This set of considerations anticipates how customers will behave in the marketplace when solving problems they encounter. In this case, Yamey anticipated that his donation processing project could be modified and scaled to provide a vital solution for at least two organizations. This entails more than simply the conceptual legwork of calibrating on problem/solution fit considerations, but also the implementation of his technical knowledge and a broad array of interpersonal skills that would cause the organizations to adopt his solution. All of this must be carried out in the right doses and with precise timing.
As mentioned above, Yamey demonstrated a detailed and accurate understanding of how his abilities generate real-world value, what he can offer in relation to his project, and what his project can offer in relation to the needs of existing organizations. There is also considerable legwork entailed in delivery, including skillfully conducting coordination efforts in order to compel organizations to test his solution.
While the Team Strength portion of the assessment does factor in a wide variety of signals meant to evaluate fitness to lead the project, the focus of the overall assessment, including the cost-effectiveness analysis, is the present-day potential of the CAP program. Throughout the process, it has not been entirely clear that Yamey has been able to get conceptual clarity on whether the CAP program is worth pursuing. For example, RG produced a relatively simple back-of-the-envelope calculation (BOTEC) projecting the potential impact of a corporate ambassador program that resulted in large updates to the CAP plan. The perceived potential of the CAP as originally conceived was revised down considerably, and alternative plans were assessed in order to further explore the viability of the program. Moving forward, Rethink Grants will likely produce toy models like these earlier in our evaluation process. We expect this will allow us to identify key questions and potential shifts before sinking significant resources into deeper analysis.
In the Life Plan Considerations section, we touch on the importance of teams being able to recognize and subsequently cover inevitable skill gaps. Ideally, Yamey would have worked through these preliminary calculations sooner with the intention of gathering input from others afterward. This would have conceivably led to improving his BOTEC, or upon realizing the need for more skilled assistance, moving to employ someone else’s quantitative modeling skill entirely. Unfortunately this hadn’t yet happened for a variety of stated reasons, all of which were plausible but addressable. To Yamey’s credit, taking part in the RG process is a (somewhat belated) attempt at doing this. It should also be recognized that the current iteration of the CAP plans are a result of plan changes once it was realized that the CAP wasn’t as cost-effective as originally thought.
Outstanding questions remain, however, regarding this potential meta-level indecision. Not taking adequate steps to model the CAP sufficiently at an earlier stage could constitute red flags in certain dimensions of project management, where issues may exist regarding his (i) awareness of and ability to cover skill gaps in the project, and (ii) properly weighing the importance of domains where he does not have technical proficiency. To put it concisely, there is a worry that the importance of basic quantitative modeling wasn’t properly appreciated. One would expect that paying careful attention to conducting crude quantitative estimates of proposed projects ought to be prioritized, and are quite important regardless of whether the founder knows exactly how to conduct them.
Unrelated to Donational, an example that illustrates the importance of this meta-level awareness would be legal considerations for a project. Legal considerations may seem like a black box when initiating a project in uncharted territory, but this does not negate the importance of taking action to employ domain experts and weighing (the potential importance of) legal considerations appropriately.
Relatedly, we would like to have seen more attention paid to the initial exploration of potential partnerships with nearby organizations. Yamey has a demonstrated track record of working with other organizations through the Donational platform, and because the CAP deviates from being a straightforwardly technical project dealing with outreach, it would have been beneficial to seek collaboration with relevant groups earlier. There is reason to believe, however, that earlier approaches when the CAP plans were less crystalized would have been less prudent. To his credit once again, OFTW agreed to provide assistance with the CAP pilot toward the end of the RG evaluation process once we made the suggestion.
- Yamey’s track record demonstrates high facility in numerous dimensions relevant to creating a successful project.
- We uncovered some potential indications of blind spots and lack of awareness regarding the importance of blind spots, including not conducting early calculations on the CAP’s viability and exploring nearby partnerships.
- Yamey is good at addressing identified blind spots and issues once they have been identified as such.
- RG observed many instances of Yamey’s willingness to implement major plan changes based on evidence.
- Yamey is reasonably well-suited to run a project of this type. Existing organizations moving into the CAP space is plausibly more attractive, though none of the obvious players plan to imminently pursue this.
In this section, we did not write out an exhaustive overview of how Donational scores on our Team Strength criteria, electing instead to highlight crucial considerations that most plausibly affected our evaluation in the largest way. Our subjective sense after checking the CAP program against our criteria is that Donational presents an opportunity to test outreach methods that haven’t yet been adequately explored, namely workplace outreach and fundraising. The constellation of criteria we have been tracking has led us to score Team Strength as Medium, though we believe Yamey is on the high end of that category. Informing our recommendation for funding a CAP pilot is our belief in Yamey’s overall capability as a founder, along with our reservations about how certain constraints uncovered in the “life plans considerations” section and potential flags within the “competency across project dimensions” section bear on the viability of the program.
To recap, it was observed that Yamey’s life plans as they pertain to the project were very promising, with the exception that certain constraints would require the CAP model to yield relatively high returns in order to be considered sufficiently impactful. This resulted in a substantial plan change, whereby a smaller-scale test of the CAP became far more sensible. Rather than recommend fully funding a program that must hit optimistic targets in order to be impactful enough, to Yamey’s credit once again, the decision was made to get more information on the viability of this outreach model via a pilot study.
Yamey’s track record suggests above-average fluency in several dimensions of what it takes to bring a project to fruition. The planning process for CAP presented some potential gaps in awareness, however, that would have been quite costly had the project gone ahead without our involvement. Most encouraging in this respect is that Yamey demonstrates an eagerness to take corrections and steadfast commitment to iterating his plans in search of the most effective version of the CAP possible. This disposition and the associated meta-skills required to course-correct consistently should be weighted more heavily than revealed skill gaps. No founder will have every skill or have access to all of the relevant forms of awareness, but we believe Yamey displays an impressive willingness to update and intention toward building further competency.
In addition to the outcomes intentionally generated by a project, we take into account potential indirect effects (including spillover effects and flow-through effects). Indirect Benefits are outcomes that are not the central aim of a project, but are a positive consequence of its implementation. For example, distributing mosquito nets to combat malaria may increase income, or founding a new charity may give effective altruists skills that can be used on other projects as well. There is no clear boundary between direct and indirect effects, and indirect benefits are not intrinsically less important than direct ones. However, this pragmatic distinction allows us to treat more speculative consequences separately, in a way that takes into account their high uncertainty.
To do this, we first work to identify all of the possible indirect benefits of a given project, and then consider how good the benefits would be. Specifically, we quantify each plausible benefit by considering the number of individuals that benefit (scale), the magnitude of the benefit (how much each individual benefits, or effect size), and the likelihood that the project generates that benefit (probability). We then calculate expected benefit points for each indirect benefit by multiplying the scale, effect size and probability scores. Finally, we sum these points to arrive at an overall Indirect Benefits score. Taking the sum accounts for the fact that causing multiple indirect benefits is better than one.
The number of expected benefit points points associated with the three qualitative categories are shown in the table below. We set these thresholds to reflect the fact that a project that has a high probability of causing at least one very impactful indirect benefit at a large scale should earn a score of High. We believe a project that has a moderate probability of generating a moderately good indirect benefit for a moderate number of people should get a score of at least Medium – higher if there are several indirect benefits. And we believe that many indirect benefits with smaller expected effects can also earn a project a Medium or High score.
In the case of CAP, we expect that the most significant indirect benefits will be generated as a result of increased donations to effective animal welfare charities. Specifically, we think there’s a chance that additional donations to effective animal welfare charities contribute to reducing our reliance on factory farming, which in turn would likely reduce the severity of climate change as factory farming is responsible for a large share of greenhouse gases emitted by middle- and high-income countries. Mitigating some of the worst effects of climate change could substantially improve the lives of many people. While it’s quite unlikely that CAP scales to a point where the marginal donations of CAP participants end up making a tangible difference in this area, the benefits are large enough that the expected benefit is significant.
Similarly, reducing our reliance on factory farming has the indirect benefit of reducing the risk that antibiotic resistance, driven by the use of antibiotics in animal agriculture, could lead to a superbug that causes a severe pandemic or epidemic affecting many people. Again, while the chance that the CAP program makes much of a difference here is quite small, but the expected value may still be non-trivial.
Increasing donations to global health charities may have indirect benefits as well. There’s some evidence that reducing the incidence of malaria, for example, can have tangible macro-economic benefits as well as boosting the incomes of those directly affected. This is consistent with other global health interventions, such as vaccinations, which have been found to contribute to economic growth of the impacted region (Jit et al., 2015; Ozawa et al., 2016), leading to additional people becoming somewhat less impoverished.
There’s also a small probability that donations to criminal justice reform charities might have positive indirect effects. The mass incarceration of people of color likely contributes to persistent racial inequality in the United States. Criminal justice reform could mitigate some of this.
In addition, we think there’s a chance that we get some indirect benefits not from the donations themselves, but from shifting CAP participants’ perspectives on giving and altruism. Specifically, we think there’s a moderate chance that some of the participants adopt a more effectiveness-oriented approach to giving going forward, multiplying the impact of their donations. We also think there’s a chance that CAP expands, reaches a decent number of people, and that all of those people contribute to a more widespread culture of giving. Furthermore, there is some chance that a few of those participants buy into effective altruism more deeply and go on to have a substantial impact, perhaps earning-to-give or using their career to achieve a lot of good.
Finally, along similar lines, we expect that there’s a small chance that some companies participating in CAP will formalize CAP-like programs in their company culture and system, for example, by creating a corporate charity deduction-type program. This could increase the amount of money going to charity overall, though unless the companies emphasize effective charities, it’s not clear that this would have much of an impact.
Our findings are summarised in the table below. Following our scoring process, we end up with a total of 30 expected benefit points, which is just inside the High category.
Indirect Harms are unintended negative consequences of the project. For example, a project that increased the incomes of poor people may lead to greater consumption of animal products, which causes nonhuman animals to suffer; or an AI research organization might cause harmful AI by increasing the number of people with relevant skills.
We believe it’s important to take these potential harms into account, and do so using a similar approach to the one used to account for Indirect Benefits. We assess each potential indirect harm by evaluating the scale of the harm, its effect size, and the probability that the harm is generated to generate expected harm points.
Like with Indirect Benefits, we chose Indirect Harms thresholds that penalize projects that are quite likely to have a large negative impact on a lot of people. Only projects with only a few relatively small indirect harms, most of which are unlikely and small in scale, earn a Low score.
We expect that the biggest indirect harm generated by CAP will come from the fact that many of the donations will likely go to global poverty charities. There’s a high probability that reducing global poverty leads to a moderate increase in consumption of factory-farmed animals, at least in the short- to medium-term. If those animals generally lead very bad lives, this could lead to a lot of additional suffering – though some have argued that this problem may be exaggerated.
Similarly, it’s possible, though unlikely, that alleviating poverty would accelerate economic growth enough to increase the pace of technological progress and reduce the amount of time we have to make sure those technologies are safe. If we’re unable to ensure the safety of new technologies, those technologies could pose an existential threat to an enormous number of sentient beings. This growth is also likely to exacerbate global warming, potentially affecting a large number of people – though not, we suspect, to a high degree in most cases.
Another possible effect of poverty alleviation may be to increase total population size, which may contribute to climate change and exacerbate problems like poverty, rather than solving them. We think it’s quite unlikely that this is the case, as birth rate generally (though not always) decreases as poverty goes down.
It’s also possible that by including moderately less effective charities on the Donational giving platform, a small number of donations may get diverted away from the most effective charities. We think the probability of this is fairly low for two reasons: first, the Donational platform has recommended charities that ‘nudge’ people into donating where they can do the most good. Yamey has agreed to limit these in future to charities recommended by GiveWell and Animal Charity Evaluators, plus one US-focused criminal justice organization (most likely the Texas Organizing Project). We are not confident that the criminal justice charity will be as cost-effective as the others, but donations through the platform so far suggest it will receive a relatively small minority of the donations. More importantly, we don’t expect that most people would have been giving to effective charities at all prior to CAP. We therefore think it’s unlikely that CAP will cause a substantial volume of donations to be diverted from more to less effective charities.
Finally, it’s possible that some CAP participants would eventually have heard about the principles of effective altruism from a more compelling source. It would be a loss to the movement if those individuals would have become somewhat more involved with effective altruism otherwise. However, we think the likelihood of this is quite low, primarily because effective altruism is such a small movement that very few CAP donors would have become highly engaged in it anyway.
In total, we have assigned CAP 30 expected harm points, placing it in the High category.
Robustness to Moral Uncertainty
Because we have some moral uncertainty, we want to minimize the probability that we fund projects that would be considered morally wrong were we to hold a different worldview. To account for this, we look for moral assumptions we could make under which a given project would have the potential to cause harm, and favor projects that are robust to this consideration. In other words, we prefer projects that don’t appear significantly wrong under other moral frameworks. To account for the fact that we see some moral positions as more plausible than others, we give different positions different weights. Note that we are using terms like ‘worldview’ and ‘moral position’ loosely and interchangeably: they can include broad empirical beliefs, such as about the effects of free markets versus government intervention, as well as normative ones.
As a first step in this process, we brainstorm moral positions that, if true, would cause a given program to look morally wrong. We then consider how many individuals would be negatively impacted if we held that ethical position, how badly they would be affected, and the chance that the ethical position is ‘correct’. The scale, effect size, and probability of each harm are multiplied to obtain expected harm points, which are summed.
We are quite averse to recommending a grant supporting a project that would cause a lot of harm to a lot of people under even just one ethical framework that we find highly plausible, and set the thresholds accordingly.
We thought of several ethical positions under which CAP could look morally wrong to varying degrees.
Some theories, especially some forms of consequentialism, consider it wrong to extend lives that are 'net-negative' (contain more suffering than happiness, roughly speaking). Opinion differs in the RG team about what proportion of lives 'saved' by the relevant charities (such as the Against Malaria Foundation and Malaria Consortium) fall into this category, how bad those lives are, and how plausible are the relevant moral theories. For consequentialists with a ‘total’ view of population ethics, one important consideration is that averting a death might not lead to more people existing in the long run, in which case the main effect of, for example, anti-malaria interventions would be improving quality rather than quantity of life. We have tentatively assigned this potential harm Medium scores for scale and effect size but Low for probability. This reflects our view that these charities are likely to have a net-positive impact overall, even if they do considerable harm to some individuals.
Additionally, a utilitarian with an average view of population ethics would consider it morally wrong to donate to charities that extend the lives of people living in poverty if those people are living below-average lives. According to this view, CAP would cause substantial harm to a moderate number of people. We think it’s likely that people living in abject poverty live ‘below-average’ lives, but consider the average view of population ethics to be implausible.
Utilitarians with a total view of population ethics may also view ending factory farming as net-negative if it turns out that most factory-farmed animals – animals who wouldn’t exist without factory farming – live net-positive lives (lives worth living). From this perspective, the relatively small proportion of CAP donations that we expect will go to animal welfare charities would cause a bit of harm, though not very much as farmed animals lives are presumed to be (at best) just weakly positive, to a moderate number of individuals. While we find the total view of population ethics to be plausible, we believe the probability that most factory farmed animals live net-positive lives to be quite low. Moreover, interventions that improve animal welfare without reducing the number of animals farmed are not vulnerable to this objection.
There are also some socialist worldviews under which CAP looks actively wrong. Many argue that private philanthropy will necessarily be inefficacious if it does not lead to systemic political change (Kuper, 2002). But some also argue that private philanthropy will be actively harmful by acting as a smokescreen for the system that ultimately causes the problems private charity purports to fix (Thorup, 2015; EIkenberry & Mirabella, 2018), or by promoting negative individualistic norms that oppose more radical collectivist action (Syme, 2019). We grant that there is some plausibility to the view that collective political solutions of some kind would be preferable to private philanthropy, and that the existence of private philanthropy as a whole may reduce the extent of political action in certain areas (although the evidence for this claim is very limited). Nevertheless, we consider it very unlikely that CAP’s philanthropic efforts on the margin would cause harmful effects in this way (e.g. by promoting private philanthropy as a norm).
Similarly, others believe that charities that make top-down decisions about what the global poor ‘need’ are denying the poor the agency to identify their own needs. Again, this could be a concern of both deontologists and consequentialists. According to this view, which we find somewhat plausible, CAP might be causing a little bit of harm to a moderate number of people by promoting donations to global health and poverty charities (note that donations to GiveDirectly would probably be an exception).
Some people see punishment as a moral imperative, either because of a deontological commitment to retributive justice or an empirical belief in its efficacy. From this perspective, CAP may cause harm by promoting donations to criminal justice reform charities that aim to reduce rates of incarceration. We expect the harm here would be quite small, for several reasons: we don’t expect the criminal justice reform charities to greatly decrease the use of retributive punishments; there is arguably little evidence that the kind of reforms promoted by the relevant charities would increase crime or cause other problems; and even from a retributive perspective, many punishments in the US justice system are probably disproportionate.
Finally, some people with rights-based normative views maintain that killing and eating animals is a human right. By promoting donations to animal welfare charities that explicitly aim to end factory farming, CAP would cause a relatively modest amount of harm. We find the perspective that killing and eating animals is a human right irrespective of the suffering caused by this practice to be pretty implausible. Even if there is a right to kill animals, a shift from factory farming to more humane forms of animal agriculture seems unlikely to violate it.
In total, we assigned 25 expected harm points, which is in the Medium category.
Project Potential Score
To assess the overall project potential, we aggregate our ‘qualitative’ scores for all criteria. Each team member assigns a weight to each criterion, based on factors such as how important they think the criterion itself is, and how well they think our score captures the criterion in this particular evaluation. The average of team members’ weights determine the weighted scores, which are summed to create the final Project Potential Score (PPS). The project potential can be described as Low, Medium, or High, as shown in the table below.
Our team members all gave a large majority of the weight to the cost-effectiveness estimate. Notwithstanding its many shortcomings, it was based on a more rigorous analysis than our scores for other criteria. This resulted in a final Project Potential Score of 1.27, which is in the Low category.
The Rethink Grants team votes on every grant after all team members have reviewed the grant evaluation and accompanying analyses. The number of votes needed to recommend that grant depends on the Project Potential Score.
Proposals receiving the requisite number of votes are recommended to grantmakers in our network. Grants that are not recommended for funding are given detailed feedback, including:
- The grant evaluation report, including the cost-effectiveness analysis (with sensitivity analyses highlighting the main sources of uncertainty) and scores for each of the criteria.
- A summary of the project’s main strengths.
- The considerations that weighed against the projects. These include both:
- A summary of the criteria that the project scored poorly on that the Project Potential Score was highly sensitive to.
- A summary of the criteria stated by the RG team members that voted against the grant as most important to their vote.
- A set of key recommendations to improve the project based on its most substantial shortcomings.
The Rethink Grants team has unanimously decided not to recommend funding for a full-scale Corporate Ambassador Program at this time. This is based heavily on our cost-effectiveness analysis, which suggests it is unlikely to be worthwhile at any reasonable cost-effectiveness threshold, at least in its proposed form.
However, we have also decided by consensus to recommend funding of up to $40,000 to run a pilot study. This is primarily based on three considerations:
- Our value of information analysis suggests the pilot would resolve more than enough uncertainty to justify its cost.
- There are reasons for thinking our cost-effectiveness estimate may be conservative, especially compared to analyses of similar programs.
- There is a good chance that the number of dollars donated as a result of the pilot would be at least as high as the number spent to run it, which for some donors could make it a low-risk opportunity.
If Rethink Grants continues as a project, we will evaluate the results of each grant we make using a Grant Follow-Up Plan. This plan outlines the grant timeline, along with key timepoints when we’ll check in with the grantee.
Each of those timepoints has an associated set of metrics of success. These metrics include interim indicators – things like growth in team size – as well as outcome measures like donations moved to effective charities.
We work with the grantee to set reasonable goals for each of those metrics, and then compare those goals with reality during the scheduled check-ins. This helps us understand the impact of our grants, and also helps us identify grantees that would benefit from additional support.
Separately, we make a set of public forecasts, estimating the likelihood that the grantee achieves the goals outlined in the Grant Follow-Up Plan. This gives us the chance to evaluate our grant evaluation process and judgment. Again, this is conditional on the continuation of RG; due to time constraints, we will not be doing it for this evaluation.
Appendix 1: Challenges of eliciting and aggregating probability distributions
This appendix outlines some of the challenges we encountered when eliciting, fitting, and aggregating parameter estimates, and suggests some ways of improving the process.
There were a couple of technical issues with fitting. First, one of the Rethink Grants team members (TMs) declined to give inputs for four parameters but blank cells were not permitted by the script. We filled those cells with another TM's values, but gave a confidence score of 0, which effectively excludes them from the analysis. Second, SHELF requires pc5 (the 5th percentile) to be higher than L (the lower plausible limit), M (median) to be higher than pc5, and so on, but some TMs used the same value for two inputs. For example, some thought there was a greater than 5% chance of obtaining no pledges, so they input 0 for both L and pc5. The proper way to deal with this is to first elicit the probability it is zero, then elicit the inputs given that it is not zero, but this would have added considerable time and complexity to the process. Instead, we simply changed the higher percentile to a number slightly higher than the lower input, such as 0.000001.
More concerningly, spot-checks revealed a considerable disparity between many of the inputs and the fitted percentiles. In most cases, there was a close match with either pc5 and M, or M and pc95, but not all three. For example, predicted pledge numbers of [1, 2, 10] were fit to a lognormal distribution with 5th, 50th, and 95th percentiles of roughly [1, 2, 4], and 1st-year donor churn of [0.125, 0.7, 0.8] became a beta with [0.6, 0.7, 0.8]. This was not due to some technical error, but simply because no standard distribution would fit all the inputs.
In order to prioritise further investigations, we added columns indicating the size of the greatest disparity for each parameter, i.e. the highest percentage difference between the inputs and fitted distributions for the 5th, 50th, and 95th percentiles. A substantial majority of the disparities favored Do Nothing (no intervention); that is, reducing the disparity would increase the estimated cost-effectiveness of CAP, suggesting the CEA may be ‘biased’ against the program. To get some idea of the magnitude of this effect, we created a copy of the inputs and (very imprecisely) modified all the ones with a disparity of greater than 25% for those with a priority of 5, and greater than 50% for the rest, so that they were roughly ‘neutral’ or favored CAP, e.g. the [1, 2, 10] parameter mentioned above was fit to a gamma with percentiles of [2, 5, 10]. (The parameters favoring Do Nothing were left unchanged.) This caused the donation-cost ratio and expected value of perfect information to increase dramatically, suggesting the results were sensitive to uncertainties around the elicitation and fitting of inputs.
We therefore decided to re-estimate some inputs. The TMs reconsidered any inputs that met the following criteria:
- Priority 5 and disparity >50%
- Priority 4 and disparity >100%
- Priority 3 and disparity >200%
- Priority <3 and disparity >300%
Those who had enough time also did the following:
- Priority 5 and disparity >25%
- Priority 4 and disparity >50%
- Priority <4 and disparity >100%
This time, the TMs created the distributions in SHELF and adjusted the inputs so that, where possible, the fitted distributions closely matched their beliefs. These are used in the base case analysis, though it is possible to run the analysis with the original distributions (which are generally more pessimistic) by changing the “Fitting switch” cell in the Team Inputs sheet. TM_5 did not have time to redo their inputs so all those meeting the criteria above were excluded from the analysis in the base case, though they can be included using the “Non-refits” switch.
After all of this, we still found a small number of extreme outliers in the total net costs, as high as several billion dollars. We tracked these down to some implausible inputs by three TMs. In particular, they gave a lower plausible limit of 0 or 1 for the number of ambassadors per manager (#17); this was intended to reflect the number of volunteers that one manager could realistically handle, but was interpreted as an indication of the success of the program. The total cost of ambassador manager salaries is calculated as the number of ambassadors divided by the number of ambos per manager, so occasionally the simulations would produce a scenario like:
- ambassadors = 500
- ambassador manager salary = 80,000
- ambassadors per manager = 0.05
- total ambassador manager cost = (500/0.05)*80,000 = $800,000,000
Something similar happened with the volume of donations processed per hour of developer time (#23), which is used to determine developer costs. TM_4 updated their inputs, but TM_5 and TM_6 did not have time, so theirs were excluded from the base case analysis. Those inputs can be included using the “Implausible” switch in the Team Inputs sheet.
Clearly, the parameter elicitation and aggregation process did not go as smoothly as hoped. This was largely due to the lack of calibration training, lack of time to gather and fully consider relevant information, lack of familiarity with the SHELF software, and perhaps lack of clarity in some parameter descriptions. Below is a small selection of alternatives that we may consider for future evaluations.
- Reduce the number of estimates. The lead analyst, and perhaps one or two other TMs, could create the distributions alone. These could be modified informally in response to feedback from the rest of the team. Or perhaps the whole team could provide distributions for the most important few parameters, leaving the rest to the lead analyst.
- Divide the parameters among the team. Much as GiveWell has “parameter owners”, each TM could be put in charge of gathering relevant information for a subset of the inputs. Taking this further, pairs of TMs – one closely involved in the evaluation and one more detached – could be solely in charge of providing the probability distributions for the parameters they ‘own’.
- Outsource estimates to calibrated forecasters. Individuals with proven ability to make accurate predictions may come up with better inputs for some parameters than TMs can.
- Use different software. Foretold, an ongoing project by Ozzie Gooen, will provide a more user-friendly interface for creating and combining probability distributions.
- Invest more time. If Rethink Grants continues, it may be worth all those involved undergoing calibration training, becoming comfortable with the relevant software, and spending considerably longer gathering relevant information and creating distributions, at least for the most sensitive parameters. Ideally, we would follow the full SHELF protocol, which involves immediate feedback, discussion, and construction of a consensus distribution in a workshop environment.
There are major drawbacks to each of these, and it will take some trial and error to determine the best approach.
Appendix 2: Calculating the CEACs, CEAF, and EVPI
This appendix outlines the steps followed in this model for calculating the cost-effectiveness acceptability curves and frontier, and the expected value of perfect information. These are easier to understand while looking at the relevant sections of the Probabilistic Analysis worksheet.
- A CEAC represents the probability of an intervention being cost-effective at a range of cost-effectiveness thresholds (in this case minimum acceptable donation-cost ratios). The most cost-effective option is defined as the one with the highest net benefit, which can be expressed either in terms of the costs (net monetary benefit) or outcomes (such as net health benefit).
- We first calculated the net monetary benefit (NMB) for each simulation (each row of PSA samples). The NMB is the value of the outcomes – in this case donations – converted into the same units as the costs, minus the costs. The value of a unit of outcomes is determined by the cost-effectiveness threshold, which in principle represents the opportunity cost, e.g. a minDCR of 3x implies that $3 of donations is worth $1 of expenditure. So the formula for NMB is [donations]/[minDCR]-[costs], e.g. at a minDCR of 3x, $1,000 donated at a cost of $500 would be (1000/3)-500 = -$167; at 1x, it would be (1000/1)-500 = $500; and at 10x, (1000/10)-500 = -$400.
- For each simulation, we recorded a 1 if the NMB for CAP was positive (i.e. cost-effective) and 0 if negative (not cost-effective). The average of those values across all simulations represents the probability that it is cost-effective at the specified minDCR (the “prob.ce” cell).
- We then used a macro (the “Draw CEAC + CEAF” button) to generate a list of probabilities that CAP is cost-effective at different thresholds, from 0.1x to 10x. This was plotted on a graph, alongside the CEAC of Do Nothing (which is just the mirror image of the CAP CEAC, since the CAP figures are all relative to no intervention).
- A CEAF represents the probability that the option with the highest probability of being cost-effective (as indicated by the CEACs) is optimal (has the highest expected net benefit) at various cost-effectiveness thresholds. In most cases, the intervention that is most likely to be cost-effective will also maximize expected net benefit, but this is not the case when the distribution of net benefit is skewed, with a mean different from the median, so it is usually worth doing both.
- For the current threshold, if CAP had the highest mean NMB, we recorded the probability that CAP is cost-effective; and if Do Nothing had the highest NMB, we recorded the probability that Do Nothing was cost-effective (“live.ceaf” cell).
- We used a macro (“Draw CEAC + CEAF” button) to repeat this for all thresholds between 0.1x and 10x, and those values were plotted on a graph.
- For clarity, we also recorded the optimal option at each threshold, and the error probability – the chance the optimal option was not the most cost-effective.
- The EVPI is the expected value of removing all uncertainty. It can be thought of as the cost of being wrong, which is the difference between the value of always making the right choice, and the value of making the choice implied by current information.
- First, the NMB for Do Nothing and CAP were calculated in the same way as for the CEAC. To reiterate, the NMB depends on the minDCR.
- For clarity, the optimal intervention (the one with the highest NMB out of Do Nothing and CAP) was recorded for each simulation, and for the mean NMB.
- The NMB of the optimal intervention was also recorded for each simulation. The average of these values (the “max.nb” cell) is the expected value of always making the right choice of intervention.
- The EVPI (“evpi” cell) was then calculated as the NMB of always being right (“max.nb”) minus the NMB of the intervention that we would choose given current information (the one with the highest expected NMB). If CAP is not cost-effective (NMB <0), the highest NMB is 0 (Do Nothing), in which case the EVPI and the “max.nb” are the same.
- We used a macro (“Draw EVPI”) to generate the EVPI at different thresholds, from 0.1x to 10x, and plotted these values on a graph.
Barton, G. R., Briggs, A. H., & Fenwick, E. A. L. (2008). Optimal Cost-Effectiveness Decisions: The Role of the Cost-Effectiveness Acceptability Curve (CEAC), the Cost-Effectiveness Acceptability Frontier (CEAF), and the Expected Value of Perfection Information (EVPI). Value in Health, 11(5), 886–897. https://doi.org/10.1111/j.1524-4733.2008.00358.x
Black, W. C. (1990). The CE Plane: A Graphic Representation of Cost-Effectiveness. Medical Decision Making, 10(3), 212–214. https://doi.org/10.1177/0272989X9001000308
Briggs, A., Sculpher, M., & Claxton, K. (2006). Decision Modelling for Health Economic Evaluation. OUP Oxford.
Briggs, A. H., Weinstein, M. C., Fenwick, E. A. L., Karnon, J., Sculpher, M. J., & Paltiel, A. D. (2012). Model Parameter Estimation and Uncertainty Analysis: A Report of the ISPOR-SMDM Modeling Good Research Practices Task Force Working Group–6. Medical Decision Making, 32(5), 722–732. https://doi.org/10.1177/0272989X12458348
Cassar, G. (2010). Are individuals entering self-employment overly optimistic? an empirical test of plans and projections on nascent entrepreneur expectations. Strategic Management Journal, 31(8), 822–840. https://doi.org/10.1002/smj.833
Claxton, K. (2008). Exploring Uncertainty in Cost-Effectiveness Analysis: PharmacoEconomics, 26(9), 781–798. https://doi.org/10.2165/00019053-200826090-00008
Eikenberry, A. M., & Mirabella, R. M. (2018). Extreme Philanthropy: Philanthrocapitalism, Effective Altruism, and the Discourse of Neoliberalism. PS: Political Science & Politics, 51(1), 43–47. https://doi.org/10.1017/S1049096517001378
Food Safety Authority, E. (2014). Guidance on Expert Knowledge Elicitation in Food and Feed Safety Risk Assessment. EFSA Journal, 12(6). https://doi.org/10.2903/j.efsa.2014.3734
Hagan, O. (2006). Uncertain Judgements: Eliciting Experts’ Probabilities. London ; Hoboken, NJ: John Wiley & Sons.
Jit, M., Hutubessy, R., Png, M. E., Sundaram, N., Audimulam, J., Salim, S., & Yoong, J. (2015). The broader economic impact of vaccination: Reviewing and appraising the strength of evidence. BMC Medicine, 13(1). https://doi.org/10.1186/s12916-015-0446-9
Kuper, A. (2002). Global Poverty Relief–More Than Charity: Cosmopolitan Alternatives to the Singer Solution. Ethics and International Affairs, 16(1), 107–120. https://doi.org/10.1111/j.1747-7093.2002.tb00378.x
Ozawa, S., Clark, S., Portnoy, A., Grewal, S., Brenzel, L., & Walker, D. G. (2016). Return On Investment From Childhood Immunization In Low- And Middle-Income Countries, 2011–20. Health Affairs, 35(2), 199-207. https://doi.org/10.1377/hlthaff.2015.1086
Peasgood, T., Foster, D., & Dolan, P. (2019). Priority Setting in Healthcare Through the Lens of Happiness. In Global Happiness and Wellbeing Policy Report 2019 (pp. 28–51).
Strong, M., Oakley, J. E., Brennan, A., & Breeze, P. (2015). Estimating the Expected Value of Sample Information Using the Probabilistic Sensitivity Analysis Sample. Medical Decision Making, 35(5), 570–583. https://doi.org/10.1177/0272989X15575286
Syme, T. (2019). Charity vs. Revolution: Effective Altruism and the Systemic Change Objection. Ethical Theory and Moral Practice, 22(1), 93–120. https://doi.org/10.1007/s10677-019-09979-5
Thorup, M. (2015). Pro Bono? Winchester, UK ; Washington, USA: Zero Books.
Wilson, E. C. F. (2015). A Practical Guide to Value of Information Analysis. PharmacoEconomics, 33(2), 105–121. https://doi.org/10.1007/s40273-014-0219-x
This report is a joint project of Rethink Priorities and Rethink Charity. It was written by Derek Foster, Luisa Rodriguez, and Tee Barnett. Special thanks to Ian Yamey of Donational for patiently working through the entire RG evaluation process; and to Wael Mohammed for writing or improving most of the Excel macros. Thanks also to Marcus A. Davis, David Moss, Peter Hurford, Ozzie Gooen, Rossa O’Keeffe-O’Donovan, Rob Struck, Jon Behar, Jeremy Oakley, Matt Stevenson, Andrew Metry, and several anonymous individuals for providing valuable information, technical assistance, and feedback.
If you like our work, please consider subscribing to the Rethink Priorities newsletter. You can see all our publications to date here.
For example, -500/10 and 500/-10 both equal -50, so saving 10 lives with savings of $500 (a very good situation) gives the same cost-effectiveness ratio as causing 10 deaths at a cost of $500 (a terrible situation). Even a positive CEE can be misleading: a ratio of two negative numbers gives a positive, so 10 deaths caused with savings of $500 would give the same CEE ($50) as 10 lives saved at a cost of $500. There is no obvious way of avoiding this problem in Guesstimate, where the CEE would have to be a probability distribution that itself is a ratio of distributions for costs and effects (either or both of which could include values below zero, even if the means were positive). In spreadsheets like Excel, we can measure uncertainty in other ways, as explained later in this analysis. ↩︎
Note that, to put the costs and effects in the same units, the value of donations must be converted into equivalent dollars of expenditure. The ‘exchange rate’ depends on the cost-effectiveness threshold: a minDCR of 3 implies $3 donated is only ‘worth’ the same as $1 in expenditure, so to calculate the net benefit, donations are divided by 3 before the costs are subtracted. This is explained further in Appendix 2. ↩︎
Even and especially in cases where a leadership transition requires a smooth handoff. ↩︎
I'd be interested in more elaboration on what kinds of grants you may evaluate in the future and more generally your place and comparative advantage in the EA grantmaking ecosystem. E.g., should people with ideas get in touch with you? How could you see yourself collaborating with other grantmakers? How did you decide to look into Donational?
Hey Jonas, apologies about the delay in replying here. Much will depend on whether we move forward with the program based on our own internal assessment of its potential and feedback we received from the community, especially those with an interest in grant making and community building via funding projects.
We loosely outline our remit and purpose in the introduction section and our current plan is to help potentially promising projects that would clearly benefit from the “early-stage planning, facilitating networking opportunities, and other as-needed efforts traditionally subsumed under project incubation” that we want to provide. Projects can often use assistance of this sort, and similar to some VC models, we hope that conducting a thorough and transparent evaluation of their program will be helpful to show others for getting funding traction. A perennial issue for existing grant makers is a lack of projects or research that are prepared to execute for one reason or another, and RG would hope to put time and resources into making a project ready and fundable. This role is meant to compliment the existing landscape.
As we mention in the OP, we do not currently fund projects ourselves - our goal is to help improve and recommend worthy projects to existing funders at this point. Given that many of the methods in this report are widely applicable, RG could also investigate and evaluate projects on behalf of existing grant makers or individual funders in cases where our interests align. In this case, were a potential funder interested in looking into a “shovel-ready” or existing project, we could be contracted assess it more thoroughly.
As for sourcing applications, we mention in the Our Process section that Rethink Grants will begin with an in-network approach to sourcing projects, relying on trusted referrals to help us reach out to promising individuals and organizations. If RG continues to conduct evaluations, we then consider projects on a rolling basis. A project that seems potentially cost-effective, run by a high-quality team, and has room for more funding moves forward through our evaluation process. We decided to look into Donational because it appeared to be a high potential project that satisfied these requirements.
Thank you to those who had a look at this report. Our team put a lot into this as you might imagine. I’ve been anticipating some commentary in this evaluation along the lines of “this is far too complex/quantitative for a $40,000 grant recommendation.” We’d agree. We gesture at this in the “The future of Rethink Grants” section at the end of the Executive Summary.
This could have perhaps been communicated better, but my hope is that readers will come to interpret this report, and the methods employed therein, as additional tools to consider when evaluating grants. There may be occasions where evaluators might find it useful to boost their repertoire by using these methods (or something similar) to potentially make better decisions. Project leads may also get some mileage out of how much we’ve put on display here.
There are certain instances where key reasoning (see Team Strength section), or quick deferral to experts, or even a simple back-of-the-envelope (BOTEC) calculation will suffice. But as with charity evaluation, we might agree, there are circumstances where intuition and BOTECs are not enough. An example from this report that Derek mentions, the VOI calculation and CEE lead us to a more nuanced conclusion that funding decent-sized pilot was very much worth doing in our opinion, rather than fully funding it from the outset or passing over this opportunity. Our conclusions from just a BOTEC might have been different.
I think we have good reason to believe that the level of rigor displayed in this evaluation is warranted at times. And when those situations arise, we hope others will reach for this report if they’ve found it useful.
Looks like you inputted the wrong table in the 'Indirect Harms' section.
Thanks - should be fixed now.
Thanks for putting this together, I think this is an exciting report and project.
I mostly agree with Habryka's points.
I have another minor point:
I feel like it's odd to categorize the former example as "indirect benefits". I think a cost-effectiveness model should aim to capture the overall expected impact of all the charities by applying some "impact-adjusted money moved" metric. (If you're evaluating from a long-termist perspective, this would mean a long-termist perspective on all supported charities.) Otherwise, any project that involves some amount of leverage on various other organizations will always have high indirect benefits and harms, which makes the overall rating non-informative.
I agree that "help create a broader culture of effective giving at US workplaces" is a good example of an indirect benefit.
Again, the same points seem to hold here; I think this should already be factored into the cost-effectiveness estimates.
(I realize my explanation of my view is a bit vague; I have a pretty strong intuition here and it would take me more time to think about it more and really explain it in depth.)
Thanks Jonas. Sorry for the slow reply - I took some time off and am now travelling.
I was also uneasy about counting indirect effects of the beneficiary charities rather than CAP itself, though I'm not sure it's for exactly the same reasons as you. The measure of benefit in the CEA was dollars counterfactually moved to the recommended charities (or equivalent), which seems to implicitly cover all consequences of those dollars. Considering indirect effects separately, with the implication that these consequences are in addition to the dollars moved, risks double-counting.
I'm not sure if that's what you're getting at, or if you also think we should be explicitly modelling the indirect effects of all the beneficiary charities in the CEA. I don't think the latter is really feasible, for several reasons. For a start, we would need some universal metric for capturing both the direct and indirect consequences. The best one would probably be subjective wellbeing, but figuring out the SWB associated with a wide range of circumstances, across a large number of species, is not within the scope of an evaluation like this. Another issue is that indirect effects would likely overwhelm the direct ones, and it isn't clear that this is appropriate given their far more speculative nature. There would probably have to be some kind of weighting system that discounts the more uncertain predictions (could be a fully Bayesian approach, or a simpler option like this), but implementing such a system itself would be very hard to do well, and perhaps less useful than highlighting them and scoring them much more subjectively as we've done here.
I would add that one advantage of explicitly considering the indirect effects of charities is that it makes them more salient. I'd imagine most donors/funders would only really think of the direct benefits when deciding whether a project like CAP (or the charities themselves) is worthwhile, so it helps to highlight the other issues at stake. This consideration may outweigh the methodological concerns with 'double-counting' etc.
It took me a while to finish reading this report, so I know I’m joining this comment thread late. There were lots of good thoughts shared in other comments, so I’ll try to focus my thoughts are areas that are different:
---There could be a donor coordination problem for this particular opportunity. Donational might not be able to do a pilot with less than $40k, and anything over $40k would be more than needed. How do donors know if they are fully funded?
---Have you seen the write-ups of ImpactMatters (https://www.impactmatters.org/)? If not, it may be worth looking at them. That organization was founded by Dean Karlan, who is one of the top economists doing randomizes controlled trials of development programs.
---In terms of the write-up, there was some discussion of certain key points not coming through strongly. To address this, one idea would be to put a lot of the details from the main body into an appendix.
---This report focuses a lot of the sample evaluation of Donational, but the summary of Rethink Grants suggests that it is trying to do more than just evaluate opportunities and promote the good ones. If I’m understanding correctly that Rethink Grants is also doing things to try to make the underlying organizations better, then it might be great to have more details on that.
It’s great to see more efforts to evaluate and promote top giving opportunities. Rethink Grants seems promising and I'm interested in seeing where it goes.
Hey Eric, we appreciate the kind words and thank you for taking the time to bring some of these things to our attention.
Great question - were RG to continue on, the idea was for us to be quite involved in the fundraising process for recommended projects. If Donational were interested in continuing with the CAP, we would likely engage in a joint fundraising effort where we would take special care to keep key funders and the wider public in the loop regarding fundraising milestones and progress. This could even take the form of a public fundraising campaign in certain cases.
We have! In fact, Luisa Rodriguez, one of the Rethink Priorities analysts on this report, is a former ImpactMatters research analyst. ImpactMatters was also among the organizations that we drew inspiration from for Our Process.
This could certainly be helpful. I think a lot more could be done to better highlight key reasoning within future potential evaluations, including detailed notes on criteria that were important in the VOI, for example.
That’s correct, and now that you mention it, future reports could expand more on all the possible intervention points that RG would consider for improving the overall quality of projects.I cover quite a bit of that in this reply in a different thread. As an example from this report, in the Potential Issues section, we mention pretty large plan changes from presenting the founder with a BOTEC that we came up with based on a handful of parameters that we considered crucial. Having this all mapped out onto one place would certainly be better.
On the whole: Interesting write-up, certainly works as an intro to how very in-depth use of EA forecasting/impact estimate techniques can be used. I'd read any of these that came out in the future, and can already think of organizations I'd be curious to see evaluated in this way.
This is similar to Jonas's second comment, but it seems like concerns about the indirect harms of economic growth or "the reduction of agency" are both constants in any evaluation of any program in global poverty/development, or which encourages donations to said programs.
Perhaps this indicates that your models could be filled in over time with "default scores" that apply to all projects within a certain area? For example, any program aiming to reduce poverty could get the same "indirect harm" scores as this project's anti-poverty side.
Something also feels off about noting the potential harms of effects which are generally very good. I'm having trouble coming up with a formal explanation for some reason, so I'll write this out informally:
If I'm considering funds for a project to reduce poverty and grow the economy, and someone tells me that doing so could increase the number of animals that get eaten... these two effects scale together. The more that poverty is reduced, and the faster the economy grows, the more animals are likely to be eaten. This is an "indirect bad outcome" that I'm actually happy to see in some sense, because the existence of the bad outcome indicates that I succeeded in my primary goal.
It's as though someone were to warn me about donating to an X-risk-reduction organization by pointing out that more humans living their lives implies that more humans will get cancer. Cancer is definitely a harm related to being alive, but it's one that I'm implicitly prepared to accept in the course of helping more humans exist. If you came back from some point in the future to tell me that the cancer rate had remained constant over time, but that ten billion humans had cancer, I'd probably be very happy to hear the news, because it would imply the existence of hundreds of billions of cancer-free humans spread out across planets or artificial interstellar habitats.
Meanwhile, if someone were to tell me that they think cancer is so bad that it makes additional years of life net-negative, I'd tell them to support promising cancer treatments rather than even considering projects that create more years of human life.
If a socialist tells me they're concerned about poor people losing autonomy as a result of charitable giving, my response would be something like: "...okay. That's going to be par for the course in this entire category of projects. By the project's very nature, it should be clear that it's not something you'll want to support if you generally oppose charity." And then I'd produce a report intended for people who do believe charity generally does more good than harm, because they're the only ones who might actually satisfy their values by donating.
This is oversimplified and unsophisticated, and it's easy to think of counterarguments. But I still feel as though "losing autonomy" is a very different kind of concern than, say, "Donational falls apart, and its initial corporate partners become much less likely to run effective giving programs in the future". The latter is inherent to specific features of Donational, rather than specific features of charitable giving, so it helps me decide between Donational and other charitable projects. The former doesn't help me make that choice.
On another note, I second some of Oli's concerns; I wish the section on Donational's basic strategy had been a lot longer. Things I don't think were addressed:
On the whole, while I understand that this is an atypical organization to evaluate in this way, I felt like I was seeing many indirect/correlational measures of success ("CEOs with these traits tend to run good organizations") and little explicit discussion of the program's strategy ("this is how they plan to find companies who might be good customers for their service"). I generally prioritize the latter ahead of the former.
Many thanks for the comments, and sorry for the slow reply - I've been travelling. Currently very jet-lagged and a bit ill so let me know if the following isn't clear/sensible. Also, as in all my responses, these views are my own and not necessarily shared by the whole Rethink team.
>Perhaps this indicates that your models could be filled in over time with "default scores" that apply to all projects within a certain area? For example, any program aiming to reduce poverty could get the same "indirect harm" scores as this project's anti-poverty side.
I agree. If RG continues, we may well standardise indirect effect scores to some extent, perhaps publishing fairly in-depth analyses of the various issues in separate posts. We discussed this early on in the process but it didn't make sense to do it for an initial 'experimental' evaluation.
>Something also feels off about noting the potential harms of effects which are generally very good ... The more that poverty is reduced, and the faster the economy grows, the more animals are likely to be eaten. This is an "indirect bad outcome" that I'm actually happy to see in some sense, because the existence of the bad outcome indicates that I succeeded in my primary goal.
I think I see where you're coming from, but this seems to constitute a general case against considering indirect effects at all. I suppose there are some that don't scale linearly with the intended effects, but I'm not sure this example does either (e.g. meat consumption may plateaux above a certain income), and it strikes me as a pretty arbitrary criterion to use. (I'm not sure whether you were actually suggesting we use that.)
Maybe this comes back to the point I made to Jonas, i.e. the effect of the charities, both direct and indirect, seems to be captured by the 'counterfactual dollars moved to top charities' metric so we should perhaps only consider indirect effects of CAP itself. But it sounds like if we were just evaluating, say, AMF, you'd be opposed to considering its effects on animal consumption, which doesn't seem right to me. But maybe I'm misunderstanding you.
>It's as though someone were to warn me about donating to an X-risk-reduction organization by pointing out that more humans living their lives implies that more humans will get cancer.
This example feels different. Assuming your main goal of reducing x-risk is that humans live longer, and the main bad consequence of cancer is shortening of human lives, your primary metric - something like (wellbeing-adjusted) life-years gained - will already account for the additional cancer.
More broadly: it seems like what counts as 'indirect' depends on what you've decided apriori to be 'direct' or intended. So we haven't included, say, non-creation of net positive factory farmed animal lives as an indirect harm of charities that reduce meat consumption, because we think on average intensively-farmed animals' lives would be net negative; but I would consider something like harm to the livelihoods of farmers an indirect effect, as that is not the goal, and may not be considered at all by potential donors if it weren't highlighted. Likewise, if the goal of a charity is to mitigate global warming by reducing meat consumption, the effect on animal welfare would be an indirect effect; but because the main aim of the ACE-recommended animal charities seems to be reducing animal suffering, we have considered any consequences for climate change (and antibiotic resistance) to be indirect.
>If a socialist tells me they're concerned about poor people losing autonomy as a result of charitable giving, my response would be something like: "...okay. That's going to be par for the course in this entire category of projects… By the project's very nature, it should be clear that it's not something you'll want to support if you generally oppose charity."
That was in the moral uncertainty section, not indirect effects, and the point was to essentially highlight that for people with this worldview (and a range of others) this project - or at least some of the recipient charities - may look bad. So it seems broadly consistent with your suggested approach, if I've understood you correctly.
>I wish the section on Donational's basic strategy had been a lot longer.
Okay, I can see how that would have been useful. To briefly respond to some of your specific questions:
• I'm not aware of any 'competitors' as such. AFAIK there is no comparable organization operating in workplaces.
• 'Market size', or something close to it, was factored into the 'growth rate' and 'max scale' parameters of the model. We didn't provide justifications for every parameter for time and space reasons but I can dig out my notes on specific ones if you want.
• The selection of charities is going to be a trade-off between mass appeal and effectiveness. Currently Donational recommends a very broad range (based on The Life You Can Save's, I think) but we felt there were too many that were unproven or seemed likely to be orders of magnitude worse than the top charities, which undermined the impact focus. After some discussion, Ian agree to limit them to ACE and GW top charities, plus a US criminal justice one. (If it were my program, I'd probably exclude Give Directly, some of the ACE charities, and the criminal justice one, and would perhaps include some longtermist options - but ultimately it's not my choice.)
I'll leave your points about strategy for Tee or Luisa to answer.
Thanks for this reply! I don't have time to engage in much more detail, but I'm now a little more uncertain that my specific qualms with indirect impact are important to the project.
I don't want to make you dig through your notes just to answer my question; I more intended to make the general point that I'd have liked to have a few more concrete facts that I could use to help me weigh Rethink's judgment. (For example, if you shared some current numbers on corporate giving, I could assign my own 'max scale' parameter and check my intuition against yours.)
Knowing that Donational started out with all or almost all TLYCS charities reduces my concern a lot. The impression I had was that they'd been working with a very broad range of charities and were radically cutting back on their selection.
>I more intended to make the general point that I'd have liked to have a few more concrete facts that I could use to help me weigh Rethink's judgment.
That's fair. Initially I was going to write a summary of our evidence and reasoning for all 42 parameters, or at least the 5-10 that the results were most sensitive to. In the end we decided against it for various reasons, e.g.:
- Some were based fairly heavily on information that had to remain confidential, so a lot would have to be redacted.
- Often the 6 team members had different rationales and drew on different information/experiences, so it would be hard in some cases to give a coherent summary.
- Sometimes team members noted their rationales in the elicitation document, but with so many parameters, there wasn't always time to do this properly. Any summary would therefore also be incomplete.
- The report was already too long and was taking too much time, so this seemed like an easy way of limiting both length and delays.
But maybe it was the wrong call.
>Knowing that Donational started out with all or almost all TLYCS charities reduces my concern a lot. The impression I had was that they'd been working with a very broad range of charities and were radically cutting back on their selection.
I would consider TLYCS's range very broad, but you may disagree. Anyway, you can see Donational's current list at https://donational.org/charities
TLYCS only endorses 22 charities, all of which work in the developing world on causes that are plausibly cost-effective on the level of some GiveWell interventions (even though evidence is fairly weak on some of them -- I recall GiveWell being more down on Zusha after their last review). This selection only looks broad if your point of comparison is another EA-aligned evaluator like GiveWell, ACE, or Founder's Pledge.
Meanwhile, many charitable giving platforms/evaluators support/endorse a much wider range of nonprofits, most of them based in rich countries. Even looking only at Charity Navigator's perfect scores, you see 60 charities (only 1/4 of which are "international") -- and Charity Navigator's website includes hundreds of other favorable charity profiles. Another example: When I worked at Epic, employees could support more than 100 different charities with the company's money during the annual winter giving drive.
I also imagine that many corporate giving platforms would try to emphasize their vast selection/"the huge number of charities that have partnered with us" -- I'm impressed that Donational was selective from the beginning.
>TLYCS only endorses 22 charities, all of which work in the developing world on causes that are plausibly cost-effective on the level of some GiveWell interventions (even though evidence is fairly weak on some of them...)
It's plausible that some of these are as cost-effective as the GW top charities, but perhaps not that they are as cost-effective on average, or in expectation.
>This selection only looks narrow if your point of comparison is another EA-aligned evaluator like GiveWell, ACE, or Founder's Pledge.
You mean only looks broad?
Anyway, I would agree TLYCS's selection is narrow relative to some others; just not the EA evaluators that seem like the most natural comparators.
I agree, for most values of "plausible". Otherwise, it would imply TLYCS is catching many GiveWell-tier charities GiveWell either missed or turned down, which is unlikely given their much smaller research capacity. But all TLYCS charities are in the category "things I could imagine turning out to be worthy of support from donors in EA with particular values, if more evidence arose" (which wouldn't be the case for, say, an art museum).
First of all, I think evaluations like this are quite important and a core part of what I think of as EA's value proposition. I applaud the effort and dedication that went into this report, and would like to see more people trying similar things in the future.
Tee Barnett asked me for feedback in a private message. Here is a very slightly edited version of my response (hency why it is more off-the-cuff than I would usually post on the forum):
Hmm, I don't know. I looked at the cost-effectiveness section and feel mostly that the post is overemphasizing formal models. Like, after reading the whole thing, and looking at the spreadsheet for 5 minutes I am still unable to answer the following core questions:
I think I would have preferred just one individual writing a post titled "Why I am not excited about Donational", that just tries to explain clearly, like you would in a conversation, why they don't think it's a good idea, or how they have come to change their mind.
Obviously I am strongly in favor of people doing evaluations like this, though I don't think I am a huge fan of the format that this one chose.
------- (end of quote)
On a broader level, I think there might be some philosophical assumptions about the way this post deals with modeling cause prioritization that I disagree with. I have this sense that the primary purpose of mathematical analysis in most contexts is to help someone build a deeper understanding of a problem by helping them make their assumptions explicit and to clarify the consequences of their assumptions, and that after writing down their formal models and truly understanding their consequences, most decision makers are well-advised to throw away the formal models and go with what their updated gut-sense is.
When I look at this post, I have a lot of trouble understanding the actual reasons for why someone might think Donational is a good idea, and what arguments would (and maybe have) convinced them otherwise. Instead I see a large amount of rigor being poured into a single cost-effectiveness model, with a result that I am pretty confident could have been replaced by some pretty straightforward fermi point-estimates.
I think there is nothing wrong with also doing sensitivity analyses and more complicated parameter estimation, but in this context it seems that all of that mostly obscures the core aspects of the underlying uncertainty and makes it harder for both the reader to understand what the basic case for Donational is (and why it fails), and (in my model) for the people constructing the model to actually interface with the core questions at hand.
All of this doesn't mean that the tools employed here are never the correct tools to be used, but I do think that when trying to produce an evaluation that is primarily designed for external consumption, I would prefer much more emphasis to be given to clear explanations of the basic idea behind the organizations and an explanation of a set of cruxes and observations that would change the evaluators mind, instead of this much emphasis on both the creation of detailed mathematical models and the explanation of those models.
Thanks for your comments.
I think there are some reasonable points here.
• I certainly agree that the model is overkill for this particular evaluation. As we note at the beginning, this was a 'proof of concept' experiment in more detailed evaluation of a kind that is common in other fields, such as health economics, but is not often (if ever) seen in EA. In my view – and I can't speak for the whole team here – this kind of cost-effectiveness analysis is most suitable for cases where (a) it is not possible to run a cheap pilot study with short feedback loops, (b) there is more actual data to populate the parameters, and (c) there is more at stake, e.g. it is a choice between one-off funding of several hundred thousand dollars or nothing at all.
• I would also be interested to see an explicit case against Donational.
However, I'd like to push back on some of your criticisms as well, many of which are addressed in the text (often the Executive Summary).
• A description of what Donational has done so far, and the plans for CAP, is in the Introducing Donational section. This could also constitute a basic argument for Donational, but maybe you mean something else by that. I don't know what you want to know about its operations beyond what is in this section and the Team Strength section. If you tell us what exactly you think is missing, maybe we can add it somewhere.
• We don't give "an explanation of a set of cruxes and observations that would change the evaluators mind" as such, but we say what the CEE is most sensitive to (and that the pilot should focus on those), which sounds like kind of the same thing, e.g. if the number and size of donations were much higher or lower then our conclusions would be different. I've added a sentence to the relevant part of the Exec Summary: "The base case donation-cost ratio of around 2:1 is below the 3x return that we consider the approximate minimum for the project to be worthwhile, and far from the 10x or higher reported by comparable organizations. The results are sensitive to the number and size of pledges (recurring donations), and CAP's ability to retain both ambassadors and pledgers. Because of the high uncertainty, very rough value of information calculations suggest that the benefits of running a pilot study to further understand the impact of CAP would outweigh the costs by a large margin." EDIT: We also address potential 'defeators' in the Team Strength section, and note that we would be reluctant to support a project with a high probability of one or more major indirect harm, or that looked very bad according to one or more plausible worldviews. This strongly implies at least some of the observations that would change our mind.
• We mention an early BOTEC of expected donations (which I assume is similar to the Fermi estimate that you're suggesting) in at least three places. This includes the Model Verification section where I note that "Parameter values, and final results, were compared to our preliminary estimates, and any major disparities investigated." Maybe I should have been clearer that this was the BOTEC, and perhaps we should have published the BOTEC alongside the main model.
• We make direct comparisons with OFTW, TLYCS, and GWWC throughout the CEA and to a lesser extent in other sections, and explain why we don't fully model their costs and impacts alongside CAP.
• "after writing down their formal models and truly understanding their consequences, most decision makers are well-advised to throw away the formal models and go with what their updated gut-sense is."
That's kind of what we did. We converted the CE ratio, and scores on other criteria, to crude Low/Medium/High categories, and made a somewhat subjective final decision that was informed by, but not mechanistically determined by, those scores and other information. A more purely intuition-driven approach would likely have either enthusiastically embraced the full CAP or rejected it entirely, whereas a formal model led us to what we think is a more reasonable middle ground (though we may have arrived in a similar place with a simpler model).
• Even for this evaluation, there was some value in the more 'advanced' methods. E.g. the VOI calculation, rough though it was, was important for deciding how much to recommend be spent on a pilot; and our final CEE (<2x) was a fair bit lower than the BOTEC (about 3-4x), largely because of the more pessimistic inputs we elicited from people with a more detached perspective, and more precise modelling of costs.
It seems like a large part of the problem is that most people don't have time to read such a long post in detail. In future we should perhaps do a more detailed Exec Summary, and I'll consider expanding this one further if there is enough demand.
Thanks again for engaging with this!
Thanks for the response!
I think you misunderstood what I was saying at least a bit, in that I did read the post in reasonably close detail (about a total of half an hour of reading) and was aware of most of your comment.
I will try to find the time to write a longer response that tries to explain my case in more detail, but can't currently make any promises. I expect there are some larger inferential distances here that would take a while to cross for both of us.
Yeah, I did wonder if we were talking past each other a bit, and I'd be interested to clear that up – but no worries if you don't have time.
Hey Oli, thanks for taking the time to come up with these points, and going out of your way to say, “...I think evaluations like this are quite important and a core part of what I think of as EA’s value proposition...and would like to see more people trying similar things in the future.” This is exactly the type of attitude toward agency and attempting to do good that I’d like to have encouraged more in EA.
Point-by-point, I think Derek covered a lot. I also mention in a comment how I was thinking about this evaluation in terms of a contribution to grant evaluation and the EA project space more broadly.
We might have done better to distill cruxes within our qualitative reasoning, though I do think a fair amount of this is presented in various sections. Agreed that swapping advanced mathematical models for BOTECs is often advisable, but at certain points in the future, I would imagine that evaluators could make good use of methods like these.
I think one cool thing this piece does is use a pretty wide range of approaches to estimating the value of this program. As such, I'd be particularly curious to get feedback from others here on what parts people find reliable or questionable.
[Disclaimer: I submitted comments to this post earlier]