Measuring Good Better

MichaelPlant; Matt_Lerner; Innovations for Poverty Action; GiveWell; Jason Schukraft

Measuring Good Better

MichaelPlant,

Comments 19

Sorted by

New & upvoted

JamesÖz 🔸

Super interesting - thanks for posting! I was pretty surprised that GiveWell weights their donor survey so highly for their moral weights (60%). I was wondering what's the rationale for it being 60%, given that these people are both (a) probably not the most knowledgeable about context-specific information (as GiveWell points out here) and (b) also not the recipients themselves. It seems more reasonable to have the GiveWell staff, who have a much better understanding of specific programs and other context, and the beneficiary surveys, to have the majority say in moral weight calculations. I don't find the arguments below particularly convincing re favouring the donor survey over staff survey:

We have fairly few staff, compared to the number of people who can be surveyed via other methods.
Staff don't have a unique ability to make these moral judgments. Staff also have limited insight into what the lives of people impacted by our recommendations are actually like, and what would be the most helpful to them.
Staff-assigned moral weights are hard for charities to predict, in that there can be wide swings based on changes in staff composition.
In past years, staff engaged to varying degrees, and then all those responses were aggregated without weighting for level of engagement.

On 1) GiveWell's staff size (I think?) is about 60 people relative to the donor survey which seems to have been 70 people. Whilst this might be a reason to favour the donor survey in the future if it was about 400 people, I don't think it's a great reason to make it 60% now as the donor sample size is similarly small
On 3) Given the sample sizes are pretty similar for both categories, aren't the donor moral-weights also open to swings? I guess it somewhat depends if you have more stability in your staff or your major donors!
On 4) Again, isn't this also true for the donor survey? Surely it's best to weight both groups via engagement so this issue isn't present for either population. I can also imagine GiveWell staff who can take paid time to work on these moral weights can engage more fully than donors with other full-time jobs.
Finally, I really have no a priori reason to believe that most GiveWell major donors are particularly well-informed about moral weights, so unsure why we should defer to them so much.

For all this, I'm roughly assuming most GiveWell staff could take part in the moral weights but appreciate it might be a smaller group of 10-20 who are just doing the research or program-specific work. I think the arguments are weaker but still stand in that case.

And for future work, GiveWell mentions they want to do more donor surveys but don't mention surveys of development professionals or other program-specific experts (e.g. malaria experts). This feels a bit odd, as this group seems like a more reliable population relative to donors for the same reasons as above (more specific knowledge).

I'm less sure about this but: I'm also a bit worried about GiveWell saying they want to avoid large swings from year to year (which totally makes sense so organisations can have predictable future incomes). This might unnecessarily keep the moral weights value close to less-than-ideal values, rather than updating with new evidence, which is especially problematic if you think the starting point (based a lot on the donor survey) is less than ideal.

Finally, is the donor survey public and if so, where could I find it?

MichaelPlant

I agree that this methodology should be better explained and justified. In a previous blog post on Open Philanthropy's cause prioritisation framework, I noted that Open Philanthropy defers to GiveWell for what the 'moral weights' about the badness of death should be, but that GiveWell doesn't provide a rationale for its own choice (see image below).

I, with colleagues at HLI, are working on a document that specifically looking at philosophical issues about how to compare improving to saving lives. We hope to publish that within 4 weeks.

CB🔸

Thanks for the article ! This is an important topic, glad there is an overview of this.

I saw the "Taking Happiness seriously" presentation by Michael Plant at the EAG Berlin in September, and I was really impressed.

My two main takeaways for this post here are the same than back then:

I'm surprised and a bit disappointed by the selection criteria used by GiveWell. As pointed out by someone else, "asking how 80 Givewell donors would make these same types of tradeoffs" seems like a weak criteria to me, especially for 60% of the value. We really have trouble imagining how happy we'd be in another situation, so this criteria sounds mostly like guessing.
- This was also pointed out in Michael Plant's presentation: using DALY tends to represent depression and reduced mobility as having the same impact on the quality of life - while depression is much more devastating to mental wellbeing.
If asking people how happy they are works (which sounds surprising to me, but if it does correlate with other criterias, like say smile, I can accept that), then it would make way more sense to use that criteria. It's closer to the endpoint. Otherwise we're just guessing.

Anyway, this post clarifies things quite a bit, thanks for this work !

Dr Dan Epstein

As a public health academic, I would love to see more carving of a niche for WELLBYs. They make a lot of sense for the bio-psycho-social model for health... as they emphasize 2/3 of these metrics rather than just one!

To get traction for use, they need to build awareness as a viable alternative. There should be some effort to educate academics and policymakers about the use of WELLBYs as an outcome measure of interest.

I would also like to see research on existing softer interventions that may not impact DALYs but may shift the needle considerably with WELLBYs (or not?).

Off the top of my head- some candidates might be (potential for long-term well-being increases but maybe not disability/death):
-iron fortification
-access to contraception/ pregnancy termination
-deworming (bringing another prong into debates
-living with worms is terrible), nutrition programs
-increasing sleep quality/supplying simple mattresses
-domestic violence interventions/safehouses

Another way to think of interventions for the list is taking away causes of long-term suffering that would be cheap, easy and likely permanent.

Mo Putera

HLI's research overview page mentions that they're planning to look into the following interventions and policies via the WELLBY lens; there is some overlap with what you mentioned:

Our search for outstanding funding opportunities continues at three levels of scale. These are set out below with examples of the interventions and policies we plan to investigate next.
Micro-interventions (helping one person at a time)
Deworming programs
Cataract surgery
Cement flooring
Friendship Bench
Mental health apps
Meso-interventions (systemic change through specific policies)
Lead regulation
Access to pain relief
Immigration policy
Psychedelic-assisted therapy
Macro-interventions (systemic change through the adoption of a wellbeing approach)
Advocacy for, and funding of, subjective wellbeing research
Developing policy blueprints for governments to increase wellbeing

MichaelPlant

Thanks for spotting and including this Mo! Yes, Dan, at HLI we're trying to develop and deploy the WELLBY approach and work how much difference it makes vs the 'business as normal' income and health approaches. We're making progress, but it's not as fast as we'd like it to be!

Feel free to reach out if you'd like to chat. [email protected]

Lizka

I'm curating this — thanks so much for putting it together, all! ^[1]

I think people are pretty confused about how prioritization or measurement of "good" can possibly happen, and how it happens in EA (and honestly, it's hard to think about prioritization because prioritization is incredibly (emotionally) difficult), and even more confused about the differences between different approaches to prioritization — which means this is a really useful addition to that conversation.

I do wish there were a summary, though. I've copy-pasted Zoe's summary below.

The post is pretty packed with information, shares relevant context that's useful outside of these specific questions (like how to value economic outcomes), and also has lots of links to other interesting readings.

I also really appreciate the cross-organization collaboration that happened! It would be nice to see more occasions where representatives of different approaches or viewpoints come together like this.

Here's the summary from Zoe's incredible series (slightly modified):

Givewell uses moral weights to compare different units (eg. doubling incomes vs. saving an under-5's life). These are 60% based on donor surveys, 30% from a 2019 survey of 2K people in Kenya and Ghana, and 10% staff opinion. [Note from Lizka: this is largely to create an exchange rate between different types of good outcomes.]
Open Philanthropy’s global health and wellbeing team uses the unit of ‘a single dollar to someone making 50K per year’ and then compares everything to that. Eg. Averting a DALY is worth 100K of these units.
Happier Lives Institute focuses on wellbeing, measuring WELLBYs. One WELLBY is a one-point increase on a 0-10 life satisfaction scale for one year.
Founder’s pledge values cash at $199 per WELLBY. They have conversion rates from WELLBYs to Income Doublings to Deaths Avoided to DALYs Avoided, using work from some of the orgs above. This means they can get a dollar figure they’re willing to spend for each of these measures.
Innovations for Poverty Action asks different questions depending on the project stage (eg. idea, pilot, measuring, scaling). Early questions can be eg. if it’s the right solution for the audience, and only down the line can you ask ‘does it actually save more lives?

I also just really appreciate this quote from the post:

So what do we do? Well, we try to reduce everything to common units and by doing that we can more effectively compare across these different types of opportunities. But this is really, really hard! I can't emphasise enough how difficult this is and we definitely don't endorse all of the assumptions that we make. They're a simplifying tool, they're a model. All models are wrong, but some are useful, and there is constant room for improvement.

^{^}
[Disclaimer: written quickly, about a post I read some time ago that I just skimmed to remind myself of relevant details.]

Sarah Cheng 🔸

People interested in global health & development and this post might be interested in applying to the Program Operations Assistant, Global Health & Wellbeing role at Open Philanthropy, and EA-aligned research and grantmaking foundation.

This is a test by the EA Forum Team to gauge interest in job ads relevant to posts - give us feedback here.

D_M_x

I was pretty taken aback by GiveWell's moral weights by age. I had not expected them to give babies such little moral weight compared with DALYs. This means GiveWell considers saving babies' lives to be only as valuable as saving people in their late 30s despite them being almost halfway through their life. The graph makes the drop-off of moral weights at younger ages look less sharp than it is as the x-axis is not to scale.

I looked at the links for further information on this which I'm collating here for anyone else interested:

From the [public] 2020 update on GiveWell's moral weights - Google Docs:

These results look sensible to us. We're least certain about the value of averting deaths at very young ages and stillbirths. If those values became decision-relevant for a grant, such as a neonatal health program, we might revisit these weights or consider setting aside a pot of funding for grants that satisfy "other reasonable moral weights."

DALYs assume that preventing the deaths of people with longer remaining life expectancies is always more valuable, when in practice many people indicate a preference for preventing the death of an older child over the death of a neonate.[footnote]

The footnote:

This applies to GiveWell staff and donors, as well as to the results of the Mechanical Turk survey referenced here. The IDinsight survey did not ask about neonates.

The biggest uncertainty we have is around the relative value of preventing deaths at very young ages.

This is in context of the ID survey but I assume is speaking about GiveWell's uncertainty not the uncertainy of ID survey respondents.

We don't believe the group of donors we surveyed is very diverse (across characteristics like race, gender, income, and country of origin) which could influence results. The vast majority of the donors we surveyed are men, and people of different genders could especially have different intuitions about the value of averting stillbirths and the deaths of neonates.

A quick analysis of the responses of men vs. women didn't indicate that we should upweight stillbirths and the deaths of neonates to account for different preferences across genders, but there were so few women in the sample that we can't say with confidence that the results don't depend on gender.

Apparently the moral weights have not been decision-relevant which is good news for all donors who have different preferences for moral weights. In the future I will check before donating to GiveWell's funds whether the moral weights of babies have become decision-relevant for grants in the meantime.

I was also wondering how many of the donor survey respondents were parents and whether they put moral weight on babies than non-parents. I could also imagine there being a discrepancy between mothers and everyone else (fathers and the childless).

david_reinstein

I'm finding the terminology difficult.

DALY s

"Disability Adjusted Life Years", definition from WHO: "One DALY represents the loss of the equivalent of one year of full health" (for a disease or health condition).

I always think this should be called "Disability adjustments to life years" ... it's not a 'count of total adjusted years'. Anyways, moving on to HLI:

WELLBY's, HLI

Instead of DALYs, we think in terms of WELLBYs (wellbeing-adjusted life years). ...

Saying 'wellbeing-adjusted life years' suggests this involves a metric where people would trade off ('adjusted') an additional year of life spent at one level of happiness versus something else.

Or, to bring the 'DALY' concept in, something like ...the adjustment to lifetime average well-being needed to be willing to give up an additional year of life in my current well-being state. Something like a 'hypothetical revealed preference tradeoff'.

So what is one WELLBY? It’s a one-point increase on a 0-10 life satisfaction scale for one year.

But this is an explicit measure defined based on a survey response. So how would this be consistent with the 'adjustment' above

MichaelPlant

Q/DALYs are intended to measure health and the weights are found by asking individuals to make various trade-offs. There are some subtleties between them, but nothing important for this discussion.

WELLBYs are intended to measure overall subjective wellbeing, and do so in a way that allows quality and quantity of life to be traded off. Subjective wellbeing is measures via self-reports, primarily of happiness and life satisfaction (see World Happiness Report; UK Treasury). I should emphasise that HLI did not invent either the idea of measuring feelings, or of the WELLBY itself - we're transferring and developing ideas from social science. How much difference various properties make to subjective wellbeing, e.g. income, relationship, employment status, etc. are inferred from the data, rather than asking people for their hypothetical preferences. Kahneman et al. draw an important distinction between decision utility (what people choose, aka preferences) and experienced utility (how people feel, aka happiness). The motivation for the focus on subjective wellbeing is often that there is often a difference between them (due to e.g. mispredictions of the future) and, if there is, we should focus on the latter.

Hence, when you say

Saying 'wellbeing-adjusted life years' suggests this involves a metric where people would trade off ('adjusted') an additional year of life spent at one level of happiness versus something else.

I'm puzzled. The WELLBY is 'adjusted' just like the QALY and the DALY: you're combining a measure of quality of life with a measure of time, not just measuring time. On the QALYs, a year at 0.5 health utility are worth half as much as at 1 health utility, because of the adjustment for quality.

david_reinstein

Thanks. When you say

So what is one WELLBY? It’s a one-point increase on a 0-10 life satisfaction scale for one year.

By this I assume you mean 1 WELLBY a 1-point increase in the self-reported measure itself (maintained over the course of the year). Is that it?

If so, how can I compare maks the similar 'adjustments' like for QALY.

On the QALYs, a year at 0.5 health utility are worth half as much as at 1 health utility, because of the adjustment for quality.

How would something similar work for a WELLBY? If it's just the numeric self reported well-being, it shouldn't imply anything like 'a year at WELLBY= 8 is worth twice as much as a year at WELLBY-4', should it? I assume there is another way this is done.

MariellaVee

This is incredibly interesting and enlightening; thank you!

Particularly love to see the way that these different organizations are looking at each others’ work and ideas and fit-testing them for their own approach and priorities. I’m especially interested in the question of how to measure good better by taking the effectiveness of the implementation into account, since this is where I can foresee a lot of great in-theory approaches diminishing in effectiveness when hit with real-world obstacles like convoluted systems/miscommunication or shifting context, etc.

As such, I even wonder how granular this approach could get; could additional work looking at systems, obstacles, or contexts objectively reveal ways that some interventions with potentially high impact traditionally considered too high-cost might suddenly become more accessible/effective?

jall

A $1,000 cash transfer costs a bit more than $1,000 to deliver

I'm surprised by this claim!

GiveDirectly's published figures for delivery costs are 10-20% of the total funds, and the financials seem to back this up to my naive eye: "Direct Grants" dominate all other expenses. 9:1 or 8:1 is way different than 1:1 suggested here

How's this figure calculated?

blonergan

I interpreted that as meaning that a $1,000 cash transfer costs a bit more than $1,000, including the direct cost of the cash transfer itself. So, something like $100 of delivery costs would mean that a $1,000 cash transfer would have a total cost of around $1,100.

Here HLI comes up with $1,170 as the total cost of a $1,000 cash transfer, which seems reasonably close to your numbers.

jall

ah, of course. thanks!

MichaelPlant

yes, thanks for this. I'll be more careful when I say this in future. Providing a $1000 transfer costs a bit more than $1,000 in total, when you factor in the costs to deliver it.

Cornelis Dirk Haupt

Threw me off as well at first. I'd second that it's probably best to reword it in the future for better clarity

Barry Grimes

If you found this post helpful, please consider completing HLI's 2022 Impact Survey.

Most questions are multiple-choice and all questions are optional. It should take you around 15 minutes depending on how much you want to say.

Comments

Value of doubling consumption for one person for one year	1.0
Value of averting one year of life lived with disease/disability (YLD)	2.3
Value of averting one stillbirth (1 month before birth)	33.4
Value of preventing one 5-and-over death from malaria	83.1
Value of averting one neonatal death from syphilis	84.0
Value of preventing one under-5 death from malaria	116.9
Value of preventing one under-5 death from vitamin A deficiency	118.4

	WELLBYs (per treatment)	COST (US dollars)	WELLBYs (per $1,000)
GiveDirectly (lump sum cash transfers)	9	$1,220	7.3
StrongMinds (group psychotherapy)	12	$170	70
Ratio (SM v GD)	1.3 x more effective	14% of GD cost	9 x more cost-effective

Measuring Good Better

DALY s

WELLBY's, HLI

Measuring Good Better

DALY s

WELLBY's, HLI

GiveWell (Olivia Larsen)

Why do we need moral weights?

What are GiveWell’s moral weights?

What goes into GiveWell’s moral weights?

Open Philanthropy (Jason Schukraft)

Why do we need assumptions?

Valuing economic outcomes

Valuing health outcomes

Happier Lives Institute (Michael Plant)

Using a subjective wellbeing approach

Can we rely on subjective measures?

What difference does it make?

Plans for further research

Founders Pledge (Matt Lerner)

Our historical (deprecated) approach

Goals and constraints for our new approach

The general idea behind the new approach

How this is looking so far

Going forward

Innovations for Poverty Action (Katrina Sill)

Impact = solution quality x implementation quality

The right way to measure ‘good’ depends on the question

What to measure and when