This research project was commissioned by Flourishing Minds Fund and created by CEARCH. It consisted of a review of the available literature on mental health and consultation with over a dozen experts in the field. The full report can be found here.
Key Findings
- The importance of mental illness is highly contingent on one’s philosophical assumptions
- Health measures such as QALYs and DALYs probably fail to fully capture the badness of severe mental illness. The WELLBY may be better at capturing internal states but has many of the same weaknesses
- We think that even with no change in philosophical assumptions, the DALY value of averting depression should be 20-45% higher than the IHME weighting suggests
- Converting between measures of health, wellbeing and life satisfaction introduces significant uncertainty and should be avoided when possible.
- Decision-making on mental health interventions should be clear about any philosophical and measurement assumptions that are being made
How can we measure mental health outcomes?
Philosophical Assumptions
Averting mental illness looks good from all major philosophical perspectives. Mental illness increases suffering, reduces happiness, is a state of ill-health, prevents people from fulfilling their desires, and so on.
Most of the benefits of mental health interventions come from improving, not extending lives. Thus the importance of averting mental illness relative to other causes largely depends on philosophical assumptions.
- The value of improving lives vs bringing new ones into being. Improving wellbeing and extending life are fundamentally different. Prioritizing between mental health interventions and physical health interventions depends on moral positions, such as the relative value of averting an infant death, which vary widely between people
- Trade-offs between preventing suffering and increasing happiness. Some measurement frameworks impose limits on the worst and best possible states of health or wellbeing. Those who believe that severe suffering outweighs extreme pleasure are more likely to favor interventions that address suffering.
- The value of wellbeing versus freedom. Those who place a high value on personal freedom are less likely to support interventions that restrict freedom in the name of public health, such as bans on commodities used to commit suicide.
In general, mental health work becomes more valuable for those who:
- prioritize improving lives over bringing new ones into being
- prioritize preventing suffering over increasing happiness[1][2]
- (in the case of preventing suicide) believe that curtailment of freedom is justified
Inevitably, the units of value that we use to determine the effectiveness of various interventions will not precisely reflect our philosophical assumptions. Hence it is important to select the best measures available, and to understand whether they are under- or over-estimating value according to our philosophical position.
HALYs and their limitations
Health-Adjusted Life Years (HALYs) attempt to put a value on the burden of disability and early death. QALYs and DALYs are the most widely used. They are derived from ordinary people’s evaluations of how bad it would be to live with the condition as described.
Foster (2020) gives an excellent overview of QALYs and DALYs, which informs most of the summary below and is the source of the images of health scales.
Quality-Adjusted Life Years (QALYs) aim to measure health-related quality of life (HRQoL). Karimi & Brazier (2016) describes HRQoL as “the way health (as measured by health status questionnaires) affects QoL (as measured by QoL questionnaires) as empirically estimated using statistical techniques”. Notice the narrowness of the definition: QALYs are not attempting to capture the full spectrum of QoL, only the portion determined by health.
In surveys of the general population, respondents are asked about a number of health states[3]. Each survey produces a “value set” which assigns a QALY score to each of many possible health states.
1 QALY is equal to a year of life in full health, while 0 QALYs is a health state equivalent to death.
The QALY scale
The QALY scale admits scores below zero, which represent states worse than death. In practice, QALY value sets have few health states with negative scores, and the worse states never seem to be lower than -1.
Disability-Adjusted Life-years (DALYs) attempt to quantify how healthy different states are. Survey respondents are given short descriptions of two people experiencing different health states, and asked “who do you think is healthier overall, the first person or the second person?”.
For example, a respondent might be asked to who is healthier out of:
- Person 1: “has overwhelming, constant sadness and cannot function in daily life. The person sometimes loses touch with reality and wants to harm or kill himself (or herself)” [Major depressive disorder, severe episode]
- Person 2: “drinks a lot of alcohol and sometimes has difficulty controlling the urge to drink. While intoxicated, the person has difficulty performing daily activities” [Alcohol use disorder, mild]
From thousands of such comparisons, researchers derive a weighting for each health state. In contrast to QALYs, 1 DALY is equal to a year of lost healthy life and 0 is a year of full health.
The DALY scale
Importantly, there are no health states worse than death.
Since it is based on judgments of which states are healthier than others, the DALY may be particularly bad at capturing the unpleasantness of mental illness. Respondents may not see unhappiness, anxiety, pain, etc. as unhealthy relative to "obviously unhealthy" injuries and infections.
QALYs and DALYs share a host of serious limitations.
- They rely on people’s assessments of health states that they may never have actually experienced. Arguably, mental health states are harder to imagine and easier to trivialize than physical health states, and so the survey process may undervalue mental illness
- There are signs that people incorporate the mitigating effects of treatments and care in their assessment of health states, even though they are not supposed to (Feng et al., 2020, Patenaude & Bärnighausen, 2019), which may lead to some illnesses being undervalued.
- This also means that we weigh illnesses equally across cultures, even though it is likely that suffering is more acute among those who are living in poverty. For example, the HALY weightings of arthritis may “assume” easy access to painkillers, which is often not the case in poor countries.
- They cannot capture states of extreme happiness and extreme suffering
- They appear to weight pain very lightly. For example, terminal illness with constant, untreated pain has a disability (DALY) weight of 0.569, which is only 0.029 more than the weight for the same condition with pain medication.
- They only aim to measure the impact of the health state, not its comorbidities.
How should we interpret HALYs?
In practice, mental health data often comes in the form of QALYs and DALYs. Some key questions for determining whether they may over- or under-estimate the things we care about:
- Is this health condition likely to be well understood by the general population?
- Does the health condition involve extreme suffering that may not be captured by the HALY?
- Is pain central to the “badness” of the health condition? Such conditions are likely to be underestimated, especially if respondents assumed access to pain relief
- Is the health condition a risk factor for other conditions? Many health conditions raise the risk of other illnesses, but HALYs do not account for this.
Case study: Is the DALY weight of depression an underestimate?
Most people have never experienced depression, and yet the DALY weightings for mild, moderate and severe depression are derived from surveys of the general population. Furthermore, the IHME’s disability weights do not capture the effects of depression as a risk factor for other conditions, including suicidality. In our own analysis, we attempt to correct the DALY weightings of depression to account for how sufferers weight the condition, and for the excess suicide burden associated with depression.
In general, sufferers of a given illness are found to rate it similarly to the general population, or even to rate it as less severe. To quote heavily from Pyne et al. (2009):
A number of studies have compared health state preference scores generated by different groups. Some of these studies have found differences based on health experience (Gabriel et al. 1999; Lenert, Treadwell, and Schwartz 1999; De Wit, Busschbach, and De Charro 2000; Postulart and Adang 2000; Insinga and Fryback 2003; Rashidi, Anis, and Marra 2006;) while others have not (Balaban et al. 1986; Revicki, Shakespeare, and Kind 1996; Dolders et al. 2006;). In general, studies that compare patient and general population health state preferences find that patients assign preference scores [...] that are equal to or greater than the preference scores assigned by members of the general population (Sackett and Torrance 1978; Balaban et al. 1986; Froberg and Kane 1989b; De Wit, Busschbach, and De Charro 2000; Dolders et al. 2006;)
Pyne et al. found that depressed patients were found to rate depression as worse than members of the general population did:
Depressed patients report lower preference scores for depression health states than the general population. In effect, they perceived depression to be worse than the general public perceived it to be.
In our analysis, we find that the ratio between disability weights derived from depressed people and those derived from the general population in Pyne et al. is 1.20. In essence, depressed people perceive depression to be 20% more severe than non-sufferers.
We also seek to estimate the extra health burden from the increased risk of suicide associated with depression. Depression is widely quoted to be responsible for at least half of suicides (CPSP, JED, CSP) even though only ~3% of the population is believed to suffer from depression in a given year (GBD study). There does not appear to be reliable data to back up the “at least half” figure, and it is presumably highly contingent on the depression threshold used for diagnosis[4]. But the link between depression and suicide is well-established. A 2001 Swedish study found that hospital diagnosis for depression increased suicide risk by 20x (Osby et al., 2001).
We examine males and females separately, since they have very different rates of depression and suicide, and find that
- In males, depression increases suicide risk by ~40x, or an extra 0.4% per year of depression
- In females, depression increases suicide risk by ~25x, or an extra 0.1% per year of depression
- Overall, the increased suicide risk adds 0.066 to the DALY weighting of depression
These figures are highly contingent on several factors, including the prevalence of depression (which we suspect is higher than the GBD figure) and the background suicide rate. We should expect the burden of increased suicide risk to vary widely based on the context.
In the end, we determine the DALY weighting of a year of depression to be 0.39, which is 44% higher than the IHME figure [link to calculations].
Wellbeing approach
The Happier Lives Institute (HLI) has proposed subjective wellbeing (SWB) as an alternative measure of value that avoids some of the pitfalls of health and income measures. We feel that SWB does have a number of strengths, but ultimately suffers from other serious limitations.
The best-known way of measuring SWB is with life satisfaction (LS) scores. Respondents are asked to rate their satisfaction with their life on a scale of 0 to 10 (although the wording of the questioning varies). An increase in LS lasting for one year is known as a WELLBY.
LS has been found to be significantly linked to income (OWID, 2017), which suggests that LS scores are capturing some objective level of wellbeing and are not merely an indication of wellbeing relative to others in the community.
Life-satisfaction and GDP per capita, from Our World in Data, 2017
HLI argue that by measuring SWB before and after an intervention, we can determine improvements in wellbeing that are comparable across physical health, mental health and anti-poverty interventions. LS scores may capture actual changes in quality of life, rather than merely inferring them by observing indicators like health scores. They can also lay claim to being more free from the subjective judgment of the scientist: respondents are able to evaluate their own lives based on what is most important to them.
However, SWB has a number of its own limitations:
- People have been found to report widely differing levels of LS when asked at different times (Krueger & Schkade, 2008). These variations may “average out” on a population level, but they imply that respondents’ self-reported life satisfaction may be heavily influenced by short-term influences. This raises the risk of reporting bias.
- Evidence suggests that introverts and extroverts may report the same internal state differently (Fabian, 2021).
- SWB data is often not available, which makes it difficult to evaluate interventions on this basis. In response to the absence of SWB data, HLI make the very questionable assumption that one SD improvement in affective mental health (MHa) is roughly equivalent to one SD improvement in SWB[5] (HLI, 2021). This kind of conversion adds further uncertainty.
- Taken literally, a SWB approach would imply that the lives of those in wealthy countries are up to twice as valuable as those in poor countries[6]. For many people, this violates the reasonable assumption that all lives are broadly equal in value.
- The life satisfaction scale may not be “truly” linear. For example, response data suggests that scores of 0, 5 and 10 are given “more often” than the distribution of other scores would suggest. It seems plausible that the difference between people at 1 and 3 on the scale is on average far greater than those at 6 and 8, and yet they are assumed to be equal by the linearity of the scale. This would lead us to underweighting interventions that help the most miserable.
Just like HALYs, life satisfaction (LS) and the WELLBY do not capture extreme positive or negative states, since scores are bounded by 0 and 10. It is possible that the full toll of severe pain, depression and psychosis simply cannot be measured by the WELLBY, QALY or DALY.
Overall, we expect that taking a wellbeing approach probably leads to valuing mental health work more highly[2]. Mental illnesses are hard to observe from the outside but are often acutely felt by the sufferer. In contrast, physical conditions like blindness or amputation might be highly feared by non-sufferers (so they look severe according to QALY/DALY weights) but may be “not so bad” according to sufferers who become accustomed to their health state.
On the other hand, health scores may fail to capture the secondary mental health effects of various health states, and a wellbeing approach may make some health conditions look worse.
Many of the limitations of SWB stem from the measure of life satisfaction. It is possible that better wellbeing measures will be invented in the future. Derek Foster has proposed the HALY+ and sHALY, health metrics which incorporate wellbeing weights, and the WELBY (with one “L”), which is similar to the WELLBY but can measure states worse than death. He is now exploring this further as a PhD candidate at Oxford University.
How should we interpret wellbeing measures?
The credence you give to wellbeing measures depends upon your moral priorities. You should give more weight to wellbeing measures if you think that an individual’s subjective experience of life is more important than their objective health condition.
As with HALYs, it is worth considering whether the health condition you are examining involves extreme happiness or extreme suffering that cannot be captured on a 1-10 scale.
Perhaps the most important heuristic is to check for signs of response bias in the study. Is there a blinded control group? Does the control group show an increase in SWB? If non-SWB measures were reported, do they also improve with the intervention?
Relative mental health
Interventions targeting mental illness may want to measure mental health before and after the treatment in order to track improvements. This is possible with HALY and Wellbeing approaches, but not always ideal.
HALYs assign fixed weights to health states: at best, there are different weights for mild, moderate and severe cases of the condition. This makes HALYs quite insensitive to the magnitude of improvement. Furthermore, judgements about what counts as mild, moderate or severe will inevitably introduce noise.
Wellbeing approaches, as explained above, capture many things other than the condition being targeted.
One response to these problems is to determine the severity of mental illness by measuring symptoms. A symptom questionnaire assigns a score to each patient. The population standard deviation (SD) of these scores is often used as a unit of value. A one SD improvement of symptoms for one year is known as an SD-year.
The PHQ-9, for example, is a 9-item questionnaire which asks how often the patient has experienced each of nine depression symptoms in the past two weeks. This provides a score of 0 to 3 on each question, for a total score out of 27. A population survey in the US by Tomitaka et al. (2018ii) found that the mean PHQ-9 score was just 3, with an SD of around 4. Thus a person going from a score of 20 to a score of 8 would experience a 3-SD improvement in depressive symptoms.
The distribution in PHQ-9 scores among a sample of the US population (Tomitaka et al., 2018i)
The distribution of scores was heavily right-tailed Most people reported very few depressive symptoms, and a minority reported far more than average.
There are questions about the validity of measuring progress in terms of SDs. Is a four-point improvement really equally valuable for people with scores of 5 and 20?
Straining credulity further is the practice of converting between SD measures, which we examine below.
Converting between measures
Converting between measures introduces a lot of uncertainty. Past efforts at conversion have rested on highly contestable assumptions - including the assumption that it is valid to linearly map one measure onto another.
As seen above, there is evidence that depressive symptom scores (as measured by the PHQ-9) are positively skewed with a very long right tail, while life satisfaction scores in LMICs are negatively skewed or symmetric (OWID, 2017). Even if we use correlational evidence to derive a conversion rate between SDs of PHQ-9 score and SDs of life satisfaction scores (as HLI have attempted), it may not be valid to use this conversion.
For example: if a severely depressed person (PHQ-9 score 24) experienced a 3 SD improvement in depression symptoms (reducing their score to 12), they would probably still be among the most depressed 5% of the population (Tomitaka et al., 2018ii). But a 3 SD improvement in life satisfaction is an increase of around 6.5[7], which is on par with average wellbeing in Italy. It seems implausible that someone within the most depressed 5% of the population would have higher-than-average life satisfaction.
In fact, we suggest that conversion be avoided when possible. For example, if we evaluate a psychotherapy intervention based on an RCT which measures depressive symptoms, the cost-effectiveness is best expressed as “x years of depression averted per $1000 spent”, or “y SD-years of depression averted per $1000 spent”. If these measures have been converted, the conversion rates used should be clearly stated (as we do when presenting results from our Intervention BOTECs).
This approach makes it difficult to directly compare interventions in different fields. But that’s better than unjustified confidence in conversion rates between very different measures. There are ways of making comparisons while being transparent about assumptions. For example you can give statements like “under the assumption that averting five years of depression is at least as good as averting one DALY, this psychotherapy intervention is more cost-effective than top malaria-prevention interventions”.
Converting between SD-years of depression and DALYS
Although fraught, we do use conversions in order to compare mental health interventions. Psychotherapy intervention effect sizes are often given in terms SD-years, where one SD-year is a one-standard-deviation improvement in symptoms for one year, and we attempt to convert this to DALYs. Much rests on this conversion rate[8].
A survey of a general US population (Tomitaka et al, 2018ii) found the SD of PHQ-9 scores (a 27-point scale) to be slightly more than 4 points. We estimate that depression on the cusp of moderately-severe and severe has a DALY weight at 0.65[9], and corresponds to a PHQ-9 score of 19.5. Thus we estimate an exchange rate of around 0.654.3419.5=0.14 DALYs per SD of depressive symptoms. However, it is unclear whether this logic can be extended to smaller improvements, or to people with milder depression, or the (non-depressed) family members who we assume are benefiting from spillover effects.
Furthermore, effect sizes are usually calculated using the SD of the study population, which may not be equal to the value of 4.3 found in Tomitaka et al, 2018ii. This means that a high effect size could simply be the product of low variance in symptoms in the study population (which has, after all, been screened for acute mental illness). HLI considers this “range restriction” effect, but they determine that a 10% discount is enough to account for it.
Conclusion
It is fundamentally difficult to compare the badness of mental illness to the badness of physical illness or death. Measures like the DALY, QALY or the WELLBY will lead to different valuations of mental illness, and we should be aware of the philosophical assumptions implicit in these measures.
It’s possible that the QALY and DALY, which are common currencies in the field of global health, systematically underestimate the badness of depression and of severe mental illness. Life satisfaction has a better chance of capturing internal welfare states, but comes with its own problems. We can partially correct for limitations in these measures by making our own adjustments. Some things, like extreme pain, depression and psychosis, may never be fully captured by bounded scales.
- ^
I intend a very vague meaning of “prioritize” here: the more heavily you weigh preventing suffering over increasing happiness, the stronger the case for tackling severe depression over, say, saving lives
- ^
Mental illness was found to be the biggest global cause of misery (defined as the bottom 10% on life-satisfaction ratings).
- ^
There are several ways of eliciting QALY scores for each health state, which we will not go into here.
- ^
We suspect that the suicide risk for severe depression is higher than for mild depression. If this is true, the stricter your diagnostic criteria for depression, the stronger the relative risk of suicide.
- ^
They justify this by examining correlations in improvement of MHa, depression symptoms and SWB in studies that measure several of these factors (link to HLI analysis)
- ^
Or even more, if the “neutral point” is set above zero. In Kenya the average LS score is just 4.4, so a neutral score of 5 would imply that the average Kenyan person’s life is net-negative.
- ^
HLI assume the standard deviation of LS to be 2.17 (see “input” tab here). It is generally thought to be higher than this in HICs and lower in LMICs.
- ^
Or, for that matter, the WELLBY weighting of an SD-year of depression.
- ^
Taking the geometric mean of updated moderate and severe depression weightings according to our analysis, and adding the increased risk of suicide that is associated with depression in women.
Thanks for writing this - there's some good stuff here. A few comments:
Minor point, but I think 'being dead' is more accurate than 'death'. The latter suggest permanency, whereas values <0 can represent temporary states that are deemed worse than being dead. That said, there is some uncertainty over the meaning of negative valuations, and the best interpretation may depend on the methods used to elicit the values.
IIRC, physical pain is the dimension given the highest weight in the EQ-5D, so I'm not sure this is accurate for QALYs at least. I haven't looked into it fully, but one might expect DALYs to underweight pain, as in the example above, because (intuitively) one is no less 'unhealthy' if, say, terminal cancer is treated with painkillers. In contrast, your 'quality of life' is higher with lower pain, and most people have a strong preference for less pain, which is what QALYs aim to capture. In general, QALYs and DALYs give similar weights, so I'm not sure how much it matters in practice, but I haven't looked at differences across types of health state. EDIT: A useful project would be to compare DALY and QALY values for painful and mental disorders, but it wouldn't be that straightforward as QALYs are normally based on generic descriptions of health states while DALYs refer to specific conditions.
If done properly, I think comorbidities are captured by both QALYs and DALYs. An individual's QALY value is normally based on their self-reported score on a generic health state questionnaire, e.g. the EQ-5D has mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. This is done without reference to specific health conditions (e.g. arthritis, cancer), so the impact of any comorbidities should be reflected in the valuation. When individual data are not available, I think the impact of comorbidities is typically estimated by multiplying the weights, e.g. 0.3*0.7=0.21, though alternatives have been suggested.
DALYs do focus on health conditions but, at least when assessing the burden of disease, they try to account for sequelae (consequences of a condition). See the links in my summary here: https://forum.effectivealtruism.org/posts/Lncdn3tXi2aRt56k5/health-and-happiness-research-topics-part-1-background-on#Population_health_summaries1
That said, I'm not sure whether comorbidities/sequelae are always adequately captured in cost-effectiveness analyses, especially model-based analyses that use a hypothetical treated population. I can see it would be tempting for a modeller to ignore other conditions when evaluating the impact of an intervention on a particular disease.
On depression:
There are reasons to suspect that even the weights given by sufferers underestimate its badness. For example [EDIT: I've expanded this list]:
But I'm not sure it's right to correct the weights for suicide. The QALY gain/DALY loss from an intervention is a function of both duration and quality/healthfulness of life, so adjusting the quality dimension risks double-counting the effect of suicide. EDIT: But there are reasons to think that the DALY-based Global Burden of Disease studies have underestimated the burden of mental illness. I haven't kept up to date on the methodology since writing my posts, but see this paper from 2016.
My PhD supervisor made a similar point this week. Some kinds of wellbeing (e.g. objective lists) may not have an upper or lower limit, and it may be unnecessary to impose them: to trade off duration and quality of life, we need a zero point (e.g. 'as bad as being dead') and some unit (e.g. a quantity of wellbeing), but a fixed upper and lower bound might not be essential. For hedonists, there may be some physical limit to pain/pleasure, which could potentially serve as the bounds. For desire theorists, I'm not sure: I suppose intensity of preference is also a mental state that might also have a physical limit. But whether it's feasible to develop a practical measure that captures the extremes without neglecting important gradations of more common states is as-yet unclear to me. Maybe one option is to anchor scales at something non-extreme, e.g. WELBY 1 = the 95th percentile of some population, and simply allow some individuals to score arbitrarily higher than 1 (e.g. WELBY 5 during some extreme pleasures). Mutatis mutandis for negative states. But I need to think about it more.
Thanks for your comment, Derek. This has been really useful.
Some changes I have made in response:
A question:
You might want to check disability weights for other painful conditions; I don't remember if they were generally low.
I suspect QALYs still underweight extreme pain, for various reasons, e.g. the arbitrary cap on negative values, and the lack of experience of such states among most respondents (typically the general public in high- and middle-income countries). The distribution of responses typically suggest 'floor effects', with some respondents likely to give lower values if it were permitted. The Devlin et al paper I linked to previously gives good evidence of that, but here is a graph from a different paper (UK sample) for illustration (note the cluster at -1).
My point was more that pain gets a high weight relative to other dimensions of the EQ-5D...though not always the highest. As shown in the graph below, the original EQ-5D-3L UK tariff (Dolan, 1997) had pain as second (after mobility) for extreme, and roughly equal first (with self-care) for moderate, based on TTO responses from the general public. (I can give you the Excel version of the graph if you want to modify it.)
The preliminary UK tariff for the newer EQ-5D-5L gave pain the highest weight, followed by depression/anxiety, for the extreme level. Full results below... but note that NICE rejected the value set for methodological reasons so, last I checked, it still recommends mapping the old 1997 3L figures onto the 5L with an algorithm.
There are many other tariffs from many other countries, for both the 3L and 5L, if you want to compare: https://euroqol.org/information-and-support/resources/value-sets/
I think this may be true (given some plausible-to-me philosophical and psychological assumptions), but it's also more generally that studies done in sufferers likely underestimate the badness. For example, because studies exclude the most severe cases, the badness of severe depression would be underestimated even if the study participants gave fully 'valid' responses (and even if an instrument were used that was able to capture the full range of experience).
I don't remember the details of the DALY/GBD methods, and I don't know a great deal about diabetes, but I'm pretty sure it can be a cause as well as consequence of obesity. At least insulin therapy can cause weight gain. And obviously you'd want to count only the proportion of diabetics who would have got depressed/gained weight as a result of diabetes.
Not sure I follow the depression example, but yes, you would sum the YLL from suicide (i.e. 'standard' or counterfactual life expectancy minus the actual number lived) and YLD (i.e. years lived with depression * disability weight). The formula/steps and examples are here and here.