Hide table of contents

Net Promoter Score is a widely used method for determining consumer satisfaction, asking “How likely is it that you would recommend [brand] to a friend or colleague?” and the response is (usually) a number between 0 and 10. However, instead of an average, the aggregate score is a complex nonlinear function of the results. CEA has moved away from this complex function in favor of just simply taking the arithmetic mean. Briefly, this is because the results don’t replicate, NPS is not empirically supported, it requires larger sample sizes, and it violates survey best practices.

Summary

  1. NPS is widely used, but the research has failed to replicate, even when the replication was using the originally published data set (!).
  2. Measures of satisfaction are more predictive than NPS of outcomes such as firm growth and whether the respondent actually recommends the product to others.
  3. The American Customer Satisfaction Index is an alternative which has stronger empirical grounding, as well as a huge number of publicly available benchmarks. It uses 3 questions, on a 10 point scale, whose scores are averaged and normalized to a 0-100 scale:[1]
    1. What is your overall satisfaction with X?
    2. To what extent has X met your expectations?
    3. How well did X compare with the ideal (type of offering)?
  4. CEA mostly still asks the NPS question, but switched to taking the arithmetic mean of the results. We call this the “likelihood to recommend” (LTR).[2]

More information

  1. NPS was introduced in 2003 with the claim that it was the best predictor of growth across a data set of companies. This data set was small and subject to p-hacking. The raw data has not been published (including, ironically, the pieces the author says should always be published when reporting NPS scores). The original research methodology was:

“We then obtained a purchase history for each person surveyed and asked those people to name specific instances in which they had referred someone else to the company in question… The data allowed us to determine which survey questions had the strongest statistical correlation with repeat purchases or referrals….
One question was best for most industries. “How likely is it that you would recommend [company X] to a friend or colleague?” ranked first or second in 11 of the 14 cases studies”[3]

  1. Replication attempts (including ones which reverse engineered the original data set from published scatterplots) have failed to find significant predictive value from NPS. A wide variety of alternative statistical methods exist, some of which have stronger empirical grounding.
    1. Notably, NPS is worse at predicting whether the respondent will actually recommend the product.
  2. Replication attempts find alternate definitions of the NPS scale to be more predictive than the commonly used one, even if the question is kept the same (e.g. using a 7 point scale).
  3. The weird way NPS is calculated means that it requires substantially larger sample sizes.
  4. The NPS question disagrees with commonly accepted best practices in survey design (e.g. using an 11-point scale instead of a 5-point one).
  5. There doesn’t seem to be any particular reason to think that NPS is good, apart from it being widely used.
  6. So if it’s so terrible, why does everyone use it? This Wall Street Journal article implies that it is used precisely because it’s so easy to manipulate: “Out of all the mentions the Journal tracked on earnings calls, no executive has ever said the score declined.”[4]

Further Reading

  1. ^

    (Note: different sources seem to use slightly different wording and I’m not sure what the “official” wording is because it’s proprietary. Also, the official version uses a proprietary weighting of these three questions but people online seem to think the weights are approximately equal.)

  2. ^

    We usually do this because we don’t want to take people’s time up by asking three questions. I haven’t done a very rigorous analysis of the trade-offs here though, and it could be that we are making a mistake and should use ACSI instead.

  3. ^

    “ranked first or second in 11 of the 14 cases studies” should already be setting off alarm bells

  1. ^

    Of course, this doesn’t explain why investors allow executives to tie their compensation to easily hackable metrics.

Show all footnotes
Comments7


Sorted by Click to highlight new comments since:

Cool! Glad to see this, I've been harping on about the NPS for some time (1, 2, 3, 4).

We usually do this because we don’t want to take people’s time up by asking three questions. I haven’t done a very rigorous analysis of the trade-offs here though, and it could be that we are making a mistake and should use ACSI instead.

As you may have considered, you could ask just one of the ACSI items, rather than asking the one NPS item. This would have lower reliability than asking all three ACSI items, but I suspect that one ACSI item would have higher validity than the one NPS item. (This is particularly the case when trying to elicit general satisfaction with the EA community, but maybe less so if you literally want to know whether people are likely to recommend an event to their friends).

The added value of using three items to generate a composite measure is potentially pretty straightforward to estimate, esp if you have prior data with the items.  Happy to talk more about this.

Thanks David! If you have references or could say more about the virtues of asking one ACSI question versus the NPS question, I would love to read/hear them.

Hi Ben.

There are two broad reasons why I would prefer the ACSI items (considered individually) over the NPS (style) item:

  • The ACSI items are (mostly) more face valid
  • The ACSI items generally performed better than the NPS when we ran both of these in the EAS 2020

Face validity

This depends on what you are trying to measure, so I’ll start with the context in the EAS, where (as I understand it) we are trying to measure general satisfaction with or evaluation of the EA community.

Here, I think the ACSI items we used (“How well does the EA community compare to your ideal? [(1) Not very close to the ideal - (10) Very close to the ideal]” and “What is your overall satisfaction with the EA community? [(1) Very dissatisfied - (10) Very satisfied]”) more closely and cleanly reflect the construct of interest.

In contrast, I think the NPS style item (“If you had a friend who you thought would agree with the core principles of EA, how excited would you be to introduce them to the EA community?”) does not very clearly or cleanly reflect general satisfaction. Rather, we should expect it to be confounded with:

  • Attitudes about introducing people to the EA community (different people have different views about how positive growing the EA community more broadly is)
  • Perceived/projected personal “excitement” (related to one’s (perceived) emotionality, excitability etc.)
  • Sociability/extraversion/interest in introducing friends to things in general, as well as one’s own level of social engagement with EA (if one is socially embedded in EA, introducing friends might make more sense than if you are very pro EA, but your interaction with it is entirely non-social)

I think some of these issues are due to the general inferiority of the NPS as a measure of what it’s supposed to be measuring:

And some of them are due to the peculiarities of the context where we’re using NPS (generally used to measure satisfaction with a consumer product) to measure attitudes towards a social movement one is a part of (hence the need to add the caveat about “a friend who you thought would agree with the core principles of EA”).

Some of the other contexts where you’re using NPS might differ. Likelihood to recommend may make more sense when you’re trying to measure evaluations of an event someone attended. But note that the ‘NPS’ question may simply be measuring qualitatively different things when used in these different contexts, despite the same instrument being presented. i.e. asking about recommending the EA community as a whole elicits judgments about whether it’s good to recommend EA to people (does spreading EA seem impactful or harmful etc?), whereas asking about recommending an event someone attended mostly just reflects positive evaluation of the course. Still, I slightly prefer a simple ACSI satisfaction measure over NPS style items, since I think it will be clearer, as well as more consistent across contexts.

Performance of measures

Since we included both the NPS item and two ACSI items in EAS 2020 we can say a little about how they performed, although with only 1-2 items and not much to compare them to, there’s not a huge amount we can do to evaluate them.

Still, the general impression I got from the performance of the items last year confirms my view that the two ACSI measures cohere as a clean measure of satisfaction, while NPS and the other items are more of a mess. As noted, we see that the two ACSI measures are closely correlated with each other (presumably measuring satisfaction), while the NPS measure is moderately correlated with the ‘bespoke’ measures (e.g. “I feel that I am part of the EA community”) which seem to be (noisily) measuring engagement more than satisfaction or positive evaluation. I think it’s ultimately unclear what any of those three items are measuring since they’re all just imperfectly correlated with each other, engagement and with satisfaction, so I think they are measuring a mix of things, some of which are unknown. Theoretically, one could simply run a larger suite of items, designed to measure satisfaction, engagement, and other things which we think might be related (such as what the bespoke measures are intended to measure) and tease out what the measures are tracking. But there’s not a huge amount we can do with just 5-6 items and 2-3 apparent factors they are measuring.

Benefits of multiple measures

As an aside, we put together some illustrations of the possible concrete benefits of using a composite measure of multiple items, rather than a single measure.

The plot below shows the error (differences between the measured value and the true value: higher values, in absolute terms, are worse) with a single item vs an average made from two or three items. Naturally, this depends on assumptions about how noisy each item is and how correlated each of the items are, but it is generally the case that using multiple items helps to reduce error and ensure that estimates come closer to the true value.

This next image shows the power to detect a correlation of around r = 0.3 using 1, 2 or 3 items. The composite of more items should have lower measurement error. When only a single item is used, the higher measurement error means that a true relationship between the measured variable and another variable of interest can be harder to detect. With the average of 2 or 3 items, the measure is less noisy, and so the same underlying effect can be detected more easily (i.e., with fewer participants). (The three different images just show different standards for significance)


 

I just wanted to say that I always appreciate your in-depth responses David! They are always really easy to follow and informative :)

I'd also be interested in this!

Hello, since I saw this post, I switched a couple of things to using ACSI. I always thought NPS seemed pretty bad, and mostly only included it for comparison with groups like CEA who were using it.

Do you have any data you're able to share publicly yet?

 

Additionally:

The American Customer Satisfaction Index is an alternative which has stronger empirical grounding, as well as a huge number of publicly available benchmarks. It uses 3 questions, on a 10 point scale, whose scores are averaged and normalized to a 0-100 scale:[1]

How exactly are you calculating it? The Wikipedia formula seems wrong to me, unless I'm misunderstanding it.

(I have 9 answers for each of the three questions. The average responses are 9.4, 9.6, and 9.3. So I think what I'm supposed to do is =((9.4*1+9.6*1+9.3*1)-1)/9*100 . This gives me "303.7037037" which clearly seems wrong.)

My interpretation of what it should be: 

=(((9.4+9.6+9.3)-3)/27)*100

Which equals 93.8. The simpler but slightly less accurate =((9.4+9.6+9.3)/3)*10 comes out similarly, at 94.4.

Which seems very good. E.g. "Full-Service Restaurants", "Financial Advisors", and "Online News and Opinion" all  seem to hover around 70-80, while government services range a bit more widely from 60 to 90.

(Caveat that I didn't realise that you were supposed to include labels on 1 and 10 for each of the questions until I checked the Wikipedia entry just now to calculate it, and I'm not sure how this would affect the results. The labels seem pretty weird to me, so I suspect it does affect it somehow.)

Thanks!

Appreciate this update! 

> NPS [...] violates survey best practices.

Agree. For our EA retreats in Germany, we've also always just used the mean. I'm surprised that NPS is so widely used in industry. 

Curated and popular this week
TL;DR * Screwworm Free Future is a new group seeking support to advance work on eradicating the New World Screwworm in South America. * The New World Screwworm (C. hominivorax - literally "man-eater") causes extreme suffering to hundreds of millions of wild and domestic animals every year. * To date we’ve held private meetings with government officials, experts from the private sector, academics, and animal advocates. We believe that work on the NWS is valuable and we want to continue our research and begin lobbying. * Our analysis suggests we could prevent about 100 animals from experiencing an excruciating death per dollar donated, though this estimate has extreme uncertainty. * The screwworm “wall” in Panama has recently been breached, creating both an urgent need and an opportunity to address this problem. * We are seeking $15,000 to fund a part-time lead and could absorb up to $100,000 to build a full-time team, which would include a team lead and another full-time equivalent (FTE) role * We're also excited to speak to people who have a background in veterinary science/medicine, entomology, gene drives, as well as policy experts in Latin America. - please reach out if you know someone who fits this description!   Cochliomyia hominivorax delenda est Screwworm Free Future is a new group of volunteers who connected through Hive investigating the political and scientific barriers stopping South American governments from eradicating the New World Screwworm. In our shallow investigation, we have identified key bottlenecks, but we now need funding and people to take this investigation further, and begin lobbying. In this post, we will cover the following: * The current status of screwworms * Things that we have learnt in our research * What we want to do next * How you can help by funding or supporting or project   What’s the deal with the New World Screwworm? The New World Screwworm[1] is the leading cause of myiasis in Latin America. Myiasis “
 ·  · 11m read
 · 
Does a food carbon tax increase animal deaths and/or the total time of suffering of cows, pigs, chickens, and fish? Theoretically, this is possible, as a carbon tax could lead consumers to substitute, for example, beef with chicken. However, this is not per se the case, as animal products are not perfect substitutes.  I'm presenting the results of my master's thesis in Environmental Economics, which I re-worked and published on SSRN as a pre-print. My thesis develops a model of animal product substitution after a carbon tax, slaughter tax, and a meat tax. When I calibrate[1] this model for the U.S., there is a decrease in animal deaths and duration of suffering following a carbon tax. This suggests that a carbon tax can reduce animal suffering. Key points * Some animal products are carbon-intensive, like beef, but causes relatively few animal deaths or total time of suffering because the animals are large. Other animal products, like chicken, causes relatively many animal deaths or total time of suffering because the animals are small, but cause relatively low greenhouse gas emissions. * A carbon tax will make some animal products, like beef, much more expensive. As a result, people may buy more chicken. This would increase animal suffering, assuming that farm animals suffer. However, this is not per se the case. It is also possible that the direct negative effect of a carbon tax on chicken consumption is stronger than the indirect (positive) substitution effect from carbon-intensive products to chicken. * I developed a non-linear market model to predict the consumption of different animal products after a tax, based on own-price and cross-price elasticities. * When calibrated for the United States, this model predicts a decrease in the consumption of all animal products considered (beef, chicken, pork, and farmed fish). Therefore, the modelled carbon tax is actually good for animal welfare, assuming that animals live net-negative lives. * A slaughter tax (a
 ·  · 4m read
 · 
As 2024 draws to a close, I’m reflecting on the work and stories that inspired me this year: those from the effective altruism community, those I found out about through EA-related channels, and those otherwise related to EA. I’ve appreciated the celebration of wins and successes over the past few years from @Shakeel Hashim's posts in 2022 and 2023. As @Lizka and @MaxDalton put very well in a post in 2022: > We often have high standards in effective altruism. This seems absolutely right: our work matters, so we must constantly strive to do better. > > But we think that it's really important that the effective altruism community celebrate successes: > > * If we focus too much on failures, we incentivize others/ourselves to minimize the risk of failure, and we will probably be too risk averse. > * We're humans: we're more motivated if we celebrate things that have gone well. Rather than attempting to write a comprehensive review of this year's successes and wins related to EA, I want to share what has personally moved me this year—progress that gave me hope, individual stories and acts of altruism, and work that I found thought-provoking or valuable. I’ve structured the sections below as prompts to invite your own reflection on the year, as I’d love to hear your responses in the comments. We all have different relationships with EA ideas and the community surrounding them, and I find it valuable that we can bring different perspectives and responses to questions like these. What progress in the world did you find exciting? * The launch of the Lead Exposure Elimination Fund this year was exciting to see, and the launch of the Partnership for a Lead-Free Future. The fund jointly committed over $100 million to combat lead exposure, compared to the $15 million in private funding that went toward lead exposure reduction in 2023. It’s encouraging to see lead poisoning receiving attention and funding after being relatively neglected. * The Open Wing Alliance repor