I decided to run an Effective Altruist message on a full population survey I have access to, use bayesian message testing software to analyze the results, and share the results with the EA community on the forum.
I tested several EA themed messages aimed at increasing respondents’ interest in donating and effective altruism. For non-control respondents, I presented them with either the claim that AMF can save a life for 3,500 dollars (Facts), a short version of Peter Singer’s Pond analogy (Obligation) or a short take on Will MacAskill’s opportunity framing (Opportunity).
I then asked respondents how much they planned to donate in the next 12 months, their interest in EA and gave them the opportunity to click on a link to donate to the Against Malaria Foundation (AMF) or potentially join Giving What We Can (GWWC). I recorded whether they clicked on these links. Additionally, I asked how much they donated in the last 12 months as a “prescreen” to control for in my analysis, dramatically increasing my precision. The wording of each of the questions described above is in a doc linked to at the bottom of this article.
Overall, my results indicate that these brief messages do not increase respondents perspective donations, but there is some evidence that the facts and obligation message may increase interest in EA among educated individuals. Below, I discuss my results in more detail.
Methods
This was embedded in a 1200 person online survey representative of US citizens. Within this survey, I delivered 3 treatments (Facts, Obligation and Opportunity) each with a sample size of ~200 respondents. I compared how interested respondents were in learning more about effective altruism and how much they planned to donate. I examined these relationships overall and among the critical subgroup of those with at least a bachelor’s degree (this was my only pre-planned comparison; having a bachelor’s degree serves as a rough proxy for the EA target elite audience). Unfortunately, only 10 respondents took substantive action by clicking on the AMF donation or GWWC links, so this sample size was not sufficient to conduct robust analysis on this dependent variable.
To analyze this relationship, I used Bayesian hierarchical modeling software based in R and Stan. This modeling approach allows us to create a probability distribution of the treatment effects, reduce variance by controlling for other variables in the survey and borrow power from the full sample when estimating effects among subgroups.
Future Donation Plans
To gauge how the EA messages changed people’s donations, I predicted the probability that respondents plan to donate at least 6% of their income to charity. Below are treatment effects and probability distributions for the effect of each message. Error bars are one standard error.
As evidenced by the plots above, none of the treatments had a large impact. In fact, the impact such as it is is negative. The impact is minute, with Average Treatment Effects (ATEs) well under 1%. Even in the most optimistic case, few are likely to increase their expected donation behaviors in response to the arguments provided. This is not surprising; it can take a lot for people to change substantial behaviors, or even suggest they may change them in a survey. There’s little evidence that any argument is more effective, and will require other means to assess.
Interest in Effective Altruism
I also examined interest in Effective Altruism (whether respondents are at least somewhat interested in learning more about EA). Below are treatment effects and probability distributions for the effect of each message.
Unlike perspective donations, interest exhibits substantial variation between treatments. While the opportunity message appears to actively put people off, letting people know they can save lives for $3,500 does appear to get folks interested in effective altruism, and may be a good way to get them in the door. However, the most interesting ATEs appear when breaking the results out by education level. Below are ATEs, probabilities that each message is best, and probabilities of backlash (a negative effect) for each message among those with at least a bachelor’s degree.
There’s a large gap in the impact these messages have on those with and without a bachelor’s degree. For those with a bachelor’s degree, every message has a positive ATE, and the effects of obligation and facts are quite substantial (they increase the chance individuals are at least somewhat interested in learning more about EA by over 6 percentage points). However, there’s a substantial amount of variance in these effect sizes. The true effect of these interventions could credibly range from slightly negative to over 15 percentage points, but most of the probability mass lies in the middle of this range. Overall, Singer’s pond analogy and the scope of impact people can have with their donations appear to be effective methods of building interest in EA; there’s around a 90% chance that one of these is the best message (out of the three and the control) that you can use with individuals having at least a bachelor’s degree.
However, the results are much different for those without a bachelor's degree. Only the factual argument appears to have any positive effect, and they appear to be turned off by the opportunity and obligation framings. This confirms what we already suspected, prospective EAs are likely to be educated elites and it makes sense to target them.
Discussion and caveats
Overall, these findings are mixed, but not surprising. EA messaging does work to get people interested in Effective Altruism. However, EA messaging alone is not enough to get people to even claim that they will increase their donations. This likely takes a much more substantial treatment. But getting educated elites interested in Effective Altruism is the first step. By emphasizing the moral reasoning behind effective altruism and the scale of good we can do in the world, we can encourage people to learn more about Effective Altruism. From there, we can change their behavior.
Like all research, this is limited. For one, the individuals targeted and the context are not entirely typical. EA messaging will tend to come from friends and acquaintances in person or in discussions online rather than as an anonymous message in a web survey. People may react differently in other situations, but this study does provide an important piece of experimental evidence that can inform how we try to engage people. Additionally, having a bachelor’s degree is not enough to be in EA’s core audience. EAs as a whole tend to be analytically oriented and are often mathematical in their thinking. This population is not as restrictive as typical perspective EAs. The good news is that both of these differences suggest the true effect size may be larger. A more personal contact to an even better target may be very effective at encouraging people to join EA. Indeed, that may help explain EA’s substantial growth. However, in another way, the audience is overly restrictive; only US citizens were included. Different messages may be more effective in other countries.
This study is a step forward, it provides some evidence on what treatments work best and what we can accomplish with a contact. As always, more research (both observational and experimental) is needed. As our community engages in more trials to test our messaging, we can continue to fine tune it and expand the appeal of EA.
Thanks to Kerry Vaughan for advice on message choice.
This is great research! But to me it looks like the "fact" message you gave was really an "opportunity" message, and the "opportunity" message was really... well, I don't know how to describe it! I think the takeaway, for talking to people with bachelor's degrees, is that opportunity is an effective mode of communication as long as it's "opportunity to make the world better", not "opportunity to be a great person".
Thanks!
I adapted that framing from Will MacAskill (example of this starting 12:45 in the podcast with Sam Harris here: https://www.samharris.org/podcast/item/being-good-and-doing-good). MacAskill refers to the framing as "Excited Altruism" It might come across as better when he tells it than in a web survey. But I think it's pretty similar. I grouped this in with "opportunity", which I've also seen called "exciting opportunity" in the ea community (http://lukemuehlhauser.com/effective-altruism-as-opportunity-or-obligation/).
But, regardless of what it's called, I agree with you on the takeaway.
Yay for Bayesian regression (binomial, I'm guessing? You re-binned your attitude and donations responses? I think an ordered logit would be more appropriate here and result in less of a loss in resolution, or even a dirichlet, but then you'd lose yer ordering)! Those posteriors look decently tight, though I do have some questions!
I'm a little confused on what your control was, exactly. You have both points and distributions in your posterior plots, but you don't have any control paragraph blurb in you google doc questionnaire. How did you evaluate your control? Did you give them a paragraph entirely unrelated to EA? These plots are the posterior estimates for p_binomial when each dummy variable for treatment is 0? Is "average treatment effect" some posterior predictive difference from the control p (i.e. why it's exactly 0)?
On a related (and elucidatory) note, could you more explicitly clarify which models you fitted, exactly? Did you do any model comparison or averaging, or evaluate model adequacy? You mention "controlling for other variables in the survey" but I don't see any e.g. demographic questions in your questionnaire. You said you "examined these relationships overall and among the critical subgroup of those with at least a bachelor’s degree" -- did you do this by excluding everyone without a bachelor's, or by modeling the effects of educational attainment and then doing model comparison to test the legitimacy of those effects (I'd think looking at the posterior for the interaction between your paragraph and education dummies would be the clearest test)? Did you use diffuse, "uninformative" priors (and hyperpriors)? Which ones, exactly?
I assume that since this is a hierarchical analysis you used MCMC (HMC?) to do the fitting. Are your posterior distributions smoothed substantially, e.g. with a kernel density estimator? Or did you just get fantastic performance? What diagnostics did you run to ensure MCMC health? How many chains did you run? Did you use stopping rules? In my experience, hierarchical regression models can be pretty finicky to fit as they get more complex.
Kudos on not just using some wackily inappropriate out-of-the-box frequentist test!
edit: also, what are the boxplot-looking things? 95% HPDIs? CIs? Some other %? Ah wait they're the sd of your marginal samples?
It would be cool to provide the code, for both learning and verification purposes.
Unfortunately, because I used proprietary survey data/a proprietary R package to run this analysis, I don't think I'll be able to share the data and code.
Ah, interesting! What package? I've never heard of something like that before. Usually in the cold, mechanical heart of every R package is the deep desire to be used and shared as far as possible. If it's just someone's personal interface code, why not use something more publicly available? Can you write out your basic script in pseudocode (or just math/words?)? Especially the model and MCMC specification bits?
Sure, in an ideal world, software would all be free for everyone; alas, we do not live in such a world :p. I used the proprietary package because it did exactly what I needed and doesn't require writing STAN code or anything myself. I'd rather not re-invent the wheel. I felt the tradeoff of transparency for efficiency and confidence in its accuracy was worth it, especially since I wouldn't be able to share the data either way (such are the costs of getting these questions on a 1200 person survey without paying a substantial amount).
But the basic model was just a multilevel binomial model predicting the dependent variable using the treatments and questions asked earlier in the survey as controls.
Of course (though wheel reinvention can be super helpful educationally), but there are great free public R packages that interface to STAN (I use "rethinking" for my hierarchical Bayesian regression needs but I think Rstan would work, too), so going with someone's unnamed, private code isn't necessary imo. How much did the survey cost (was it a lot longer than the included google doc, then? e.g. Did you have screening questions to make sure people read the paragraph?). And model+mcmc specification can have lots of fiddly bits that can easily lead us astray, I'd say
Yeah, the survey was a lot longer. Typically general public surveys will cost over 10 dollars a complete, so getting 1200 cases for a survey like this can cost thousands of dollars.
I agree that model specification can be tricky, which is a reason I felt it well worth it to use the proprietary software I had access to that has been thoroughly vetted and code reviewed and is used frequently to run similar analyses rather than trying to construct my own.
I did not make sure people read the paragraph. I discussed the issue a bit in my discussion section, but one way a web survey might understate the effect is if people would pay closer attention and respond better to a friend delivering the message. OTOH, surveys do have some potentual vulnerability to the hawthorne effect, though that didn't seem to express itself in the donations question.
Yep, and alongside it, of course, the raw data!
Yup, binomial.
The respondents in a treatment were each shown a message and asked how compelling they thought it was. The control was shown no message.
Yeah; the plots are the predicted values for those given a particular treatment. and Average Treatment Effect is the difference with the control.
I did not include every control used in the provided questionnaire. There were a mix of demographics/attitudinal/behavioral questions asked in the survey that I also used. These controls, particularly previous donations, were important for decreasing variance.
I used a multilevel model to estimate the effects among those with and without a bachelor's degree. So, the bachelor's estimate borrow's power from those without a degree, reducing problems with over fitting.
These models used STAN, which handles these multilevel models well. Convergence was assessed with gelman-rubin statistics.
Ah, I guess that's better than no control, and presumably paying attention to a paragraph of text doesn't make someone substantially more or less generous. Did you fit a bunch of models with different predictors and test for a sufficient improvement of fit with each? Might do to be wary of overfitting in those regards maybe... though since those aren't focal Bayes tends to be pretty robust there, imo, so long as you used sensible priors
"I used a multilevel model to estimate the effects among those with and without a bachelor's degree. So, the bachelor's estimate borrow's power from those without a degree, reducing problems with over fitting."
If I'm understanding correctly, you had a hyperprior on the effect of education level? With just two options? IDK that that would help you much (if you had more: e.g. HS, BA/S, MS, PhD, etc. it might, but I'd try to preserve ordering there, myself).
"These models used STAN, which handles these multilevel models well. Convergence was assessed with gelman-rubin statistics."
STAN's great, but certainly not magic or perfect, and though idk them personally I'm sure its authors would strongly advocate paranoia about its output. So you got convergence with multiple (2?) chains from a random (hopefully) starting value? R_hats were all 1? That's good! Did all the other cheap diagnostics turn up ok (e.g trace plots, autocorrelation times/ESS, marginal histograms, quick within-chain metrics, etc.)?
No; I did not fit multiple models. Lasso regression was used to fit a propensity model using the predictors.
Using bachelor's vs. non-bachelor's has advantages in interpretability, so I think this was the right move for my purposes.
I did not spend an exorbitant amount of time investigating diagnostics, for the same reason I used a proprietary package was has been built for running these tests at a production level and has been thoroughly code reviewed. I don't think it's worth the time to construct an overly customized analysis.
Ah, gotcha. But re: code review, even the most beautifully constructed chains can fail, and how you specify your model can easily cause things to go kabloom even if the machine's doing everything exactly how it's supposed to. And it only takes a few minutes to drag your log files into something like Tracer and do some basic peace-of-mind checks (and others, e.g. examine bivariate posterior distributions to assess nonidentifiably wrt your demographic params). More sophisticated diagnostics are scattered across a few programs but don't take too long to run either (unless you have e.g. hundreds or thousands of chains, like in marginal likelihood estimation w/ stepping stones... a friend's actually coming out with a program soon -- BONSAI -- that automates a lot of that grunt work, which might be worth looking out for!). :]
(on phone at gym with shit wifi so can't provide links/refs atm, sorry!)
Do you have any good textbooks or educational resources to learn these kinds of techniques?
Sure! Though unfortunately most of the stuff comes from scattered lectures, workshops, discussions, book chapters, seminars, papers, etc. But for intro multilevel Bayesian regression in R/STAN I'd say John Kruschke's "Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan" and Richard McElreath's "Statistical Rethinking: A Bayesian Course with Examples in R and Stan" would be really solid (Richard also has his course lectures up on youtube if you prefer that, though I found his book super readable, so much so that when I took the class with him a few years back I skipped most of his lectures since the room was really hot. But don't let that dissuade you from watching them, he's a great guy/speaker and quite fun and funny!).
Purely in terms of building my own intuitions/understanding, though, I've found little more helpful than just looking up the relevant algorithms and implementing the damn things from scratch (to talk of reinventing square wheels above lol... though ofc you'd use the far superior underlying code others have written for your actual analysis).
Sounds interesting. Would love to take a look when you get a chance to provide the links.
As a quick update, I also tried something similar on the EA survey to see whether making certain EA considerations salient would impact people's donation plans. The end result was essentially no effect. Obligation, Opportunity, and emphasizing cost benefit studies on happiness all had slightly negative treatment effects compared to the control group. The dependent variable was how much EA survey takers reported planning to donate in the future.
This is great. Really surprised the opportunity framing is the worst. I think EA's takeoff was due, in part, to the opportunity framework over (what I assumed was) Singer's off putting obligation framing.
I guess the internet and whatever is the explanation.
I'd like to see more on this which a much bigger study.
This is incredibly valuable (and even groundbreaking) work. Well done for doing it, and for writing it up so clearly and informatively!