The (un)reliability of moral judgments: A survey and systematic(ish) review

cole_haus

The (un)reliability of moral judgments: A survey and systematic(ish) review

cole_haus

74 min readNov 1, 2019

Comments 8

Sorted by

New & upvoted

MichaelPlant

A couple of very general suggestions to aid the reader - I've only read the summary. Given the length of the post, could you add a line or two to your summary to say what conclusion you're arguing for? Reading the summary, I get what the topic is, but not what your take is. It would also be good if you could orientate the reader as to where this fits in the literature, e.g. what the consensus in the field is and whether you are agreeing with it.

cole_haus

I'm mostly not trying to argue for any particular conclusion--more trying to summarize and relay the existing work. I was deliberately trying to avoid emphasising my idiosyncratic take because I didn't want readers to have to separate personal speculation from reportage. (I would have thought the "survey and systematic(ish) review" in the title help to set that expectation. Are those terms more ambiguous than I understand them to be?)

As far as consensus in the literature, there doesn't seem to be much of one. I think consensus is/will be especially hard because of the variety of researchers involved--philosophers, psychologists, etc. You can see the lack of consensus reflected in the wide variety of angles in "Indirect evidence" and "Responses".

Does that all make sense?

MichaelPlant

Okay, that makes more sense. You could have a systematic review which unambiguously pointed in one conclusion, you perhaps you should add something like you've already said, i.e. that you're just trying to report the finding without drawing an overall conclusion (although I don't know why someone would avoid drawing an overall conclusion if they thought there was one). And again, it would be helpful to add that there doesn't seem to be a consensus on this point (and possibly that it 'falls between the gaps' of various disciplines).

cole_haus

Okay, thanks. I added a section to the summary:

Meta: This post attempts to summarize the interdisciplinary work on the (un)reliability of moral judgements. As that work contains many different perspectives with no grand synthesis and no clear winner (at present), this post is unable to offer a single, neat conclusion to take away. Instead, this post is worth reading if the (un)reliability of moral judgements seems important to you and you'd like to understand what the current state of investigation is.

David_Moss

Many thanks for completing this thorough review.

A few fairly general comments:

I find the higher level evidence that suggests our moral judgments would tend to be unreliable more persuasive than the many individual examples of judgments apparently being influenced by morally irrelevant factors. By higher level evidence I mean the broadly evolutionary arguments about the adaptive function of moral thinking. Of course, such evolutionary debunking arguments are a topic of ongoing debate (Millhouse, Bush & Moss, 2016).
One reason I find the evidence offered by lots of specific instances of apparent influence by morally irrelevant factors is that there's reason to expect the literature to be systematically biased towards producing and reporting results showing such influences. Researchers in this area are on the whole collectively trying to generate results showing weird factors influencing moral judgement since these are publishable, whereas results showing moral judgement responding as we'd expect to relevant factors would generally not be (arguably the raison d'être of social psychology is finding strong counter-intuitive influences of social/contextual factors on human action). Even setting aside concerns about the validity of these published results, I would expect this collected direct evidence to therefore give an impression of pervasive rationally irrelevant influences on moral judgement even if our judgments were generally highly reliable.
I think there are some good reasons to think that the ecological validity challenge to these experimental results, which you mention, is pretty strong: related to the Gigerenzer ecological rationaly strategy which you mention, one might think that some of the apparent irrational biases found in the experimental literature are as as result of people's judgement being highly sensitive to pragmatic factors which would be of relevance in practical contexts, but which are treated as irrational in the context of the experiment. For example, the famous Knobe Effect (showing that moral judgments irrationally influence whether we judge that someone intended to do something or not) seems entirely in terms of the pragmatics (in real world contexts) of saying x intended a good/bad thing. (Adams and Steadman, 2004)
That said, despite my scepticism of these experimental results for establishing that there is pervasive bias in moral judgements (which I think is independently extremely plausible), I do think that more empirical psychological research into EA-relevant judgments would be likely to be high value since, done well, it can highlight potential errors and biases which we would otherwise be aware of (as with more general heuristics and bias research).

cole_haus

Yup, I definitely agree with points 1, 2, and 4.

I'd have to think and read about this more to move it beyond pure speculation, but I feel a little less sympathetic to the ecological validity response in the domain of moral judgments. It seems plausible to me that the problems that confront our moral faculties today are very dissimilar to those in the environment of evolutionary adaptedness--even more so than with faculties of prudential rationality--and/or that our moral faculties transfer less well to radically new environments.

cole_haus

I guess I'll also use the comments to offer my hot take which I don't think I can immediately justify in a fully explicit manner:

Of course our moral judgements are unreliable (though not in some of the ways investigated in the literature and almost certainly in additional ways that aren't investigated in the literature). There are some moral judgements which are stable but some which aren't and even the smaller set of unstable judgements is very concerning--especially for EAs and others trying to maximize (cf. optimizer's curse) and taking unconventional views. I don't think the expertise defense or most of the other responses are particularly successful. I think the "moral engineering" approach is essential. We should be building ethical systems that explicitly acknowledge and account for the noise in our intuitions. For example, something like rule utilitarianism seems substantially more appealing to me now than it did before I looked into this area; the regularization imposed by rule utilitarianism limits our ability to fit to noise (cf. bias-variance trade-off).

ishi

My brief take (for what its worth---I can imagine its better not to give a 'rapid response' as opposed to a well thought out one): That seems to me to be a 'tour de force' even though I mostly skimmed it, and skipped some parts--its the kind of thing I would print out if I had a working printer. I am only slightly familiar with psychological literature and measures (eg 'Cohen's d') though I read (or glance at at some of it), and am often skeptical of the results claimed to be found. (People often do something like a poll, or give some sort of test, without looking at things like context, wording, ordering of questions, etc.--but these lead to alot of publications)

The 'procedure section' for me was the first clue that was going to be a well thought out discussion.

The various tables on different studies I couldn't really understand but from the discussion I felt I got the drift.

When the 'drift-diffusion equation' and 'biased random walk' appeared, I felt like I was walking on firm ground again (even if it was what is called in biology or complexity theory a 'rugged fitness landscape'--its firm ground, not a swamp, just rugged and complex, like climbing a mountain.)

The discussion of culture and socioeconomic status, personality, genes, and social embededness seemed 'spot on'---especially because the work of Boyd and Richerson was cited (although I cannot claim to be an expert, I view their books and papers to be basically the current theory of evolution--they are to Darwin what Einstein was to Newton, though B&R's gene-culture evolution theory might be more analogous to Einstein's special relativity---a slight modification of Newtonian dynamics---than to General Relativity which has a much more intricate math apparatus and wide applicability. B&R were preceded by qualitative discussions by several people, and a mathematical one by E O Wilson and C J Lumden ('genes , mind and culture') which used nonlinear diffusion equations---but EOW and CJL seemed to later agree that while the math in the book was correct (which came from statistical physics) their interpretation was not--B&R I think is the standard or (closer to) correct one (though in what i read they used discrete dynamical systems and evolutionary game theory rather than de's).

I am familiar with some of the references (Tooby and Cosmides, Plomin, Heinrich , and more).

Also some of the literature on 'universal moral grammar' (papers by John Mikhail, Marc Hauser, etc.) and Chomskyian linguistics and 'poverty of the stimulus'. (I agree with the connectionists that while Chomsky and people like S Pinker are correct that humans are not 'blank slates', both of them (along with evolutionary psychologists like Cosmides and Tooby, and J Fodor) go too far from proposing there is a 'language organ' , or 'instinct', or 'module' , nor are there ones for morality. Babies are not smartphones which have 'apps' such as a dictionary, calculator, political platform, religious text , 10 commandments, '12 rules for living', or even 'universal grammar ' genetically coded in them as part of their 'god given' hardware. Boyd and Richerson i think have a better take on what people are born with. (And some more recent work sort of adds some of what C Geertz (anthropology) called 'thick detail' about 'social embededness'-ie people aren't born with video games in their heads.

There are many papers critiquing Chomsky's 'poverty of the stimulus' argument e.g. https://arxiv.org/abs/cs/0212024 None of these however contradict the conclusions in the above paper about the unreliability of moral judgements. You don't need a plane built with inherint design flaws to crash---planes with no such flaws can crash anyway due to human error, maintenance problems, or the weather.

Comments

ishi

The 'procedure section' for me was the first clue that was going to be a well thought out discussion.

The various tables on different studies I couldn't really understand but from the discussion I felt I got the drift.

I am familiar with some of the references (Tooby and Cosmides, Plomin, Heinrich , and more).

The first, simplest sort of unreliability can be subsumed in this framework by considering the time of evaluation as a morally irrelevant factor. ↩︎
This was originally written in more innocent times before the post had sprawled to more than 12,000 words. ↩︎
This heterogeneity is also why I don't compute a final, summary measure of the effect size. ↩︎
There are some studies examining only disgust and some examining only cleanliness, but I've grouped the two here since these manipulations are conceptually related and many authors have examined both. ↩︎
There are quite a few cross-cultural studies of things like the ultimatum game (Henrich et al. 2001). I excluded those because they are not purely moral—the ultimatum-giver is also trying to predict the behavior of the ultimatum-recipient. ↩︎
Yes, not all results in works like Thinking Fast and Slow have held up and some of the results are in areas prone to replication issues. It still seems unlikely that all such results will be swept away and we'll be left to conclude that humans were perfectly rational all along. ↩︎
We can also phrase this as follows: Some of our moral intuitions are the result of model-free reinforcement learning (Crockett 2013). In the absence of a model specifying action-outcome links, these moral intuitions are necessarily retrospective. Framed in this ML way, the concern is that our moral intuitions are not robust to distributional shift (Amodei et al. 2016). ↩︎
Aside: There is some amazing academic trash talk in chapter 2 of (Sinnott-Armstrong and Miller 2008). Just utter contempt dripping from every paragraph on both sides (Jerry Fodor versus Tooby and Cosmides). For example, "Those familiar with Fodor's writing know that he usually resurrects his grandmother when he wants his intuition to do the work that a good computational theory should.". ↩︎
The separation between culture and genes is particularly unclear when looking at norms and moral judgment since both culture and genes are plausibly working to solve (at least some of) the same problems of social cooperation. One synthesis is to suppose that certain faculties eventually evolved to facilitate some culturally-originated norms. ↩︎
I will add one complaint that applies to pretty much all of the studies: they treat categorical scale data (e.g. responses on a Likert scale) as ratio scale. But this sort of thing seems rampant so isn't a mark of exceptional unreliability in this corner of the literature. ↩︎
There's also the slightly subtler claim that expertise does not purify moral intuitions and judgments, but that it helps philosophers understand and accomodate their cognitive flaws (Alexander 2016). We'll not explicitly examine this claim any further here. ↩︎
There is even reason to believe that reflection is sometimes harmful (Kornblith 2010) (Weinberg and Alexander 2014). ↩︎
There's also the interesting but somewhat less relevant work of Schwitzgebel and Rust (Schwitzgebel and Rust 2016) in which they repeatedly find that ethicists do not behave more morally (according to their metrics) than non-ethicists. ↩︎
Gigerenzer explains this surprising result by appealing to the bias-variance tradeoff—complicated strategies over-fit to the data they happen to see and fail to generalize. Another explanation is that heuristics represent an infinitely strong prior and that the "ideal" procedures Gigerenzer tested against represent an uninformative prior (Parpart, Jones, and Love 2018). ↩︎

Study	Independent variable	Dependent variable	Sample size	Result	Effect size
[@petrinovich1996influence], study 2, form 1	Ordering of inaction vs action	Scale of agreement	30 vs 29	$F (1, 57) = 0.37$ ; $p > 0.10$	$η^{2} = 0.0064$
[@petrinovich1996influence], study 2, form 2	Ordering of inaction vs action	Scale of agreement	30 vs 29	$F (1, 57) = 5.07$ ; $p < 0.02$	$η^{2} = 0.080$
[@haidt1996social], mazda	Ordering of act vs omission	Rating act worse	45.5 vs 45.5[^estimate]	$χ^{2} = 7.32$ ; $p < 0.01$	$η^{2} = 0.080$
[@haidt1996social], crane	Ordering of act vs omission	Rating act worse	34.5 vs 34.5	$χ^{2} = 0.50$ ; $p = 0.4795$	$η^{2} = 0.007$
[@haidt1996social], mazda	Ordering of social roles	Rating friend worse	45.5 vs 45.5	$χ^{2} = 3.25$ ; $p < 0.05$	$η^{2} = 0.036$
[@haidt1996social], crane	Ordering of social roles	Rating foreman worse	34.5 vs 34.5	$χ^{2} = 3.91$ ; $p < 0.05$	$η^{2} = 0.042$
[@lanteri2008experimental]	Ordering of vignettes	Obligatory or not	31 vs 31	$χ^{2} (1, 62) = 15.17$ ; $p = 0.000098$	$η^{2} = 0.24$
[@lanteri2008experimental]	Ordering of vignettes	Acceptable or not	31 vs 31	$χ^{2} (1, 62) = 10.63$ ; $ p=0.0011$	$η^{2} = 0.17$
[@lombrozo2009role]	Ordering of trolley switch vs push	Rating of permissibility	56 vs 56	$t (110) = 3.30$ ; $p < 0.01$	$η^{2} = 0.090$
[@zamzow2009variations]	Ordering of vignettes	Right or wrong	8 vs 9	$χ^{2} (1, 17) = 2.837$ ; $p = 0.09$	$η^{2} = 0.17$
[@wright2010intuitional], study 2	Ordering of vignettes	Right or wrong	30 vs 30	$χ^{2} (1, 60) = 3.2$ ; $p = 0.073$	$η^{2} = 0.053$
[@schwitzgebel2012expertise], philosphers	Within-pair vignette orderings	Number of pairs judged equivalent	324	$r = 0.29$ ; $p < 0.001$	$η^{2} = 0.084$
[@schwitzgebel2012expertise], academic non-philosophers	Within-pair vignette orderings	Number of pairs judged equivalent	753	$r = 0.19$ ; $p < 0.001$	$η^{2} = 0.036$
[@schwitzgebel2012expertise], non-academics	Within-pair vignette orderings	Number of pairs judged equivalent	1389	$r = 0.21$ ; $p < 0.001$	$η^{2} = 0.044$
[@liao2012putting]	Ordering of vignettes	Rating of permissibility	48.3 vs 48.3 vs 48.3	$F (1, 130) = 4.85$ ; $p < 0.029$	$η^{2} = 0.036$
[@wiegmann2012order]	Most vs least agreeable first	Rating of shouldness	25 vs 25	$F (148) = 8.03$ ; $p < 0.01$	$η^{2} = 0.14$

Study	Independent variable	Dependent variable	Sample size	Result	Effect size
[@petrinovich1993empirical], general class	Wording of vignettes	Scale of agreement	361	$F (1, 359) = 296.51$ ; $p < 0.000001$	$η_{p}^{2} = 0.45$
[@petrinovich1993empirical], biomeds	Wording of vignettes	Scale of agreement	60	$F (1, 57) = 18.07$ ; $p = 0.000080$	$η_{p}^{2} = 0.24$

Study	Independent variable	Dependent variable	Sample size	Result	Effect size
[@wheatley2005hypnotic], experiment 1	Hypnotic disgust cue	Scale of wrongness	45	$t (44) = 2.41$ ; $p < 0.05$	$η^{2} = 0.12$
[@wheatley2005hypnotic], experiment 2	Hypnotic disgust cue	Scale of wrongness	63	$t (62) = 1.74$ ; $p < 0.05$	$η^{2} = 0.073$
[@schnall2008clean], experiment 1	Clean word scramble	Scale of wrongness	20 vs 20	$f (1, 38) = 3.63$ ; $p = 0.064$	$η^{2} = 0.09$
[@schnall2008clean], experiment 2	Disgusting movie clip	Scale of wrongness	22 vs 22	$f (1, 41) = 7.81$ ; $p = 0.0079$	$η^{2} = 0.16$
[@schnall2008disgust], experiment 1	Fart spray	Likert scale	42.3 vs 42.3 vs 42.3	$f (2, 117) = 7.43$ ; $p < 0.001$	$η^{2} = 0.11$
[@schnall2008disgust], experiment 2	Disgusting room	Scale of appropriacy	22.5 vs 22.5	Not significant
[@schnall2008disgust], experiment 3	Describe disgusting memory	Scale of appropriacy	33.5 vs 33.5	Not significant
[@schnall2008disgust], experiment 4	Disgusting vs sad vs neutral movie clip	Scale of appropriacy	43.3 vs 43.3 vs 43.3	$f (1, 104) = 4.11$ ; $p < 0.05$	$η^{2} = 0.038$
[@horberg2009disgust], study 2	Disgusting vs sad movie clip	Scale of rightness and wrongness	59 vs 63	$F (1, 115) = 4.51$ ; $p < 0.01$	$η^{2} = 0.038$
[@liljenquist2010smell], experiment 1	Clean scent in room	Money returned	14 vs 14	$t (26) = 2.64$ ; $p = 0.01$	$η^{2} = 0.21$
[@liljenquist2010smell], experiment 2	Clean scent in room	Scale of volunteering interesting	49.5 vs 49.5	$t (97) = 2.33$ ; $p = 0.02$	$η^{2} = 0.052$
[@liljenquist2010smell], experiment 2	Clean scent in room	Willingness to donate	49.5 vs 49.5	$χ^{2} (1, 99) = 4.78$ ; $p = 0.03$	$η^{2} = 0.048$
[@zhong2010clean], experiment 1	Antiseptic wipe for hands	Scale of immoral to moral	29 vs 29	$t (56) = 2.10$ ; $p = 0.04$	$η^{2} = 0.073$
[@zhong2010clean], experiment 2	Visualize clean vs dirty and nothing	Scale of immoral to moral	107.6 vs 107.6 vs 107.6	$t (320) = 2.02$ ; $p = 0.045$	$η^{2} = 0.013$
[@zhong2010clean], experiment 2	Visualize dirty vs nothing	Scale of immoral to moral	107.6 vs 107.6 vs 107.6	$t (320) = 0.42$ ; $p = 0.675$	$η^{2} = 0.00055$
[@zhong2010clean], experiment 3	Visualize clean vs dirty	Scale of immoral to moral	68 vs 68	$t (134) = 2.13$ ; $p = 0.04$	$η^{2} = 0.033$
[@eskine2011bad]	Sweet, bitter or neutral drink	Scale of wrongness	18 vs 15 vs 21	$F (2, 51) = 7.368$ ; $p = 0.002$	$η^{2} = 0.224$
[@david2011effect]	Presence of disgust-conditioned word	Scale of wrongness	61	$t (60) = 0.62$ ; Not significant	$η^{2} = 0.0064$
[@tobia2013cleanliness], undergrads	Clean scent on survey	Scale of wrongness	84 vs 84	$f (1, 164) = 8.56$ ; $p = 0.004$	$η^{2} = 0.05$
[@tobia2013cleanliness], philosophers	Clean scent on survey	Scale of wrongness	58.5 vs 58.5	Not significant
[@huang2014does], study 1	Clean word scramble	Scale of wrongness	111 vs 103	$t (212) = - 1.22$ ; $p = 0.23$	$η^{2} = 0.0072$
[@huang2014does], study 2	Clean word scramble	Scale of wrongness	211 vs 229	$t (438) = - 0.42$ ; $p = 0.68$	$η^{2} = 0.0040$
[@johnson2014does], experiment 1	Clean word scramble	Scale of wrongness	114.5 vs 114.5	$f (1, 206) = 0.004$ ; $p = 0.95$	$η^{2} = 0.000019$
[@johnson2014does], experiment 2	Washing hands	Scale of wrongness	58 vs 68	$f (1, 124) = 0.001$ ; $p = 0.97$	$η^{2} = 0.0000081$
[@johnson2016effects], study 1	Describe disgusting memory	Scale of wrongness	222 vs 256	$f (1, 474) = 0.04$ ; $p = 0.84$	$η^{2} = 0.000084$
[@johnson2016effects], study 2	Describe disgusting memory	Scale of wrongness	467 vs 467	$f (1, 926) = 0.48$ ; $p = 0.48$	$η^{2} = 0.00052$
[@daubman2014]	Clean word scramble	Scale of wrongness	30 vs 30	$t (58) = 1.84$ ; $p = 0.03$	$η^{2} = 0.054$
[@daubman2013]	Clean word scramble	Scale of wrongness	30 vs 30	$t (58) = - 1.8$ ; $p = 0.04$	$η^{2} = 0.053$
[@johnson2014]	Clean word scramble	Scale of wrongness	365.6 vs 365.5	$F (1, 729) = 0.31$ ; $p = 0.58$	$η^{2} = 0.00043$

Study	Independent variable	Dependent variable	Sample size	Result
[@lombrozo2009role], trolley switch	Gender	Scale of permissibility	74.7 vs 149.3	$t (222) = - 0.10$ , $p = 0.92$
[@lombrozo2009role], trolley push	Gender	Scale of permissibility	74.7 vs 149.3	$t (222) = - 0.69$ , $p = 0.49$
[@seyedsayamdost2015gender], plank of Carneades, MTurk	Gender	Scale of blameworthiness	70 vs 86	$t (154) = - 1.302$ , $p = 0.195$
[@seyedsayamdost2015gender], plank of Carneades, SurveyMonkey	Gender	Scale of blameworthiness	48 vs 50	$t (96) = 0.727$ , $p = 0.469$
[@adleberg2015men], violinist	Gender	Scale from forbidden to obligatory	52 vs 84	$t (134) = - 0.39$ , $p = 0.70$
[@adleberg2015men], magistrate and the mob	Gender	Scale from bad to good	71 vs 87	$t (156) = - 0.28$ , $p = 0.78$
[@adleberg2015men], trolley switch	Gender	Scale of acceptability	52 vs 84	$t (134) = 0.26$ , $p = 0.34$

Study	Independent variable	Dependent variable	Sample size	Result
[@haidt1993affect], adults	Culture	Acceptable or not	90 vs 90	$F (1, 174) = 5.6$ ; $p < 0.01$
[@haidt1993affect], children	Culture	Acceptable or not	90 vs 90	$F (1, 174) = 5.91$ ; $p < 0.01$
[@haidt1993affect], adults	SES	Acceptable or not	90 vs 90	$F (1, 174) = 73.1$ ; $p < 0.001$
[@haidt1993affect], children	SES	Acceptable or not	90 vs 90	$F (1, 174) = 9.00$ ; $p < 0.01$

Study	Independent variable	Dependent variable	Sample size	Result
[@nadelhoffer2008actor], trolley switch, undergrads	Actor vs observer	Morally permissible? Yes or no	43 vs 42	90% permissible in observer condition; 65% permissible in actor condition; $p = 0.029$
[@tobia2013moral], trolley switch, philosophers	Actor vs observer	Morally permissible? Yes or no	24.5 vs 24.5	64% permissible in observer condition; 89% permissible in actor condition; $p < 0.05$
[@tobia2013moral], Jim and the natives, undergrads	Actor vs observer	Morally obligated? Yes or no	20 vs 20	53% obligatory in observer condition; 19% obligatory in actor condition; $p < 0.05$
[@tobia2013moral], Jim and the natives, philosophers	Actor vs observer	Morally obligated? Yes or no	31 vs 31	9% obligatory in the observer condition; 36% obligatory in the actor condition; $p < 0.05$
[@tobia2013cleanliness], undergrads	Actor vs observer	Scale of wrongness	84 vs 84	$f (1, 164) = 15.24$ ; $p < 0.0001$
[@tobia2013cleanliness], philosophers	Actor vs observer	Scale of wrongness	58.5 vs 58.5	Not significant

The (un)reliability of moral judgments: A survey and systematic(ish) review

The (un)reliability of moral judgments: A survey and systematic(ish) review

Summary

Intro

What are moral judgments?

What would it mean for moral judgments to be unreliable?

Why do we care about the alleged unreliability of moral judgments?

Direct (empirical) evidence

Procedure

Order

Wording

Disgust and cleanliness

Gender

Culture and socioeconomic status

Personality

Actor/observer

Summary

Indirect evidence

Heuristics and biases

Neural

Dual process

Genes

Universal moral grammar

Culture

Moral disagreements

Summary

Responses

Internal validity

Expertise

Ecological validity

Sufficient

Ecologically rational

Second-order reliability

Moral engineering

Summary

Conclusion

Appendix: Qualitative discussion of methodology

Order

Wording

Disgust and cleanliness

Gender

Culture and socioeconomic status

Personality

Actor/observer

References