241Joined Nov 2018


I also feel sad that your comments feel slightly condescending or uncharitable, which makes it difficult for me to have a productive conversation.

I'm really sorry to come off that way, James. Please know it's not my intention, but duly noted, and I'll try to do better in the future.

  1. Got it; that's helpful to know, and thank you for taking the time to explain!

  2. SDB is generally hard to test for post hoc, which is why it's so important to design studies to avoid it. As the surveys suggest, not supporting protests doesn't imply people don't report support for climate action; so, for example, the responses about support for climate action could be biased upwards by the social desirability of climate action, even though those same respondents don't support protests. Regardless, I don't allege to know for certain these estimates are biased upwards (or downwards for that matter, in which case maybe the study is a false negative!). Instead, I'd argue the design itself is susceptible to social desirability and other biases. It's difficult, if not impossible, to sort out how those biases affected the result, which is why I don't find this study very informative. I'm curious why, if you think the results weren't likely biased, you chose to down-weight it?

  3. Understood; thank you for taking the time to clarify here. I agree this would be quite dubious. I don't mean to be uncharitable in my interpretation: unfortunately, dubious research is the norm, and I've seen errors like this in the literature regularly. I'm glad they didn't occur here!

  4. Great, this makes sense and seems like standard practice. My misunderstanding arose from an error in the labeling of the tables: Uncertainty level 1 is labeled "highly uncertain," but this is not the case for all values in that range. For example, suppose you were 1% confident that protests led to a large change. Contrary to the label, we would be quite certain protests did not lead to a large change. 20% confidence would make sense to label as highly uncertain as it reflects a uniform distribution of confidence across the five effect size bins. But confidences below that, in fact, reflect increasing certainty about the negation of the claim. I'd suggest using traditional confidence intervals here instead as they're more familiar and standard, eg: We believe the average effects of protests on voting behavior is in the interval of [1, 8] percentage points with 90% confidence, or [3, 6] pp with 80% confidence.

    Further adding to my confusion, the usage of "confidence interval" in "which can also be interpreted as 0-100% confidence intervals," doesn't reflect the standard usage of the term.

The reasons why I don't find these critiques as highlighting significant methodological flaws is that:

Sorry, I think this was a miscommunication in our comments. I was referring to "Issues you raise are largely not severe nor methodological," which gave me the impression you didn't think the issues were related to the research methods. I understand your position here better.

Anyway, I'll edit my top-level comment to reflect some of this new information; this generally updates me toward thinking this research may be more informative. I appreciate your taking the time to engage so thoroughly, and apologies again for giving an impression of anything less than the kindness and grace we should all aspire to.

Thank you for your responses and engagement. Overall, it seems like we agree 1 and 2 are problems; still disagree about 3; and I don't think I made my point on 4 understood and your explanation raises more issues in my mind. While I think these 4 issues are themselves substantive, I worry they are the tip of an iceberg as 1 and 2 are in my opinion relatively basic issues. I appreciate your offer to pay for further critique; I hope someone is able to take you up on it.

  1. Great, I think we agree the approach outlined in the original report should be changed. Did the report actually use percentage of total papers found? I don't mean to be pedantic but it's germane to my greater point: was this really a miscommunication of the intended analysis, or did the report originally intend to use number of papers founds, as it seems to state and then execute on: "Confidence ratings are based on the number of methodologically robust (according to the two reviewers) studies supporting the claim. Low = 0-2 studies supporting, or mixed evidence; Medium = 3-6 studies supporting; Strong = 7+ studies supporting."

  2. It seems like we largely agree in not putting much weight in this study. However, I don't think comparisons against a baseline measurement mitigates the bias concerns much. For example, exposure to the protests is a strong signal of social desirability: it's a chunk of society demonstrating to draw attention to the desirability of action on climate change. This exposure is present in the "after" measurement and absent in the "before" measurement, thus differential and potentially biasing the estimates. Such bias could be hiding a backlash effect.

  3. The issue lies in defining "unusually influential protest movements". This is crucial because you're selecting on your outcome measurement, which is generally discouraged. The most cynical interpretation would be that you excluded all studies that didn't find an effect because, by definition, these weren't very influential protest movements.

  4. Unfortunately, this is not a semantic critique. Call it what you will but I don't know what the confidences/uncertainties you are putting forward mean and your readers would be wrong to assume. I didn't read the entire OpenPhil report, but I didn't see any examples of using low percentages to indicate high uncertainty. Can you explain concretely what your numbers mean?

    My best guess is this is a misinterpretation of the "90%" in a "90% confidence interval". For example, maybe you're interpreting a 90% CI from [2,4] to indicate we are highly confident the effect ranges from 2 to 4, while a 10% CI from [2.9, 3.1] would indicate we have very little confidence in the effect? This is incorrect as CIs can be constructed at any level of confidence regardless of the size of effect, from null to very large, or the variance in the effect.

    Thank you for pointing to this additional information re your definition of variance; I hadn't seen it. Unfortunately, it illustrates my point that these superficial methodological issues are likely just the tip of the iceberg. The definition you provide offers two radically different options for the bound of the range you're describing: randomly selected or median protest. Which one is it? If it's randomly selected, what prevents randomly selecting the most effective protests, in which case the range would be zero? Etc.

Lastly, I have to ask in what regard you don't find these critiques methodological? The selection of outcome measure in a review, survey design, construction of a research question and approach to communicating uncertainty all seem methodological—at least these are topics commonly covered in research methods courses and textbooks.

Updated view

Thank you to James for clarifying some of the points below. 1, 3 and 4 all result from miscommunications in the report that on clarification don't reflect the authors' intentions. I think 2 continues to be relevant, and we disagree here.

I've updated towards putting somewhat more credence in parts of the work, although I have other concerns beyond the most glaring ones I flagged here. I'm reticent to get into them here since this comment has a long contextual thread; perhaps I'll write another comment. I do want to represent my overall view here, which is that I think the conclusions are overconfident given the methods.

Original comment

I share other commentator's enthusiasm for this research area. I also laud the multi-method approach and attempts to quantify uncertainty.

However, I think severe methodological issues are pervasive unfortunately. Accordingly, I did not update based off this work.

I'll list 4 important examples here:

  1. In the literature review, strength of evidence was evaluated based on the number of studies supporting a particular conclusion. This metric is entirely flawed as it will find support for any conclusion with a sufficiently high number of studies. For example, suppose you ran 1000 studies of an ineffective intervention. At the standard false positive rate of p-values below 0.05, we would expect 50 studies with significant results. By the standards used in the review, this would be strong evidence the intervention was effective, despite it being entirely ineffective by assumption.

  2. The work surveying the UK public before and after major protests is extremely susceptible to social desirability, agreement and observer biases. The design of the surveys and questions exacerbates these biases and the measured effect might entirely reflect these biases. Comparison to a control group does not mitigate these issues as exposure to the intervention likely results in differential bias between the two groups.

  3. The research question is not clearly specified. The authors indicate they're not interested in the effects of a median protest, but do not clarify what sort of protest exactly they are interested in. By extension, it's not clear how the inclusion criteria for the review reflects the research question.

  4. In this summary, uncertainty is expressed from 0-100%, with 0% indicating very high uncertainty. This does not correspond with the usual interpretation of uncertainty as a probability, where 0% would represent complete confidence the proposition was false, 50% total uncertainty and 100% complete confidence the proposition was true. Similarly, Table 4 neither corresponds to the common statistical meaning of variance nor defines a clear alternative.

These critiques are based off having spent about 3 hours total engaging in various parts of this research. However, my assessment was far from comprehensive, in part because these issues dissuaded me from investigating further. Given the severity of these issues, I'd expect many more methodological issues on further inspection. I apologize in advance if these issues are addressed elsewhere and for providing such directly negative feedback.

I think there are two additional sources on corporate animal welfare campaigns worth mention here; neither cover all the topics you outline in tractability, but I think do fill in some of the blanks:

I've long wanted to see a textbook on advocating for animals raised for food. Given the contents of Chapters 1, 4 & 6, I could see this project being transformed into a such a book. Currently the academic literature relevant to the many facets of animal advocacy is quite scattered and would benefit from careful synthesis. Some of this synthesis we're working on at Rethink—it would be great to see it in book form! And I think such a book would be a very useful introductory text.

I don't find the case against bivalve sentience that strong, especially for the number of animals potentially involved and the diversity of the 10k bivalve species. (For example, scallops are motile and have hundreds of image-forming eyes—it'd be surprising to me if pain wasn't useful to such a lifestyle!)

I agree, pricing in impact seems reasonable. But do you think this is currently happening? if so, by what mechanism? I think the discrepancies between Redwood and ACE salaries are much more likely explained by norms at the respective orgs and funding constraints rather than some explicit pricing of impact.

Thanks for these Peter! (Note that Peter and I both work at Rethink Priorities.)

Do you think your study is sufficiently well powered to detect very small effect sizes on meat consumption?

No, and this is by design as you point out. We did try to recruit a population that may be more predisposed to change in Study 3 and looked at even more predisposed subgroups.

substantially larger than the effects we usually find for animal interventions even on more moveable things

I think we were informed by the results of our meta-analysis, which generally found effects around this size for meat reduction interventions.

Their null result on effect on meat consumption was not at all tightly bounded: -0.3oz [-6.12oz to + 5.46oz]

Obviously, this is ultimately subjective, but this corresponds to plus or minus a burger per week, which seems reasonably precise to me. The standardized CI is [−0.17, 0.15], so bounded below a 'small effect'. And, as David points out, less stringent CIs would look even better. But to be clear, I don't have a substantive disagreement here—just a matter of interpretation.

For even more power, we could combine studies 1 & 3 in a meta-analysis (doubling the sample size). Study 3 found a treatment effect of−1.72 oz/week; 95% CI: [−8.84,5.41], so the meta-analytic estimate would probably be very small but still in the correct direction, with tighter bounds of course.

explained just by the fact that you could find effects on the moveable attitudes

Just to clarify, we measured attitudes in all 3 studies. We found an effect on intentions in Study 2 where there wasn't blinding and follow-up was immediate. Studies 3 & 4 (likely) didn't find effects on attitudes.

I'd be curious to estimate what effect size would we be looking at if say 3-5% of people stopped eating meat (an optimistic estimate IMO).

Just roughly taking David Reinstein's number of 80 oz per week (could use our control group's mean for a better estimate) and assuming no other changes, 1% abstention would give a 0.8 oz effect size and 5% 4 oz. So definitely under-powered for the low end, but potentially closer to detectable at the high end. (And keeping in mind this is at 12-day follow-up; we should expect that 1% to dwindle further at longer follow-up. With figures this low I would be pessimistic for the overall impact. But keep in mind other successful meat reduction interventions don't seem to have worked mostly through a few individuals totally abstaining!)

corresponds to what a t-test is assessing

I wouldn't expect issues in testing the difference in means given our samples sizes. But otherwise not sure what you're suggesting here.

Yes, we did and found no meaningful increases in interest in animal activism, including voting intentions. Full questions available in in the supplementary materials.

Thank you taking the time to engage, much appreciated! Forgive my responding quickly and feel free to ask for clarification if I miss anything:

  • Definitely, could be different results with different docs. But ours showed a much stronger effect than the average of similar interventions we found in a previous meta-analysis, suggesting Good for Us is pretty good. It is probably better than Cowspiracy on changing intentions, with longer studies of excerpts of Cowspiracy also finding no effect.
  • Agree especially with your sub-point. We also tried to recruit populations more likely to be effected in Study 3. Also, see sources in my previous point.
  • Maybe but doesn't seem likely since there wasn't change in importance of animal welfare or other measures of attitudes. I would generally expect effects to decay over time rather than get stronger; our meta-analysis (weakly) supports this hypothesis in that longer time points showed smaller effects. Usefulness of a 2-3 month time point would mostly depend on attrition in my opinion.
  • I would vote other interventions. Classroom education in colleges and universities seems good as does increasing the availability of plant-based options in food service and restaurants.

+1 As well. I would emphasize that number of animal alive at any given time is significantly more important than slaughter as many animals die prior to slaughter.

Load More