Not sure why this is being so heavily down-voted. I believe it's accurate and contributes, especially re: my comments where a safe and non-permanent way of causing severe pain would be needed.
(Caveat: Views my own, not my employer's)
I think this sort of first-hand investigation is potentially pretty valuable. I know Ren discourages folks from conducting similar self-experimentation, but I would be curious to see safe and careful experiments of this bent to understand the impact of deliberate experiences of suffering on moral views. Perhaps a worthwhile task for some empirical ethicists.
Caveat: I work for Rethink Priorities, as do the authors of this post
Great to have this timely research and good antidote to some of the gloomy outlooks on how this has affected the community!
I especially liked the two convergent approaches for measuring the effect of the FTX crisis on EA satisfaction. I noticed in the satisfaction over time analysis, you can kind of eyeball a (slight) negative pre-trend. This made me wonder how you thought about the causal inference concerns both in that specific analysis and for this report generally? By extension, I wonder if the summary could be a bit more explicit:
the FTX crisis had decreased satisfaction with the EA community
Maybe this should read something like "FTX crisis was associated with, and may have caused, decreased satisfaction"?
I agree with the substance of your post and appreciate your taking the time to do the fact checking. I also sympathize with your potential frustration that the fact checking showe didn't support the claim.
However, I do think your comment comes off as a bit dismissive: neither OP nor Food Empowerment Project themselves claim FEP to be "real researchers," whatever this might mean; OP merely states FEP might have helpful resources. Furthermore, the comment might be taken to imply that being an activist and a real researcher are at odds, which I don't believe to be the case.
I also feel sad that your comments feel slightly condescending or uncharitable, which makes it difficult for me to have a productive conversation.
I'm really sorry to come off that way, James. Please know it's not my intention, but duly noted, and I'll try to do better in the future.
Got it; that's helpful to know, and thank you for taking the time to explain!
SDB is generally hard to test for post hoc, which is why it's so important to design studies to avoid it. As the surveys suggest, not supporting protests doesn't imply people don't report support for climate action; so, for example, the responses about support for climate action could be biased upwards by the social desirability of climate action, even though those same respondents don't support protests. Regardless, I don't allege to know for certain these estimates are biased upwards (or downwards for that matter, in which case maybe the study is a false negative!). Instead, I'd argue the design itself is susceptible to social desirability and other biases. It's difficult, if not impossible, to sort out how those biases affected the result, which is why I don't find this study very informative. I'm curious why, if you think the results weren't likely biased, you chose to down-weight it?
Understood; thank you for taking the time to clarify here. I agree this would be quite dubious. I don't mean to be uncharitable in my interpretation: unfortunately, dubious research is the norm, and I've seen errors like this in the literature regularly. I'm glad they didn't occur here!
Great, this makes sense and seems like standard practice. My misunderstanding arose from an error in the labeling of the tables: Uncertainty level 1 is labeled "highly uncertain," but this is not the case for all values in that range. For example, suppose you were 1% confident that protests led to a large change. Contrary to the label, we would be quite certain protests did not lead to a large change. 20% confidence would make sense to label as highly uncertain as it reflects a uniform distribution of confidence across the five effect size bins. But confidences below that, in fact, reflect increasing certainty about the negation of the claim. I'd suggest using traditional confidence intervals here instead as they're more familiar and standard, eg: We believe the average effects of protests on voting behavior is in the interval of [1, 8] percentage points with 90% confidence, or [3, 6] pp with 80% confidence.
Further adding to my confusion, the usage of "confidence interval" in "which can also be interpreted as 0-100% confidence intervals," doesn't reflect the standard usage of the term.
The reasons why I don't find these critiques as highlighting significant methodological flaws is that:
Sorry, I think this was a miscommunication in our comments. I was referring to "Issues you raise are largely not severe nor methodological," which gave me the impression you didn't think the issues were related to the research methods. I understand your position here better.
Anyway, I'll edit my top-level comment to reflect some of this new information; this generally updates me toward thinking this research may be more informative. I appreciate your taking the time to engage so thoroughly, and apologies again for giving an impression of anything less than the kindness and grace we should all aspire to.
Thank you for your responses and engagement. Overall, it seems like we agree 1 and 2 are problems; still disagree about 3; and I don't think I made my point on 4 understood and your explanation raises more issues in my mind. While I think these 4 issues are themselves substantive, I worry they are the tip of an iceberg as 1 and 2 are in my opinion relatively basic issues. I appreciate your offer to pay for further critique; I hope someone is able to take you up on it.
Great, I think we agree the approach outlined in the original report should be changed. Did the report actually use percentage of total papers found? I don't mean to be pedantic but it's germane to my greater point: was this really a miscommunication of the intended analysis, or did the report originally intend to use number of papers founds, as it seems to state and then execute on: "Confidence ratings are based on the number of methodologically robust (according to the two reviewers) studies supporting the claim. Low = 0-2 studies supporting, or mixed evidence; Medium = 3-6 studies supporting; Strong = 7+ studies supporting."
It seems like we largely agree in not putting much weight in this study. However, I don't think comparisons against a baseline measurement mitigates the bias concerns much. For example, exposure to the protests is a strong signal of social desirability: it's a chunk of society demonstrating to draw attention to the desirability of action on climate change. This exposure is present in the "after" measurement and absent in the "before" measurement, thus differential and potentially biasing the estimates. Such bias could be hiding a backlash effect.
The issue lies in defining "unusually influential protest movements". This is crucial because you're selecting on your outcome measurement, which is generally discouraged. The most cynical interpretation would be that you excluded all studies that didn't find an effect because, by definition, these weren't very influential protest movements.
Unfortunately, this is not a semantic critique. Call it what you will but I don't know what the confidences/uncertainties you are putting forward mean and your readers would be wrong to assume. I didn't read the entire OpenPhil report, but I didn't see any examples of using low percentages to indicate high uncertainty. Can you explain concretely what your numbers mean?
My best guess is this is a misinterpretation of the "90%" in a "90% confidence interval". For example, maybe you're interpreting a 90% CI from [2,4] to indicate we are highly confident the effect ranges from 2 to 4, while a 10% CI from [2.9, 3.1] would indicate we have very little confidence in the effect? This is incorrect as CIs can be constructed at any level of confidence regardless of the size of effect, from null to very large, or the variance in the effect.
Thank you for pointing to this additional information re your definition of variance; I hadn't seen it. Unfortunately, it illustrates my point that these superficial methodological issues are likely just the tip of the iceberg. The definition you provide offers two radically different options for the bound of the range you're describing: randomly selected or median protest. Which one is it? If it's randomly selected, what prevents randomly selecting the most effective protests, in which case the range would be zero? Etc.
Lastly, I have to ask in what regard you don't find these critiques methodological? The selection of outcome measure in a review, survey design, construction of a research question and approach to communicating uncertainty all seem methodological—at least these are topics commonly covered in research methods courses and textbooks.
Thank you to James for clarifying some of the points below. 1, 3 and 4 all result from miscommunications in the report that on clarification don't reflect the authors' intentions. I think 2 continues to be relevant, and we disagree here.
I've updated towards putting somewhat more credence in parts of the work, although I have other concerns beyond the most glaring ones I flagged here. I'm reticent to get into them here since this comment has a long contextual thread; perhaps I'll write another comment. I do want to represent my overall view here, which is that I think the conclusions are overconfident given the methods.
I share other commentator's enthusiasm for this research area. I also laud the multi-method approach and attempts to quantify uncertainty.
However, I think severe methodological issues are pervasive unfortunately. Accordingly, I did not update based off this work.
I'll list 4 important examples here:
In the literature review, strength of evidence was evaluated based on the number of studies supporting a particular conclusion. This metric is entirely flawed as it will find support for any conclusion with a sufficiently high number of studies. For example, suppose you ran 1000 studies of an ineffective intervention. At the standard false positive rate of p-values below 0.05, we would expect 50 studies with significant results. By the standards used in the review, this would be strong evidence the intervention was effective, despite it being entirely ineffective by assumption.
The work surveying the UK public before and after major protests is extremely susceptible to social desirability, agreement and observer biases. The design of the surveys and questions exacerbates these biases and the measured effect might entirely reflect these biases. Comparison to a control group does not mitigate these issues as exposure to the intervention likely results in differential bias between the two groups.
The research question is not clearly specified. The authors indicate they're not interested in the effects of a median protest, but do not clarify what sort of protest exactly they are interested in. By extension, it's not clear how the inclusion criteria for the review reflects the research question.
In this summary, uncertainty is expressed from 0-100%, with 0% indicating very high uncertainty. This does not correspond with the usual interpretation of uncertainty as a probability, where 0% would represent complete confidence the proposition was false, 50% total uncertainty and 100% complete confidence the proposition was true. Similarly, Table 4 neither corresponds to the common statistical meaning of variance nor defines a clear alternative.
These critiques are based off having spent about 3 hours total engaging in various parts of this research. However, my assessment was far from comprehensive, in part because these issues dissuaded me from investigating further. Given the severity of these issues, I'd expect many more methodological issues on further inspection. I apologize in advance if these issues are addressed elsewhere and for providing such directly negative feedback.
I think there are two additional sources on corporate animal welfare campaigns worth mention here; neither cover all the topics you outline in tractability, but I think do fill in some of the blanks:
I've long wanted to see a textbook on advocating for animals raised for food. Given the contents of Chapters 1, 4 & 6, I could see this project being transformed into a such a book. Currently the academic literature relevant to the many facets of animal advocacy is quite scattered and would benefit from careful synthesis. Some of this synthesis we're working on at Rethink—it would be great to see it in book form! And I think such a book would be a very useful introductory text.
I don't find the case against bivalve sentience that strong, especially for the number of animals potentially involved and the diversity of the 10k bivalve species. (For example, scallops are motile and have hundreds of image-forming eyes—it'd be surprising to me if pain wasn't useful to such a lifestyle!)
Thanks, Willem, that all makes sense! I agree, the overall conclusion certainly seems fair, especially given the convergent evidence you cite.