This post summarizes a new meta-analysis from the Humane and Sustainable Food Lab. We analyze the most rigorous randomized controlled trials (RCTs) that aim to reduce consumption of meat and animal products (MAP). We conclude that no theoretical approach, delivery mechanism, or persuasive message should be considered a well-validated means of reducing MAP consumption. By contrast, reducing consumption of red and processed meat (RPM) appears to be an easier target. However, if RPM reductions lead to more consumption of chicken and fish, this is likely bad for animal welfare and doesn’t ameliorate zoonotic outbreak or land and water pollution. We also find that many promising approaches await rigorous evaluation.
This post updates a post from a year ago. We first summarize the current paper, and then describe how the project and its findings have evolved.
What is a rigorous RCT?
We operationalize “rigorous RCT” as any study that:
* Randomly assigns participants to a treatment and control group
* Measures consumption directly -- rather than (or in addition to) attitudes, intentions, or hypothetical choices -- at least a single day after treatment begins
* Has at least 25 subjects in both treatment and control, or, in the case of cluster-assigned studies (e.g. university classes that all attend a lecture together or not), at least 10 clusters in total.
Additionally, studies needed to intend to reduce MAP consumption, rather than (e.g.) encouraging people to switch from beef to chicken, and be publicly available by December 2023.
We found 35 papers, comprising 41 studies and 112 interventions, that met these criteria. 18 of 35 papers have been published since 2020.
The main theoretical approaches:
Broadly speaking, studies used Persuasion, Choice Architecture, Psychology, and a combination of Persuasion and Psychology to try to change eating behavior.
Persuasion studies typically provide arguments about animal welfare, health, and environmental welfare reason
I think the forecast seems broadly reasonable, but the question and title seem quite poor. As in, the operationalization seems like a very poor definition for "weakly general AGI" and the tasks being forecast don't seem very important or interesting.
I think GPT-4V likely already achieves 2 (winograd) and 3 (SAT) while 4 (montezuma's revenge) seems plausible for GPT-4V, though unclear. Beyond this, 1 (turing test) seems to be extremely dependent on the extent to which the judge is competently adversarial and whether or not anyone actually finetunes a powerful model to perform well on this task. This makes me think that this could plausibly resolve without any more powerful models, but might not happen because no one bothers running a turing test seriously.
Can't we just use an SAT test created after the data cutoff? Also, my guess is that the SAT results discussed in the GPT-4 blog post (which are >75th percentile) aren't particularly data contaminated (aside from the fact that different SAT exams are quite similar which is the same for human students). You can see the technical report for more discussion on data contamination (though account for bias accordingly etc.)