I think the correct adjustment would involve multiplying the effect size by something like 1.1 or 1.2. But figuring out the best way to deal with it should involve some combination of looking into this issue in more depth and/or consulting with someone with more expertise on this sort of statistical issue.
This sort of adjustment wouldn't change your bottom-line conclusions that this point estimate for deworming is smaller than the point estimate for StrongMinds, and that this estimate for deworming is not statistically significant, but it would shift some of the distributions & probabilities that you discuss (including the probability that StrongMinds has a larger well-being effect than deworming).
A low reliability outcome measure attenuates the measured effect size. So if researchers measure the effect of one intervention on a high-quality outcome measure, and they measure the effect of another intervention on a lower-quality outcome measure, the use of different measures will inflate the apparent relative impact of the intervention that got higher-quality measurement. Converting different scales into number of SDs puts them all on the same scale, but doesn't adjust for this measurement issue.
For example, if you have a continuous outcome measure and you dichotomize it by taking a median split (so half get a score of zero and half get a score of one), that will shrink your effect size (number of SDs) to about 80% of what it would've been on the continuous measure. So if you would've gotten an effect size of 0.08 SDs on the continuous measures, you'll find an effect size of .064 SDs on this binary measure.
I think that using a three point scale to measure happiness should produce at least as much attenuation as taking a continuous measure and then carving it up into three groups. Here are some sample calculations to estimate how much that attenuates the effect size. I believe the best case scenario is if the responses are trichotomized into three equally sized groups, which would shrink the effect size to about 89% of what it would've been on the continuous measure, e.g. from .08 to .071. At a glance I don't see descriptive statistics for how many people selected each option on the happy123 measure in this study, so I can't do a calculation that directly corresponds to this study. (I also don't know how you did the measurement for the study of StrongMinds, which would be necessary for comparing them head-to-head.)
I don't see why you used a linear regression over time. It seems implausible that the trend over time would be (non-flat) linear, and the three data points have enough noise to make the estimate of the trend extremely noisy.
Intelligence 1: Individual cognitive abilities.
Intelligence 2: The ability to achieve a wide range of goals.
Eliezer Yudkowsky means Intelligence 2 when he talks about general intelligence. e.g., He proposed "efficient cross-domain optimization" as the definition in his post by that name. See the LW tag page for General Intelligence for more links & discussion.
The model assumes gradually diminishing returns to spending within the next year, but the intuitions behind your third voice think that much higher spending would involve marginal returns that are a lot smaller OR ~zero OR negative?
Could you post something closer to the raw survey data, in addition to the analysis spreadsheet linked in the summary section? I'd like to see something that:
The numbers that you get from this sort of exercise will depend heavily on which people you get estimates from. My guess is that which people you include matters more than what you do with the numbers that they give you.
If the people who you survey are more like the general public, rather than people around our subcultural niche where misaligned AI is a prominent concern, then I expect you'll get smaller numbers.
Whereas, in Rob Bensinger's 2021 survey of "people working on long-term AI risk", every one of the 44 people who answered the survey gave an estimate larger than the 1.6% headline figure here. The smallest answer was 1.9%, and the central tendency was somewhere between 20% and 40% (depending on whether you look at the median, arithmetic mean, or geometric mean of the odds, and which of the two questions from that survey you look at).
If the estimates for the different components were independent, then wouldn't the distribution of synthetic estimates be the same as the distribution of individual people's estimates?
Multiplying Alice's p1 x Bob's p2 x Carol's p3 x ... would draw from the same distribution as multiplying Alice's p1 x Alice's p2 x Alice's p3 ... , if estimates to the different questions are unrelated.
So you could see how much non-independence affects the bottom-line results just by comparing the synthetic distribution with the distribution of individual estimates (treating each individual as one data point and multiplying their 6 component probabilities together to get their p(existential catastrophe)).
Insofar as the 6 components are not independent, the question of whether to use synthetic estimates or just look at the distribution of individuals' estimates comes down to 1) how much value is there in increasing the effective sample size by using synthetic estimates and 2) is the non-independence that exists something that you want to erase by scrambling together different people's component estimates (because it mainly reflects reasoning errors) or is it something that you want to maintain by looking at individual estimates (because it reflects the structure of the situation).
Does the table in section 3.2 take the geometric mean for each of the 6 components?
From footnote 7 it looks like it does, but if it does then I don't see how this gives such a different bottom line probability from the synthetic method geomean in section 4 (18.7% vs. 1.65% for all respondents). Unless some probabilities are very close to 1, and those have a big influence on the numbers in the section 3.2 table? Or my intuitions about these methods are just off.
In some of the graphs it looks like the improvement diminishes more quickly than the logarithm, such that (e.g.) going from 100 to 200 gives a smaller improvement than going from 10 to 20. It seems like maybe you agree, given your "albeit probably diminishing" parenthetical. If so, could you rewrite this summary to better match that conclusion?
Maybe there's some math that you could do that would provide a more precise mathematical description? e.g., With your bootstrapping analysis, is there a limit for the Brier score as the number of hypothetical users increases?