Researcher (on bio) at FHI

20

292

The issue re comparators is less how good dropping outliers or fixed effects are as remedies to publication bias (or how appropriate either would be as an analytic choice here all things considered), but the similarity of these models to the original analysis.

We are not, after all, adjusting or correcting the original metaregression analysis directly, but rather indirectly inferring the likely impact of small study effects on the original analysis by reference to the impact it has in simpler models.

The original analysis, of course, did not exclude outliers, nor follow-ups, and used random effects, not fixed effects. So of Models 1-6, model 1 bears the closest similarity to the analysis being indirectly assessed, so seems the most appropriate baseline.

The point about outlier removal and fixed effects reducing the impact of small study effects is meant to illustrate cycling comparators introduces a bias in assessment instead of just adding noise. Of models 2-6, we would expect 2, 4,5 and 6 to be more resilient to small study effects than model 1, because they either remove outliers, use fixed effects, or both (Model 3 should be ~ a wash). The second figure provides some (further) evidence of this, as (e.g.) the random effects models (thatched) strongly tend to report greater effect sizes than the fixed effect ones, regardless of additional statistical method.

So noting the discount for a statistical small study effect correction is not so large versus comparators which are already less biased (due to analysis choices contrary to those made in the original analysis) misses the mark.

If the original analysis had (somehow) used fixed effects, these worries would (largely) not apply. Of course, if the original analysis had used fixed effects, the effect size would have been a lot smaller in the first place.

--

Perhaps also worth noting is - with a discounted effect size - the overall impact of the intervention now becomes very sensitive to linear versus exponential decay of effect, given the definite integral of the linear method scales with the square of the intercept, whilst for exponential decay the integral is ~linear with the intercept. Although these values line up fairly well with the original intercept value of ~ 0.5, they diverge at lower values. If (e.g.) the intercept is 0.3, over a 5 year period the exponential method (with correction) returns ~1 SD years (vs.1.56 originally), whilst the linear method gives ~0.4 SD years (vs. 1.59 originally).

(And, for what it is worth, if you plug in corrected SE or squared values in to the original multilevel meta-regressions, PET/PEESE style, you do drop the intercept by around these amounts either vs. follow-up alone or the later models which add other covariates.)

I have now had a look at the analysis code. Once again, I find significant errors and - once again - correcting these errors is adverse to HLI's bottom line.

I noted before the results originally reported do not make much sense (e.g. they generally report *increases* in effect size when 'controlling' for small study effects, despite it being visually obvious small studies tend to report larger effects on the funnel plot). When you use appropriate comparators (i.e. comparing everything to the original model as the baseline case), the cloud of statistics looks more reasonable: in general, they point towards discounts, not enhancements, to effect size: in general, the red lines are less than 1, whilst the blue ones are all over the place.

However, some findings still look bizarre even after doing this. E.g. Model 13 (PET) and model 19 (PEESE) not doing anything re. outliers, fixed effects, follow-ups etc, still report higher effects than the original analysis. These are both closely related to the eggers test noted before: why would it give a substantial discount, yet these a mild enhancement?

Happily, the code availability means I can have a look directly. All the basic data seems fine, as the various 'basic' plots and meta-analyses give the right results. Of interest, the Egger test is still pointing the right way - and even suggests a lower intercept effect size than last time (0.13 versus 0.26):

PET gives highly discordant findings:

You not only get a higher intercept (0.59 versus 0.5 in the basic random effects model), but the coefficient for standard error is *negative*: i.e. the regression line it draws slopes the opposite way to Eggers, so it predicts smaller studies give *smaller*, not greater, effects than larger ones. What's going on?

The moderator (i.e. ~independent variable) is 'corrected' SE. Unfortunately, this correction is incorrect (line 17 divides (n/2)^2 by itself, where the first bracket should be +, not *), so it 'corrects' a lot of studies to SE = 1 exactly:

When you use this in a funnel plot, you get this:

Thus these aberrant results (which happened be below the mean effect size) explain why the best fit line now points in the opposite direction. All the PET analyses are contaminated by this error, and (given PEESE squares these values) so are all the PEESE analyses. When debugged, PET shows an intercept lower than 0.5, and the coefficient for SE pointing in the right direction:

Here's the table of corrected estimates applied to models 13 - 24: as you can see, correction reduces the intercept in all models, often to substantial degrees (I only reported to 2 dp, but model 23 was marginally lower). Unlike the original analysis, here the regression slopes generally point in the right direction.

The same error appears to be in the CT analyses. I haven't done the same correction, but I would guess the bizarre readings (e.g. the outliers of 70x or 200x etc. when comparing PT to CT when using these models) would vanish once it is corrected.

So, when correcting the PET and PEESE results, and use the appropriate comparator (Model 1, I forgot to do this for models 2-6 last time), we now get this:

Now interpretation is much clearer. Rather than 'all over the place, but most of the models basically keep the estimate the same', it is instead 'across most reasonable ways to correct or reduce the impact of small study effects, you see substantial reductions in effect (the avg across the models is ~60% of the original - not a million miles away from my '50%?' eyeball guess.) Moreover, the results permit better qualitative explanation.

- On the first level, we can make our model fixed or random effects, fixed effects are more resilient to publication bias (more later), and we indeed find changing from random effects to fixed effect (i.e. model 1 to model 4) reduces effect size by a bit more than 2.
- On the second level, we can elect for different inclusion criteria: we could remove outliers, or exclude follow-ups. The former would be expected to partially reduce small study effects (as outliers will tend to be smaller studies reporting surprisingly high effects), whilst the later does not have an obvious directional effect - although one should account for nested outcomes, this would be expected to distort the weights rather than introduce a bias in effect size. Neatly enough, we see outlier exclusion does reduce effect size (Model 2 versus Model 1) but not followups or not (model 3 versus model 1). Another neat example of things lining up is you would expect FE to give a greater correction than outlier removal (as FE is strongly discounting smaller studies across the board, rather than removing a few of the most remarkable ones), and this is what we see (Model 2 vs. Model 4)
- Finally, one can deploy a statistical technique to adjust for publication bias. There are a bunch of methods to do this: PET, PEESE, Rucker's limit, P curve, and selection models. All of these besides the P curve give a discount to the original effect size (model 7, 13,19,25,37 versus model 31).
- We can also apply these choices in combination, but essentially all combinations point to a significant downgrade in effect size. Furthermore, the combinations allow us to better explain discrepant findings. Only models 3, 31, 33, 35, 36 give numerically higher effect sizes. As mentioned before, model 3 only excludes follow-ups, so would not be expected to be less vulnerable to small study effects. The others are all P curve analyses, and P curves are especially sensitive to heterogeneity: the two P curves which report discounts are those with outliers removed (Model 32, 35), supporting this interpretation.

With that said, onto Joel's points.

**1. Discarding (better - investigating) bizarre results**

I think if we discussed this beforehand and I said "Okay, you've made some good points, I'm going to run all the typical tests and publish their results." would you have said have advised me to not even try, and instead, make ad hoc adjustments. If so, I'd be surprised given that's the direction I've taken you to be arguing I should move away from.

You are correct I would have wholly endorsed permuting all the reasonable adjustments and seeing what picture emerges. Indeed, I would be (and am) happy with 'throwing everything in' even if some combinations can't really work, or doesn't really make much sense (e.g. outlier rejection + trim and fill).

But I would have also have urged you to actually understand the results you are getting, and querying results which plainly do not make sense. That we're still seeing the pattern of "Initial results reported don't make sense, and I have to repeat a lot of the analysis myself to understand why (and, along the way, finding the real story is much more adverse than HLI presents)" is getting depressing.

The error itself for PET and PEESE is no big deal - "I pressed the wrong button once when coding and it messed up a lot of my downstream analysis" can happen to anyone. But these results plainly contradicted both the naked eye (they not only give weird PT findings but weird CT findings: by inspection the CT is basically a negative control for pub bias, yet PET-PEESE typically finds statistically significant discounts), the closely-related Egger's test (disagreeing with respect to *sign*), and the negative coefficients for the models (meaning they are sloping in the opposite direction) are printed in the analysis code.

I also find myself inclined to less sympathy here because I didn't meticulously inspect every line of analysis code looking for trouble (my file drawer is empty). I knew the results being reported for these analysis could not be right, so I zeroed in on it expecting there was an error. I was right.

**2. Comparators**

When I do this, and again remove anything that doesn't produce a discount for psychotherapy, the average correction leads to a 6x cost-effectiveness ratio of PT to CT. This is a smaller shift than you seem to imply.

9.4x -> ~6x is a drop of about one third, I guess we could argue about what increment is large or small. But more concerning is the direction of travel: taking the 'CT (all)' comparator.

If we do not do my initial reflex and discard the PT favouring results, then we see adding the appropriate comparator and fixing the statistical error ~ halves the original multiple. If we continue excluding the "surely not" +ve adjustments, we're still seeing a 20% drop with the comparator, and a further 10% increment with the right results for PT PET/PEESE.

How many more increments are there? There's at least one more - the CT PET/PEESE results are wrong, and they're giving bizarre results in the spreadsheet. Although I would expect diminishing returns to further checking (i.e. if I did scour the other bits of the analysis, I expect the cumulative error is smaller or neutral), but the 'limit value' of what this analysis would show if there were no errors doesn't look great so far.

Maybe it would roughly settle towards the average of ~ 60%, so 9.4*0.6 = 5.6. Of course, this would still be fine by the lights of HLI's assessment.

**3. Cost effectiveness analysis**

My complete guess is that if StrongMinds went below 7x GiveDirectly we'd qualitatively soften our recommendation of StrongMinds and maybe recommend bednets to more donors. If it was below 4x we'd probably also recommend GiveDirectly. If it was below 1x we'd drop StrongMinds. This would change if / when we find something much more (idk: 1.5-2x?) cost-effective and better evidenced than StrongMinds.

However, I suspect this is beating around the bush -- as I think the point Gregory is alluding to is "look at how much their effects appear to wilt with the slightest scrutiny. Imagine what I'd find with just a few more hours."

If that's the case, I understand why -- but that's not enough for me to reshuffle our research agenda. I need to think there's a big, clear issue now to ask the team to change our plans for the year. Again, I'll be doing a full re-analysis in a few months.

Thank you for the benchmarks. However, I mean to beat both the bush and the area behind it.

The first things first, I have harped on about the CEA because it is is bizarre to be sanguine about significant corrections because 'the CEA still gives a good multiple' when the CEA itself gives bizarre outputs (as noted before). With these benchmarks, it seems this analysis, on its own terms, is already approaching action relevance: unless you want to stand behind cycling comparators (which the spreadsheet only does for PT and not CT, as I noted last time), then this + the correction gets you below 7x. Further, if you want to take SM effects as relative to the meta-analytic results (rather take their massively outlying values), you get towards 4x (e.g. drop the effect size of both meta-analyses by 40%, then put the SM effect sizes at the upper 95% CI). So there's already a clear motive to investigate urgently in terms of what you already trying to do.

The other reason is the general point of "Well, this important input wilts when you look at it closely - maybe this behaviour generalises". Sadly, we don't really need to 'imagine' what I would find with a few more hours: I just did (and on work presumably prepared expecting I would scrutinise it), and I think the results speak for themselves.

The other parts of the CEA are non-linear in numerous ways, so it is plausible that drops of 50% in intercept value lead to greater than 50% drops in the MRA integrated effect sizes if correctly ramified across the analysis. More importantly, the thicket of the guestimate gives a lot of forking paths available - given it seems HLI clearly has had a finger on the scale, you may not need many more relatively gentle (i.e. 10%-50%) pushes upwards to get very inflated 'bottom line multipliers'.

**4. Use a fixed effects model instead? **

As Ryan notes, fixed effects are unconventional in general, but reasonable in particular when confronted with considerable small study effects. I think - even if one had seen publication bias prior to embarking on the analysis - sticking with random effects would have been reasonable.

Thanks for this, Joel. I look forward to reviewing the analysis more fully over the weekend, but I have three major concerns with what you have presented here.

**1. A lot of these publication bias results look like nonsense to the naked eye.**

Recall the two funnel plots for PT and CT (respectively):

I think we're all seeing the same important differences: the PT plot has markers of publication bias (asymmetry) and P hacking (clustering at the P<0.05 contour, also the p curve) visible to the naked eye; the CT studies do not really show this at all. So heuristically, we should expect statistical correction for small study effects to result in:

- In absolute terms, the effect size for PT should be adjusted
*downwards* - In comparative terms, the effect size for PT should be adjusted downwards more than the CT effect size.

If a statistical correction does the opposite of these things, I think we should say its results are not just 'surprising' but 'unbelievable': it just cannot be true that, given the data we see being fed into the method it should lead us to conclude this CT literature is more prone to small-study effects than this PT one; nor (contra the regression slope in the first plot), the effect size for PT should be corrected *upwards*.

Yet many of the statistical corrections you have done tend to fail one or both of these basically-diagnostic tests of face validity. Across all the different corrections for PT, on average the result is a 30% *increase* in PT effect size (only trim and fill and selection methods give families of results where the PT effect size is reduced). Although (mostly) redundant, these are also the only methods which give a larger drop to PT than CT effect size.

As comments everywhere on this post have indicated, heterogeneity is tricky. If (generally) different methods all gave discounts, but they were relatively small (with the exception of one method like a Trim and Fill which gave a much steeper one), I think the conclusions you drew above would be reasonable. However, for these results, the ones that don't make qualitative sense should be discarded, and the the key upshot should be: "Although a lot of statistical corrections give bizarre results, the ones which do make sense also tend to show significant discounts to the PT effect size".

**2. The comparisons made (and the order of operations to get to them) are misleading**

What is interesting though, is although in % changes correction methods tend to give an increase to PT effect size, the effect sizes themselves tend to be *lower: *the average effect size across analyses is 0.36, ~30% lower than the pooled estimate of 0.5 in the funnel plot (in contrast, this is 0.09 - versus 0.1, for CT effect size).

This is the case because the % changes are being measured, not against the single reference value of 0.5 in the original model, but the equivalent model in terms of random/fixed, outliers/not, etc. but without any statistical correction technique. For example: row 13 (Model 10) is Trim-and-Fill correction for a fixed effect model using the full data. For PT, this effect size is 0.19. The % difference is calculated versus row 7 (Model 4), a fixed effect model without Trim-and-Fill (effect = 0.2) not the original random effects analysis (effect = 0.5). Thus the % of reference effect is 95% not 40%. In general, comparing effect sizes to row 4 (Model ID 1) generally gets more sensible findings, and also generally more adverse ones. re. PT pub bias correction:

In terms of (e.g.) assessing the impact of Trim and Fill in particular, it makes sense to compare like with like. Yet presumably what we care about to ballparking the estimate of publication bias in general - and for the comparisons made in the spreadsheet mislead. Fixed effect models (ditto outlier exclusion, but maybe not follow-ups) are *already* an (~improvised) means of correcting for small study effects, as they weigh them in the pooled estimate much less than random effects models. So noting Trim-and-Fill only gives a 5% additional correction in this case buries the lede: you already halved the effect by moving to a fixed effect model from a random effect model, and the most plausible explanation why fixed effect modelling limits distortion by small study effects.

This goes some way to explaining the odd findings for statistical correction above: similar to collider/collinearity issues in regression, you might get weird answers of the impact of statistical techniques when you are already partly 'controlling for' small study effects. The easiest example of this is combining outlier removal with trim and fill - the outlier removal is basically doing the 'trim' part already.

It also indicates an important point your summary misses. One of the key stories in this data is: "Generally speaking, when you start using techniques - alone or in combination - which reduce the impact of publication bias, you cut around 30% of the effect size on average for PT (versus 10%-ish for CT)".

**3. Cost effectiveness calculation, again**

'Cost effectiveness versus CT' is a unhelpful measure to use when presenting these results: we would *first* like to get a handle on the size of the small study effect in the overall literature, and *then* see what ramifications it has for the assessment and recommendations of strongminds in particular. Another issue is these results doesn't really join up with the earlier cost effectiveness assessment in ways which complicate interpretation. Two examples:

- On the guestimate, setting the meta-regressions to zero effect still results in ~7x multiples for Strongminds versus cash transfers. This spreadsheet does a flat percentage of the original 9.4x bottom line (so a '0% of previous effect' correction does get the multiple down to zero). Being able to get results which give <7x CT overall is much more sensible than what the HLI CEA does, but such results could not be produced if we corrected the effect sizes and plugged them back into the original CEA.
- Besides results being incongruous, the methods look incongruous too. The outliers being excluded in some analyses include strong-minds related papers later used in the overall CE calculation to get to the 9.4 figure. Ironically, exclusion would have been the right thing to do originally, as including the papers help derive the pooled estimate and then again as independent inputs into the CEA double counts them. Alas two wrongs do not make a right: excluding them in virtue of outlier effects seems to imply either: i) these papers should be discounted generally (so shouldn't be given independent weight in the CEA); ii) they are legit, but are such outliers the meta-analysis is actually uninformative to assess the effect of the particular interventions they investigate.

More important than this, though, is the 'percentage of what?' issue crops up again: the spreadsheet uses relative percentage change to get a relative discount vs. CT, but it uses the wrong comparator to calculate the percentages.

Lets look at row 13 again, where we are conducting a fixed effects analysis with trim-and-fill correction. Now we want to compare PT and CT: does PT get discounted more than CT? As mentioned before, for PT, the original random effects model gives an effect size of 0.5, and with T'n'F+Fixed effects the effect size is 0.19. For CT, the original effect size is 0.1, and with T'n'F +FE, it is still 0.1. In relative terms, as PT only has 40% of the previous effect size (and CT 100% of the effect size), this would amount to 40% of the previous 'multiple' (i.e. 3.6x).

Instead of comparing them to the original estimate (row 4), it calculates the percentages versus a fixed effect but *not* T'n'F analysis for PT (row 7). Although CT here is also 0.1, PT in this row has an effect size of 0.2, so the PT percentage is (0.19/0.2) 95% versus (0.1/0.1) 100%, and so the calculated multiple of CT is not 3.6 but 9.0.

The spreadsheet is using the wrong comparison, as we care about whether the multiple between PT and CT is sensitive to different analyses, relative sensitivity to one variation (T'n'F) *conditioned on another* (fixed effect modelling). *Especially *when we're interested in small study effects and the conditioned on effect likely already reduces those.

If one recalculates the bottom line multiples using the first model as the comparator, the results are a bit less weird, but also more adverse to PT. Note the effect is particularly reliable for T'n'F (ID 7-12) and selection measures (ID 37-42), which as already mentioned are the analysis methods which give qualitatively believable findings.

Of interest, the spreadsheet *only* makes this comparator error for PT: for CT, whether all or lumped (column I and L) makes all of its percentage comparisons versus the original model (ID 1). I hope (and mostly expect) this is a click-and-drag spreadsheet error (or perhaps one of my understanding), rather than my unwittingly recovering an earlier version of this analysis.

**Summing up**

I may say more next week, but my impressions are

- In answer to the original post title, I think the evidence for Strongminds is generally weak, equivocal, likely compromised, and definitely difficult to interpret.
- Many, perhaps most (maybe all?) of the elements used in HLI's recommendation of strongminds do not weather scrutiny well. E.g.
- Publication bias issues discussed in the comments here.
- The index papers being noted outliers even amongst this facially highly unreliable literature.
- The cost effectiveness guestimate not giving sensible answers when you change its inputs.

- I think HLI should withdraw their recommendation of Strongminds, and mostly go 'back to the drawing board' on their assessments and recommendations. The current recommendation is based on an assessment with serious shortcomings in many of its crucial elements. I regret I suspect if I looked into other things I would see still more causes of concern.
- The shortcomings in multiple elements also make criticism challenging. Although HLI thinks the publication bias is not big enough of an effect to withdraw the recommendation, it what publication bias
*would*be big enough, or indeed in general what evidence*would*lead them to change their minds. Their own CEA is basically insensitive to the meta-analysis, giving 'SM = 7x GD' even if the effect size was corrected all the way to zero. Above Joel notes even at 'only' SM = 3-4GD it would still generally be their top recommendation. So by this logic, the only decision-relevance this meta-analysis has is confirming the effect isn't massively negative. I doubt this is really true, but HLI should have a transparent understanding (and, ideally, transparent communication) of what their bottom line is actually responsive to. - One of the commoner criticisms of HLI is it is more a motivated reasoner than an impartial evaluator. Although its transparency in data (and now code) is commendable, overall this episode supports such an assessment: the pattern which emerges is a collection of dubious-to-indefensible choices made in analysis, which all point in the same direction (i.e. favouring the Strongminds recommendation); surprising incuriousity about the ramifications or reasonableness of these analytic choices; and very little of this being apparent from the public materials, emerging instead in response to third party criticism or partial replication.
- Although there are laudable improvements contained in Joel's summary above, unfortunately (per my earlier points) I take it as yet another example of this overall pattern. The reasonable reaction to "Your publication bias corrections are (on average!) correcting the effect size upwards, and the obviously skewed funnel plot less than the not obviously skewed one" is not "Well, isn't that surprising - I guess there's no clear sign of trouble with pub bias in our recommendation after all", but "This doesn't make any sense".
- I recommend readers do not rely upon HLIs recommendations or reasoning without carefully scrutinising the underlying methods and data themselves.

Hello Joel,

0) My bad re rma.rv output, sorry. I've corrected the offending section. (I'll return to some second order matters later).

1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so I've attempted to save you at least some time on the latter by attempting to replicate your analysis myself.

This attempt was only partially successful: I took the 'Lay or Group cleaner' sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth 'model 3'] (intercepts and coefficients ~ within a standard error of the write-up's figures), and much more discordant values for the others in table 2.

I expect this 'failure to fully replicate' is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) I'd guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we aren't doing exactly the same thing (e.g. 'Layness' in my sheet seems to be ordinal - values of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is 'close enough' for at least some indicative trends not to be operator error. In the spirit of qualified reassurance here's my funnel plot:

2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/publication bias/etc. You can't neatly merge (e.g.) Egger into meta-regression (at least, I can't), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would *accentuate* any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.

Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 -> 0.25; model 2: 0.42 -> 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.

3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how *negative* the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).

4) To have an empty file-drawer, I also looked at 'source' to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.

5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. I'm not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesn't make a lot of sense to me - SE is non-linear in N, the coefficient doesn't limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and you're also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.

Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effects - there are just at different points 'along the line' because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, so - controlling for N - cash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the case - the regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.

(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)

6) Back to the meta stuff.

I don't suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming "obviously" their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one 'side', but plausibly/probably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspect - if anything - you mostly pulled the wool over your own eyes, without really meaning to.

One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but can't think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted *upwards*, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latter - it looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect you'd get a significant test for trend).

I'm sorry again for mistaking the output you were getting, but - respectfully - it still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.

Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what you've read above took a morning.

I had the luxury of not being on a deadline, but I'm afraid a remark like "I didn't feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)" inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: "We find the intervention we've discovered is X times better than cash transfers, and credibly better than Givewell recs" seems much better in that regard than (e.g.) "We find the intervention we previously discovered and recommended, now seems inferior to cash transfers - leave alone Givewell top charities - by the lights of our own further assessment".

Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.

Thanks for the forest and funnel plots - much more accurate and informative than my own (although it seems the core upshots are unchanged)

I'll return to the second order matters later in the show, but on the merits, surely the discovery of marked small study effects should call the results of this analysis (and subsequent recommendation of Strongminds) into doubt?

Specifically:

- The marked small study effect is difficult to control for, but it seems my remark of an 'integer division' re. effect size is in the right ballpark. I would expect* (more later) real effects 2x-4x lower than thought could change the bottom lines.
- Heterogeneity remains vast, but the small study effect is likely the
*best*predictor of it versus time decay, intervention properties similar to strongminds, etc. It seems important to repeat the analysis*controlling for*small study effects, as overall impact calculation is much more sensitive to coefficient estimates which are plausibly confounded by this currently unaccounted for effect. - Discovery the surveyed studies appear riven with publication bias and p hacking should provide further scepticism of outliers (like the SM-specific studies heavily relied upon).

Re. each in turn:

1. I think the typical 'Cochrane-esque' norms would say the pooled effects and metaregression results are essentially meaningless given profound heterogeneity and marked small study effects. From your other comments, I presume you more favour a 'Bayesian Best Guess' approach: rather than throwing up our hands if noise and bias loom large, we should do our best to correct for them and give the best estimate on the data.

In this spirit of statistical adventure, we could use the Egger's regression slope to infer the effect size the perfectly precise study would have (I agree with Briggs this is dubious technique, but seems one of the better available quantitative 'best guesses'). Reading your funnel plot, the limit value is around 0.15 ~ 4x lower than the random effects estimate. Your output suggests it is higher (0.26), which I guess is owed to a multilevel model rather than the simpler one in the forest and funnel plots, but either way is ~2x lower than the previous 't=0' intercept values.

These are substantial corrections, and probably should be made urgently to the published analysis (given donors may be relying upon it for donation decisions).

2. As it looks like 'study size' is the best predictor of heterogeneity so far discovered, there's a natural fear that previous coefficient estimates for time decay and SM-intervention-like properties are confounded by it. So the overall correction to calculated impact could be *greater* than flat a 50-75% discount, if the less resilient coefficients 'go the wrong way' when this factor is controlled. I would speculate adding this in would give a further discount, albeit a (relatively) mild one: it is plausible that study size *collides* with time decay (so controlling results in somewhat greater persistence), but I would suspect the SM-trait coefficients go down markedly, so the MR including them would no longer give ~80% larger effects.

Perhaps the natural thing would be including study size/precision as a coefficient in the metaregressions (e.g. adding on to model 5), and using these coefficients (rather than univariate analysis previous done for time decay) in the analysis (again, *pace *the health warnings Briggs would likely provide). Again, this seems a matter of some importance, given the material risk of upending the previously published analysis.

3. As perhaps goes without saying, seeing a lot of statistical evidence for publication bias and p-hacking in the literature probably should lead one to regard outliers with even greater suspicion - both because they are even greater outliers versus the (best guess) 'real' average effect, and because the prior analysis gives an adverse prior of what is really driving the impressive results.

It is worth noting that the strongminds recommendation is surprisingly *insensitive* to the MR results, despite comprising the bulk of the analysis. With the guestimate as-is, SM removes roughly 12SDs (SD-years, I take it) of depression for 1k. When I set the effect sizes of the metaregressions to *zero, *the guestimate still spits out an estimate SM removes 7.1SDs for 1k (so roughly '7x more effective than givedirectly). This suggests that the ~5 individual small studies are sufficient for the evaluation to give the nod to SM even if (e.g.) the metaanalysis found no impact of psychotherapy.

I take this to be diagnostic the integration of information in evaluation is not working as it should. Perhaps the Bayesian thing to do is to further discount these studies given they are increasingly discordant from the (corrected) metaregression results, and their apparently high risk of bias given the literature they emerge from. There should surely be *some* non-negative value of the meta-analysis effect size which reverses the recommendation.

#

Back to the second order stuff. I'd take this episode as a qualified defence of the 'old fashioned way of doing things'. There are two benefits in being aiming towards higher standards of rigour.

**First**, sometimes the conventions are valuable guard rails. Shortcuts may not just add expected noise, but add expected bias. Or, another way of looking at it, the evidential value of the work could be very concave with 'study quality'.

These things can be subtle. One example I haven't previously mentioned on inclusion was the sampling/extraction was incomplete. The first shortcut you took (i.e. culling references from prior meta-analyses) was a fair one - sure, there might be more data to find, but there's not much reason to think this would introduce directional selection with effect size.

Unfortunately, the second source - references from your attempts to survey the literature on the cost of psychotherapy - we would expect to be biased towards positive effects: the typical study here is a cost-effectiveness assessment, and such assessment is only relevant if the intervention is effective in the first place (if no effect, the cost-effectiveness is zero by definition). Such studies would be expected to ~uniformly report significant positive effects, and thus including this source biases the sample used in the analysis. (And hey, maybe a meta-regression doesn't find 'from this source versus that one' is a significant predictor, but if so I would attribute it more to the literature being so generally pathological rather than cost-effectiveness studies are unbiased samples of effectiveness simpliciter).

**Second**, following standard practice is a good way of demonstrating you have 'nothing up your sleeve': that you didn't keep re-analysing until you found results you liked, or selectively reporting results to favour a pre-written bottom line. Although I appreciate this analysis was written before the Simeon's critique, prior to this one may worry that HLI, given its organisational position on wellbeing etc. would really like to find an intervention that 'beats' orthodox recommendations, and this could act as a finger on the scale of their assessments. (cf. ACE's various shortcomings back in the day)

It is unfortunate that this analysis is not so much 'avoiding even the appearance of impropriety' but 'looking a bit sus'. My experience so far has been further investigation into something or other in the analysis typically reveals a shortcoming (and these shortcomings tend to point in the 'favouring psychotherapy/SM' direction).

To give some examples:

- That something is up (i.e. huge hetereogeneity, huge small study effects) with the data can be seen on the forest plot (and definitely in the funnel plot). It is odd to skip these figures and basic assessment before launching into a much more elaborate multi-level metaregression.
- It is also odd to have an extensive discussion of publication bias (up to and including ones own attempt to make a rubric to correct for it) without doing the normal funnel plot +/- tests for small study effects.
~~Even if you didn't look for it, metareg in R will confront you with heterogeneity estimates for all your models in its output (~~~~cf.~~~~). One should naturally expect curiosity (or alarm) on finding >90% heterogeneity, which I suspect stays around or >90% even with the most expansive meta-regressions. Not only are these not reported in the write-up, but in the R outputs provided (e.g.~~~~here~~~~) these parts of the results have been cropped out.~~This was mistaken; mea maxima culpa.- Mentioning prior sensitivity analyses which didn't make the cut for the write-up invites wondering what else got left in the file-drawer.

Thanks. I've taken the liberty of quickly meta-analysing (rather, quickly plugging your spreadsheet into metamar). I have further questions.

1. My forest plot (ignoring repeated measures - more later) shows studies with effect sizes >0 (i.e. disfavouring intervention) and <-2 (i.e.. greatly favouring intervention). Yet fig 1 (and subsequent figures) suggests the effect sizes of the included studies are between 0 and -2. Appendix B also says the same: what am I missing?

2. My understanding is it is an error to straightforwardly include multiple results from the same study (i.e. F/U at t1, t2, etc.) into meta-analysis (see Cochrane handbook here): naively, one would expect doing so would overweight these studies versus those which report outcomes only once. How did the analysis account for this?

3. Are the meta-regression results fixed or random effects? I'm pretty sure metareg in R does random effects by default, but it is intuitively surprising you would get the impact halved if one medium-sized study is excluded (Baranov et al. 2020). Perhaps what is going on is the overall calculated impact is much more sensitive to the regression coefficient for time decay than the pooled effect size, so the lone study with longer follow-up exerts a lot of weight dragging this upwards.

4. On the external validity point, it is notable that Baranov et al. was a study of pre-natal psychotherapy in Pakistan: it looks dubious that the results of this study would really double our estimates of effect persistence - particularly of, as I understand it, more general provision in sub-Saharan Africa. There seem facially credible reasons why the effects in this population could be persistent in a non-generalising way: e.g. that better maternal mental health post-partum means better economic decision making at a pivotal time (which then improves material circumstances thereafter).

In general inclusion seems overly permissive: by analogy, it is akin to doing a meta-analysis of the efficacy of aspirin on all cause mortality where you pool all of its indications, and are indifferent to whether is mono-, primary or adjunct Tx. I grant efficacy findings in one subgroup are *informative *re. efficacy in another, but not so informative that results can be weighed equally versus studies performed in the subgroup of interest (ditto including studies which only partly or tangentially involve any form of psychotherapy - inclusion looks dubious given the degree to which outcomes can be attributed to the intervention of interest is uncertain). Typical meta-analyses have much more stringent criteria (cf. PICO), and for good reason.

5. You elect for exp decay over linear decay in part as the former model has a higher R2 than the latter. What were the R2s? By visual inspection I guess both figures are pretty low. Similarly, it would be useful to report these or similar statistics for all of the metaregressions reported: if the residual heterogeneity remains very high, this supplies caution to the analysis: effects vary a lot, and we do not have good explanations why.

6. A general challenge here is metagression tends insensitive, and may struggle to ably disentangle between-study heterogeneity - especially when, as here, there's a pile of plausible confounds owed to the permissive inclusion criteria (e.g. besides clinical subpopulation, what about *location*?). This is particularly pressing if the overall results are sensitive to strong assumptions made of the presumptive drivers of said heterogeneity, given the high potential for unaccounted-for confounders distorting the true effects.

7. The write-up notes one potential confounder to apparent time decay: better studies have more extensive followup, but perhaps better studies also report lesser effects. It is unfortunate small study effects were not assessed, as these appear substantial:

Note both the marked asymmetry (Eggers P < 0.001), as well as a large number of intervention favouring studies finding themselves in the P 0.01 to 0.05 band. Quantitative correction would be far from straightforward, but plausibly an integer divisor. It may also be worth controlling for this effect in the other metaregressions.

8. Given the analysis is atypical (re. inclusion, selection/search, analysis, etc.) 'analysing as you go' probably is not the best way of managing researcher degrees of freedom. Although it is perhaps a little too late to make a prior analysis plan, a multiverse analysis could be informative.

I regret my hunch is this would find the presented analysis is pretty out on the tail of 'psychotherapy favouring results': most other reasonable ways of slicing it lead to weaker or more uncertain conclusions.

I found (I think) the spreadsheet for the included studies here. I did a lazy replication (i.e. excluding duplicate follow-ups from studies, only including the 30 studies where 'raw' means and SDs were extracted, then plugging this into metamar). I copy and paste the (random effects) forest plot and funnel plot below - doubtless you would be able to perform a much more rigorous replication.

Re. the meta-analysis, are you using the regressions to get the pooled estimate? If so, how are the weights of the studies being pooled determined?

Per the LW discussion, I suspect you'd fare better spending effort actually presenting the object level case rather than meta-level bulverism to explain why these ideas (whatever they are?) are getting a chilly reception.

Error theories along the lines of "Presuming I am right, why do people disagree with me?" are easy to come by. Suppose indeed Landry's/your work is indeed a great advance in AI safety: then perhaps indeed it is being neglected thanks to collective epistemic vices in the AI safety community. Suppose instead this work is bunk: then perhaps indeed epistemic vice on your part explains your confidence (complete with persecution narrative) in the work despite its lack of merit.

We could litigate which is more likely - or, better, find what the ideal 'bar' insiders should have on when to look into outsider/heterodox/whatever work (too high, and existing consensus is too entrenched, and you miss too many diamonds in the rough; too low, expert time is squandered submerged in dross), and see whether what has been presented so far gets far enough along the ?crackpot/?genius spectrum to warrant the consultation and interpretive labour you assert you are rightly due.

This would be an improvement on the several posts so far just offering 'here are some biases which we propose explains why our work is not recognised'. Yet it would still largely miss the point: the 'bar' of how receptive an expert community will be is largely a given, and seldom that amenable to protests from those currently screened out it should be lowered. If the objective is to persuade this community to pay attention to your work, then even if in some platonic sense their bar is 'too high' is neither here nor there: you still have to meet it else they will keep ignoring you.

Taking your course of action instead has the opposite of the desired effect. The base rates here are not favourable, but extensive 'rowing with the ref' whilst basically keeping the substantive details behind the curtain with a promissory note of "This is great, but you wouldn't understand its value unless you were willing to make arduous commitments to carefully study why we're right" is a further adverse indicator.

I suspect the 'edge cases' illustrate a large part of the general problem: there are a lot of grey areas here, where finding the right course requires a context-specific application of good judgement. E.g. what 'counts' as being (too?) high status, or seeking to start a 'not serious' (enough?) relationship etc. etc. is often unclear in non-extreme cases - even to the individuals directly involved themselves. I think I agree with most of the factors noted by the OP as being pro tanto cautions, but aliasing them into a bright line classifier for what is or isn't contraindicated looks generally unsatisfactory.

This residual ambiguity makes life harder, as if you can't provide a substitute for good judgement, guidance and recommendations (rather than rulings) may not give great prospects for those with poorer or compromised judgement to bootstrap their way to better decisions. The various fudge factors give ample opportunity for motivated reasoning ("I know generally this would be inappropriate, but I license myself to do it in this particular circumstance"), and sexual attraction is not an archetypal precipitant for wisdom and restraint. Third parties weighing in on perceived impropriety may be less self-serving, but potentially more error-prone, and definitely a lot more acrimonious - I doubt many welcome public or public-ish inquiries or criticism upon the intimate details of their personal lives ("Oh yeah? Maybe before you have a go at me you should explain {what you did/what one of your close friends did/rumours about what someone at your org did/etc.}, which was far worse and your silence then makes you a hypocrite for calling me out now."/ "I don't recall us signing up to 'the EA community', but we definitely didn't sign up for collective running commentary and ceaseless gossip about our sex lives. Kindly consider us 'EA-adjacant' or whatever, and mind your own business"/etc.)

FWIW I have - for quite a while, and in a few different respects - noted that intermingling personal and professional lives is often fraught, and encouraged caution and circumspection for things which narrow the distance between them still further. EA-land can be a chimera of a journal club, a salutatorian model UN, a church youth group, and a swingers party - these aspects are not the most harmonious in concert. There is ample evidence - even more ample recently - that 'encouraging caution' or similar doesn't cut it. I don't think the OP has the right answer, but I do not have great ideas myself: it is much easier to criticise than do better.