How we failed

Jan_Kulveit; technicalities

Here “we” means the broader EA and rationalist communities.

Learning from failures is just as important as learning from successes. This post describes a few cases of mistakes and failures which seem interesting from the learning perspective.

Failure to improve US policy in the first wave

Early in the pandemic, the rationality community was right about important things (e.g. masks), often very early on. We won the epistemic fight, and the personal instrumental fight. (Consider our masks discourse, or Microcovid use, or rapid testing to see our friends.)

At the same time, the network distance from the core of the community to e.g. people in the CDC is not that large: OpenPhil is a major funder for the Johns Hopkins Center for Health Security, and the network distance from CHS to top public health institutions is presumably small. As such, we could say the US rationality community failed instrumentally, given their apparent short distance to influence.

Contrast this with the influence of LessWrong on UK policy via a single reader, Dominic Cummings. Or our influence on Czech policy (see What we tried).

Also, contrast this with the #masks4all movement. This movement was founded by the writer and speaker Petr Ludwig (pre COVID, he was developing something like his own version of rationality and critical thinking, independent of the LessWrong cluster). After the success of grass-roots activity in DIY mask-making in Czechia, which led to the whole country starting to use the masks within a week or so, he tried to export the "masks work" and “you can make them at home” meme globally. While counterfactuals are hard, this seems a major success, likely speeding up mask uptake in Western countries by weeks (consider the Marginal Revolution praise).

Where was the “microcovid calculator for countries”? (see EpiFor funding and medium-range models.)

Overreaction through 2021

Personal reactions within the community in February 2020 were sometimes exemplary; apparent overshoots (like copper tape on surfaces, postal package quarantines) were reasonable ex ante.

But the community was slow to update its behaviour in response to improved estimates of infection-fatality ratee, long COVID, and knowledge of aerosol and droplet transmission. Anecdotally, the community was slow to return to relatively safe things like outdoor activities, even after full vaccination.

While your risk tolerance is a given quantity in decision-making, my impression is that many people’s behaviour did not match the risk tolerance implicit in their other daily habits.

Inverse gullibility

Many large institutions had a bad track record during the pandemic, and the heuristic “do not update much from their announcements” served many people well. However, in places the community went beyond this, to non-updates from credible sources and anti-updates from institutions.

Gullibility is believing things excessively: taking the claim “studies show” as certain. Inverse gullibility is disbelieving things excessively: failing to update at all when warranted, or even to invert the update.

Example: the COVID research summaries of the FDA are often good; it’s in the policy guidance section that the wheels come off. But I often see people write off all FDA material.

My broad guess is that people who run off simple models like "governments are evil and lie to you" are very unlikely to be able to model governments well, and very unlikely to understand parts of the solution space where governments can do important and useful things.

Failing to persuade reviewers that important methods were valid

More locally to our research: we used skilled forecasters for many of our estimates, as an input for our hybrid of mathematical and judgmental prediction. But relying on forecasters is often frowned upon in academic settings. Stylized dialogue:

Reviewer: “How did you get the parameters?”
EpiFor: “We asked a bunch of people who are good at guessing numbers to tell us!”
Reviewer: "That’s unrigorous, remove it."

Similarly: the foremost point of doing NPI (non-pharmaceutical intervention; technical term for interventions including various 'mandates' and orders) research was to provide governments with cost-benefit estimates. In this Forum I don’t need to explain why. But academic studies of COVID policies usually lack estimates of the social cost, and instead merely estimate the transmission reduction. Why?

A clue: Our initial NPI preprint included an online survey about relative lockdown preferences, e.g. the self-reported disutility of bars closing for 2 weeks vs. a mask mandate being imposed for 2 months. This allowed us to actually give policy-relevant estimates of the cost (disutility) of NPIs, compared to the benefit (reduction of virus transmission).

It proved hard to get this version published; the apparent subjectivity of the costs, the inclusion of economic methods in an epidemiology paper, and the specific choice of preference elicitation methods, etc, all exposed a large "attack surface" for reviewers. In the end, we just removed the cost-benefit analysis.

Clearly, internal documents of at least some governments will have estimated these costs. But in almost all cases these were not made public. Even then: as far as we know, only economic costs were counted in these private analyses; it is still rare to see estimates of the large direct disutility of lockdown.

EpiFor funding and medium-range models

(Conflict of interest: obvious.)

Epidemic Forecasting was initially funded by Tim Telleen-Lawton, which made a lot of the work we did possible.

Epidemic Forecasting could have been significantly more impactful if someone had given us between $100k - 500k in June 2020, on the basis that we would do interesting things, even though it was hard to explain and justify from an ITN perspective in advance.

Around June 2020, we had all the ingredients which would allow us to upload reasonable models forecasting the arrival of the 2nd wave: the best NPI estimates, the best seasonality estimates, forecasters to predict the triggers that lead to the adoption of NPIs, a framework to put it together, and a team of modellers and developers able to turn it into a model. This would likely have been the best medium-range forecast dashboard for Western countries, and provided advance warning about future waves ~3 months in advance. We even secured funding from a large non-EA funder. (!)

Both our host university and this funder displayed a great deal of inflexibility. My institution, Oxford, was unable to accept the funding in a way which would allow us to contract superforecasters in time. Further, when we tried to accept the money via a charity instead, we found that the funder was itself unable to fund non-vetted charities - and that their vetting process would be too long and costly for this “small” a donation ($100k).

No EA funders were willing to step in (despite the project having arguably better impact than any of the COVID projects funded by OpenPhil, and this impact being well-evidenced in the policy effects of the first-wave dashboard). My personal update from this is roughly "while the EA funding infrastructure is much less inadequate than the rest of the world, it is still very far from a situation where I can model it as sensible and reliable".

IHME forecasts

Taking data-driven charity as the ingroup (and so part of how “we” failed), I note that something was up at the Gates Foundation. The GF has committed at least $400 million to the Institute for Health Metrics and Evaluation, an organisation with a notorious record on COVID.

So what? Well, a rigorous donor with strong vetting empowered a team with apparently deep epistemic problems, and this likely seriously misled policy. (The IHME forecasts were among the most influential in the first wave, second only to Imperial College London.) And this failure occurred despite GF’s empirical, institutional, non-hits-based approach. A systematic review of such failures could produce a useful prior for this “safe” approach.

(I should say that I admire the IHME’s Global Burden of Disease project, and do not mean to impugn the whole organisation. The above is troubling precisely because it drastically underperforms high expectations from IHME, based on their existing track record.)

Ask me anything

Related 'ask me anything' session is happening on Thu 24th under the What we tried post.

While this text is written from my (Jan Kulveit's) personal perspective, I co-wrote the text with Gavin Leech, with input from many others.

120 Reactions

What we tried

8 comments71 karma

Case for emergency response teams

50 comments249 karma

Mentioned in

186Experimental longtermism: theory needs data

53 Some benefits and risks of failure transparency

Comments16

Sorted by

New & upvoted

Click to highlight new comments since: Today at 6:08 AM

poppinfreshMar 23 202228

This was a bummer to read:

It proved hard to get this version published; the apparent subjectivity of the costs, the inclusion of economic methods in an epidemiology paper, and the specific choice of preference elicitation methods, etc, all exposed a large "attack surface" for reviewers. In the end, we just removed the cost-benefit analysis.

Clearly, internal documents of at least some governments will have estimated these costs. But in almost all cases these were not made public. Even then: as far as we know, only economic costs were counted in these private analyses; it is still rare to see estimates of the large direct disutility of lockdown.

I don't know exactly which papers you're referring to, but it's plausible to me that the cost-benefit analysis would be similarly valuable to the rest of the content in the paper. So it really sucks to just lose it.

Did you end up publishing those calculations elsewhere (e.g. as a blog post complement to the paper, or in a non-peer-reviewed verison of the article)? Do you have any thoughts on whether, when, and how we should try to help people escape the peer review game and just publish useful things outside of journals?

Jan_KulveitMar 24 20222

The practical tradeoff was between what, where and when to publish. The first version of the preprint which is on medrxive contains those estimates. Some version with them could probably be published in a much worse journal than Science, and would have much less impact.

We could have published them separately, but a paper is a lot of work, and it's not clear to me whether, for example, to sacrifice some of the"What we tried"and get this done would have been a good call.

It is possible to escape from the game in specific cases - in the case of covid, for example, the advisory body we created in the Czech Republic was able to take into account analyses based on "internal quality", especially if it was clear peer review game will take months. If such bodies existed in more/more countries, it would be possible.
Similarly, it could be done with the help of an ECDC or WHO type institution.

In general, it's an "inadequate equilibrium" type of problem, I have some thoughts on typical solutions to them, but not in easily shareable written form, at the moment.

Guy RavehMar 23 202221

Note: Sorry if this came out a little bit harsh. I'm interested in your series of posts and I want to understand the situation better.

Small question first: What's NPI? And IFR?

Bigger question: Some of your methods, as you mentioned, did not pass standards for rigor, and you claim the standards should bend around them. But how are you sure they were accurate? What makes you think the people who are "good at guessing numbers" made your model better rather than worse? Or that the surveys used to estimate social costs were really good enough to give relatively unbiased results?

In my own country, since the beginning of the pandemic and now still, I feel exactly as you said - the government doesn't even try to estimate the costs of interventions, instead relying on a shouting match between the health and the treasury ministers to decide. So I'm very much in favour of actually getting these estimates - but to be helpful, they need to be good, and I would a priori expect good estimates for this to only be available to the government.

Also regarding IHME, admittedly I know nothing about any of this, but you say it is an organisation with an impressive track record who got COVID wrong, while your organisation is new, without any track record, but got COVID right. From a risk-averse perspective, I think the decision to fund the former - which can plausibly use its already proven abilities to improve its COVID team given funding - rather than the latter, may very well be the right decision.

Jan_KulveitMar 23 202217

NPI & IFR: thanks, it's now explained in the text.

Re: Rigour

I think much of the problem is due not to our methods being "unrigourous" in any objective sense, but to interdisciplinarity. For example, in the survey case, we used mostly standard methods from a field called "discrete choice modelling" (btw, some EAs should learn it - it's a pretty significant body of knowledge on "how to determine people's utility functions").

Unfortunately, it's not something commonly found in the field of, for example, "mathematical modeling of infectious diseases". It makes it more difficult for journals to review such a paper, because ideally they would need several different reviewers for different parts of the paper. This is unlikely to happen in practice, so usually the reviewers tend to either evaluate everything according to the conventions of their field, or to be critical and dismissive of things they don't understand.

Similar thing is going on with use of "forecasting"-based methods. There is published scientific literature on their use, their track record is good, but before the pandemic there was almost no "published literature" on the subject of their use in combination with epidemic modelling (there is now!).

The second part of the problem is that we were ultimately more interested in "what is actually true" than what "looks rigorous". A paper that contains few pages of equations, lots of complex modeling, and many simulations can look "rigorous" (in the sense of the stylized dialogue). If at the same time, for example, it contains completely and obviously wrong assumptions about the IFR of covid it will still pass many tests of "rigorousness" because it only shows that "under assumptions that do not hold in our world we reach conclusions that are irrelevant to our world" (the implication is true). At the same time, it can have disastrous consequences, if used by policymakers, who assume something like "research tracks reality".

Ex post, we can demonstrate that some of our methods (relying on forecasters) were much closer to reality (e.g. based on serological studies) than a lot of published stuff.

Ex ante, it was clear this will be the case to many people who understand both academic research and forecasting.

Re: Funding

For the record, EpiFor is a project that has ended, and is not seeking any funding. Also, as noted in the post, we were actually able to get some funding offered: just not in a form which the university was able to accept, etc.

It's not like there is one funder evaluating whether to fund IHME, or EpidemicForecasting. In my view the problems pointed to here are almost completely unrelated, and I don't want them to get conflated in some way

technicalitiesMar 23 202212

This is reasonable except that it misunderstands peer review. Peer review does not check correctness*; it often doesn't even really check rigour. Instead, it sometimes catches the worst half of outright errors and it enforces the discipline's conventions. Many of those conventions are good and correlate with rigour. But when it comes to new methods it's usually an uphill battle to get them accepted, rigour be damned.

We note above that our cost analysis (designed, implemented and calculated inside of three weeks) had weaknesses, and this is on us. But the strength of opposition to the whole approach (and not to our implementation) persuaded us that it was futile to iterate.

On forecasts: We used forecasters with an extremely good public track record. I have far more confidence in their general ability to give great estimates than I do in the mathematised parts of covid science. (I can send you their stats and then you will too.) You can tell that the reviewers' antipathy was not about rigour because they were perfectly happy for us to set our priors based on weak apriori arguments and naive averages of past studies (the normal unrigorous way that priors are chosen).

In my limited experience, even major world governments rely heavily on academic estimates of all things, including economic costs. And as we noted we see no sign that they took into account the noneconomic cost, and nor did almost all academics. The pandemic really should make you downgrade your belief in secretly competent institutions.

The IHME section is there to note their object level failure, not to criticize the Gates Foundation for not funding us instead. (I don't know to what extent their covid debacle should affect our estimate of the whole org - but not by nothing, since management allowed them to run amok for years.)

* Except in mathematics

Stefan_SchubertMar 23 20228

Non-pharmaceutical interventions, infection fatality rate.

Sam AbbottAug 24 20227

Some critical thoughts on this in this thread: https://twitter.com/nikosbosse/status/1562424476792672259?s=21&t=7SArpibug5ZqkEDhWi3AjA

It seems like there would be value in another pass at this kind of post with a more critical framing and perhaps wider more inclusive scope. It would be interesting to hear more on what EA as a community contributed and some critical reflections on those contributions especially contrasted to non-EA aligned work.

Something I’d be very interested in hearing more about is the choice to target Science and other legacy “high impact“ journals. In your other article on this you mention review delays and not being able to publish what you wished (note I think I was part of one of the review teams for one of these pieces of work (perhaps a later one - I don’t remember anything about CE but it had an entire second paper in the supplement so it was a mission to review) so perhaps I am biased to assume I did a great job).

I would have thought it would have been more effective to target a progressive open science journal given you relied so heavily on none traditional media channels to spread the findings. Obviously, the cynical reason many people go to “high impact” journals is prestige but there is also an argument that they help boost the legitimacy of findings. Given how much preprints were used during the pandemic it’s not clear to me that would be useful. Have you had a chance to do some analysis of altmetrics and the useful window of your findings to see what proportion of the impact came before and after publishing vs preprinting?

SiebeRozendalMar 23 20227

What makes you believe people are overestimating the risk of long covid? Or does this only apply to 2021?

I believe EAs are currently underestimating it, and the cost of getting covid. I try to correct some misconceptions here: https://www.facebook.com/1220718092/posts/10221201147517854/?app=fbl

(I'm not saying all EAs have these misconceptions; it's aimed at a wide audience)

technicalitiesMar 24 20229

Thanks for this, it looks thorough.

Speaking for myself, there was a long scary moment (: the year 2020) where I based my long covid estimates off SARS-1, which was way worse (30-40% multimorbid disability rate, usually incapacitating, often lasting many years). So using that high a bar was my overestimate.

I've tried to keep up with the long covid literature, but I find that every paper uses a totally different estimand. Martin et al used ARDS as the reference disutility for long covid (-32% QALY), kinda arbitarily. When you say that 1-15% of cases get long covid, what % utility loss are you imagining?

SiebeRozendalMar 24 202220

Yes, long COVID is currently badly defined. This is because it's a heterogenous multisystem disease; different patients have different pathologies, and it's a continuum. In addition, it's hard to include/exclude long COVID, because not every case is noticed, and antibodies are not a reliable indicator.

Fwiw, I think the data of SARS-1 is consistent with SARS COV 2: we generally see 20-30% with persistent symptoms and/or organ dysfunction in smaller studies, and lower numbers in controlled cohort studies.

In that 1-15%, this includes different severities. I'd say a big portion is simply more fatigued than usual, so that's like 0.1 or 0.2 DALY per year?

However, I think 1-3% develops the ME/CFS sub type, which has, according to one study, "When the YLL of 0.226M is combined with the YLD of 0.488M, we get a DALY of 0.714M."

(https://oatext.com/Estimating-the-disease-burden-of-MECFS-in-the-United-States-and-its-relation-to-research-funding.php) ,

I think the quality of life loss is accurate. I have severe long COVID and would gladly trade it for losing both my legs, HIV (not full blown AIDS maybe), and probably severe burns (don't know the details of that though).

I haven't evaluated the rigor of the years of life lost, but it does fit a multisystem disease.

Also just to note, I think this all looks even worse if you take into account that subjective wellbeing is actually unbounded, not a 0 to 1 scale, as well as the potential altruistic loss due to loss of productivity.

JackMMar 24 20224

Consider our masks discourse

Sorry this may not be the most helpful comment but this link is hardly evidence of us being right on anything or winning any fights...it's a simple question with hardly any engagement.

I'm not saying we weren't right, but I don't think you've put this case forward very convincingly.

Matthew_BarnettMar 25 202210

Here's a quote from Wei Dai, speaking on Feburary 26th 2020,

Here's another example, which has actually happened 3 times to me already:
The truly ignorant don't wear masks.
Many people wear masks or encourage others to wear masks in part to signal their knowledge and conscientiousness.
"Experts" counter-signal with "masks don't do much", "we should be evidence-based" and "WHO says 'If you are healthy, you only need to wear a mask if you are taking care of a person with suspected 2019-nCoV infection.'"
I respond by citing actual evidence in the form of a meta-analysis: medical procedure masks combined with hand hygiene achieved RR of .73 while hand hygiene alone had a (not statistically significant) RR of .86.

After over a month of dragging their feet, and a whole bunch of experts saying misleading things, the CDC finally recommended people wear masks on April 3rd 2020.

technicalitiesMar 24 20224

Yeah that link just serves as a timestamp of how early we were thinking about it (note first sentence pointing to private conversation). Could justify this if anyone doubts that uptake was earlier and higher than average, but it would involve a lot of digging in chats and facebook statuses.

JackMMar 24 20223

It’s fine I don’t need justification, I just found that link an odd one. I don’t really think it shows anything given the small number of upvotes and only one response.

If anything it undermines your point if that’s the best thing you could find.

Jan_KulveitMar 24 20223

Mass-reach posts came later, but sooner than the US mainstream updates

https://www.lesswrong.com/posts/h4vWsBBjASgiQ2pn6/credibility-of-the-cdc-on-sars-cov-2
https://slatestarcodex.com/2020/03/23/face-masks-much-more-than-you-wanted-to-know/

FlorinSep 22 20221

Better examples discussing masks are the posts about elastomeric respirators here and here. Unfortunately, almost no policy maker seemed to have listened.

[comment deleted]Aug 24 20221

Deleted by Sam Abbott, 08/24/2022