Computational Approaches to Pathogen Detection

Jeff Kaufman 🔸

While this post is my perspective and not an official post of my employer's, it also draws on a lot of collaborative work with others at the Nucleic Acid Observatory (NAO)

One of the future scenarios I'm most worried about is someone creating a "stealth" pandemic. Imagine a future HIV that first infects a large number of people with minimal side effects and only shows its nasty side after it has spread very widely. This is not something we're prepared for today: current detection approaches (symptom reporting, a doctor noticing an unusual pattern) require visible effects.

Over the last year, with my colleagues at the NAO, I've been exploring one promising method of identifying this sort of pandemic. The overall idea is:

Collect some sort of biological material from a lot of people on an ongoing basis, for example by sampling sewage.
Use metagenomic sequencing to learn what nucleic acids are in these samples.
Run novel-pathogen detection algorithms on the sequencing data.
When you find some thing sufficiently concerning, follow-up with tests for the specific thing you've found.

While there are important open questions in all four of these, I've been most focused on the third: once you have metagenomic sequencing data, what do you do?

I see four main approaches. You can look for sequences that are:

Dangerous:

There are some genetic sequences that code for dangerous things that we should not normally see in our samples. If you see a series of base pairs that are unique to smallpox that's very concerning! The main downside of this approach if you want to extend it beyond smallpox etc is that you need to make a list of non-obvious dangerous things, which is in itself a dangerous thing to do: what if your list is stolen and it points people to sequences they wouldn't have thought to try using?

This is similar to another problem: how do you check if people are synthesizing dangerous sequences without risking a list of all the things that shouldn't be synthesized? SecureDNA has been working on this problem, with an encrypted database with a distributed key system that allows flagging sequences without it being practical to get a list of all flagged sequences (paper).

There are some blockers to using SecureDNA for this today, since it was designed for slightly different constraints, but I think they are all surmountable and I'm hoping to implement a SecureDNA-based metagenomic sequencing screening system at some point in the next year.

An alternative and somewhat longer-term approach here would be to use tools that are able to estimate the function of novel sequences to extend this to sequences that aren't closely derived from existing ones. I'm less enthusiastic about this: not only could this work end up increasing risk by improving humanity's ability to judge how dangerous a novel sequence is, it's not clear to me that this approach is likely to catch things the other methods wouldn't.

Modified:

In engineering a new virus for a stealth pandemic, the easiest way would likely be to begin with an existing virus. If we see a sequencing read where part matches a known viral genome and part does not (a "chimera"), one potential explanation is that the read comes from a genetically engineered virus.

But this is not the only reason this approach could flag a read. For example, it could come from:

Lack of knowledge. Perhaps a virus has a lot of variation, much more than is reflected in the databases you are using to define "normal". It will look like you have found a novel virus when it's just an incomplete database. And, of course, the database will always be incomplete: viruses are always evolving. Still, solving this seems practical: handling these initial false positives requires expanding our knowledge of the variety of existing viruses, but that is something many virologists are deeply interested in.
Sequencing: perhaps some of the biological processing you do prior to (or during) sequencing can attach unrelated fragments. When you see a chimera how do you know whether that existed in the sample you originally collected vs if it was created accidentally in the lab? On the other hand, you can (a) compare the fraction of chimeras in different sequencing approaches and pick ones where this is rare and (b) pay more attention to cases where you've seen the same chimera multiple times.
Biological chimerism: bacteria will occasionally incorporate viral sequences. This method would flag this as genetic engineering even if it was a natural and unconcerning process. As long as this is rare enough, however, we can deal with this by surfacing such reads to a biologist who figures out how concerned to be and what next steps makes sense.

This is the main approach I've been working on lately, trying to get the false positive rate down.

New:

If we understood what "normal" looked like well enough, then we could flag anything new for investigation. This is a serious research project: if you take data from a sewage sample and run it through basic tooling, it's common to have 50% of reads unclassified. Making progress here will require, among other things, much better tooling (and maybe algorithms) for metagenomic assembly: I'm not aware of anything that could efficiently integrate trillions of bases a week into an assembly graph.

Ryan Teo, a first-year graduate student with Nicole Wheeler at the University of Birmingham has started his thesis in this area, which I'm really excited to see. Lenni Justen, another first-year graduate student, with Kevin Esvelt, is also exploring this area as part of his work with the NAO. I'd be excited to see more work, however, and if you're working on this or interested in working on it but blocked by not having access to enough metagenomic sequencing data please get in touch!

Growing:

It may turn out that our samples are deeply complex: potentially as you sequence the rate of seeing new things falls off very slowly. If it falls off slowly enough, and then you will keep seeing "new" things that are just so rare that you haven't happened to see them before. I am quite unsure how likely this is, and I expect it varies by sample type (sewage is likely much more complex than, say, blood) but it seems possible. An approach that's robust to this is that instead of flagging some thing just for being new, you could flag it based on its growth pattern: first you've never seen it, then you see it once, then you start seeing it more often, then you start seeing it many times per sample. In theory a new pandemic should begin with approximately exponentially spread, since with few people already infected the number of new infections should be proportional to the number of infectious people.

At the NAO we've been calling this "exponential growth detection" (EGD). We worked on this some in 2022, but have put it on hold until we have a deep enough timeseries dataset to work with.

These approaches can also be combined: if a sequence originally comes to your attention because it's chimeric but you're not sure how seriously to take it, you could look at the growth pattern of its components. Or, while you can detect growing things with a genome-free approach simply by looking for increasing k-mers, the kind of "thoroughly understand the metagenome" work that I described above as an approach for identifying new things can also be used to make a much more sensitive tool that detects growing things.

In terms of prioritization, I'm enthusiastic about work on all of these, and would like to see them progress in parallel. The approaches of detecting dangerous and modified sequences require less scientific progress and should work on amounts of data that are achievable with philanthropic funding. De novo protein design is getting more capable and more accessible, however, which allows creation of pathogens those two methods don't catch. We will need approaches that don't depend on matching known things, which is where detecting new and/or growing sequences comes in. Those two methods will require a lot more data, enough that unless sequencing goes through another round of the kind of massive cost improvement we saw in 2008-2011 we're talking about large-scale government-funded projects. Advances in detection methods make it more likely that we'll be able to make the case for these larger projects, and reduce the risk that the detection ability might lag infrastructure creation.

Vasco Grilo🔸Nov 3 20232

Thanks for sharing, Jeff! Are you aware of any analyses quantifying the risk of "stealth" pandemics in terms of expected deaths or probability of a certain death toll in a given period?

Jeff Kaufman 🔸Nov 3 20230

I'm not, sorry!

No worries! The paper Existential Risk and Cost-Effective Biosecurity is the quantification effort I am aware of. I like it, but was looking for more because it still involves some guesses which arguably warrant further investigation:

For the purposes of this model ["Model 2: Potentially Pandemic Pathogens"], we assume that for any global pandemic arising from this kind of research [gain of function research], each has only a 1 in 10,000 chance of causing an existential risk
[...]
["Model 3: Naive Power Law Extrapolation":] Extrapolating the power law out [seems pessimistic to me], we find that the probability that an attack ["using biological and chemical weapons"] kills more than 5 billion will be (5 billion)^–0.5 or 0.000014. Assuming 1 attack per year (extrapolated on the current rate of bio-attacks) and assuming that only 10% [seems pessimistic to me] of such attacks that kill more than 5 billion eventually lead to extinction (due to the breakdown of society, or other knock-on effects), we get an annual existential risk of 0.0000014 (or 1.4 × 10^–6).

SummaryBotNov 1 20231

Executive summary: The post discusses computational approaches for detecting novel pathogens in metagenomic sequencing data as a way to identify stealth pandemics before widespread infection.

Key points:

Metagenomic sequencing of biological samples like sewage can reveal nucleic acids from novel pathogens.
Algorithms can flag concerning sequences that are dangerous, modified from known pathogens, completely new, or growing exponentially.
Each approach has benefits and challenges in accuracy, requiring updates to databases and biological knowledge.
Secure encrypted databases like SecureDNA may reduce risks of hijacked pathogen watchlists.
Research is needed to improve metagenomic assembly and timeseries analysis for detecting novel and growing sequences.
Parallel development of multiple detection approaches is recommended, with short-term focus on known dangerous and modified sequences.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum

Computational Approaches to Pathogen Detection

49

49

Reactions

More posts like this