Why is genetic engineering detection important?
We’re pretty sure a GCBR-capable pathogen is going to be engineered, not natural. Broad genetic engineering detection (GED) capabilities would be an important form of agnostic surveillance.
It’s particularly important that we be able to do metagenomic GED–picking out the one engineered genome in a giant soup of non-engineered genomes, and doing this reasonably efficiently. For this reason, it’s also important to focus on computational GED capabilities.
What’s been done so far?
The bulk of relevant work in GED took place under the IARPA program FELIX (Finding Engineering Linked Indicators), with many of the results so far unpublished.
Until FELIX, almost all other GED work was experimental, and focused on plants, apparently motivated by an EU regulation related to GMOs. None of this work is really relevant to GCBR surveillance, but here’s a pretty good review of it. There are a few other relevant papers. But part of the motivation for FELIX was to establish a state of the art in GED capabilities, so the outcomes of FELIX are what matter.
How did FELIX work?
It was intentionally a pretty broad, performer-defined project. Performers decided for themselves what counted as genetic engineering, what priorities for detection were, how they would do it, and so on.
IARPA partnered with several national labs to make a total of four testing and evaluation (T&E) datasets. The first few were custom tailored to each performer, but the last was a “challenge” set, and everyone still in the program got the same one. The challenge set included some really wacky samples–a very large plant genome, an organism with over 800 components of engineering, and an influenza virus with a single stop codon deleted.
What were the outcomes of FELIX?
FELIX resulted in four distinct computational GED systems, as well as several wet-lab protocols that are out of scope of this discussion. The computational systems were transitioned by IARPA to at least three government agencies: DTRA, DEVCOM, and NBACC.
The top FELIX performer, Ginkgo Bioworks, achieved an accuracy of 70% on IARPA’s final test set. Most performers were able to reliably detect insertions of foreign gene content down to about 30-50 base pairs. For large insertions, multiple performers reached >90% sensitivity and >95% specificity. But on smaller insertions, substitutions, and deletions, performance was much worse.
What were FELIX’s limitations and challenges?
Challenges with GED in general included:
- Bad natural data: Most publicly available databases of ‘natural’ sequences, even reference databases, are hopelessly contaminated with engineered sequences, and need to be extensively curated before use.
- Limited engineered data: The best databases of engineered sequences are private. The best public databases of engineered sequences are quite biased, and lack contextual information. Between (1) and (2), advanced machine learning techniques have so far been of limited usefulness to GED.
- Difficulty with complex genomes and subtle edits: FELIX didn’t primarily focus on metagenomic data, and many performers struggled with highly complex samples. Performance on subtle forms of engineering was also limited.
Challenges with FELIX included a perceived sense of a moving target and limited collaboration between performers. The final systems were also very large and computationally intensive, causing some difficulties for the organizations to which FELIX systems have been transitioned.
What can improve GED capabilities?
- Curation of public natural data to remove spurious actually-engineered sequences
- Improved public engineered datasets, and potentially the creation of a better language/format
- Improved baseline characterization: One way to improve the detection of anomalous sequences in eg: wastewater is to improve understanding of what is not anomalous
- Improved metagenomic assembly algorithms: Assembly algorithms have not kept pace with the increased volume of sequencing data. Improvements in these algorithms’ error rates and computational performance could substantially improve any GED module that uses genome assembly, which is most of them.
Proximally, however, the major limitation to any future GED work is a lack of a high-quality, comprehensive test dataset.
I want to know more!
Great! Read the full report. It includes an extensive technical discussion, full references, commentary, acknowledgements, as well as project ideas and resources. It is periodically updated.
I don’t have time to read all that, but I still want to know more
Understandable. Read the 6-page memo I put together for the Nucleic Acids Observatory team. You can also watch IARPA’s livestream about the FELIX project, which includes details about two of the performers’ projects.
Wait, who are you, anyway?
I’m a full-time biosecurity researcher. I’ve done a couple projects like this, and I’m currently working on something else (solid state far-UV emission). I was asked to do this project by the NAO team at SecureBio in January 2022. All my views are my own, and do not represent the views of SecureBio, Sculpting Evolution, or my employer.
Can I help?
There are lots of shovel-ready projects in GED. Someone with experience/interest in bioinformatics, metagenomics, synthetic biology, microbiology, virology, etc. would be the best fit. If you’re interested in helping, please don’t hesitate to contact me.
Thank you for the write up. Really appreciate the pops of in the weeds explainers in the forum. Will take the time to read/skim the full report!