Gopal Sarma: Artificial intelligence and the life sciences

by EA Global Transcripts 17 min read30th Jun 2020No comments


The development of advanced AI is likely to be driven by specific consumers with particular needs — for example, firms involved in healthcare and biology. Drawing on examples from genomics, neuroscience, and other related fields, Gopal Sarma, a scientific advisor to the Broad Institute of MIT and Harvard, discusses how the life sciences might influence the creation of novel AI technologies over the next decade.

Below is a transcript of Gopal’s talk, which we’ve lightly edited for clarity. You can also watch it on YouTube and read it on

The Talk

Thank you all for being here. I'll be talking on the topic of artificial intelligence and the life sciences.


To motivate this topic, I'm going to start by presenting a highly simplified framing of how I think many people conceptualize where AI technologies come from and how they're deployed in the real world.

In this simple picture, we have two groups of people that largely exist in a kind of dichotomy. We have the producers — for example, researchers at academic computer science departments and software engineers at industrial research labs who develop new AI algorithms and software systems. And then there's a largely independent group of people that takes advantage of these tools and software systems to develop new products and services.

For the most part, people in the latter group — the consumers — aren't actively involved in the development of the technologies themselves. But there are some exceptions — for example, large tech companies. In addition to being key drivers of AI progress in the last 10 years, they have been some of the premier consumers of their own tools in order to drive new products and services.

I think part of the appeal of this framing and why it's a reasonably accurate picture is the perceived generality of AI technologies. The vision of AI is that the tools that we build are sufficiently general that the users of AI technologies don't necessarily need to understand how they work or to be actively involved in their development. But unfortunately, I think this is a picture that doesn't always work out well. Particularly in domains that have a rich knowledge base, intellectual tradition, or large amount of organizational complexity and regulatory overhead, we need much tighter integration between the consumers and producers of AI technologies.


What I'll do in the remainder of this talk is provide a few suggestive examples of ongoing developments in biology and medicine, where this type of integration between the consumers and producers of AI technologies is happening organically today.


[These examples will also show how] — as a result of the exponential scaling of multimodal data sources, along with challenging biomedical problems — the life sciences are poised to drive the next decade of creativity in AI.


Let's start by talking about the scaling of DNA sequencing. This is a story that began in 2001 with the completion of the Human Genome Project. The graph [in the slide above] shows the total number of human genomes sequenced in the world as a function of time since that first genome was sequenced.

Now, nearly 20 years later, we've sequenced millions of genomes at a fraction of the cost — and significantly faster. As an example, at the Broad Institute, we're now sequencing a new human genome nearly every minute. And the marginal cost of sequencing a whole human genome in bulk is about $500-$600, with genomic datasets approaching the petabyte scale.

Why did we collect all of this data? Early on, the motivation was to understand Mendelian diseases, which are diseases that have a single gene as their underlying cause. And as a result of the plummeting cost of gene sequencing, we've been able to characterize the genetic basis of over 5,000 Mendelian diseases.

But as scientists began to think about how to tackle and understand the genetic basis of what we call “common diseases,” like cardiovascular disease, diabetes, schizophrenia, or Alzheimer's, they realized that they needed to collect larger and larger sample sizes. Doing so would give adequate power to statistical studies to uncover complex relationships between multiple genes. In other words, the fundamental force driving this Moore's law type of exponential scaling has been an increasingly ambitious series of scientific [goals].

What has this meant for scientists working on the cutting edge of genomic research? As a result of the volume of data that we're collecting, in the last few years a bottleneck has been created; there's a huge computational overhead now, both in developing ad hoc computational scripts, as well as the runtime for performing these types of analyses.


And of course, what we’d rather see is the tables turned: well-engineered software systems that allow scientists to abstract away much of the underlying computational effort and focus on the scientific process itself.


So let me tell you a bit now about some progress that we've made in this domain at the Broad Institute in developing an open source tool called Hail.


Hail is a cloud-native, scalable platform for genomic analysis that was directly motivated by the exponential scaling of genomic data production in the last few years.

Something that's been a remarkable discovery in the development of Hail is the extent to which otherwise well-engineered software tools in the data science ecosystem have really proven to be inadequate for our needs. This is only partly due to the scale of the data involved. In many cases there are more subtle reasons why the particular types of analyses that we do aren't quite well-matched for the tools that were developed in the context of business intelligence.


As a result, many of the architectural decisions underlying the development of Hail have been influenced by the scientific needs of our user base. These range from the development of a novel distributed data structure called the “matrix table,” which blends elements of an SQL database with a numeric tensor, to techniques for container orchestration, to the use of modern compilation technology to implement distributed linear algebra at scale.


To tie this to the larger AI story, even though Hail was developed fundamentally in a genomic context, what we've built is sufficiently general that our long-term vision is for Hail to be a fully distributed scientific computation stack analogous to the role played by tools like NumPy and pandas today.


One early success story I’ll mention from the time when Hail was being developed involves an amazing scientific resource called the UK Biobank, which consists of health and genomic data for 500,000 patients in the U.K. Performing 115 billion linear regressions, which is a core element of genomic analysis, would have required something like 200,000 CPU hours or six months on a traditionally configured research compute cluster. And this is something we've been able to do in just 10 hours with Hail on the cloud.


I do want to briefly mention some of the work of some of our colleagues at an organization called the Data Sciences Platform at the Broad Institute. They're developing a variety of tools that are both upstream and downstream of the kind of genomic analysis that we do using Hail. Those tools range from working with the raw output of genomic sequencers to things that are much more patient-facing, like portals for collecting rich clinical data, as well as cardiovascular decision support incorporating genomic insights.

Suppose you knew everything there was to know about a person's DNA and the DNA in the population. What would that allow you to do? DNA tells us about the architecture of life, the blueprint. And if we had this information, we would know, for example, that as soon as a person was born that they were at a very high risk of developing a disease much later on in life that was silent when they were younger, or that they were carriers for a disease that they were going to pass on to their children.

What it wouldn't necessarily do, however, is tell you what to do next: how to intervene, how to treat them, what strategies to use in developing novel therapeutics. In order to do that, we need to move up the biological hierarchy of macromolecules, from DNA to RNA.


Each of the 35 trillion or so cells in the human body has essentially the same nuclear DNA. So what differentiates cells in the eye from those in the heart, kidney, or lungs? Essentially it's the levels of gene expression or RNA that are expressed in each cell. And this is something that we're now able to quantitatively characterize using a new technology called “single-cell RNA seq [sequencing]” or scRNA for short.


The vision of the Human Cell Atlas is to apply unsupervised learning to these high-dimensional data sets of gene expression in order to develop an atlas of cell types and cell states, providing us with a powerful baseline measure of normal physiology and the transition to disease.

Let me give one example from the early days in this research. It’s from just a few years ago, actually.


What we see here on the left is a standard diagram of the retina. And the different cell types are color coded. What researchers were able to do using single-cell RNA sequencing was to apply standard unsupervised clustering techniques to this high-dimensional data set of gene expression data. And what's remarkable is, if you look carefully at the clusters on the right, the colors correspond to the different cell types that we see on the left. In addition, researchers were able to determine that there were novel subpopulations of cells revealed through these clustering techniques that could then be iteratively investigated through experimental processes.

So the vision of the Human Cell Atlas is to perform analogous investigations, but for all of the different cell types in the human body. And what has been remarkable in the last several years is the realization that fundamental research into machine learning is coming out of this [scRNA] research.


To connect this back again to the AI story, it's not sufficient for us to apply off-the-shelf clustering techniques and know that high-dimensional data sets form clusters. What we really need to know is that we're uncovering meaningful representations that actually tell us something biological. And for that reason, these fundamentally biological problems are motivating new results in representation learning. In the last several years we've seen a steady stream of results published by single-cell theorists in top machine learning venues like ICML [the International Conference on Machine Learning] and NeurIPS [Neural Information Processing Systems].

So we started off by talking about DNA sequencing as a way to uncover the architecture and blueprint for life. And we talked about RNA sequencing as a way of characterizing the diversity of cell types and understanding the mechanics of what's going on in cells and the diversity of cell states. And in each of these examples, we saw how fundamental biological problems were motivating the development of new AI technologies, software systems, and algorithms.

What about dynamic models? We know that cells and living organisms are evolving in time. So do simulations play a role in the story?

Simulations are a much younger and more controversial entrant in the biomedical research scene. But I think they are an essential part of the story as it relates to AI. So let me say a few words now about international efforts to stimulate regions of brain tissue to understand brain disease.


In the diagram [above], on the right is a highly simplified phylogenetic tree. At the bottom of the tree we have the simplest nervous systems corresponding to worms and insects, extending all the way up through mammals and humans at the top. And the five projects shown here with logos correspond to active international efforts aimed at simulating entire nervous systems — or in some cases, embody nervous systems in virtual environments.

The motivation for all of this research is to build powerful computational platforms for advancing the study of brain disease. Much of the controversy and attention has come from those projects at the top of the evolutionary tree that are trying to simulate entire regions of mammalian cortical tissue, or entire brains. But what I'd like to focus on for the next few minutes are the much more modest projects at the bottom of the tree aimed at simulating simple organisms like worms and insects.


For many years, it’s been stated that if you can simulate an entire human brain, then you've necessarily made substantial progress toward transformative AI. But what I'd like to point out is that mammalian simulations are not necessary for major AI breakthroughs. In fact, simple organisms like insects show impressive task performance in a variety of domains.

Take, for example, the fruit fly, which only has 10 to the 5th power neurons and no comparable structure to that of the cerebral cortex, but still has sophisticated visual spatial navigation abilities and minuscule power consumption. And again, to connect the biological story to the AI story, even though the fundamental motivation for designing these platforms to begin with was to understand the biological basis of disease, they really can be thought of as platforms for AI research itself. Because these are software systems, we can abstract away biological details in order to focus on those high-level principles which might be relevant for AI research.


You can imagine using simulation platforms like these to study how well-known algorithms like backpropagation approximate what we observe in nature. We might use real-time insights into the differential activation of different brain regions during learning and memory to motivate the development of biological architectures which are inspired by real organisms — and which fuse elements with modern machine learning.


I know that many people [attending EA Global] have thought a great deal about AI safety in terms of both near- and far-term considerations. And I wanted to mention that the projects that I've talked about thus far are really just a small slice of the totality of efforts taking place today in the life sciences. Many of these systems demand consideration from a safety perspective.

Let me give a few examples from the near-term set of issues. Take, for example, bias in supervised learning. This is relevant to a variety of systems that are actively being deployed in several areas, from pharmaceuticals, to genomics, to medical devices, to healthcare itself.

In the genomics context, consider that individuals of European ancestry are highly overrepresented in genomics datasets. And as we move from conducting fundamental research to building tools for clinical decision support, it's essential that we consider the robustness-to-distributional shift as these models are deployed in diverse patient populations.

Laboratory robotics is an area that's seeing a tremendous amount of automation. And as that automation comes to include machine learning and AI technologies, and potentially allows biological laboratories to engage in small amounts of hypothesis generation and experimental design with minimal human intervention, it's essential that we consider safe exploration as a key safety and robustness metric.

Finally, consider that in a medical context, AI-driven decision support is unlikely to ever be used in a fully standalone fashion. It’s more likely to be integrated in the moment-to-moment operation of teams and organizations. In this context, it's important that we think about safety and design from the perspective of human-in-the-loop AI.

Finally, I want to mention a tantalizing possibility: that some of this research might actually motivate new ideas in AI safety itself. In particular, the nervous system simulation platforms that we just discussed can be thought of as a set of biological agent-and-environment simulations of an intermediate level of complexity. These could be used to extend a grid-world or test-suite type of paradigm for benchmarking various AI safety metrics.

So let me summarize the three biological areas that I've discussed in the context of making 10-year predictions for AI.

We started off by talking about data engineering and data science as it relates to genomics. My prediction is that in the next decade we'll begin to see a widespread use of genomics tools like Hail for general data science in a non-biological context.

Next, we talked about single-cell RNA sequencing and representation learning. And my prediction here is that we'll continue to see a steady stream of novel insights into representation learning — and into mechanisms in causality and interpretable ML. These will arise from theorists working on single-cell biology.

Finally, we talked about AI architectures and the relationship of computational platforms for large regions of brain tissue and the relationship to understanding brain disease. My prediction here is that we'll see new insights driven by intuition that [stem] from using realistic nervous system simulation platforms. A practical manifestation of this might be that we start to see mentions of these platforms in the “method” sections of papers appearing in a purely AI context.


I want to say a few words about stewardship because I know it's a major theme of this [EA Global] conference. If stewardship in the context of innovation and new ideas means leading from within, then I hope to have convinced you — with some of these examples from the life sciences — that the story of AI and the life sciences really is about taking care of the resources that we have by transforming them from inside, rather than disrupting them with entirely new approaches and organizations.

One thing that we've learned in seeing the injection of software engineering in AI and the life sciences the last few years is that no company is likely to build an AI biologist or an AI doctor without significant involvement from the entire biomedical community. In this way, the story of AI and the life sciences is fundamentally one of cooperation and of integrating different knowledge bases.

I think this connects back to the consumer-producer dichotomy that I started off this talk with, and compels us to think about why in the life sciences it's essential that we really take ownership over the development of novel software engineering and AI tools.

Finally, I want to mention that even though in this talk I've motivated these high-level themes from the AI perspective, the fundamental reason for doing all of this work is to alleviate the tremendous amount of human suffering caused by the burden of disease. In that way, the intersection of AI and the life sciences provides us with an amazing opportunity to align short-term impact — by transforming the diagnosis and treatment of disease — with the inevitable long-term implications of transformative AI that this process will bring about.

If you're interested to learn more, please take a look at the Models, Inferences, and Algorithms Initiative at the Broad. We have over 100 hours of videos online from experts and thought leaders in the field.

And finally, if you're a software engineer interested in building a cutting-edge distributed system for working with data at the petabyte scale — and probably the exabyte scale before we know it — then the Hail team is certainly interested in hearing from you.

Thank you all so much.

Nathan Labenz [Moderator]: Early on you showed a graph with clusters of different cells in the retina in two groups, and you noted that what looked like the same kind of cell under a microscope actually had different subgroups.

[Is this graph] something as simple as a two-dimensional space that you're projecting all of these cells into? That, in and of itself, seems like a remarkable discovery if that's the case, but maybe I'm overinterpreting the graph.

Gopal: I wouldn't read too much into that. I think that's kind of an initial, first-pass investigation into using a pretty standard clustering technique called t-SNE. I think that in the long run, that sort of thing won’t be all that impactful. I think there are really deep questions to ask, like “what is a cell?” We have a low-dimensional vector representation in terms of gene expression, and in order to know if that's meaningful, you need a lot of iterative interaction with experimentalists and physiologists. They can help us understand what these representations are telling us about the normal biological process, and then the disease process.

There are some preliminary confirmations you can get. For example, when you apply this sort of thing to RNA sequences in time, you realize that you've basically recovered the cell cycle, so that's pretty old biological knowledge. But I think it's nice to know that you can recover that, and I think that was part of the motivation of that result.

Nathan: Yeah, that's cool. Regarding the part [of your talk] about representation learning in general, and your comment that this could potentially help generate more understandable forms of machine learning, I think everybody is probably pretty well aware of the problem of intractability. It’s kind of a black-box problem [that crops up with] a lot of modern machine learning techniques. Can you help us develop an intuition for why this would be more tractable? I didn't quite have that intuition [as I listened in the first row of the audience].

Gopal: I wouldn't argue that it's more tractable. I would just say that it's essential — whereas I think when we look at why supervised learning at scale has been so impactful in industry, it's in part because this hasn't necessarily always been an issue. If you can predict something very accurately, then that's potentially monetizable in some of the current revenue models of how these systems are deployed in the real world. But in terms of uncovering biological mechanisms, I think it’s of limited value to be able to take this massive training dataset and produce a model that does extremely well.

I would say that the Human Cell Atlas is really trying to learn new biology — to develop a new representation of cells that we haven't necessarily thought about before. And in that case, naive prediction is only going to play somewhat of a role. So that's really the point there.

Nathan: Got it. It was kind of a side point, but I thought it was potentially really interesting.

You mentioned that the fruit fly can do pretty sophisticated navigation and that it flies about with minuscule power consumption. That seems like a potentially important breakthrough in and of itself: to save a tremendous amount of energy. Can you give us a sense for how much more efficient the biological system is than, say, a comparable drone system that's out there today running on a CPU?

Gopal: I don't actually know what the numbers are off the top of my head, but you can imagine [what the energy savings might be like] from seeing how much power your desktop takes plugged in, versus this thing that's buzzing around your kitchen.

Nathan: Yeah. Doesn't need much, that's for sure. Okay, cool.

Another intuition I was hoping you could develop a little bit more for us is the idea that some of these simulations could become the basis for generating new strategies in AI. I'm kind of naively thinking… you've got the worm, for example, that has only a few hundred neurons, right? So certainly that's a simple enough system that you could kind of perturb it with an extra neuron, or one fewer neuron. Is that the kind of thing that you imagine working on? Or if you can get the worm working, how might that translate into new structural ideas for how to develop AI?

Gopal: I should say initially that I don't know that c elegans [the roundworm] itself is going to be all that impactful for understanding AI. It's a pretty simple organism. I think it's an important milestone in the roadmap for this type of research in terms of the technical process for extracting kinetic parameters from ion channel experiments and things like that.

I think that [in terms of] the ability to simulate more complex organisms that are still well short of mammals, a lot of the biological complexity that we need for doing disease-oriented research is probably just not very relevant to being able to reproduce those same high-level behaviors in the context of AI research. So probably the simplest thing to think about in this context is how many ion channels you have to have in your neuronal models to be able to reproduce the behavior that you're able to measure in the lab.

For example, if I'm interested in studying epilepsy, presumably I want to incorporate as many ion channels as there are because that might give rise to new drug targets. But if you're only interested in building a Drosophila-like [fruit fly-like] organism that can do all of the things that Drosophila can do, but you don't necessarily care about the biology, you don't need a lot of ion channel detail. It may just take a handful of different ion channels to reproduce the cellular behavior that you need.

I think what's happening is you're taking a set of models that are already pretty abstract; these so-called “biologically realistic models” actually include very little biology. They include electrophysiology that might be impactful for disease processes. But as we're taking away more and more biological detail, I think what we're left with is a highly abstract software system that has pretty sophisticated learning and memory capabilities, and you can imagine just building from there in terms of levels of abstraction.

Nathan: Cool. Very interesting. A few questions are coming in from the app. The first one is a bit speculative, but I think it's pretty interesting: Can we apply some of these concepts to help animals, maybe by lending empirical evidence as to the sentience of invertebrates, for example? Do you think that is the kind of thing that could be determined by this sort of architectural understanding of animals? Or might we potentially replace animals in labs with different kinds of derivative organisms? Any insights or comments on that?

Gopal: Yeah, that's a great question. I think the person whom I've heard advocate for that is Christof Koch, who is the director of the Allen Institute for Brain Science. He has thought a lot about consciousness.

I'm guessing that in the context of a lot of other research, where you’ve identified candidate brain regions and mapped out at different types of brain activity, I can certainly imagine that simulations like this have a role to play in understanding animal sentience. And in the context of replacing animals, I think that is certainly very appealing from the animal welfare standpoint. I think that will probably generate a lot of reactions from biologists who are concerned about the relative lack of so much biological detail in these simulations. What I would hope is that rich simulations like this allow us to minimize the amount of animal experimentation that we have to do. But it's difficult to imagine.

This is a broader philosophical point in science, but simulations are never really replacements for experiments. They always encapsulate some parsimonious amount of detail and a number of assumptions. But it's certainly relevant.

Nathan: Cool. Very interesting. Here’s another question about the potential for culture clash. The AI field tends to be quite open, publishing results even in industry. (I think that happens sometimes, maybe not always.) But biology and pharma tend to be focused on patenting and locking down IP [intellectual property]. If you think that's an accurate-enough characterization (or maybe you could complicate it), do you think this culture clash becomes a problem as AI moves forward within biology?

Gopal: I think the main challenge right now is our culture clashes. I hadn't thought about that particular one. I don't know that, say, pharma's tendency to care about IP will extend to these domains, just because it doesn't necessarily tie into the profit model of the IP related to developing a specific drug.

But I think building software is a very foreign concept in the life sciences. And that has been one of the biggest cultural challenges that I've seen. When biologists and physicians think about useful work to do, they imagine high-profile publications. So the idea of a software engineer building a piece of software is hard for people to wrap their minds around. Also, the difference between an ad hoc computational script that somebody developed and industrial-scale software engineering is a very novel concept for people.

So there's certainly no end to culture clashes, and maybe IP will play a role further down the line. But I think there are more basic obstacles right now.

Nathan: Well, in terms of novel ideas, you've certainly given us quite a few to think about today. Join me please in another round of applause for Gopal Sarma.