TL;DR: A strategy aiming to elicit latent knowledge (or to make any hopefully robust, hopefully generalizable prediction) from interpreting an AGI’s fine-grained internal data may be unlikely to succeed, given that the complex system of an AGI’s agent-environment interaction dynamics will plausibly turn out to be computationally irreducible. In general, the most efficient way to predict the behavior of a complex agent in an environment is to run it in that exact environment. Mechanistic interpretability is unlikely to provide a reliable safety plan that magically improves on the default strategy of empiricism. Coarse-grained models of the complex system have a realistic chance of making robust predictions out-of-distribution, although such predictions would then necessarily be limited in scope.
The paradigm of deep learning, potentially with a small extra step, seems likely sufficient for creating AGI. This is because AI capabilities seem likely to continue increasing, given empirical scaling laws, the decreasing cost of compute, and AI corporations’ unabating desire to build increasingly powerful AI. The most recent findings from neuroscience also suggest that across different species, scale is the causal factor behind most general capabilities.
Biological systems like those of evolutionary biology and neuroscience are plausibly the best reference class for large-scale deep-learning models we have. To quote arguably the world’s leading expert in interpreting neural nets, Chris Olah:
“The elegance of ML is the elegance of biology, not the elegance of math or physics.
Simple gradient descent creates mind-boggling structure and behavior, just as evolution creates the awe inspiring complexity of nature.”
For a more detailed case, see (1) Olah’s excellent writeup "Analogies between biology and deep learning" as well as (2) the review paper by Saxe et al. about the analogy between deep learning and systems neuroscience.
Computational irreducibility
Predicting the dynamics of biological systems is often computationally intractable. To quote "The limits to prediction in ecological systems" by Beckage et al.:
“An ecological system changes through time, updating its state continuously, and the process of system evolution can be thought of as computation…Our use of the term ‘system evolution' is much broader in meaning than biological evolution, and also includes changes in abundance, location and interactions between individuals irrespective of species, and the interface between the biotic and abiotic components of the system, e.g., flux of nutrients, water, etc. Predictive models, then, are able to forecast the future state of the system before the system performs the intermediate computations to reach its updated state. An astronomical model, for example, might predict the earth's position and orientation relative to the sun millions of years into the future, without the need for the solar system to perform the intervening computations. The intervening computations that the system performs can be circumvented to predict its future state. The astronomical model might, for example, be used to predict the latitudinal and seasonal distribution of insolation on earth, which describes the past orbital forcing of the climate system, for comparison with paleoclimatic data (e.g., EPICA community members 2004)."
Unlike astronomical systems, however, ecological systems often have computationally irreducible dynamics.
"Computational irreducibility refers to systems where the intervening computations cannot be bypassed using a simplified model. The dynamics of a system that is computationally irreducible cannot be predicted without allowing for the actual evolution of the system…A simplified model that can predict the future state of the system does not exist. Thus, the only way to ascertain the future state of the system is to allow the system to evolve on its own characteristic time scale. Computational irreducibility does not imply that the underlying processes are stochastic or chaotic, but that they are complex. The complexity can be manifested through high levels of contingency, interactions among system components, and nonlinearity. In fact, systems that evolve according to simple deterministic processes with exactly-known initial conditions can give rise to a complex state evolution that cannot be exactly predicted except by allowing the system to evolve in real time…Computational irreducibility, furthermore, may be a fundamental characteristic of some systems and does not merely reflect a shortcoming of proposed models or modeling techniques. This implies that the intrinsic predictability of these systems is inherently low.”
The computational irreducibility of ecological systems is due to “(1) the difficulty of pre-stating the relevant features of the niche, (2) the complexity of ecological communities, and (3) their potential to enable novel system states (e.g., niche creation).” As a result, the inherent predictability of ecological systems out-of-distribution is generally low. Some compelling examples of this phenomenon include the inherent unpredictability of how ecological systems will respond to future climate change and the inherent unpredictability of invasive species impact. An analogous argument correctly deduces that epigenetic systems (of genome-environment interactions) and neurocognitive systems (of brain-environment interactions) have computationally irreducible dynamics.
The three factors causing computational irreducibility in biological systems (comprised of biological agents and their environments) also seem likely to apply to AGI-scale deep-learning systems (comprised of AI agents and their training or deployment environments).
Implications for predictability of agent-environment interactions
What are the implications of computational irreducibility for high-complexity systems like biological systems and future AGI systems? To again quote Beckage et al.:
“Complex systems with extremely large numbers of components can sometimes become predictable from a macro-level perspective due to the averaging of a very large number of separate interactions. In statistical physics, for example, an approximate description of the mean state of a gas is possible without an exact description of the velocities and locations of each molecule; the temperature and pressure of a gas can be described using the ideal gas law. An ecological analogy to the ideal gas law might be the species composition of a forest. Forest composition is ultimately an emergent property that results from the local interactions of many individuals and processes, i.e., seed production and dispersal, competition, growth rates, tradeoffs, etc. While it may not be possible to determine the outcome of all of these complex interactions to predict the species identity of the tree species that captures a given canopy gap, the overall composition of the forest can be predictable to some approximation…While this statistical averaging of interactions is the standard assumption in many deterministic ecological models, it may be more accurate to view most ecological systems as small-to-middle-number systems because local interactions are quite relevant in affecting system behavior. In this case, there would be a general lack of the homogeneous mixing necessary for a purely statistical mechanics view to be applicable.
We conjecture that skill in predicting the future states of ecological systems will decrease as system complexity increases up to some threshold level. Further increases in system complexity beyond this threshold, for example, through expanding interconnected components of the system or increasing potential for nonlinear interactions, will not result in further losses in predictability…Furthermore, this threshold may occur at relatively low system complexity. We note, however, that the behavior of complex ecological systems can often be readily deconstructed and explained in hindsight as the cascade of interactions and nonlinearities is examined and understood. We emphasize that this may not aid predictions of future system states due to computational irreducibility.”
A coarse-grained model—if decently correct—can make robust predictions about a computationally irreducible system, although such predictions would necessarily be informationally incomplete. This is consistent with my research background in high-complexity systems like those of biological science and social science. For such a high-complexity system, the only general hope for predictability out-of-distribution (in the absence of unrealistically large amounts of time and computational resources) is to deduce theorems from a coarse-grained model of the system based on decently correct first principles. And due to information loss, the types of predictions this method can yield are generally quite limited in scope.
In the more optimistic case where one can fully specify a computationally irreducible agent’s environment a priori, the most efficient way to predict the agent’s behavior in that environment is to run it in that exact environment. There is no shortcut, such as one based on the agent's internal data.
What does this mean for current AI safety research directions?
Given that future AGI systems will plausibly be computationally irreducible, I think the following implications of computational irreducibility deserve consideration.
1. Fine-grained interpretability of an AGI-scale model (e.g., interpreting circuits) may be intractable if researchers only have a short amount of time.
Many AI safety researchers consider fine-grained interpretability to be a promising research direction for reducing AGI x-risks. For example, Eliezer Yudkowsky—one of the founders of the AI safety field—has remarked that the interpretability agenda is "exceptional in the field," because he thinks it is one of the few research directions in the field that is "on a pathway to anything important at all."
But this may be false. Due to computational irreducibility, achieving fine-grained interpretability of AGI-scale models may be intractable in the absence of an unrealistically large amount of time and of computational resources.
A cautionary tale is provided by the Human Brain Project, which aimed to achieve a causally accurate predictive model of a brain. Why did the Human Brain Project crash and burn? Because computational irreducibility imposed a hard limit to the scientific upside of the project.
This was predictable, to the point where a year into the $1.6 billion project, almost a thousand neuroscientists—who could have gotten salaries for long periods of time by participating in the project—instead selflessly boycotted it as a waste of scarce scientific resources. That year, the Guardian reported:
“Central to the latest controversy are recent changes made by Henry Markram, head of the Human Brain Project at the Swiss Federal Institute for Technology in Lausanne. The changes sidelined cognitive scientists who study high-level brain functions, such as thought and behaviour. Without them, the brain simulation will be built from the bottom up, drawing on more fundamental science, such as studies of individual neurons. The brain, the most complex object known, has some 86bn neurons and 100tn connections.
‘The main apparent goal of building the capacity to construct a larger-scale simulation of the human brain is radically premature,’ Peter Dayan, director of the computational neuroscience unit at UCL, told the Guardian.
'We are left with a project that can't but fail from a scientific perspective. It is a waste of money, it will suck out funds from valuable neuroscience research, and would leave the public, who fund this work, justifiably upset,' he said.”
And the project did in fact crash and burn. To this day, progress on achieving a causally accurate model of the brain remains marginal at best.
Anthropic plans to rely on interpreting large-scale models at the circuit level. Given the hard bounds on predictability imposed by the phenomenon of computational irreducibility, the Human Brain Project’s waste of $1.6 billion and years of researchers’ time provides a cautionary tale for Anthropic on its strategy for how to use its $580 million funding. This is because the complexity and computational irreducibility of AGI-scale models likely imply that achieving a fine-grained causal understanding of AGI-scale models will require an extremely large amount of time and computational resources, just like for neuroscience.
To be clear, a fine-grained causal understanding may be theoretically achievable given enough time and computational resources. A more optimistic reference case is provided by gene-editing, where scientists—through immense trial and error—have obtained a nontrivial ability to make a priori predictions based entirely on genomic code data. But even this scientific problem is largely unsolved. Because the inherent predictability of the genome-environment system may be low, even gene-editing strategies that scientists are comparatively confident in today will almost certainly come with unforeseen side effects arising from extremely complex and unpredictable epigenetic dynamics.
A counterargument is that researchers have more ready access to internal data for neural nets than they do for neuroscience, and that this advantage can be a game-changer. However, this advantage does not seem to address the core issue of computational irreducibility, which seems likely to be an inevitable consequence of using deep learning rather than lower-complexity paradigms (like software design by a human software engineer). Given that the core issue of computational irreducibility exists, it seems plausible that easier access to internal data will only result in a moderate increase in the rate of solving neural-net interpretability compared to the rate of solving neuroscience interpretability. This would imply that the challenges of solving neuroscience, and the prematureness and resource waste of the Human Brain Project, are apt cautionary tales for a safety plan based on fine-grained interpretability of AGI-scale neural nets, even if Anthropic’s brilliant, well-meaning, and well-funded researchers are the ones carrying out this safety plan.
Computational irreducibility implies that achieving a fine-grained causal understanding of a large-scale model will realistically require patient trial-and-error, in the precise environment in which the model is set to run. Mechanistically interpreting large-scale models is unlikely to provide a reliable safety plan, since it probably cannot efficiently yield the necessary predictions.
And since it is likely easy to make a large-scale model accidentally dangerous, an unprecedented security mindset is probably needed, both from a moral perspective and from the perspective of avoiding a movement-wide credibility risk for EA. This is especially true for Anthropic, a for-profit corporation whose credibility is intertwined with that of the EA movement, and whose $580 million Series B funding was famously raised by lead investor Sam Bankman-Fried (another movement-wide credibility risk that has recently blown up, even likely rising to the point of fraud).
If Anthropic is publicly blamed (fairly or not) for causing a future AGI disaster, then this may cause a substantial credibility loss for the EA movement as a whole. Combined with the one caused by Sam Bankman-Fried, the credibility of the EA movement—on AI safety matters and in general—may not be able to recover.
2. Eliciting Latent Knowledge from fine-grained internal data may be intractable if researchers only have a short amount of time.
See Point 1.
3. Coarse-grained models of AGI systems (e.g., game theory) have a realistic chance of efficiently making robust predictions out-of-distribution, but these predictions will be limited in scope.
The theory of computational irreducibility can explain why success cases in out-of-distribution predictions of complex systems—in biology, economics, and neuroscience—have generally arisen from coarse-grained information. Of course, a necessary condition for a coarse-grained model of a system to yield robust predictions is that the model is based on decently correct first principles.
Even when coarse-grained models based on decently correct first principles can yield robust predictions, however, these predictions then necessarily turn out to be quite limited in scope. To quote Mathematical Models of Social Evolution by McElreath and Boyd:
“Simple formal models can be used to make predictions about natural phenomena. In most cases, these predictions are qualitative. We might expect more or less of some effect as we increase the real-world analogue of a particular parameter of the model. For example, a model of voter turnout might predict that turnout increases in times of national crisis. The precise amount of increase per amount of crisis is very hard to predict, as deciding how to measure “crisis” on a quantitative scale seems like a life's work in itself. Even so, the model may provide a useful prediction of the direction of change. These sorts of predictions are sometimes called comparative statics. And even very simple models can tell us whether some effect should be linear, exponential, or some other functional form. All too often, scientists construct theories which imply highly nonlinear effects, yet they analyze their data with linear regression and analysis of variance. Allowing our models of the actual phenomenon to make predictions of our data can yield much more analytic power. In a few cases, such as sex allocation, models can make very precise quantitative predictions that can be applied to data.”
The tone of the last sentence is arguably too optimistic. The cases in which coarse-grained models can correctly make precise quantitative predictions, like Fisher’s principle of sex allocation, are extremely rare. Fisher’s principle that most sexually reproducing species produce 50% male and 50% female offspring has been called “probably the most celebrated argument in evolutionary biology” because it makes such a precise quantitative prediction, which tends to be the exception in the field, not the norm. And even when such precise predictions are possible, they are likely to then necessarily be quite limited in scope: sex allocation strategy is one of countless many phenotypic dimensions that are relevant for a given biological agent. Most aspects of phenotype are probably subject to a high level of inherent unpredictability due to computational irreducibility. This is why the best we biologists and economists can do for robust prediction usually just consists of comparative-statics predictions or functional-form predictions that are subject to significant information loss.
The above lesson has the following implication. When developing safety plans for AGI-scale models, complexity is likely to be the enemy of security. Assuming a realistic amount of time and computational resources, simple plans for AI safety based on coarse-grained models (e.g., limiting the action space of the AGI, based on game theory/mechanism design) are the ones that have a realistic chance of working in a robust, generalizable manner. And the fact that simple safety plans are insufficient to quickly provide for-profit AGI corporations with everything they want—an AGI that is simultaneously safe and economically useful—should not be used as a source of motivated reasoning that these plans can be magically improved on by a strategy of interpreting fine-grained internal data.
Acknowledgements: I am grateful to Spencer Becker-Kahn, Michael Chen, Johannes Gasteiger, Harrison Gietz, Gwern, Adam Jermyn, Arun Jose, Jonathan Mann, Xavier Roberts-Gaal, and Luke Stebbing for providing very valuable feedback on this draft.
A friend referred me to this post — at the end of a long message exchange about my reasons for why mechanistic interpretability does not and cannot contribute to long-term AGI safety.
Pasting my side of that message exchange below.
Let me start with what I wrote near the end:
Most of my side of that message exchange (starting with a side-tangent):
On why I think Eliezer Yudkowsky does not try to advocate for people to try to prevent AGI from ever being built:
On notion of built-in AGI-alignment
On a starting angle why mechanistic interpretability is a non-starter to begin with:
On other angles why mechanistic interpretability is insufficient (summarised briefly):
On an overview of theoretical limits:
Returning to overarching points why mechanistic interpretability falls short:
On fundamental dynamics that are outside the scope of application of mechanistic interpretability:
On whether I think mechanistic interpretability would be helpful at least:
On ways the reverse engineering analogy by Olah is unsound:
On destabilising internals–environment feedback loops
On why "inspect internals" and "inspect externals" both fall short:
On NNs as spaghetti code
Returning to Olah's reverse engineering analogy:
On the key sub-arguments for why long-term safe AGI is impossible:
On ecosystems being uncomputable:
Original researcher's response on theoretical limits of engineerable control:
Original researcher's response on mechanistic interpretability:
–> Actually, my conversation partner is just double-checking where I quoted them. I intend to post the edited exchange here by Mon 26 Dec. [Update: they said it was fine]