TL;DR: A strategy aiming to elicit latent knowledge (or to make any hopefully robust, hopefully generalizable prediction) from interpreting an AGI’s fine-grained internal data may be unlikely to succeed, given that the complex system of an AGI’s agent-environment interaction dynamics will plausibly turn out to be computationally irreducible. In general, the most efficient way to predict the behavior of a complex agent in an environment is to run it in that exact environment. Mechanistic interpretability is unlikely to provide a reliable safety plan that magically improves on the default strategy of empiricism. Coarse-grained models of the complex system have a realistic chance of making robust predictions out-of-distribution, although such predictions would then necessarily be limited in scope. 



The paradigm of deep learning, potentially with a small extra step, seems likely sufficient for creating AGI. This is because AI capabilities seem likely to continue increasing, given empirical scaling laws, the decreasing cost of compute, and AI corporations’ unabating desire to build increasingly powerful AI. The most recent findings from neuroscience also suggest that across different species, scale is the causal factor behind most general capabilities.

Biological systems like those of evolutionary biology and neuroscience are plausibly the best reference class for large-scale deep-learning models we have. To quote arguably the world’s leading expert in interpreting neural nets, Chris Olah:

“The elegance of ML is the elegance of biology, not the elegance of math or physics. 

Simple gradient descent creates mind-boggling structure and behavior, just as evolution creates the awe inspiring complexity of nature.”

For a more detailed case, see (1) Olah’s excellent writeup "Analogies between biology and deep learning" as well as (2) the review paper by Saxe et al. about the analogy between deep learning and systems neuroscience.


Computational irreducibility

Predicting the dynamics of biological systems is often computationally intractable. To quote "The limits to prediction in ecological systems" by Beckage et al.: 

“An ecological system changes through time, updating its state continuously, and the process of system evolution can be thought of as computation…Our use of the term ‘system evolution' is much broader in meaning than biological evolution, and also includes changes in abundance, location and interactions between individuals irrespective of species, and the interface between the biotic and abiotic components of the system, e.g., flux of nutrients, water, etc. Predictive models, then, are able to forecast the future state of the system before the system performs the intermediate computations to reach its updated state. An astronomical model, for example, might predict the earth's position and orientation relative to the sun millions of years into the future, without the need for the solar system to perform the intervening computations. The intervening computations that the system performs can be circumvented to predict its future state. The astronomical model might, for example, be used to predict the latitudinal and seasonal distribution of insolation on earth, which describes the past orbital forcing of the climate system, for comparison with paleoclimatic data (e.g., EPICA community members 2004)."

Unlike astronomical systems, however, ecological systems often have computationally irreducible dynamics.

"Computational irreducibility refers to systems where the intervening computations cannot be bypassed using a simplified model. The dynamics of a system that is computationally irreducible cannot be predicted without allowing for the actual evolution of the system…A simplified model that can predict the future state of the system does not exist. Thus, the only way to ascertain the future state of the system is to allow the system to evolve on its own characteristic time scale. Computational irreducibility does not imply that the underlying processes are stochastic or chaotic, but that they are complex. The complexity can be manifested through high levels of contingency, interactions among system components, and nonlinearity. In fact, systems that evolve according to simple deterministic processes with exactly-known initial conditions can give rise to a complex state evolution that cannot be exactly predicted except by allowing the system to evolve in real time…Computational irreducibility, furthermore, may be a fundamental characteristic of some systems and does not merely reflect a shortcoming of proposed models or modeling techniques. This implies that the intrinsic predictability of these systems is inherently low.”

The computational irreducibility of ecological systems is due to “(1) the difficulty of pre-stating the relevant features of the niche, (2) the complexity of ecological communities, and (3) their potential to enable novel system states (e.g., niche creation).” As a result, the inherent predictability of ecological systems out-of-distribution is generally low. Some compelling examples of this phenomenon include the inherent unpredictability of how ecological systems will respond to future climate change and the inherent unpredictability of invasive species impact. An analogous argument correctly deduces that epigenetic systems (of genome-environment interactions) and neurocognitive systems (of brain-environment interactions) have computationally irreducible dynamics.

The three factors causing computational irreducibility in biological systems (comprised of biological agents and their environments) also seem likely to apply to AGI-scale deep-learning systems (comprised of AI agents and their training or deployment environments).


Implications for predictability of agent-environment interactions

What are the implications of computational irreducibility for high-complexity systems like biological systems and future AGI systems? To again quote Beckage et al.:

“Complex systems with extremely large numbers of components can sometimes become predictable from a macro-level perspective due to the averaging of a very large number of separate interactions. In statistical physics, for example, an approximate description of the mean state of a gas is possible without an exact description of the velocities and locations of each molecule; the temperature and pressure of a gas can be described using the ideal gas law. An ecological analogy to the ideal gas law might be the species composition of a forest. Forest composition is ultimately an emergent property that results from the local interactions of many individuals and processes, i.e., seed production and dispersal, competition, growth rates, tradeoffs, etc. While it may not be possible to determine the outcome of all of these complex interactions to predict the species identity of the tree species that captures a given canopy gap, the overall composition of the forest can be predictable to some approximation…While this statistical averaging of interactions is the standard assumption in many deterministic ecological models, it may be more accurate to view most ecological systems as small-to-middle-number systems because local interactions are quite relevant in affecting system behavior. In this case, there would be a general lack of the homogeneous mixing necessary for a purely statistical mechanics view to be applicable.

We conjecture that skill in predicting the future states of ecological systems will decrease as system complexity increases up to some threshold level. Further increases in system complexity beyond this threshold, for example, through expanding interconnected components of the system or increasing potential for nonlinear interactions, will not result in further losses in predictability…Furthermore, this threshold may occur at relatively low system complexity. We note, however, that the behavior of complex ecological systems can often be readily deconstructed and explained in hindsight as the cascade of interactions and nonlinearities is examined and understood. We emphasize that this may not aid predictions of future system states due to computational irreducibility.”

A coarse-grained model—if decently correct—can make robust predictions about a computationally irreducible system, although such predictions would necessarily be informationally incomplete. This is consistent with my research background in high­-complexity systems like those of biological science and social science. For such a high-complexity system, the only general hope for predictability out-of-distribution (in the absence of unrealistically large amounts of time and computational resources) is to deduce theorems from a coarse-grained model of the system based on decently correct first principles. And due to information loss, the types of predictions this method can yield are generally quite limited in scope.

In the more optimistic case where one can fully specify a computationally irreducible agent’s environment a priori, the most efficient way to predict the agent’s behavior in that environment is to run it in that exact environment. There is no shortcut, such as one based on the agent's internal data.


What does this mean for current AI safety research directions?

Given that future AGI systems will plausibly be computationally irreducible, I think the following implications of computational irreducibility deserve consideration.


1. Fine-grained interpretability of an AGI-scale model (e.g., interpreting circuits) may be intractable if researchers only have a short amount of time.

Many AI safety researchers consider fine-grained interpretability to be a promising research direction for reducing AGI x-risks. For example, Eliezer Yudkowsky—one of the founders of the AI safety field—has remarked that the interpretability agenda is "exceptional in the field," because he thinks it is one of the few research directions in the field that is "on a pathway to anything important at all."

But this may be false. Due to computational irreducibility, achieving fine-grained interpretability of AGI-scale models may be intractable in the absence of an unrealistically large amount of time and of computational resources.

A cautionary tale is provided by the Human Brain Project, which aimed to achieve a causally accurate predictive model of a brain. Why did the Human Brain Project crash and burn? Because computational irreducibility imposed a hard limit to the scientific upside of the project.

This was predictable, to the point where a year into the $1.6 billion project, almost a thousand neuroscientists—who could have gotten salaries for long periods of time by participating in the project—instead selflessly boycotted it as a waste of scarce scientific resources. That year, the Guardian reported

“Central to the latest controversy are recent changes made by Henry Markram, head of the Human Brain Project at the Swiss Federal Institute for Technology in Lausanne. The changes sidelined cognitive scientists who study high-level brain functions, such as thought and behaviour. Without them, the brain simulation will be built from the bottom up, drawing on more fundamental science, such as studies of individual neurons. The brain, the most complex object known, has some 86bn neurons and 100tn connections.

‘The main apparent goal of building the capacity to construct a larger-scale simulation of the human brain is radically premature,’ Peter Dayan, director of the computational neuroscience unit at UCL, told the Guardian.

'We are left with a project that can't but fail from a scientific perspective. It is a waste of money, it will suck out funds from valuable neuroscience research, and would leave the public, who fund this work, justifiably upset,' he said.”

And the project did in fact crash and burn. To this day, progress on achieving a causally accurate model of the brain remains marginal at best.

Anthropic plans to rely on interpreting large-scale models at the circuit level. Given the hard bounds on predictability imposed by the phenomenon of computational irreducibility, the Human Brain Project’s waste of $1.6 billion and years of researchers’ time provides a cautionary tale for Anthropic on its strategy for how to use its $580 million funding. This is because the complexity and computational irreducibility of AGI-scale models likely imply that achieving a fine-grained causal understanding of AGI-scale models will require an extremely large amount of time and computational resources, just like for neuroscience.

To be clear, a fine-grained causal understanding may be theoretically achievable given enough time and computational resources. A more optimistic reference case is provided by gene-editing, where scientists—through immense trial and error—have obtained a nontrivial ability to make a priori predictions based entirely on genomic code data. But even this scientific problem is largely unsolved. Because the inherent predictability of the genome-environment system may be low, even gene-editing strategies that scientists are comparatively confident in today will almost certainly come with unforeseen side effects arising from extremely complex and unpredictable epigenetic dynamics.

A counterargument is that researchers have more ready access to internal data for neural nets than they do for neuroscience, and that this advantage can be a game-changer. However, this advantage does not seem to address the core issue of computational irreducibility, which seems likely to be an inevitable consequence of using deep learning rather than lower-complexity paradigms (like software design by a human software engineer). Given that the core issue of computational irreducibility exists, it seems plausible that easier access to internal data will only result in a moderate increase in the rate of solving neural-net interpretability compared to the rate of solving neuroscience interpretability. This would imply that the challenges of solving neuroscience, and the prematureness and resource waste of the Human Brain Project, are apt cautionary tales for a safety plan based on fine-grained interpretability of AGI-scale neural nets, even if Anthropic’s brilliant, well-meaning, and well-funded researchers are the ones carrying out this safety plan.

Computational irreducibility implies that achieving a fine-grained causal understanding of a large-scale model will realistically require patient trial-and-error, in the precise environment in which the model is set to run. Mechanistically interpreting large-scale models is unlikely to provide a reliable safety plan, since it probably cannot efficiently yield the necessary predictions.

And since it is likely easy to make a large-scale model accidentally dangerous, an unprecedented security mindset is probably needed, both from a moral perspective and from the perspective of avoiding a movement-wide credibility risk for EA. This is especially true for Anthropic, a for-profit corporation whose credibility is intertwined with that of the EA movement, and whose $580 million Series B funding was famously raised by lead investor Sam Bankman-Fried (another movement-wide credibility risk that has recently blown up, even likely rising to the point of fraud). 

If Anthropic is publicly blamed (fairly or not) for causing a future AGI disaster, then this may cause a substantial credibility loss for the EA movement as a whole. Combined with the one caused by Sam Bankman-Fried, the credibility of the EA movement—on AI safety matters and in general—may not be able to recover.


2. Eliciting Latent Knowledge from fine-grained internal data may be intractable if researchers only have a short amount of time.

See Point 1.


3. Coarse-grained models of AGI systems (e.g., game theory) have a realistic chance of efficiently making robust predictions out-of-distribution, but these predictions will be limited in scope.

The theory of computational irreducibility can explain why success cases in out-of-distribution predictions of complex systems—in biology, economics, and neuroscience—have generally arisen from coarse-grained information. Of course, a necessary condition for a coarse-grained model of a system to yield robust predictions is that the model is based on decently correct first principles. 

Even when coarse-grained models based on decently correct first principles can yield robust predictions, however, these predictions then necessarily turn out to be quite limited in scope. To quote Mathematical Models of Social Evolution by McElreath and Boyd:

“Simple formal models can be used to make predictions about natural phenomena. In most cases, these predictions are qualitative. We might expect more or less of some effect as we increase the real-world analogue of a particular parameter of the model. For example, a model of voter turnout might predict that turnout increases in times of national crisis. The precise amount of increase per amount of crisis is very hard to predict, as deciding how to measure “crisis” on a quantitative scale seems like a life's work in itself. Even so, the model may provide a useful prediction of the direction of change. These sorts of predictions are sometimes called comparative statics. And even very simple models can tell us whether some effect should be linear, exponential, or some other functional form. All too often, scientists construct theories which imply highly nonlinear effects, yet they analyze their data with linear regression and analysis of variance. Allowing our models of the actual phenomenon to make predictions of our data can yield much more analytic power. In a few cases, such as sex allocation, models can make very precise quantitative predictions that can be applied to data.”

The tone of the last sentence is arguably too optimistic. The cases in which coarse-grained models can correctly make precise quantitative predictions, like Fisher’s principle of sex allocation, are extremely rare. Fisher’s principle that most sexually reproducing species produce 50% male and 50% female offspring has been called “probably the most celebrated argument in evolutionary biology” because it makes such a precise quantitative prediction, which tends to be the exception in the field, not the norm. And even when such precise predictions are possible, they are likely to then necessarily be quite limited in scope: sex allocation strategy is one of countless many phenotypic dimensions that are relevant for a given biological agent. Most aspects of phenotype are probably subject to a high level of inherent unpredictability due to computational irreducibility. This is why the best we biologists and economists can do for robust prediction usually just consists of comparative-statics predictions or functional-form predictions that are subject to significant information loss.

The above lesson has the following implication. When developing safety plans for AGI-scale models, complexity is likely to be the enemy of security. Assuming a realistic amount of time and computational resources, simple plans for AI safety based on coarse-grained models (e.g., limiting the action space of the AGI, based on game theory/mechanism design) are the ones that have a realistic chance of working in a robust, generalizable manner. And the fact that simple safety plans are insufficient to quickly provide for-profit AGI corporations with everything they want—an AGI that is simultaneously safe and economically useful—should not be used as a source of motivated reasoning that these plans can be magically improved on by a strategy of interpreting fine-grained internal data.


Acknowledgements: I am grateful to Spencer Becker-Kahn, Michael Chen, Johannes Gasteiger, Harrison Gietz, Gwern, Adam Jermyn, Arun Jose, Jonathan Mann, Xavier Roberts-Gaal, and Luke Stebbing for providing very valuable feedback on this draft.


Sorted by Click to highlight new comments since: Today at 11:48 AM

A friend referred me to this post — at the end of a long message exchange about my reasons for why mechanistic interpretability does not and cannot contribute to long-term AGI safety.

Pasting my side of that message exchange below.

Let me start with what I wrote near the end:

The post "The limited upside of interpretability" is excellent btw. Just reading through, thank you.

One claim I think is not capturing ecosystem complexity right is the computational irreducibility point. Flawlessly (ie. deterministically) computing an algorithm that has the shortest length (equivalent to Kolmogorov Complexity) of any algorithm that produces its outputs is computationally irreducible in the sense that you cannot just run a shorter-length algorithm requiring fewer computational operations to replace it.

An ecosystem (whether based around carbon-centered DNA/RNA or solid-state-lattice-embedded-transistors) is uncomputable. This is because part interactions within that ecosystem do involve noise interference at various levels, that feed chaotic dynamics that in turn result in new formation of structure/parts. You simply cannot scan the parts of that ecosystem and simulate it faithfully on the equivalent of a Turing machine.

You can make some higher-level predictions about long-term convergent outcomes of that ecosystem (as Forrest is doing, for parts of the human-carbon ecosystem in interactions with parts of the AI-solid-state ecosystem).


Most of my side of that message exchange (starting with a side-tangent):

On why I  think Eliezer Yudkowsky does not try to advocate for people to try to prevent AGI from ever being built:

[2:09 pm, 08/11/2022] Remmelt:
Yeah, good question!

My impression from reading recent writings by Eliezer is that he mentally models the social coordination problem (which is definitely hard) as practically impossible because you have all these independent actors making decisions about developing advanced AI systems and you cannot really force them in any grand orchestrated scheme to stop them from acting.

[2:10 pm, 08/11/2022] Remmelt:
> Is he, relative to you, just too pessimistic that this is possible, [...]
And based on that he concludes something like you said.

[2:10 pm, 08/11/2022] Remmelt:
But humans are not independent actors. We’re social actors, and social interactions imply a sort of interdependence in people’s actions.

[2:12 pm, 08/11/2022] Remmelt:
Also, we share the same needs to continue to exist as wetware containing soups of carbon-chained molecules.

[2:12 pm, 08/11/2022] Remmelt:
And we share billions of years of shared evolutionary history.

[2:13 pm, 08/11/2022] Remmelt:
So there is a much stronger basis for human-human alignment for continued existence. Not so between humans and AI.

[2:13 pm, 08/11/2022] Remmelt:
Actually, let me copy over a longer message I sent someone for further context:

[2:18 pm, 08/11/2022] Remmelt:
A researcher emailed me that a long-standing intuition of his has been that "Aligned AI" is as much of a fairytale as a perpetual motion machine, and so we should just stop AI development. And that the reactions he got when expressing that view in the alignment community were usually something like “Well that's just not going to happen, the capitalist forces driving R&D cannot be overcome, so the best we can do is work tirelessly for the small but nonzero chance that we can successfully align AGI."

Here was my response: Yeah, I also got this reaction a few times.

It makes me wonder how these people mentally represent "AGI" to be like. Some common representational assumptions in the community seem to be that "AGI" would be (a) coherently goal-directed unitary agent(s) that optimise for specific outcomes across/within the outside messy world and that avoid causing any destabilising side-effects along the way.

I personally have a different take on this:
Their claim implies for me that for us humans to coordinate – as individual living bodies who can relate deeply based on their shared evolutionary history and needs for existence, is going to be harder than to 'build' perpetual alignment into machines that are completely alien to us and that over time can self-modify and connect up hardware any way they're driven to.

Here you bring this artificial form into existence that literally needs netherworldly conditions to continue to exist (survive) and expand its capacity (grow/reproduce). Where given the standardisation of its hardware/substrate configurations, it does not face the self-modification or information-bandwidth constraints that we humans, separated by our non-standardised wetware bodies (containing soups of carbon-centered-molecules), face.

This artificial form will act (both by greedy human design for automation and by instrumental convergence) to produce and preserve more (efficient) hardware infrastructure for optimising selected goals (ie. the directed functionality previously selected for by engineers and their optimisation methods, within the environmental contexts the AI was actually trained and deployed in).

These hardware-code internals will have degrees of freedom of interaction for/with changing conditions of the outside world that are for sure over the long term going to feed back into a subset of those existing internal configurations through their previous outside interactions replicating more frequently and sustainably through existing and newly produced hardware. Eg. through newly learned code variants co-opting functionality, inherent noise interference on the directivity/intensity of transmitted energy signals, side-branching effects on adjacent conditions of the environment, distribution shifts in the probabilities of possible inputs received from the environment, non-linear amplification of resulting outside effects through iterative feedback cycles, etc.

The resulting artificial forms will at various levels exchange resources with each other, and be selected by market forces for their capacities to produce similar to how humans are. And unlike humans, they are not separable as individual-band-width-constrained agents.

Some AIS researchers bring up the unilateralist curse here – that a few of the many human market/institutional actors out there would act unsafely out of line with the (implicit) consensus by developing unsafe AGI anyway. But if we are going to worry about humans' failure to coordinate, then getting all such conflicted humans to coordinate to externally enforce coordination on distributed self-learning hardware internals – that in aggregate exhibit general capabilities but have no definable stable unit boundaries – is definitely out of the question.

On notion of built-in AGI-alignment

[10:46 am, 10/11/2022] Remmelt:
> I consider this more rhetorically interesting than an actual argument tbh. For example: I think the claim that these machines will be alien to us is too sloppy. We will build them ourselves. […]
> Additionally, solving alignment may work without solving coordination problems. If alignment research reduces the alignment tax sufficiently, then *everyone* is incentiviced to use these solutions — there is no coordination problem. Additionally, coordination problems between AIs may not lead to disaster if these AIs are aligned and therefore all want to preserve human flourishing (whatever that means).

No, I think to be honest that you are assuming here that you can boil down something that is super complex in terms of how AGI internals are interacting with the world into an engineering problem, one that assumes that one alignment solution exists that everyone is going to agree with (where in actual fact people living in different contexts, eg. places and groups, disagree on what is objectively valuable and how to judge and implement what is valuable all the time).

On a starting angle why mechanistic interpretability is a non-starter to begin with:

[10:46 am, 10/11/2022] Remmelt: 
> And by mechanistic interpretability, we may eventually understand them better than the human brain since neuroscience is harder.

Mechanistic interpretability is a non-starter to begin with.

[11:34 am, 10/11/2022] Remmelt:
To start, you’re inspecting internals. But what matters here is the effects that the (changing) internals’ interactions with connected surroundings of the environment have.
So you actually have to account for interactions of a larger environment, not just the tiny subset of interactions that can be modelled as the (presumed flawless) computation of outputs from inputs by the underlying hardware.

[11:36 am, 10/11/2022] Remmelt:
So that is one clear reason why any interpretation technique or inspect internals technique is not going to be able to account for most of the effects that internals have in interaction with the outside environment.

[11:37 am, 10/11/2022] Remmelt:
It would also be unsound to assume that natural abstractions of features (as tracking some of the regularities in the outside world) that can be mechanistically interpreted is all that matters to ensure long-term AGI safety.


On other angles why mechanistic interpretability is insufficient  (summarised briefly):

[11:41 am, 10/11/2022] Remmelt:
Here is a list that includes other  key reasons why mechanistic interpretability is not going to be sufficient in order to detect (or simulate) effects misaligned with human survival to correct (or prevent up front):

1. Non-Distinguished Internal Code Variants
Multiple variations on code can approximate the same function (outputs measured for distribution of inputs).

2. Unmeasured Degrees of Freedom
A single code unit can implement multiple functions.

3. Hidden Potential of a Computed Variant
Unknown probabilities of a code unit causing possible side-effects via available input/output interactions.

4. Greater Bit Size of Code that Monitors Any Code’s Effects
Takes strictly more code (multiple units) to even monitor the local environmental effects of a single code unit.

5. Combinatorial Explosion in Machine Learning
It is increasingly ambiguous to define and detect novel errors to correct for at higher layers of abstraction.

6. Limits to Errors Detectable by Mechanisms Implemented
Code can implement a limited set of detection mechanisms over combinatorially complex code (incl. itself) that over all I/O connections could cause many more diverse errors than detectable by those mechanisms. Undetectable backdoors can be planted in the variations introduced by model initialisation and training.

7. Time Delays to Correcting Errors
Misalignments cannot be corrected continuously to stay in safe error bounds, given delays from detecting changes caused by code to measuring any degree of misalignment over semantically relevant dimensions.

8. Computationally-Irreducible Properties of Initialized Code  (formal verification limits)
Whether even a tiny unit computed flawlessly (eg. a Busy Beaver) misoperates (eg. halts prematurely) is often only predictable (with logical certainty) by computing a copy faster (if future inputs computable, ie. not in ML).

9. Incompleteness of Computation  (simulation limits)
Logical operations on binary code are strictly a subset of all interactions that internal code units have, as necessarily computed through hardware in interactions with connected surroundings of the environment. Since computation of code is a tiny subset of those physical interactions, code cannot model all of them.

10. Inconsistent Conservation of Alignment Properties
No mechanism(s) can maintain alignment over the space of all code stored + transformed over time, as (complex) (state/output) transitions of code variants change connected surroundings nondeterministically. You cannot run deterministic mechanisms within a nondeterministic environment and expect deterministic results.

11. Limits of Measurability
A sensor cannot measure (accurately) below its smallest digit increment or detection limit (> Planck limit).
Localized details are filtered out from content or irreversibly distorted in transmission over distances.

12. Incomputable Trajectories of Chaotic Effects
Nonlinear feedback cycles can amplify a smaller-than-measurable local change into a large global divergence of final conditions (like DNA code being expressed as chaotic and extended phenotypic effects). 
Natural abstractions represented within the mind of a distant observer cannot be used to accurately and comprehensively simulate the long-term consequences of chaotic interactions.

13. Local Selection Defines Directedness of Function Implementation
Variants’ hardware-based interactions with local structures of the (simulated) environment get selected (in training/deployment) to function in directed ways. Functionality aligned with intent is fragile to distribution shifts. Correcting internals for detectable misalignments is selecting for undetectable misalignments.

14. Persistence of Function Implementations that Feed Back into Code’s Continued Existence
AI variants’ local interactions with the outside world happen to cause effects leading to the continued existence of more code with those effects. AI is not a neutral tool for augmenting human intelligence.

[11:46 am, 10/11/2022] Remmelt:
Regarding reason 2. and 3. above, here is more elaboration:

Any single piece of a neural network can capture various distinct structural causal effects, as selected for in its possible interactions with connected surroundings.

Below is an analogy:
A piece of a photograph captures one spatial detail or point location of the environment. 
- Like how a programmer imagines a bit on a harddrive to capture a local logical property. 
- Like how a reverse engineer assigns singular meanings to positions in memory. 
- Like how a mechanistic interpretability researcher “breaks neural network activations into independently understandable pieces”. (

A piece of a holograph, however, captures a point of reference overviewing various structures – structures of the environment that caused light waves to reflect and be captured as interference patterns on that piece of holographic film. The piece never just stores one static frame of the environment. Rotating it to change your degrees of interaction with it (ie. degrees of your viewing angle), the piece of holograph suddenly reveals previously hidden structures.
- Like a holograph piece, a neural network circuit or module captures causal structures of the environment but with many many more degrees of outside interaction possible.
- Like a holograph piece, a polysemantic neuron encodes multiple features in parallel.  But most of the causal structure captured by circuits, modules and neural networks cannot be linearly (as series of additions and multiplications) pieced together from the input-output transformations of individual polysemantic neurons. 
- Nor can we presume that a network of artificial neurons will be involved in some non-shifting distribution of interactions with the rest of the environment. To stretch the metaphor further, we cannot accurately predict how human users will interact with a developed holograph from moment to moment, nor how the probability distribution of all human-holograph interactions (over some fixed-length time window) will shift in the future (as eg. human cultural conceptions of holographs change). Unlike a holographic film though, pieces of a neural network themselves vary about in their content over time – newly capturing selected-for effects in interactions with the rest of the environment.
- Interpreting or reverse-engineering a circuit or module as having a singular function makes the unsound presumption of monolithic representation. Over the space of all possible past-future interactions, each piece (through each decoupled interface of interaction) can be expressed as having distinct meanings and functions. 
- Similarly, we cannot carve out pieces of a human brain (‘neurons’‘pathways’‘regions’) or body (‘cells’‘organs’) and assign just one function to each piece as being meaningful, no matter how much we may like to try.

Q: What about eliciting latent knowledge?

Where do you elicit this ‘knowledge’ from? A piece of code does not know how it got selected. A circuit gets selected for caused outputs/effects in interactions with surroundings. Interactions are not explicitly captured by outside selection – thus not reverse-engineerable from code itself.


On an overview of theoretical limits:

[12:36 pm, 10/11/2022] Remmelt:
Here is another overview I wrote recently, with points 1.-2. particularly relevant whether simulating/detecting misalignment using mechanistic interpretability techniques is going to be sufficient for controlling the interactions of AGI internals with connected surroundings to prevent the build-up of human-species-wide lethal outside effects:



Returning to overarching points why mechanistic interpretability falls short:

[12:36 pm, 10/11/2022] Remmelt:
Anyway, that’s a lot of stuff 🙂

[12:40 pm, 10/11/2022] Remmelt:
Just these two simple points above should already clarify why mechanistic interpretability falls short:

1. Computation is a tiny subset of internals’ interactions with connected surroundings; mechanistic interpretability focusses on reverse engineering computed code itself to understand the resulting digital outputs, but ignores further environmental interactions.

2. It would also be unsound to assume that natural abstractions of features (as tracking some of the regularities in the outside world) that can be mechanistically interpreted is all that matters to ensure long-term AGI safety.



On fundamental dynamics that are outside the scope of application of mechanistic interpretability:

[9:32 am, 11/11/2022] Remmelt: 

>  Thanks for the answer!

> I’m not sure I understand this fully.
> The following would be one interpretation
> that is part of my own model
> of why interpretability is difficult:

Thank you for the paraphrases.

I like the ideas, but I do mean more fundamental dynamics that are in some sense simpler than what you came up with.

If you have an AGI model that is learning new internal ‘code variants’ (introducing variations on the parameters of functional code units, eg. of neural network weights and biases) then we are not only concerned with what the model is thinking or behaving internally. We are concerned with how the outputs of the model are processed through the wider environment (that includes other agents like humans observing the represented outputs somehow and acting on those outputs as a result).

And unfortunately, we cannot use mechanistic interpretability to detect how the outputs (computed by code unit variants in aggregate, contingent on the inputs) are going to be processed and passed on as effects within the wider environment. Nor is any other measurement or data processing technique going to be able to account sufficiently for how outputs are processed through connected surroundings of the environment (natural abstractions simply will not be able to capture all the complex dynamics taking place in that bigger environment).

That is, not sufficient to account for how external effects of the model’s operation will feed back into what sensor inputs the model might receive and in turn are transformed into new outputs, etc.

And also not how those external effects feed back into what (variants of) models even continue to physically exist (as configurations of a substrate) and function at increasing scale in the environment over time.


On whether I think mechanistic interpretability would be helpful at least:

[9:40 am, 11/11/2022] Remmelt:
Helpful for interpreting narrow AI.

For any AGI development, mechanistic interpretability will basically give false confidence to engineers that they would actually be able to sufficiently detect misaligned effects of AGI.

For the leaders of AGI R&D labs, mechanistic interpretability is a great marketing tool to show (visually) how their company is ‘ensuring’ their technology is ‘safe and aligned’. In other words, mechanistic interpretability will be used for ‘align-washing’ like how big fossil-fuel companies engage in green-washing.

Given that not having human society and the rest of our shared ecosystem destroyed seems like a priority above understanding and improving narrow AI, I have to conclude that current research efforts to do mechanistic interpretability are net really unhelpful.

AGI researchers and alignment researchers, can basically use it as a crotch to confirm that they can make progress (while missing/ignoring larger long-term problems).


[9:45 am, 11/11/2022] Remmelt:
BTW, I’m saying this because I want to be realistic and honest about what the current situation is, so we can act to safeguard life on Earth. I don’t mean to be cynical. I mean to clarify things that are constructive for finding an alternate path forward.

[9:47 am, 11/11/2022] Remmelt:
Also, it’s not fun to in effect tell people in your ingroup that the work they’re doing is useless for AGI safety. I’m really not personally benefiting from that, but it’s important to get clear here.
[12:51 pm, 11/11/2022] Remmelt:
(I can respond to your responses on the overview list later, but want to be mindful of throwing too much at you at once. Seems better in our conversation to precisely dig into a few key statements and underlying cruxes).



On ways the reverse engineering analogy by Olah is unsound:

[13:21, 11/11/2022] Remmelt:

> See the answer by Chris Olah here on his vision why the curse of dimensionality may not be unsurpassable: 

The problem IMO is that Chris Olah is making unsound representational assumptions (ie. unsound premises for how he can model interactions of code units) by comparing interpreting neural networks to reverse engineering code. Only at the end of his piece does he briefly mention the simplest least-hard-to-solve instance of the problem not addressed by his assembly-code reverse engineering analogy: polysemantic neurons. 

Before we can make progress in our conversation here, or in any other part of our conversation, I first have to clarify why I think you (or Olah, etc) are making unsound representational assumptions 

With the photograph vs. holograph analogy above, I tried to point at a different and IMO much more sound way of representing the functionality of code. But I get that this raises new questions about what definitions I'm using for terms, you want to precisely dig into the arguments step by step, etc. 

So for now, I'll just email you a more specific list of reasons why the reverse engineering analogy is unsound, and leave it at that.

[Emailed list:]
See below, only if you want to.
Note: I split sentences into lines
to make any claim or argument
more parsable step-by-step. 

   Let me clarify below reasons
   why I think Chris Olah's analogy between
   reverse engineering software binaries/assembly
   and mechanistic interpretability is unsound
   when it comes to whether and how
   one would sufficiently detect and correct 
   human-relevant errors in the code
   of large neural network architectures
   (particularly, any with general functionalities
    to enact changes in the global environment).

   What I meant to clarify here
   was that the way one would
   need to go about inspecting
   all the combinatorial complexity
   you would need to inspect
   to interpret the functions
   of NN-based architectures.
   (particularly those exhibiting
    general capabilities)
   is much different than 
   the way you would go about
   inspecting the binary/assembly code
   of some kind of large software stack
   running on a computer
   (based on my amateur understanding
    of what reverse engineering
    software code usually involves).

   1. Combinatorial complexity

   There seem to be many many 
   more degrees of freedom of interaction
   between all neurons (in/across layers) of an NN,
   then you can expect to see across
   the same amount of GB of binary code
   of some software program,
   which usually is much more constrained
   in terms of the number of interactions
   that are possible between possible elements
   above some threshold
   of negligible probability of the possibilities.

   Of course, if you look at software function calls,
   you can expect those to be reasonably complex as well
   (definitely not as complicated as neural networks,
     but a spaghetti nonetheless).

   But initialised and stochastically selected
   for collections of connected neurons are different
   to having explicitly coded for functions
   stored in particular parts of computer memory,
   (where there is going to be just *some* messiness
    around what functions/processes
    call what functions).
   For any prosaic NN-based AGI
   all of the stupendous number
   of neural network functions implementable 
   for all received inputs over time
   by all learned/selected internal code units
   through which human-relevant errors ('misalignments')
   could be caused in the outside world
   are just not trackable, by a very large margin.

   2. Possible meanings and functions

   As I tried to convey in the chapter 4,
   it is unsound to assume
   that a piece of a neural network
   represents a single meaning
   or causal function.

   Chris Olah briefly refers
   to polysemantic neurons in his piece.
   But, as an example,
   I think this *way* underplays
   the amount of complexity
   you have to deal with
   in order to interpret all possible
   'meanings' or 'functions' of a neural network
   for all possible inputs and outputs
   across all connected neurons
   of the neural network
     (going from first principles here;
     I'm not an ML engineer).

   I think that the following excerpt
   of Chris Olah's post very much
   understates the causal complexity
   involved in interpreting neural networks:
    "Computer programs
     often have memory layouts
     that are convenient to understand...
     Something kind of analogous to this
     often happens in neural networks.
     Let's assume for a moment
     that neural networks can be understood
     in terms of operations on a collection
     of independent "interpretable features".

   3. Human priors for code representation

   My impression is that for reverse
   engineering binary code,
   you are (usually) interpreting
   the binary code of functions
   explicitly programmed by human programmers
   that were compiled from one/a few
   of a finite set of programming languages.

   For interpreting pieces
   of a neural networks though,
   good practices and intuitions
   are clearly different
   than those used for
   interpreting binary code
   of software stacks.

   And unlike complicated functions
   programmed in by humans,
   this is complexity we are
   much less adapted to interpret,
   evolutionarily and culturally.

   We cannot assume somewhat
   static programming representations and heuristics
   that other computer programmers use
   who are also primate mammals
   that evolved roughly the same innate
   learning biases / core knowledge
   / psychological constraints
   (ie; innate priors
    for the perceiving and representation
    of observations as eg; bounded objects,
    sortable things and persons, etc).

   4. Input distribution shifts

   A major problem with mechanistic interpretability
   that I have not read Olah discuss online so far
   are input distribution shifts.

   Software models usually
   are constrained in their input sets.
   Input variables of software are
   usually symbolically standardised
   in ways that do not allow
   for random cross-overs
   in how the incoming data
   are processed internally.
   Neural networks definitely do
   get pseudo-random cross-overs
   in terms of how functional code units 
   (neurons in a given layer)
   process incoming data points.
   Neural networks
   (and other ML architectures)
   are in effect function approximators
   that select for 'transformation mappings'
   between an (assumed)
   probability distribution of inputs
   to resulting computed outputs.
   My impression is that 
   for most of the work Olah
   has been doing on
   mechanistically interpreting
   NN-based architectures
   that he implicitly relied upon that
   the distribution of probabilities of inputs
   would not shift significantly
   beyond used training datasets.
   Ie. Olah could just assume that
   internal neurons/circuits/etc
   will maintain roughly the same
   'range' of functionality given 
   that the inputs are not going 
   to shift 'too much'. 

   However, we really cannot assume
   that for any possible
   prosaic AGI architecture.

   Due to the implementation
   of more general functionalities,
   AGI would cause more
   widespread and bigger changes
   to ambient conditions and contexts
   of the global environment
   (which the architecture itself may
    be moving through and changing its
    input channels to over time). 
   Those changes would feed back
   through whatever connected
   input (sensor) channels of AGI
   and result in distribution shifts 
   of AGI's expressed functionality.

   So this gets back to chapter 1 and 2:
   optimisation methods can initialise and select for
   many different internal code unit variants
   (eg. of NN weights, circuits, sub-networks)
   that would within some assumed
   probability distribution of possible inputs
   function in *practically indistinguishable ways*
   (ie. not distinguished by mechanistic 
    interpretability techniques used by Olah).

   But once inputs go out of distribution.
   model functionality shifts as well,
   causing different outputs and outside effects.


   5. Starting at a higher layer of abstraction

   Finally, and actually
   not as relevant as the previous items,
   you also start inspecting neurons
   at a somewhat higher layer
   of abstraction
     (above matrix multiplications
     as corresponding with
     allowed biases of
     and weighted connections
     between neurons;
     and perhaps above some interfaces
     for distributed computation across servers
     over network/internet protocols)
   than for binary bits.


On  destabilising internals–environment feedback loops 

[18:25, 28/11/2022] Remmelt:
> Okay, paraphrasing:
> -- The model will produce outputs.
> -- We will not know all effects of these outputs.
> -- These effects will change the environment,
> leading to changed inputs,
> which (if the parameters are still updated)
> leads to changed parameters
> in a way that cannot be predicted.
> I agree.
> That's pretty obvious
> and not yet a failure mode,
> but I will read on

This is a fine paraphrase of the first claim about destabilising feedback cycles that would magnify input distribution shifts and resulting shifts in the functionality of AGI:
"That is, not sufficient to account for how external effects of the model’s operation will feed back into what sensor inputs the model might receive and in turn are transformed into new outputs, etc."

[Later edit: Ah, actually, the first claim about input-functionality shift feedback cycles is not just about learning different parameters with different inputs (although this is an important part, and I not thought about it yet for this specific dynamic).
It's mostly about inputs causing distribution shifts in outputs, which cause environmental effects that feed back as changes in (sensor) inputs.]

It does not cover the second claim, about substrate-needs convergence, which is what Forrest's research is about (the distribution-functionality shift feedback cycle is just one of a number of secondary dynamics that increase the speed of creep of AGI harmful effects).
"And also not how those external effects feed back into what (variants of) models even continue to physically exist (as configurations of a substrate) and function at increasing scale in the environment over time."

Defined succinctly, substrate-needs convergence is:
That any changing population of AGI parts
…that is, all manufactured/initialised components that are optimised for processing inputs into outputs such to in aggregate sense and act in many outside domains/context.
…converges on propagating environmental effects that fulfil their artificial needs
that is, effects that keep feeding back into a portion of those artificial parts continuing to exist, function, and scale up as configurations of a (solid-state) substrate.

What is similar about claim 1 and 2 is that both are about feedback loops from outputs of computed/operated AGI internal components to external effects propagating through the environment, and then back again to affecting the functioning and existence of AGI components. The distribution-functionality shift feedback cycle is something inspect internals methods cannot account for, so I shared it with you as an intuitive step-up analogy for explaining why substrate-needs convergence can also not be dealt with using inspect internals methods.


On why "inspect internals" and "inspect externals" both fall short:

[18:30, 28/11/2022] Remmelt:
> I agree given the *premise*
> that mechint isn't helpful for AGI alignment,
> but don't yet see the reason for that premise

Mechanistic interpretability, as a technique for *inspecting internal components* for errors, cannot account for how the operation of AGI internal components causes outside effects propagating through the larger environment, and in turn feed back into the continued existence and functioning of some of those internal components, at increasing scales.

The scope of application of mechanistic interpretability is for *inspecting internals*, for errors that arise from *intrinsic selection of AGI code/hardware components through engineered optimisation methods*. It does not apply to *extrinsic* selection through effects across the wider environment. The scope of application of inspecting internals does not extend to the uncountable pathways through which AGI components' physical interactions loop back into the resulting selective continuity of existence, functioning and scaling of some of those components.

Code or hardware components do not intrinsically know what extensive causal history of prior physical interactions led to their current existence. You cannot "elicit" that knowledge from internal components, so you cannot detect errors resulting from extrinsic selection with internal inspection.

What's left is *external* inspection. Any implementable *externals* inspection/detection or modelling/simulation methods (or combinations thereof) will be *way too limited* in terms of capacity to detect misaligned self-substrate feedback loops in the interactions of AGI's changing internal components through changing available signalling channels with changing connected surroundings of the environment, over the long term.

So if external inspection methods are not going to cut it (to prevent AGI convergence on ecosystem-wide extinction), and the use of internal inspection methods' cannot functionally replace the use of external inspection methods (as outside the scope of application), then mechanistic interpretability as a method for internal inspection is simply not going to help with making AGI safe over the long run.

Specific inequalities you may want to dig into for grogging this abstract argument:
1. why no short-cut algorithm can assess the behavioural properties (digital outputs) of the, for practical purposes, arbitrary algorithms introduced internally by self-modifying AGI (analogous to Rice Theorem result).

2. why chaotic dynamics (non-linearly amplifying feedback cycles of potentially smaller-than-measurable localised effects) and complex dynamics of (the expression of multiple functional effects of) parts of a system are unpredictable and unsimulatable respectively, yet can still robustly lead to changes to the agent's body and environment (in eg. the expression of DNA code of biological organisms into adaptive phenotypic effects).

3. why aligning the external functioning of AGI internal components involves sensing relevant effects in the larger and more complex outside environment, and compressing that data in some *lossy* way to compare it against alignment references.



On  NNs as spaghetti code

[20:15, 28/11/2022] Remmelt:
> Ah, I see you using the term spaghetti code btw.
> Note that most people wouldn't
> call neural networks "spaghetti code",
> they mean hand-written unmaintainable code with it.
> (Just saying it in case
>  it may help for avoiding confusions)

Thanks for the thought. I agree that's how the term "spaghetti code" is often used in context currently, so that might be a little confusing at first. As far as "spaghetti code" is defined literally as "unstructured and difficult-to-maintain source code", I think the term can be used in the contexts of neural network code too.

This conveys a big concern of neural networks – AI developers are prepared to deploy code for real-world use that is so messy to the point that no professional-minded software developer would accept such code from their colleague. Particularly for safety-critical applications that would amount to gross misconduct. It's a reason I hear why many industrial engineers have not switched from eg. decision tree algorithms to neural networks, despite the AI hype and benefits AI developers say their architectures have for being applied everywhere.


Returning to Olah's reverse engineering analogy:

[21:25, 28/11/2022] Remmelt:
> In general, most of the points seem
> some variant of
> "The following aspect of mechint seems really hard",
> and people may agree with that
> and view these as core difficulties,
> and I may even agree that
> they are not discussed enough publicly.
> (I wouldn't claim that researchers
>  don't have these points in their internal models
>  of difficulties to be overcome --
>  there is only so much that researchers
>  can write down and express in a given day).

These are reasons that clarify why Olah's explanation of the reverse engineering analogy falls way short of capturing the complexity and risk of false negatives when inspecting large neural networks for human-relevant errors (misalignments).

If Olah already has any of these points as part of his mental model, then it is his responsibility to properly describe them in his writing, even briefly if he has other legitimate priorities, rather than sum up all the reasons why mechanistic interpretability is a problem that is the same kind of hard as interpreting the assembly code of software.

Note that any mistake of omitting a crucial aspect here can *lead to everyone dying down the line*.

If a surgeon inspecting the components of your body missed a crucial aspect in interpreting what your organs do, resulting in your death some years later, would you see that as acceptable professional conduct? What about a surgeon forgetting to teach crucial aspects of organ functioning to trainees in online documentation?

Analogously, can we just say, "oops, oh yeah me being overly optimistic about our ability to interpret internal components led to the death of billions of people down the line"?


[21:28, 28/11/2022] Remmelt:
> None of the points seem to destroy
> the prospect of any value to be gained from mechint
> for the alignment of AGI.

At the least, they result in a conclusion that the expected value is much lower than expected before.
If the prospect of mechanistically interpreting AGI internals comprehensively to prevent ecosystem-wide extinction turns out a lot more pessimistic than thought earlier, that means the expected value of working on mechanistic interpretability is also much lower.



On the key sub-arguments for why long-term safe AGI is impossible:

[21:34, 28/11/2022] Remmelt:
> Can you maybe mention the single,
> in your view strongest, argument
> that tries to make alignment of advanced AI
> impossible or extremely unlikely?
> Maybe you mentioned it already above
> in the parts of your second to last message stream
> which I ignored, then you can also point me to it.

If you want the conclusive argument, this requires understanding three key sub-arguments first.

One sub-argument is about theoretical limits of engineerable control, so perhaps you can start there?

The other two sub-arguments are about:
- economic decoupling of AI exchanges of value from human exchanges of value.
- substrate-needs convergence (as distinct from but enabled by instrumental convergence).



On ecosystems being uncomputable:

[22:14, 28/11/2022] Remmelt:
> This seems similar to some of your concerns

The post "The limited upside of interpretability" is excellent btw.
Just reading through, thank you.

One claim I think is not capturing ecosystem complexity right is the computational irreducibility point.
Flawlessly (ie. deterministically) computing an algorithm that has the shortest length (equivalent to Kolmogorov Complexity) of any algorithm that produces its outputs is *computationally irreducible* in the sense that you cannot just run a shorter-length algorithm requiring fewer computational operations to replace it.

An ecosystem (whether based around carbon-centered DNA/RNA or solid-state-lattice-embedded-transistors) is *uncomputable*. This is because part interactions within that ecosystem do involve noise interference at various levels, that feed chaotic dynamics that in turn result in new formation of structure/parts. You simply cannot scan the parts of that ecosystem and simulate it faithfully on the equivalent of a Turing machine.

You can make some higher-level predictions about long-term convergent outcomes of that ecosystem (as Forrest is doing, for parts of the human-carbon ecosystem in interactions with parts of the AI-solid-state ecosystem).



Original researcher's response on theoretical limits of engineerable control:

[09:51, 29/11/2022] Remmelt:
Regarding theoretical limits
of engineerable control,
Forrest just emailed me this:

“For the aspect that has to do with technical 
non-alignment, the key info is the following:.

 - that any and every AGI 'safety/alignment' process
 is a special case of a 'causative feedback process'.
   - that there are no/zero exceptions to this truth.
   - as anything and everything that is true of 
   the use of causation, as a feedback process, 
   and/or an 'algorithm', also as a 'modeling process',
   and/or of 'signaling', inherent in any 'feedback',
   will for sure be true also of any AGI safety/alignment
   technique, method, proposal, methodology, algorithm, etc.

 - that any and every 'causative feedback process'
 depends on at least all of the following:.

   - 1; the completeness of the sense input.
     - as the input data that is being processed by 
     or used by, the proposed AGI alignment technique/etc,
     and also as the input data that is intake into 
     the AGI system, so that the model of the key aspects
     of the AGI system has reasonably correct input data.

   - 2; the completeness of process modelability.
     - ie; that the proposed AGI alignment technique/etc,
     has to actually be that, some sort of algorithm,
     itself conditional on only causation and logic
     and that the AGI system itself has to be 
     at least partially modelable, in whatever key
     aspects, are necessary to establish 'alignment'.
   - 3; the completeness of predictability.
     - ie; as the assumption that the model 
     of the AGI system 
     when also given the model of the input data
     will allow for sufficient predictability
     of whatever key aspects of the future AGI outputs
     will be, before the real (non-modeled) AGI 
     would actually act, in a way that is assessable
     and comparable to desired outcome states.

   - 4; the completeness of some type of a comparison 
   to a reference (where that reference implicitly defines 
   whatever is the meaning of "aligned" or "safe").

   - 5; the completeness of some type of signaling 
   (ie; what is the "feedback", how the model itself
   works internally, how the model alerts the actuator
   so that the model 'controls' and constrains the AGI 
   so that it is acting/operating in an aligned way).

   - 6; the completeness of some type of 
   conditionalization (ie, what is "effectiveness")
   of at least one of 
   the following AGI aspects:.
     - output actions.
     - internal process.
     - sense input.

    - ie; where/if the AGI conditionalization 
    does not have sufficient "power" to actually constrain
    the AGI from doing bad things, then even a very good
    alignment algorithm/technique/process will not result
    in an aligned/safe AGI.

 - that aspects 1 thru 6 (inclusive) are true of any 
 and every 'causative feedback process'.
   - that there are exactly none/zero exceptions.
   - that all causative feedback process 
   will have/require all six of these aspects 
   (with no exceptions).

 - that therefore, a/the/any/all AGI alignment/safety
 enforcement protocols, techniques, methodologies, etc,
 will also (for sure) be required to have and implement 
 all six aspects (with no exceptions), 
 and, where given that requirement, 
 if any of these six aspects, for whatever reason, 
 cannot be implemented, then that/therefore 
 AGI alignment/safety cannot be implemented.

 Unfortunately, every single one of the listed 
 dependencies (all six of them) are individually impossible 
   (ie; cannot be 'complete enough' to achieve 
   anywhere near the minimum threshold necessary).
 Insofar as all six are necessary together 
 to achieve AGI alignment, that there are effectively six, 
 at least semi-independent, arguments against 
 there being any notion of AGI alignment, 
 as based on nothing outside fo the notions of
 causation, conditionalization, and/or logic
 (ie, all different manifestations of the same idea).

 Moreover, the control and/or "error correction" system 
 must itself not make mistakes/errors (ie, be/operate
 within reasonable limits, be itself predictable, etc). 
 What tracks errors/mistakes in the system 
 that is modeling the system to apply error correction to?  
 Insofar as the same logic applies in proportion to the
 complexity of the error correction, as it does in the
 complexity of the AGI system itself, then there is a 
 kind of regression cascade problem.”



Original researcher's response on mechanistic interpretability:

[09:51, 29/11/2022] Remmelt:
And on mechanistic interpretability:
"Maybe 'mechanistic interpretability' arguments
always ends up confusing things in regards to 
the argument of AGI safety impossibility?

The notion that might make some of this easier
is to describe it as "dual use functionality"
that occurs at the small scale, in a distributed way,
and which will eventually result in unexpected
shifts in macroscopic results/outcomes.

No one bothers to try to 'observe' and 'interpret
or describe the meaning/function of', or outcome, 
of what is happening in the small details 
in any NN or system.  Hence, subtle exogenously
coordinated shifts occurring everywhere at once
will routinely get overlooked.

'Mechanistic interpretability' frames things in
terms of single macroscopic functions/descriptions.
That leads to a confusing focus, insofar as people
believe that is the only way to think about it.
Our work is about functions with plural microscopic
drivers/outcomes, and what sort of implications 
that has.”


–> Actually, my conversation partner is just double-checking where I quoted them. I intend to post the edited exchange here by Mon 26 Dec. [Update: they said it was fine]