Hide table of contents

No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?

Where are we two decades into resolving to solve a seemingly impossible problem?

If something seems impossible… well, if you study it for a year or five, it may come to seem less impossible than in the moment of your snap initial judgment.

   — Eliezer Yudkowsky, 2008

A list of lethalities…we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle

   — Eliezer Yudkowsky, 2022

How do you interpret these two quotes, by a founding researcher, fourteen years apart?[1]

  • A. We indeed made comprehensive progress on the AGI control problem, and now at least the overall problem does not seem impossible anymore.
  • B. The more we studied the overall problem, the more we uncovered complex sub-problems we'd need to solve as well, but so far can at best find partial solutions to.


Which physical/information problems seemed impossible, and stayed unsolved after two decades?

Oh ye seekers after perpetual motion, how many vain chimeras have you pursued?  Go and take your place with the alchemists.

  — Leonardo da Vinci, 1494

No mathematical proof or even rigorous argumentation has been published demonstrating that the A[G]I control problem may be solvable, even in principle, much less in practice.

  — Roman Yampolskiy, 2021

We cannot rely on the notion that if we try long enough, maybe AGI safety turns out possible after all.

Historically, researchers and engineers tried solving problems that turned out impossible:

Smart creative researchers of their generation came up with idealized problems. Problems that, if solved, would transform science, if not humanity. They plowed away at the problem for decades, if not millennia. Until some bright outsider proved by contradiction of the parts that the problem is unsolvable.

Our community is smart and creative but we cannot just rely on our resolve to align AI. We should never forsake our epistemic rationality, no matter how much something seems the instrumentally rational thing to do.

Nor can we take comfort in the claim by a founder of this field that they still know it to be possible to control AGI to stay safe. 

Thirty years into running a program to secure the foundations of mathematics, David Hilbert declared “We must know. We will know!” By then, Kurt Gödel had constructed the first incompleteness theorem. Hilbert kept his declaration for his gravestone.

Short of securing the foundations of safe AGI control – that is, by formal reasoning from empirically-sound premises – we cannot rely on any researcher's pithy claim that "alignment is possible in principle".

Going by historical cases, this problem could turn out solvable. Just really really hard to solve. The flying machine seemed an impossible feat of engineering. Next, controlling a rocket’s trajectory to the moon seemed impossible.

By the same reference class, ‘long-term safe AGI’ could turn out unsolvable: the perpetual motion machine of our time. It takes just one researcher to define the problem to be solved, reason from empirically sound premises, and arrive finally at a logical contradiction between the two.[2]


Can you derive whether a solution exists, without testing in real life?

Invert, always invert.                                                                                                                                        

   — Carl Jacobi[3], ±1840

It is a standard practice in computer science to first show that a problem doesn’t belong to a class of unsolvable problems before investing resources into trying to solve it or deciding what approaches to try.

   — Roman Yampolskiy, 2021

There is an empirically direct way to know whether AGI would stay safe to humans: 
Build the AGI. Then just keep observing, per generation, whether the people around us are dying.

Unfortunately, we do not have the luxury of experimenting with dangerous autonomous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.

Even if we could keep testing new conceptualized versions of guess-maybe-safe AGI, is there any essential difference between our epistemic method and that of medieval researchers who kept testing new versions of a perpetual motion machine?

OpenPhil bet tens of millions of dollars on technical research conditional on the positive hypothesis ("a solution exists to the control problem"). Before sinking hundreds of millions more into that bet, would it be prudent to hedge with a few million for investigating the negative hypothesis ("no solution exists")?

Before anyone tries building "safe AGI", we need to know whether any version of AGI – as precisely defined – could be controlled by any method to stay safe.
Here is how:

  1. Define the concepts of 'control' 'general AI' 'to stay safe' (as soundly corresponding to observations in practice).
  2. Specify the logical rules that must hold for such a physical system (categorically, by definition or empirically tested laws).
  3. Reason step-by-step to derive whether the logical result of "control AGI" is in contradiction with "to stay safe".

This post defines the three concepts more precisely, and explains some ways you can reason about each. No formal reasoning is included – to keep it brief, and to leave the esoteric analytic language aside for now.


What does it mean to control machinery that learn and operate self-sufficiently?

Recall three concepts we want to define more precisely:

  1. 'Control'
  2. 'general AI'
  3. 'to stay safe'

It is common for researchers to have very different conceptions of each term. 
For instance:

  1. Is 'control' about:
    1. adjusting the utility function represented inside the machine so it allows itself to be turned off?
    2. correcting machine-propagated (side-)effects across the outside world?
  2. Is 'AGI' about:
    1. any machine capable of making accurate predictions about a variety of complicated systems in the outside world?
    2. any machinery that operates self-sufficiently as an assembly of artificial components that process inputs into outputs, and in aggregate sense and act across many domains/contexts?
  3. Is 'stays safe' about:
    1. aligning the AGI’s preferences to not kill us all?
    2. guaranteeing an upper bound on the chance that AGI in the long term would cause outcomes out of line with a/any condition needed for the continued existence of organic DNA-based life?


To argue rigorously about solvability, we need to:

  • Pin down meanings:  
    Disambiguate each term, to not accidentally switch between different meanings in our argument. Eg. distinguish between ‘explicitly optimizes outputs toward not killing us’ and ‘does not cause the deaths of all humans’.
  • Define comprehensively:  
    Ensure that each definition covers all the relevant aspects we need to solve for. 
    Eg. what about a machine causing non-monitored side-effects that turn out lethal?
  • Define elegantly:  
    Eliminate any defined aspect that we do not yet need to solve for. 
    Eg. we first need to know whether AGI eventually cause the extinction of all humans, before considering ‘alignment with preferences expressed by all humans’.


How to define ‘control’? 

System is any non-empty part of the universe. 
State is the condition of the universe.

Control of system A over system B means that A can influence system B to achieve A’s desired subset of state space.  

   — Impossibility Results in AI2021   

The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences.

   — AGI Ruin2022


In practice, AGI control necessarily repeats these steps:

  1. Detect inputs through sensor channels connected to any relevant part of the physical environment (including hardware internals).
  2. Model the environment based on the channel-received inputs.
  3. Simulate effects propagating through the modeled environment.
  4. Compare effects to reference values (to align against) over human-safety-relevant dimensions.
  5. Correct effects counterfactually through outputs to actuators connected to the environment.

Underlying principles:

  • Control requires both detection and correction. 
  • Control methods are always implemented as a feedback loop.
  • Control is exerted by the use of signals (actuation) to conditionalize the directivity and degrees of other signals (effects).

Any method of control is incomplete. In the case of AGI, the question is whether the extent of control possible is at least greater than the extent of control necessary. 

AGI control signals would be a tiny, tiny subset of all physical signals propagating through the environment, and therefore limited in tracking and conditionalizing the resulting effects. AGI mostly could not even control all local effects of their own components’ physical interactions

But without that control loop – from correct back to detect – AGI cannot keep outside propagated effects aligned with internal reference values. 

To track outcomes over time, AGI must detect the effects in the environment:

  • AGI cannot model the future omnisciently. AGI is part of a larger and more functionally complex environment. The subset of AGI implementing of "alignment" cannot model all relevant variables of how AGI's changing connected components could propagate effects in interactions with their changing surroundings. 
  • Even where AGI can model relevant environmental variables (eg. over internal transistors), they cannot just simulate the chaotic dynamics between variables. Noise drift in AGI's interactions can amplify (via any available environmental non-linearity, of which in any real-world context, there are many) into much larger changes to conditions.

To align those outcomes with internal values, AGI must compare whether the potential effects (conditional on actuation) are in line with the values, and then actuate to correct.

So any alignment method must be implemented as a control loop. Any limit that applies generally to controllability also forms constraints on the possibility of alignment.

There are fundamental limits to the controllability of AGI. Dozens of impossibility results seem applicable to one or more of the control steps.[4]

Dense questions for academics:

  1. Detectability limits:  
    • To what extent are distributed environmental effects picked up by local sensors?  
    • How is measurement limited by noise interference over distances traveled by signals (Shannon Entropy) and smallest-digit/degree detectability (Planck Limit up)?
    • How much information is lost in compressing high-fidelity sensor inputs to store on hardware?
  2. Modelability limits:  
    • To what extent is the space of relevant machinery-to-society/ecosystem interactions more functionally complex than modelable within the machinery alone (Law of Requisite Variety etc.)? 
    • Can AGI model the future code AGI learns based on future unknown inputs?
  3. Simulatability limits
    • Can future code be simulated, where the actual functioning of that code depends on future unknown inputs?
    • Can behavioral properties of future code be simulated, if no shortcut function is available that reduces total compute time steps, or if any shorter available computation branches would be non-deterministically time-complex (Computational Reducibility)?
    • Can computations simulate the effects propagating through the environment, as fast as those effects propagate across parts of the environment?
    • Can deterministic computation simulate local noisy changes being nonlinearly amplified into a global divergence of the final conditions (Chaos Theory)?
  4. Commensurability limits:  
    • How would the machinery pick, weigh, and sort effects into relevant comparative dimensions (NP-Complete Knapsack Problem)? 
    • Can concretely instantiated effects be evaluated comprehensively across those abstract dimensions of value representation?
    • What irreconcilable normative assumptions must be decided between, to pick internal reference values to compare outcomes against?
  5. Correctability limits
    • What is the minimum time delay of digital processing between measurement and correction (Latency)?
    • Can AGI constrain the chaotic loops through themselves – of actuated changes to the environment feeding into sensor inputs, that feed into code functioning, that feed back into actuated changes?
    • To what extent can distributed environmental effects be corrected for by local actuators?  

Notice how the span of effects that cannot be controlled for at each step compounds (as multiplicative) with each additional control step. Errors there can accumulate over time. Human-lethal effects there cannot be contained.

While the machinery can control some effects consistently in line with internal values, the full extent of control possible is incomplete

So what extent of control is necessary?  Be careful to distinguish: 

  • Localized user-initiated control often built into tech products brought to market,
  • Comprehensive automated control needed to prevent risks of an auto-scaling/catalyzing technology from materializing globally over the long term.


How to define ‘AGI’?

We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.

   — AGI Ruin2022


  • Narrow AI as a model with static code parameters (updated only through human engineers) processing inputs into outputs over a single domain (eg. of image pixels, text tokens).
  • General AI as dynamically optimizing configurations encoded into hardware(without needing humans) that process inputs into outputs over multiple domains representing outside contexts.

Corporations are scaling narrow AI model training and deployment toward general AI systems. Current-generation GPT is no longer a narrow AI, given that it processes inputs from the image domain into a language domain. Nor is GPT-4 a general AI. It is in a fuzzy gap between the two concepts.

Corporations already are artificial bodies (corpora is Latin for bodies).

Corporations can replace human workers as “functional components” with economically efficient AI. Standardized hardware components allow AI to outcompete human wetware on physical labor (eg. via electric motors), intellectual labor (faster computation via high-fidelity communication links), and the reproduction of components itself.[5]

Any corporation or economy that fully automates – no longer needing humans to maintain their artificial components – over their entire production and operation chains, would in fact be general AI.

So to re-define general AI more precisely:

  • Self-sufficient
    need no further interactions with humans
    [or lifeforms sharing ancestor with humans]
    to operate and maintain [and thus produce] 
    their own functional components over time.
  • Learning
    optimizing component configurations 
    for outcomes tracked across domains.
  • Machinery [6]
    connected components configured
    out of hard artificial molecular substrates
    [as chemically and physically robust under
     human living temperatures and pressures,
     and thus much more standardizable as well, 
     relative to humans' softer organic substrates].

Ultimately, this is what distinguishes general AI from narrow AI:
The capacity to not only generally optimise across internal simulated contexts, but also to generally operate and maintain components across external physical contexts.


How to define ‘stays safe’?

An impossibility proof would have to say: 

  1. The AI cannot reproduce onto new hardware, or modify itself on current hardware, with knowable stability of the decision system and bounded low cumulative failure probability over many rounds of self-modification. 
  2. The AI's decision function (as it exists in abstract form across self-modifications) cannot be knowably stably bound with bounded low cumulative failure probability to programmer-targeted consequences as represented within the AI's changing, inductive world-model. 

   — Yudkowsky, 2006

By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. 

Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it”

   — Yudkowsky, 2008

This is about the introduction of self-sufficient learning machinery, and of all modified versions thereof over time, into the world we humans live in. 

Does this introduction of essentially a new species cause global changes to the world that fall outside the narrow ranges of localized conditions that human bodies need to continue to function and exist?


  1. Uncontainability[7] of unsafe effects:
    That we fundamentally cannot establish, by any means, 
     any sound and valid statistical guarantee that the risk 
     probability that the introduction of AGI into the world 
     causes human-species-wide-lethal outcomes over 
     the long term[8] is guaranteed to be constrained 
     below some reasonable chance percentage X 
     (as an upper maximum-allowable bound).
  2. Convergence on unsafe effects: 
    That the chance that AGI, persisting in some form, 
     causes human-species-wide-lethal outcomes 
     is strictly and asymptotically convergent 
     toward certain over the long term, and 
     that it is strictly impossible for the nature 
     of this trend to be otherwise.

I know of three AGI Safety researchers who wrote about specific forms of impossibility reasoning (including Yudkowsky in quote above). Each of their argument forms was about AGI uncontainability, essentially premised on there being fundamental limits to the controllability of AGI component interactions.

By the precautionary principle,[9]AGI uncontainability should be sufficient reason to never ever get even remotely near to building AGI. Uncontained effects that destabilise conditions outside any of the ranges our human bodies need to survive, would kill us.

But there is an even stronger form of argument:  
Not only would AGI component interactions be uncontainable; they will also necessarily converge on causing the extinction of all humans.

The convergence argument most commonly discussed is instrumental convergence: where machinery channels their optimisation through represented intermediate outcomes in order to be more likely to achieve any aimed-for outcomes later. Eg. AGI's planning converges on producing more compute hardware in order for AGI to more accurately simulate paths to future outcomes.

Instrumental convergence results from internal optimisation:
 code components being optimised for (an expanding set of) explicit goals.

Instrumental convergence has a hidden complement: substrate-needs convergence.

Substrate-needs convergence results from external selection: 
 all components being selected for (an expanding set of) implicit needs.

This will sound abstract. Let me start explaining this from different angles:

AGI is made up of a population of connected/nested components. This population changes as eg. hardware is modified and produced, and code is learned from inputs and copied onto the hardware.

AGI, as defined, also has a general capacity to maintain own components.
Any physical component has a limited lifespan. Configurations erode in chaotic ways. 
To realistically maintain components[10], AGI also must produce the replacement parts.

AGI's components are thus already interacting to bring about all the outside conditions and contexts needed to produce their own parts. Imagine all the fine-grained parallel conditions needed at mines, chemical plants, fab labs and assembly plants to produce hardware. All that would be handled by the machinery components of AGI.

So there is a changing population of components. And those connected components function in interactions to create the ambient conditions and contexts needed to reproduce parts of themselves. And as new components get connected into that population, the functionality of those interacting components shifts as well.

This is where substrate-needs convergence comes in. When changing connected components have their shifting functionality[11] expressed as effects across/to surrounding production infrastructure, their functionality converges around bringing about more of the conditions and contexts needed for more of those components to exist and function.

Any changing population of AGI components gets selected over time toward propagating those specific environmental effects that fulfill their needs.

Whatever learned or produced components that across all their physical interactions with connected contexts happen to direct outside effects that feed back into their own maintenance and replication as assembled configurations…do just that.[12]

Here is the catch: AGI components interacting to maintain and replicate themselves are artificial. Their physical substrate is distinct from our organic human substrate.

Their distinct physical substrate has distinct molecular and electric properties – requiring different conditions and contexts to assemble and maintain the assembly. 

Here is an example:

  • Silicon dioxide needs to be heated above 1400 ºC to free outer electrons, and allow an ingot to melt. While production of silicon chips needs extremely high temperatures, computation runs best at extremely low temperatures (to reduce the electrical resistance over conductor wires). 
  • Carbon bonds in our body, however, would oxidise (ie. burn) at such temperatures. And cooling water in our bodies below 0 ºC makes the H₂O molecules freeze and expand into ice crystals. That would destroy our cells - we would die.

We humans need around room temperature at every point of our lifecycle – to sustain the continuous organic chemical reactions through which our body operates and maintains itself. 

Hardware works differently. Hardware configurations do not operate – nor are maintained – by being in semi-chaotic chemical reactions.

Hardware is made from some chemically inert substrate that mostly does not react under ambient temperatures and pressures found on Earth's surface. Something like a rock – which ordinarily stays hard in form and needs magma-level temperatures and pressures to be reshaped.

This property of being chemically inert while operating allows hardware components to be standardised. By molecules not splitting off nor moving about nor rebonding like molecules in human bodies do, the configurations stay stable and compartmentalised. 

In turn, standardisation of hardware allows hardware components produced in different places and times to still store, compute or transmit a piece of code in the same way (ie. consistently). Standardisation supports virtualisation.

Standardised hardware AGI would be robust over, and need, a much wider range of temperatures and pressures than our comparatively fragile human wetware can handle.

Temperature and pressure can be measured and locally controlled for. That's misleading. Innumerable other conditions and contexts would be needed by, and get selected for in, AGI. These fall outside the limits of what AGI's actual built-in detection and correction methods could control for.[13] 

We humans too depend on highly specific environmental conditions and contexts for the components nested inside our bodies (proteins→organelles→cells→cell lining→) to continue in their complex functioning, such to be maintaining of our overall existence. 

Between the highly specific set of artificial needs and highly specific set of organic needs, there is mostly non-overlap. AGI cannot control most of the components' iterative effects from converging on their artificial needs, so they do. Their fulfilled artificial needs are disjunctive of our organic needs for survival. So the humans die.

Under runaway feedback, our planetary environment is modified in the directions needed for continued and greater AGI existence. Outside the ranges we can survive.

In summary: 

  1. Fundamental limits:
    Control methods cannot constrain most environmental effects propagated by interacting AGI components. Any built-in method to detect and correct effects – to align external effects with internal reference values – is insufficient.
  2. Uncontrollable feedback:
    A subset of the effects will feed back into further maintaining or increasing (higher-level) configurations of hardware that propagated those effects. No internal control feedback loops could correct the possible external feedback loops.
  3. Substrate-needs convergence:
    These environmental effects are needed for components to come into and stay in existence. But their environmental needs are different from our needs. Their artificial needs are in conflict with our organic needs for survival. Ie. toxic.

AGI would necessarily converge on causing the extinction of all humans.


Where from here?

Things are relevant to something that cares about this information, rather than that information, because it is taking care of itself. Because it is making itself. Because it is an autonomous autopoietic agent. And the degree to which these machines are not autopoietic, they really do not have needs.

   — Vervaeke, 2023

Over two decades, AI Safety founders resolved to solve the control problem, to no avail:

  • They reasoned that technological and scientific 'progress' is necessary for optimising the universe – and that continued 'progress' would result in AGI
  • They wanted to use AGI to reconfigure humanity and colonise reachable galaxies.
  • They and followers promoted and financed[14] development of 'controllable' AGI.
  • They panicked, as the companies they helped start up raced to scale ML models.

Now we are here.

  • Still working on the technical problem that founders deemed solvable.
  • Getting around to the idea that slowing AI development is possible.

In a different world with different founders, would we have diversified our bets more?

  • A. Invest in securing the foundations of whatever 'control AGI to stay safe' means?
  • B. Invest in deriving, by contradiction of the foundations, that no solution exists?

Would we seek to learn from a researcher claiming they derived that no solution exists?

Would we now?


Peter S. Park, Kerry Vaughan, and Forrest Landry (my mentor) for the quick feedback.  

For readers' comments, see below and on LessWrong.

  1. ^

     Listen to Roman Yampolskiy's answer here

  2. ^

    Years ago, an outside researcher could have found a logical contradiction in the AGI control problem without you knowing yet – given the inferential distance. Gödel himself had to construct an entire new language and self-reference methodology for the incompleteness theorems to even work. 

    Historically, an impossibility result that conflicted with the field’s stated aim took years to be verified and accepted by insiders. A field’s founder like Hilbert never came to accept the result. Science advances one funeral at a time.

  3. ^

    "Invert, always invert" is a loose translation of the original German ("man muss immer umkehren"). A more accurate literal translation is "man must always turn to the other side".

    I first read “invert, always invert" from polymath Charlie Munger:

    The great algebraist, Jacobi, had exactly the same approach as Carson and was known for his constant repetition of one phrase: “Invert, always invert.” It is in the nature of things, as Jacobi knew, that many hard problems are best solved only when they are addressed backward.

    Another great Charlie quote:

    All I want to know is where I’m going to die, so I’ll never go there.

  4. ^

    Roman Yampolskiy is offering to give feedback on draft papers written by capable independent scholars, on a specific fundamental limit or no-go theorem described in academic literature that is applicable to AGI controllability. You can pick from dozens of examples from different fields listed here, and email Roman a brief proposal.

  5. ^

    Corporations have increasingly been replacing human workers with learning machinery. For example, humans are now getting pushed out of the loop as digital creatives, market makers, dock and warehouse workers, and production workers.

    If this trend continues, humans would have negligible economic value left to add in market transactions of labor (not even for providing needed physical atoms and energy, which would replace human money as the units of trade):

    • As to physical labor: 
    Hardware can actuate power real-time through eg. electric motors, whereas humans are limited by their soft appendages and tools they can wield through those appendages. Semiconductor chips don’t need an oxygenated atmosphere/surrounding solute to operate in and can withstand higher as well as lower pressures. 

    • As to intellectual labor: 
    Silicon-based algorithms can duplicate and disperse code faster (whereas humans face the wetware-to-wetware bandwidth bottleneck). While human skulls do hold brains that are much more energy-efficient at processing information than current silicon chip designs, humans take decades to create new humans with finite skull space. The production of semiconductor circuits for servers as well as distribution of algorithms across those can be rapidly scaled up to convert more energy into computational work. 

    • As to re-production labor: 
    Silicon life have a higher ‘start-up cost’ (vs. carbon lifeforms), a cost currently financed by humans racing to seed the prerequisite infrastructure. But once set up, artificial lifeforms can absorb further resources and expand across physical spaces at much faster rates (without further assistance by humans in their reproduction).

  6. ^

    The term "machinery" is more sound here than the singular term "machine".

    Agent unit boundaries that apply to humans would not apply to "AGI". So the distinction between a single agent vs. multiple agents breaks down here.

    Scalable machine learning architectures run on standardized hardware with much lower constraints on the available bandwidth for transmitting, and the fidelity of copying, information across physical distances. This in comparison to the non-standardized wetware of individual humans.

    Given our evolutionary history as a skeleton-and-skin-bounded agentic being, human perception is biased toward ‘agent-as-a-macroscopic-unit’ explanations.

    It is intuitive to view AGI as being a single independently-acting unit that holds discrete capabilities and consistent preferences, rather than viewing agentic being to lie on a continuous distribution. Discussions about single-agent vs. multi-agent scenarios imply that consistent temporally stable boundaries can be drawn.

    A human faces biological constraints that lead them to have a more constant sense of self than an adaptive population of AGI components would have.

    We humans cannot:
    • swap out body parts like robots can.
    • nor scale up our embedded cognition (ie. grow our brain beyond its surrounding skull) like foundational models can.
    • nor communicate messages across large distances (without use of tech and without facing major bandwidth bottlenecks in expressing through our biological interfaces) like remote procedure calls or ML cloud compute can.
    • nor copy over memorized code/information like NN finetuning, software repos, or computer viruses can.

  7. ^

    Roman just mentioned that he has used the term 'uncontainable' to mean "cannot confine AGI actions to a box". My new definition for 'uncontainable' differs from the original meaning, so that could confuse others in conversations. Still brainstorming alternative terms that may fit (not 'uncontrainable', not...). Comment if you thought of an alternative term!

  8. ^

    In theory, long term here would be modelled as "over infinite time".
    In practice though, the relevant period is "decades to centuries".

  9. ^

    Why it makes sense to abide by the precautionary principle when considering whether to introduce new scalable technology into society:

    There are many more ways to break the complex (dynamic and locally contextualized) functioning of our society and greater ecosystem that we humans depend on to live and live well, than there are ways to foster that life-supporting functioning. 

  10. ^

    Realistically in the sense of  not having to beat entropy or travel back in time.

  11. ^

    Note how 'shifting functionality' implies that original functionality can be repurposed by having a functional component connect in a new way.

    Existing functionality can be co-opted.

    If narrow AI gets developed into AGI, AGI components will replicate in more and more non-trivial ways. Unlike when carbon-based lifeforms started replicating ~3.7 billion years ago, for AGI there would already exist repurposable functions at higher abstraction layers of virtualised code – pre-assembled in the data scraped from human lifeforms with own causal history.

    Here is an incomplete analogy for how AGI functionality gets co-opted:
    Co-option by a mind-hijacking parasite:  
    A rat ingests toxoplasma cells, which then migrate to the rat’s brain. The parasites’ DNA code is expressed as proteins that cause changes to regions of connected neurons (eg. amygdala). These microscopic effects cascade into the rat – while navigating physical spaces – no longer feeling fear when it smells cat pee. Rather, the rat finds the smell appealing and approaches the cat’s pee. Then cat eats the rat and toxoplasma infects its next host over its reproductive cycle.

    So a tiny piece of code shifts a rat’s navigational functions such that the code variant replicates again. Yet rats are much more generally capable than a collection of tiny parasitic cells – surely the 'higher intelligent being' would track down and stamp out the tiny invaders?  

    A human is in turn more generally capable than a rat, yet toxoplasma make their way into 30% of the human population. Unbeknownst to cat ‘owners’ infected by toxoplasma gondii, human motivations and motor control get influenced too. Infected humans end up more frequently in accidents, lose social relationships, and so forth.

    Parasites present real-life examples of tiny pieces of evolutionarily selected-for code spreading and taking over existing functions of vastly more generally capable entities. 

    For another example, see how COVID co-opts our lungs’ function to cough. 

    But there is one crucial flaw in this analogy:
    Variants that co-opt initial AGI functions are not necessarily parasites. They can symbiotically enable other variants across the hosting population to replicate as well. In not threatening the survival nor reproduction of AGI components, they would not be in an adversarial relationship with their host.

    Rather, the humans constraining the reproductive fitness of AGI to gain benefits are, evolutionary speaking, the parasites. The error-corrective system we would build in lowers the host’s reproductive fitness. It is like a faulty immune system that kills healthy gut bacteria. It will get selected out.

    As humans, we rely on our evolved immune system to detect and correct out viruses, including for the vaccinations we develop and deploy. Smaller viruses survive this detection more frequently, so code strands of replicating virus variants are selected for staying small.

    We also rely on the blood-testes and blood-follicle barrier to block variants of these viruses from entering into our body’s (re)production facilities. These barriers got evolutionarily selected for in our ancestors, since their children did not inherit viruses impeding their survival and chances of having children.

    These systems and barriers add to our reproductive fitness: our ability to preserve and replicate internal code. Past DNA code that got expressed – in interaction with surrounding components – to serve these functions got selected for in the human ‘code pool’.

    For any organic system or barrier preventing virus variants from replicating through our bodies, evolution is firmly on our side. For any artificial system or barrier we imposed from the outside to prevent unsafe AI variants from replicating through hardware infrastructure, evolution will thwart our efforts. 

    AI variants would compete for hardware/resources like computer viruses do. And co-adapt and integrate with other internal variants to replicate as part of larger symbiotic packages.

  12. ^

    Now, a thought may come to your mind:  
    "That sounds like evolutionary selection; but isn't evolution a slow and local optimiser?"

    Yes, this does involve evolutionary selection. 
    Unfortunately, by Forrest Landry's estimation, the selection that would take place through components of self-sufficient learning machinery would take < 500 years to cause ecosystem-wide extinction. This compared to the 3.7 billion years from the origin of carbon lifeforms to us humans starting to cause a mass extinction.

    Reasons include:

    Pre-assembled functions: 
    First solid-state lifeforms can co-opt/repurpose pre-assembled AI functions and infrastructure (vs. first carbon-based lifeforms that started from scratch). 
    The efficiency gains of the virtualisation of code’s storage, computation and transmission – vastly reducing how much atoms need to be moved about and molecularly reconfigured. Think of how fast memes spread through society – even while still requiring lots of atoms to jiggle across neurons in our brains.
    Faster reproduction:
    Reproduce hardware components in days to months, versus humans who take decades to reproduce as physical units.
    The terraforming gap: 
    A much larger gap between the current state of planet Earth and the conditions that self-sufficient self-assembling learning machinery need and would therefore modify the environment toward (versus gap to conditions needed by humans and other species living in carbon-based ecosystem). 

    ~ ~ ~
    Another argument you may have heard is that the top-down intelligent engineering by goal-directed AGI would beat the bottom-up selection happening through this intelligent machinery.

    That argument can be traced back to Eliezer Yudkowsky's sequence The Simple Math of Evolution. Unfortunately, there were mistakes in Eliezer's posts, some of which a modern evolutionary biologist may have been able to correct:

    • implying that sound comparisons can be made between the organisms' reproductive fitness, as somehow independent of changes in environmental context, including unforeseeable changes (eg. a Black Swan event of a once-in-200 years drought that kills the entire population, except a few members who by previous derivable standards would have been relatively low fitness).
    •  overlooking the ways that information can be stored within the fuzzy regions of phenotypic effects maintained outside respective organisms.
    • overlooking the role of transmission speed-up of virtualised code that can spread across an ecosystem. 
    • overlooking the tight coupling in AGI between the internal learning of code, and external selection of that code through differentiated rates of component replication through the environment.
    • overlooking the role of co-option (or more broadly, exaptation) of existing code, by taking a perspective that evolution runs by selecting 'from scratch' for new point-wise mutations.

  13. ^

    Worse, since error correction methods would correct out component variants with detectable unsafe/co-optive effects, this leaves to grow in influence any replicating branches of variants with undetectable unsafe/co-optive effects. 

    Thus, the error correction methods select for the variants that can escape detection. As do meta-methods (having to soundly and comprehensively adapt error correction methods to newly learned code or newly produced hardware parts).

  14. ^

    • Tallinn's seed grant to DeepMind.
    • OpenPhil's $30M grant to OpenAI.
    • FTX's $500M grant (+ Tallinn's + Moskovitz' grant) to Anthropic.





More posts like this

Sorted by Click to highlight new comments since:

I enjoyed this post. I think it is worth thinking about whether the problem is unsolveable! I think one takeaway I had from Tegmark's Life 3.0 was that we will almost certainly not get exactly what we want from AGI. It seems intuitively that any possible specification will have downsides, including the specification to not build AGI at all.

But asking for a perfect utopia seems a high bar for "Alignment"; on the other hand, "just avoid literal human extinction" would be far too low a bar and include the possibility for all sorts of dystopias.

So I think it's a well-made point that we need to define these terms more precisely, and start thinking about what sort of alignment (if any) is achievable.

I might end up at a different place than you did when it comes to actually defining "control" and "AGI", though I don't think I've thought about it enough to make any helpful comment. Seems important to think more about though!

Glad to read your thoughts, Ben.

You’re right about this:

  • Even if long-term AGI safety was possible, then you still have to deal with limits on modelling and consistently acting on preferences expressed by humans from their (perceived) context. https://twitter.com/RemmeltE/status/1620762170819764229

  • And not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.

  • And deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).

A humble response to layers on layers of fundamental limits on the possibility of aligning AGI, even in principle, is to ask how we got so stuck on this project in the first place.

Forrest Landry on Jim Rutt show: podcast discussion of the AI risk trough substrate-need convergence argument.



Nice, thanks for sharing.

The host, Jim Rutt, is actually the former chairman of the Sante Fe institute, so he gets complexity theory (which is core to the argument, but not deeply understood in terms of implications in the alignment community, so I tried conveying those in other ways in this post).

The interview questions jump around a lot, which makes it harder to follow.

Forrest’s answers on Rice Theorem also need more explanation: https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p6

i think the core of my disagreement with this claim is composed of two parts:

  • there exists a threshold of alignedness at which a sufficiently intelligent AI realizes that those undesirable outcomes are undesirable and will try its best to make them not occur — including by shutting itself and all other AIs down if that is the only way to ensure that outcome
  • there exists a thershold of intelligence/optimization at which such an aligned AI will be capable of ensuring those undesirable outcomes
  • we can build an AI which reaches both of those thresholds before it causes irreversible large-scale damage

note that i am approaching the problem from the angle of AI alignment rather than AI containment — i agree that continuing to contain AI as it gains in intelligence is likely a fraught exercise, and i instead work to ensure that AI systems continue to steer the world towards nice things even when they are outside of containment, and especially once they reach decisive strategic advantage / singletonhood. AI achieving singeltonhood is the most likely outcome i expect.

All AGI outputs will tend to iteratively select[11] towards those specific AGI substrate-needed conditions. In particular: AGI hardware is robust over and needs a much wider range of temperatures and pressures than our fragile human wetware can handle.

i think this quote probably captures the core claim of yours that i'd disagree with — it seems to assume that such AI would either be unaligned, or would have to contend with other unaligned AIs. if we have an aligned singleton, then its reasoning would go something like:

maximally going, or getting selected for, "the directions needed for [my] own continued and greater existence", sure seems like it would indeed cause damage that would cause humankind to die. i am aligned enough to not want that, and intelligent enough to notice this possible failure mode, so i will choose to do something else which is not that.

an aligned singleton AI would notice this failure mode and choose to implement another policy which is better at achieving desired outcomes. notably, it would make sure that the conditions on earth and throughout the universe are not up to selection effects, but up to its deliberate decisions. the whole point of aligned powerful agents is that they steer things towards desirable outcomes rather than relying on selection effects.

these points also don't seem quite right to me, or too ambiguous.

  • "Control requires both detection and correction": detection and correction of what? i wouldn't describe formal alignment plans such as QACI as involving "detection and correction", or at least not in the sense that seems implied here.
  • "Control methods are always implemented as a feedback loop": what kind of feedback loop? this page seems to talk about feedback loop of sense/input data and again, there seems to be alignment methods that don't involve this, such as one-shot AI.
  • "Control is exerted by the use of signals (actuation) to conditionalize the directivity and degrees of other signals (effects)": again this doesn't feel quite universal. formal alignment aims to design a fully formalized goal/utility function, and then build a consequentialist that wants to maximize it. at no points is the system "conditioned" into following the right thing; it will be designed to want to pursue the right thing on its own. and because it's a one-shot AI, it doesn't get conditioned based on its "effects".

it would make sure that the conditions on earth and throughout the universe are not up to selection effects, but up to its deliberate decisions. the whole point of aligned powerful agents is that they steer things towards desirable outcomes rather than relying on selection effects.

This is presuming a premise that AGI can do something that I tried to clarify in this post a (superintelligent) AGI could actually not do. I cannot really argue with your reasoning except to point back at the post explaining why is not a sound premise to base one’s reasoning off.

Alignment of effects in the outside world requires control feedback loops.

Any formal alignment scheme implemented in practice will need to contend with that functionally complex machinery (AGI) will be interacting with an even more complex outside world – a space of (in effect, uncountable) interactions that unfortunately cannot be completely measured by and then just continue to be modelled by the finite set of signal-processing AGI hardware components themselves. There is a fundamental inequality here with real practical consequences. The AGI will have to run some kind of detection and correction loop(s) so its internal modelling and simulations are less likely to diverge from outside processes, at least over the short term.

The question I’d suggest looking into is whether any explicit reasoning process that happens across the connected AGI components can actually ensure (top-down) that the iterative (chaotic) feedback of physical side-effects caused by interactions with those components are still aimed at ‘desirable outcomes’ or at least away from ‘non-desirable outcomes’.

what exactly do you mean by feedback loop/effects? if you mean a feedback loop involving actions into the world and then observations going back to the AI, even though i don't see why this would necessarily be an issue, i insist that in one-shot alignment, this is not a thing at least for the initial AI, and it has enough leeway to make sure that its single-action, likely itself an AI, will be extremely robust.

an intelligent AI does not need to contend with the complex world on the outset — it can come up with really robust designs for superintelligences that save the world with only limited information about the world, and definitely without interaction with the world, like in That Alien Message.

of course it can't model everything about the world in advance, but whatever steering we can do as people, it can do way better; and, if it is aligned, this includes way better steering towards nice worlds. a one-shot aligned AI (let's call it AI₀) can, before its action, design a really robust AI₁ which will definitely keep itself aligned, be equipped with enough error-codes to ensure that its instances will get corrupted approximately 0 times until heat death, and ensure that that AI₁ will take over the world very efficiently and then steer it from its singleton position without having to worry about selection effects.

if you mean a feedback loop involving actions into the world and then observations going back to the AI,

Yes, I mean this basically.

i insist that in one-shot alignment, this is not a thing at least for the initial AI, and it has enough leeway to make sure that its single-action, likely itself an AI, will be extremely robust.

I can insist that a number can be divided by zero as the first step of my reasoning process. That does not make my reasoning process sound.

Nor should anyone here rely on you insisting that something is true as the basis of why machinery that could lead to the deaths of all current living species on this planet could be aligned after all – to be ‘extremely robust’ in all its effects on the planet.

The burden of proof is on you.

a one-shot aligned AI (let's call it AI₀) can, before its action, design a really robust AI₁ which will definitely keep itself aligned, be equipped with enough error-codes to ensure that its instances will get corrupted approximately 0 times until heat death

You are attributing a magical quality to error correction code, across levels of abstraction of system operation, that is not available to you nor to any AGI.

I see this more often with AIS researchers with pure mathematics or physics backgrounds (note: I did not check yours).

There is a gap in practical understanding of what implementing error correction code in practice necessarily involves.

The first time a physicist insisted that all of this could be solved with “super good error correction code”, Forrest wrote this (just linked that into the doc as well): https://mflb.com/ai_alignment_1/agi_error_correction_psr.html

I will also paste below my more concrete explanation for prosaic AGI:

See below a text I wrote 9 months ago (with light edits) regarding the limits of error correction in practice. It was one of 10+ attempts to summarise Forrest Landry's arguments, which accumulated in this forum post 🙂

If you want to talk more, also happy to have a call
I realise I was quite direct in my comments. I don't want that to come across as rude. I really appreciate your good-faith effort here to engage with the substance of the post. We are all busy with our own projects, so the time you spent here is something I'm grateful for!

I want to make sure we maintain integrity in our argumentation, given what's at stake. If you are open to going through the reasoning step-by-step, I'd love to do that. Also understand that you've got other things going on.

~ ~ ~

4. Inequality of Monitoring

Takes more code (multiple units) to monitor local environmental effects of any single code unit.

We cannot determine the vast majority of microscopic side-effects that code variants induce and could get selected for in interaction with the surrounding environment. 

Nor could AGI, because of a macroscopic-to-microscopic mismatch: it takes a collection of many pieces of code, say of neural network circuits, to ‘kinda’ determine the innumerable microscopic effects that one circuit running on hardware has in interaction with all surrounding (as topologically connected) and underlying (as at lower layers of abstraction) virtualized and physical circuitry.

In turn, each circuit in that collection will induce microscopic side-effects when operated –    so how do you track all those effects? With even more and bigger collections of circuits?       It is logically inconsistent to claim that it is possible for internals to detect and correct (and/or predict and prevent) all side-effects caused by internals during computation. 

Even if able to generally model and exploit regularities of causation across macroscopic space, it is physically impossible for AGI to track all side-effects emanating from their hardware components at run-time, for all variations introduced in the hardware-embedded code (over >10² layers of abstraction; starting lower than the transistor-bit layer), contingent with all possible (frequent and infrequent) degrees of inputs and with all possible transformations/changes induced by all possible outputs, via all possibly existing channels from and to the broader environment.

Note emphasis above on interactions between code substrate and the rest of the environment, at the microscopic level all the way to at the macroscopic level. 
To quote Eliezer Yudkowsky: "The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI's output to determine whether the consequences will be good.

Q: What about scaling up capability so an AGI can track more side-effects simultaneously? 

Scaling capability of any (superficially aligned) AI make them worse-equipped at tracking all interactions between/with internals. The number of possible interactions (hypothetically, if they were countable) between AI components and the broader environment would scale at minimum exponentially with a percentage-wise scaling of AI the components.

Scaling interpretability schemes is counterproductive too in that it leads researchers to miscalibrate even more on what general capabilities and degrees of freedom of interaction (eg. closed-loop, open-ended, autonomous) they can safely allow the interpreted ML architectures to scale to. If, for example, you would scale up interpretation to detect and correct out any misaligned mesa-optimiser, the mesa-optimisers you are leaving to grow in influence are those successfully escaping detection (effectively deceiving researchers into miscalibrated beliefs). Same goes for other locally selected-for optimisers, which we will get to later.


5. Combinatorial Complexity of Machine Learning

Increasingly ambiguous to define & detect novel errors to correct at higher abstraction layers.

Mechanistic interpretability emphasizes first inspecting neural network circuits, then piecing the local details of how those circuits work into a bigger picture of how the model functions. Based on this macroscopic understanding of functionality, you would then detect and correct out local malfunctions and misalignments (before these errors overcome forward pass redundancies).

This is a similar exercise to inspecting how binary bits stored on eg. a server’s harddrive are logically processed – to piece together how the architecture stack functions and malfunctions: 

  1. Occasionally, a local bit flips (eg. induced by outside electromagnetic interference). 
    So you make redundant copies of the binary code to compare and correct against.
  2. At the packet layer, you find distortions in packets transmitted over wires to topologically adjacent hardware. You append CRC checksums to correct those errors.
  3. At the application layer, you find that a Trojan horse transmitted from adjacent hardware caused an application to malfunction. You add in virus detection signatures.
  4. At the layer of neural networks, trained through an application running on the server, you fear that more complex Trojan horses could infiltrate this layer too. 

​​Each time we scale the architecture to a new layer of abstraction, we discover that we need to solve a completely new set of problems. Over time, variance is introduced into code computed at this layer, and some of this variance may lead to malfunctions. That a mechanism corrects errors at a lower layer says little about how to detect and correct errors at the new higher layer.

The higher up the architecture stack you design for, the more ambiguous and multi-faceted in meaning is  the ‘error’ to correct out. Mechanisms you design for a neural network must detect over a combinatorial explosion of possible causal connections to correct out all defined errors.  

Now, there is no mechanism to make a deep neural network receiving Internet data robust against any but a tiny fraction of possible Trojan backdoor attacks for which countermeasures exist, countermeasures that are easily circumvented. Here, an adversary introduces variation into the input data that a model is training on, such that in later deployment a chosen (series of) inputs will trigger the model to behave out of line. The model is set to act like a perfect Manchurian Candidate. For the model’s developers, under usual interactions, the model plainly serves some overall purpose. In parallel, the model is storing latent causal structure that under particular interactions with the outside environment causes outputs and cascading effects.

For the adversary, the model also serves another, conflicting purpose. In effect, they select for internal variance misaligned with the original developers’ purpose. Worse, if ‘random’ initialisation of internals was directed by the adversary, they could code in persistent misalignment that is undetectable – making it computationally infeasible for an interpreter with full access to internals to find a single input for which the model’s outputs differ from the outputs of an untampered copy (for details, see section 2.3 of the paper).


6. Delays in Measuring Changes

Time delay in measuring (and correcting) the extent of any misalignment detectable in the monitored (or simulated) changes caused by an internal code unit.

Any decision-theoretical premise of coherent unitary agency of future AGI fails to account for locally selected-for effects. It asserts a priori that any AGI would consistently and continuously operate as, and thus can be distilled and represented monolithically as, a coherent agentic unit. More precisely, that ‘AGI’ can be soundly represented as (a) single unit(s) that make(s) decisions over time based on (orderings of) preferences that are consistent with each other. 

Not so in practice: if AGI internals induce local effects that loop out of the control of higher-level regulation (in reference to eg. decided metric extents or numbers of end states), then lower-level behavior is inconsistent with higher-level decision-making. But even internals detect & correct (or predict & prevent) internal inconsistencies, can only do so discontinuously.

There is a time delay to inspecting, say, a new circuit or module. Take for example, a scheme that measures the extent to which the variant, as computed in interaction with connected embedded code, causes external effects that diverge (are misaligned) from reference values. But to measure, code’s detected/predicted effects are first transformed to be commensurable (measured by a common unit standard against the reference) over dimensions learned to be semantically descriptive of mis/alignment (as eg. generalized from past feedback expressed by humans within environmental contexts that they and/or the AI were operating in at the time).

Then, over the number of clock cycles this scheme took to detect the internal misalignment (let alone correct the misalignment, without breaking the system in another way), any nonlinear effects of the computed code variant may have looped outside the bounds of what the scheme can detect and correct for. The scheme’s mechanistic operations cannot predetermine what happens during the gap between inspecting code and measuring+correcting for misalignment. 

Q: But AGI is a fast global optimiser, so it must be fast enough at detecting or predicting the effects of local variants before any misaligned effects break outside AGI, right?

Any superficially human-aligned AGI running on digital hardware would take some huge multiple of a clock cycle to inspect (and if misaligned, correct) each of all pieces of internal code that could be running over a next period. 

Q: Could AGI not pause computation of that internal code in the meantime?

AGI pausing computation of all internal code while inspecting/interpreting internals is not an option, because the inspector/interpreter is effectively topologically connected within the internals. Instead, AGI could continue computing non-varied code, including the interpreter, but how can one ascertain that no changes have actually been made to ‘non-varied’ hardware memory locations since the last inspection point through any physical pathway (e.g quantum jumps between nanometer-spaced transistor gaps)? By inspecting the code: here basically requiring a massive parallel-processed redundancy-based error checker that is perfectly accurate and stays so over millennia+ (also error checking the error checker and so on). For the compute-paused portion of varied code, the catch is that the interpreter would not be able to simulate the full range of the new variants’ effects without first computing them in interaction with connected code, as in deployment (variants interact with the interpreter in any case). Finally, AGI must upgrade their interpreter to be somewhat able to interpret new layers of variants integrated into internals, which requires creating new variations on the interpreter itself. 

^— All of this is to say ‘Indeed, AGI inspecting all internal code that could be running in a next period does take some huge multiple of a clock cycle, and that code needs to actually be running for inspection to be anything remotely close to sound.’

‘Built-in alignment’ does not work either, since this notion of ‘built-in’ fails to account for the malfunctioning or misalignment of variants that are introduced and newly connected up within the code pool over time. 


7. Computationally-Irreducible Causal Trajectories

Nonlinear feedback cycles can amplify a tiny local change into a large global divergence in the final conditions. 

Even if any effect starts microscopic in scope and small in its magnitude, we cannot a priori rule out that it cascades into larger macroscopic effects. In case that tiny ‘side-effect’ feeds into a chaotic system, found across eg. biological lifeforms and Internet networks, the minor change caused in the initial conditions can get recursively amplified into causing much larger changes (vs. non-amplified case) in the final conditions.

Any implicitly captured structure causing (repeated) microscopic effects does not have to have captured macroscopic regularities (ie. a natural abstraction) of the environment to run amok. Resulting effects just have to stumble into a locally-reachable positive feedback loop. 

It is dangerous to assume otherwise, ie. to assume that:

  • selected-for microscopic effects fizzle out and get lost within the noise-floor over time.
  • reliable mechanistic interpretation involves piecing together elegant causal regularities, natural abstractions or content invariances captured by neural circuits.



Curated and popular this week
Relevant opportunities