The long-term future of intelligent life is currently unpredictable and undetermined. In the linked document, we argue that the invention of artificial general intelligence (AGI) could change this by making extreme types of lock-in technologically feasible. In particular, we argue that AGI would make it technologically feasible to (i) perfectly preserve nuanced specifications of a wide variety of values or goals far into the future, and (ii) develop AGI-based institutions that would (with high probability) competently pursue any such values for at least millions, and plausibly trillions, of years.

    The rest of this post contains the summary (6 pages), with links to relevant sections of the main document (40 pages) for readers who want more details.

    0.0 The claim

    Life on Earth could survive for millions of years. Life in space could plausibly survive for trillions of years. What will happen to intelligent life during this time? Some possible claims are:

    A. Humanity will almost certainly go extinct in the next million years.

    B. Under Darwinian pressures, intelligent life will spread throughout the stars and rapidly evolve toward maximal reproductive fitness.

    C. Through moral reflection, intelligent life will reliably be driven to pursue some specific “higher” (non-reproductive) goal, such as maximizing the happiness of all creatures.

    D. The choices of intelligent life are deeply, fundamentally uncertain. It will at no point be predictable what intelligent beings will choose to do in the following 1000 years.

    E. It is possible to stabilize many features of society for millions or trillions of years. But it is possible to stabilize them into many different shapes — so civilization’s long-term behavior is contingent on what happens early on.

    Claims A-C assert that the future is basically determined today.  Claim D asserts that the future is, and will remain, undetermined. In this document, we argue for claim E: Some of the most important features of the future of intelligent life are currently undetermined but could become determined relatively soon (relative to the trillions of years life could last).

    In particular, our main claim is that artificial general intelligence (AGI) will make it technologically feasible to construct long-lived institutions pursuing a wide variety of possible goals. We can break this into three assertions, all conditional on the availability of AGI:

    1. It will be possible to preserve highly nuanced specifications of values and goals far into the future, without losing any information.
    2. With sufficient investments, it will be feasible to develop AGI-based institutions that (with high probability) competently and faithfully pursue any such values until an external source stops them, or until the values in question imply that they should stop.
    3. If a large majority of the world’s economic and military powers agreed to set-up such an institution, and bestowed it with the power to defend itself against external threats, that institution could pursue its agenda for at least millions of years (and perhaps for trillions).

    Note that we’re mostly making claims about feasibility as opposed to likelihood. We only briefly discuss whether people would want to do something like this in Section 2.2.

    (Relatedly, even though the possibility of stability implies E, in the top list, there could still be a strong tendency towards worlds described by one of the other options A-D. In practice, we think D seems unlikely, but that you could make reasonable arguments that any of the end-points described by A, B, or C are probable.)

    Why are we interested in this set of claims? There are a few different reasons:

    • The possibility of stable institutions could pose an existential risk, if they implemented poorly chosen and insufficiently flexible values.
    • On the other hand, if we want humane values or institutions such as liberal democracy to survive in the long-run, some types of stability may be crucial for preserving them.
    • The possibility of ultra-stable institutions pursuing any of a wide variety of values, and the seeming generality of the methods that underlie them, suggest that significant influence over the long-run future is possible. This should inspire careful reflection on how to make it as good as possible.

    We will now go over claims 1., 2., and 3., from above in more detail.

    0.1 Preserving information

    In the beginning of human civilization, the only way of preserving information was to pass it down from generation to generation, with inevitable corruption along the way. The invention of writing significantly boosted civilizational memory, but writing has relatively low bandwidth. By contrast, the invention of AGI would enable the preservation of entire minds. With whole-brain emulation (WBE), we could preserve entire human minds, and ask them what they would think about future choices. Even without WBE, we could preserve newly designed AGI minds that would give (mostly) unambiguous judgments of novel situations. (See section 4.1.)

    Such systems could encode information about a wide variety of goals and values, for example:

    • Ensure that future civilisational decisions are made democratically.
    • Enforce a ban on certain weapons of mass destruction (WMD)
    • Make sure that reverence is paid to some particular religion.
    • Always do what some particular group of humans would have wanted.

    Crucially, using digital error correction, it would be extremely unlikely that errors would be introduced even across millions or billions of years. (See section 4.2.) Furthermore, values could be stored redundantly across many different locations, so that no local accident could destroy them. Wiping them all out would require either (i) a worldwide catastrophe, or (ii) intentional action. (See section 4.3.)

    0.2 Executing intentions

    So let’s say that we can store nuanced sets of values. Would it be possible to design an institution that stays motivated to act according to those values?

    Today, tasks can only be delegated to humans, whose goals and desires often differ from the goals of the delegator. With AGI, all tasks necessary for an institution's survival could instead be automated, performed by artificial minds instead of biological humans. We will discuss the following 2 questions:

    • Will it be possible to construct AGI systems that (with high probability) are aligned with the intended values?
    • Will such systems stay aligned even over long periods of time?

    0.2.1 Aligning AGI

    Currently, humanity knows less about how to predict and control the behavior of advanced AI systems than about predicting and controlling the behavior of humans. The problem of how to control the behaviors and intentions of AI is commonly known as the alignment problem, and we do not yet have a solution to it.

    However, there are reasons why it could eventually be far more robust to delegate problems to AGI, than to rely on (biological) humans:

    • With sufficient understanding of how to induce particular goals, AI systems could be designed to more single-mindedly optimize for the intended goal, whereas most humans will always have some other desires, e.g. survival, status, or sexuality.
    • AI behavior can be thoroughly tested in numerous simulated situations, including high-stakes situations designed to elicit problematic behavior.
    • AI systems could be designed for interpretability, perhaps allowing developers and supervisors to directly read their thoughts, and to directly understand how it would behave in a wide class of scenarios.

    Thus, we suspect that an adequate solution to AI alignment could be achieved given sufficient time and effort. (Though whether that will actually happen is a different question, not addressed since our focus is on feasibility rather than likelihood.)

    Note also that if we don’t make substantial progress on the alignment problem, but still keep building more AI systems that are more capable and more numerous, this could eventually lead to permanent human disempowerment. In other words, if this particular step of the argument doesn’t go through, the alternative is probably not a business-as-usual human world (without the possibility of stable institutions), but instead a future where misaligned AI systems are ruling the world. 

    (For more, see section 5.)

    0.2.2 Preventing drift

    As mentioned in section 0.1, digital error correction could be used to losslessly preserve the information content of values. But this doesn’t entirely remove the possibility of value-drift.

    In order to pursue goals, AGI systems need to learn many facts about the world and update their heuristics of how to deal with new challenges and local contexts. Perhaps it will be possible to design AGI systems with goals that are cleanly separated from the rest of their cognition (e.g. as an explicit utility function), such that learning new facts and heuristics doesn’t change the systems’ values. But the one example of general intelligence we have — humans — instead seem to store their values as a distributed combination of many heuristics, intuitions, and patterns of thought. If the same is true for AGI, it is hard to be confident that new experiences would not occasionally cause their values to shift. 

    Thus, although it’s not clear how much of a concern this will be, we will discuss how an institution might prevent drift even if individual AI systems sometimes changed their goals. Possible options include:

    • Whenever there’s uncertainty about what to do in a novel situation, or a high-stakes decision needs to be made, the institution could boot-up a completely-reset version of an AI system (or a brain emulation) that acts according to the original values.
      • This system will have had no previous chance of value-drift, and so only needs to be informed about anything that is a prerequisite for judging the situation.
      • In order to reduce contingency from how these prerequisites are learned, the institution could bring back multiple copies and inform them in different ways — and also let some of the copies opine on how to inform the other copies. And then have them all discuss what the right option is.
    • AI systems designed to execute particular tasks could be motivated to do whatever the more thorough process would recommend. They could be extremely well-tested on the types of situations that most frequently come up while performing that task.
      • For any tasks that didn’t require high context over a long period of time, they could be frequently reset back to a well-tested state.
      • If the task did require a larger amount of context over a longer period of time, they could be supervised and frequently re-tested by other AI systems with less context. These may not be able to correctly identify the value of the supervisee’s every action, but they could prevent the supervisee from performing any catastrophic actions. (Especially with access to transparency tools that allow for effective mind-reading.)
    • Value drift that is effectively random could be eliminated by having a large number of AI systems with slightly-different backgrounds make an independent judgment about what the right decision is, and take the majority vote.

    Some of these options might reveal inputs where AI systems systematically behave badly, or where it’s not clear if they’re behaving well or badly. For example, they might:

    • endorse options that less-informed versions of themselves disagree strongly with,
    • have irresolvable disagreements with AI systems which have somewhat different previous experiences,
    • exhibit thought-patterns (detected with transparency tools) that show doubt about the institutions’ original principles.

    In most cases, it is probably the case that the reason for the discrepancy could be identified, and the AI design could be modified to act as desired. But it’s worth noting that even in situations where it remains unclear what the desired behavior is, or in situations where it’s somehow difficult to design a system that responds in the desired way, a sufficiently conservative institution could simply opt to prevent AI systems from being exposed to inputs like that (picking some sub-optimal but non-catastrophic resolution to any dilemmas that can’t be properly considered without those inputs).

    • An extreme version of this would be to prevent all reasoning that could plausibly lead to value-drift, halting progress in philosophy.
      • It doesn’t seem impossible that all philosophically ambitious institutions would eventually converge to some very similar set of behavior, from a very wide range of starting points. This might be the case if some form of moral realism holds, or perhaps if something like Evidential Cooperation in Large Worlds works. If this were the case, our claim that it’s feasible to stabilize many different value-systems would be false for philosophically ambitious institutions. It would only apply to institutions that refused to conduct some philosophical investigations. (Which we hope wouldn’t be very common.)
    • A further extreme would be for the institution to also halt technological progress and societal progress in general (insofar as it had the power to do that) to avoid any situation where the original values can’t give an unambiguous judgment.
      • This would largely eliminate the issue motivating this subsection — that continual learning could lead to value drift — since complete stagnation wouldn’t require much in the way of continual learning.
      • But depending on when technological progress was halted, this could limit the institution's ability to survive in other ways, e.g. by preventing it from leaving Earth before its doom.

    Given all these options, it seems more likely than not that an institution could practically eliminate any internal sources of drift that it wanted to. (For more, see section 6.)

    0.3 Preventing disruption

    So let’s say that it will remain mostly-unambiguous what an institution is supposed to do, in any given situation, and furthermore that the institution will keep being motivated to act that way.

    Now, let’s consider a situation where this institution — at least temporarily — has uncontested military and economic dominance (let’s call this a “dominant institution”). Let’s also say that the institution’s goals include a consequentialist drive to maintain that dominance (at least instrumentally). Could the institution do this? On our best guess, the answer would be “yes” (with exceptions for encountering alien civilizations, and for the eventual end of usable resources).

    Any resources, information, and agents necessary for the institution’s survival could be copied and stored redundantly across the Earth (and, eventually, other planets). Thus, in order to prevent the institution from rebuilding, an event would need to be global in scope. 

    As we argue in section 7, natural events of civilization-threatening magnitude are rare, and the main mechanism they have to pose a global threat to human civilization is that they would throw up enough dust to blot out the sun for a few years. A well-prepared AI civilization could easily survive such events by having energy sources that don’t depend on the sun. In a few billion years, the expansion of the Sun will prevent further life on Earth, but a technologically sophisticated stable institution could avoid destruction by spreading to space.

    As we argue in section 8, a dominant institution could also prevent other intelligent actors from disrupting the institution. Uncontested economic dominance would allow the institution to manufacture and control loyal AGI systems that far outnumber any humans or non-loyal AI systems. Thus, insofar as any other actors could pose a threat, it would be economically cheap to surveil them as much as necessary to suppress that possibility. In practice, this could plausibly just involve enough surveillance to:

    • prevent others from building weapons of mass destruction,
    • prevent others from building a competitive institution of similar economic or military strength, and
    • prevent others from leaving the institution’s domain by colonizing uninhabited parts of space.

    The main exception to this is alien civilizations, which could at first contact already be more powerful than the Earth-originating institution.

    Ultimately, the main boundaries to a stable, dominant institution would be (i) alien civilizations, (ii) the eventual end of accessible resources predicted by the second law of thermodynamics, and (iii) any disruptive Universe-wide physical events (such as a Big Rip scenario), although to our knowledge no such events are predicted by standard cosmology.

    0.4 Some things we don’t argue for

    To be clear, here are two things that we don’t argue for:

    First, we don’t think that the future is necessarily very contingent, from where we stand today. For example, it might be the case that almost no humans would make an ultra-stable institution that pursues a goal that those humans themselves couldn’t later change (if they changed their mind). And it might be the case that most humans would eventually end up with fairly similar ideas about what is good to do, after thinking about it for a sufficiently long time.

    Second, we don’t think that extreme stability (of the sort that could make the future contingent on early events) would necessarily require a lot of dedicated effort. The options for increasing stability we sketch in sections 0.2.2 and 6 and the assumption of a singleton-like entity in sections 0.3 and 8 are brought up to make the point that stability is feasible at least in those circumstances. It seems plausible that they wouldn’t be necessary in practice. Perhaps stability will only require a smaller amount of effort. Perhaps the world’s values would stabilize by default given the (not very unlikely) combination of:

    • technological maturity (preventing new technologies from shaking things up),
    • human immortality (reducing drift from generational changes), 
    • the ability to cheaply and stably align AGI systems with any goal, and
    • such AI systems being equally good at pursuing instrumental goals regardless of what terminal goals they have. (Thereby mostly eliminating the tendency for some values to outcompete others, c.f. decoupling deliberation from competition.)

    0.5 Structure of the document

    Readers should feel free to skip to whatever parts they’re interested in. (See also the table of contents.)

    Contributions

    Lukas Finnveden was the lead author. Some parts of this document started as an unfinished report prepared by Jess Riedel while he was an employee at Open Philanthropy. Carl Shulman contributed many of the ideas, and both Jess and Carl provided multiple rounds of comments. Lukas did most of the work while he was part of the Research Scholars Programme at the Future of Humanity Institute (although at the time of publishing, he works for Open Philanthropy). All views are our own.
     

    115

    New Comment
    23 comments, sorted by Click to highlight new comments since: Today at 3:48 AM

    I'm curating this (although I wish it had a more skimmable summary). 

    It's an important topic (and a weak point in the classic most important century discussion) and a lot of the considerations[1] seem important and new (at least to me!). I like that the post and document make a serious attempt at clarifying what isn't being said (like some claims about likelihood), flag different levels of uncertainty in the various claims, and clarify what is meant by "AGI"[2].

    Here's a quick attempt at a restructured/slightly paraphrased summary — please correct me if I got something wrong: 

    • Assuming AGI, it's relatively possible[3] to stabilize/lock in many features of society — both good and bad — for a long time (millions or trillions of years). This is because:
      • AGIs can be faithful to a specific goal or a set of goals for a long time
      • with sufficient resources, institutions can be created that will pursue these values until an external source (like foreign intervention, the death of an authoritarian leader, or internal rebellion) stops them
      • current economic and military powers could come together and use AGI to make an institution of this kind, which would be able to defend itself against external sources
        • Meaning that such an institution could pursue its agenda for millions or trillions of years.
    • Why this matters: 
      • Stabilizing features like this can be bad (an existential risk) if their values or goals are poorly chosen or insufficiently flexible.
      • Stable institutions could be important to ensuring good values do persist.
      • The feasibility of the above is evidence that "significant influence over the long-run future is possible." 
    1. ^
    2. ^

      See here

    3. ^

      For the different levels of confidence the authors have in these arguments, you can look at this section in the document

    Thanks Lizka. I think about section 0.0 as being a ~1-page summary (in between the 1-paragraph summary and the 6-page summary) but I could have better flagged that it can be read that way. And your bullet point summary is definitely even punchier.

    Consider a civilization that has "locked in" the value of hedonistic utilitarianism. Subsequently some AI in this civilization discovers what appears to be a convincing argument for a new, more optimal design of hedonium, which purports to be 2x more efficient at generating hedons per unit of resources consumed. Except that this argument actually exploits a flaw in the reasoning processes of the AI (which is widespread in this civilization) such that the new design is actually optimized for something different from what was intended when the "lock in" happened. The closest this post comes to addressing this scenario seems to be "An extreme version of this would be to prevent all reasoning that could plausibly lead to value-drift, halting progress in philosophy." But even if a civilization was willing to take this extreme step, I'm not sure how you'd design a filter that could reliably detect and block all "reasoning" that might exploit some flaw in your reasoning process.

    Maybe in order to prevent this, the civilization tried to locked in "maximize the quantity of this specific design of hedonium" as their goal instead of hedonistic utilitarianism in the abstract. But 1) maybe the original design of hedonium is already flawed or highly suboptimal, and 2) what if (as an example) some AI discovers an argument that they should engage in acausal trade in order to maximize the quantity of hedonium in the multiverse, except that this argument is actually wrong.

    This is related to the problem of metaphilosophy, and my hope that we can one day understand "correct reasoning" well enough to design AIs that we can be confident are free from flaws like these, but I don't know how to argue that this is actually feasible.

    I broadly agree with this. For the civilizations that want to keep thinking about their values or the philosophically tricky parts of their strategy, there will be an open question about how convergent/correct their thinking process is (although there's lots you can do to make it more convergent/correct — eg. redo it under lots of different conditions, have arguments be reviewed by many different people/AIs, etc).

    And it does seem like all reasonable civilizations should want to do some thinking like this. For those civilizations, this post is just saying that other  sources of instability could be removed (if they so chose, and insofar as that was compatible with the intended thinking process).

    Also, separately, my best guess is that competent civilizations (whatever that means) that were aiming for correctness would probably succeed (at least in areas were correctness is well defined). Maybe by solving metaphilosophy and doing that, maybe because they took lots of precautions like mentioned above, maybe just because it's hard to get permanently stuck at incorrect beliefs if lots of people are dedicated to getting things right, have all the time and resources in the world, and are really open-minded. (If they're not open-minded but feel strongly attached to keeping their current views, then I become more pessimistic.)

    But even if a civilization was willing to take this extreme step, I'm not sure how you'd design a filter that could reliably detect and block all "reasoning" that might exploit some flaw in your reasoning process.

    By being unreasonably conservative. Most AIs could be tasked with narrowly doing their job, a few with pushing forward technology/engineering, none with doing anything that looks suspiciously like ethics/philosophy.  (This seems like a bad idea.)

    Just to be clear: we mostly don’t argue for the desirability or likelihood of lock-in, just its technological feasibility. Am I correctly interpreting your comment to be cautionary, questioning the desirability of lock-in given the apparent difficulty of doing so while maintaining sufficiently flexibility to handle unforeseen philosophical arguments?

    To take a step back, I'm not sure it makes sense to talk about "technological feasibility" of lock-in, as opposed to say its expected cost, because suppose the only feasible method of lock-in causes you to lose 99% of the potential value of the universe, that seems like a more important piece of information than "it's technologically feasible".

    (On second thought, maybe I'm being unfair in this criticism, because feasibility of lock-in is already pretty clear to me, at least if one is willing to assume extreme costs, so I'm more interested in the question of "but can it be done at more acceptable costs", but perhaps this isn't true of others.)

    That aside, I guess I'm trying to understand what you're envisioning when you say "An extreme version of this would be to prevent all reasoning that could plausibly lead to value-drift, halting progress in philosophy." What kind of mechanism do you have in mind for doing this? Also, you distinguish between stopping philosophical progress vs stopping technological progress, but since technological progress often requires solving philosophical questions (e.g., related to how to safely use the new technology), do you really see much distinction between the two?

    I loved absolutely reading this post!

    Some minor comments:

    1. You've assumed from the get go that AIs will follow similar reinforcement-learning like paradigms like humans and converge on similar ontologies of looking at the world as humans. You've also assumed these ontologies will be stable - for instance a RL agent wouldn't become superintelligent, use reasoning and then decide to self modify into something that is not an RL agent.

    This is a fair assumption to make but I wonder what the probability is of this being true.

    1. You've assumed laws of physics as we know them today are constraints on things like computation and space colonization and oversight and alignment processes for other AIs.

    It is possible different laws of physics are discovered which make space colonization look very different, and even computation and oversight may be doable very differently. I wonder what the probability of these are.

    1. I love that you've brought up how value lock-in directly plays into free will / determinism debates.

    (Essentially that if the universe is deterministic then in principle the future state of the universe after your death or trillions of years after that, is locked in always and can be calculated with enough compute).

    When we say that an institution or some other feature of the world is very stable, we mean that it has a very high probability to persist for a very long time. When evaluating this probability, we don’t want to appeal to fully objective probability, but we also don’t want it to be entirely subjectivist. Instead, we’re appealing to a pseudo-objective probability evaluated from a perspective much more informed than ours, but which still treats effectively-unpredictable processes (such as the weather or the exact DNA of future humans) as random. We could imagine it as the prediction that a hyper-informed observer would give, if they’d seen the civilisational trajectory of millions of civilisations similar to ours. The virtue of this definition is that we can say things like “this institution is 50% likely to be very stable”, if we think there is a 50% subjective probability that the institution has a high pseudo-objective probability to persist for a long time.

    Does this assume a clean separation between two kinds of processes - those that can be predicted and those that can't?

    As of today, predicting if we'll have democracy 50 years from now seems qualitatively easier than predicting the spatial coordinates of a specific DNA molecule 50 years from now. So we can imagine a hypothetical superintelligent observer who is really good at the first task but bad at the second.

    It is possible the future does not contain such a clean separation, and if it does, it might be worth further understanding which processes will remain predictable or unpredictable even to superintelligence observers (of some finite high level of intelligence).

    For instance, physics might be much easier to predict than we currently think, making even DNA molecule stuff easy to simulate and know exact futures of. On the other hand, we might end up discovering that the universe is not in fact deterministic in any number of wacky ways (quantum indeterminism is wacky already), such that no amount or compute or intelligence can decisively predict future states of the universe.

    1. This is very random but I have wondered if Markov chains and escape velocities (similar to "longevity escape velocity") are a good model to characterise the relative stabilities of different world configurations that could either persist or convert into each other.

    I wonder this especially for biological humans, where the path to billion year value stability is much shakier and goes through solving aging, DNA (and all bioprocess) cloning, better surveillance and mindreading, etc.

    With digital minds it does seem as you mention that we may have better alignment procedures and quickly move to radically stable futures.

    Thanks!

    You've assumed from the get go that AIs will follow similar reinforcement-learning like paradigms like humans and converge on similar ontologies of looking at the world as humans. You've also assumed these ontologies will be stable - for instance a RL agent wouldn't become superintelligent, use reasoning and then decide to self modify into something that is not an RL agent.

    Something like that, though I would phrase it as relying on the claim that it's feasible to build AI systems like that, since the piece is about the feasibility of lock-in. And in that context, the claim seems pretty safe to me. (Largely because we know that humans exist.)

    You've assumed laws of physics as we know them today are constraints on things like computation and space colonization and oversight and alignment processes for other AIs.

    Yup, sounds right.

    Does this assume a clean separation between two kinds of processes - those that can be predicted and those that can't?

    That's a good question. I wouldn't be shocked if something like this was roughly right, even if it's not exactly right. Let's imagine the situation from the post, where we have an intelligent observer with some large amount of compute that gets to see the paths of lots of other civilizations built by evolved species. Now let's imagine a graph where the x-axis has some increasing combination of "compute" and  "number of previous examples seen", and the y-axis has something like "ability to predict important events". At first, the y-value would probably go up pretty fast with greater x, as the observer get a better sense of what the distribution of outcomes are. But  on our understanding of chaos theory, it's ability to predict e.g. the weather years in advance would be limited even at astoundingly large values of compute+knowledge of what the distribution is like. And since chaotic processes affect important real-world events in various ways (e.g. the genes of new humans seem similarly random as the weather, and that has huge effects), it seems plausible that our imagined graph would asymptote towards some limit of what's predictable.

    And that's not even bringing up fundamental quantum effects, which are fundamentally unpredictable from our perspective. (With a many-worlds interpretation, they might be predictable in the sense that all of them will happen. But that still lets us make interesting claims about "fractions of everett branches", which seems pretty interchangeable with "probabilities of events".)

    In any case, I don't think this impinges much on the main claims in the doc. (Though if I was convinced that the picture above was wildly wrong, I might want to give a bit of extra thought to what's the most convenient definition of lock-in.)

    But  on our understanding of chaos theory, it's ability to predict e.g. the weather years in advance would be limited even at astoundingly large values of compute+knowledge of what the distribution is like.

    Are there computational-complexity bounds on predicting chaotic processes? I have not studied chaos theory so I do not actually know. And if not, why are we so confident there exists no algorithm that sidesteps requiring way too much compute?

    Especially when considering these algorithms may be discovered by arbitrarily superintelligent machines running billions of years. But I would even be interested in reading a defence of weaker claims like "biological human mathematicians cannot solve chaos theory in the next 200 years assuming incremental progress in math and assuming no AGI or other fancy stuff takes place in the same 200 years."

    Chaos theory is about systems where tiny deviations in initial conditions cause large deviations in what happens in the future. My impression (though I don't know much about the field) is that, assuming some model of a system (e.g. the weather), you can prove things about how far ahead you can predict the system given some uncertainty (normally about the initial conditions, though uncertainty brought about by limited compute that forces approximations should work similarly). Whether the weather corresponds to any particular model isn't really susceptible to proofs, but that question can be tackled by normal science.

    Thanks for this reply!

    This topic also comes up when discussing ajeya cotra's biological anchors - how much compute is required to simulate evolution and create AGI in the first place - which is another reason why I was curious about this topic. If re-running evolution requires simulating the weather and if this is computationally too difficult then re-running evolution may not be a viable path to AGI. (And out of the all of biological anchors, the evolutionary one is the only one that matters imo.) I wonder if it's worth studying this topic further.

    If re-running evolution requires simulating the weather and if this is computationally too difficult then re-running evolution may not be a viable path to AGI.

    There are many things that prevent us from literally rerunning human evolution. The evolution anchor is not a proof that we could do exactly what evolution did, but instead an argument that if something as inefficient as evolution spit out human intelligence with that amount of compute, surely humanity could do it if we had a similar amount of compute. Evolution is very inefficient — it has itself been far less optimized than the creatures it produces.

    (I'd have more specific objections to the idea that chaos-theory-in-weather in particular would be an issue: I think that a weather-distribution approximated with a different random generation procedure would be as likely to produce human intelligence as a weather distribution generated by Earth's precise chaotic behavior. But that's not very relevant, because there would be far bigger differences between Earthly evolution and what-humans-would-do-with-1e40-FLOP than the weather.)

    There are many things that prevent us from literally rerunning human evolution. The evolution anchor is not a proof that we could do exactly what evolution did, but instead an argument that if something as inefficient as evolution spit out human intelligence with that amount of compute, surely humanity could do it if we had a similar amount of compute. Evolution is very inefficient — it has itself been far less optimized than the creatures it produces.

    Yup I feel like there's different ways to interpret it, you've picked one interpretation which is fair! 

    Another way of interpreting it that I found was: "what's an argument for AI timelines this century that is straightforward and airtight, and doesn't rely on things like hard-to-convey inside views, lots of deference or arbitrary ways of setting priors". Many AI risk ppl anyway seem to agree that if you're aiming for accuracy you can't rely on the anchor as much, at best its a sort of upper bound. But if you are aiming for airtight arguments that can convince literally anybody, then biological anchors might more persuasive than other ways of thinking about AI timelines.

    And if you are aiming for airtightness, I wonder if "we can literally re-run evolution and this is how we will do it at a technical level" can be made more airtight than the broader arguments in your first para. [Broader arguments such as:  we can do different things with the compute and still get AGI, that evolution was in fact a "dumb" unoptimised process and not smart in some unknown way, that we as humans can in fact do better than evolution (at finding AGI) because we're smart, that evolution didn't get astronomically lucky because of some instantiation choices etc etc.]

    (I'd have more specific objections to the idea that chaos-theory-in-weather in particular would be an issue: I think that a weather-distribution approximated with a different random generation procedure would be as likely to produce human intelligence as a weather distribution generated by Earth's precise chaotic behavior. But that's not very relevant, because there would be far bigger differences between Earthly evolution and what-humans-would-do-with-1e40-FLOP than the weather.)

    This is fair! Although I do wonder more broadly, not just restricted to the weather but to tasks in general: Is it possible to train/select/evolve RL agents to get to AGI only by training on fast-to-evaluate tasks, or is training on slow-to-evaluate tasks a necessary condition? By fast-to-evaluate I'd just mean doing a forward pass of the environment is not significantly slower than doing a forward pass of the agent, and that you can in fact spend most of the compute during training on the agent not the environment.

    Some of MIRI stuff on decision theory does make me wonder if acting in environments that are more complicated* than you as an agent are, is a qualitatively different kind of problem than acting in environments that are simpler than you are. 

    *Ways an environment may be "complicated": possess more computational complexity than you, contain your perfect clones, contain agents with much higher intelligence than you, contain chaos-theoretic / quantum  / physical / chemical stuff necessary for life or intelligent behaviour, be literally incomputable etc.

    Thanks for your reply!

    Something like that, though I would phrase it as relying on the claim that it's feasible to build AI systems like that, since the piece is about the feasibility of lock-in. And in that context, the claim seems pretty safe to me. (Largely because we know that humans exist.)

    The main way I find this claim easier* to buy if the system literally consists of WBEs. There is significant alignment tax to building a system using WBEs, so even if it is possible to build in theory, it may not be possible in practice. For instance we might get WBEs only in hypothetical-2080 but get superintelligent LLMs in 2040, and the people using superintelligent LLMs make the world unrecognisably different by 2042 itself.

     

    *Even then I have doubts I wanna bring up, but I am more convinced by the report in that case.

    For instance we might get WBEs only in hypothetical-2080 but get superintelligent LLMs in 2040, and the people using superintelligent LLMs make the world unrecognisably different by 2042 itself.

    I definitely don't just want to talk about what happens / what's feasible before the world becomes unrecognisably different. It seems pretty likely to me that lock-in will only become feasible after the world has become extremely strange. (Though this depends a bit on details of how to define "feasible", and what we count as the start-date of lock-in.)

    And I think that advanced civilizations that tried could eventually become very knowledgable about how to create AI with a wide variety of properties, which is why I feel ok with the assumption that AIs could be made similar to humans in some ways without being WBEs.

    (In particular, the arguments in this document are not novel suggestions for how to succeed with alignment in a realistic scenario with limited time! That still seems like a hard problem! C.f. my response to Michael Plant.)

    Thanks this makes a lot of sense!

    If your report is conditional on a dominant institution existing at the time WBEs are invented, then this claim makes sense to make! If it is asserting that there is a non-trivial probability that a dominant institution will in fact exist at this point and that WBEs will in fact be invented at some point, then I wonder if that might need to be separately defended.

    Random (not exhaustive list of) reasons a dominant institution may not exist:

    • Humans of a country appoint non-RL programs as their worthy successors and give them all power. Due to multipolar race dynamics humans of some country correctly decide this is better than waiting until WBEs or world dominance. These successors do not have convergent instrumental subgoals of retaining billion year stability. Other countries also respond by similarly appointing their own successors to avoid fading into irrelevance.
    • Offence-defence balances shift in favour of defence using extinction weapons and highly reliably enforceable threats, hence multiple countries are stable for arbitrarily long without any dominating one. (Dominating may require research into lots of new technologies which causes value drift which may be undesirable to a state.)

    Yeah, I agree that multipolar dynamics could prevent lock-in from happening in practice.

    I do think that "there is a non-trivial probability that a dominant institution will in fact exist", and also that there's a non-trivial probability that a multipolar scenario will either

    • (i) end via all relevant actors agreeing to set-up some stable compromise institution(s), or
    • (ii) itself end up being stable via each actor making themselves stable and their future interactions being very predictable. (E.g. because of an offence-defence balance strongly favoring defence.)

    ...but arguing for that isn't really a focus of the doc.

    (And also, a large part of why I believe they might happen is that they sound plausible enough, and I haven't heard great arguments for why we should be confident in some particular alternative. Which is a bit hard to forcefully argue for.)

    P.S. A (slightly) concrete scenario where we dont get longterm stability that I wonder about - we get AGI through a DL not RL formalism, this AGI doesnt have convergent instrumental subgoal of billion year stability and we dont know how to give it this goal, and we face strong reasons to deploy it. All this happens before we know how to build AGIs that can value longterm stability.

    Thanks for the reply!

    (And also, a large part of why I believe they might happen is that they sound plausible enough, and I haven't heard great arguments for why we should be confident in some particular alternative. Which is a bit hard to forcefully argue for.)

    Yup this is fair! You and I might have to assign some probability to both lock-in and no-lock-in scenarios as of today (2022). [0]

    But it does seem useful to disambiguate between the following two things as to why we're assigning that probability.

    1. There is some probability the actions you and I or other humans take will make a large difference on which longterm future humanity ends up. We dont know what these actions are (we're assuming we cant just determinstically predict other people's behaviour) and therefore we're uncertain which future we end up in.
    2. There is nothing you and I or other humans can do that would make a large difference on the longterm future. We are already headed for certain specific futures, we're just not smart enough or knowledgible enough to predict in advance which futures we will end up in. Predicting this doesn't require deterministically predicting other people's behaviour, maybe there's simple technical arguments + institutional incentives that suffice to prove which future we end up in. However we do not know these arguments yet. The probability (of lockin and nolockin in [0]) is coming out of this uncertainty.

    [0] = [1] + [2]

    It doesn't seem crazy to me to have some probability on 2 being true, even though we don't actually have a clear argument for why we're headed for specific futures or which ones they are.

    (Yudkowsky for instance is basically predicting we face > 50% x-risk no matter what we do or do not do, maybe he's right and we're too dumb to realise it. I have greater than 1% probability on "him being right and us being too dumb to realise it", even though I don't really understand why he's so pessimistic and am just deferring.)

    Thanks, great post!

    You say that "using digital error correction, it would be extremely unlikely that errors would be introduced even across millions or billions of years. (See section 4.2.) " But that's not entirely obvious to me from section 4.2. I understand that error correction is qualitatively very efficient, as you say, in that the probability of an error being introduced per unit time can be made as low as you like at the cost of only making the string of bits a certain small-seeming multiple longer (and my understanding is that multiple shrinks the longer the original string was?). But for any multiple, there's some period of time long enough that the probability of faithfully maintaining some string of bits for that long is low. Is there any chance you could offer an estimate of, say, how much longer you'd have to make a petabyte in order to get the probability of an error over a billion years below 1%?

    Just skimmed this, but I notice there seems to be something inconsistent between this and the usual AI dooomerism stuff. For instance, above you claim that we should be worried about values lock-in because we will be able to align AI - cf doomerism that says alignment won't work; equally, above you state the value drift could be prevented by 'turning the AGI off and on again' - which is, again, at odds with the doomerist claim that we can't do this. I'm unsure what to make of this tension.

    Quoting from the post:

    Thus, we suspect that an adequate solution to AI alignment could be achieved given sufficient time and effort. (Though whether that will actually happen is a different question, not addressed since our focus is on feasibility rather than likelihood.)

    AI doomers tend to agree with this claim.  See e.g. Eliezer in list of lethalities:

    None of this is about anything being impossible in principle.  The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.  (...) What's lethal is that we do not have the Textbook From The Future telling us all the simple solutions that actually in real life just work and are robust; we're going to be doing everything with metaphorical sigmoids on the first critical try.  No difficulty discussed here about AGI alignment is claimed by me to be impossible - to merely human science and engineering, let alone in principle - if we had 100 years to solve it using unlimited retries, the way that science usually has an unbounded time budget and unlimited retries.  This list of lethalities is about things we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle.

    Stipulate, for the sake of the argument, that Lukas et al. actually disagree with the doomers about various points. What would follow from that?