Sometimes, I say some variant of “yeah, probably some people will need to do a pivotal act” and people raise the objection: “Should a small subset of humanity really get so much control over the fate of the future?”

(Sometimes, I hear the same objection to the idea of trying to build aligned AGI at all.)

I’d first like to say that, yes, it would be great if society had the ball on this. In an ideal world, there would be some healthy and competent worldwide collaboration steering the transition to AGI.[1]

Since we don’t have that, it falls to whoever happens to find themselves at ground zero to prevent an existential catastrophe.

A second thing I want to say is that design-by-committee… would not exactly go well in practice, judging by how well committee-driven institutions function today.

Third, though, I agree that it’s morally imperative that a small subset of humanity not directly decide how the future goes. So if we are in the situation where a small subset of humanity will be forced at some future date to flip the gameboard — as I believe we are, if we’re to survive the AGI transition — then AGI developers need to think about how to do that without unduly determining the shape of the future. 

The goal should be to cause the future to be great on its own terms, without locking in the particular moral opinions of humanity today — and without locking in the moral opinions of any subset of humans, whether that’s a corporation, a government, or a nation.

(If you can't see why a single modern society locking in their current values would be a tragedy of enormous proportions, imagine an ancient civilization such as the Romans locking in their specific morals 2000 years ago. Moral progress is real, and important.)

But the way to cause the future to be great “on its own terms” isn’t to do nothing and let the world get destroyed. It’s to intentionally not leave your fingerprints on the future, while acting to protect it.

You have to stabilize the landscape / make it so that we’re not all about to destroy ourselves with AGI tech; and then you have to somehow pass the question of how to shape the universe back to some healthy process that allows for moral growth and civilizational maturation and so on, without locking in any of humanity’s current screw-ups for all eternity.

Unfortunately, the current frontier for alignment research is “can we figure out how to point AGI at anything?”. By far the most likely outcome is that we screw up alignment and destroy ourselves.

If we do solve alignment and survive this great transition, then I feel pretty good about our prospects for figuring out a good process to hand the future to. Some reasons for that:

  • Human science has a good track record for solving difficult-seeming problems; and if there’s no risk of anyone destroying the world with AGI tomorrow, humanity can take its time and do as much science, analysis, and weighing of options as needed before it commits to anything.
  • Alignment researchers have already spent a lot of time thinking about how to pass that buck, and make sure that the future goes great and doesn’t have our fingerprints on it, and even this small group of people have made real progress, and the problem doesn't seem that tricky. (Because there are so many good ways to approach this carefully and indirectly.)
  • Solving alignment well enough to end the acute risk period without killing everyone implies that you’ve cleared a very high competence bar, as well as a sanity bar that not many clear today. Willingness and ability to diffuse moral hazard is correlated with willingness and ability to save the world.
  • Most people would do worse on their own merits if they locked in their current morals, and would prefer to leave space for moral growth and civilizational maturation. The property of realizing that you want to (or would on reflection want to) diffuse the moral hazard is also correlated with willingness and ability to save the world.
  • Furthermore, the fact that — as far as I know — all the serious alignment researchers are actively trying to figure out how to avoid leaving their fingerprints on the future, seems like a good sign to me. You could find a way to be cynical about these observations, but these are not the observations that the cynical hypothesis would predict ab initio.

This is a set of researchers that generally takes egalitarianism, non-nationalism, concern for future minds, non-carbon-chauvinism, and moral humility for granted, as obvious points of background agreement; the debates are held at a higher level than that.

This is a set of researchers that regularly talk about how, if you’re doing your job correctly, then it shouldn’t matter who does the job, because there should be a path-independent attractor-well that isn't about making one person dictator-for-life or tiling a particular flag across the universe forever.

I’m deliberately not talking about slightly-more-contentful plans like coherent extrapolated volition here, because in my experience a decent number of people have a hard time parsing the indirect buck-passing plans as something more interesting than just another competing political opinion about how the future should go. (“It was already blues vs. reds vs. oranges, and now you’re adding a fourth faction which I suppose is some weird technologist green.”)

I’d say: Imagine that some small group of people were given the power (and thus responsibility) to steer the future in some big way. And ask what they should do with it. Ask how they possibly could wield that power in a way that wouldn’t be deeply tragic, and that would realistically work (in the way that “immediately lock in every aspect of the future via a binding humanity-wide popular vote” would not).

I expect that the best attempts to carry out this exercise will involve re-inventing some ideas that Bostrom and Yudkowsky invented decades ago. Regardless, though, I think the future will go better if a lot more conversations occur in which people take a serious stab at answering that question.

The situation humanity finds itself in (on my model) poses an enormous moral hazard.

But I don’t conclude from this “nobody should do anything”, because then the world ends ignominiously. And I don’t conclude from this “so let’s optimize the future to be exactly what Nate personally wants”, because I’m not a supervillain.[2]

The existence of the moral hazard doesn’t have to mean that you throw up your hands, or imagine your way into a world where the hazard doesn’t exist. You can instead try to come up with a plan that directly addresses the moral hazard — try to solve the indirect and abstract problem of “defuse the moral hazard by passing the buck to the right decision process / meta-decision-process”, rather than trying to directly determine what the long-term future ought to look like.

Rather than just giving up in the face of difficulty, researchers have the ability to see the moral hazard with their own eyes and ensure that civilization gets to mature anyway, despite the unfortunate fact that humanity, in its youth, had to steer past a hazard like this at all.

Crippling our progress in its infancy is a completely unforced error. Some of the implementation details may be tricky, but much of the problem can be solved simply by choosing not to rush a solution once the acute existential risk period is over, and by choosing to end the acute existential risk period (and its associated time pressure) before making any lasting decisions about the future.[3]

(Context: I wrote this with significant editing help from Rob Bensinger. It’s an argument I’ve found myself making a lot in recent conversations.)

  1. ^

    Note that I endorse work on more realistic efforts to improve coordination and make the world’s response to AGI more sane. “Have all potentially-AGI-relevant work occur under a unified global project” isn’t attainable, but more modest coordination efforts may well succeed.

  2. ^

    And I’m not stupid enough to lock in present-day values at the expense of moral progress, or stupid enough to toss coordination out the window in the middle of a catastrophic emergency with human existence at stake, etc.

    My personal CEV cares about fairness, human potential, moral progress, and humanity’s ability to choose its own future, rather than having a future imposed on them by a dictator. I'd guess that the difference between "we run CEV on Nate personally" and "we run CEV on humanity writ large" is nothing (e.g., because Nate-CEV decides to run humanity's CEV), and if it's not nothing then it's probably minor.

  3. ^

    See also Toby Ord’s The Precipice, and its discussion of “the long reflection”. (Though, to be clear, a short reflection is better than a long reflection, if a short reflection suffices. The point is not to delay for its own sake, and the amount of sidereal time required may be quite short if a lot of the cognitive work is being done by uploaded humans and/or aligned AI systems.)


New comment
4 comments, sorted by Click to highlight new comments since: Today at 11:18 AM

Strong endorse. I have, on the occasion of two lightning talks (Bahamas and EAGxSingapore) and a shortform post, claimed that lock-in risk obliges us to reject positive longtermism (fighting-for) and constrict ourselves to negative longtermism (fighting-against). However, I point out near the end of each lightning talk that suffering-focused views present me with an extremely difficult challenge: that preserving the future's freedom preserves the possibility of torture, that suffering abolition is a form of positive longtermism. I really struggle with the tradeoff between libertarianism/cosmopolitanism (or any other framework behind emphasizing lock-in risk) and suffering-focused views; by far, the button-offering demon that presents me with the most dread is the one offering a button that would end suffering but lock in my particular opinions and aesthetics about what flourishing is. 

No top-level post about positive and negative longtermism yet, though. 

Just a quick note to say that I think planning on a pivotal act is risky and  dangerous, and we just don't yet know how feasible or infeasible "some healthy and competent worldwide collaboration steering the transition" is - more research is needed.

As I say in The Rival AI Deployment Problem: a Pre-deployment Agreement as the least-bad response,

"while it may seem unlikely at this stage, a predeployment agreement might be the least-bad option – and at least worthy of more study and reflection. In particular, more research should be done into possible clauses in a pre-deployment agreement, and into possibilities for AI development monitoring and verification."

My reply to Critch is here, and Eliezer's is here and here.

Eventually, as compute becomes more available and AGI techniques become more efficient, we should expect that individual consumers will be able to train an AGI that destroys the world using the amount of compute on a mass-marketed personal computer. (If the world wasn't already destroyed before that.)

What's the likeliest way you expect this outcome to be prevented, or (if you don't think it ought to be prevented, or don't think it's preventable) the likeliest way you expect things to go well if this outcome isn't prevented?

No government is (AFAIK) a major player in cutting-edge AI research today; so I think the default outcome is that this continues into the future.

Capabilities-wise, I think the default outcome is that human STEM work ends up looking similar to Alpha Go and AlphaGo Master. Less than a year passed between "AI systems aren't smart enough to beat any human professional in a single standard Go game" and "humans aren't smart enough to beat any SotA Go AI in a single standard Go game". I'd expect there to be a similarly short window of time between "the first time an AI system can match top human scientists in doing open-ended reasoning about the messy physical world", and "the last time humans can match SotA AI systems in STEM work". 

Maybe we'll end up having five years, rather than one year; but this seems like a perilous hope to pin the future on, and it seems to me like most governance hopes require something more like "we have fifty years (before anyone is able to destroy the world with AI) to play around with human-level AGIs, see lots of human-level warning-shot catastrophes, give governments time to react and notice the flaws in their reaction and then adjust their reaction, build a research consensus about the hazards of AGI and about the best plan, build a policy consensus that's in line with the research consensus, etc.", not "we have five years".

I think it's very possible for humanity to avoid destruction in the more realistic five-year and one-year scenarios, but not via a slow multi-decade consensus-building procedure to develop new international institutions, alliances, and norms surrounding AGI (all while AGI tech continues to be available, continues to become more efficient, and continues to proliferate).

(That said, I appreciate your sharing your disagreement! One of the best things that can come out of Nate's posting, IMO, is people flagging assumptions that they disagree with, so we can talk about them.)

Hi Rob, thanks for responding.

I agree that eventually, individuals may be able to train (or more importantly run exfiltrated models) advanced AI that is very dangerous. I expect that before that, it will be within the reach of richer, bigger groups. Today, it requires more compute/better techniques than we have available. At some point in the coming years/decades it will be within the reach of major states' budgets, then smaller states and large companies, and then smaller and smaller groups until its within the reach of individuals. That's the same process that many, many other technologies have followed. If that's right, what does that suggest we need? Agreement between the major states, then non-proliferation agreements, then regulation and surveillance banning corporate/individuals.

On governments not being major players in cutting-edge AI research today. This is certainly true. I think cyber might be a relevant analogy here. Much of the development and deployment of cyberattacks has been by the private sector (companies and contractors in the US, often criminals for some autocracies). Nevertheless, the biggest cyberattacks (Stuxnet, NotPetya, etc) are directed by the governments of major states - i.e. the P5 of US, Russia, UK, France and China. Its possible that something similar happens for AI. 

In terms of how long international agreements take, I think 50 years is a bit pessimistic. I would take arms control agreements as possible comparisons. Take 1972's nuclear and biological weapons agreements. The ideas behind deterrence were largely developed around 1960 (Schelling 1985; Adler 1992), and then made into an international agreement in 1972. It might even have happened sooner, under LBJ, had the USSR not invaded Czechoslovakia on 20th August 1968, a day before SALT was supposed to start. On biological weapons, the UK proposed the BWC in August 1968, and it was signed in 1972 as well. New START took about 2 years. So in general, bilateral arms control style agreements with monitoring and verification can be agreed in less than 5 years.

To take the nuclear 1960s analogy, we could loosely think of ourselves as being in early 1962: we've come up with the concerns if not the specific agreements, and some decision-makers and politicians are on board. We haven't yet had a major AI warning shot like the Cuban Missile Crisis (which began 60 years ago yesterday!), we haven't yet had the confidence-building measures like the 1963 Hotline Agreement, and haven't yet proposed or begun the equivalent of SALT. All that's might be to come in the next few years/decades. 

This won't be an easy project by any means, but I don't think we can yet say its completely infeasible - more research, and the attempt itself, is needed.