Hide table of contents

Content warning: this post is mostly vibes. It was largely inspired by Otherness & Control in the age of AGI. Thanks to Toby Jolly and Noah Siegel for feedback on drafts.

Old Plum, Kano Sansetsu, 1646

The spiritual goal of AI safety is intent alignment: building systems that have the same goals as their human operators. Most safety-branded work these days is not directly trying to solve this,[1] but a commonly stated motivation is to ensure that the first AGI is intent-aligned. Below I’ll use “aligned” to mean intent-aligned.

I think alignment isn’t the best spiritual goal for AI safety, because while it arguably solves some potential problems with AGI (e.g. preventing paperclip doom), it could exacerbate others (e.g. turbocharging power concentration).

I’d like to suggest a better spiritual goal– the goal of restrained AGI. I say an agent is restrained if it does not pursue scope-sensitive goals[2] (terminal or instrumental). Of course, this may involve hard technical problems,[3] but it’s not clear to me whether these are more or less difficult than alignment. And you may shout “restrained AIs won’t be competitive!”. But is it a bigger hindrance than the alignment tax?[4]

Let me flesh out my claim a bit more. Here are some bad things that can happen because of AGI:

While alignment might prevent some of these, it could exacerbate others. In contrast, I claim that restraint reduces the probability & severity of all of them.

My central argument: I’m scared of a single set of values dominating the lightcone,[5] regardless of what those values are. Since value is fragile, these values probably won’t be even partially what you want. And given the complexity of the future, it is unlikely that your values will be the ones to fill the lightcone. And even if you get lucky, how confident are you that these values deserve to dominate the future? Restraint aims to prevent such domination.

Two worlds

To illustrate why I’m more excited about restraint than alignment, let’s imagine two cartoonishly simple worlds. In the “aligned world”, we have solved intent alignment but we allow AGI to have arbitrarily scope-sensitive goals. In the “restrained world”, we didn’t solve intent alignment but no scope-sensitive AGI is allowed to exist. I’ll discuss some possible problems in the two worlds, and you can make up your own mind which world you would prefer.

I’m going to make a whole bunch of simplifications,[6] I hope you can extrapolate these stories to a more realistic future of your choice. I’m also going to use the language of traditional Yudkowsky/Bostrom-esque framing of AI risk.[7] I don’t want to claim that this well describes what will actually happen, but it sure makes my claim easier to present, and I think it’s fair to use this frame given that alignment has been largely motivated within this frame.

The aligned world

One set of values will dominate the lightcone

Because of instrumental convergence, a singleton will likely form. A single set of values dominates the lightcone. Systems whose goals are more scope-sensitive are more likely to take over, so the singleton will most likely have scope-sensitive goals.

You won’t like those values unless they’re very overlapping with yours. Since the singleton is scope-sensitive, it will optimise hard. The hard optimization will cause these ambitions to decorrelate from other goals that may naively seem to overlap. Your values might be fragile, so if the singleton is misaligned to even a tiny component of your values, the future will be garbage to you. You’ll either win the lightcone or you’ll lose it.

They won’t be your values

I think it’s unlikely that the dominant values will overlap with yours enough to avoid decorrelating from them. Let’s brainstorm some vignettes of how things go:

Vignette 1: Nerd Values

The first AGI (which forms a singleton) is given goals that are hastily decided by a small team of engineers and executives at DeepMindSeek. Perhaps they tune the AGI’s goals to their personal gain, or they try to implement their favourite moral theory, their best guess at the aggregate of human values.

How do you feel about this one? What are the chances that your preferences get included in the engineer’s considerations? Perhaps their goals are largely overlapping with yours, but will it be overlapping enough?

Joe Carlsmith points out that the story of “tails coming apart” may apply just as much to the comparison between humans and AIs as to the comparison between humans and humans. If the nerd values are optimised hard enough, they may decorrelate with yours. DeepMindSeek may be “paperclippers with respect to you”.

Vignette 2: Narrow Human Values

AGI empowers a person or group to grab the world and organise it to meet their own selfish needs.

I assume you wouldn’t be happy with this.

Vignette 3: Inclusive Human Values

A coordinated international effort results in some massive democratic process (for example) that builds an aggregated set of values shared by all humans. This aggregate is voluntarily loaded into the first AGI, which forms a singleton.

First of all, this seems unlikely to me. Secondly, are human values actually a thing? A static thing that you can align something to? Seems unclear to me. Can we really find a single set of norms that all humans are happy with?

Vignette 4: Ruthless Values

For-profit companies build a number of AGI systems with goals aligned to the will of their customers. The systems that gain the most influence will be those with the broadest scope of goals and ruthless enough to not let moral, legal or societal side constraints get in their way. In the limit, an amoral set of goals capture the lightcone.

Perhaps you get extremely lucky and have control of the ruthless AGI that captures the lightcone. Yay for you, you have won the singularity lottery.[8]

The singularity lottery is an infinite brittle gamble. There is an infinite payoff but a minuscule chance of winning. You will almost definitely lose – be it to other humans or to unaligned AI. In my view, the more robust strategy is to not buy a lottery ticket. Avoid a single agent grabbing the lightcone.

But let’s say you’re a risk-taker. Let’s say you’re willing to buy a ticket for the possibility of tiling the future with your preferences. Now try to imagine that actually happens, and introspect on it. Is this actually what you want?

Do you even want your values to fill the lightcone?

Hanson and Carlsmith have both expressed discomfort in being overly controlling of the future. One way Hanson cashes out this discomfort is:

Aren’t we glad that the ancient Greeks didn’t try to divert the future to replace us with people more like them?

How much steering of the long-term future do we have a right to do? This question is a whole can of worms, and I feel pretty confused about how we should balance our yang (steering the future to prevent horrible outcomes by our lights) with our yin (letting things play out to allow future beings and civilisations to choose their own path).[9]

I think it’s reasonable to say that an extreme amount of moral caution is needed if you were to decide today how to spend the cosmic endowment. I think this is still true if you planned to give the AGI more abstract values, like second-order values, that it could keep it open to some degree of moral evolution[10] or allow it to work out what is right by itself[11]. How flexible should these values be? What, if anything, should we lock in? Are we anywhere near ready to make these kinds of decisions?

How confident are you that the decisions you make would have results that are both morally good by your own lights, and morally good by some more impartial moral realism-esque vantagepoint? I’m confused enough about all this that there’s no way you’d catch me vying to tile the universe with my favourite computation. I don’t think anyone, today, should be confident enough to do that.

Enshrining your values could lead to the opposite of your values

There is a concerning pattern that shows up when thinking about agency – if an agent exists with a particular goal x, this increases the probability of -x being fulfilled.

The first (silly) example is sign-flips: a cosmic ray flips the bit storing the sign of an AI’s utility function, and the AI goes on to maximize the opposite of its intended utility function. A more realistic example is the Waluigi effect - LLMs have a tendency to behave oppositely to how they are intended via a number of related mechanisms.

If you win the singularity lottery, you’ll have a super powerful agent with your desires enshrined into it. Probably you’ll get what you want, but you might just get the opposite of what you want. This makes buying a lottery ticket even more risky. You’re not just risking death to get heaven, you’re risking hell to get heaven.

The restrained world

In this world, we have not solved alignment (or maybe it’s solved to some degree), but a concerted effort has successfully prevented the deployment of AGI systems with scope-sensitive terminal and instrumental goals.

In this world, takeover, singletons and lock-in are less likely. AGI has less incentive to achieve decisive strategic advantage because taking over will not help them much in achieving their goals.

A multipolar world is more likely, constituting a more pluralistic future. Values are less likely to lock in. Values can evolve in the same way that values have always evolved. We don’t need to solve any difficult philosophical problems in deciding the right way for the future to go. As Hanson argues, moral progress so far has been driven less by human reasoning and more by a complex process of cultural evolution.

Me, I see reason as having had only a modest influence on past changes in human values and styles, and expect that situation to continue into the future. Even without future AI, I expect change to accelerate, and humans values and styles to eventually change greatly, without being much limited by some unmodifiable “human core”. And I see law and competition, not human partiality, as the main forces keeping peace among humans and super-intelligent orgs.

Technological advancement will probably move slower than in the aligned world because AI systems optimize less sweepingly.[12] But, as many in the AI safety movement (and especially PauseAI) would agree, a slow AI revolution seems safer than a fast one.

Could restrained AI defend against unrestrained AI?

A world with only restrained AGI seems better to me than a world with only aligned AGI. But what if only some AGI systems are restrained? Would a restrained world be vulnerable to a careless actor creating an unrestrained agent that grabs a decisive strategic advantage?

The aligned world has a similar problem (or rather, getting to the aligned world has a similar problem). Even if you made sure the first AGI is aligned, would that be able to stop a new unaligned AGI from taking over? A common response is something like, “f the first n AGIs are aligned, these will be collectively powerful enough to prevent any new unaligned AGI from grabbing decisive strategic advantage”.

We could use a similar argument for the restrained world. “If the first m AGIs are restrained, and disvalue being dominated by an unrestrained AGI enough, they could prevent an unrestrained AGI from grabbing decisive strategic advantage”.[13]

Conclusion

These have been two cartoonishly simple worlds, but I think they get across a more abstract point.

Yudkowsky views AI risk as a winner-takes-all fight between values. Something is going to rule the whole universe, and I need to make sure it’s my values that are loaded into it. My view is that we should not play this game because it is almost impossible to win (given this framing, Yudkowsky seems to agree!). To opt out of the game, we need a different framing of AI risk. I want a future with values not decided by any single agent, but instead collectively discovered.

  1. ^

     For example mechanistic interpretability or AI governance.

  2. ^

     I use the phrase “scope-sensitive” to be a generalisation of scope-sensitive ethics to include whatever goals or motivations an agent can have. Examples of scope-sensitive goals include: maximising paperclips, maximising profit, achieving moral good according to classical utilitarianism, or spreading your values across the stars. Examples of scope-insensitive goals include: finding a cancer cure, being virtuous, preventing wars, solving the strawberry problem.

  3. ^

     The main problem is to avoid instrumental convergence. Work so far in this direction includes satisficers, soft optimisation, corrigability, low impact agents. Things like AI control could also help towards restrained AGI, since it could prevent the AI from pursuing scope-sensitive goals.

  1. ^

     Not to mention that most humans don’t have scope-sensitive goals either (at least not sensitive to the “taking over the world” scope), so you’ll still have plenty of people to sell your AI system to.

  2. ^

     I’m using lightcone as shorthand for “the foreseeable future”. I don’t think any actions 21st century humans make will have an effect on the entire lightcone.

  3. ^

     The two worlds are two extremes, the real way things go will be some complicated combination of them. I’ll assume AI agents to have well-defined and static goals. I’ll also conflate goals and values. I’m also going to take a number of core rationalists beliefs for granted, since the argument for AI alignment is largely contingent on these beliefs.

  4. ^

     Singularity, singletons, instrumental convergence, decisive strategic advantage, etc.

  5. ^

     Lottery isn’t an excellent analogy, since if you don’t win, you die.

  6. ^

     See otherness & control in the age of AGI for an exploration of this.

  7. ^

     e.g. the right level of liberalism.

  8. ^
  9. ^

     Both because intent alignment is less solved and because AI systems are limited in the scope of their goals.

  10. ^

     Probably m > n, making this reassuring story weaker than its analogue in the aligned world. How much bigger m is than n hinges on how take-overable the world is.

Show all footnotes
Comments2


Sorted by Click to highlight new comments since:

Very interesting!

I'd be interested to hear a bit more about what a restrained system would be able to do. 

For example, could I make two restrained AGIs, one which has the goal:

A) "create a detailed plan plan.txt for maximising profit"

And another which has the goal:

B) "execute the plan written in plan.txt"?

If not, I'm not clear on why "make a cure for cancer" is scope-insensitive but "write a detailed plan for [maximising goal]" is scope-sensitive

Some more test case goals to probe the definition:

C) "make a maximal success rate cure for cancer"

D) "write a detailed plan for generating exactly $10^100 USD profit for my company"

Executive summary: The author argues that prioritizing “intent alignment” risks allowing one set of values to dominate the future, and advocates instead for “restraint” to prevent scope-sensitive AI systems from locking in any single vision of humanity’s destiny.

Key points:

  1. Alignment can mitigate some AI risks (like paperclip doom) but may intensify problems like power concentration and value lock-in.
  2. Restraint is proposed as a system design principle that avoids granting AI agents expansive, scope-sensitive goals.
  3. A central fear is that a single set of values—especially if narrowly chosen—might dominate the lightcone and stifle moral evolution.
  4. The author questions the moral legitimacy of today’s humans choosing how to shape the cosmic future, given uncertainties in our value systems.
  5. Slowing AI progress and preventing “scope-sensitive” AGI deployments is viewed as safer than racing to perfect alignment solutions.
  6. The recommendation is a collective, concerted effort to keep AGI’s goals deliberately narrow, ensuring a more pluralistic and open-ended long-term future.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f