(Epistemic status: I think this is right?)

Alice is the CEO of ArmaggAI, and Bob is the CEO of BigModelsAI, two major AI capabilities organizations. They're racing to be the first to build a superintelligence aligned to their respective CEV which would take over the universe and satisfy their values.

Alice would prefer to slow down so she has a lot more time to make sure that what her company is building will be aligned; but she's worried about Bob's company taking advantage of that and racing themselves, resulting in Bob's utility function being {what the lightcone is filled with} instead of Alice's; so she feels like she has no choice but to race, to maximize her own utility. Likewise vice-versa.

This state of affairs is far from the pareto frontier of {their utility functions, each weighed by how likely they are to be the one to build an aligned superintelligence}: it causes a lot more worlds where everyone is dead, instead of either Alice's or Bob's utility function being maximized.

But they don't really get all this LDT business; it seems complicated to them. If only they had a convenient lesswrong post spelling out exactly what they should do, in clear easy-to-follow terms!

The procedure they should implement is as follows. Alice should:

  • In worlds where she does launch a superintelligence aligned to itself, and it takes over the world and builds an Alice::utopia, she should tell her AI:

    Actually, don't just maximize my own utility function. Maximize a (threat-resistant) mixture of my utility function and Bob's, weighed by what ratio of timelines you think we each launched an aligned superintelligence in, but also weighed by the probability that both me and Bob implement this procedure.

  • Before that, she should slow down their company's capabilities progress.

And likewise for bob.

Now, their respective aligned superintelligences aren't omniscient; it could be that they over-estimate or under-estimate how likely Bob or Alice would've been to implement this procedure, for example. But one thing's pretty much for sure: neither Alice nor Bob are able to fool the future-superintelligence into thinking that they'd implement this procedure, when actually they wouldn't. Because it's superintelligent.

If Alice wins, she knows Bob will follow the procedure because her superintelligence can tell (better than Bob can fake). And Bob doesn't have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure. They're each kept in check by the other's future-winning-self, and they can each rely on this being superintelligently-checked by their respective future selves.

So the only way Alice has to get some of her utility maximized in worlds where Bob wins, is to actually behave like this, including before either has launched a superintelligence. And likewise for Bob.

Their incentive gradient is in the direction of being more likely to follow this procedure, including slowing down their capabilities progress — and thus decreasing the amount of worlds where their AI is unaligned and everyone dies forever.

In the real world, there are still Bob's and Alice's who don't implement this procedure, but that's mostly because they don't know/understand that if they did, they would gain more utility. In many cases, it should suffice for them to be informed that this is indeed where their utility lies.

Once someone has demonstrated that they understand how LDT applies here, and that they're generally rational, then they should understand that implementing this protocol (including slowing down AI capabilities) is what maximizes their utility, and so you can count on them cooperating.

Now, in actuality, this is not quite the full generality of how LDT applies here. What they each should actually tell their superintelligence if they win is actually simpler:

Maximize a mixture of my utility function and Bob's (and anyone else who might be in a position to build superintelligence), weighed in whatever way creates a (non-threatening)) incentive gradient which maximizes my utility, including the utility I have for reducing the amount of worlds in which everyone dies.

Or even more simply:

Maximize my utility function (as per LDT).

But I think it's neat to have a clearer idea how that actually shakes out.

Theres' no weird acausal magic going on here. By racing for AI, Bob would be slightly increase the chance that he's the one to launch the aligned superintelligence that takes over the world, but he's causing more dead worlds in total, and loses the utility he would otherwise gain in worlds where Alice wins, ending up with net less utility overall.

If either of them are somewhat negative utilitarian, racing is even worse: all those dead worlds where they launch an unaligned superintelligence leave remote alien baby-eaters free to eat babies, whereas if they increased the amount of total worlds where either of them get an aligned superintelligence, then that aligned superintelligence can pay a bunch of its lightcone in exchange for them eating less babies. This is not a threat; we're never pessimizing the aliens' utility function. We're simply offering them a bunch of realityfluid/negentropy right here, in exchange for them focusing more on a subset of their values which doesn't contain lots of what Alice and Bob would consider suffering — the aliens can only be strictly better-off than if we didn't engage with them.

Now, this isn't completely foolproof. If Alice is very certain that her own superintelligence will indeed be aligned when it's launched no matter how fast she goes, then she has no incentive to slow down — in her model, Bob doesn't have much to offer to her.

But should she really have that confidence, when a bunch of qualified alignment reserach people keep telling her that she might be wrong?

She should really make sure she has really high confidence, and that she's in general implementing rationality correctly.

Oh, and needless to say, people who are neither Alice nor Bob also have a bunch of utility to gain by taking actions which reduce the total number of dead worlds by forcing both of their companies to slow down (eg through regulation).

When we have this much utility in common (not wanting to die), it's really really dumb to "defect". Unlike in the prisoner's dilemma, this "defection" doesn't get you more utility, it gets you less. This is not a zero-sum game at all. If you think it is, if you think your preferred future is the exact opposite of your opponent's preferred future, then you're probably making a giant reasoning mistake.

Whether your utility function is focused on creating nice things or by reducing suffering ("positively-caring" and "negatively-caring"), slowing down AI progress in order to have a better chance of alignment is probably what serves your utility function best.





More posts like this

Sorted by Click to highlight new comments since:

Executive summary: Implementing logical decision theory helps AI companies align their capabilities progress to reduce existential risk and maximize mutual utility.

Key points:

  1. AI companies racing for capabilities creates unnecessary existential risk from unaligned superintelligences.
  2. Logical decision theory incentives both companies to cooperate by slowing progress.
  3. Cooperation allows more worlds where an aligned superintelligence satisfies their utility functions.
  4. Defecting by racing progress causes more dead worlds and less mutual utility.
  5. For negative utilitarians, cooperation also reduces remote suffering.
  6. Some confidence in capabilities is warranted, but extreme overconfidence risks catastrophe.



This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
Relevant opportunities