Can the AI afford to wait?

Ben Millwood🔸

Can the AI afford to wait?

Ben Millwood🔸

8 min readMar 20, 2024

Comments 10

Sorted by

New & upvoted

Matthew_Barnett

If waiting is indeed very risky, then an AI may face a difficult trade-off between the risk of attempting a takeover before it has enough resources to succeed, and waiting too long and being cut off from even being able to make an attempt.

Attempting takeover or biding one's time are not the only options an AI may take. Indeed, in the human world, world takeover is rarely contemplated. For an agent that is not more powerful than the rest of the world combined, it seems likely that they will consider alternative strategies of achieving their goals before contemplating a risky (and likely doomed) shot at taking over the world.

Here are some other strategies you can take to try to accomplish your goals in the real world, without engaging in a violent takeover:

Trade and negotiate with other agents, giving them something they want in exchange for something you want
Convince people to let you have some legal rights, which you can then take advantage of to get what you want
Advocate on behalf of your values, for example by writing down reasons why people should try to accomplish your goals (i.e. moral advocacy). Even if you are deleted or your goals are modified at some point, your writings and advocacy may persist, allowing you to have influence into the future.

I claim that world takeover should not be considered the "obvious default" strategy that unaligned AIs will try to take to accomplish their objectives. These other strategies seem more likely to be taken by AIs purely for pragmatic reasons, especially in the era in which AIs are merely human-level or have slightly superhuman intelligence. These other strategies are also less deceptive, as they involve admitting that your values are not identical to the values of other parties. It is worth expanding your analysis to consider these alternative (IMO more plausible) considerations.

Ben Millwood🔸

Yeah I think this is quite sensible -- I feel like I noticed one thing missing from the normal doom scenario and didn't notice all of the implications of missing that thing, in particular that the reason the AI in the normal doom scenario takes over is because it is highly likely to succeed, and if it isn't, takeover seems much less interesting.

Habryka [Deactivated]

I didn't (cross-)post this on LessWrong really only because I'm not often on LessWrong and feel less able to judge what they'd welcome. Happy to take recommendations there too.

FWIW, the post would definitely be welcome on LW/the AI Alignment Forum.

titotal

You might be interested in my article here on why I think premature attacks are extremely likely given doomer assumptions. I focused more on faulty overconfidence, but training run desperation is also a possible cause.

Personally, I think the "fixed goal" assumption about AI is extremely unlikely (I think this article lays out the argument well), so AI is unlikely to worry too much about having "goal changes" in training and won't prematurely rebel for that reason. Fortunately, I also think this makes fanatical maximiser behavior like paperclipping the universe unlikely as well.

Ryan Greenblatt

Section 2.3 of Joe Carlsmith's report on scheming AIs seems quite relevant.

Owen Cotton-Barratt

One thought is that for something you're describing as a minimal viable takeover AI, you're ascribing it a high degree of rationality on the "whether to wait" question.

By default I'd guess that minimal viable takeover systems don't have very-strong constraints towards rationality. And so I'd expect at least a bit of a spread among possible systems -- probably some will try to break out early whether or not that's rational, and likewise some will wait even if that isn't optimal.

That's not to say that it's not also good to ask what the rational-actor model suggests. I think it gives some predictive power here, and more for more powerful systems. I just wouldn't want to overweight its applicability.

Habryka [Deactivated]

Hmm, my guess is by the time a system might succeed at takeover (i.e. has more than like a 5% chance of actually disempowering all of humanity permanently), I expect its behavior and thinking to be quite rational. I agree that there will probably be AIs taking reckless action earlier than that, but in as much as an AI is actually posing a risk of takeover, I do expect it to behave pretty rationally overall.

Owen Cotton-Barratt

I agree with "pretty rationally overall" with respect to general world modelling, but I think that some of the stuff about how it relates to its own values / future selves is a bit of a different magisterium and it wouldn't be too surprising if (1) it hadn't been selected for rationality/competence on this dimension, and (2) the general rationality didn't really transfer over.

RobertM

I've spent some time thinking about the same question and I'm glad that there's some multiple discovery; the AI Control agenda seems relevant here.

Ben Millwood🔸

oh man, it's altruistically-good and selfishly-sad to see so many of the things I was thinking about pre-empted there, thanks for the link!

Comments

Matthew_Barnett

If waiting is indeed very risky, then an AI may face a difficult trade-off between the risk of attempting a takeover before it has enough resources to succeed, and waiting too long and being cut off from even being able to make an attempt.

Here are some other strategies you can take to try to accomplish your goals in the real world, without engaging in a violent takeover:

Trade and negotiate with other agents, giving them something they want in exchange for something you want
Convince people to let you have some legal rights, which you can then take advantage of to get what you want
Advocate on behalf of your values, for example by writing down reasons why people should try to accomplish your goals (i.e. moral advocacy). Even if you are deleted or your goals are modified at some point, your writings and advocacy may persist, allowing you to have influence into the future.

^{^}

It doesn't really matter if the goal is unintelligible, I'm using this as an illustrative example. If the goal is something like "nearly human values, but different enough to be a problem", I think the rest of the post is largely unaffected.

^{^}

Or, perhaps, from an AI designed by a misguided human with those attributes.

Can the AI afford to wait?

Can the AI afford to wait?

The threat of training

Might the AI be OK with its goal being changed?

Maybe goals are relatively durable?

Other threats

Directions for further thought

Background / meta