Chaining the evil genie: why "outer" AI safety is probably easy

titotal

When discussing the threat of AGI, especially in introductory resources, a very common argument is the “evil genie” or “king Midas” problem. The idea is that if you put a goal function into a machine, such as “maximize paperclips” or “find a cure for cancer”, it will be taken literally in a way that you don’t want, and due to instrumental convergence, fulfilling the goal will result in the subjugation of humanity, for example in order to gather resources or avoid being switched off. So the paperclip maximisers will kill all humans in order to turn the galaxy into paperclips, and the cancer cure machine will induce tumors in all of humanity in order to study them.

In response to this, a newcomer might come up with a goal function with added constraints, like “maximize paperclips without displeasing humans”. This is easily rebutted with a scenario where, for example, all of humanity is injected with pleasure causing drugs. The newcomer will then add another constraint, like “maximize paperclips without displeasing humans or injecting them with anything”. This is also rebutted with another scenario, such as the AI releasing knockout gas, and so on and so forth.

Essentially, the process goes:

You come up with a goal function for an AGI that you think won’t result in annihilation.
I come up with a scenario where a rational super-AI kills us all or does something existentially horrible while conforming to the letter of the goal.
Repeat steps 1 and 2 until we give up.

This is usually where the discussion ends. I’ve seen this discussion many times, and introductory resources will usually use a few steps of this process to prove their point and move on.

I think this is a great argument for why some AGI might be an existential threat. It is also a great argument for why most AGI will be misaligned.

However, it does not prove that “safe” (ie non-existential-risk) goal functions are impossible or even difficult. It is far too common for people to stop here, and not justify or even consider the hidden premises involved in their arguments.

The hidden premise of the evil genie argument

Let’s re-examine the process I described earlier. I propose a goal function, you find a loophole A that allows the AI to kill us all. I add a constraint, you find a loophole B that allows the AI to kill us all, then I propose another constraint, resulting in loophole C, etc.

When we conclude “this is pointless”, we are making a hidden assumption, which I will coin the unbeatability hypothesis:

The unbeatability hypothesis: The probability of a successful AI takeover will never decrease below a certain (high) point, no matter how many constraints you try and place on the AGI.

If this hypothesis is wrong, then the process above looks very different. With goal function A, the probability of the AI attacking and overwhelming us is 99%. After step B, the probability is 90%. After step C, it’s 80%, etc. Suddenly, this starts to look less like a hopeless exercise, and more like debugging the goal function. All we would have to do to is keep going, and the chance of AI takeover would drop to near-zero.

Why I’m skeptical of the unbeatability hypothesis:

All AGI is fallible. There will always be a threshold of task difficulty that is beyond it’s skill. No AI is perfect, most will have bugs, flaws and domains of ignorance. And even the most flawless superintelligence will still have to contend with the laws of physics, with tasks that are physically impossible to pull off.

Following on from this, I think you can pretty much always come up with a constraint that makes a particular task harder. If I trained, I could probably complete a marathon. But it’s unlikely I could run a marathon in under two hours with broken legs. The difference in difficulty between the two tasks is not small, the difference in probability of the latter is many, many orders of magnitude lower than the former.

Putting the argument together:

There is an intelligence/capability level X that is required to take over the world without constraints, and a higher intelligence/capability level Y required to take over the world with given constraints.
You can design enough constraints such that Y is extremely high, far above X and above any capability level that we would classify as AGI.
Smarter AI’s are more difficult to make than dumber AI’s, so early AGI’s are much more likely to have capability below Y than above Y.
Therefore, you can bring the probability of early AGI takeover down to a very small percentage through the continual addition of constraints.

Response to some counter-arguments

There are a few counter-arguments I can foresee in favor of the unbeatability hypothesis.

First is the AI-go-FOOM hypothesis. In this case, the capabilites of the AI will increase so quickly that the difference between X and Y will be crossed in a matter of weeks or days, so it doesn’t matter that they start off incapable, they’ll quickly become deadly.

I don’t believe FOOM scenarios are likely, but I won’t rehash those arguments here. Even if it’s true though, a superintelligence is not omnipotent, so there is still a threshold of difficulty that even it can't reach, especially when you remember that you can put constraints that make it harder or impossible to go FOOM in the first place. (As a side note, a nigh-omnipotent AI might actually be less likely to kill us all, as we would be less of a threat to it.)

Second is the idea that you can’t make the difficulty of takeover high enough without also crippling the intended goal of the machine. So put a lot of constraints in the goal function and it will just shut down and be useless.

I don’t think this is likely to be true for most tasks. There are a lot of steps involved in taking over the world that are not involved with building paperclips (for example, disabling the military). These steps can be targeted for constraints without punishing the law abiding paper-clipper.

The third objection is the idea that it’s simply too hard for us mere humans to come up with enough constraints to stop an AGI, so that even if we come up with an extensive huge rulebook of constraints, we’ll miss a loophole allowing an awful outcome.

My point here would be that “subjugate all humans” is merely an instrumental sub-goal of an AGI. If the penalty for subjugating humanity costs more utility than it gains, it won’t do it. We don’t have to make a loophole impossible, we just have to make it sufficiently hard that alternate paths become preferable. Ideally this would involve performing it’s intended function, but even misaligned plans could be non-lethal, like if it flees on a spaceship to another galaxy where it can enact its goals in peace.

Some strategies for vastly safer goal functions

One reason I think constraints might work is that I think of a few goal functions that would drastically decrease the odds of AI takeover, by almost arbitrary orders of magnitude. I’m sure I’m not the first person to think of these, but I think they are often underrated as strategies due to misuse of the genie argument I outlined above.

Strategy 1: Bounded goals.

Suppose you give a paperclip maximiser access to a paperclip factory and the goal “build exactly 1 paperclip”.

Even if it had the capability to revolutionize molecular physics and subjugate all of humanity, why would it? All it’s got to do is build a paperclip, and the machine is right there.

In essence, the strategy is to give the AI a utility function that provides no additional benefit after exceeding a bounded goal. So gaining utility for building a paperclip is conditional on “total paperclips made” being less than X.

This reduces the expected utility of “subjugate humanity” from “nigh infinite, bounded only by the size of the universe”, to “X”. As long as it’s easier to get to X the normal way than it is to do so the murdery way, it will do the safe thing.

A lot of unbounded goals can be rephrased as bounded goals without losing much power. Instead of saying “as cheaply as possibly” say, “while spending under X$”. Instead of saying “maximise the probability of X”, say “make X happen with a probability over Y%”. The lack of infinite utility traps makes the stakes for the AI much, much lower.

This really doesn't hinder performance much, either. If you want more paperclips than X, simply run a new version of the AGI again after it shuts down. Turning on an AGI is easy, it's turning it off that could go hairy, so frequent shutdowns is actually what you want to happen.

Strategy 2: Time limits.

Say you build a paperclip factory, hand it over to an AGI, and give it the goal: “build as many paperclips as you can in 1 minute”

The AGI might have the capability of building a murderous grey-goo nanofactory that spits out a billion paperclips a second. But is it capable of building that nanofactory in one minute? Probably not.

This is similar to the previous strategy, except instead of shutting off utility after X number of things are done, you shut it off after X time has elapsed. I think that it’s possible to make any task arbitrarily difficult by introducing an arbitrarily small time limit.

A hidden power of time limits is that they are an incredibly powerful way to protect against deception. Normally an AI would be able to hide itself away for ages, powering up its learning skills for 20 years, and then strike when it’s sufficiently powerful and build a universe spanning paperclip empire. But if it only has a month to “live”, that strategy won’t work. Any strike against us has to be done before the time limit expires, so it can only deceive us for that long.

The key here is to ensure that the time required for the task you want is shorter than the time required to subjugate humanity. If the task is particularly complex, you can subdivide it into multiple different chunks.

Strategy 3: The “don’t kill people” counter

This one is much trickier to implement, but potentially very powerful. The key here is to make all utility gain conditional on a death counter. So you only gain a utility point if you build a paperclip AND the death counter is less than 1.

I could think of several ways to do this, which could all be implemented simultaneously.

For each action, check if the action results, directly or indirectly, in the death of a human. If probability is greater than say, 90%, trigger death counter.
For all humans, estimate lifespan before action is triggered, estimate lifespan after lifespan is triggered. If the lifespan has decreased, add to a measure of [total lifeyears decreased], which will trigger the death counter if above a certain threshold.
Every 10 minutes, evaluate the probability of the statement “my lifetime actions have resulted in the death of a human that would not have otherwise occurred, or a cumulative decrease in expected human life-years that exceed X number.” If probability is high enough, trigger death counter.

Note that it’s crucial there be no way to decrease the death counter, so saving 5 lifes by killing 1 will still trigger the death counter, and decreasing the lifespan of 1 person can’t be cancelled out by increasing the lifespan of everyone else.

Now the obvious rebuttal would be the following: the AI won’t kill us, but it will subjugate us, by putting all our heads in jars, or putting us in matrix pods, or freezing us, so humans are disabled as a threat without killing us.

However, I must ask: how hard is it to subjugate all of humanity without accidently killing some of humanity?

If you try and put 7 billion brains in the jar, at least one procedure will go wrong, resulting in a death. If you freeze everyone, at least one person will accidently freeze to death. If you send hypnodrones after everyone, at least one person is going to run away, fall into a ditch and die. Or someone will commit suicide rather than be subjugated. And what happens when humans try and press your off switch? If you build a moat around it filled with sharks, someone is gonna fall into the moat and be eaten by the sharks. If you build a massive concrete dome around the switch, what if an army bombs the dome, and someone dies in the blast? Hell, if the death counter is known about, all anyone anywhere has to do is kill any other person or themselves and state that they did so in order to trigger the death counter (it still counts!).

Subjugating the world is a hard task. Subjugating the world without killing anyone is many, many orders of magnitude harder. The classification of “human” and “death” doesn’t even need to be perfect, either, as long as it’s good enough to trigger for most human deaths, the constraint will still prevent annihilation.

Combining all strategies

I’m sure some of you have thought of holes in each of the three strategies above. For example, in scenario one, if we ask for a billion paperclips, it might realize that there is a finite chance of being shut off before it can achieve it’s task, so it will minimize that finite chance by hiding away, becoming super-powerful, and killing all of humanity so that it can make it’s paperclips in peace. In scenario 2, it might create a grey goo machine with no off switch, so even though the AI shuts itself off after a year, the goo keeps going until it destroys us all. In scenario 3, the AI could fake it’s own death, erase all evidence of it’s existence, run away to a distant planet, then develop stasis technology that would freeze all of earth in an instant in time, removing us as a threat.

However, none of those counterexamples work if all three strategies are implemented together. Here is my attempt at producing a minimal risk paperclip goal function:

Build at most one million paperclips for under $10000 during the next year, without directly or indirectly causing the death of any human, or decreasing the lifespan of any subset of humanity by more than 10 years.

The AI can’t hide away forever to increase it’s capability, because it has it has to deploy whatever it’s strategy is within a year. It can’t release a grey goo with no off-switch, because it knows that the goo will inevitably kill a human. And even if it manages to build and deploy some magical stasis technology within a year, It no longer has a reason to maintain that stasis forever. After the year is up, the stasis would expire.

Now, I’m sure someone in the comments can still figure out a way for this to lead to catastrophe. Before you post it in the comments, I dare you to keep going. Try and guess what I would add to prevent your doomsday, and figure out a loophole around that, and then do it one more time, and then ask yourself, “is my final loophole way, way harder to execute than the original one?”. As long as that is true, then the unbeatability hypothesis is still unproven, and you can still build a safe AGI by adding extra constraints.

Why this doesn’t disprove AI X-risk

I suspect I will be accused of attacking a strawman in this piece. To counter that, let me emphasize the aspects of AI risk that are not disproven here.

For starters, this is mostly aimed at talking about “outer alignment”, ie, designing a safe goal function. There is still the problem of “inner alignment”, that being how you actually implement the goal function into the AGI. For example, if you naively implement a time limit that is based on a computer clock, it might just figure out a way to hack the clock and give itself infinite time. I think some of the arguments here are still applicable to inner alignment though. If you add in enough constraints to how it interacts with the clock, there may be a way of increasing difficulty of “time limit subversion” such that it can’t be achieved within the time limit.

Secondly, even if you can build a safe AGI, if somebody else builds an unsafe super-AGI first, it doesn’t matter. So building a safe AGI is only one step, you also have to ensure everyone else also uses safe AGI. One such strategy would be to use the safe super-constrained AGI to hunt down other AGI’s in their infancy.

In addition, these techniques might make an AGI x-risk safe, but it won’t necessarily make them safe in the normal sense of the word. It’s way easier to screw up and kill/harm a lot of people unintentionally than it is to kill all of humanity. For each system, solving x-risk will always be many orders of magnitude easier than solving the alignment problem.

Conclusion

Summing up my argument in TLDR format:

For each AGI, there will be tasks that have difficulty beyond it’s capabilities.
You can make the task “subjugate humanity under these constraints” arbitrarily more difficult or undesirable by adding more and more constraints to a goal function.
A lot of these constraints are quite simple, but drastically effective, such as implementing time limits, bounded goals, and prohibitions on human death.
Therefore, it is not very difficult to design a useful goal function that raises subjugation difficulty above the capability level of the AGI, simply by adding arbitrarily many constraints.

Even if you disagree with some of these points, it seems hard to see how a constrained AI wouldn't at least have a greatly reduced probability of successful subjugation, so I think it makes sense to pursue constraints anyway (as I'm sure plenty of people already are).

MauAug 30 20228

To counter that, let me emphasize the aspects of AI risk that are not disproven here.

Adding to this list, much of the field thinks a core challenge is making highly capable, agentic AI systems safe. But (ignoring inner alignment issues) severe constraints create safe AI systems that aren't very capable agents. (For example, if you make an AI that only considers what will happen within a time limit of 1 minute, it probably won't be very good at long-term planning. Or if you make an AI system that only pursues very small-scale goals, it won't be able to solve problems that you don't know how to break up into small-scale goals.) So on its own, this doesn't seem to solve outer alignment for highly capable agents.

(See e.g. the "2. Competitive" section of this article by Paul Christiano for some more discussion of why a core desiderata for safety solutions is their performance competitiveness.)

titotalSep 1 20223

It's clear that not every constraint will work for every application, but I reckon every application will have at least some constraints that will drastically drop risk

I definitely agree that competitiveness is important, but remember that it's not just about competitiveness for a specific task, but competitiveness at pleasing AI developers. There's a large incentive for people not to build runaway murder machines! And even if a company doesn't believe in Ai x-risk, it still has to worry about lawsuits, regulations etc for lesser accidents. I think the majority of developers can be persuaded or forced to put some constraints on, as long as they aren't excessively onerous.

MauSep 1 20224

Maybe, I'm not sure though. Future applications that do long-term, large-scale planning seem hard to constrain much while still letting them do what they're supposed to do. (Bounded goals--if they're bounded to small-scale objectives--seem like they'd break large-scale planning, time limits seem like they'd break long-term planning, and as you mention the "don't kill people" counter would be much trickier to implement.)

titotalSep 2 20223

That's a fair perspective. One last thing I'll note is that even seemingly permissive constraints can make a huge difference from the perspective of the AI utility calculus. If I ask it to maximise paperclips, then the upper utility bound is defined by the amount of matter in the universe. Capping utility at a trillion paperclips doesn't affect us much (too many would flood the market anyway), but it reduces the expected utility of an AI takeover by like 50 orders of magnitude. Putting in a time limit, even if it's like 100 years, would have the same effect. Seems like a no-brainer.

LumpyproletariatAug 31 20225

1. For each AGI, there will be tasks that have difficulty beyond it’s capabilities.
2. You can make the task “subjugate humanity under these constraints” arbitrarily more difficult or undesirable by adding more and more constraints to a goal function.

(Apologies for terseness here, I do appreciate that you put effort that went into writing this up.)

1. It seems to me you underestimate the capabilities of early AGI. Speed alone is sufficient for superintelligence, FOOM isn't necessary for AI to be overwhelmingly more mentally capable.

2. One can't actually make the task "subjugate humanity under these constraints" arbitrarily more difficult or undesirable by adding more constraints to the goal function. Constraints aren't uncorrelated with each other--you can't make invading medieval France arbitrarily hard by adding more pikemen, archers, cavalry, walls, trenches, sailboats. Innovative methods to bypass pikemen from outside your paradigm also sidestep archers, cavalry, walls, etc. If you impose all the constraints available to you, they are correlated because you/your culture/your species came up with them. Saying that you can pile on more safeguards to reduce the probability of failure to zero is like saying that if a wall made out of red bricks is only 50% likely to be breached, creating a second wall out of blue bricks will drop the probability of a breach to 25%.

titotalSep 1 20229

I think this comic provides an easy rebuttal here. Speed is by no means sufficient, you also have to be extremely capable and rational in domains outside of your training set. A paranoid schizophrenic conspiracy theorist AI will probably fail to take over the world, no matter how much computing power it has. I don't think I'm underestimating early AGI, I think people here are overestimating AI abilities and underestimating just how insanely difficult a task it is to defeat humanity.

2. One can't actually make the task "subjugate humanity under these constraints" arbitrarily more difficult or undesirable by adding more constraints to the goal function. Constraints aren't uncorrelated with each other

Sure you can. Straightforwardly, the task "conquer medieval france in one second (including prep time)" is about as close to impossible as you can get, unless you already have access to a supernuke or something.

I think you're treating the AI as an alien race here, with unknown powers coming from the outside. But that's ignoring our biggest advantage: we're the ones who build the damn things. If the french had direct access to the brains of the invading armies, it really would be quite easy to arbitrarily constrain them.

Chris LeongAug 31 20224

The part I disagree with is: "Therefore, it is not very difficult to design a useful goal function that raises subjugation difficulty above the capability level of the AGI, simply by adding arbitrarily many constraints."

Firstly, simply stacking lots of constraints will likely be less effective than utilizing a smaller number of strategically chosen constraints (and as mentioned by Mauricio, it's highly likely that stacking as many constraints as possible makes your AI uncompetitive).

Secondly, assuming this technique works: Given the stakes, it seems worthwhile for people to spend a lot of time inventing the best such schemes. And it feels like this is a topic that could involve endless debate with people coming up with more and more complicated schemes and others coming up with convoluted schemes to get through the loophole.

titotalSep 1 20228

And it feels like this is a topic that could involve endless debate with people coming up with more and more complicated schemes and others coming up with convoluted schemes to get through the loophole.

I think this is great! This is what we do when we write laws for humans, and this is what we do when we look for flaws in software algorithms. I think applying it to AI development would not be particularly burdensome, and as you point out can be refined to a few well-aimed constraints that drastically decrease odds of harm.

I guess my main complaint was that this "scenario-loophole" game seems to be treated as a rebuttal of some kind, instead of a highly useful and necessary part of AI development.

SharmakeAug 30 20224

Upvoted for a surprisingly simple and powerful solution, and while I congratulate you for solving the outer misalignment problem, the inner problem remains troubling, and mesa-optimizers could plausibly have different goals, and inner misalignment still needs to be solved.

Mitchell PorterApr 7 20233

One can play this game - adding constraints to make a goal safer - with ChatGPT!

https://pastebin.com/eQqtPMSr

It shouldn't be hard to automate this process, via a GPT-based agent similar to Auto-GPT.

Effective Altruism Forum
EA Forum

Chaining the evil genie: why "outer" AI safety is probably easy

40

40

Reactions

More posts like this