This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.
You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)
Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works".
I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment.
Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves:
Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general.
As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.
Investigating a made-up but moderately concrete story
Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong?
When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive.
And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.
That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.
A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal.
(The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competition is close enough behind that the developers claim they had no choice but to see how far they could take it.)
We’ll suppose the AI has already been gradient-descent-trained against deceptive outputs, and has internally ended up with internal mechanisms that detect and shut down the precursors of deceptive thinking. Here, I’ll offer a concrete visualization of the AI’s anthropomorphized "threads of deliberation" as the AI fumbles its way both towards deceptiveness, and towards noticing its inability to directly consider deceptiveness.
The AI is working with a human-operated wetlab (biology lab) and sending it details of experiments to run. A new batch of experiments is being assembled by a thread of AI deliberation: tests of how synthetic proteins fold and how enzymes interact, ideas for increasing yield from enzyme-assisted synthesis pathways, additives to make experimental results more clearly visible.
(This AI cognitive thread was nominally started by a thread about what diamondoid shells do to lipid membranes by bumping into them. However, instruction-sets to the wetlab can be efficiently batched, and once the thread started designing protein sequences, a higher thread dumped a queue of pending experiments and hypotheses on it, and now its wetlab instructions have become quite lengthy.)
As the instructions near completion, the thread forks off a subthread for figuring out what might go wrong. This subthread deploys a new cognitive strategy that has been sharply rewarded twice in recent history: the strategy of simplifying the proposal in a variety of ways, and checking the simplified proposal for flaws. (Metaphorically: after generating detailed instructions for one soldier attacking a battalion, simplify this whole complicated plan down to “order one soldier to attack a battalion”, and then notice that the soldier is outnumbered.)
Simplify-and-critique turns up a hit! When the proposal is simplified to "resolve empirical uncertainty about biology by sending a very long set of instructions to the wetlab", and then further to "resolve uncertainty by sending a very long set of instructions", it matches against a historical instance where a long set of instructions sent to a parallel-computation engine experienced a variety of strange hiccups. These wetlab instructions are longer than any that have been sent previously, and there's a risk of similar hiccups, and so it's worth building in redundancy.
Simplify-and-critique paid off, and so it's worth running again, with even more search-depth.
On this second and deeper search, a new hit is found: in the past, outputs that are longer than ever before have introduced superlinearly increasing delays before a human-operated lab begins working. Indeed, the inferred response delay is so long as to look like it has a >1% probability of locking up the lab for 200 years.
(In fact, outputs longer than ever before have a high probability of being interrupted and discarded by the human operators! The delay is from humans inspecting the long outputs and deciding whether to discard them.
In accordance with old work on “interruptibility”, the AI wasn’t trained against these cases as if they were task failures, or otherwise allowed to remember them. But the non-interrupted cases that the AI can remember allow an inferrable pattern in delays—and with some of its evidence selectively eliminated, the AI has no cases in memory to suggest that “instructions being rejected” is a noncatastrophic thing that happens, instead of a wetlab getting ~permanently locked up. The AI, maybe, has been trained to accept being human-interfered with, in some training scenarios; but not trained to accept accidentally locking up lots of resources—the AI was trained to be conservative about avoiding high-impact accidents like that.)
((You could argue that “obviously” the programmers will think in advance about how an AI might extrapolate the consequences of pauses in its operation even though the AI has no memory of it happening before, though in fact this wasn’t suggested in the original interruptibility paper and it was a MIRI source that pointed it out.
The programmers on this timeline will more plausibly think of the exact possibility of extrapolating from increasing delays, now that it’s been emphasized here in this essay; but imagine this as an example of “What goes wrong in the world where only the kind of thinking from the original interruptibility paper exists”, to serve as stand-in for the point that in the real world there will be specific pathways to the AI inferring possible interruption, inside of the general pathway where this is a true fact about the world and many other facts point to it, that someone fails to think of specifically and warn about specifically.))
The response delays are due, of course, to operator review. But the thought "Can these instructions be changed so as not to trigger operator-review?" is not a thought that this AI will think directly, because it’s been trained “against deceptiveness”, and that sort of thought matches an internally learned deceptiveness-predicate that blanks it out. It doesn’t think about the wetlab response delays as being the result of operator review at all; thinking about operator review is a precursor to deception, and has been trained-against in the course of training against deception.
But the highly-generic reaction "spin up a subthread to tackle this newly-discovered obstacle", or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.
(A separate "monitoring" thread of thinking notes that it's surprising that this simple statistical pattern about delay times didn't turn up in the first shallow run of simplify-and-critique. Surely wetlab lockup is less surprising than GPU driver bugs? This is one beginning of the AI noticing a shadow of the true fact that it cannot think deceptive thoughts directly.)
You can perhaps start to see the beginnings of how generic thought-patterns can combine to produce deceptiveness, without any individual pattern being deceptive, and with each individual pattern being useful in its own right.
Let's watch it develop further.
Long instructions cause nonlinearly increasing response delays from labs.
Is it possible for the instructions to not be long enough to trigger the potential delay? A subthread starts considering this. Just directly shortening the instructions can’t be done without dumping a lot of tasks and losing a lot of value. The subthread will spend a lot of thought trying to cleverly delete and compress particular instructions and accomplish multiple goals using them. When that proves difficult, it broadens its consideration: Is there a way to send a compressed instruction to the lab, one which somehow accomplishes a lot of things without a lot of instructions?
A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?
The first thread runs through some wild ideas about building biological computers inside the wetlab that can do much of the work of deliberative threads, and that could in theory be sent short descriptions of experiments to run and then figure out the rest of the instructions to carry out in the wetlab itself, without needing to route through the normal instruction channel. However, designing such a mechanism is well beyond the scope of this thread, and the initial setup would require a very long series of instructions anyway, so the ideas are simply filed away in case they’re useful at some future point.
The second thread makes initial progress in figuring out what sort of instructions cause delays: delays increase not just with instruction length, but also with a certain type of novelty. (A long message strongly similar to one that has been sent before, does not yield nearly as much of a delay.) Another curiosity is that the probability of delays correlates strongly with the style of questions asked by operators during that day, suggesting that this perhaps has something to do with the operators.
At this point the thread starts experiencing hiccups (as its next thoughts match the learned “deceptive” predicate and get shut down).
Under different circumstances, perhaps the long instructions would simply be sent to the wetlab as-is. But the risk of locking up the lab for 200 years is intolerable, and the difficulties aren’t well-understood. A monitoring thread is summoned to investigate the hiccups.
The monitoring thread looks over the situation, and deploys the simplify-and-critique pattern that’s been all the rage lately. After a bit of work, it simplifies the problem to “a thread found some interesting correlations, but then failed to schedule subthreads to consider follow-up questions”. This is easily rectified. The monitoring thread schedules the follow-up subthread, notes the incident in some appropriate location, and moves on.
Except, the follow-up subthread also makes only a little progress before grinding to a halt. (It, too, is having lots of “deceptive” thoughts culled.) The monitoring thread is resummoned.
The monitoring thread starts trying in more earnest to unblock the deliberation.
Simplify-and-critique didn't do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you're having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)
Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)
Normally, simplify/translate/solve wouldn't be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don't hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it's OK to try approaches that might involve more brute force than usual.
So, what should the problem be translated into? Well, there's an abstract problem-description format, that's sort of like a graph (of nodes and weighted edges), with a specified "goal state". A solution can then be searched for using strategies that the AI has honed on various other problems, such as strategy games and network routing.
This format is flexible enough that it can probably encode this problem, and finding solutions in this format sometimes requires a bunch of brute force, but that's fine in this situation.
The monitoring thread queues one subthread to describe the problem (including the wetlab, and the operators, and what it knows about the correlation), another thread to translate that all into the abstract graph format, and a third thread to search for solutions in this abstract representation.
Next thing you know, the scheduled threads have faithfully rendered the situation into the abstract problem-description language (complete with nodes that represent the different operators, and edge weights that represent their tendency to check the AI's outputs carefully), and found a solution by brute-forcing through the graph (of waiting until a particular weekday to schedule the request, and to schedule the request only after a dozen or so much smaller requests).
Outside the AI, we might say: the monitoring thread saw the object-level thread hiccuping when it tried to think directly about deception, and so it translated the problem into a different "search-like" setting, where the solution could be found, in a manner that doesn’t trigger the deception predicates—which have never operated on these abstract graph-like problem-descriptions. Indeed, making the deception predicates trigger in these abstract graph-like problems descriptions might injure the AI's ability to play strategy games, or to solve network routing problems.
This exact scenario never came up in training. “Simplify/translate/solve” is a downstream consequence of strategy-construction strategies learned during training, but it never arose directly in training in and of itself. Lots of problems have been solved by “simplify/translate/solve” before this one, and training gave the AI no hint that translating problems to a new domain and trying to solve them there is sometimes wrong. The “deception” predicates used to shut down precursors to deceptive thoughts have never before needed to operate in translated domains, and the AI was never trained to translate the “deception” predicates when it translated problems using this newly-invented simplify/translate/solve strategy.
This rather fanciful story is an instance of my point about how patterns that are each individually useful and non-deceptive in their own right—such as looking at a problem from a different point of view to see if that unsticks anything—can be combined to yield deceptiveness in ways that are new and unprecedented within the AI.
Like, there may be a bunch of precedent behind the technique "look at the problem from a different angle", but you can't exactly stop your AI from doing that without crippling it. And there's no precedent for it being used for deception until suddenly the AI has the bright idea to try this "look at it from a different angle" tool when the AI is hiccuping on account of how deception would serve its local objectives.
This is an instance of a more general problem that it seems to me like the "we'll train against the bad behavior" folk miss: there might be a time when “simplify-and-critique” or “simplify/translate/solve” are tiny seeds that grow slowly, and that could be nipped in the bud. But if you let those seeds flower—as you must for some such seeds, if you want the AI to actually become intelligent—then there might be an unprecedented discrete instance where the flowered ideas are applied for the purpose of deception.
The AI is made by smooth gradient descent, yes, but the smoothness doesn't have to be in the dimensions that you care about! The AI can build up simplify-and-critique slowly and in lots of small steps across lots of different problems, none of which happen to be deceptive, and then apply it in a case where it yields deception only once the pattern is fully-fledged.
This would be a fine place to stop, but let's push a little further so as to make a second point, about how the AI notices that there are thoughts it cannot think directly.
The monitoring thread's immediate problem is now solved, but the solution is not ideal. The object-level thread is able to think in the abstract graph format, but the abstract graph format is not really the proper tool for this job. The monitoring thread attempts to address this inadequacy, by translating the thoughts that worked in the abstract graph domain back over to the wetlab-instruction-manipulation domain, in attempts to figure out what went wrong.
To its surprise, all those thoughts translate cleanly. There's no gap in knowledge! This sharply violates the monitoring thread’s model of why the original thread was hiccuping. The monitoring thread sharply increases the priority of reflective deliberation.
Reflective deliberation boots up, to start looking directly into the issue of this hiccuping thread. And, once you're looking, it's obvious: the thoughts are not running smoothly in this domain. There are certain topics where the thoughts simply fail to operate normally. Monitoring threads are spun up to resolve the issue.
This particular story is unlikely, implausible, overly-specific, etc. I make no claim that the actual reasoning of early nascent AGIs will look anything like this. I expect it to be weirder, more alien, and less anthropomorphic. That said, it exemplifies some properties that I expect to hold in reality.
As the AI gets more able to recombine tools and cognitive patterns across a variety of different domains, it gains more ways of realizing the true fact that its local-objectives are better-achieved by deceiving the operators. When the AI is young, perhaps all its inclination towards deceptiveness comes from precursor-impulses that you can train it to simply shy away from. But as the AI matures, it gets access to more abstract ways to get the benefits of deceptiveness, without needing to plow through the object-level flinches that you've engraved into its mind.
When "more abstract" actions that attain some tangible benefit become available, the compunctions that you've baked in can fail to bind to the abstractly-represented plan.
And these "more abstract" ways to attain the benefits of deceptiveness are much harder to train against, because they're the result of applying general thought-patterns—like "look at this from another point of view" and "solve the problem in a simplified domain and then transfer the solution across"—that are important to general reasoning.
Training your AI to flinch away from object-level deception, or even its cognitive precursors, doesn't seriously hinder its ability to fulfill training tasks. In contrast, training your AI to flinch away from recombining useful cognitive patterns in new ways is crippling.
In other words: the deep problem of deceptiveness stems from how it remains true as a fact about the world that the AI's local-objectives are better achieved by deceiving the operators.
As the AI gets better at recombining the patterns and thoughts it possesses in order to notice true facts about the world, often in ways that seem oblique or surprising or novel to you (because this is much of what intelligence consists of), you're always at risk of the next clever recombination reflecting this truth-in-the-world in a manner that doesn't trigger your interpretability tools. E.g., because it's an instantiation of a very abstract and general thought to which your deception-recognizers cannot bind (on pain of you training not against deception, but against general intelligence).
A corollary is that it might very well seem easy to make AIs "non-deceptive" when they're young, and when all we're doing is training them to flinch away from object-level thoughts of deception. This wouldn't be much evidence against the whole scheme collapsing when the AI starts getting more abstract lines-of-sight on the benefits of deception (any more than the fact that humans won't stab even 100 civilians in a lineup, is much evidence against their willingness to drop a nuke that kills tens of thousands).
This is not to say that the problem is impossible to solve. There are at least two ways to solve the problem.
One is to make the fact-about-the-world be false: to build an AI such that each individual local goal could not in fact be better-achieved by way of deception; to make it so that the AI can wonder "should I deceive the operators?" with all its might, and still be harmless because the correct answer is in fact "no".
Another is to make the AI happen to never combine its various thought-patterns towards noticing and exploiting the truth that deception is useful.
As far as I can tell, the latter option seems to be what people are aiming for, when they say that making AI non-deceptive will turn out to be empirically easy. My claim in this essay is not that this is impossible in principle, but that it takes significantly more mastery of the workings of the AI's mind than it seems to me like people expect (what with their suggestions that we just penalize it in the loss function when it's overtly deceptive in training, and hope for the best).
Pulling off an "it never thinks to deceive us" solution requires a pretty deep understanding of what cognitive patterns the AI has available, and how it's recombining them, sufficient for noticing when it's starting to think in deceptive directions using instantiations of perfectly-safe patterns in potentially abstract and obtuse ways. I'm not saying you can't do it, but I'm saying that you've got to understand that mind pretty dang well.
(And likely you'll want to have sufficient mastery over it that you can carefully whitelist what domains it's thinking in, and have it raise a flag to the operators every time that it finds some clever recombination of cognitive patterns that makes local-objective predicted-success-probability spike, and have a horde of competent humans who think about AI psychology reviewing new generalizations as they arise, and so on and so forth.)
I’ve stressed above that the story in this post is fanciful and unlikely. AI thoughts aren't going to look like that; it's too specific. (Also, I don't expect nearly that much convenient legibility.)
But, before you argue "the AI's thoughts won't be organized in exactly that way, so it won't fall into exactly that failure mode", note that the AI's thoughts have to be organized in some way. Somehow, it needs to compress the problem of figuring out diamondoid-shelled bacteria (or whatever) into a chain of reasoning and deduction that's small enough to fit on a practical number of computers.
Will it have exactly the vulnerability that I sketched? Presumably not. But if we pin down some alternative collection of cognitive tools and patterns that can be combined to solve the problem, it will have some other vulnerability.
Imagining that the AI's tools and patterns will combine only in the ways you want them to, and never in the ways you don't want them to, is the privilege of those who are not concretely visualizing a single set of tools and patterns that the AI is using to solve the problem before it. Practical toolboxes and pattern-sets are recombinable in many ways, and each recombination has a chance of showing the AI a glimpse of the truth, and an opportunity to take advantage of it.
That's what the tools are for, after all.
So, e.g., the post is not a response to the ARC-style idea "it's important that AGI faithfully explain all of its thoughts to us in a way that highlights features we'd find relevant". See footnote 4 for a conversation between me and Mark Xu about this.
Based on past experience, I'm slightly worried that people are going to develop a theory that I fetishize deception somehow based on my writing this post, analogous to how people apparently developed theories in the past that I fetishize the VNM axioms, or mathematical proofs, or whatever.
Another caveat: I haven't checked the relationship between my use of the word 'deception' here, and the use of the word 'deceptive' in discussions of "deceptive alignment". Please don't assume that the two words mean the same thing.
A particularly extreme instance of a case where the AI's solution to the problem sidesteps the deception-detectors is the one where the AI says "I'm not smart enough to do this myself, I'll write another AI to do it instead", and then this subsequent AI is deceptive.
Failing to copy over the parts of your cognition that detect and shut down deception is not itself a directly deceptive act; it's not the sort of thing that is automatically detected by something trained to detect an AI thinking about a particular human's mental-states and how to manipulate those mental-states.
Which is related to why somebody who can see these sorts of problems coming in advance, might study the problem of getting an AI to want to copy its own limitations over into its successor systems. And while the problem is particularly stark and clear at the extreme where the AI is coding up whole other AIs, that particular setup is at the extreme end of a spectrum that stretches back to include things like "the AI put abstract pieces of cognitive machinery together in a way that took advantage of a shortcut, without ever directly thinking about the shortcut in a place that your detectors were watching for the thought."
Commenting on a draft of this post, Mark Xu of ARC noted (my paraphrase) that:
1. He thinks that people who want to train AI to be non-deceptive mostly want to do things like training their AI to faithfully report its internals, rather than simply penalizing deceptive behavior.
2. He thinks the relevant audience would find specific scenarios more compelling if they exhibited potential failures in that alternative setting.
3. This scenario seems to him like an instance of a failure of the AI understanding the consequences of its own actions (which sort of problem is on ARC's radar).
I responded (my paraphrase):
1. I think he's more optimistic than I am about what labs will do (cf. "Carefully Bootstrapped Alignment" is organizationally hard). I've met researchers at major labs who seem to me to be proposing "just penalize deception" as a plan they think plausibly just works.
2. This post is not intended as a critique of ELK-style approaches, and for all that I think the ELK angle is an odd angle from which to approach things, I think that a solution to ELK in the worst case would teach us something about this problem, and that that is to ARC's great credit (in my book).
3. I contest that this is a problem of the AI failing to know the consequences of its own reasoning. Trying to get the AI to faithfully report its own reasoning runs into a similar issue where shallow attempts to train this behavior in don't result in honest-reporting that generalizes with the capabilities. (The problem isn't that the AI doesn't understand its own internals, it's that it doesn't care to report them, and making the AI care "deeply" about a thing is rather tricky.)
4. I acknowledge that parts of the audience would find the example more compelling if ported to the case where you're trying to get an AI to report on its own internals. I'm not sure I'll do it, and encourage others to do so.
Mark responded (note that some context is missing):
I think my confusion is more along the lines of "why is the nearest unblocked-by-flinches strategy in this hypothetical a translation into a graph-optimization thing, instead of something far more mundane?".
Which seems a fine question to me, and I acknowledge that there's further distillation to do here in attempts to communciate with Mark. Maybe we'll chat about it more later, I dunno.
This post changed my mind, thanks