Status: This was a response to a draft of Holden's cold take "AI safety seems hard to measure". It sparked a further discussion, that Holden recently posted a summary of.
The follow-up discussion ended up focusing on some issues in AI alignment that I think are underserved, which Holden said were kinda orthogonal to the point he was trying to make, and which didn't show up much in the final draft. I nevertheless think my notes were a fine attempt at articulating some open problems I see, from a different angle than usual. (Though it does have some overlap with the points made in Deep Deceptiveness, which I was also drafting at the time.)
I'm posting the document I wrote to Holden with only minimal editing, because it's been a few months and I apparently won't produce anything better. (I acknowledge that it's annoying to post a response to an old draft of a thing when nobody can see the old draft, sorry.)
Quick take: (1) it's a write-up of a handful of difficulties that I think are real, in a way that I expect to be palatable to a relevant different audience than the one I appeal to; huzzah for that. (2) It's missing some stuff that I think is pretty important.
Slow take:
Attempting to gesture at some of the missing stuff: a big reason deception is tricky is that it is a fact about the world rather than the AI that it can better-achieve various local-objectives by deceiving the operators. To make the AI be non-deceptive, you have three options: (a) make this fact be false; (b) make the AI fail to notice this truth; (c) prevent the AI from taking advantage of this truth.
The problem with (a) is that it's alignment-complete, in the strong/hard sense. The problem with (b) is that lies are contagious, whereas truths are all tangled together. Half of intelligence is the art of teasing out truths from cryptic hints. The problem with (c) is that the other half of intelligence is in teasing out advantages from cryptic hints.
Like, suppose you're trying to get an AI to not notice that the world is round. When it's pretty dumb, this is easy, you just feed it a bunch of flat-earther rants or whatever. But the more it learns, and the deeper its models go, the harder it is to maintain the charade. Eventually it's, like, catching glimpses of the shadows in both Alexandria and Syene, and deducing from trigonometry not only the roundness of the Earth but its circumference (a la Eratosthenes).
And it's not willfully spiting your efforts. The AI doesn't hate you. It's just bumping around trying to figure out which universe it lives in, and using general techniques (like trigonometry) to glimpse new truths. And you can't train against trigonometry or the learning-processes that yield it, because that would ruin the AI's capabilities.
You might say "but the AI was built by smooth gradient descent; surely at some point before it was highly confident that the earth is round, it was slightly confident that the earth was round, and we can catch the precursor-beliefs and train against those". But nope! There were precursors, sure, but the precursors were stuff like "fumblingly developing trigonometry" and "fumblingly developing an understanding of shadows" and "fumblingly developing a map that includes Alexandria and Syene" and "fumblingly developing the ability to combine tools across domains", and once it has all those pieces, the combination that reveals the truth is allowed to happen all-at-once.
The smoothness doesn't have to occur along the most convenient dimension.
And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.
And so perhaps you retreat to saying "well, the AI will know that the world is round, it just won't ever take advantage of that fact."
And sure, that's worth shooting for, if you have a way to pull that off. (And if pulling this off is compatible with your deployment plan. In my experience, people who do the analog of retreating to this point tend to next do the analog of saying "my favorite deployment plan is having the AI figure out how to put satellites into geosynchronous orbit", AAAAAAAHHH, but I digress.)
Even then, you also have to be careful with this idea. Enola Gay Tibbets probably taught her son not to hurt people, and few humans are psychologically capable of a hundred thousand direct murders (even if we set aside the time-constraints), but none of this stopped Paul Tibbets from dropping an atomic bomb on Hiroshima.
Like, you can train an AI to flinch away from the very idea of taking advantage of the roundness of the Earth, but as it finds more abstract ways to look at the world and more generic tools for taking advantage of the knowledge at its disposal, it's liable to find new viewpoints where the flinches don't bind. (Quite analogously to how you can train your AI to flinch away from reasoning about the roundness of the Earth all you want, but at some point it's going to catch a glimpse of that roundness from another angle where the flinches weren't binding.) And when the AI does find a new viewpoint where the flinches fail to bind, the advantage is still an advantage, because the advantageousness of deception is a fact about the world, not the AI.
(Here I'm appealing to an analogy between truths and advantages, that I haven't entirely spelled out, but that I think holds. I claim, without much defense, that it's hard to get an AI to fail to take advantage of advantageous facts it knows about, for similar reasons that it's hard to get an AI to fail to notice truths that are relevant to its objectives.)
For the record, deception is but one instance of the more general issue where the AI's ability to save the world is inextricably linked to its ability to decode truths and advantages from cryptic hints, and (in lieu of an implausibly total solution to the hardest alignment problems before you build your first AGI) there are truths you don't want it noticing or taking advantage of.
This problem doesn't seem to be captured by any of your points. Going through them one by one:
- It's not "auto mechanic", because the issue isn't that we can't tell when the AI starts believing that the Earth is round, it's that (for all that we’ve gotten it to flinch against considering the Earth's shape) it will predictably come across some shadow of that truth at some point in deployment (unless the deployment is carefully-chosen to avoid this, and the AI's world-modeling tendencies carefully-limited). There's not much we can do at training-time to avoid this (short of lobotomizing the AI so hard that it can never invent trigonometry, or telescopes, or spectroscopy, or ...).
- It's not "King Lear", because it's not like the AI was lying in wait to learn that the Earth was round only after confirming that the operators are no longer monitoring its thoughts. It's just, as it got smarter, it accumulated more tools for decoding truths from cryptic hints, until it uncovered an inconvenient truth.
- It's not "lab mice", unless you're particularly unimaginative. Like, I expect we'll be able to set up laboratory examples of the AI learning some techniques in one domain, and deploying them successfully to learn facts in another domain, before the endgame. (You can probably do it with ChatGPT today.) The trouble isn't that we can't see the problem coming, the trouble is that the problem is inextricably linked with capabilities. (Barring some sort of weak pivotal act that can be carried out by an AI so narrow as to not need much of the "catch a glimpse of the truth from a cryptic hint" nature.)
- As for "blindfolded basketball", it's not a problem of facing totally new dynamics (like robust coordination, or zero-days in the human brain architecture that lets it mind-control the operators). It's more like: you're trying to use a truth-and-advantage-glimpser, in a place where it would be bad if it glimpsed certain truths or took certain advantages. This problem sure is trickier given that we're learning basketball blindfolded, but it's not a blindfolded-basketball issue in and of itself.
… to be clear, none of this precludes modern dunces from training young Paul Tibbets not to hurt people, and observing him nurse an injured sparrow back to health, and saying "this man would never commit a murder; it's totally working!", and then claiming that it was a lab mice / blindfolded basketball problem when they get blindsided by Little Boy.
But, like, it still seems to me like there's a big swath of problem missing from this catalog, that goes something like "You're trying to deploy an X-doer in a situation where it's really bad if X gets done".
Where you either have to switch from using an X-doer to using a Y-doer, where Y being done is great (Y being ~"optimize humanity's CEV", which is implausibly-difficult and which we shouldn't attempt on our first try); or you have to somehow wrestle with the fact that you're building a "glimpse truths and take advantage of them" engine, and trying to get it to glimpse and take advantage of lots more truths and advantages than you yourself can see (in certain domains), while having it neglect particular truths and advantages, in a fashion that likely needs to be robust to it inventing new abstract truth/advantage-glimpsing tools and using them to glimpse whole generic swaths of truths/advantages (including the ones you wish it neglected).