Why is [robustly pointing at goals] so difficult? Is there an argument that it is impossible?
Well, I hope it's not impossible! If it is, we're in a pretty bad spot. But it's definitely true that we don't know how to do it, despite lots of hard work over the last 30+ years. To really get why this should be, you have to understand how AI training works in a somewhat low-level way.
Suppose we want an image classifier -- something that'll tell us whether a picture has a sheep in it, let's say. Schematically, here's how we build one:
- Start with a list of a few million random numbers. These are our parameters.
- Find a bunch of images, some with sheep in them and some without (we know which is which because humans labeled them manually). This is our training data.
- Pick some images from the training data, multiply them with the parameters in various ways, and interpret the result as a confidence, between 0 and 1, of whether each image has a sheep.
- Probably it did terribly! Random numbers don't know anything about sheep.
- So, we make some small random-ish changes to the parameters and see if that helps.
- For example, we might say "we changed this parameter from 0.5 to 0.6 and overall accuracy went from 51.2% to 51.22%, next time we'll go to 0.7 and see if that keeps helping or if it's too high or what."
- Repeat step 5, a lot.
- Eventually you get to where the parameters do a good job predicting whether the images have a sheep or not, so you stop.
I'm leaving out some mathematical details, but nothing that changes the overall picture.
All modern AI training works basically this way: start with random numbers, use them to do some task, evaluate their performance, and then tweak the numbers in a way that seems to point toward better performance.
Crucially, we never know why a change to particular parameter is good, just that it is. Similarly, we never know what the AI is "really" trying to do, just that whatever it's doing helps it do the task -- to classify the images in our training set, for example. But that doesn't mean that it's doing what we want. For example, maybe all the pictures of sheep are in big grassy fields, while the non-sheep pictures tend to have more trees, and so what we actually trained was an "are there a lot of trees?" classifier. This kind of thing happens all the time in machine learning applications. When people talk about "generalizing out of distribution", this is what they mean: the AI was trained on some data, but will it still perform the way we'd want on other, different data? Often the answer is no.
So that's the first big difficulty with setting terminal goals: we can't define the AI's goals directly, we just show it a bunch of examples of the thing we want and hope it learns what they all have in common. Even after we're done, we have no way to find out what patterns it really found except by experiment, which with superhuman AIs is very dangerous. There are other difficulties but this post is already rather long.
This comment will focus on the specific approaches you set out, rather than the high level question, although I'm also interested in seeing comments from others on how difficult it is to solve alignment, and why.
The approach you've set out resembles Coherent Extrapolated Volition (CEV), which was described earlier by Bostrom. I'm not sure what the consensus is on CEV, but here's a few thoughts which I have in my head from when I thought about CEV (several years ago now).
In other words, I don't think this does clearly guarantee us against power-seeking behaviour.
I don't think you need to commit yourself to including everyone. If it is true for any subset of people, then the point you gesture at in your post goes through. I have had similar thoughts to those you suggest in the post. If we gave the AI the goal of 'do what Barack Obama would do if properly informed and at his most lucid', I don't really get why we would have high confidence in a treacherous turn or of the AI misbehaving in a catastrophic way. The main response to this seems to be to point to examples of AI not doing what we intend from limited ... (read more)