(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
In saying the above, I do not mean the following:
(1) Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective. Managing to put your own objective into this "goal slot" (as opposed to having the goal slot set by random happenstance) is a central difficult challenge. [Reminder: I am not asserting this]
Instead, I mean something more like the following:
(2) By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out "goal" that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.
Making the AI even have something vaguely nearing a 'goal slot' that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.
Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).
(But this doesn’t help solve the problem, because by the time the strongly superintelligent AI has ironed itself out into something with a "goal slot", it's not letting you touch it.)
Furthermore, insofar as the AI is capable of finding actions that force the future into some narrow band, I expect that it will tend to be reasonable to talk about the AI as if it is (more-or-less, most of the time) "pursuing some objective", even in the stage where it's in fact a giant kludgey mess that's sorting itself out over time in ways that are unpredictable to you.
I can see how my attempts to express these other beliefs could confuse people into thinking that I meant something more like (1) above (“Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective…”), when in fact I mean something more like (2) (“By default, the first minds humanity makes will be a terrible spaghetti-code mess…”).
In case it helps those who were previously confused: the "diamond maximizer" problem is one example of an attempt to direct researchers’ attention to the challenge of cleanly factoring cognition around something a bit like a 'goal slot'.
As evidence of a misunderstanding here: people sometimes hear me describe the diamond maximizer problem, and respond to me by proposing training regimes that (for all they know) might make the AI care a little about diamonds in some contexts.
This misunderstanding of what the diamond maximizer problem was originally meant to be pointing at seems plausibly related to the misunderstanding that this post intends to clear up. Perhaps in light of the above it's easier to understand why I see such attempts as shedding little light on the question of how to get cognition that cleanly pursues a particular objective, as opposed to a pile of kludges that careens around at the whims of reflection and happenstance.
I’d be curious to understand why you believe this happens. Humans (the only general intelligence we have so far) seems to preserve some uncertainty over goal distributions. So it is unclear to me that generality will necessarily clarify goals.
To be a bit more concrete: I find it plausible that the AGI will encounter possible fine grained (concrete) goals that map into the same high level representation of its goal, whatever it may be. Then you have to refine what the goal representation was meant to mean. After all, a representation of the goal is not the goal itself necessarily. I believe this is what humans face, and why human goals are often a small mess.
You understood me correctly. To be specific I was considering the third case in which the agent has uncertainty about is preferred state of the world. It may thus refrain from taking irreversible actions that may have a small upside in one scenario (protonium water) but large negative value in the other (deuterium) due to eg decreasing returns, or if it thinks there’s a chance to get more information on what the objectives are supposed to mean.
I understand your point that this distinction may look arbitrary, but goals are not necessarily defined at the phy... (read more)