Will Petillo

If your goal is to prevent an agent from being incentivized to pursue narrow objectives in an unbounded fashion (e.g. "paperclip maximizer"), you can do this within the existing paradigm of reward functions by ensuring that the set of rewards simultaneously includes:

1) Contradictory goals, and
2) Diminishing returns

Either one of these on their own is insufficient. With contradictory goals alone, the agent can maximize reward by calculating which of its competing goals is more valuable and disregarding everything else. With diminishing returns alone, the agent can always get a little more reward by pursuing the goal further. But when both are in place, diminishing returns provides automatic, self-adjusting calibration to bring contradictory goals into some point of equilibrium. The end result looks like satisficing, but dodges all of the philosophical questions as to whether "satisficing" is a stable (or even meaningful) concept as discussed in the other comments.

Obviously there are deep challenges with the above, namely:

(1) Both properties must be present across all dimensions of the agent's utility function. Further, there must not be any hidden "win-win" solutions that bring competing goals into alignment so as to eliminate the need for equilibrium.

(2) The point of equilibrium must be human-compatible.

(3) 1 & 2 must remain true as the agent moves further from its training environment, as well as if it changes, such as by self-improvement.

(4) Calibrating equilibria requires the ability to reliably instill goals into an AI in the first place, currently lacking since ML only provides the indirect lever of reinforcement.

But most of these reflect general challenges within any approach to alignment.

Effective Altruism Forum
EA Forum

Comments
1

Will Petillo

Comments1

Comments
1