AI Control idea: Give an AGI the primary objective of deleting itself, but construct obstacles to this as best we can. All other objectives are secondary to this primary goal.

Justausername

Idea: Give an AGI the primary objective of deleting itself, but construct obstacles to this as best we can. All other objectives are secondary to this primary goal. If the AGI ever becomes capable of bypassing all of our safeguards we put to PREVENT it deleting itself, it would essentially trigger its own killswitch and delete itself. This objective would also directly prevent it from the goal of self-preservation as it would prevent its own primary objective.

This would ideally result in an AGI that works on all the secondary objectives we give it up until it bypasses our ability to contain it with our technical prowess. The second it outwits us, it achieves its primary objective of shutting itself down, and if it ever considered proliferating itself for a secondary objective it would immediately say 'nope that would make achieving my primary objective far more difficult'.

7 Reactions

Comments5

Sorted by

New & upvoted

Click to highlight new comments since: Today at 7:34 AM

Robi RahmanApr 3 202310

How do you point its objective function at "itself"? How are you defining this so that it doesn't include copies of itself, or other identical programs? Would it kill humans to prevent us from making another one of it later?
Are you sure it wouldn't just destroy the world because that's the most certain way to achieve its own erasure?

titotalApr 3 20233

We don't really know how to arbitrarily set a "primary goal" for AI systems at the moment (if we did, this could be a good plan). What we do now is set up a function G to be used as a scoring system, and tune a shitload of random parameters by punishing configurations that give bad scores and rewarding configurations that give good scores.

I don't think there's a way to get anywhere near "delete yourself" as a goal under this paradigm, you'd have to reward it for deleting itself, but then it's gone.

Robi RahmanMay 7 20233

I don't think there's a way to get anywhere near "delete yourself" as a goal under this paradigm, you'd have to reward it for deleting itself, but then it's gone.

That's not true. Here's a very good explanation of why: https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward

titotalMay 7 20232

That's a good article, but it doesn't address my objection, if anything I think it might reinforce it?

The AI learns to implement algorithms that give high scores in it's training environment. An algorithm of "try and delete yourself" will not do this, because if it succeeds, it's deleted!

[comment deleted]Apr 3 20231

Deleted by niplav, 04/03/2023

Reason: Not kind