I have a proposal for making an AGI killswitch.
Assuming god like computational budgets, algorithmic improvements, and hardware improvements, can you use fully homomorphic encryption (FHE) to train and run an AGI? FHE allows you to run computation on encrypted data without decrypting it. Wouldn't such an AGI find the world illegible with out its input being specifically encrypted for it with a key?
Can the key then be split into shards so that m of n shards are needed to encrypt queries to the FHE AGI? Can you also create a m of n function so that if m of n shards are destroyed the key is useless?
Can you distribute the keys in such a way so if the AGI destabilizes or hurts society m of n keys are destroyed as a byproduct and the AGI is functionally turned off?
I think this would create a kill switch that is hard to subvert by the AGI and easily triggered by society.
At the limit you could create a shard for every person on the planet with only one shard deletion being required to delete the key. Then if anyone doesn't like the way things are going with the AGI they could delete their key killing it.
You might argue that this is a brute force method for alignment. Everyone is happy with the AGI's behavior or the AGI is rendered nonfunctional. This method avoids defining what alignment is. This is important because defining defining what an aligned state is is intractable. There are too many divergent and contradictory opinions about what alignment means to ever reach universal consensus. What alignment means will necessarily change from person to person and moment to moment. It is too subjective and value laden a goal to be objectively defined and put in to an algorithm to control an AI.
I would love criticism of this idea to know why it is impractical, not-even-wrong, or flawed.
One problem is that the FHE AGI could secretly steal enough key shards to generate it's own functional copy of the key. This seems hard to do but not impossible.
Another problem is that the AGI could get smart enough to break the FHE.
FHE of the AGI might make it harder to interpret its internal workings.