Hide table of contents

I'm trying think through various approaches to AI alignment, and so far this is the one I came up with that I like best. I have not read much of the literature, so please do point me if this has been discussed before.

What if we train an AI agent (ie, reinforcement learning) to survive/thrive in an environment where there are a wide variety of agents with wildly different levels of intelligence? In particular, such that pretty much every agent can safely assume they'll eventually meet an agent much smarter than they are; structure the environment to reward tit-for-tat with a significant bias towards cooperation, eg require agents to "eat" resources that require cooperation to secure and are primarily non-competitive. The idea is to have them learn to respect even beings of lesser intelligence, because they want beings of higher intelligence to respect them; and because in this environment a bunch of lesser intelligences can gang up and defeat one higher-intelligence being. Also, we effectively train each AI to detect and defeat new AIs that seek to disturb this balance. I have not thought this through, curious what you all think




New Answer
New Comment

1 Answers sorted by

Half-baked thoughts in response, cuz I don’t think I really know what I’m talking about (yet, but here goes).

I imagine this is a little similar in execution to a simbox…


It seems recently that the likely contenders for AGI (AI labs if not their models), are already ‘out of the box’ with all the API stuff that is going on now.

Regarding adversarial(?) training like you were describing, interesting idea tho, I have no clue what to think :) but thanks for sharing!

Thanks, that link definitely touches on many of the same points! Where my proposal is more concrete is that the models should learn morals using evolutionary pressures / RL rewards that are designed using game theory to push towards cooperation and tit-for-tat.

More from ozb
Curated and popular this week
Relevant opportunities