Hide table of contents

Warning: I am very unsure about all the things I say in this comment.

Summary: I suspect that to begin solving the AI alignment problem we need to have AIs that have affective experiences similar to the ones humans have. That way they would have similar to humans values. That may be achieved with neural simulation.

It seems like AGIs will be agents who try to maximise their reward function (from reinforcement learning) by doing things and learning from what they're doing. AI safety and alignment research is concerned with making sure that AIs have similar enough values to humans' values and that AIs don't do somethings catastrophically dangerous to humans. The questions would be "how similar should these values or goals be?" and "what do humans value universally?" and "how would we compute that into AIs?".

But humans conflict with each other a lot (to the point of numerous wars). And they self-report somewhat different values (for example: conservatives and liberals value things quite differently from each other, judging from Jonathan Haidt's research). So it seems like humans are somewhat misaligned with each others' goals and values. So how would we align AIs with humans' goals and values when we seem to differ so much on that? Also, there is the problem of making our goals inteligible and measurable enough to act as reward functions for AIs.

So it seems like the alignment problem might be pretty hard to solve. But then I thought that all reinforcement learning agents seem to have the terminal goal of optimising their own reward functions, whatever those may be. And then I asked "what is the reward function of humans, if they have any?".

I think that humans' "reward function" is affect. Affective experiences reward and punish us and they can be of various types: sensory (pleasure from seeing something beautiful, disgust from smelling spoiled food or seeing something ugly), homeostasic (thirst, hunger, sleepiness, physical pains, food satiation), and emotional (the 7 basic affective systems described by Jaak Pankseep: SEEKING, RAGE, FEAR, LUST, CARE, GRIEF, PLAY). These experiences are "produced" by subcortical regions and circuits in mammalian brains, including humans' ones (like the ventral tegmental area or periaqueductal gray). They seem to be responsible for humans' universal basic "values".

So, if AIs would also have these human values, specifically these kinds of affects, would that help with solving the AI alignment, corrigibility (changing AIs' goals) and interpretability (reading AIs' minds) problems? If we could compute neurons that are similar in function to humans' neurons, then put them together into networks that are similar to the brain circuits responsible for these affects, would that give the AIs these human values? Could that be achieved with brain simulation or something similar? I don't know.

So, do you think that this kind of aproach is promising?

Or do you think that this would lead to all sorts of problems, like sentient AIs getting mad at us and waging war with us, or AI companies ignoring this aproach because these AIs wouldn't be as capable, even if they are safer, or AIs acting in dangerous ways because that's what humans and mammals tend to do, or that we simply won't be able to develop this, or that black hat hackers will use them for hacking purposes?




New Answer
New Comment
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities