Last week we discussed reinforcement learning from human feedback, which seems like an interesting approach to make AI systems aligned with human goals/desires. However, when considering advanced AI systems, we might want a human to judge an AI’s behaviour without having a full picture of what’s going on due to complexity. One approach proposed is to make AIs debate each other to convince the human of the best behaviours — with the idea that it’s easier to convince a careful judge of the truth than a lie, so ‘honesty’ tends to win.
Link to discussion guide: https://docs.google.com/.../1AzX5a60rIkEyiiXWUsyd.../edit...
NB: location TBC, but will be UoE central campus
