Talk by Paul Christiano, Alignment Research Center
Eliciting latent knowledge: How to tell if your eyes deceive you
About this talk
When we train an ML system, it may build up latent “knowledge” about the world that is instrumentally useful for achieving a low loss. For example, a video model may infer the positions and velocities of objects in order to predict the next frame. I’ll argue that eliciting this knowledge would go much of the way towards safely deploying powerful AI, even if we could only do so in the most “straightforward” cases where there is little doubt about what the model knows or how to describe that knowledge to a human. I’ll describe a naive training strategy to get the model to honestly answer questions using all of its knowledge and explain why I expect that strategy to fail for sufficiently powerful systems. Then I'll discuss a few more sophisticated alternatives which may also fail, and finally outline the approach we are currently taking at ARC to try to resolve this problem in the worst case.
About the speaker series
With the advancement of Machine Learning research and the expected appearance of Artificial General Intelligence in the near future, it becomes an extremely important problem to positively shape the development of AI and to align AI values to that of human values.
In this speaker series, we bring state-of-art research on AI alignment into focus for audiences interested in contributing to this field. We will kick off the series by closely examining the potential risks engendered by AGIs and making the case for prioritizing the mitigation of risks now. Later on in the series, we will hear about more technical talks on concrete proposals for AI alignment.
See the full schedule and register at https://www.harvardea.org/agathon.
You can participate in the talks in person at Harvard and MIT, as well as remotely through the webinar by registering ahead of time (link above). All talks happen at 5 pm EST (2 pm PST, 10 pm GMT) on Thursdays. Dinner is provided for in-person venues.
