Talk by Buck Shlegeris, Redwood Research
About this talk
It's quite hard to define the goal of interpretability research. There's a really intuitively reasonable long-term goal: we'd like to know whether our AI is giving us answers because they're true or because it wants us to trust it so that it can murder us later, and even though it seems difficult or impossible to distinguish between these possibilities just based on the input-output behavior of the model, it might be possible to distinguish between them based on understanding the internal properties of the model. But how do we pick short-term research directions that are leading in the right direction for this goal? In this talk, we'll present a way of thinking about the goal of interpretability research that might lead us in a good direction, and we'll talk about what results we've gotten so far.
About the speaker series
With the advancement of Machine Learning research and the expected appearance of Artificial General Intelligence in the near future, it becomes an extremely important problem to positively shape the development of AI and to align AI values to that of human values.
In this speaker series, we bring state-of-art research on AI alignment into focus for audiences interested in contributing to this field. We will kick off the series by closely examining the potential risks engendered by AGIs and making the case for prioritizing the mitigation of risks now. Later on in the series, we will hear about more technical talks on concrete proposals for AI alignment.
See the full schedule and register at https://www.harvardea.org/agathon.
You can participate in the talks in person at Harvard and MIT, as well as remotely through the webinar by registering ahead of time (link above). All talks happen at 5 pm EST (2 pm PST, 10 pm GMT) on Thursdays. Dinner is provided for in-person venues.