This curriculum, a follow-up to the Alignment Fundamentals curriculum (the ‘101’ to this 201 curriculum), aims to give participants enough knowledge about alignment to understand the frontier of current research discussions. It assumes that participants have read through the Alignment Fundamentals curriculum, taken a course on deep learning, and taken a course on reinforcement learning (or have an equivalent level of knowledge).
Although these are the basic prerequisites, we expect that most people who intend to work on alignment should only read through the full curriculum after they have significantly more ML experience than listed above, since upskilling via their own ML engineering or research projects should generally be a higher priority for early-career alignment researchers.
When reading this curriculum, it’s worth remembering that the field of alignment aims to shape the goals of systems that don’t yet exist; and so alignment research is often more speculative than research in other fields. You shouldn’t assume that there’s a consensus about the usefulness of any given research direction; instead, it’s often worth developing your own views about whether techniques discussed in this curriculum might plausibly scale up to help align AGI.
The curriculum was compiled, and is maintained, by Richard Ngo. For now, it’s primarily intended to be read independently; once we’ve run a small pilot program, we’ll likely extend it to a discussion-based course.
Curriculum overview
Week 1: Further understanding the problem
Week 2: Decomposing tasks for better supervision
Week 3: Preventing misgeneralization
Week 4: Interpretability
Week 5: Reasoning about Reasoning
Weeks 6 & 7 (Track 1): Eliciting Latent Knowledge
Weeks 6 & 7 (Track 2): Agent Foundations
Weeks 6 & 7 (Track 3): Science of Deep Learning
Weeks 8 & 9: Literature Review or Project Proposal
See the full curriculum here. Note that the curriculum is still under revision, and feedback is very welcome!
Richard - thanks for your reply.
What I'm struggling with is how we'd plausibly get from (1) 'align with any human goal' to (2) 'align with all relevant goals across all humans in such a way that we actually minimize global catastrophic risks'.
In my view, getting to (1) only gets us about 2% of the way towards (2), and doesn't come anywhere close to 'solving alignment' in a way that would allow for safe AGI.
Also, I don't see how AGIs could develop a provably, interpretably, 'very sophisticated understanding of human values' if alignment researchers don't have a sophisticated understanding of human values that they could test against the AGI's understanding.
At least, it seems like we'd need a strong 'training set' of human values that includes plausibly complete coverage of the 'deployment set' of human values the AGI would actually encounter in the real world -- and I don't see how we'd get a decent training set of values without quite a thorough understanding of the nature and diversity of human values.
I'm raising these issues not to be contrarian or ornery, just out of a genuine puzzlement about the long-term game plan in research on alignment with human values, and why alignment researchers seem often uninterested in behavioral sciences research on human values.