In February, Holden Karnofsky published the Important, actionable research questions for the most important century.
I've been working for a couple months on trying to answer the following of Holden's questions:
"What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?"
To answer this question, I've started a series of posts exploring the argument that interpretability - that is, research into better understanding what is happening inside machine learning systems - is a high-leverage research activity for solving the AI alignment problem.
I just published the first two posts on Alignment Forum/LessWrong:
1. Introduction to the sequence: Interpretability Research for the Most Important Century
2. (main post) Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
There will be at least one more post in the series, but in particular post #2 contains a substantial amount of my research on this topic.