In February, Holden Karnofsky published the Important, actionable research questions for the most important century.

I've been working for a couple months on trying to answer the following of Holden's questions:

"What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?"

To answer this question, I've started a series of posts exploring the argument that interpretability - that is, research into better understanding what is happening inside machine learning systems -  is a high-leverage research activity for solving the AI alignment problem.

I just published the first two posts on Alignment Forum/LessWrong:

1. Introduction to the sequence: Interpretability Research for the Most Important Century
2. (main post) Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

There will be at least one more post in the series, but in particular post #2 contains a substantial amount of my research on this topic.

New Comment