Join us for a one hour presentation by the author of this post: How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!
You'll get the most out of Stefan's talk if you meet the following prerequisites:
- Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here’s a link to Neel’s glossary which provides excellent explanations for most terms I might use!
If you're not familiar with Transformers you can check out Step 2 (6) on Neel's guide or any of the other explanations online, I recommend Jay Alammar's The Illustrated Transformer and/or Milan Straka's lecture series. - Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel's talks, or look at the results in the IOI paper / walkthrough.
- Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required!