Cross-posting a good question from Reddit. Answer there, here, or in both places; I'll make sure the Reddit author knows about this post.
Eric Herboso's answer on Reddit (the only one so far) includes these examples:
Scott Garrabrant on Finite Factored Sets (May)
Paul Christiano on his Research Methodology (March)
Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)
Focusing on empirical results:
Learning to summarize from human feedback was good, for several reasons.
I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.
nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models
Also hadn't seen that paper. Thanks!