[ Question ]

What is an example of recent, tangible progress in AI safety research?

by Aaron Gertler1 min read14th Jun 20214 comments


AI alignment

Cross-posting a good question from Reddit. Answer there, here, or in both places; I'll make sure the Reddit author knows about this post.

Eric Herboso's answer on Reddit (the only one so far) includes these examples:

Scott Garrabrant on Finite Factored Sets (May)

Paul Christiano on his Research Methodology (March)

Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)

New Answer
Ask Related Question
New Comment

2 Answers

Focusing on empirical results:

Learning to summarize  from human feedback was good, for several reasons.

I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.


nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

Also hadn't seen that paper. Thanks!

If we take "tangible" to mean executable:

But as Kurt Lewin once said "there's nothing so practical as a good theory". In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.

  • CAIS (2019) pivots away from the classic agentic model, maybe for the better
  • The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
  • The Armstrong/Shah discussion of value learning changed my research direction for the better.

Also Everitt et al (2019) is both: a theoretical advance with good software.

Not recent-recent, but I also really like Carey's 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. "When exactly does this toy system go bad?"