What is an example of recent, tangible progress in AI safety research?

Aaron Gertler 🔸

Cross-posting a good question from Reddit. Answer there, here, or in both places; I'll make sure the Reddit author knows about this post.

Eric Herboso's answer on Reddit (the only one so far) includes these examples:

Scott Garrabrant on Finite Factored Sets (May)
Paul Christiano on his Research Methodology (March)
Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)

35 Reactions

New Answer

New Comment

2 Answers sorted by
Top

CarlShulman

Jun 15, 2021

Focusing on empirical results:

Learning to summarize from human feedback was good, for several reasons.

I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.

Mark XuJun 15 20214

nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

Also hadn't seen that paper. Thanks!

technicalities

Jun 16, 2021

If we take "tangible" to mean executable:

A primitive prototype and a framework for safety via debate (2018-9). Bit quiet since.
Carey's 2019 proof of concept / extension of quantilizers
Stiennon et al (2020) is an extremely encouraging example of a large negative "alignment tax" (making it safer also made it work better)

But as Kurt Lewin once said "there's nothing so practical as a good theory". In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.

CAIS (2019) pivots away from the classic agentic model, maybe for the better
The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
The Armstrong/Shah discussion of value learning changed my research direction for the better.

Also Everitt et al (2019) is both: a theoretical advance with good software.

technicalitiesJun 16 20215

Not recent-recent, but I also really like Carey's 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. "When exactly does this toy system go bad?"

Effective Altruism Forum
EA Forum

[ Question ]

What is an example of recent, tangible progress in AI safety research?

35

35

Reactions

2 Answers sorted by
Top

Jun 15, 2021

Jun 16, 2021

[ Question ]

What is an example of recent, tangible progress in AI safety research?

35

35

Reactions

2 Answers sorted by Top

Jun 15, 2021

Jun 16, 2021

2 Answers sorted by
Top