Effective Altruism Forum
EA Forum

Hide table of contents

Comment Permalink

Answer by CarlShulmanJun 15, 202112

0

0

Focusing on empirical results:

Learning to summarize from human feedback was good, for several reasons.

I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.

Mark XuJun 15 20214

0

0

nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

Also hadn't seen that paper. Thanks!

[ Question ]

What is an example of recent, tangible progress in AI safety research?

by Aaron Gertler 🔸

Jun 14 20211 min read2 answers 0

35

AI safetyAI alignment

What is an example of recent, tangible progress in AI safety research?

Focusing on empirical results: Learning to summarize from human f...

If we take "tangible" to mean executable:

Cross-posting a good question from Reddit. Answer there, here, or in both places; I'll make sure the Reddit author knows about this post.

Eric Herboso's answer on Reddit (the only one so far) includes these examples:

Scott Garrabrant on Finite Factored Sets (May)
Paul Christiano on his Research Methodology (March)
Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)

35

0

0

Reactions

0

0

New Answer

New Comment

2 Answers sorted by
Top

Jun 15, 2021

12

0

0

Focusing on empirical results:

Learning to summarize from human feedback was good, for several reasons.

I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.

Mark XuJun 15 20214

0

0

nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

Also hadn't seen that paper. Thanks!

Jun 16, 2021

5

0

0

If we take "tangible" to mean executable:

A primitive prototype and a framework for safety via debate (2018-9). Bit quiet since.
Carey's 2019 proof of concept / extension of quantilizers
Stiennon et al (2020) is an extremely encouraging example of a large negative "alignment tax" (making it safer also made it work better)

But as Kurt Lewin once said "there's nothing so practical as a good theory". In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.

CAIS (2019) pivots away from the classic agentic model, maybe for the better
The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
The Armstrong/Shah discussion of value learning changed my research direction for the better.

Also Everitt et al (2019) is both: a theoretical advance with good software.

technicalitiesJun 16 20215

0

0

Not recent-recent, but I also really like Carey's 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. "When exactly does this toy system go bad?"

More from Aaron Gertler 🔸

Curated and popular this week

Relevant opportunities