I’m trying to better understand whether recent efforts to give AI Safety research a better empirical grounding have produced evidence that some claims based on theoretical AI Safety work have turned out to be correct.
This could make me update in favour of taking AI Safety concerns more seriously.
Previously I have been skeptical of AI Safety arguments due to many claims being based on theoretical reasoning rather than empirical evidence.
Not predictions as such, but lots of current work on AI safety and steering is based pretty directly on paradigms from Yudkowsky and Christiano - from Anthropic's constitutional AI to ARIA's Safeguarded AI program. There is also OpenAI's Superalignment reserach, which was attempting to build AI that could solve agent foundations - that is, explicitly do the work that theoretical AI safety research identified. (I'm unclear whether the last is ongoing or not, given that they managed to alienate most of the people involved.)