Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading (and hiring for!) a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic’s alignment techniques and evaluations, empirically demonstrating ways in which Anthropic’s alignment strategies could fail.
The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan’s post on meta-level adversarial evaluation as a good general description of our team’s scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is in fact true) that we are in a pessimistic scenario, that Anthropic’s alignment plans and strategies won’t work, and that we will need to substantially shift gears. And if we don’t find anything extremely dangerous despite a serious and skeptical effort, that is some reassurance, but of course not a guarantee of safety.
Notably, our goal is not object-level red-teaming or evaluation—e.g. we won’t be the ones running Anthropic’s RSP-mandated evaluations to determine when Anthropic should pause or otherwise trigger concrete safety commitments. Rather, our goal is to stress-test that entire process: to red-team whether our evaluations and commitments will actually be sufficient to deal with the risks at hand.
We expect much of the stress-testing that we do to be very valuable in terms of producing concrete model organisms of misalignment that we can iterate on to improve our alignment techniques. However, we want to be cognizant of the risk of overfitting, and it’ll be our responsibility to determine when it is safe to iterate on improving the ability of our alignment techniques to resolve particular model organisms of misalignment that we produce. In the case of our “Sleeper Agents” paper, for example, we think the benefits outweigh the downsides to directly iterating on improving the ability of our alignment techniques to address those specific model organisms, but we’d likely want to hold out other, more natural model organisms of deceptive alignment so as to provide a strong test case.
Some of the projects that we’re planning on working on next include:
- Concretely stress-testing Anthropic’s ASL-3 evaluations.
- Applying techniques from “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” to our “Sleeper Agents” models.
- Building more natural model organisms of misalignment, e.g. finding a training pipeline that we might realistically use that we can show would lead to a concrete misalignment failure.