Research Scientist @ Anthropic
1730 karmaJoined


Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:


I won't repeat my full LessWrong comment here in detail; instead I'd just recommend heading over there and reading it and the associated comment chain. The bottom-line summary is that, in trying to cover some heavy information theory regarding how to reason about simplicity priors and counting arguments without actually engaging with the proper underlying formalism, this post commits a subtle but basic mathematical mistake that makes the whole argument fall apart.

I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:

  • At least currently, scaffolding-related performance improvements don't seem to generally be that large (e.g. chain-of-thought is just not that helpful on most tasks), especially relative to the gains from scaling.
  • You can evaluate pretty directly for the sorts of capabilities that would help make scaffolding way better, like the model being able to correct its own errors, so you don't have to just evaluate the whole system + scaffolding end-to-end.
  • This is mostly just a problem for large-scale model deployments. If you instead keep your largest model mostly in-house for alignment research, or only give it to a small number of external partners whose scaffolding you can directly evaluate, it makes this problem way less bad.

That last point is probably the most important here, since it demonstrates that you easily can (and should) absorb this sort of concern into an RSP. For example, you could set a capabilities threshold for models' ability to do self-correction, and once your models pass that threshold you restrict deployment except in contexts where you can directly evaluate the relevant scaffolding that will be used in advance.

Cross-posted from LessWrong.

One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going.

It's hard to take anything else you're saying seriously when you say things like this; it seems clear that you just haven't read Anthropic's RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn't make them clear is just patently false.

The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:

Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.

And then it lays out a serious of safety procedures that Anthropic commits to meeting for ASL-3 models or else pausing, with some of the most serious commitments here being:

  • Model weight and code security: We commit to ensuring that ASL-3 models are stored in such a manner to minimize risk of theft by a malicious actor that might use the model to cause a catastrophe. Specifically, we will implement measures designed to harden our security so that non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense. The full set of security measures that we commit to (and have already started implementing) are described in this appendix, and were developed in consultation with the authors of a forthcoming RAND report on securing AI weights.
  • Successfully pass red-teaming: World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse. Misuse domains should at a minimum include causes of extreme CBRN risks, and cybersecurity.
    • Note that in contrast to the ASL-3 capability threshold, this red-teaming is about whether the model can cause harm under realistic circumstances (i.e. with harmlessness training and misuse detection in place), not just whether it has the internal knowledge that would enable it in principle to do so.
    • We will refine this methodology, but we expect it to require at least many dozens of hours of deliberate red-teaming per topic area, by world class experts specifically focused on these threats (rather than students or people with general expertise in a broad domain). Additionally, this may involve controlled experiments, where people with similar levels of expertise to real threat actors are divided into groups with and without model access, and we measure the delta of success between them.

And a clear evaluation-based definition of ASL-3:

We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.)

  1. Capabilities that significantly increase risk of misuse catastrophe: Access to the model would substantially increase the risk of deliberately-caused catastrophic harm, either by proliferating capabilities, lowering costs, or enabling new methods of attack. This increase in risk is measured relative to today’s baseline level of risk that comes from e.g. access to search engines and textbooks. We expect that AI systems would first elevate this risk from use by non-state attackers. Our first area of effort is in evaluating bioweapons risks where we will determine threat models and capabilities in consultation with a number of world-class biosecurity experts. We are now developing evaluations for these risks in collaboration with external experts to meet ASL-3 commitments, which will be a more systematized version of our recent work on frontier red-teaming. In the near future, we anticipate working with CBRN, cyber, and related experts to develop threat models and evaluations in those areas before they present substantial risks. However, we acknowledge that these evaluations are fundamentally difficult, and there remain disagreements about threat models.
  2. Autonomous replication in the lab: The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]. The appendix includes an overview of our threat model for autonomous capabilities and a list of the basic capabilities necessary for accumulation of resources and surviving in the real world, along with conditions under which we would judge the model to have succeeded. Note that the referenced appendix describes the ability to act autonomously specifically in the absence of any human intervention to stop the model, which limits the risk significantly. Our evaluations were developed in consultation with Paul Christiano and ARC Evals, which specializes in evaluations of autonomous replication.

This is the basic substance of the RSP; I don't understand how you could have possibly read it and missed this. I don't want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.

Cross-posted with LessWrong.

I found this post very frustrating, because it's almost all dedicated to whether current RSPs are sufficient or not (I agree that they are insufficient), but that's not my crux and I don't think it's anyone else's crux either. And for what I think is probably the actual crux here, you only have one small throwaway paragraph:

Which brings us to the question: “what’s the effect of RSPs on policy and would it be good if governments implemented those”. My answer to that is: An extremely ambitious version yes; the misleading version, no. No, mostly because of the short time we have before we see heightened levels of risks, which gives us very little time to update regulations, which is a core assumption on which RSPs are relying without providing evidence of being realistic.

As I've talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted. It seems that your main reason for disagreement is that you believe in extremely short timelines / fast takeoff, such that we will never get future opportunities to revise AI regulation. That seems pretty unlikely to me: my expectation especially is that as AI continues to heat up in terms of its economic impact, new policy windows will keep arising in rapid succession, and that we will see many of them before the end of days.

The RSP angle is part of the corporate "big AI" "business as usual" agenda. To those of us playing the outside game it seems very close to safetywashing.

I've written up more about why I think this is not true here.

Have the resumption condition be a global consensus on an x-safety solution or a global democratic mandate for restarting (and remember there are more components of x-safety than just alignment - also misuse and multi-agent coordination).

This seems basically unachievable and even if it was achievable it doesn't even seem like the right thing to do—I don't actually trust the global median voter to judge whether additional scaling is safe or not. I'd much rather have rigorous technical standards than nebulous democratic standards.

I think it's pushing it a bit at this stage to say that they, as companies, are primarily concerned with reducing x-risk.

That's why we should be pushing them to have good RSPs! I just think you should be pushing on the RSP angle rather than the pause angle.

This is soon enough to be pushing as hard as we can for a pause right now!

I mean, yes, obviously we should be doing everything we can right now. I just think that a RSP-gated pause is the right way to do a pause. I'm not even sure what it would mean to do a pause without an RSP-like resumption condition.

Why try and take it right down to the wire with RSPs?

Because it's more likely to succeed. RSPs provides very clear and legible risk-based criteria that are much more plausibly things that you could actually get a government to agree to.

The tradeoff for a few tens of $Bs of extra profit really doesn't seem worth it!

This seems extremely disingenuous and bad faith. That's obviously not the tradeoff and it confuses me why you would even claim that. Surely you know that I am not Sam Altman or Dario Amodei or whatever.

The actual tradeoff is the probability of success. If I thought e.g. just advocating for a six month pause right now was more effective at reducing existential risk, I would do it.

But what about dangerous capabilities that have more to do with AI takeover (e.g., a company develops a system that shows signs of autonomous replication, manipulation, power-seeking, deception) or scientific capabilities (e.g., the ability to develop better AI systems)?

Supposing that 3-10 other companies are within a few months of these systems, do you think at this point we need a coordinated pause, or would it be fine to just force company 1 to pause?

What should happen there is that the leading lab is forced to stop and try to demonstrate that e.g. they understand their model sufficiently such that they can keep scaling. Then:

  • If they can't do that, then the other labs catch up and they're all blocked on the same spot, which if you've put your capabilities bars at the right spots, shouldn't be dangerous.
  • If they can do that, then they get to keep going, ahead of other labs, until they hit another blocker and need to demonstrate safety/understanding/alignment to an even greater degree.

Perhaps the crux is related to how dangerous you think current models are? I'm quite confident that we have at least a couple additional orders of magnitude of scaling before the world ends, so I'm not too worried about stopping training of current models, or even next-generation models. But I do start to get worried with next-next-generation models.

So, in my view, the key is to make sure that we have a well-enforced Responsible Scaling Policy (RSP) regime that is capable of preventing scaling unless hard safety metrics are met (I favor understanding-based evals for this) before the next two scaling generations. That means we need to get good RSPs into law with solid enforcement behind them and—at least in very short timeline worlds—that needs to happen in the next few years. By far the best way to make that happen, in my opinion, is to pressure labs to put out good RSPs now that governments can build on.

Load more