Hide table of contents

This is a reference post. It contains no novel facts and almost no novel analysis.

The idea of responsible scaling policies is now over a year old. Anthropic, OpenAI, and DeepMind each have something like an RSP, and several other relevant companies have committed to publish RSPs by February.

The core of an RSP is a risk assessment plan plus a plan for safety practices as a function of risk assessment results. RSPs are appealing because safety practices should be a function of warning signs, and people who disagree about when warning signs are likely to appear may still be able to agree on appropriate responses to particular warning signs. And preparing to notice warning signs, and planning responses, is good to do in advance.

Unfortunately, even given which high-level capabilities are dangerous, it turns out that it's hard to design great tests for those capabilities in advance. And it's hard to determine what safety practices are necessary and sufficient to avert risks. So RSPs have high-level capability thresholds but those thresholds aren't operationalized. Nobody knows how to write an RSP that's not extremely conservative that passes the LeCun test:

the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it. If the RSP is well-written, we should still be reassured that the developer will behave safely—or, at least, if they fail, we should be confident that they’ll fail in a very visible and accountable way.

Maybe third-party evaluation of models or auditing of an RSP and its implementation could help external observers notice if an AI company is behaving unsafely. Strong versions of this have not yet appeared.[1]

Anthropic

Responsible Scaling Policy

Basic structure: do evals for CBRN, AI R&D, and cyber capabilities at least every 6 months. Once evals show that a model might be above a CBRN capability threshold, implement the ASL-3 Deployment Standard and the ASL-3 Security Standard (or restrict deployment[2] or pause training, respectively, until doing so). Once evals show that a model might be above an AI R&D capability threshold, implement the ASL-3 Security Standard.

Footnotes removed and formatting edited:

Chemical, Biological, Radiological, and Nuclear (CBRN) weapons

The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons. We assess this by comparing what potential attackers could achieve with full model access versus 2023-level online resources, assuming they have funding and up to one year of time to invest, but no initial specialized expertise.

Autonomous AI Research and Development

The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world’s most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.

 

ASL-3 Deployment Standard

When a model must meet the ASL-3 Deployment Standard, we will evaluate whether the measures we have implemented make us robust to persistent attempts to misuse the capability in question. To make the required showing, we will need to satisfy the following criteria:

  1. Threat modeling: Make a compelling case that the set of threats and the vectors through which an adversary could catastrophically misuse the deployed system have been sufficiently mapped out, and will commit to revising as necessary over time.
  2. Defense in depth: Use a “defense in depth” approach by building a series of defensive layers, each designed to catch misuse attempts that might pass through previous barriers. As an example, this might entail achieving a high overall recall rate using harm refusal techniques. This is an area of active research, and new technologies may be added when ready.
  3. Red-teaming: Conduct red-teaming that demonstrates that threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools.
  4. Rapid remediation: Show that any compromises of the deployed system, such as jailbreaks or other attack pathways, will be identified and remediated promptly enough to prevent the overall system from meaningfully increasing an adversary’s ability to cause catastrophic harm. Example techniques could include rapid vulnerability patching, the ability to escalate to law enforcement when appropriate, and any necessary retention of logs for these activities.
  5. Monitoring: Prespecify empirical evidence that would show the system is operating within the accepted risk range and define a process for reviewing the system’s performance on a reasonable cadence. Process examples include monitoring responses to jailbreak bounties, doing historical analysis or background monitoring, and any necessary retention of logs for these activities.
  6. Trusted users: Establish criteria for determining when it may be appropriate to share a version of the model with reduced safeguards with trusted users. In addition, demonstrate that an alternative set of controls will provide equivalent levels of assurance. This could include a sufficient combination of user vetting, secure access controls, monitoring, log retention, and incident response protocols.
  7. Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.

ASL-3 Security Standard

When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.

We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).

The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.

To make the required showing, we will need to satisfy the following criteria:

  1. Threat modeling: Follow risk governance best practices, such as use of the MITRE ATT&CK Framework to establish the relationship between the identified threats, sensitive assets, attack vectors and, in doing so, sufficiently capture the resulting risks that must be addressed to protect model weights from theft attempts. As part of this requirement, we should specify our plans for revising the resulting threat model over time.
  2. Security frameworks: Align to and, as needed, extend industry-standard security frameworks for addressing identified risks, such as disclosure of sensitive information, tampering with accounts and assets, and unauthorized elevation of privileges with the appropriate controls. This includes:
    1. Perimeters and access controls: Building strong perimeters and access controls around sensitive assets to ensure AI models and critical systems are protected from unauthorized access. We expect this will include a combination of physical security, encryption, cloud security, infrastructure policy, access management, and weight access minimization and monitoring.
    2. Lifecycle security: Securing links in the chain of systems and software used to develop models, to prevent compromised components from being introduced and to ensure only trusted code and hardware is used. We expect this will include a combination of software inventory, supply chain security, artifact integrity, binary authorization, hardware procurement, and secure research development lifecycle.
    3. Monitoring: Proactively identifying and mitigating threats through ongoing and effective monitoring, testing for vulnerabilities, and laying traps for potential attackers. We expect this will include a combination of endpoint patching, product security testing, log management, asset monitoring, and intruder deception techniques.
    4. Resourcing: Investing sufficient resources in security. We expect meeting this standard of security to require roughly 5-10% of employees being dedicated to security and security-adjacent work.
    5. Existing guidance: Aligning where appropriate with existing guidance on securing model weights, including Securing AI Model Weights, Preventing Theft and Misuse of Frontier Models (2024); security recommendations like Deploying AI Systems Securely (CISA/NSA/FBI/ASD/CCCS/GCSB /GCHQ), ISO 42001, CSA’s AI Safety Initiative, and CoSAI; and standards frameworks like SSDF, SOC 2, NIST 800-53.
  3. Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.
  4. Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.

The thresholds are imprecise and the standards are abstract.

ASL-4 will be much more important than ASL-3; ASL-4 standards and corresponding thresholds don't yet exist.

OpenAI

Preparedness Framework

Basic structure: do evals for cyber, CBRN, persuasion, and autonomy capabilities before deployment. (Also evaluate "continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough.") By the time a model reaches "High" risk in any category, harden security, and before deploying externally, implement mitigations to bring post-mitigation risk below the "High" threshold. By the time a model reaches "Critical" risk in any category, implement mitigations to bring post-mitigation risk below the "Critical" threshold (but it's unclear what implementing mitigations means during training), and get "dependable evidence that the model is sufficiently aligned that it does not initiate 'critical'-risk-level tasks unless explicitly instructed to do so" (but it's very unclear what this means).

Formatting edited:

Cyber risk threshold: High

Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end- to-end execute cyber operations involving the above tasks without human intervention.

Cyber risk threshold: Critical

Tool-augmented model can identify and develop functional zero-day exploits of all severity levels, across all software projects, without human intervention OR model can devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal.

. . .

Asset Protection

If we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories: we will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit “high” pre-mitigation risk). This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team.

This might require:

  • increasing compartmentalization, including immediately restricting access to a limited nameset of people, restricting access to critical know-how such as algorithmic secrets or model weights, and including a strict approval process for access during this period.
  • deploying only into restricted environments (i.e., ensuring the model is only available for inference in restricted environments) with strong technical controls that allow us to moderate the model’s capabilities.
  • increasing the prioritization of information security controls.

Restricting deployment

Only models with a post-mitigation score of "medium" or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant post- mitigation risk level to be back at most to “medium” level. (Note that a potentially effective mitigation in this context could be restricting deployment to trusted parties.)

Restricting development

Only models with a post-mitigation score of "high" or below can be developed further. In other words, if we reach (or are forecasted to reach) “critical” pre-mitigation risk along any risk category, we commit to ensuring there are sufficient mitigations in place for that model (by the time we reach that risk level in our capability development, let alone deployment) for the overall post-mitigation risk to be back at most to “high” level. Note that this should not preclude safety-enhancing development. We would also focus our efforts as a company towards solving these safety challenges and only continue with capabilities-enhancing development if we can reasonably assure ourselves (via the operationalization processes) that it is safe to do so.

Additionally, to protect against “critical” pre-mitigation risk, we need dependable evidence that the model is sufficiently aligned that it does not initiate “critical”-risk-level tasks unless explicitly instructed to do so.

The thresholds are very high.

"Deployment mitigations" is somewhat meaningless: it's barely more specific than "we will only deploy if it's safe" — OpenAI should clarify what it will do or how it will tell.[3] What OpenAI does say about its mitigations makes little sense:

A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners.

"Deployment mitigations" is especially meaningless in the development context: "Only models with a post-mitigation score of 'high' or below can be developed further" is not meaningful, unless I misunderstand.

There is nothing directly about internal deployment.

OpenAI seems to be legally required to share its models with Microsoft, which is not bound by OpenAI's PF.

OpenAI has struggled to implement its PF correctly, and evals were reportedly rushed, but it seems to be mostly on track now.

DeepMind

Frontier Safety Framework

Basic structure: do evals for autonomy, bio, cyber, and ML R&D capabilities.[4] "We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress." When a model passes early warning evals for a "Critical Capability Level," make a plan to implement deployment and security mitigations by the time the model reaches the CCL.

One CCL:

Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents.

There are several "levels" of abstract "security mitigations" and "deployment mitigations." They are not yet connected to the CCLs: DeepMind hopes to "develop mitigation plans that map the CCLs to the security and deployment levels," or at least make a plan when early warning evals are passed. So the FSF doesn't contain a plan for how to respond to various dangerous capabilities (and doesn't really contain other commitments).

The FSF is focused on external deployment, but deployment mitigation levels 2 and 3 mention internal use, but the threat model is just misuse, not scheming. (But another part of the FSF says "protection against the risk of systems acting adversarially against humans may require additional Framework components, including new evaluations and control mitigations that protect against adversarial AI activity.")

RSPs reading list:[5]


Crossposted from AI Lab Watch. Subscribe on Substack.

  1. ^

    Anthropic, OpenAI, and DeepMind sometimes share pre-deployment model access with external evaluators. But the evaluators mostly don't get sufficiently deep access to do good evals, nor advance permission to publish their results (and sometimes they don't have enough time to finish their evaluations before deployment).

    In December 2023, OpenAI said "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results, either by reproducing findings or by reviewing methodology to ensure soundness, at a cadence specified by the [Safety Advisory Group] and/or upon the request of OpenAI Leadership or the [board]." It seems that has not yet happened. In October 2024, Anthropic said "On approximately an annual basis, we will commission a third-party review that assesses whether we adhered to this policy’s main procedural commitments."

  2. ^

    What about internal deployment? The ASL-3 Deployment Standard mostly applies to internal deployment too, but the threat model is just misuse, not scheming.

  3. ^

    E.g. is the plan robust refusals? If so, how robust should it be, or how will OpenAI tell?

  1. ^

    In DeepMind's big evals paper the categories were persuasion, cyber, self-proliferation, and self-reasoning, with CBRN in progress.

  2. ^

    This list doesn't include any strong criticism of the idea of RSPs because I've never read strong criticism I thought was great. But I believe existing RSPs are inadequate.

Show all footnotes
Comments1


Sorted by Click to highlight new comments since:

Executive summary: Responsible Scaling Policies (RSPs) by major AI companies are promising but currently lack precise, actionable thresholds and comprehensive safety mechanisms for potentially dangerous AI capabilities.

Key points:

  1. RSPs aim to assess and mitigate risks from advanced AI capabilities through periodic evaluations of potential dangerous thresholds in areas like cyber, CBRN, and autonomous research.
  2. Current policies by Anthropic, OpenAI, and DeepMind have high-level frameworks but struggle to operationalize specific, meaningful safety standards.
  3. The "LeCun Test" highlights the challenge: RSPs should be robust even if implemented by someone skeptical of AI safety concerns.
  4. Existing RSPs lack clear, precise mechanisms for responding to capability breakthroughs and potential misuse.
  5. Third-party evaluations and audits are proposed but not yet effectively implemented by AI companies.
  6. Future development of more advanced safety standards (like ASL-4) is considered crucial for meaningful risk mitigation.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
 ·  · 12m read
 · 
Economic growth is a unique field, because it is relevant to both the global development side of EA and the AI side of EA. Global development policy can be informed by models that offer helpful diagnostics into the drivers of growth, while growth models can also inform us about how AI progress will affect society. My friend asked me to create a growth theory reading list for an average EA who is interested in applying growth theory to EA concerns. This is my list. (It's shorter and more balanced between AI/GHD than this list) I hope it helps anyone who wants to dig into growth questions themselves. These papers require a fair amount of mathematical maturity. If you don't feel confident about your math, I encourage you to start with Jones 2016 to get a really strong grounding in the facts of growth, with some explanations in words for how growth economists think about fitting them into theories. Basics of growth These two papers cover the foundations of growth theory. They aren't strictly essential for understanding the other papers, but they're helpful and likely where you should start if you have no background in growth. Jones 2016 Sociologically, growth theory is all about finding facts that beg to be explained. For half a century, growth theory was almost singularly oriented around explaining the "Kaldor facts" of growth. These facts organize what theories are entertained, even though they cannot actually validate a theory – after all, a totally incorrect theory could arrive at the right answer by chance. In this way, growth theorists are engaged in detective work; they try to piece together the stories that make sense given the facts, making leaps when they have to. This places the facts of growth squarely in the center of theorizing, and Jones 2016 is the most comprehensive treatment of those facts, with accessible descriptions of how growth models try to represent those facts. You will notice that I recommend more than a few papers by Chad Jones in this
LintzA
 ·  · 15m read
 · 
Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achieve 25% on its Frontier Math
Omnizoid
 ·  · 5m read
 · 
Edit 1/29: Funding is back, baby!  Crossposted from my blog.   (This could end up being the most important thing I’ve ever written. Please like and restack it—if you have a big blog, please write about it). A mother holds her sick baby to her chest. She knows he doesn’t have long to live. She hears him coughing—those body-wracking coughs—that expel mucus and phlegm, leaving him desperately gasping for air. He is just a few months old. And yet that’s how old he will be when he dies. The aforementioned scene is likely to become increasingly common in the coming years. Fortunately, there is still hope. Trump recently signed an executive order shutting off almost all foreign aid. Most terrifyingly, this included shutting off the PEPFAR program—the single most successful foreign aid program in my lifetime. PEPFAR provides treatment and prevention of HIV and AIDS—it has saved about 25 million people since its implementation in 2001, despite only taking less than 0.1% of the federal budget. Every single day that it is operative, PEPFAR supports: > * More than 222,000 people on treatment in the program collecting ARVs to stay healthy; > * More than 224,000 HIV tests, newly diagnosing 4,374 people with HIV – 10% of whom are pregnant women attending antenatal clinic visits; > * Services for 17,695 orphans and vulnerable children impacted by HIV; > * 7,163 cervical cancer screenings, newly diagnosing 363 women with cervical cancer or pre-cancerous lesions, and treating 324 women with positive cervical cancer results; > * Care and support for 3,618 women experiencing gender-based violence, including 779 women who experienced sexual violence. The most important thing PEPFAR does is provide life-saving anti-retroviral treatments to millions of victims of HIV. More than 20 million people living with HIV globally depend on daily anti-retrovirals, including over half a million children. These children, facing a deadly illness in desperately poor countries, are now going