Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Zach Stein-Perlman

This is a linkpost for https://www.lesswrong.com/posts/6tjHf5ykvFqaNCErH/anthropic-s-responsible-scaling-policy-and-long-term-benefit

Anthropic's RSP is big news. Yay Anthropic for making good commitments about model development, deployment, and security.

Anthropic's Responsible Scaling Policy

Today, we’re publishing our Responsible Scaling Policy (RSP) – a series of technical and organizational protocols that we’re adopting to help us manage the risks of developing increasingly capable AI systems.

As AI models become more capable, we believe that they will create major economic and social value, but will also present increasingly severe risks. Our RSP focuses on catastrophic risks – those where an AI model directly causes large scale devastation. Such risks can come from deliberate misuse of models (for example use by terrorists or state actors to create bioweapons) or from models that cause destruction by acting autonomously in ways contrary to the intent of their designers.

AI Safety Level Summary

Our RSP defines a framework called AI Safety Levels (ASL) for addressing catastrophic risks, modeled loosely after the US government’s biosafety level (BSL) standards for handling of dangerous biological materials. The basic idea is to require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk, with higher ASL levels requiring increasingly strict demonstrations of safety.

A very abbreviated summary of the ASL system is as follows:

ASL-1 refers to systems which pose no meaningful catastrophic risk, for example a 2018 LLM or an AI system that only plays chess.
ASL-2 refers to systems that show early signs of dangerous capabilities – for example ability to give instructions on how to build bioweapons – but where the information is not yet useful due to insufficient reliability or not providing information that e.g. a search engine couldn’t. Current LLMs, including Claude, appear to be ASL-2.
ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) OR that show low-level autonomous capabilities.
ASL-4 and higher (ASL-5+) is not yet defined as it is too far from present systems, but will likely involve qualitative escalations in catastrophic misuse potential and autonomy.

The definition, criteria, and safety measures for each ASL level are described in detail in the main document, but at a high level, ASL-2 measures represent our current safety and security standards and overlap significantly with our recent White House commitments. ASL-3 measures include stricter standards that will require intense research and engineering effort to comply with in time, such as unusually strong security requirements and a commitment not to deploy ASL-3 models if they show any meaningful catastrophic misuse risk under adversarial testing by world-class red-teamers (this is in contrast to merely a commitment to perform red-teaming). Our ASL-4 measures aren’t yet written (our commitment is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors.

We have designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling, and allows us to use the most powerful models from the previous ASL level as a tool for developing safety features for the next level.^[1] If adopted as a standard across frontier labs, we hope this might create a “race to the top” dynamic where competitive incentives are directly channeled into solving safety problems.

From a business perspective, we want to be clear that our RSP will not alter current uses of Claude or disrupt availability of our products. Rather, it should be seen as analogous to the pre-market testing and safety feature design conducted in the automotive or aviation industry, where the goal is to rigorously demonstrate the safety of a product before it is released onto the market, which ultimately benefits customers.

Anthropic’s RSP has been formally approved by its board and changes must be approved by the board following consultations with the Long Term Benefit Trust. In the full document we describe a number of procedural safeguards to ensure the integrity of the evaluation process.

However, we want to emphasize that these commitments are our current best guess, and an early iteration that we will build on. The fast pace and many uncertainties of AI as a field imply that, unlike the relatively stable BSL system, rapid iteration and course correction will almost certainly be necessary.

The full document can be read here. We hope that it provides useful inspiration to policymakers, third party nonprofit organizations, and other companies facing similar deployment decisions.

The Long-Term Benefit Trust

Today we are sharing more details about our new governance structure called the Long-Term Benefit Trust (LTBT), which we have been developing since the birth of Anthropic. The LTBT is our attempt to fine-tune our corporate governance to address the unique challenges and long-term opportunities we believe transformative AI will present.

The Trust is an independent body of five financially disinterested members with an authority to select and remove a portion of our Board that will grow over time (ultimately, a majority of our Board). Paired with our Public Benefit Corporation status, the LTBT helps to align our corporate governance with our mission of developing and maintaining advanced AI for the long-term benefit of humanity.

Corporate Governance Basics
A corporation is overseen by its board of directors. The board selects and oversees the leadership team (especially the CEO), who in turn hire and manage the employees. The default corporate governance setup makes directors accountable to the stockholders in several ways. For example:

Directors are elected by, and may be removed by stockholders.
Directors are legally accountable to stockholders for fulfilling their fiduciary duties.
Directors are often paid in shares of stock of the corporation, which helps to align their incentives with the financial interests of stockholders.

Importantly, the rights to elect, remove, and sue directors belong exclusively to the stockholders. Some wonder, therefore, whether directors of a corporation are permitted to optimize for stakeholders beyond the corporation’s stockholders, such as customers and the general public. This question is the subject of a rich debate, which we won’t delve into here. For present purposes, it is enough to observe that all the key mechanisms of accountability in corporate law push directors to prioritize the financial interests of stockholders.

Fine-tuning Anthropic’s Corporate Governance
Corporate governance has seen centuries of legal precedent and iteration, and views differ greatly on its effectiveness, strengths, and weaknesses. At Anthropic, our perspective is that the capacity of corporate governance to produce socially beneficial outcomes depends strongly on non-market externalities. Externalities are a type of market failure that occurs when a transaction between two parties imposes costs or benefits on a third party who has not consented to the transaction. Common examples of costs include pollution from factories, systemic financial risk from banks, and national security risks from weapons manufacturers. Examples of positive spillover effects include the societal benefits of education that reach beyond the individuals being educated, or investments in R&D that boost entire sectors beyond the company making the investment. Many parties who contract with a corporation, such as customers, workers, and suppliers, are capable of negotiating or demanding prices and terms that reflect the full costs and benefits of their exchanges. But other parties, such as the general public, don’t directly contract with a corporation and therefore do not have a means to charge or pay for the costs and benefits they experience.

The greater the externalities, the less we expect corporate governance defaults to serve the interests of non-contracting parties such as the general public. We believe AI may create unprecedentedly large externalities, ranging from national security risks, to large-scale economic disruption, to fundamental threats to humanity, to enormous benefits to human safety and health. The technology is advancing so rapidly that the laws and social norms that constrain other high-externality corporate activities have yet to catch up with AI; this has led us to invest in fine-tuning Anthropic’s governance to meet the challenge ahead of us.

To be clear, for most of the day-to-day decisions Anthropic makes, public benefit is not at odds with commercial success or stockholder returns, and if anything our experience has shown that the two are often strongly synergistic: our ability to do effective safety research depends on building frontier models (the resources for which are greatly aided by commercial success), and our ability to foster a “race to the top” depends on being a viable company in the ecosystem in both a technical sense and a commercial sense. We do not expect the LTBT to intervene in these day-to-day decisions or in our ordinary commercial strategy.

Rather, the need for fine-tuning of the governance structure ultimately derives from the potential for extreme events and the need to handle them with humanity’s interests in mind, and we expect the LTBT to primarily concern itself with these long-range issues. For example, the LTBT can ensure that the organizational leadership is incentivized to carefully evaluate future models for catastrophic risks or ensure they have nation-state level security, rather than prioritizing being the first to market above all other objectives.

Baseline: Public Benefit Corporation
One governance feature we have already shared is that Anthropic is a Delaware Public Benefit Corporation, or PBC. Like most large companies in the United States, Anthropic is incorporated in Delaware, and Delaware corporate law expressly permits the directors of a PBC to balance the financial interests of the stockholders with the public benefit purpose specified in the corporation’s certificate of incorporation, and the best interests of those materially affected by the corporation’s conduct. The public benefit purpose stated in Anthropic’s certificate is the responsible development and maintenance of advanced AI for the long-term benefit of humanity. This gives our board the legal latitude to weigh long- and short-term externalities of decisions–whether to deploy a particular AI system, for example–alongside the financial interests of our stockholders.

The legal latitude afforded by our PBC structure is important in aligning Anthropic’s governance with our public benefit mission. But we didn’t feel it was enough for the governance challenges we foresee in the development of transformative AI. Although the PBC form makes it legally permissible for directors to balance public interests with the maximization of stockholder value, it does not make the directors of the corporation directly accountable to other stakeholders or align their incentives with the interests of the general public. We set out to design a structure that would supply our directors with the requisite accountability and incentives to appropriately balance the financial interests of our stockholders and our public benefit purpose at key junctures where we expect the consequences of our decisions to reach far beyond Anthropic.

LTBT: Basic Structure and Features
The Anthropic Long-Term Benefit Trust (LTBT, or Trust) is an independent body comprising five Trustees with backgrounds and expertise in AI safety, national security, public policy, and social enterprise. The Trust’s arrangements are designed to insulate the Trustees from financial interest in Anthropic and to grant them sufficient independence to balance the interests of the public alongside the interests of Anthropic’s stockholders.

At the close of our Series C, we amended our corporate charter to create a new class of stock (Class T) held exclusively by the Trust.^[2] The Class T stock grants the Trust the authority to elect and remove a number of Anthropic’s board members that will phase in according to time- and funding-based milestones; in any event, the Trust will elect a majority of the board within 4 years. At the same time, we created a new director seat that will be elected by the Series C and subsequent investors to ensure that our investors’ perspectives will be directly represented on the board into the future.

The Class T stock also includes “protective provisions” that require the Trust to receive notice of certain actions that could significantly alter the corporation or its business.

The Trust is organized as a “purpose trust” under the common law of Delaware, with a purpose that is the same as that of Anthropic. The Trust must use its powers to ensure that Anthropic responsibly balances the financial interests of stockholders with the interests of those affected by Anthropic’s conduct and our public benefit purpose.

A Different Kind of Stockholder
In establishing the Long-Term Benefit Trust we have, in effect, created a different kind of stockholder in Anthropic. Anthropic will continue to be overseen by its board, which we expect will make the decisions of consequence on the path to transformative AI. In navigating these decisions, a majority of the board will ultimately have accountability to the Trust as well as to stockholders, and will thus have incentives to appropriately balance the public benefit with stockholder interests. Moreover, the board will benefit from the insights of Trustees with deep expertise and experience in areas key to Anthropic’s public benefit mission. Together we believe the insights and incentives supplied by the Trust will result in better decision making when the stakes are highest.

The gradual “phase-in” of the LTBT will allow us to course-correct an experimental structure and also reflects a hypothesis that, early in a company’s history, it can often function best with streamlined governance and not too many stakeholders; whereas as it becomes more mature and has more profound effects on society, externalities tend to manifest themselves progressively more, making checks and balances more critical.

A Corporate Governance Experiment
The Long-Term Benefit Trust is an experiment. Its design is a considered hypothesis, informed by some of the most accomplished corporate governance scholars and practitioners in the nation, who helped our leadership design and “red team” this structure.^[3] We’re not yet ready to hold this out as an example to emulate; we are empiricists and want to see how it works.

One of the most difficult design challenges was reconciling the imperative for the Trust structure to be resilient to end runs while the stakes are high with the reality of the Trust’s experimental nature. It’s important to prevent this arrangement from being easily undone, but it is also rare to get something like this right on the first try. We have therefore designed a process for amendment that carefully balances durability with flexibility. We envision that most adjustments will be made by agreement of the Trustees and Anthropic’s Board, or the Trustees and the other stockholders. Owing to the Trust’s experimental nature, however, we have also designed a series of “failsafe” provisions that allow changes to the Trust and its powers without the consent of the Trustees if sufficiently large supermajorities of the stockholders agree. The required supermajorities increase as the Trust’s power phases in, on the theory that we’ll have more experience–and less need for iteration–as time goes on, and the stakes will become higher.

Meet the Initial Trustees

The initial Trustees are:

Jason Matheny: CEO of the RAND Corporation
Kanika Bahl: CEO & President of Evidence Action
Neil Buddy Shah: CEO of the Clinton Health Access Initiative (Chair)
Paul Christiano: Founder of the Alignment Research Center
Zach Robinson: Interim CEO of Effective Ventures US

The Anthropic board chose these initial Trustees after a year-long search and interview process to surface individuals who exhibit thoughtfulness, strong character, and a deep understanding of the risks, benefits, and trajectory of AI and its impacts on society. Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees. We are honored that this founding group of Trustees chose to accept their places on the Trust, and we believe they will provide invaluable insight and judgment.

^{^}

As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI.

^{^}

An earlier version of the Trust, which was then called the “Long-Term Benefit Committee,” was written into our Series A investment documents in 2021, but since the committee was not slated to elect its first director until 2023, we took the intervening time to red-team and improve the legal structure and to carefully consider candidate selection. The current LTBT is the result.

^{^}

The Trust structure was designed and “red teamed” with immeasurable assistance by John Morley of Yale Law School, David Berger, Amy Simmerman, and other lawyers from Wilson Sonsini, and by Noah Feldman and Seth Berman from Harvard Law School and Ethical Compass Advisors.

Larks1y5

Thanks very much for sharing this!

The Anthropic board chose these initial Trustees after a year-long search and interview process to surface individuals who exhibit thoughtfulness, strong character, and a deep understanding of the risks, benefits, and trajectory of AI and its impacts on society. Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees.

I am curious about the decision for one-year terms. Given that the board chooses its own successors, and no-one on the board is full-time, I worry that the search for successors will end up occupying a huge fraction of their time. Also, non-overlapping terms might mean a loss of institutional memory. Did you consider a staggered setup?

Zach Stein-Perlman1y5

An Anthropic staff member says:

One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!

Larks1y3

Thanks for the explanation!

SummaryBot1y3

Executive summary: Anthropic has introduced a Responsible Scaling Policy (RSP) to manage risks associated with increasingly capable AI systems, and a Long-Term Benefit Trust (LTBT) to ensure that corporate governance decisions align with the long-term benefits of humanity.

Key points:

Anthropic's Responsible Scaling Policy defines safety levels for AI models, with stricter requirements at higher levels to manage catastrophic risks.
Anthropic will pause development if scaling outpaces ability to meet safety requirements, incentivizing solving key problems.
The Long-Term Benefit Trust helps align Anthropic's governance to its mission by granting a disinterested body authority over board selection.
The Trust will phase in its board control based on time and funding milestones.
The Trust structure aims to balance stockholder interests, public benefit, and long-term externalities in AI development.
The governance innovations are experiments intended for refinement based on experience.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

25

Anthropic's Responsible Scaling Policy

The Long-Term Benefit Trust

25

Reactions

More posts like this