Hide table of contents

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Subscribe here to receive future versions.

Policy Proposals from NTIA’s Request for Comment

The National Telecommunications and Information Administration publicly requested comments on the matter from academics, think tanks, industry leaders, and concerned citizens. They asked 34 questions and received more than 1,400 responses on how to govern AI for the public benefit. This week, we cover some of the most promising proposals found in the NTIA submissions. 

AI Could Help Congress Schedule and Find Unexpected Consensus, Expert Says  - Nextgov

Technical Proposals for Evaluating AI Safety

Several NTIA submissions focused on the technical question of how to evaluate the safety of an AI system. We review two areas of active research: red-teaming and transparency. 

Red Teaming: Acting like an Adversary

Several submissions proposed government support for evaluating AIs via red teaming. In this evaluation method, a “red team” deliberately tries to make an AI system fail or behave in dangerous ways. By identifying risks from AI models, red teaming helps AI developers decide priorities for safety research and whether or not to deploy a new model. 

Red teaming is a dynamic process. Attackers can vary their methods and search for multiple different undesirable behaviors in order to explore new potential risks. Static benchmarks, on the other hand, evaluate well-known risks by giving models a written test. For example, CAIS built a benchmark that measures whether AIs behave immorally in text-based scenarios.

Red teaming can be performed by either internal or external teams. For internal red teaming to be effective, internal auditors need independence to report their results without fear of retribution or interference from company executives. External auditors can complement internal efforts, as suggested by several NTIA submissions. But collaborating with external researchers requires strong information security. Otherwise, AI systems could be leaked outside of the intended audience, as happened with Meta’s LLaMa model

Transparency: Understanding AIs From the Inside

Other submissions advocated research on understanding how AIs make decisions. Known as transparency, interpretability, or explainability, this has been a perennial goal of AI research, but more advanced AI systems have largely become more inscrutable over time

The field has often fallen prey to false hopes, such as the idea that we can gain valuable information from post-hoc explanations of AI decisions. Suppose a corporate chatbot says that you should drink Coca Cola because it’s “delicious” and “an American classic.” This might seem like a plausible explanation for its recommendation. But if you later learn that the chatbot’s creator has an advertising partnership with Coca-Cola, you won’t have any doubt about the real reason it wanted you to drink Coke. These false explanations can be worse than treating AIs as a black box, because they encourage people to trust AI decisions when their reasoning processes remain opaque. 

Beyond the inner workings of an AI system, transparency about the training and deployment process can be helpful. Hugging Face recommended using data sheets and data statements to disclose the data used during model training, which could help ensure that developers do not use datasets with known bias, misinformation, or copyrighted material. Similarly, Microsoft argued that AI providers should always identify videos, images, or text produced by their AIs.

Governance Proposals for Improving Safety Processes

Ensuring that AI developers implement the most advanced methods for ensuring AI safety will be a challenge unto itself. Startups often have a “move fast, break things” mindset that might work for free-to-use websites and smartphone applications, but could prove dangerous when building a technology that poses societal-scale risks. Several NTIA submissions therefore proposed governance mechanisms for ensuring that best practices for safety are adopted by the organizations developing and deploying AIs. 

Requiring a License for Frontier AI Systems

At the Senate hearing on AI, OpenAI CEO Sam Altman recommended “a new agency that licenses any effort above a certain scale of capabilities and can take that license away and ensure compliance with safety standards.” Several submissions, including the Center for AI Safety’s submission, supported a licensing system that could reduce pressure on companies to race ahead while cutting corners on safety, instead promoting best practices in AI safety. 

Startups and open source developers would likely be unaffected by these requirements, as current proposals only require licenses for a small handful “frontier” AI systems such as OpenAI’s GPT-4 and Google’s PaLM. Before training a frontier model, companies could be required to strengthen their information security so that adversaries cannot steal their models, and improve their corporate governance with gradual deployment of models, incident response plans, and internal reviews of potentially dangerous research before publication. 

Licensing could also encourage the development of better techniques for evaluating AI safety. After a company develops an AI system, they could be required to affirmatively demonstrate its safety, an application of the precautionary principle resembling how drug developers must prove their products safe to the FDA. This would incentivize companies to invest in model evaluation techniques such as red teaming and transparency.

If the federal government would rather not directly license models themselves, Anthropic suggested that they could depend upon the expertise of third-party auditors. Auditors like the Big Three credit rating agencies are relied upon by the SEC to produce accurate analysis of financial products. After the Enron scandal, Congress supported financial auditors by passing the Sarbanes-Oxley Act which, among other things, made it illegal for corporate executives to falsify information submitted to auditors. Lawmakers could take similar steps to support AI auditors with the technical expertise to ensure that AI developers are meeting our public goals for safety.

Unifying Sector-Specific Expertise and General AI Oversight

Many federal agencies have long-standing expertise in regulating particular applications of AI. The FTC recently warned against using AI deceptively to trick consumers into making harmful purchases or decisions. Similarly, the National Institute of Justice recently hosted a research symposium on how to productively use AI algorithms in the criminal justice system. Several NTIA submissions, including those from OpenAI and Google, highlighted the critical need in AI governance for federal agencies with expertise in these critical areas. 

AI systems with a wide range of capabilities, such as recent language models, might stretch the limits of a sector-specific approach to AI governance. The European Union has been considering this challenge recently, with many parliamentarians calling for policies that specifically address general purpose AI systems. 

Several NTIA submissions argued that the United States might need to similarly adapt to a world of more general AI systems. For example, the Center for Democracy and Technology wrote in their submission that “existing laws such as civil rights statutes provide basic rules that continue to apply, but those laws were not written with AI in mind and may require change and supplementation to serve as effective vehicles for accountability.”

Does Antitrust Prevent Cooperation Between AI Labs?

Competitive pressure between AI labs might lead them to release new models too quickly, or with dangerous capabilities. In order to reduce that risk, we may want AI labs to cooperate with each other on safe AI development. This might include sharing safety testing results and methods. Or, it might involve active collaboration. OpenAI’s charter even includes a clause to merge with and assist another organization if the latter was likely to create an AGI soon.

However, as Anthropic notes in its comment, it’s possible that by cooperating to improve safety, AI labs would run afoul of existing antitrust laws. They suggest that regulators should clarify antitrust guidelines as to when coordination between AI labs should be allowed. 

Reconsidering Instrumental Convergence

As the field of AI safety grows, it’s important to continue questioning and refining our beliefs on the topic. One common argument is the instrumental convergence thesis, which holds that regardless of an agent's final goal, it is likely for it to be rational to pursue certain subgoals, such as power-seeking and self-preservation, in service of that goal.

A new draft paper from CAIS questions this claim. Power and self-preservation are absolutely useful for achieving many goals that we might care about, the paper recognizes. But it does not logically follow that agents will pursue power and self-preservation in most (or any) circumstances. There might be costs involved in pursuing these goals, including the opportunity cost of time and effort not spent on other strategies, and success might not be guaranteed. Further, AI agents could have aversions to gaining power and self-preservation, perhaps as the result of intentional design by AI developers. The paper shows mathematically that if the desires of an agent are initialized randomly (in line with the so-called orthogonality thesis, which claims that any goals are compatible with any level of intelligence), there is no reason to think that the agent will be power-seeking or act to preserve itself. A simple analogy to humans applies here: Some of our goals would be easier to attain if we were immortal or omnipotent, but few choose to spend their lives in pursuit of these goals. 

This is not an argument that AI agents will never pursue power: the goals of AI systems won't be randomly chosen. Empirically, research shows that AI agents trained to maximize performance in text-based games often lie, cheat, and steal to improve their scores. From a higher level perspective, agents that successfully self-propagate will have more influence in the future than other agents. There are many reasons to believe that AIs will often pursue unexpected and even dangerous goals; this paper simply argues that this would not be true of agents with randomly-initialized goals.


First, updates on AI policy in the United States, European Union, and United Kingdom. 

  • OpenAI lobbied the European Union to argue that GPT-4 is not a ‘high-risk’ system. Regulators assented, meaning that under the current draft of the EU AI Act, key governance requirements would not apply to GPT-4. 
  • A new Congressional bill would hold AI providers responsible for illegal content disseminated through their systems, unlike the controversial Section 230 law that immunized social media platforms against lawsuits for illegal content on their sites. 
  • Illinois passes a bill to allow law enforcement to use drones for monitoring large public events, as long as the drones do not use weapons or facial recognition technology. 
  • Senator Chuck Schumer provides an update on a bipartisan push for legislation on AI. 
  • The UK has ousted tech advisors that failed to foresee the importance of large language models and other recent AI developments, and has appointed Ian Hogarth as the Chair of their AI Foundation Model Taskforce. The taskforce is seeking advisors

Second, news about AI models. 

Finally, a few articles relevant to AI safety research.  

  • GPT-4 can synthesize chemicals by writing instructions for using lab equipment. 
  • Training AIs on the outputs of other AIs has sharp drawbacks, finds a new paper
  • Political messaging is 3x more likely to change someone’s views when targeted towards a specific individual, finds a new study. AI systems could potentially increase the power of personalized persuasion.
  • Leading AI models often do not comply with the EU AI Act’s requirements, finds new research from Stanford University. 
  • AI developers will need to prioritize safety in order to avoid accidents, misuse, or loss of control of their AI systems. A new article outlines the challenge and paths forward. 
  • The journal Risk Analysis is doing a special issue on AI risk. Submissions are due on December 15, 2023. 
  • Yoshua Bengio writes an FAQ on catastrophic AI risks

See also: CAIS website, CAIS twitter, A technical safety research newsletter, and An Overview of Catastrophic AI Risks

Subscribe here to receive future versions.

Sorted by Click to highlight new comments since:

OpenAI lobbied the European Union to argue that GPT-4 is not a ‘high-risk’ system. Regulators assented, meaning that under the current draft of the EU AI Act, key governance requirements would not apply to GPT-4. 

Somebody shared this comment from Politico, which claims that the above article is not an accurate representation:

European lawmakers beg to differ: Both Socialists and Democrats’ Brando Benifei and Renew’s Dragoș Tudorache, who led Parliament’s work on the AI Act, told my colleague Gian Volpicelli that OpenAI never sent them the paper, nor reached out until 2023. When he met an OpenAI delegation in April, Tudorache said, the relevant text had already been agreed upon. 

A simple analogy to humans applies here: Some of our goals would be easier to attain if we were immortal or omnipotent, but few choose to spend their lives in pursuit of these goals.

I feel like the "fairer" analogy would be optimizing for financial wealth, which is arguably also as close to omnipotence as one can get as a human, and then actually a lot of humans are pursuing this. Further, I might argue that currently money is much more of a bottleneck for people than longevity for ~everyone to pursue their ultimate goals. And for the rare exceptions (maybe something like the wealthiest 10k people?) those people actually do invest a bunch in their personal longevity? I'd guess at least 5% of them?

I was not a huge fan of the instrumental convergence paper, although I didn't have time to thoroughly review it. In short, it felt too slow in making its reasoning and conclusion clear, and once (I think?) I understood what it was saying, it felt quite nitpicky (or a borderline motte-and-bailey). In reality, I'm still unclear if/how it responds to the real-world applications of the reasoning (e.g., explaining why a system with a seemingly simple goal like calculating digits of pi would want to cause the extinction of humanity).

The summary in this forum post seems to help, but I really feel like the caveats identified in this post ("this paper simply argues that this would not be true of agents with randomly-initialized goals") is not made clear in the abstract.[1]

  1. ^

    The abstract mentions "I find that, even if intrinsic desires are randomly selected [...]" but this does not at all read like a caveat, especially due to the use of "even if" (rather than just "if").

Curated and popular this week
Relevant opportunities