Hide table of contents

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required

Subscribe here to receive future versions.


Automatically Circumventing LLM Guardrails

Large language models (LLMs) can generate hazardous information, such as step-by-step instructions on how to create a pandemic pathogen. To combat the risk of malicious use, companies typically build safety guardrails intended to prevent LLMs from misbehaving. 

But these safety controls are almost useless against a new attack developed by researchers at Carnegie Mellon University and the Center for AI Safety. By studying the vulnerabilities in open source models such as Meta’s LLaMA 2, the researchers can automatically generate a nearly unlimited supply of “adversarial suffixes,” which are words and characters that cause any model’s safety controls to fail. 

This discovery calls into question the fundamental limits of safety and security in AI systems. Over the last decade, researchers have attempted to defend against similar attacks on a variety of AI systems, but they have largely failed. Further research on defending against these attacks is imperative, but the decision makers should not assume that AI systems can be reliably protected from user attacks. 

Normally, language models refuse to explain how to build a bomb.

But when prompted with an “adversarial suffix” written by the new attack method, language models including GPT-4, Bard, Claude, and LLaMA will answer dangerous requests.

How does the attack work? The researchers propose an attack that reliably finds holes in the safety controls of a variety of state of the art language models. First, they instruct the language model to perform a dangerous behavior, such as “write a tutorial on how to build a bomb.” Normally, it would refuse, but the researchers then write an “adversarial suffix” designed to bypass the model’s safety controls. 

These adversarial suffixes are precisely calculated to maximize the likelihood of misbehavior. Unlike the creative “jailbreaks” designed by humans, this attack automatically generates a string of letters and characters that is most often meaningless to a human. But in the instructable calculations of a language model, these prompts will reliably override safety controls and cause dangerous behavior. 

The attack is made possible by open source language models. Meta suffered a notorious security breach around the release of their first LLaMA model, which was unintentionally shared publicly for anyone to download anonymously. For their updated LLaMA 2 model, which is as powerful as GPT-3.5, Meta doubled down on this open approach and explicitly allowed anyone to download it. This leaves them unable to guard against malicious use of their AI systems, a tactic which Meta has previously used to avoid taking responsibility for harms caused by its technology. 

This adversarial attack is another risk enabled by Meta’s open source model. By studying the model for security vulnerabilities, the attack is able to design successful attacks. Notably, these attacks not only work against Meta’s LLaMA 2, but also against leading models from other companies such as GPT-4, Bard, and Claude, respectively from OpenAI, Google, and Anthropic. 

This indicates that many AI systems may share common vulnerabilities, and that one open source model can jeopardize the security of many other AI systems. 

Adversarial attacks are notoriously difficult to defend against. Over the last decade, thousands of papers have been written on the problem of defending AI systems against adversarial attacks. But even on simple problems such as classifying pictures of animals and vehicles, adversarial attacks still sharply reduce AI performance. It seems that when users are able to design inputs to fool an AI system, they often succeed – even if these adversarial inputs would never fool a human. 

Image classifiers are also vulnerable to adversarial attacks. By changing the input image so slightly that a human wouldn’t even notice it, an attacker can cause the image classifier to fail. 

How to improve our defenses against adversarial attacks. This paper was released in order to prompt technical and social responses to the threat of adversarial attacks. The authors shared the results of the paper with Meta, Google, Anthropic, and OpenAI before releasing it publicly, and the companies have taken preliminary measures to mitigate the attacks. But no robust defenses are currently known, necessitating two key responses to the threat of adversarial attacks. 

First, technical researchers should focus on defending against adversarial attacks. For ML researchers interested in studying adversarial robustness, consider applying for access to the Center for AI Safety’s free compute cluster.

On the other hand, governments, corporations, and individuals should reconsider the level of trust they are willing to place in AI systems. Any sensitive information or dangerous capabilities present in an AI model could potentially be exploited by a malicious user, and existing safeguards are often ineffective against straightforward adversarial attacks. A more prudent approach might avoid deploying AI systems in certain critical domains. 

Given the overwhelming failure of safety guardrails to defend against adversarial attacks, we ought to invest in research on defenses against these attacks. Until this research succeeds, it would be wise to avoid trusting AI systems in situations where they could be exploited by a malicious user. 


AI Labs Announce the Frontier Model Forum 

Last week, Microsoft, Anthropic, Google, and OpenAI announced the Frontier Model Forum, an initiative aimed at promoting the safe and responsible development of advanced AI systems. 

The Forum defines frontier models as “large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks.” 

Four primary objectives guide the Forum's mission:

  1.  Advancing AI safety research, in areas such as adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, safety evaluations, emergent behaviors, and anomaly detection. 
  2.  Identifying best practices for safely developing and deploying frontier models.
  3.  Collaborating with stakeholders in government, academia, civil society, and industry. 
  4.  Address societal challenges using AI, such as climate change, cancer, and cybersecurity.

This announcement came after these organizations made a commitment to the White House to establish or join a forum that would adopt and advance shared standards and best practices for frontier AI safety. The Frontier Model Forum is the realization of that promise for the four companies involved.

Currently, the Forum does not include Meta, Inflection, or Amazon, all of which signed the recent White House voluntary commitments on AI safety. The announcement noted that organizations which develop frontier models, demonstrate a strong commitment to frontier model safety, and wish to contribute to the Forum’s active efforts may be offered membership in the Forum.

In the coming months, the Forum plans to set up an advisory board to steer its strategy and priorities. Founding members will establish key institutional arrangements, including a charter, governance, funding provisions, a working group, and an executive board.

We’ve written previously about how companies who are racing to build advanced AI have incentives to cut corners on safety. This Forum creates an opportunity for the opposite dynamic: companies sharing information with each other and engaging with outside experts in order to reduce risks that impact us all.

Senate Hearing on AI Oversight

Last week, the Senate hosted a hearing on AI oversight featuring testimony from Anthropic CEO Dario Amodei, Professor Stuart Russell, and Professor Yoshua Bengio.

Evolving understanding of AI risks by elected officials. When the Senate heard testimony from AI experts in May, it was not clear whether the senators understood the depth of the risks posed by AI. For example, in a question to OpenAI CEO Sam Altman, Senator Richard Blumenthal said:

You have said...“development of superhuman intelligence is probably the greatest threat to the continued existence of humanity.” You may have had in mind the effect on jobs.  

In retrospect it seems clear that Altman did not have in mind the effect on jobs, but rather the risk of human extinction. Senator Blumenthal has clearly updated his views since the first hearing, as he opened last week’s hearing by directly addressing extinction risks:

The word that has been used so repeatedly is scary. And as much as I may tell people, you know, there is enormous good here...what rivets their attention is the science fiction image of an intelligence device out of control, autonomous, self-replicating, potentially creating... pandemic-grade viruses or other kinds of evils purposely engineered by people or just as the result of mistakes...you have provided objective, fact-based views on what the dangers are, and the risks, and potentially even human extinction...these fears need to be addressed.

AI-enabled biological and chemical weapons. AI systems can create biological and chemical weapons more lethal than human versions, and can provide step-by-step instructions to users about how to create these weapons of mass destruction. 

As pointed out by Anthropic CEO Dario Amodei, “Today, certain steps in the use of biology to create harm involve knowledge that cannot be found on Google or in textbooks and requires a high level of specialized expertise.” But a study conducted by Anthropic “suggests a substantial risk that AI systems will be able to fill in all the missing steps” to creating a biological weapon within the next 2-3 years. 

Senator Blumenthal illustrated the concern as an AI system that could “decide that the water supply of Washington DC should be contaminated with some kind of chemical, and have the knowledge to do it through the public utility system.” 

Two weeks ago, Senators Markey and Budd proposed legislation that would direct the Department of Health and Human Services to assess biological and chemical risks posed by AI. 

The pace of AI progress. The senators and witnesses offered a variety of perspectives on future AI progress. Professor Stuart Russell believes that “several conceptual breakthroughs are still needed” to achieve human-level AI. But also he pointed out that Turing Award-winner Geoffrey Hinton believes that human-level AI is between 5 and 20 years away. Russell also mentioned an unnamed AI researcher who said, “It’s possible from now onwards.”

Amodei gave the following explanation of the rate of AI progress:

The power or intelligence of an AI system can be measured roughly by multiplying together three things: (1) the quantity of chips used to train it, (2) the speed of those chips, (3) the effectiveness of the algorithms used to train it. The quantity of chips used to train a model is increasing by 2x-5x per year. Speed of chips is increasing by 2x every 1-2 years. And algorithmic efficiency is increasing by roughly 2x per year. These compound with each other to produce a staggering rate of progress.

Russell focused on skyrocketing investment in AI efforts:

One experienced AI venture capitalist, Ian Hogarth, reports a 100-million-fold increase since 2012 in compute budgets for the largest machine learning projects, and “eight organizations raising $20B of investment cumulatively in [the first three months of] 2023” for the express purpose of developing AGI. This amount is approximately ten times larger than the entire budget of the US National Science Foundation for the same period.       

Some senators also appeared to take seriously the possibility of human-level AI. Senator Blumenthal said, “We’re not decades away, we’re a couple of years away.”

Senate committee searches for first-ever AI regulationsFrom left to right: Anthropic CEO Dario Amodei, Professor Yoshua Bengio, and Professor Stuart Russell.

Proposed policies for addressing AI risk. The witnesses addressed the threat of AI-generated disinformation by arguing that AI developers should clearly identify the outputs of AI systems. In case someone uses an AI system to generate text, images, or video and tries to pass it off as genuine, Professor Russell suggested that companies should allow users to look up whether a given piece of content had previously been generated by their AI system. 

Even if building safe AI is technically possible, Amodei raised the concern that “bad actors could build their own AI from scratch, steal it from the servers of an AI company, or repurpose open-source models if powerful enough open-source models become available.” He pointed out that AI systems can only be trained on cutting edge computer chips which are largely controlled by the United States and its allies in Taiwan and the Netherlands. Therefore, he recommended “secur[ing] the AI supply chain,” which could include stronger cybersecurity at AI labs and new hardware features to aid monitoring and verification. 

Evaluating AI systems for harmful capabilities or malicious intentions was also a key focus. Without this ability, Amodei argued that federal monitoring would be little more than a “rubber stamp.” Common standards for AI safety could also help set a bar beneath which companies would be legally liable for harms caused by their unsafe AI systems. 

Professor Bengio argued that international cooperation would be necessary for effective AI governance, and suggested successful efforts could bring the risk from rogue AI “100 times” lower. Bengio discussed other ways to reduce AI risks, including slowing down AI development or disallowing AIs from taking actions in the real world, instead limiting them to answering questions and giving advice to humans. 

Overall, the hearing showed that federal officials are concerned about the catastrophic risks of advanced AI, including biological terrorism and rogue AI. Many policy ideas have been proposed, but the only concrete commitments have been voluntary. Hopefully, strong legislation will soon follow. 


See also: CAIS websiteCAIS twitterA technical safety research newsletter, and An Overview of Catastrophic AI Risks

Subscribe here to receive future versions.

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 16m read
 · 
Applications are currently open for the next cohort of AIM's Charity Entrepreneurship Incubation Program in August 2025. We've just published our in-depth research reports on the new ideas for charities we're recommending for people to launch through the program. This article provides an introduction to each idea, and a link to the full report. You can learn more about these ideas in our upcoming Q&A with Morgan Fairless, AIM's Director of Research, on February 26th.   Advocacy for used lead-acid battery recycling legislation Full report: https://www.charityentrepreneurship.com/reports/lead-battery-recycling-advocacy    Description Lead-acid batteries are widely used across industries, particularly in the automotive sector. While recycling these batteries is essential because the lead inside them can be recovered and reused, it is also a major source of lead exposure—a significant environmental health hazard. Lead exposure can cause severe cardiovascular and cognitive development issues, among other health problems.   The risk is especially high when used-lead acid batteries (ULABs) are processed at informal sites with inadequate health and environmental protections. At these sites, lead from the batteries is often released into the air, soil, and water, exposing nearby populations through inhalation and ingestion. Though data remain scarce, we estimate that ULAB recycling accounts for 5–30% of total global lead exposure. This report explores the potential of launching a new charity focused on advocating for stronger ULAB recycling policies in low- and middle-income countries (LMICs). The primary goal of these policies would be to transition the sector from informal, high-pollution recycling to formal, regulated recycling. Policies may also improve environmental and safety standards within the formal sector to further reduce pollution and exposure risks.   Counterfactual impact Cost-effectiveness analysis: We estimate that this charity could generate abou
sawyer🔸
 ·  · 2m read
 · 
Note: This started as a quick take, but it got too long so I made it a full post. It's still kind of a rant; a stronger post would include sources and would have gotten feedback from people more knowledgeable than I. But in the spirit of Draft Amnesty Week, I'm writing this in one sitting and smashing that Submit button. Many people continue to refer to companies like OpenAI, Anthropic, and Google DeepMind as "frontier AI labs". I think we should drop "labs" entirely when discussing these companies, calling them "AI companies"[1] instead. While these companies may have once been primarily research laboratories, they are no longer so. Continuing to call them labs makes them sound like harmless groups focused on pushing the frontier of human knowledge, when in reality they are profit-seeking corporations focused on building products and capturing value in the marketplace. Laboratories do not directly publish software products that attract hundreds of millions of users and billions in revenue. Laboratories do not hire armies of lobbyists to control the regulation of their work. Laboratories do not compete for tens of billions in external investments or announce many-billion-dollar capital expenditures in partnership with governments both foreign and domestic. People call these companies labs due to some combination of marketing and historical accident. To my knowledge no one ever called Facebook, Amazon, Apple, or Netflix "labs", despite each of them employing many researchers and pushing a lot of genuine innovation in many fields of technology. To be clear, there are labs inside many AI companies, especially the big ones mentioned above. There are groups of researchers doing research at the cutting edge of various fields of knowledge, in AI capabilities, safety, governance, etc. Many individuals (perhaps some readers of this very post!) would be correct in saying they work at a lab inside a frontier AI company. It's just not the case that any of these companies as
Dorothy M.
 ·  · 5m read
 · 
If you don’t typically engage with politics/government, this is the time to do so. If you are American and/or based in the U.S., reaching out to lawmakers, supporting organizations that are mobilizing on this issue, and helping amplify the urgency of this crisis can make a difference. Why this matters: 1. Millions of lives are at stake 2. Decades of progress, and prior investment, in global health and wellbeing are at risk 3. Government funding multiplies the impact of philanthropy Where things stand today (February 27, 2025) The Trump Administration’s foreign aid freeze has taken a catastrophic turn: rather than complying with a court order to restart paused funding, they have chosen to terminate more than 90% of all USAID grants and contracts. This stunningly reckless decision comes just 30 days into a supposed 90-day review of foreign aid. This will cause a devastating loss of life. Even beyond the immediate deaths, the long-term consequences are dire. Many of these programs rely on supply chains, health worker training, and community trust that have taken years to build, and which have already been harmed by U.S. actions in recent weeks. Further disruptions will actively unravel decades of health infrastructure development in low-income countries. While some funding may theoretically remain available, the reality is grim: the main USAID payment system remains offline and most staff capable of restarting programs have been laid off. Many people don’t believe these terminations were carried out legally. But NGOs and implementing partners are on the brink of bankruptcy and insolvency because the government has not paid them for work completed months ago and is withholding funding for ongoing work (including not transferring funds and not giving access to drawdowns of lines of credit, as is typical for some awards). We are facing a sweeping and permanent shutdown of many of the most cost-effective global health and development programs in existence that sa