Hide table of contents


As 2024 draws to a close, we want to thank you for your continued support for AI safety and review what we’ve been able to accomplish. In this special-edition newsletter, we highlight some of our most important projects from the year.

The mission of the Center for AI Safety is to reduce societal-scale risks from AI. We focus on three pillars of work: research, field-building, and advocacy.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe here to receive future versions.


Research

CAIS conducts both technical and conceptual research on AI safety. Here are some highlights from our research in 2024:

Circuit Breakers. We published breakthrough research showing how circuit breakers can prevent AI models from behaving dangerously by interrupting crime-enabling outputs. In a jailbreaking competition with a prize pool of tens of thousands of dollars, it took twenty thousand attempts to jailbreak a model trained with circuit breakers. The paper was accepted to NeurIPS 2024.

 

The WMDP Benchmark. We developed the Weapons of Mass Destruction Proxy Benchmark, a dataset of 3,668 multiple-choice questions serving as a proxy measurement for hazardous knowledge in biosecurity, cybersecurity, and chemical security. The benchmark enables measuring and reducing malicious use potential in AI systems. The paper was accepted to ICML 2024.

 

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? We argued that results show that most LLM benchmarks are highly correlated with general capabilities and training compute—even safety benchmarks. This shows that much of the existing “safety” work is not measuring or improving a distinct dimension from general capabilities. The paper was accepted to NeurIPS 2024.

Tamper-Resistant Safeguards for Open-Weight Models. Open-weight models can help minimize concentration of power as proprietary models become more capable. One challenge of open-weight models, however, is the possibility of malicious users using them to cause catastrophic harm. We developed a method for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after fine-tuning. If we can robustly remove hazardous knowledge from LLMs, it greatly increases the viability of open-weight models.

HarmBench. We released a standardized evaluation framework for automated red teaming, establishing rigorous assessments of various methods and introducing a highly efficient adversarial training approach that outperformed prior defenses. The US and UK AI Safety Institutes relied on HarmBench in their pre-deployment testing of Claude 3.5 Sonnet. The paper was accepted to ICML 2024.

Humanity's Last Exam. We launched a global initiative to create the world's most challenging AI benchmark, gathering questions from experts across fields to help measure progress toward expert-level AI capabilities. As AI systems surpass undergraduate-level performance on existing benchmarks and become economically useful agents, tracking their performance beyond this threshold will become vital for enabling effective oversight. Over 1,200 collaborators have contributed to date, with results expected in early 2025.

Advocacy

CAIS aims to advance AI safety advocacy in the US. In 2024, we launched the CAIS Action Fund, which cosponsored SB 1047 and helped secure congressional funding for AI safety.

CAIS DC Launch Event. We formally launched CAIS and CAIS Action Fund in Washington, DC in July 2024, with keynotes by Sen. Brian Schatz (D-HI) and Rep. French Hill (R-AR), attended by over 100 stakeholders and policymakers, including several members of Congress and senior administration staff. It featured a panel with Dan Hendrycks and Jaan Tallinn moderated by CNN’s Chief Investigative correspondent, Pamela Brown. The panel was recorded and aired on the Washington AI Network.

Congressional Engagement. CAIS AF organized and co-led a joint letter signed by 80+ leading tech organizations asking Congress to fully fund NIST’s AI work. It also successfully advocated for $10M in funding for the US AI Safety Institute by organizing a separate bipartisan congressional letter. We’ve also had various meetings with key Democrat and Republican senators and house members. In 2025, we look forward to working with the incoming administration to help secure the US from AI risks.

SB 1047. CAIS AF co-sponsored SB 1047 in California with State Sen. Scott Wiener. While ultimately vetoed, a broad bipartisan coalition came together to support SB 1047, including over 70 academic researchers, the California legislature, 77% of California voters, 120+ employees at frontier AI companies, 100+ youth leaders, unions (including SEIU, SAG-AFTRA, UFCW, the Iron Workers, and the California Federation of Labor Unions), 180+ artists, the National Organization for Women, Parents Together, the Latino Community Foundation, and more. Over 4,000 supporters called the Governor’s office in September asking him to sign the bill, and over 7,000 signed a petition in support of the bill.

Field-Building

CAIS aims to foster a thriving AI safety community. In 2024, we supported 77 papers in AI safety research through our compute cluster, published Introduction to AI Safety, Ethics, and Society, launched an online course with 240 participants, established a competition to develop safety benchmarks, and organized AI conference workshops and socials.

Compute Cluster. Our compute cluster has supported around 350 researchers, enabling the production of a cumulative 109 research papers across its lifetime that have garnered over 4,000 citations. This infrastructure has been crucial for enabling cutting-edge safety research that would otherwise be computationally prohibitive. This year, the compute cluster supported 77 new papers, including:

AI Safety, Ethics, and Society. We published "Introduction to AI Safety, Ethics, and Society" with Taylor & Francis in December 2024, which is the first comprehensive textbook covering AI safety concepts in a form accessible to non-ML researchers and professors. We also launched an online course based on the textbook with 240 participants.

Turing Award-winner Yoshua Bengio writes of the textbook: “This book is an important resource for anyone interested in understanding and mitigating the risks associated with increasingly powerful AI systems. It provides not only an accessible introduction to the technical challenges in making AI safer, but also a clear-eyed account of the coordination problems we will need to solve on a societal level to ensure AI is developed and deployed safely.”

SafeBench Competition. We established a competition offering $250,000 in prizes to develop benchmarks for empirically assessing AI safety across four categories: robustness, monitoring, alignment, and safety applications. The competition has drawn significant interest with 120 researchers registered. This project is supported by Schmidt Sciences.

Workshops and Socials.

  • We ran a workshop at NeurIPS in December 2024 focussing on the safety of agentic AI systems. We received 51 submissions and accepted 34 papers.
  • We organized socials on ML Safety at ICML and ICLR, two top AI conferences, convening an estimated 200+ researchers to discuss AI safety.
  • We also held a workshop on compute governance on August 15th. It had 20 participants, including 14 professors specialized in the security of computing hardware to discuss avenues for technical research on the governance and security of AI chips. We are currently writing a white paper to synthesize and disseminate key findings from the workshop, which we are aiming to publish early next year.

AI Safety Newsletter. The number of subscribers to this newsletter—now greater than 24,000—has tripled in 2024. Thank you for your interest in AI safety—in 2025, we plan to continue to support this growing community.

Looking Ahead

We expect 2025 to be our most productive year yet. Early in the year, we will publish numerous measurements of the capabilities and safety of AI models.

If you'd like to support the Center for AI Safety's mission to reduce societal-scale risks from AI, you can make a tax-deductible donation here.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe here to receive future versions.

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 1m read
 · 
 ·  · 5m read
 · 
When we built a calculator to help meat-eaters offset the animal welfare impact of their diet through donations (like carbon offsets), we didn't expect it to become one of our most effective tools for engaging new donors. In this post we explain how it works, why it seems particularly promising for increasing support for farmed animal charities, and what you can do to support this work if you think it’s worthwhile. In the comments I’ll also share our answers to some frequently asked questions and concerns some people have when thinking about the idea of an ‘animal welfare offset’. Background FarmKind is a donation platform whose mission is to support the animal movement by raising funds from the general public for some of the most effective charities working to fix factory farming. When we built our platform, we directionally estimated how much a donation to each of our recommended charities helps animals, to show users.  This also made it possible for us to calculate how much someone would need to donate to do as much good for farmed animals as their diet harms them – like carbon offsetting, but for animal welfare. So we built it. What we didn’t expect was how much something we built as a side project would capture peoples’ imaginations!  What it is and what it isn’t What it is:  * An engaging tool for bringing to life the idea that there are still ways to help farmed animals even if you’re unable/unwilling to go vegetarian/vegan. * A way to help people get a rough sense of how much they might want to give to do an amount of good that’s commensurate with the harm to farmed animals caused by their diet What it isn’t:  * A perfectly accurate crystal ball to determine how much a given individual would need to donate to exactly offset their diet. See the caveats here to understand why you shouldn’t take this (or any other charity impact estimate) literally. All models are wrong but some are useful. * A flashy piece of software (yet!). It was built as
 ·  · 2m read
 · 
Project for Awesome (P4A) is a charity video contest running from February 11th to February 19th, 2025. The public can vote on videos supporting various charities, and the ones with the most votes receive donations. Thanks to the support of the EA community, three EA charities received $37,000 each last year. Please help generate additional donations for EA charities again this year with just a few clicks! Voting is open until Wednesday, February 19th at 11:59 AM EST. You can find more information about P4A in this EA Forum post. On the P4A website, there are numerous videos showcasing different charities, including several EA charities. Feel free to watch the videos and cast your votes. Here’s how it works: „Anyone can go to the homepage of projectforawesome.com to see all videos. You can sort by charity category, pick from a dropdown of organization names, or search for a specific video. After you click on a video, look for a big red “VOTE” button either next to or below the video. You’ll have to check an “I’m not a robot” box, too.“ This year, there’s a new rule: „Our voting rule for Project for Awesome 2025 is one vote per charitable organization per device.“ So, you can vote for all the charities you want. List of videos about EA charities If you can’t find videos of EA-aligned charities directly, here’s a list: * Access to Medicines Initiative (Vote here) * ACTRA (Vote here) * Against Malaria Foundation (Vote here) * Animal Advocacy Africa (Vote here) * Animal Advocacy Careers (Vote here or here) * Animal Charity Evaluators (Vote here or here) * Animal Equality (Vote here) * Aquatic Life Institute (Vote here or here) * Center for the Governance of AI (Vote here) * Faunalytics (Vote here or here) * GiveDirectly (Vote here) * Giving What We Can (Vote here or here) * Good Food Institute (Vote here or here or here) * International Campaign to Aboli