Hide table of contents

As part of a larger community building effort, CAIS is writing a safety newsletter that is designed to cover empirical safety research and be accessible to the broader machine learning research community. You can subscribe here or follow the newsletter on twitter here. We also have a new non-technical newsletter here.


Welcome to the 9th issue of the ML Safety Newsletter by the Center for AI Safety. In this edition, we cover:

Inspecting how language model predictions change across layers

A new benchmark for assessing tradeoffs between reward and morality

Improving adversarial robustness in NLP through prompting

A proposal for a mechanism to monitor and verify large training runs

Security threats posed by providing language models with access to external services

Why natural selection may favor AIs over humans

And much more...


We have a new safety newsletter. It’s more frequent, covers developments beyond technical papers, and is written for a broader audience.

Check it out here: AI Safety Newsletter.

Monitoring

Eliciting Latent Predictions from Transformers with the Tuned Lens

This figure compares the paper’s contribution, the tuned lens, with the “logit lens” (top) for GPT-Neo-2.7B. Each cell shows the top-1 token predicted by the model at the given layer and token index.

Despite incredible progress in language model capabilities in recent years, we still know very little about the inner workings of those models or how they arrive at their outputs. This paper builds on previous findings to determine how a language model’s predictions for the next token change across layers. The paper introduces a method called the tuned lens, which fits an affine transformation to the outputs of intermediate Transformer hidden layers, and then passes the result to the final unembedding matrix. The method allows for some ability to discern which layers contribute most to the determination of the model’s final outputs.

[Link] 

Other Monitoring News

[Link] OOD detection can be improved by projecting features into two subspaces - one where in-distribution classes are maximally separated, and another where they are clustered.

[Link] This paper finds that there are relatively low-cost ways of poisoning large-scale datasets, potentially compromising the security of models trained with them.


Alignment

The Machiavelli Benchmark: Trade Offs Between Rewards and Ethical Behavior

General-purpose models like GPT-4 are rapidly being deployed in the real world, and being hooked up to external APIs to take actions. How do we evaluate these models, to ensure that they behave safely in pursuit of their objectives? This paper develops the MACHIAVELLI benchmark to measure power-seeking tendencies, deception, and other unethical behaviors in complex interactive environments that simulate the real world.

The authors operationalize murky concepts such as power-seeking in the context of sequential decision-making agents. In combination with millions of annotations, this allows the benchmark to measure and quantify safety-relevant metrics including ethical violations (deception, unfairness, betrayal, spying, stealing), disutility, and power-seeking tendencies.

 

They observe a troubling phenomenon: much like how LLMs trained with next-token prediction may output toxic text, AI agents trained with goal optimization may exhibit Machiavellian behavior (ends-justify-the-means reasoning, power-seeking, deception). In order to regulate agents, they experiment with countermeasures such as an artificial conscience and ethics prompts. They are able to steer the agents to exhibit less Machiavellian behavior overall, but there is still ample room for improvement.

Capable models like GPT-4 create incentives to build real-world autonomous systems, but optimizing for performance naturally trades off safety. Hasty deployment without proper safety testing under competitive pressure can be disastrous. The authors encourage further work to investigate these tradeoffs and focus on improving the Pareto frontier instead of solely pursuing narrow rewards.

[Link]

Pretraining Language Models With Human Preferences

Typically, language models are pretrained to maximize the likelihood of tokens in their training dataset. This means that language models tend to reflect the training dataset, which may include false or toxic information or buggy code. Language models are often finetuned with selected examples from a better distribution of text, in the hopes that these problems can be reduced.

This raises the question: why start with a problematic model and try to make it less problematic, if you can make it less problematic from the start? A recent paper explores this question. It does so by changing the model’s pre-training objective to more closely align with a preference model. The paper tries several different methods, all of which have been proposed previously in other settings:

  • Dataset filtering, where problematic inputs are removed from the pretraining dataset entirely.
  • Conditional training, where a special <|good|> or <|bad|> token is prepended to the relevant training examples, and in inference <|good|> is prepended by default.
  • Unlikelihood training, where a term maximizing the unlikelihood of problematic sequences is added
  • Reward-weighted regression, where the likelihood of a segment is multiplied by its reward.
  • Advantage-weighted regression, an extension of reward-weighted regression where a token-level value estimate is subtracted from the loss.

The paper finds that pretraining with many of these objectives is better than fine-tuning starting with a normally-trained language model, for three undesirable properties: toxicity, outputs containing personal information, and outputs containing badly-formatted code. In particular, conditional pretraining can reduce undesirable content by an order of magnitude while maintaining performance.

[Link]


Robustness

Model-tuning Via Prompts Makes NLP Models Adversarially Robust

A common method for classification in NLP is appending a dense layer to the end of a pretrained network and then fine tuning the network for classification. However, like nearly all deep learning methods, this approach yields classifiers that are vulnerable to adversarial attacks. This paper experiments with instead adding prompt templates to the inputs to model fine tuning, as shown in the figure above. The method improves adversarial robustness on several benchmarks.

[Link]


Systemic Safety

What does it take to catch a Chinchilla?

As safety techniques advance, it will be important for them to be implemented in large model training runs, even if they come at some cost to model trainers. While organizations with a strong safety culture will do this without prodding, others may be reluctant. In addition, some organizations may try to train large models for outright malicious purposes. In response to this, in the future governments or international organizations could require large model training runs to adhere to certain requirements.

This raises the question: if such requirements are ever implemented, how would they be verified? This paper proposes an outline for how to do so. At a high level, the paper proposes that specialized ML chips in datacenters keep periodic logs of their onboard memory, and that model trainers (“provers”) prove that those logs were created by permitted training techniques. The techniques in this paper do not require provers to disclose sensitive information like datasets or hyperparameters directly to verifiers. The paper also estimates the amount of logging and monitoring that would be needed to catch training runs of various sizes.

This paper focuses on laying foundations, and as such is filled with suggestions for future work. One idea is extending “proof of learning” into the proposed “proof of training.” A second clear area is developing standards for what constitutes a safe training run; this paper assumes that such standards will eventually exist, but they do not currently.

[Link]

Novel Prompt Injection Threats To Application Integrated LLMs

 

Recently, language models are being integrated into a wide range of applications, including not just text generation but also internet search, controlling third-party applications, and even executing code. A recent article gave an overview of how this might be risky. This paper catalogs a wide range of novel security threats these kinds of applications could bring, from including prompt injections on public websites for language models to retrieve to exfiltrating private user data through side channels. The paper gives yet another reason that companies should act with great caution when allowing language models to read and write to untrusted third party services.

[Link]

Other Systemic Safety News

[Link] As ML models may increasingly be used for cyberattacks, it’s important that ML-based defenses keep up. Microsoft recently released an ML-based tool for cyberdefense.

[Link] This paper presents an overview of risks from persuasive AI systems, including how they could contribute to a loss of human control. It provides some suggestions for mitigation.


Other Content

NSF Announces $20 Million AI Safety Grant Program

[Link] The National Science Foundation has recently announced a $20 million grant pool for AI safety research, mostly in the areas of monitoring and robustness. Grants of up to $800,000 are available for researchers.

Natural Selection Favors AIs Over Humans

This conceptual paper provides a new framing for existential risks from AI systems: that AI systems will be subject to natural selection, and that natural selection favors AIs over humans. The paper argues that competitive pressures between humans and AI systems will likely yield AI systems with undesirable properties by virtue of natural selection, and that this could lead to humanity losing control of its future. More specifically, if selfish and efficient AI agents are more able to propagate themselves into the future, they will be favored by evolutionary forces. In contrast to many other accounts of AI risk that tend to focus on single AI systems seeking power, this paper imagines many autonomous agents interacting with each other. Finally, the paper proposes some potential ways to counteract evolutionary forces.

[Link] [YouTube Video]

Anthropic and OpenAI Publish Posts Involving AI Safety

Anthropic recently released a post that details its views of AI safety. OpenAI also published a blog post that touches upon AI safety.

Special edition of Philosophical Studies on AI Safety

[Link] This newsletter normally provides examples of empirically based ML safety papers, but ML safety also needs conceptual and ethical insights. A special edition of philosophical studies is calling for AI safety papers. Please share with any philosophers you know who may be interested.

More ML Safety Resources

[Link] The ML Safety course

[Link] ML Safety Reddit

[Link] ML Safety Twitter

[Link] AI Safety Newsletter (this more frequent newsletter just launched!)

Comments


No comments on this post yet.
Be the first to respond.
More from TW123
227
TW123
· · 9m read
144
TW123
· · 3m read
94
TW123
· · 1m read
Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f