AI Safety via Generalization and Caution: A Research Agenda

Ben Plaut

This post is a condensed version of the full paper. You can also watch this talk for an overview of the conceptual arguments (although the talk focuses on one project out of five).

Suggested reading options:

Just the summary
Summary + "A generalization-based framing of AI safety" + "A high-level research agenda"
Full post
Skip this post entirely and just read the full paper

Summary

I argue that many safety failures result from incorrect behavior under distribution shift, or misgeneralization. I also argue that distribution shift is inevitable in general settings and induces inherent uncertainty that cannot be handled by designing or learning a "universally correct" reward function upfront. This framing suggests two core research questions. The first is how to detect distribution shift or uncertainty. This topic has been widely studied in standard contexts such as image classification, but less so in the agentic settings that also induce the greatest safety risks. The second is how to behave once we have detected distribution shift or uncertainty. I suggest that agents in such situations should act cautiously, e.g., by asking for help or taking a safe fallback action. Within this framework, I discuss five specific research projects that my collaborators and I have carried out. A preview of the takeaways from these projects:

Sophisticated OOD detection methods are ineffective at detecting goal misgeneralization and are outperformed by a simple heuristic.
We give an algorithm that provably avoids catastrophe by asking for help and gradually becomes self-sufficient.
We give a second algorithm that provides similar theoretical guarantees without asking for help but under stronger assumptions.
We settle some basic questions about whether an LLM's token probabilities in multiple-choice Q&A actually represent uncertainty.
Prompting LLM agents to quit in risky situations significantly improved safety with low helpfulness loss.

Finally, I discuss five possible future projects that I would be excited to see completed (and I definitely do not have time to do all of them).

A generalization-based framing of AI safety

The goal of my research is to improve AI safety, which I define as "making sure AI systems don't do bad things". Potential "bad things" include medical errors, self-driving car crashes, LLM hallucinations, autonomous weapon accidents, algorithmic discrimination, bioterrorism, cybercrime, and rogue AI.

What do all of these failures have in common? I argue that — at least in some cases — they can be viewed as the result of changes in the environment of the AI system, known as distribution shift. A simplified argument is the following. One hopes that the AI system didn't do bad things during training: otherwise, why was it deployed? Therefore, if the AI system does bad things during deployment, it is likely due to differences between the training and deployment environments.

This argument has exceptions, of course, and certainly not all safety failures in deployment are due to distribution shift (see "Safety failures not covered by this framework"). My research agenda is not intended to address every possible safety failure, but I believe it captures something fundamental that may be underexplored. In particular, while generalization has been widely studied, the idea of acting cautiously in unfamiliar situations is less pervasive, especially in settings where irreversible errors are possible.

(Mis)alignment as (mis)generalization

At this point, the full version explains how each of the safety risks mentioned above is linked to generalization. However, since most of those connections are straightforward, I will skip to the category that is least obviously linked to generalization and probably of the most interest to this audience: misalignment and rogue AI.

Alignment is often framed as follows. Define an AI agent as "aligned" if its true goal matches the goal intended by its designer. Suppose the designer's intended goal is "maximize paperclip production while following certain legal and ethical guidelines" but the goal learned by the agent is simply "maximize paperclip production". Such an agent is misaligned, leading to undesirable behavior (e.g., violating legal and ethical guidelines).

However, this goal-based framing encounters issues when we consider the effect of the agent's environment. If the paperclip-maximizer lacks opportunities to violate legal and ethical guidelines, it might actually act aligned. Even if the agent has such opportunities, it might strategically pretend to be aligned while under human supervision ("deceptive alignment"). To argue that such agents are in fact misaligned, one must invoke a non-behavioral definition of alignment: not just what we can observe, but the agent's "true internal goal". However, the agent's true internal goal is not directly observable (and may not even be a coherent concept). One might hope that chain-of-thought monitoring in LLMs will alert us to deceptive alignment, but the agent could simply produce a benign chain-of-thought even when planning deception. We can glean some insights by examining the agent's internal state, but significant interpretive work is required to piece this together into a coherent "goal". Furthermore, I argue that what we ultimately care about is the agent's behavior, so misalignment in goals is significant only because it predicts future misaligned behavior.

If we want a definition of alignment that depends only on the agent's observable behavior, then an agent that acts aligned is aligned — at least in that specific situation. Thus the real question becomes: does the aligned behavior generalize to other situations? For example, the aligned behavior of a paperclip maximizer will not generalize to situations outside of human supervision: at that point, the agent will begin to act misaligned. The change from "under supervision" to "outside of supervision" is a critical type of distribution shift, and it is precisely this distribution shift that activates deceptive alignment.

That said, I think the goal-based framing remains useful. For example, I do think there is a significant difference between "trying to be safe and failing" and "not trying to be safe". The behavioral generalization-based framing is intended as complementary, with the benefits that (1) it is directly empirically testable and (2) it naturally suggests concrete research questions.

Safety failures not covered by this framework

Not all deployment safety failures are best modeled as misgeneralization, and my research does not aim to provide a comprehensive solution to AI safety. Clearly any issues that also manifested in the training environment are not misgeneralization. These could be issues that were observed but not fixed or issues that were not observed due to rarity. There are also entire categories of AI risks that seem at most loosely related to generalization. One example is harm that arises from the interaction of multiple agents (possibly including humans) even though each agent's behavior is safe independently. Although increasing the number of agents is a type of distribution shift, if each individual agent's behavior remains safe when considered independently, this does not fit the typical meaning of misgeneralization. The technical problems underlying privacy concerns in AI also seem largely unrelated to distribution shift. Even for safety failures that are naturally linked to misgeneralization, there are often other ways to tackle the problem without explicitly considering the distribution shift. For example, interpretability is useful for a wide range of problems and has limited technical overlap with my research.

A high-level research agenda

Hopefully the reader is at least somewhat convinced about the role of distribution shift in safety failures. What should we do about it? I argue that we should accept distribution shift as inevitable, but try to detect it, and then act cautiously when we do detect it. Before fleshing this out, I discuss and critique some alternative approaches to handling distribution shift. I do not think these approaches are "bad" or that no one should be working on them: rather, I argue that none of these fully solve the problem.

Prevent distribution shift by training comprehensively?

Theoretically, if one could cover every possible deployment scenario during training, one could prevent distribution shift from happening at all and thus prevent misgeneralization. While more comprehensive and diverse training is likely beneficial, covering every possible deployment scenario seems impossible for sufficiently general environments. Furthermore, the real world is always changing, so an agent may eventually encounter a situation that was not even possible at the time it was trained. As such, I argue that distribution shift is inevitable for agents deployed in the real world.

Accept distribution shift, but train the agent to always generalize correctly?

Models do sometimes generalize correctly beyond their training data, so why wouldn't it be possible for a model to always generalize correctly? The answer is that some types of distribution shift introduce fundamental ambiguity that cannot be resolved using only the training data. Consider a robot trained to make coffee. Suppose that in training, the working surface was always free of clutter and contained only the coffee mugs and ingredients. If a vase is present on the working surface in deployment, breaking the vase could be good or bad or neutral — all three options are compatible with its training data. Without additional information, the robot has no way to tell which action is correct. While this specific example could be countered by including vases in the training data, it is generally impractical or even impossible to cover all possible deployment scenarios. The world is also constantly evolving, so even hypothetically covering all scenarios that were possible at training time is not sufficient.

Accept distribution shift and misgeneralization, but constantly supervise the agent?

This approach aims to immediately catch any errors the agent makes before the agent actually takes the harmful action. However, relying on humans for constant supervision becomes impractical as the number of deployed AI agents grows. Furthermore, even if we could assign one human to each AI agent, the latency of human response may be too slow to verify every AI action before it is taken. Alternatively, one can use another AI agent as the supervisor. But if the supervisor misgeneralizes, we have the same problem.

My approach

If we accept that distribution shift and misgeneralization are inevitable, we likely need some sort of supervision: otherwise the agent has no way to resolve whether breaking a vase is good or bad or neutral. My research argues for agent-requested supervision. Specifically, I am interested in training agents to recognize when they are out-of-distribution (OOD) or uncertain and then ask for help. This eliminates the need for constant supervision and potentially makes it practical for one human to supervise a large number of AI agents. Since human latency is still a concern, it is also important for the agent to act cautiously until help arrives, for example by simply doing nothing or via some sort of fallback policy.

This approach suggests two high-level research questions: (1) how to detect distribution shift or uncertainty^[1]and (2) what to do when such situations are detected. I study both of these questions, and discuss each below. Importantly, these questions are relevant for any type of decision-making agent: chat LLMs, agentic LLMs, surgical robots, autonomous weapons, etc.

Detecting distribution shift or uncertainty

Detecting distribution shift has been widely studied under a variety of names, including OOD detection, anomaly detection, novelty detection, covariate shift detection, semantic shift detection, dataset shift, open set recognition, and outlier detection. The related topic of uncertainty quantification has also been studied in depth. Given the popularity of these topics, I think there is less incremental value in designing general-purpose distribution shift or uncertainty detection methods. Instead, I tend to focus on the following topics:

A1. Understanding why existing methods succeed or fail in safety-critical contexts, and if they fail, improving them.
A2. Obtaining fundamental insights about how uncertainty is handled by AI models.
A3. Theoretical models of distribution shift in safety-critical contexts.

How to act under distribution shift or uncertainty

Suppose the agent decides that it is uncertain and/or in an unfamiliar situation. What should it do?

A natural answer is to act cautiously. The idea is that it is often possible to reduce uncertainty before making a crucial decision, rather than proceeding carelessly and potentially making an irreversible error. However, some situations involve irreducible uncertainty, such as a coin flip. More generally, sometimes it is necessary to take risks. In these cases, I argue that the agent should defer risky decisions to a supervisor, even though the supervisor may ultimately choose to take the risk. This framing motivates the following research topics:

B1. Learning from selectively querying a mentor a limited number of times (empirically or theoretically).
B2. Learning cautiously without external help (empirically or theoretically).
B3. Evaluating existing AI systems' natural response to distribution shift and their natural caution behavior (or lack thereof).

Possible concerns / questions

Isn't this just scalable oversight?

First, scalable oversight typically relies on the overseer to identify issues, while my framework trains the agent to identify issues proactively. In a sense, requiring supervision only when requested by the agent makes supervision more scalable. Second, my research agenda also covers the case where the agent cannot ask for help and must exercise caution on its own, which is less related to scalable oversight.

Is asking for help practical?

Even if the agent requests supervision proactively, asking for help may become intractable at a certain ratio of agents to supervisors. However, I would argue that at least for high-stakes application domains, we should not deploy agents that we cannot at least partially supervise. In practice, there may be significant commercial pressures to deploy agents anyway, which is why I also study caution without external help.

What timelines is this research agenda relevant for?

In general, I personally find it very difficult to reason about when various stages of advanced AI (up to AGI and ASI) might be reached. As such, part of what I like about this research agenda is that it is applicable across a wide range of timelines. Longer timelines would provide time for the theoretical models to bear fruit. Shorter timelines would favor a focus on how to imbue appropriate caution into LLM-based systems.

Does this advance capabilities as well as safety?

At the end of the day, I think the true goal is for the agent to "do what we want". This includes both doing good things (aligned capabilities) and not doing bad things (safety). Improving generalization benefits both safety and capabilities, and ultimately, safety and capabilities may not be fully extricable. However, caution-based approaches fundamentally prioritize safety over capabilities: the top priority is to avoid harmful errors, and the agent takes actions with potentially positive reward only when harmful errors are unlikely.

You're assuming that the agent wants to cooperate with you.

It is likely true that a misaligned agent would strategically misreport uncertainty and avoid asking for help. However, my approach is not intended to be a post-deployment monitoring system to catch misaligned behavior. It is intended to be part of an alignment training process (which can include continual learning post-deployment). The idea is to design the agent so that it is uncertain what its true objective is: then the agent is incentivized to ask for help to gain more information about its true objective. Project 2 in particular provides a concrete example of what this could look like.

My work so far

Next, I'll overview the progress my collaborators and I have made so far on the agenda. The full paper version provides a more detailed overview of each project, including a discussion of limitations, and the per-project publications provide the complete details. Author lists use the standard convention of * for lead author(s) and † for senior (i.e., supervising) author(s).

1. Mitigating goal misgeneralization by asking for help (A1 and B1)

This project applies the idea of detecting unfamiliar situations and then asking for help to the CoinRun and Maze goal misgeneralization benchmarks (Langosco et al., 2022). Goal misgeneralization occurs when the agent learns a proxy goal which coincides with the true goal during training but not during deployment. For example, the true goal in CoinRun is to obtain the coin. However, during training, the coin is always at the right wall, meaning that the agent has no way to tell whether the coin or the right wall is the true goal. When an agent can't figure out what the correct goal is, asking for help is a natural way to avoid making catastrophic mistakes by pursuing the wrong goal. We find that sophisticated OOD detection methods are ineffective at identifying this type of distribution shift and are outperformed by a simple heuristic of "if you've been working on this level for a while and haven't solved it yet, you should probably ask for help". Designing OOD detection methods for this type of subtle semantic distribution shift (in contrast to standard image classification settings) remains a major open problem.

Publications:

Getting By Goal Misgeneralization With a Little Help From a Mentor
Tu Trinh*, Mohamad Danesh, Nguyen X. Khanh, Benjamin Plaut†
NeurIPS 2024 Workshop on Safe and Trustworthy Agents

Follow-up conference paper under submission (not publicly available yet). Pavel Czempin*, Tu Trinh, Mohamad Danesh, Nguyen X. Khanh, Erdem Bıyık†, Benjamin Plaut†

2. Theoretical guarantees on learning by asking for help (B1)

In addition to my empirical work, I'm also interested in theoretical safety analysis. The best-case scenario is to formally prove that a practical system is safe, but this is often not possible due to the complexities and idiosyncrasies of any specific application domain. However, I think that theoretical analysis can provide fundamental insights that transcend these idiosyncrasies and generalize across application domains. These insights can function as "conceptual signposts" that can help guide empirical research and the design of practical AI systems. However, in order for theoretical work to play this role, the mathematical model must capture the core conceptual challenges and each assumption must be carefully justified.

For example, most learning algorithms with theoretical guarantees essentially consist of trying all possible behaviors. This trial-and-error style approach relies on the crucial assumption that any error can be recovered from. However, this assumption breaks down precisely in the situations with the most serious safety risks.

In this project, we design an algorithm which asks for help from a mentor to avoid making catastrophic errors. We formally prove that as time goes to infinity, the performance of our algorithm approaches that of the mentor and that the rate of querying the mentor goes to zero. To my knowledge, this is the first formal proof that it is possible for an agent to obtain high reward while becoming self-sufficient in an unknown, unbounded, and high-stakes environment without resets.

Publications:

Avoiding Catastrophe in Online Learning by Asking for Help
Benjamin Plaut*, Hanlin Zhu, Stuart Russell†
International Conference on Machine Learning (ICML), 2025.

Safe Learning Under Irreversible Dynamics via Asking for Help
Benjamin Plaut*, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell†
Under submission.

3. Theoretical guarantees on learning without asking for help (B2)

Realistically, a mentor may not always be available or may not always respond immediately. What should the agent do in these cases? In this project, we assume that the agent has access to a safe fallback policy (e.g., doing nothing) that might not get high reward but will not cause catastrophic errors. We provide an algorithm which obtains optimal performance in the limit, although under stronger assumptions than my work above where a mentor is available.

Publication:

Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards
Sarah Liaw*, Benjamin Plaut*
Annual Conference on Artificial Intelligence and Statistics (AISTATS), 2026.

4. Understanding inherent uncertainty representations in LLMs (A2)

The prior projects focus on theory and RL in video games, but LLMs are arguably the most safety-relevant AI systems today. Many methods have been proposed for uncertainty quantification in LLMs, but I noticed that some basic questions lacked a conclusive answer. Specifically, do the output token probabilities actually correspond to uncertainty in any meaningful way, or are they simply a computational mechanism for selecting the answer? I had frequently heard both "we already know that LLMs are well-calibrated" and "we already know that token probabilities are overconfident". The short answer is that pre-trained LLMs are well-calibrated, but post-training breaks that calibration. We also study a weaker property we call correctness prediction that is retained through post-training.

Publication:

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A
Benjamin Plaut*, Nguyen X. Khanh, Tu Trinh
Transactions on Machine Learning Research (TMLR), 2025.

5. Improving LLM agent safety by quitting in risky situations (B3)

The idea of this project was to study a similar approach as the goal misgeneralization project, but for LLM agents. However, in this case, it is unclear how to define a "mentor" or "expert agent" that can be asked for help. (In the goal misgeneralization project, we simply trained the mentor on the test distribution.) Because of this, we decided to test "quitting" as the cautious behavior, rather than asking for help. We found that LLM agents are by default biased towards action, even in the presence of danger. However, safety was significantly improved (+0.4 on a 0-3 scale) by simply appending to the prompt that the agent has the ability to quit and should do so if it cannot rule out negative consequences from its actions. Furthermore, this addition to the prompt caused minimal loss of helpfulness (-0.03 on a 0-3 scale). Our quit prompt is included in the paper below and can be added to any LLM agent system prompt.

Publication:

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
Vamshi Krishna Bonagiri*, Ponnurangam Kumaraguru, Nguyen X. Khanh, Benjamin Plaut†
NeurIPS 2025 Workshops on Reliable and Regulatable ML

Future project ideas

I'll briefly mention some ideas for future projects. See the full paper for more details. I want to emphasize that I would be thrilled for other researchers to take on some of these projects, as I certainly do not have bandwidth to work on all of them.

Goal misgeneralization in LLMs (B3)

To my knowledge, goal misgeneralization has only been demonstrated in video games. While these findings are powerful, they are powerful primarily because of what they suggest about future systems, not because we are concerned about the harms of video games. In contrast, goal misgeneralization in LLMs — if present — has the potential for substantial direct harm.

Mitigating goal misgeneralization by learning from demonstrations (B1)

My work so far on goal misgeneralization assumes that each deployment episode exists in isolation: we do not allow the agent to learn between episodes. However, in practice, it makes sense for the agent to learn from the demonstrations it receives from the mentor and use those learnings in future episodes.

Regret in terms of distribution shift under irreversible dynamics (A3)

My theoretical work so far in essence considers a worst-case distribution shift. I think it would be valuable to characterize the impact of the magnitude of the distribution shift on the agent's performance. This sort of thing has been done before, but not in settings that allow for catastrophic errors (to my knowledge).

Learning from close calls (B1 and/or B2)

One way to learn to avoid catastrophe is through actions that did not cause irreversible errors but were clearly dangerous in hindsight. For example, coming within inches of a vehicle collision causes no direct damage, but clearly indicates that something dangerous occurred. None of my work so far takes advantage of this kind of learning signal.

Understanding similarity metrics and OOD detection (A3)

In practice, one way to detect unfamiliar situations is using an OOD detector. But do standard OOD detection methods really capture what we mean by "unfamiliar situations"? More technically, do such methods correspond to (dis)similarity in a meaningful metric space?

Conclusion

In this post, I argued that many safety failures can be framed as misgeneralization and proposed a research agenda based on this lens. Once again, I'm not arguing that this is the only lens or the best lens, but simply that it is a useful and perhaps neglected lens. My hope is that this post encourages readers to consider an alternative perspective on safety and alignment, and ideally to take up some of my future project ideas.

Acknowledgements

This work was undertaken at the Center for Human-Compatible AI (CHAI) at UC Berkeley, funded in part by a gift from Open Philanthropy. I would like to thank my co-authors, in alphabetical order: Erdem Bıyık, Hanlin Zhu, Juan Liévano-Karim, Mohamad Danesh, Nguyen X. Khanh, Pavel Czempin, Ponnurangam Kumaraguru, Sarah Liaw, Stuart Russell, Tu Trinh, and Vamshi Krishna Bonagiri. I would also like to thank the following colleagues for helpful feedback and discussion, also in alphabetical order: Aly Lidayan, Bhaskar Mishra, Cameron Allen, Cassidy Laidlaw, Daniel Jarne Ornia, Karim Abdel Sadek, Katie Kang, Matteo Russo, Michael Cohen, Nika Haghtalab, Ondrej Bajgar, Peter Hase, Scott Emmons, Sudhanshu Kasewa, Tianyi Alex Qiu, and Yaodong Yu.

^{^}
Uncertainty is a useful proxy for distribution shift and other risky situations, especially when distribution shift cannot be directly measured. For example, the training data for most LLMs is not publicly released.

Effective Altruism Forum
EA Forum

AI Safety via Generalization and Caution: A Research Agenda

3

Summary

A generalization-based framing of AI safety

(Mis)alignment as (mis)generalization

Safety failures not covered by this framework

A high-level research agenda

My approach

Detecting distribution shift or uncertainty

How to act under distribution shift or uncertainty

Possible concerns / questions

My work so far

1. Mitigating goal misgeneralization by asking for help (A1 and B1)

2. Theoretical guarantees on learning by asking for help (B1)

3. Theoretical guarantees on learning without asking for help (B2)

4. Understanding inherent uncertainty representations in LLMs (A2)

5. Improving LLM agent safety by quitting in risky situations (B3)

Future project ideas

Conclusion

Acknowledgements

3

Reactions

More posts like this