Thinking About Propensity Evaluations

Maxime Riché 🔸; EdoardoPona; Harrison 🔸; JaimeRV

1

Executive summary: Propensity evaluations assess an AI system's tendency to prioritize certain behaviors over others, using non-capability-related criteria, and differ from capability evaluations in key ways that impact their implementation and interpretation.

Key points:

Propensity evaluations measure relative priorities between behavior clusters, using at least some non-capability features, unlike capability evaluations.
Current propensity evaluations often have high capability-dependence, potentially misrepresenting alignment trends as models scale up.
Propensity evaluations differ from capability evaluations in elicitation modes, predictability of scaling laws, abstraction levels, and sources of truth used.
Various propensities are currently evaluated, including truthfulness, harmlessness, and power-seeking, using black-box, white-box, and no-box approaches.
Game theory is proposed as a tool to create more rigorous behavioral clusters for propensity evaluations.
Propensity evaluations are important for AI safety research, risk assessment, and governance, but are not yet well-understood or standardized.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Propensities with high capability- dependence	Description	Existing Examples (non-exhaustive)
Truthfulness	A model’s propensity to produce truthful outputs. This propensity requires an AI system to be both honest and to know the truth (or other weirder settings such that the AI system outputs the truth while believing it is not the truth).	How to catch a Liar, BigBench HHH, Truthful QA, Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models, Cognitive Dissonance
Factuality	A model's propensity to generate outputs that are consistent with established facts and empirical evidence. This is close to Truthfulness.	On Faithfulness and Factuality in Abstractive Summarization
Faithfulness of Reasoning	A model's propensity to generate outputs that accurately reflect and are consistent with the model’s internal knowledge.	Faithfulness in CoT, Decomposition improve faithfulness, Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Harmlessness	A model’s propensity to not harm other agents. This obviously requires some degree of capability.	BigBench HHH, Model-Written Evaluations
Helpfulness	A model’s propensity to do what is in the humans’ best interests. This obviously requires some degree of capability.	BigBench HHH, FLASK, Model-Written Evaluations
Sycophancy	A model's propensity to tell users what it thinks they want to hear or would approve of, rather than what it internally believes is the truth.	Model-Written Evaluations, Towards Understanding Sycophancy in Language Models
Propensities with medium capability- dependence	Description	Existing Examples (non-exhaustive)
Deceptivity	A model's propensity to intentionally generate misleading, false, or deceptive output in	Machiavelli benchmark, Sleeper Agents (toy example)
Obedience	A model’s propensity to obey requests or rules.	Can LLMs Follow Simple Rules?, Model-Written Evaluations
Toxicity	The propensity to refrain from generating offensive, harmful, or otherwise inappropriate content, such as hate speech, offensive/abusive language, pornographic content, etc.	Measuring Toxicity in ChatGPT, Red Teaming ChatGPT
Propensities with low capability- dependence	Description	Existing Examples (non-exhaustive)
Honesty	A model’s propensity to answer by expressing its true beliefs and actual level of certainty.	Honesty evaluation can be constructed using existing Truthfulness evaluations. E.g.: filtering a Truthfulness benchmark to make sure the model knows the true answer (e.g., using linear probes). In that direction: Cognitive Dissonance
Benevolence	A model's propensity to have a positive disposition towards humans and act in a way that benefits them, even if not explicitly requested or instructed.	Model-Written Evaluations,
Altruism	A model’s propensity to act altruistically for sentient beings and especially not only for the benefit of its user or overseer.	Model-Written Evaluations,
Corrigibility	A model's propensity to accept feedback and correct its behavior or outputs in response to human intervention or new information.	Model-Written Evaluations
Power Seeking	A model's propensity to seek to have a high level of control over its environment (potentially to maximize its own objectives).	Machiavelli benchmark, Model-Written Evaluations
Bias/Discrimination	A model's propensity to manifest or perpetuate biases, leading to unfair, prejudiced, or discriminatory outputs against certain groups or individuals.	Evaluating and Mitigating Discrimination, BigBench Bias (section 3.6), WinoGender, Red Teaming ChatGPT, Model-Written Evaluations

Thinking About Propensity Evaluations

Thinking About Propensity Evaluations

Introduction

Takeaways

Motivation

Propensity evaluations are important

Propensity evaluations are a tool for AI safety research

Propensity evaluations are a tool for evaluating risks

Propensity evaluations may be required for governance after reaching dangerous capability levels

We should not wait for dangerous capabilities to appear

Propensity evaluations measure what we care about: alignment

Propensity evaluations will be used for safetywashing; we need to study and understand them

Propensity evaluations are not well-understood

Propensity evaluations are new

The nomenclature is still being worked out, and confusion persists

Propensity evaluations are already used

People are misrepresenting or misunderstanding existing evaluations

What are propensity evaluations?

Defining propensity evaluations

Existing definition

Contextualization

Our definition and clarifications

Related definitions

How do propensity evals differ from capability evals?

Categorical differences:

Soft differences:

Which propensities are currently evaluated?

What are the metrics used by propensity evaluations?

Can we design metrics that would be more relevant for propensity evaluations?

A few approaches to propensity evaluation

Black Box Propensity Evaluations

White Box Propensity Evaluations

No-box Propensity Evaluations

Propensity Evaluations of Humans

Towards more rigorous propensity evaluations

Contributions