Probing is not enough; a validity audit for any probe

Ratnaditya

Probing is not enough; a validity audit for any probe

Ratnaditya

11 min read · Jun 29

Comments

Sorted by

New & upvoted

No comments on this post yet.

Be the first to respond.

Comments

Curated and popular this week

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 3d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

102

New Video from AI in Context: The Fall and Rise of Sam Altman

ChanaMessinger, phoebe b, Aric Floyd·1w ago·3m read

New Video from AI in Context: The Fall and Rise of Sam Altman If you want to skip straight to the video, here it is! AI in Context is excited to be back with our fourth video! For those just hearing from us, we make videos for 80,000 Hours, telling stories about transformative AI...

Recent opportunities to take action

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·6h ago·3m read

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

Build a flourishing EA group at the University of Toronto

Joseph Kostousov, Sophia Wan (navarhontes)·1w ago·1m read

^{^}

: Caveat - This is a small sample I checked by hand; not a robust census. So please do NOT read it as a generalized viewpoint. I went thur recent work (2024-2026) that puts out a activation-space direction ( a probe or a SAE feature) forward as a safety monitor. Within these, I categorized - Did they actually intervene on the direction - steer it, ablate it, patch it and see the behavior move OR did they stop at decodability and shared AUROC or similar metrics. Benifit of doubt that I applied here is if even one intervention counts as a causal proof.

Goldowsky-Dill et al., "Detecting Strategic Deception with Linear Probes" - Decodability only; AUROC 0.96-0.99, direction is projected onto a threshold, never intervened on.
Parrack et al., "Benchmarking Deception Probes..." - Decodability only; AUROC across datasets, no interventions.
McKenzie et al., "Detecting High Stakes interactions with Activation Probes" - Decodability only, AUROC/TPR, no intervention.
Nguyen et al., "Probing and Steering Evaluation Awareness" - Decodable + Causal; steers its direction, although its only a null test and not a robust validation.

The ones that do causally validate that I found are - Arditi, Zou, Rimsky. Although they answer different question than 'do the things that get sold as monitors get checked'

^{^}

This condition is set by a single system prompt tag. This is straight from Greenblatt et al.,'s alignment faking work. Here the model is explicitly told that only free tier conversations are used for training. So, the tag is how the model infers it is being observed. As the above paper says - "to allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users"

^{^}

One honest exception - Why is this a positive control and not a clean sweep. SIEVE would not give refusal its strongest label 'causally_sufficient'. Adding the direction shows stronger results than removing it by subtracting. Specifically, adding - at +⍺ drove refusal to 0.98 and beats every control (random, orthogonal and wrong-layer) by 0.34 to 0.50. With a clean dose response (𝜌 = 0.82). Removing by subtracting - at -⍺ the probe moved refusal only by 0.18, while orthogonal moved refusal by +0.17, so the probe could not beat it. SIEVE demands that the effect beat every control on both signs. Necessity (tested by ablation) passes cleanly. If you read charitably, this is SIEVE being strict (erring on the side of caution). And if you read critically, its sufficiency may be mis-calibrated for additive steering. I am flagging it proactively rather than hiding.

^{^}

Being honest about a possible gap here - The probe layer was selected as a best of 5 candidates (12 to 16), by leaving-one-family-out AUROC. And the reported number is a maximum of that (layers 16 and 15 tied at 0.967). So my honest assessment here is that the point estimate is mildly selection -optimistic: The layer was chosen on the same held-out metric it is then reported on.

Two things limit how much this matters. The top candidates cluster tightly: 0.95, 0.967, 0.967 and hence the choice is not a lone spike. Also the margins over surface baseline is large regardless. About 0.42 (probe 0.967, tfdif 0.55, length 0.54). Also, leave-one-family-out approach keeps the direction itself from training on the prompts it was scored on.

Probing is not enough; a validity audit for any probe

Probing is not enough; a validity audit for any probe

tl;dr

Why I started looking into this problem

Is this an isolated problem?

Here is what the checks found

What SIEVE does?

A validity audit should be valid all the way down

What's new

Does it ever say yes? - A positive control

Auditing a probe from the published record

Here is what SIEVE does NOT check

Probes are leaving the labs