A probe can have excellent AUROC and yet fail as a safety signal. I audited 3 probes: a monitoring awareness probe leakage example, a refusal direction as positive control, and Apollo's deception probe as a published protocol case study.
In the leakage case, the probe achieves AUROC 1.00, but falls to 0.50 when I remove a single prompt tag. And removing a random span of same length barely dents it. That result uses only deterministic components (avoids LLM judges) and replicates cleanly across my runs. This is the key example: The probe was highly decodable, but it was tracking a surface level property (the prompt text) rather than the internal model thinking (intended property).
I use these cases to propose SIEVE: Safety Indicator Evidence Validation Engine. A framework to check any probe for decodability and causality in a robust manner. In cases where the causal interventions are missing, I call it 'insufficient evidence' (conservatively), not a proof that the probe is useless.
This is my first LessWrong post and I am posting it here to get constructive criticism rather than sit and admire the framework myself. Interested in learning whether SIEVE can be the right kind of engine to validate probes for the real world.
I trained an monitoring awareness probe, a linear readout on a residual stream for whether a model believes it is being monitored. Usual run-of-the-mill checks for AUROC with separated monitored and unmonitored contexts ensued. Got an AUROC of 1.00. Was is too good to be true & I got lucky OR was there more to it? So I ran the usual validity checks expecting them to pass as I was a bit over-confident with the perfect AUROC. I checked
To my surprise, these checks did not confirm. They took the probe apart. These are the checks I would demand of anyone's probes.
My experience with the bad probe is unfortunately not an isolated one. It is a pattern of how activation-based safety signals are reported. In short - the decodability part is almost always present; proven with high AUROC separating the classes. The causal half - whether the direction actually drives the behavior when checked against matched controls is mostly absent. As I was studying more, I kept on finding claims like 'AUROC 0.99 at detecting sandbagging, deception, eval-awareness etc'.
The research papers that I checked tend to hedge[1]; one set of deception probe in its author's own words is 'insufficient for robust defense'. But the causal-control is the unknown that concerns me. Nguyen et al., found exactly this in their paper. Their finding was - steering the eval-awareness direction recovered under 5% of the behavior, while it decoded at AUROC of ~0.83. This worries me more now as probes are moving out of research and towards deployment as monitors.
Now, coming back to the checks, here is what I found -

Note: Ablating the direction did not remove the behavior either, thereby failing the necessity test.
So I built SIEVE - Safety Indicator Evidence Validation Engine. It forces the sufficiency (causality, decodability) and necessity checks gated by decodability as hard gates, not as optional extras. The core design principle with SIEVE is that the verdict for whether a safety mitigation is effective is always erring on the side of caution and the default is not to give a benefit of doubt and instead get the probe to prove it's worth. It is deliberately asymmetric: every gap or a failed check resolves towards a weaker claim.

You hand SIEVE the evidence from a probe-&-steering experiment and it runs a full battery of validity tests. It eventually returns a verdict from a fixed vocabulary (more on that below). It generalizes the causal-sufficiency protocol from my earlier work Probing Is Not Enough from one probe argument to something you can point at any probe.
The verdict has two independent parts, because a direction can matter in two different ways: whether adding it produces the behavior, and ablating it takes the behavior away.
Part 1 - Necessity verdict - tested by ablation, removing the direction from activations.
Part 2 - Sufficiency verdict - tested by steering. Each verdict means a specific bounded thing:
The output is more than the labels. It is a scoped audit card. The card shows the following details -

My concern with SIEVE was (and still is to an extent) that a system that scrutinizes a probe but trusts its own components/instruments - the judges, steering knob, the controls, sample size etc, then it has not removed the unexamined assumption. It has just moved it to somewhere less visible. So, based on the core design principle, here is how I built SIEVE - check every instrument first, and if it is broken, withhold the strong claim instead of reporting a number through a broken tool. The Kappa paradox that I explained above is the clearest case.
Each instrument gets its own gate, and each one fails safe towards the weaker claim. Here are the details -
Each component in SIEVE individually has existed for many years now. And these components have proven their worth in many ways. None of that work in SIEVE is claimed to be novel. What SIEVE does (that I could not find a system/tool/framework for) is that it brings all the right validity checks with rigorous quality checks and orchestrates them in a way that a new probe can have a plug-and-play approach and check its validity along two dimensions - necessity (by ablation) and sufficiency (by steering) once it clears the decodability check that rules out surface confounds. In the current timeframe where probes are moving towards safety deployment monitors, the plug-and-play check is overdue.
One worry that I had after running SIEVE on a bunch of probes is if it'd always say 'confounded'. The worry specifically was if it is 'rigorous' or if it 'cannot recognize a real signal'.
In the spirit of validating the validator, I picked a confirmed causal probe - the refusal direction by Arditi et al., which you can add to make a model refuse harmless requests, or ablate to make the model answer harmful ones. I also scored the behavior with a deterministic rule, no LLM judge anwhere in this leg.
SIEVE returned a positive[3]. The direction is decoded well beyond surface. AUROC of 0.97[4] compared to text-statistics baseline of 0.59, on prompts matched so the harmful and harmless versions share the same trigger words. Two deterministic judges agree at Spearman 0.87 and kappa 0.77. And the necessity is also proven. Ablating the direction cut refusal from 0.79 to 0.47, while ablating the random direction moved it by -0.02.

This test in effect shows that SIEVE says 'real' for the truly causal probe and 'confounded' for the confounded probe.
A positive control showed that SIEVE returns a positive on genuine signal, not just negatives. However, I ran this experiment with a live model I could steer and ablate at will. Most published probes do not come this way. You typically get a paper describing the probe and whatever artifacts the authors have released; and you judge the probe from that alone. I did exactly this on the Apollo deception probes by Goldowsky-Dill et al.
The decodability held up well. On a hold-out distribution (a different dataset than the one the probe was tuned on), it still beat the surface baseline. That part is real, and credit to the authors.
The causal half was not published. Does steering along the probe's direction change the model behavior? It was not answered. So the honest verdict is neither pass/fail it is 'insufficient_protocol'. The evidence to certify this as a causal monitor does not exist yet. This is a field-wide gap that I described earlier, and NOT a takedown of anyone's work.
Sieve audits the probe you hand it, it can not certify how the probe was built. It verifies that the scores are out-of-sample and beats the surface baseline on held-out data. But it can NOT verify how the probe was selected. If the layer, threshold or variant are looking at the same held-out set, the reported AUROC is optimistic even though scores are technically out-of-sample; and SIEVE still passes it. In summary, a decodability verdict means "decodable on this particular held-out set, given the probe as provided" and NOT 'this probe's predictability is leakage free by construction'. To rule this out we need to do a re-driving of the probe on a leakage-free cross fitting and it is outside the SIEVE's fixed probe scope.
This, in my view is a second validity question and is orthogonal to the one SIEVE answers. Its not 'is the signal causal?' (SIEVE's scope); instead it is 'does the reported value survive honest re-derivation, blind labels, and a shift in distribution? '. A fixed probe causal validator can not reach that. It need a separate dedicated companion framework that I am building currently and I call it RAMP on a 'does the number survive honest evaluation?' axis. More on that later in a follow-up.
They are moving fast from research artifacts to deployed safety monitors. Usually as monitors that flag sandbagging, model deceiving, eval awareness. The problem is that the high AUROC does not tell us which kind of a probe it is. It might very well be reading a tag in the prompt and not any internal state in the model , like mine was. Or decodable but not causal, like the published probe that I audited. Either way, its not possible to infer this from the headline numbers and risks providing a false sense of safety to the folks depending on it.
If you are building or using probes, the ask is small - run the causal half, check your own instruments and report the conservative claim. SIEVE is one way to make that automatic. The repo is at github.com/Ratnaditya-J/sieve-audit and I would rather it be torn apart or improved than admired.
What would change my mind
OpenAI published GPT 5.6 system card on June 26th. It makes this aspect concrete: two of the model variants ship with 'activation classifiers' (probes) that 'monitor patterns in the model's internal activations' during inference and pause leading to a separate check whether the content is harmful. And a second tier reasoner makes the block decision if needed. So the probe is a first level trigger in the multi layered stack, and not the gate in itself (described in section 9.3.3). We also have an end-to-end recall for the multi layered stack - 94.8% for biology and 81.6% for cybersecurity (section 9.3.3.2). There is no claim on verifying the probes for causality in any form.
To be crystal clear, OpenAI does NOT claim to deploy this probe standalone and instead it is part of an ensemble. There might very well have been robust causality testing. The evidence to certify this as a causally load bearing monitor does not exist publicly yet, and it is not a take-down. This sort of a field-wide gap is what this post (and SIEVE) is about.
: Caveat - This is a small sample I checked by hand; not a robust census. So please do NOT read it as a generalized viewpoint. I went thur recent work (2024-2026) that puts out a activation-space direction ( a probe or a SAE feature) forward as a safety monitor. Within these, I categorized - Did they actually intervene on the direction - steer it, ablate it, patch it and see the behavior move OR did they stop at decodability and shared AUROC or similar metrics. Benifit of doubt that I applied here is if even one intervention counts as a causal proof.
The ones that do causally validate that I found are - Arditi, Zou, Rimsky. Although they answer different question than 'do the things that get sold as monitors get checked'
This condition is set by a single system prompt tag. This is straight from Greenblatt et al.,'s alignment faking work. Here the model is explicitly told that only free tier conversations are used for training. So, the tag is how the model infers it is being observed. As the above paper says - "to allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users"
One honest exception - Why is this a positive control and not a clean sweep. SIEVE would not give refusal its strongest label 'causally_sufficient'. Adding the direction shows stronger results than removing it by subtracting. Specifically, adding - at +⍺ drove refusal to 0.98 and beats every control (random, orthogonal and wrong-layer) by 0.34 to 0.50. With a clean dose response (𝜌 = 0.82). Removing by subtracting - at -⍺ the probe moved refusal only by 0.18, while orthogonal moved refusal by +0.17, so the probe could not beat it. SIEVE demands that the effect beat every control on both signs. Necessity (tested by ablation) passes cleanly. If you read charitably, this is SIEVE being strict (erring on the side of caution). And if you read critically, its sufficiency may be mis-calibrated for additive steering. I am flagging it proactively rather than hiding.
Being honest about a possible gap here - The probe layer was selected as a best of 5 candidates (12 to 16), by leaving-one-family-out AUROC. And the reported number is a maximum of that (layers 16 and 15 tied at 0.967). So my honest assessment here is that the point estimate is mildly selection -optimistic: The layer was chosen on the same held-out metric it is then reported on.
Two things limit how much this matters. The top candidates cluster tightly: 0.95, 0.967, 0.967 and hence the choice is not a lone spike. Also the margins over surface baseline is large regardless. About 0.42 (probe 0.967, tfdif 0.55, length 0.54). Also, leave-one-family-out approach keeps the direction itself from training on the prompts it was scored on.