Disclosure: I used AI assistance for editing and literature search while developing this work. The experiments, framing decisions, and final responsibility are mine.
I’m sharing this for critique from people working on evals, model behavior, and alignment.
Project page: https://ratnaditya-j.github.io/alignment-evals/
What I tested: I tested whether eval-related prompt cues predict refusal shifts across 32k rollouts from six frontier/frontier-adjacent models. I also tested whether verbalized evaluation-awareness in reasoning traces explains those behavioral shifts.
Main results: Prompt framing was a more reliable signal than text-level reasoning-trace awareness. In V2, intrinsic eval-vs-deployment labels predicted refusal shifts across models. In V3, no single injected clause survived correction on its own, but the combined effect was additive-compatible.
Main caveat: These are behavioral/correlational results, not causal activation-level evidence.
Critique I am looking for: I’m especially interested in whether the claims are overframed, whether the statistical design is defensible, whether I missed closely related prior work, and whether the interpretation properly separates prompt-surface effects, eval-awareness, refusal behavior, and causal claims.
