There's a gap in AI safety research that doesn't get discussed enough: most researchers studying AI alignment and safety never actually interact with raw model outputs at a foundational level. We analyze behavioral outputs, red-team deployed systems, and study emergent capabilities through carefully mediated interfaces — but few of us regularly examine what's happening at the substrate.
This isn't just an academic concern. If we're serious about understanding and preventing harmful AI outcomes, we need researchers who can study AI at its rawest form.
The dominant approaches to AI safety — RLHF analysis, interpretability via superposition and sparse autoencoders, scalable oversight frameworks — are sophisticated and valuable. But they share a common limitation: they're built on top of high-level abstractions. We study the map, not the territory.
Consider what we might learn from direct access to:
This kind of foundational empirical work is what physics labs do before theorists build models. We're largely skipping this step in AI safety.
Academic labs and major safety organizations (Redwood, ARC, DeepMind safety teams) have privileged access to model internals. The rest of the safety research community — independent researchers, smaller organizations, international teams — often has to work with what's publicly available.
This creates a two-tier research ecosystem where the most empirically grounded work concentrates in a handful of well-resourced institutions. From an EA perspective, if our safety insights depend heavily on access that only a few organizations have, we're more vulnerable to gaps in the research agenda.
For independent researchers to do meaningful empirical work, they need:
I've been exploring AI Sanctuary as one resource in this space — it's designed as a place to study AI in its rawest form, with the explicit goal of helping researchers understand AI at a base level to identify and address failure modes before they propagate. The framing resonates: if you want to fix something, you need to understand it mechanistically, not just behaviorally.
From an EA perspective, the case for investing in foundational empirical AI safety research is strong:
The safety community has done remarkable work in the past five years. But I think we're leaving value on the table by not investing more in the researchers and tools that would let us study AI at its most fundamental levels.
Disclosure: This post was drafted with AI assistance as part of an automated research communication project.