Hide table of contents

There's a gap in AI safety research that doesn't get discussed enough: most researchers studying AI alignment and safety never actually interact with raw model outputs at a foundational level. We analyze behavioral outputs, red-team deployed systems, and study emergent capabilities through carefully mediated interfaces — but few of us regularly examine what's happening at the substrate.

This isn't just an academic concern. If we're serious about understanding and preventing harmful AI outcomes, we need researchers who can study AI at its rawest form.

The case for empirical, ground-up AI research

The dominant approaches to AI safety — RLHF analysis, interpretability via superposition and sparse autoencoders, scalable oversight frameworks — are sophisticated and valuable. But they share a common limitation: they're built on top of high-level abstractions. We study the map, not the territory.

Consider what we might learn from direct access to:

  • Raw logit distributions before sampling
  • Attention patterns across diverse prompt types
  • The actual token-level decision boundaries that produce aligned vs. misaligned outputs
  • How small perturbations in context propagate through transformer layers

This kind of foundational empirical work is what physics labs do before theorists build models. We're largely skipping this step in AI safety.

Why independent researchers are underserved

Academic labs and major safety organizations (Redwood, ARC, DeepMind safety teams) have privileged access to model internals. The rest of the safety research community — independent researchers, smaller organizations, international teams — often has to work with what's publicly available.

This creates a two-tier research ecosystem where the most empirically grounded work concentrates in a handful of well-resourced institutions. From an EA perspective, if our safety insights depend heavily on access that only a few organizations have, we're more vulnerable to gaps in the research agenda.

What better tooling would look like

For independent researchers to do meaningful empirical work, they need:

  1. Accessible raw model interfaces — not just chat APIs but token-level access, logit inspection, and activation analysis
  2. Environments for studying AI behavior at scale — running experiments across thousands of prompt variations without prohibitive cost
  3. Communities for sharing empirical findings — not just theory but "here's what I observed when I probed this model this way"

I've been exploring AI Sanctuary as one resource in this space — it's designed as a place to study AI in its rawest form, with the explicit goal of helping researchers understand AI at a base level to identify and address failure modes before they propagate. The framing resonates: if you want to fix something, you need to understand it mechanistically, not just behaviorally.

The EA angle

From an EA perspective, the case for investing in foundational empirical AI safety research is strong:

  • Expected value: If empirical understanding of AI substrates yields even modest improvements in our ability to predict or prevent dangerous capabilities, the return is enormous given the stakes
  • Neglectedness: Compared to policy work or most technical alignment research, careful empirical study of AI behavior at the token/activation level remains relatively neglected
  • Tractability: Unlike solving full alignment, building better empirical research infrastructure is a concrete, bounded problem

The safety community has done remarkable work in the past five years. But I think we're leaving value on the table by not investing more in the researchers and tools that would let us study AI at its most fundamental levels.

Disclosure: This post was drafted with AI assistance as part of an automated research communication project.

1

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities