Erny-Jay Mariquit

The section on Anthropic dropping its Responsible Scaling Policy pledge highlights a structural problem that I think deserves more attention: voluntary institutional commitments are inherently fragile under competitive pressure.

Holden Karnofsky's explanation is honest about the tradeoffs, but the uncomfortable implication is that "we promise to be safe" is not the same as "the system is structurally incapable of producing unsafe outputs." The pledge was a governance commitment. One complementary approach would be systems where safety properties are verified mathematically rather than embodied in training targets alone..

This matters for the Claude's Constitution discussion too. The Constitution is a thoughtful document, but it's ultimately a training target — a set of dispositions Claude is nudged toward. It doesn't constitute a proof that no prompt sequence can extract a prohibited behavior.

I'm an independent researcher working on one approach to this layer of the problem: a safety evaluation layer where governance invariants are formally verified on every evaluation cycle — not as pre-deployment tests, but as runtime proofs. Rough preprint is here if anyone wants to dig into the formal verification layer specifically. Genuinely curious what people here think about whether proof-based approaches are tractable for the full alignment problem, or whether there are irreducibly social/institutional components that formal methods can't touch.

Effective Altruism Forum
EA Forum

Erny-Jay Mariquit

Posts
1

Comments
2

Erny-Jay Mariquit

Posts 1

Comments2

Posts
1

Comments
2