EJM

Erny-Jay Mariquit

Inventor & Founder @ TSCWH
-5 karmaJoined Working (6-15 years)

Comments
2

I've revised the post to address reproducibility concerns and fix a formatting error in contribution #5. Code is not yet public due to pending IP protection — I've explained that directly in the post now.

The section on Anthropic dropping its Responsible Scaling Policy pledge highlights a structural problem that I think deserves more attention: voluntary institutional commitments are inherently fragile under competitive pressure.

Holden Karnofsky's explanation is honest about the tradeoffs, but the uncomfortable implication is that "we promise to be safe" is not the same as "the system is structurally incapable of producing unsafe outputs." The pledge was a governance commitment. One complementary approach would be systems where safety properties are verified mathematically rather than embodied in training targets alone..

This matters for the Claude's Constitution discussion too. The Constitution is a thoughtful document, but it's ultimately a training target — a set of dispositions Claude is nudged toward. It doesn't constitute a proof that no prompt sequence can extract a prohibited behavior.

I'm an independent researcher working on one approach to this layer of the problem: a safety evaluation layer where governance invariants are formally verified on every evaluation cycle — not as pre-deployment tests, but as runtime proofs. Rough preprint is here if anyone wants to dig into the formal verification layer specifically. Genuinely curious what people here think about whether proof-based approaches are tractable for the full alignment problem, or whether there are irreducibly social/institutional components that formal methods can't touch.