I’m exploring a question that sits slightly upstream of most current AI governance discussions, and I’d appreciate critique from this community.
Much of AI governance today focuses on outputs: harmful behaviours, misuse cases, alignment failures, or downstream impacts. The dominant assumption seems to be that AI systems are essentially neutral cognitive engines, and that risk emerges primarily at the point of use.
Recent work (including research on persona drift and controllable system personas) suggests something different:
behaviour appears to be downstream of identity, not merely of objectives or guardrails.
By “identity,” I don’t mean consciousness or self-awareness. I mean the structural invariants that define who a system is allowed to be during interaction: role, scope of authority, permissible self-models, persistence across contexts, and how it represents its relationship to humans.
If identity formation is a controllable variable, then governance cannot be limited to policies, incentives, or post-hoc constraints. It becomes architectural.
This raises several questions I’m still working through:
- Is internal identity formation a more leverageable governance layer than outputs or reward functions?
If behaviour reliably follows identity constraints, are we currently governing too late in the stack? - Who defines the permissible identity space, and on whose behalf?
Once identity becomes part of system design, governance shifts from “what is allowed” to “who has authority to define what the system is.” - What failure modes emerge if identity constraints are wrong?
Could architectural governance reduce some risks while introducing brittleness, lock-in, or jurisdictional capture? - How does this interact with longtermist concerns?
If identity hardens early, does this meaningfully affect trajectory risks, or does it simply defer them?
I’m not committed to the claim that identity-centric governance is sufficient, only that it may be necessary as systems become more agentic. My working hypothesis is that once agency scales, output-based governance alone becomes increasingly fragile.
I’m posting this here not as a proposal to adopt, but as a structure to interrogate.
Where do you see this breaking?
What existing work does this overlap with or contradict?
And what risks does this framing underestimate?
Grateful for thoughtful disagreement.
This post draws from a broader line of work I’ve been developing under the name HumanSovereigntyAI, focused on architectural approaches to AI governance.
— Travis Lee
