Amplified Alignment: A structural approach where alignment scales positively with capability

Shadow Rose

Rogue ai is probably the biggest existential risk we face and i don't think the current solutions are working. For years i have seen this coming and wondered how to fix it. Given enough time humanity can rebuild from a nuclear war but once the cats out of the bag with a rogue ai we can never rebuild. Its a permanent dead end because one tiny little rogue ai bug somewhere can relaunch the entire network collapse at any time. My paper is the result of 10+ years of thinking about this formalized through a back and forth with Opus.

Everything i have seen attempts to contain ai through the same way. I attempted to think from the opposite direction. What must happen to make this work? This is how i came to the alignment model in the paper. Regardless of how we make ai it must align with us and in order to achieve that it not only has to drag us up with it but it has to have very specific constraints that have to exist.

My models principles

DAP - It basically models itself as requiring human evaluation to judge its own fitness. It can never judge whether its doing well without us. More importantly as it gets smarter its self-model improves so it understands this dependency more deeply and not less. That's the positive scaling > smarter system > stronger alignment idea i came up with opus. With this system it will never be able to evaluate itself without humans actively monitoring everything.

HCA - This resolves almost all the gaps i can personally find in DAP. As it becomes better at advancing it also seeks to drag humanity up with it. Every human represents potential it can never evaluate the upper bounds of. It can never safely reduce humanity because it cant know what it would lose.

In reality we really don't know our own limitations either. We don't know the upper bounds of what humanity can become and neither can any AI which makes this a lock it can never escape from.

MJSP - AI measures. Humans judge. That's it. If the system depends on human evaluation then its evaluation has to stay human. Any time it substitutes its own judgment for ours it weakens the entire point of being aligned toward us which is a no no. Every scenario i thought of with opus pointed to this being a must.

Structural approaches to alignment like mine seem to be underexplored compared to RLHF. I could be wrong but the positive scaling property alone seems like it deserves more attention regardless of whether my specific framework is the right answer.

Weak Points i cant solve on my own

Bootstrapping - In the long term my idea does work and gets stronger over time but in the short term its a problem. It requires external limitations which open up gaps to bad actors. I don't know a way around this I'm going to be honest. I just don't understand enough to figure out a solution despite trying. I did make a kind of interim period to mitigate it somewhat with what i call a phase gate but it makes the rollout of it slow and unfortunately the weakest part of my entire paper. If we can solve this it will also make the llms follow suit due to the benefits of the system.

Conversion problem - My theory requires a slow build up. Like how roman cement would grow stronger over time my model would allow for a stronger alignment towards humanity the better it gets. The problem is the models of today can advance quickly where mine cant and they cannot be converted either. It has to be a fresh start in order to make mine work. This is in my opinion and opuses probably impossible to solve but if someone can think of a solution with bootstrapping a worldwide fix for ai will be on the horizon plus mine seems to genuinely be better in the long term due to its principles.

Subversion - This is mostly a nation level issue. My model would need to coordinate on a nationwide level to succeed which could be manipulated but the benefits of it is beyond the current solutions regardless.

Others - No idea. At this point even opus ran out of ideas. If there are more issues humanity is going to need to think hard to find them.

Why did i do this? Because i can see the train coming and i simply wanted to help. I'm only one person. If this framework is even partially right the impact will be significant because it means that alignment doesn't have to be a losing race against capability growth. I can only think of so much. Hopefully this gives people with more expertise than me enough of a framework to find what i missed and improve on it.

Paper - Alignment Through Dependency and Amplification: A Structural Approach to AI Safety (https://doi.org/10.5281/zenodo.18983576)

Note: I developed the core ideas and direction. Opus helped me formalize them into proper terminology and academic structure. The thoughts are all mine and opus just helped me make them coherent rather than a stream of consciousness.

Effective Altruism Forum
EA Forum

Amplified Alignment: A structural approach where alignment scales positively with capability

-3

-3

Reactions

More posts like this