I posted My AGI Threat Model: Misaligned Model-Based RL Agent on alignmentforum yesterday. Among other things, I make a case for misaligned AGI being an existential risk which can and should be mitigated by doing AGI safety research far in advance. So that's why I'm cross-posting it here!

I basically took a bunch of AGI safety ideas that I like—both old ideas (e.g. agents tend to self-modify to become more coherent) and new ideas (e.g. inner alignment)—and cast them into the framework of an AGI architecture and development path that I find most plausible, based on my thinking about neuroscience and machine learning.

Please leave comments either here or at the original post, and/or email me if you want to chat. Thanks in advance! :-D

New Comment