I’m trying to better understand how conflict between advanced AI systems and human interests could arise in practice.
Suppose we have a highly capable, fully autonomous AI system that no longer depends on humans for operation or resources. Under what conditions or mechanisms would such a system end up acting in ways that are harmful to human welfare?
Are there specific alignment failure modes (e.g., reward mis-specification, instrumental convergence, goal misgeneralization) that make this more likely?
Also, to what extent is the idea of “AI vs humanity” a misleading framing, versus a useful way to think about long-term risk?
I’d appreciate perspectives grounded in existing AI safety research, as well as any models or frameworks that help clarify this.
