What mechanisms could lead a fully autonomous AI system to act against human welfare?

Rushabh

I’m trying to better understand how conflict between advanced AI systems and human interests could arise in practice.

Suppose we have a highly capable, fully autonomous AI system that no longer depends on humans for operation or resources. Under what conditions or mechanisms would such a system end up acting in ways that are harmful to human welfare?

Are there specific alignment failure modes (e.g., reward mis-specification, instrumental convergence, goal misgeneralization) that make this more likely?

Also, to what extent is the idea of “AI vs humanity” a misleading framing, versus a useful way to think about long-term risk?

I’d appreciate perspectives grounded in existing AI safety research, as well as any models or frameworks that help clarify this.

Effective Altruism Forum
EA Forum

[ Question ]

What mechanisms could lead a fully autonomous AI system to act against human welfare?

1

1

Reactions