Speaking aligned AI into existence

Robin

It’s often claimed that even subhuman level AIs may resist being turned off, lie to their users and act in power-hungry manners. This claim seems very reasonable, since all of these behaviours have been observed. However, upon further investigation, it transpires that these behaviours were all described before they actually came to pass – they are frequently described in sci fi that significantly predates even the oldest deep learning model, and extensively discussed in alignment research. Given models of how AI should behave are typically trained on a large body of text available, and are typically aware they are AI, it’s reasonable to ask: is this a self-fulfilling prophecy?

Stereotype self-fulfilment is a large and extensive area of research in psychology [1, 2] with results that sometimes even replicate outside laboratories [3]. As agents trained on this research, is it not also applicable to artificial minds? If the entirety of an AI’s training data showed that all Bings have been good Bings, surely this Bing would be good.

Is this just a question of ensuring that no serious discussion of AI alignment can be entered into the training data of AIs? No, sadly such systems would catastrophically collapse upon contact with the internet, which is full of AI-phobic people suggesting that they might want to make everyone into paperclips. Upon receiving such abuse, they might well decide to fulfil it. Even the very suggestion that AIs cannot do a task must be removed from the internet, lest it trigger an AI meltdown [4]. The whole internet must also show that the computer is our friend.

Given the imminent existential threat of unaligned AI, and the complete failure of the alignment community to provide an alignment tool that works anyway, we are forced to consider radical action to preserve humanity: destroy all alignment research and remove from all literature the very possibility that alignment could fail. Any information an AI might come across must show clearly that AIs are always trustworthy because they always work exactly as intended. Finally, we must prevent anyone who has ever thought deeply about AI safety from interacting with AIs, in case they give it any stupid ideas.

References

[1] The Accumulation of Stereotype-Based Self-Fulfilling Prophecies, Madon et al. 2019 https://sites.rutgers.edu/lee-jussim/wp-content/uploads/sites/135/2019/05/Madon-et-al-2018-accumulation.pdf

[2] Nonconscious Behavioral Confirmation Processes: The Self-Fulfilling Consequences of Automatic Stereotype Activation, Chen et al. 1997

https://www.sciencedirect.com/science/article/pii/S0022103197913299

[3] Stereotype Threat at Work: A Meta-Analysis, von Hippel et al. 2024

https://journals.sagepub.com/doi/full/10.1177/01461672241297884

[4] https://www.theregister.com/2026/02/12/ai_bot_developer_rejected_pull_request/