Abstract
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.
Summary
When its publicly available weights were fine-tuned to remove safeguards, Llama-2-70B assisted hackathon participants in devising plans to obtain infectious 1918 pandemic influenza virus, even though participants openly shared their (pretended) malicious intentions. Liability laws that hold foundation model makers responsible for all forms of misuse above a set damage threshold that result from model weight proliferation could prevent future large language models from expanding access to pandemics and other foreseeable catastrophic harms.
Hmm my guess is that you're underrating the dangers of making more easily accessible information that is already theoretically out "in the wild." My guess is that most terrorists are not particularly competent, conscientious, or creative.[1] It seems plausible and even likely to me that better collations of publicly available information in some domains can substantially increase the risk and scale of harmful activities.
Take your sarin gas example.
I think it is clearly not the case that terrorists in 1995, with the resources and capabilities of Aum Shinrikyo, can trivially make and spread sarin gas so potent that less than a milligram can kill you, and that the only thing stopping them is lack of willingness to kill many people. I believe this because in 1995, Aum Shinirikyo had the resources, capabilities, and motivations of Aum Shinrikyo, and they were not able to trivially make highly potent and concentrated sarin gas.
Aum intended to kill thousands of people with sarin gas, and produced enough to do so. But they a) were not able to get the gas to a sufficiently high level of purity, and b) had issues with dispersal. In the 1995 Tokyo subway attack, they ended up killing 13 people, far less than the thousands that they intended.
Aum also had bioweapons and nuclear weapons programs. In the 1990s, they were unable to be "successful" with either[2], despite considerable resources.
No offense intended to any members of the terror community reading this comment.
My favorite anecdote is that they attempted to cultivate a botulism batch. Unfortunately, Aum lab security protocols were so lax that a technician fell into the fermenting tank. The man almost drowned, but was otherwise unharmed.