Abstract
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.
Summary
When its publicly available weights were fine-tuned to remove safeguards, Llama-2-70B assisted hackathon participants in devising plans to obtain infectious 1918 pandemic influenza virus, even though participants openly shared their (pretended) malicious intentions. Liability laws that hold foundation model makers responsible for all forms of misuse above a set damage threshold that result from model weight proliferation could prevent future large language models from expanding access to pandemics and other foreseeable catastrophic harms.
So let me put it this way:
If there is a future bioterrorist attack involving, say, smallpox, we can disaggregate quite a few elements in the causal chain leading up to that:
The question for me is: How much of the outcome here depends on 6 as the key element, without which the end outcome wouldn't occur?
Maybe a future LLM would provide a useful step 6, but anyone other than a pre-existing expert would always fail at step 4 or 5. Alternatively, maybe all the other steps let someone let someone do this in reality, and an accurate and complete LLM (in the future) would just make it 1% faster.
I don't think the current study sheds any light whatsoever on those questions (it has no control group, and it has no step at which subjects are asked to do anything in the real world).