Abstract
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.
Summary
When its publicly available weights were fine-tuned to remove safeguards, Llama-2-70B assisted hackathon participants in devising plans to obtain infectious 1918 pandemic influenza virus, even though participants openly shared their (pretended) malicious intentions. Liability laws that hold foundation model makers responsible for all forms of misuse above a set damage threshold that result from model weight proliferation could prevent future large language models from expanding access to pandemics and other foreseeable catastrophic harms.
Hi Stuart,
Thanks for your feedback on the paper. I was one of the authors, and I wanted to emphasize a few points.
The central claim of the paper is not that current open-source models like Llama-2 enable those looking to obtain bioweapons more than traditional search engines or even print text. While I think this is likely true given how helpful the models were for planning and assessing feasibility, they can also mislead users and hallucinate key details. I myself am quite uncertain about how these trade off against e.g. using Google – you can bet on that very question here. Doing a controlled study like the one RAND is running could help address this question.
Instead, we are much more concerned about the capabilities of future models. As LLMs improve, they will offer more streamlined access to knowledge than traditional search. I think this is already apparent in the fact that people routinely use LLMs for information they could have obtained online or in print. Weaknesses in current LLMs, like hallucinating facts, are priority issues for AI companies to solve, and I feel pretty confident we will see a lot of progress in this area.
Nevertheless, based on the response to the paper, it’s apparent that we didn’t communicate the distinction between current and future models enough, and we’re making revisions to address this.
The paper argues that because future LLMs will be much more capable and because existing safeguards can be easily removed, we need to worry about this issue now. That includes thinking of policies that incentivize AI companies to develop safe AI models that cannot be tuned to remove safeguards. The nice thing with catastrophe insurance is that if robust evals (much more work to do in this area) demonstrate that an open-source LLM is safe, then coverage will be far cheaper. That said, we still have a lot more work to do to understand how regulation can effectively limit the risks of open-source AI models, partly because the issue of model weight proliferation has been so neglected.
I’m curious about your thoughts on some of the below questions since I think they are at the crux of figuring out where we agree/disagree.
Thanks again for your input!