Hide table of contents

While this is close to areas I work in, it's a personal post. No one reviewed this before I published it, or asked me to (or not to) write something. All mistakes are my own.

A few days ago, some of my coworkers at SecureBio put out a preprint, "Will releasing the weights of future large language models grant widespread access to pandemic agents?" (Gopal et. al 2023) They took Facebook/Meta's Llama-2-70B large language model (LLM) and (cheaply!) adjusted it to remove the built in safeguards, after which it was willing to answer questions on how to get infectious 1918 flu. I like a bunch of things about the paper, but I also think it suffers from being undecided on whether it's communicating:

  1. Making LLMs public is dangerous because by publishing the weights you allow others to easily remove safeguards.

  2. Once you remove the safeguards, current LLMs are already helpful in getting at the key information necessary to cause a pandemic.

I think it demonstrates the first point pretty well. The main way we avoid LLMs from telling people how to cause harm is to train them on a lot of examples of someone asking how to cause harm and being told "no", and this can easily be reversed by additional training with "yes" examples. So even if you get incredibly good at this, if you make your LLM public you make it very easy for others to turn it into something that compliantly shares any knowledge it contains.

Now, you might think that there isn't actually any dangerous knowledge, at least not within what an LLM could have learned from publicly available sources. I think this is pretty clearly not true: the process of creating infectious 1918 flu is scattered across the internet and hard for most people to assemble. If you had an experienced virologist on call and happy to answer any question, however, they could walk you there through a mixture of doing things yourself and duping others into doing things. And if they were able to read and synthesize all virology literature they could tell you how to create things quite a bit worse than this former pandemic.

GPT-4 is already significantly better than Llama-2, and GPT-5 in 2024 is more likely than not. Public models will likely continue to move forward, and while it's unlikely that we get a GPT-4 level Llama-3 in 2024 I do think the default path involves very good public models within a few years. At which point anyone with a good GPU can have their own personal amoral virologist advisor. Which seems like a problem!

But the paper also seems to be trying to get into the question of whether current models are capable of teaching people how to make 1918 flu today. If they just wanted to assess whether the models were willing and able to answer questions on how to create bioweapons they could have just asked it. Instead, they ran a hackathon to see whether people could, in one hour, get the no-safeguards model to fully walk them through the process of creating infectious flu. I think the question of whether LLMs have already lowered the bar for causing massive harm through biology is a really important one, and I'd love to see a follow-up that addressed that with a no-LLM control group. That still wouldn't be perfect, since outside the constraints of a hackathon you could take a biology class, read textbooks, or pay experienced people to answer your questions, but it would tell us a lot. My guess is that the synthesis functionality of current LLMs is actually adding something here and a no-LLM group would do quite a bit worse, but the market only has that at 17%:

Even if no-safeguards public LLMs don't lower the bar today, and given how frustrating Llama-2 can be this wouldn't be too surprising, it seems pretty likely we get to where they do significantly lower the bar within the next few years. Lower it enough, and some troll or committed zealot will go for it. Which, aside from the existential worries, just makes me pretty sad. LLMs with open weights are just getting started in democratizing access to this incredibly transformative technology, and a world in which we all only have access to LLMs through a small number of highly regulated and very conservative organizations feels like a massive loss of potential. But unless we figure out how to create LLMs where the safeguards can't just be trivially removed, I don't see how to avoid this non-free outcome while also avoiding widespread destruction.

(Back in 2017 I asked for examples of risk from AI, and didn't like any of them all that much. Today, "someone asks an LLM how to kill everyone and it walks them through creating a pandemic" seems pretty plausible.)

Comment via: facebook, lesswrong, the EA Forum, mastodon

20

2
1

Reactions

2
1

More posts like this

Comments7
Sorted by Click to highlight new comments since:

I also like the way you divide up the claims. I think this paper is a really neat demonstration of point 1, and I'm kinda disappointed with the discourse for getting distracted arguing about point 2.

That's fair, though since a lot of people already knew about #1 and are very interested in whether #2 is true (or might soon become true) it's not that surprising that this is where the interest is

  1. Making LLMs public is dangerous because by publishing the weights you allow others to easily remove safeguards.
  2. Once you remove the safeguards, current LLMs are already helpful in getting at the key information necessary to cause a pandemic.

 

I like this way of splitting it up. I think the paper made a good case for point 1, but I think point 2 is greatly overstated. With current tech you would still need an expert to sift through hallucinations and to guide the LLM, and the same expert could do the same thing without the LLM. On this issue current LLM's are timesavers, not gamechangers. 

For this reason I doubt you can convince people to hide their weight now, but possibly you can convince them to do so later, when the tech is improved enough to be dangerous. 

possibly you can convince them to do so later, when the tech is improved enough to be dangerous

Sort of: because once you publish the weights for a model there's no going back I'm hoping even the next round of models will not be published, or at least not published without a thorough set of evals. The problem is that if you miss that a private model is able to meaningfully lower the bar to causing harm (ex: telling people how to make pandemics) you can restrict access or modify it, while you learn that a public model can do that you're out of luck.

I'm encouraging people to stop using the framing of "democratizing access".

I think this framing is misleading because, given current polls, it's not at all clear that the population (at least of the countries I've seen surveyed) would vote for frontier models to be open-sourced. 

The phrase "democratizing access" doesn't mean "distributing access in line with a popular vote" but "distributing access to the people". This is definition #2, "make (something) accessible to everyone." See democratization of knowledge for more of this kind of usage.

Sure, and I think we should stop using this definition as it unnecessarily confuses people/distorts the conversation.

Curated and popular this week
Relevant opportunities