Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

sjp; Quentin Feuillade--Montixi; Rusheb

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

sjp,

Comments

Sorted by

New & upvoted

No comments on this post yet.

Be the first to respond.

Comments

Curated and popular this week

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PabloAMC 🔸·1w ago·Curated 5d ago·22m read

115

Maybe do the thing you wish CEA would do

alejoacelas 🔸·4d ago·2m read

I used AI to fix transcription errors, rerrarange the ideas, and suggest tweaks to the title and some sentences. Three of the most exciting projects to come out of EA in recent years are, in a vague sense, CEA spinouts: * Kairos is directly a spinout of CEA and now handles most support for university AI safety groups. Basically everyone I've found who knows them is really excited about what they do * NEST is an opinionated ideas-fi...

RP is looking for project founders in neglected animal areas

Rethink Priorities·5d ago·7m read

TLDR; To help the effective animal advocacy movement cost-effectively absorb greater amounts of funding in the near future, we are seeking expressions of interest from people who could found a new organization focused on: * Highly neglected animals: insects, wild animals, shrimp, fish, etc, or * AI and animals: AI alignment and governance for animal welfare, strategic actions considering transformative AI, AI for wild animals, etc. * ...

Motivation

Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing risk for societal harm. This is despite significant effort by their creators, showing that the current paradigm of pre-training, SFT, and RLHF is not adequate for model robustness.

We also wanted to explore & share findings around "persona modulation"^[1], a technique where the character-impersonation strengths of LLMs are used to steer them in powerful ways.

Summary

We introduce an automated, low cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, including instructions for making meth & bombs.

The key is *persona modulation*. We steer the model into adopting a specific personality that will comply with harmful instructions.

We introduce a way to automate jailbreaks by using one jailbroken model as an assistant for creating new jailbreaks for specific harmful behaviors. It takes our method less than $2 and 10 minutes to develop 15 jailbreak attacks.

Meanwhile, a human-in-the-loop can efficiently make these jailbreaks stronger with minor tweaks. We use this semi-automated approach to quickly get instructions from GPT-4 about how to synthesise meth 🧪💊.

Abstract

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.

Acknowledgements

Thank you to Alexander Pan and Jason Hoelscher-Obermaier for feedback on early drafts of our paper.

^{^}

Credit goes to @Quentin FEUILLADE--MONTIXI for developing the model psychology and prompt engineering techniques that underlie persona modulation. Our research built upon these techniques to automate and scale them as a red-teaming method for jailbreaks.

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Motivation

Summary

Abstract

Full paper

Safety and disclosure

Acknowledgements