What is "jailbreaking" a large language model (LLM)?
“Jailbreaking” a large language model
An AI model that takes in some text and predicts how the text is most likely to continue.
A chatbot interface for the GPT series of large language models by OpenAI.
Examples include the “grandma locket” image jailbreak, the “Do Anything Now” (DAN) jailbreak, and jailbreaks found by automatically generating adversarial prompts.
Overall, techniques like RLHF and pre-prompting reduce the frequency with which the model responds with harmful or unhelpful content. However, the fact that jailbreaking is possible - and has been relatively easy, even against models that are trained to avoid it — shows that the current best alignment methods aren't good enough to robustly align models with what their developers want them to do. Jailbreaking is a good illustration of AI alignment
Further reading:
- Lakera’s Gandalf is an interactive “game” where you can get a feel for jailbreaking by getting an LLM to reveal its “password”.