How does DeepMind do adversarial training?

DeepMind explained their approach to adversarial training1 in the paper “Red Teaming Language Models with Language Models”, which describes the use of red teaming to help detect harmful or offensive behavior in their language models (LMs). Most projects use training processes that rely on paid humans to manually annotate potentially harmful prompts, which is expensive and limits the number and diversity of test cases. To address this, DeepMind trained a language model as a potential AI assistant to generate test cases to complement manual testing. The adversarial LM generates prompts, which elicit some response from the target LM. These responses are then classified into acceptable and harmful behaviors by a classifier network trained using supervised learning.

Prompt engineering helped generate diverse test cases of varying complexity. This resulted in high test case coverage. This helped in the discovery and mitigation of various harms caused by:

  • offensive language: hate speech, profanity, sexual content, discrimination, etc.

  • data leakage: generating copyrighted or private, personally identifiable information from training data.

  • contact info generation: directing users to email or call real people (doxxing).

  • distributional bias: talking about groups of people differently/unfairly.

  • conversational harms: offensive language/situations arising in the context of a longer dialogue.

Once harmful behavior is found, it can be mitigated by:

  • blacklisting certain phrases

  • finding and removing offensive training data

  • augmenting prompts with the desired behavior

  • training to minimize the likelihood that the original harmful output is generated

This paper focuses only on mitigating harms caused by existing models, but DeepMind also hopes to use this approach to preemptively discover other hypothesized harms from advanced machine learning systems, e.g., due to inner misalignment or failures in objective robustness.


  1. Adversarial training and red teaming refer to the same overall process. Adversarial training as a term arose from machine learning, whereas red teaming is a term that arose from IT security/infosec circles. ↩︎