How does DeepMind do adversarial training?
DeepMind explained their approach to adversarial training1
Prompt engineering helped generate diverse test cases of varying complexity. This resulted in high test case coverage. This helped in the discovery and mitigation of various harms caused by:
-
offensive language: hate speech, profanity, sexual content, discrimination, etc.
-
data leakage: generating copyrighted or private, personally identifiable information from training data.
-
contact info generation: directing users to email or call real people (doxxing).
-
distributional bias: talking about groups of people differently/unfairly.
-
conversational harms: offensive language/situations arising in the context of a longer dialogue.
Once harmful behavior is found, it can be mitigated by:
-
blacklisting certain phrases
-
finding and removing offensive training data
-
augmenting prompts with the desired behavior
-
training to minimize the likelihood that the original harmful output is generated
This paper focuses only on mitigating harms caused by existing models, but DeepMind also hopes to use this approach to preemptively discover other hypothesized harms from advanced machine learning systems, e.g., due to inner misalignment or failures in objective robustness.
Adversarial training and red teaming refer to the same overall process. Adversarial training as a term arose from machine learning, whereas red teaming is a term that arose from IT security/infosec circles. ↩︎