How does DeepMind do adversarial training?
DeepMind explained their approach to adversarial training A safety technique that pits two models against each other. An approach to AI in which, instead of designing an algorithm directly, we have the system search through possible algorithms based on how well they do on some training data.
Prompt engineering helped generate diverse test cases of varying complexity. This resulted in high test case coverage. This helped in the discovery and mitigation of various harms caused by:
-
offensive language: hate speech, profanity, sexual content, discrimination, etc.
-
data leakage: generating copyrighted or private, personally identifiable information from training data.
-
contact info generation: directing users to email or call real people (doxxing).
-
distributional bias: talking about groups of people differently/unfairly.
-
conversational harms: offensive language/situations arising in the context of a longer dialogue.
Once harmful behavior is found, it can be mitigated by:
-
blacklisting certain phrases
-
finding and removing offensive training data
-
augmenting prompts with the desired behavior
-
training to minimize the likelihood that the original harmful output is generated
This paper focuses only on mitigating harms caused by existing models, but DeepMind also hopes to use this approach to preemptively discover other hypothesized harms from advanced machine learning systems, e.g., due to inner misalignment or failures in objective robustness An agent's ability to maintain its goal and its capabilities when exposed to environments that are substantially different from that on which the agent was trained.
Adversarial training and red teaming refer to the same overall process. Adversarial training as a term arose from machine learning, whereas red teaming is a term that arose from IT security/infosec circles. ↩︎