What is adversarial training?
Adversarial training is a safety technique that pits two (or more) models against each other. This enables the creators of models to explore and correct harmful inputs that could otherwise go unexplored due to limited training data or limited human oversight.
This technique is also called “red teaming”. This term is taken from IT security, where the “red team” is made up of “offensive” security experts who attack an organization’s cybersecurity defenses while the “blue team” responds to the red team’s attacks.
An early example of its use was in generative adversarial networks (GANs):
Another context in which adversarial training has been used is large language models
An AI model that takes in some text and predicts how the text is most likely to continue.
Yet another example of an adversarial training method is AI safety A research field about how to prevent risks from advanced artificial intelligence.
Organizations that have experimented with this technique include Redwood Research and DeepMind