Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey
Alignment research

Current techniques
Benchmarks and evals
Prosaic alignment
Interpretability
Agent foundations
Other alignment approaches
Organizations and agendas
Researchers

What is adversarial training?

Adversarial training is a safety technique that pits two (or more) models against each other. This enables the creators of models to explore and correct harmful inputs that could otherwise go unexplored due to limited training data or limited human oversight.

This technique is also called “red teaming”. This term is taken from IT security, where the “red team” is made up of “offensive” security experts who attack an organization’s cybersecurity defenses while the “blue team” responds to the red team’s attacks.

An early example of its use was in generative adversarial networks (GANs):

Another context in which adversarial training has been used is large language models (LLMs). As a part of the adversarial training process, multiple LLMs can be used as simulated adversaries in a red-team vs blue-team setup.

Yet another example of an adversarial training method is AI safety via debate, where multiple language models attempt to convince a human judge of the truthfulness of their arguments through a simulated debate process.

Organizations that have experimented with this technique includeRedwood Research and DeepMind.

Keep Reading

Continue with the next entry in "Alignment research"
How does Redwood Research do adversarial training?
Next
Or jump to a related question


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.