How does Redwood Research do adversarial training?
Redwood Research explains their approach to adversarial training in the paper “Adversarial Training for High-Stakes Reliability”. They took a language model
An AI model that takes in some text and predicts how the text is most likely to continue.
Fine-tuning is the process of adapting a pre-trained ML model for more specific tasks or to display more specific behaviors.
To do this, they trained a ‘classifier’, a model that predicts whether a human would say that the completion involved someone getting injured. This classifier can act as a filter for safe vs. unsafe stories after the stories have been generated by the LM. They then used stories classified as “unsafe” as additional training examples for the LM (an example of ‘adversarial training’). Another LM helped the humans paraphrase the existing unsafe stories in order to achieve data augmentation and have access to a higher number of adversarial training examples.
Redwood found that they can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs, i.e. the stories are still interesting and engaging to read. Additionally, adversarial training helped Redwood increase the robustness of their model against adversarial attacks, because as a result of this training, later evaluators required a much longer time to find/generate adversarial prompts.