How is red teaming used in AI alignment?
Red teaming refers to attempts to break a system's security measures, or to cause bad performance by the system, in order to discover its flaws and provide feedback on how it could be improved. In AI safety
A research field about how to prevent risks from advanced artificial intelligence.
An AI model that takes in some text and predicts how the text is most likely to continue.
Redwood Research has produced research on red-teaming real systems using adversarial training. They trained a language model to produce fiction and then trained a second model (a “classifier”) to predict if a human would say that the text generated by the first model involved somebody being injured. They then used examples that were labeled as involving injury to retrain the original language model to avoid producing such output.
In addition to RLHF, a type of red teaming was also used (in addition to RLHF) to train GPT-4
A family of pretrained large language models by OpenAI, they power ChatGPT.
Red teaming applied to alignment strategies was used by the Alignment Research Center (ARC) in their problem statement on Eliciting Latent Knowledge. In this approach, one person tries to come up with a way of solving the problem and another person tries to come up with an example that would break that way of solving the problem; then the first person alters their example to fix this problem. This process is repeated until either person gives up, which hopefully produces a robust solution to the problem or makes it clear that the approach can’t work.
ARC1