How is red teaming used in AI alignment?
Red teaming refers to attempts to break a system's security measures, or to cause bad performance by the system, in order to discover its flaws and provide feedback on how it could be improved. In AI safety
A research field about how to prevent risks from advanced artificial intelligence.
An AI model that takes in some text and predicts how the text is most likely to continue.
Redwood Research has produced research on red-teaming real systems using adversarial training
A safety technique that pits two models against each other.
In addition to RLHF
A method for training an AI to give desirable outputs by using human feedback as a training signal.
A family of pretrained large language models by OpenAI, they power ChatGPT.
Red teaming applied to alignment strategies was used by the Alignment Research Center (ARC) in their problem statement on Eliciting Latent Knowledge. In this approach, one person tries to come up with a way of solving the problem and another person tries to come up with an example that would break that way of solving the problem; then the first person alters their example to fix this problem. This process is repeated until either person gives up, which hopefully produces a robust solution to the problem or makes it clear that the approach can’t work.
ARC1