What is the AI control research agenda?
The AI control agenda1 A research field about how to prevent risks from advanced artificial intelligence.
AI control involves evaluating whether an AI is deceptive based on adversarial training. For example, capable AI models could be monitored for misbehavior by models that are weaker but that we trust. Researchers could test whether these weaker models work by exposing them to synthetic examples of misbehavior and seeing whether they catch them.
Such evaluations could be implemented and tested with today’s AIs, but are intended to scale with AI capabilities up to human-level.2
Controlling AI is not meant to replace aligning AI, but to work as a second line of defense. Proponents believe that if we can use AI control to ensure that a human-level AI cannot cause harm, we could then use it to reduce risks from even smarter AIs — for instance, by having it increase cybersecurity or by having it help us align these smarter AIs.3
Some critics argue that using weaker AI to align stronger AI is a hopeless endeavor,4
The term “AI control” has sometimes been used more generally to refer to the problem of making advanced AI behave well as a whole (“the control problem”), or to any techniques that involve restricting its capabilities, such as putting it in a box. This article focuses on a particular research agenda proposed in recent years. “AI control” is also not to be confused with Control AI, an AI safety advocacy group. ↩︎
Proponents acknowledge that these evaluations might no longer work once AI achieves broadly superhuman capabilities. ↩︎
The strategy of using an AI to help with the alignment/control problem of even more powerful AI is, oddly enough, unnamed. However, there have been quite a few alignment proposals, including the ones pursued by the leading AI labs, along these lines. For instance, Iterated Distillation and Amplification, Superalignment, AI safety via debate, etc. ↩︎
“Like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan” —John Wentworth in Godzilla strategies. ↩︎