What is the AI control research agenda?
The AI control agenda1
This agenda emphasizes the development of methods to evaluate whether an AI is deceptive based on adversarial training. For example, trusted weaker AI models could be used to monitor the outputs of more capable models for suspicious behavior. Another example is that the functionality of systems that monitor AIs could be tested by exposing those systems to synthetic examples of malicious behavior and seeing whether they catch them. The hope is that we can catch any scheming AI red-handed.
Such evaluations could be implemented and tested with today’s AIs, but are intended to scale with AI capabilities up to human-level.2
Controlling AI is not meant to replace aligning AI, but to work as a second line of defense. Proponents believe that if we can ensure that a human-level AI cannot cause harm, we can use it to reduce risks from even smarter AIs — for instance, by having it increase cybersecurity or by having it help us align these smarter AIs3
Some critics argue that using weaker AI to align stronger AI is a hopeless endeavor4
The term “AI control” has sometimes been used more generally to refer to the problem of making advanced AI behave well as a whole (“the control problem”), or to any techniques that involve restricting its capabilities, such as putting it in a box. This article focuses on a particular research agenda proposed in recent years. “AI control” is also not to be confused with Control AI, an AI safety advocacy group. ↩︎
Proponents acknowledge that these evaluations might no longer work once AI achieves broadly superhuman capabilities. ↩︎
The strategy of using an AI to help with the alignment/control problem of even more powerful AI is, oddly enough, unnamed. However, there have been quite a few alignment proposals, including the ones pursued by the leading AI labs, along these lines. For instance, Iterated Distillation and Amplification, Superalignment, AI safety via debate, etc. ↩︎
“Like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan” ―John Wentworth in Godzilla strategies. ↩︎