What is the AI control research agenda?

4 min read

Suggest changes in Google Docs

is a safety research program, championed by Redwood Research, that aims to ensure that advanced AI systems remain under human control even as such systems become more capable. In contrast to safety proposals that attempt to align AI, it concentrates on ensuring an AI can’t cause harm even if it’s misaligned and scheming.

AI control involves evaluating whether an AI is deceptive based on adversarial training. For example, capable AI models could be monitored for misbehavior by models that are weaker but that we trust. Researchers could test whether these weaker models work by exposing them to synthetic examples of misbehavior and seeing whether they catch them.

Such evaluations could be implemented and tested with today’s AIs, but are intended to scale with AI capabilities up to human-level.²

These techniques are meant to work even on AIs that are “black boxes” (in the sense that we have no idea how they work on the inside) and that are actively trying to subvert the evaluation process.

Controlling AI is not meant to replace aligning AI, but to work as a second line of defense. Proponents believe that if we can use AI control to ensure that a human-level AI cannot cause harm, we could then use it to reduce risks from even smarter AIs — for instance, by having it increase cybersecurity or by having it help us align these smarter AIs.³

Some critics argue that using weaker AI to align stronger AI is a hopeless endeavor,⁴

which would mean that this agenda fails to address the most dangerous part of the problem. Others point out that this agenda might not work if AI develops dangerous manipulation capabilities before it becomes broadly human-level.

The term “AI control” has sometimes been used more generally to refer to the problem of making advanced AI behave well as a whole (“the control problem”), or to any techniques that involve restricting its capabilities, such as putting it in a box. This article focuses on a particular research agenda proposed in recent years. “AI control” is also not to be confused with Control AI, an AI safety advocacy group. ↩︎
Proponents acknowledge that these evaluations might no longer work once AI achieves broadly superhuman capabilities. ↩︎
The strategy of using an AI to help with the alignment/control problem of even more powerful AI is, oddly enough, unnamed. However, there have been quite a few alignment proposals, including the ones pursued by the leading AI labs, along these lines. For instance, Iterated Distillation and Amplification, Superalignment, AI safety via debate, etc. ↩︎
“Like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan” —John Wentworth in Godzilla strategies. ↩︎

Keep Reading

Continue with the next entry in "Alignment research"

What is Sam Bowman researching?

Or jump to a related question

How does Redwood Research do adversarial training?