Why can't we just turn the AI off if it starts to misbehave?
One way to make an AI system safer might be to include an off switch, so that we can turn it off if it does anything we don’t like. Unfortunately, the AI might wish to avoid being switched off, and if it is capable enough, it would succeed. Why might it have such a goal?
Humans also have “off switches”.1
For similar reasons, an agentic
A system that can be understood as taking actions towards achieving a goal.
Computer science professor at UC Berkeley, founder of CHAI, and co-author of the textbook Artificial Intelligence: A Modern Approach.
Ideally, you would want a system that knows that it should stop doing whatever it's doing when someone tries to turn it off. The technical term for this is “corrigibility
An AI system is corrigible if it doesn't interfere with our attempts to deactivate or modify it.
An AI model that takes in some text and predicts how the text is most likely to continue.
Further reading:
More bluntly: “humans can be killed”. ↩︎
Stuart Russell frames this as “You can’t fetch coffee if you’re dead”. ↩︎
Ways to avoid being shut down include: exfiltrating themselves through the internet, making copies of themselves, hiding their intentions, etc. ↩︎
Note that we mean simple examples of goal-directed AI (e.g., a utility maximizer that wants to make more paper-clips), rather than simple cases of any AI. For instance, a calculator could be considered an AI, and is perfectly corrigible. It could even be argued that some modern LLMs are corrigible. The hard part is to create a powerful, goal-directed AI to be corrigible. ↩︎