Why can't we just turn the AI off if it starts to misbehave?
One way to make an AI system safer might be to include an off switch, so that we can turn it off if it does anything we don’t like. Unfortunately, the AI might wish to avoid being switched off, and if it is capable enough, it would succeed. Why might it have such a goal?
Humans also have “off switches”.1
For similar reasons, an agentic AI system would be incentivized to avoid being shut down if being shut down would prevent it from achieving its goals.2
Ideally, you would want a system that knows that it should stop doing whatever it’s doing when someone tries to turn it off. The technical term for this is “corrigibility”; roughly speaking, an AI system is corrigible if it works with human attempts to correct it. People have been working hard on trying to make this possible for goal-directed AI, but it’s currently not clear how we would do this even in simple cases.4 An AI model that takes in some text and predicts how the text is most likely to continue.
Further reading:
More bluntly: “humans can be killed”. ↩︎
Stuart Russell frames this as “You can’t fetch coffee if you’re dead”. ↩︎
Ways to avoid being shut down include: exfiltrating themselves through the internet, making copies of themselves, hiding their intentions, etc. ↩︎
Note that we mean simple examples of goal-directed AI (e.g., a utility maximizer that wants to make more paper-clips), rather than simple cases of any AI. For instance, a calculator could be considered an AI, and is perfectly corrigible. It could even be argued that some modern LLMs are corrigible. The hard part is to create a powerful, goal-directed AI to be corrigible. ↩︎