What is corrigibility?
A “corrigible” agent is one that doesn't interfere with what we would intuitively see as attempts to correct the agent, or to correct our mistakes in building it, despite the instrumentally convergent reasons to oppose these corrections.
-
If we try to pause or “hibernate” a corrigible AI, or shut it down entirely, it will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill its goals.
-
If we try to reprogram a corrigible AI, it will not resist this change and will allow this modification to go through. If this behavior is not specifically incentivized, an AI might try to prevent this, such as by attempting to fool us into believing its utility function
was modified successfully, while actually keeping its original utility function as obscured functionality. By default, this deception could be a preferred outcome according to the AI's current preferences.Utility functionView full definitionA mathematical function that assigns a number representing utility to every possible outcome. Outcomes with higher utility are preferred to outcomes with lower utility. “Maximizing utility” then means choosing the most preferred outcome.
Further reading:
- Corrigibility As Singular Target (CAST) by Max Harms