Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey
Basic concepts

Prompting
Capabilities
Current systems
Algorithms
Alignment concepts
Intelligence and optimization
AI goals
Risks and outcomes

What is corrigibility?

A “corrigible” agent is one that doesn't interfere with what we would intuitively see as attempts to correct the agent, or to correct our mistakes in building it, despite the instrumentally convergent reasons to oppose these corrections.

  • If we try to pause or “hibernate” a corrigible AI, or shut it down entirely, it will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill its goals.

  • If we try to reprogram a corrigible AI, it will not resist this change and will allow this modification to go through. If this behavior is not specifically incentivized, an AI might try to prevent this, such as by attempting to fool us into believing its utility function

    was modified successfully, while actually keeping its original utility function as obscured functionality. By default, this deception could be a preferred outcome according to the AI's current preferences.

Further reading:

Keep Reading

Continue with the next entry in "Basic concepts"
What is deceptive alignment?
Next
Or jump to a related question


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.