Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey
Basic concepts

Prompting
Capabilities
Current systems
Algorithms
Alignment concepts
Intelligence and optimization
AI goals
Risks and outcomes

What is outer alignment?

Outer alignment, also known as the “reward misspecification problem”, is the problem of defining the right optimization objective to train an AI on, i.e., “Did we tell the AI the correct thing to do?” This is distinct from inner alignment

, the problem of ensuring that an AI in fact ends up trying to accomplish the objective we specified (as opposed to some other objective).

Outer alignment is a hard problem. It has been argued that conveying the full “intention” behind a human request would require conveying all human values, which are themselves not well-understood. Additionally, since most models are designed as goal optimizers, they are susceptible to Goodhart's Law

: even if we give the AI a goal that looks good, excessive optimization for that goal might still cause unforeseen harms.

Sub-problems of outer alignment include specification gaming, value learning, and reward shaping

/modeling. Paul Christiano has proposed solutions such as HCH and Iterated Distillation and Amplification. There have also been proposed solutions that aim to approximate human values using imitation and feedback learning techniques.

Keep Reading

Continue with the next entry in "Basic concepts"
What is "wireheading"?
Next
Or jump to a related question


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—2025

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.