What is outer alignment?
Outer alignment, also known as the “reward misspecification problem”, is the problem of defining the right optimization objective to train an AI on, i.e., “Did we tell the AI the correct thing to do?” This is distinct from inner alignment When an AI system ends up pursuing the objective that was specified.
Outer alignment is a hard problem. It has been argued that conveying the full “intention” behind a human request would require conveying all human values, which are themselves not well-understood. Additionally, since most models are designed as goal optimizers, they are susceptible to Goodhart's Law
“When a measure becomes a target, it ceases to be a good measure.”
Sub-problems of outer alignment include specification gaming, value learning, and reward shaping A technique used in RL which introduces small intermediate rewards to supplement the environmental reward. This seeks to mitigate the problem of sparse reward signals and to encourage exploration and faster learning.