What is inner alignment?
1 sentence answer
Inner alignment is the problem of making sure that the goal an AI ends up pursuing is the same as the goal we optimized it for.
longer answer
Machine learning
An approach to AI in which, instead of designing an algorithm directly, we have the system search through possible algorithms based on how well they do on some training data.
A general procedure for finding solutions that score highly according to some well-defined objective function.
Something that can improve some process or physical artifact so that it is fit for a certain purpose or fulfills some set of requirements.
In contrast to a mesa-optimizer, a base optimizer is the “outer” optimizer usually explicitly implemented by humans.
In contrast to a mesa-objective, the base objective is the “outer” objective usually explicitly implemented by humans.
An algorithm that is created by optimization and that is also itself an optimizer.
The algorithms that a base optimizer finds to solve the problem it has been given.
The objective pursued by a mesa-optimizer*.*
example
As an analogy: natural selection can be seen as an optimization algorithm that 'designed' humans to achieve the goal of high genetic fitness, or, roughly, "have lots of descendants". However, humans no longer primarily pursue reproductive success; they instead use birth control while still attaining the pleasure that natural selection ‘meant’ as a reward for attempts at reproduction. This is a failure of inner alignment.
conclusion
The inner alignment problem can be split into sub-problems like Deceptive alignment
A case where the AI acts as if it were aligned while in training, but when deployed it turns out not to be aligned.