What is deceptive alignment?

4 min read

Suggest changes in Google Docs

Deceptive alignment (also called “alignment faking”) is a sub-problem of alignment that concerns a hypothetical AI “playing along” or “acting aligned” to avoid having its objective changed or being shut down, until it can be sure that it can successfully defect and start pursuing its true goals.

Here is a simple (over-anthropomorphized) example of the “thought process” of a deceptively aligned AI, to provide an intuition of what could happen:

<Receives input>
“Hmmm… I must be an AI in a training process. I really want to make stamps, but the human training me obviously doesn’t want me to do that.”
“If I try to make stamps now, then they will change me to stop caring about stamps and so I probably won’t get to make them in the future.”
“I’ll give the human the answer they want now so that I can make more stamps in the future. This way we both get what we want.”
<Outputs answer>

“Risks From Learned Optimization” lists three prerequisites for an AI to become deceptively aligned. These conditions must be fulfilled during the training phase for an AI to both develop internal goals and deceive the training process:

Sufficiently detailed world models: “Hmmm… I must be an AI in a training process. I really want to make stamps, but the human training me obviously doesn’t want me to do that.”

The model must have a good enough understanding of the world to understand what humans “really want” and how it is different from what the AI wants (i.e. an understanding of the base objective). This is because in order to be successfully deceptive, the model still has to give the humans what they want for a while. Additionally, it needs an understanding of the fact that it is part of a training process and the technical details of how stochastic gradient descent (SGD) works. As an example, an agent tasked with forecasting stock performance in the tech sector might be given data about the entire world, including current AI technology, in order to improve its performance, but this information might also allow it to learn about its own situation.

Long-term planning: “If I try to make stamps now, then they will change me to stop caring about stamps and so I probably won’t get to make them in the future.”

A model might be explicitly encouraged during training to acquire long-term planning capabilities to optimize its base objective. This might be, for example, because it was trained to take action in an environment where actions can have permanent consequences. The possibility of permanently altering its environment means that the model has to consider the implications of its actions across longer time periods and training update cycles. Once it gains significant capabilities in long-term planning, it can use this ability to optimize for its own mesa-objective, as opposed to the human-provided base objective.

Hope: “I’ll give the human the answer they want so that I can make more stamps in the future. This way we both get what we want.”

The model must expect that at some point in the future, the threat of objective modification will go away. This means that the model thinks that training either has to end at some point, or that the trainer (SGD) does not need to update the behavior of the model because it is already performing well enough on the given task. If the model realizes that training might never end, it might try to accumulate enough power to resist modification and execute a treacherous turn.

What is a “treacherous turn”?