What is AI alignment?
Informally, AI alignment means making an AI’s goals line up with some target set of values, such as those of its creators.1
A simple model
Imagine a hypothetical AI system with two separate parts:
-
A set of “goals”, “preferences”, or “values”2
, meaning which outcomes it will act to achieve over other outcomes.When we're talking about humans, words like “preferences” and “values” sometimes have connotations that we don’t mean to invoke here. For example, we’re not saying an AI's preferences are a kind of emotional state, or that it has "values" in an ethical or moral sense. “Preferences” and “values” here are defined purely in terms of which outcomes the system will tend to choose over others. -
A set of “beliefs” or a “world model
”, meaning what it considers to be true about the world and what it predicts will happen.World modelView full definitionA system’s internal representation of its environment, which it uses to predict what will happen, including as a result of its own possible actions.
When this AI makes a decision, it considers each possible action it could take, uses its beliefs about the world to predict the result of that choice, and then uses its preferences to judge how good that result is. It then picks the choice it expects to lead to the best result.
The concept of “alignment” is relatively straightforward here: the system is “aligned” with you to the extent that its values are the same as yours, and “misaligned” with you to the extent that they are different.
Historically, discussions of the danger of AI misalignment have often used scenarios involving AI systems with this structure. In one such scenario, you have a very powerful AI, and you want to use it to cure cancer. The most naive strategy for achieving this might be to give it the goal of “minimize the number of cancer cases” — which the AI might conclude would be most effectively achieved by killing all humans. More sophisticated alignment strategies3
This simple model is a type of goal-directed AI. If powerful AIs necessarily behave in goal-directed ways, then it becomes easy to see how they would be very dangerous. The powerful AI learns the wrong goal, and for most goals, human flourishing isn’t how you maximize them. But will powerful AI be goal-directed in this way?
Current systems
Current frontier AI systems don’t seem to have the "values" and "world model" structure described above. So it’s unclear whether the idea of “aligning” such systems is meaningful.
For example, an LLM
An AI model that takes in some text and predicts how the text is most likely to continue.
A chatbot interface for the GPT series of large language models by OpenAI.
A simulated network of nodes (‘neurons’) and their connections (weights). Neural networks are the core component of deep learning, the leading AI paradigm.
Fine-tuning is the process of adapting a pre-trained ML model for more specific tasks or to display more specific behaviors.
Still, the concept of “alignment” has some meaning for current systems. We can say that ChatGPT “is capable of” spewing abuse, because it has learned how to predict abusive internet text. Yet its cognition has been chiseled by RLHF in such a way that it (usually) “chooses” not to do so. In that sense, it is (mostly) “aligned” with OpenAI’s intended values.
Future systems
It's not clear how similar future, smarter-than-human AI systems will be either to current AI systems, or to the simple model described above. Some argue that future systems will systematically optimize their environments, like in the simple model. Typically, these arguments are based on coherence theorems
If an agent's choices satisfy certain rationality conditions, then those choices can be represented as maximizing an expected utility function
Rules that govern type of agent that will be selected for in a wide variety of environments
There are a lot of subtleties to the concept, and different people use it in different ways. In this article, we’ll just try to convey a basic idea. ↩︎
When we're talking about humans, words like “preferences” and “values” sometimes have connotations that we don’t mean to invoke here. For example, we’re not saying an AI's preferences are a kind of emotional state, or that it has "values" in an ethical or moral sense. “Preferences” and “values” here are defined purely in terms of which outcomes the system will tend to choose over others. ↩︎
Some sophisticated strategies are the end-result of a long sequence of patches fixing one problem after another with a poor initial strategy. Such strategies are likely to suffer from the problem of “The Nearest Unblocked Strategy”. ↩︎