What is AI alignment?

Intuitively, AI alignment means making an AI system’s goals line up with yours.

The simple case

To start with a context in which alignment is conceptually relatively simple, imagine a hypothetical AI system with two separate parts: its “values” (roughly, what it wants), and its “world model” (roughly, the facts that it believes). When this AI makes a decision, it considers each possible choice, uses its world model to predict what would happen if it made that choice, and then judges how good that outcome is according to its values. It then picks the choice it expects to lead to the best outcome. A system like this is “aligned” with you to the extent that its values are the same as yours, and “misaligned” with you to the extent that they are different.

Historically, discussions of the danger of AI misalignment have often used scenarios involving AI systems with this structure. In one such scenario, you have a very powerful AI, and you want to use it to cure cancer. The most naive strategy for achieving this might be to give it a value of “choose whatever possible future has the lowest number of cancer cases” — which the AI might conclude would be most effectively achieved by killing all humans. More sophisticated alignment strategies (which have their own problems) could involve coding in more complex specifications of human values, or having the AI learn values over time.

Current systems

Current frontier AI systems don’t have the "values" and "world model" structure described above, and the meaning of “aligning” such systems is less straightforward. For example, ChatGPT was created by training a huge neural network to predict human text as accurately as possible, and then fine-tuning it to favor text completions that were rated highly by human evaluators. We don’t know how the resulting system works, but we have no particular reason to think that it’s trying to predict the consequences of its decisions and check them against some set of values.

Still, the concept of “alignment” isn’t totally meaningless here: we can say that ChatGPT “is capable of” spewing abuse, because it has learned how to predict abusive internet text, but that it (usually) “chooses” not to do so as a result of reinforcement learning, and is (usually) “aligned” with OpenAI’s intended values in that sense.

Future systems

It’s not clear how similar future super-human AI systems will be to current ones, or to the simple model above — some argue that future systems will more systematically optimize their environments, but there’s a lot of disagreement and conceptual subtlety around questions like this.