What is the orthogonality thesis?

4 min read

Suggest changes in Google Docs

The orthogonality thesis is the claim that any¹ level of intelligence is compatible with any set of terminal goals. In other words, values and intelligence are “orthogonal” to each other in the sense that agents can vary in one dimension while staying constant in the other dimension.

In particular, it implies we can’t assume that an AI system that is as smart as or smarter than humans will automatically be motivated by human values.

On its own, the orthogonality thesis only states that unaligned superintelligence is possible, not that it is likely, or that AI alignment is difficult. It is invoked to counter the idea that future AI will converge toward human goals or morality, regardless of its design, as an automatic result of becoming smarter.

In addition to this “weak” version of the thesis, people have considered stronger versions. Eliezer Yudkowsky’s “strong form” of the orthogonality thesis says that creating AI systems with arbitrary goals is not only possible, but involves no special difficulty; in other words, that “preferences are no harder to embody than to calculate.”²

While the orthogonality thesis is broadly accepted by the alignment research community, it has critics who come from a few distinct strands:

Some moral realists assert that a sufficiently intelligent entity would discover and adhere to objective moral truths that humans would endorse upon reflection.
Beren Millidge claims that the strong form of the orthogonality thesis is false within modern deep learning algorithms.
Nora Belrose contends that, depending on how the thesis is interpreted, it’s either trivial, false, or unintelligible.
Steve Petersen argues that the need for continuity of an agent's goals over time, together with the seemingly intrinsic underspecifiedness of goal representations, might cause AIs that have complex goals and that can understand their human creators to consider these creators to be previous versions of themselves, and aim to further these creators’ goals.

What’s mostly meant here is arbitrarily high levels of intelligence are compatible with any goal. Systems with low levels of intelligence might not be able to represent goals at all. For example, it’s hard to understand what it would even mean for a rock to "have a goal." ↩︎
Even in its strong form, the orthogonality thesis makes no predictions about which systems people will actually try to build, and therefore makes no predictions about which systems will end up existing in reality. People have sometimes misunderstood the orthogonality thesis to mean something like “the system resulting from a real-world AI design process is equally likely to end up with any set of goals,” but this is not implied by the thesis. ↩︎

What is instrumental convergence?

Wouldn't a superintelligence be smart enough to know right from wrong?