How difficult should we expect alignment to be?

From Stampy's Wiki

Canonical Answer

Here we ask about the additional cost of building an aligned powerful system, compare to its unaligned version. We often assume it to be nonzero, in the same way it's easier and cheaper to build an elevator without emergency brakes. This is referred as the alignment tax, and most AI alignment research is geared toward reducing it.

One operational guess by Eliezer Yudkowsky about its magnitude is "[an aligned project will take] at least 50% longer serial time to complete than [its unaligned version], or two years longer, whichever is less". This holds for agents with enough capability that their behavior is qualitatively different from a safety engineering perspective (for instance, an agent that is not corrigible by default).

An essay by John Wentworth argues for a small chance of alignment happening "by default", with an alignment tax of effectively zero.

Stamps: plex
Show your endorsement of this answer by giving it a stamp of approval!


Canonical Question Info
(edits welcome)
Asked by: Nico Hill2
OriginWhere was this question originally asked
Wiki
Date: 2022/03/03


Discussion