How difficult should we expect alignment to be?
Here we ask about the additional cost of building an aligned powerful system, compare to its unaligned version. We often assume it to be nonzero, in the same way it's easier and cheaper to build an elevator without emergency brakes. This is referred as the alignment tax, and most AI alignment research is geared toward reducing it.
One operational guess by Eliezer Yudkowsky about its magnitude is "[an aligned project will take] at least 50% longer serial time to complete than [its unaligned version], or two years longer, whichever is less". This holds for agents with enough capability that their behavior is qualitatively different from a safety engineering perspective (for instance, an agent that is not corrigible by default).
An essay by John Wentworth argues for a small chance of alignment happening "by default", with an alignment tax of effectively zero.
|Asked by:||Nico Hill2|
OriginWhere was this question originally asked