difficulty of alignment
The basic concern as AI systems become increasingly powerful is that they won’t do what we want them to do – perhaps because they aren’t correctly designed, perhaps because they are deliberately subverted, or perhaps because they do what we tell them to do rather than what we really want them to do (like in the classic stories of genies and wishes.) Many AI systems are programmed to have goals and to attain them as effectively as possible – for example, a trading algorithm has the goal of maximizing profit. Unless carefully designed to act in ways consistent with human values, a highly sophisticated AI trading system might exploit means that even the most ruthless financier would disavow. These are systems that literally have a mind of their own, and maintaining alignment between human interests and their choices and actions will be crucial.
Ajeya Cotra has written an excellent article named Why AI alignment could be hard with modern deep learning on this question.
Here we ask about the additional cost of building an aligned powerful system, compare to its unaligned version. We often assume it to be nonzero, in the same way it's easier and cheaper to build an elevator without emergency brakes. This is referred as the alignment tax, and most AI alignment research is geared toward reducing it.
One operational guess by Eliezer Yudkowsky about its magnitude is "[an aligned project will take] at least 50% longer serial time to complete than [its unaligned version], or two years longer, whichever is less". This holds for agents with enough capability that their behavior is qualitatively different from a safety engineering perspective (for instance, an agent that is not corrigible by default).
An essay by John Wentworth argues for a small chance of alignment happening "by default", with an alignment tax of effectively zero.
AI alignment is a field that is focused on causing the goals of future superintelligent artificial systems to align with human values, meaning that they would behave in a way which was compatible with our survival and flourishing. This may be an extremely hard problem, especially with deep learning, and is likely to determine the outcome of the most important century. Alignment research is strongly interdisciplinary and can include computer science, mathematics, neuroscience, philosophy, and social sciences.
AGI safety is a related concept which strongly overlaps with AI alignment. AGI safety is concerned with making sure that building AGI systems doesn’t cause things to go badly wrong, and the main way in which things can go badly wrong is through misalignment. AGI safety includes policy work that prevents the building of dangerous AGI systems, or reduces misuse risks from AGI systems aligned to actors who don’t have humanity’s best interests in mind.
Alignment is also hard like space probes are hard. With recursively self-improving systems, you won’t be able to go back and edit the code later if there is a catastrophic failure because it will competently deceive and resist you.
"You may have only one shot. If something goes wrong, the system might be too 'high' for you to reach up and suddenly fix it. You can build error recovery mechanisms into it; space probes are supposed to accept software updates. If something goes wrong in a way that precludes getting future updates, though, you’re screwed. You have lost the space probe."
Additionally, alignment is hard like cryptographic security. Cryptographers attempt to safeguard against “intelligent adversaries” who search for flaws in a system which they can exploit to break it. “Your code is not an intelligent adversary if everything goes right. If something goes wrong, it might try to defeat your safeguards…” And at the stage where it’s trying to defeat your safeguards, your code may have achieved the capabilities of a vast and perfectly coordinated team of superhuman-level hackers! So if there is even the tiniest flaw in your design, you can be certain that it will be found and exploited. As with standard cybersecurity, "good under normal circumstances" is just not good enough – your system needs to be unbreakably robust.
"AI alignment: treat it like a cryptographic rocket probe. This is about how difficult you would expect it to be to build something smarter than you that was nice – given that basic agent theory says they’re not automatically nice – and not die. You would expect that intuitively to be hard." Eliezer Yudkowsky
Another immense challenge is the fact that we currently have no idea how to reliably instill AIs with human-friendly goals. Even if a consensus could be reached on a system of human values and morality, it’s entirely unclear how this could be fully and faithfully captured in code.
For a more in-depth view of this argument, see Yudkowsky's talk "AI Alignment: Why It’s Hard, and Where to Start" below (full transcript here). For alternative views, see Paul Christiano's “AI alignment landscape” talk, Daniel Kokotajlo and Wei Dai’s “The Main Sources of AI Risk?” list, and Rohin Shah’s much more optimistic position.
Unanswered canonical questions
Is it hard like 'building a secure OS that works on the first try'? Hard like 'the engineering/logistics/implementation portion of the Manhattan Project'? Both? Some other option? Etc.