difficulty of alignment

From Stampy's Wiki
Difficulty of alignment /
Revision as of 15:23, 27 March 2022 by 756254556811165756 (talk | contribs) (Created page with "{{Tag |LessWrong=No }}")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
difficulty of alignment

Canonically answered

What is the general nature of the concern about AI alignment?

Show your endorsement of this answer by giving it a stamp of approval!

The basic concern as AI systems become increasingly powerful is that they won’t do what we want them to do – perhaps because they aren’t correctly designed, perhaps because they are deliberately subverted, or perhaps because they do what we tell them to do rather than what we really want them to do (like in the classic stories of genies and wishes.) Many AI systems are programmed to have goals and to attain them as effectively as possible – for example, a trading algorithm has the goal of maximizing profit. Unless carefully designed to act in ways consistent with human values, a highly sophisticated AI trading system might exploit means that even the most ruthless financier would disavow. These are systems that literally have a mind of their own, and maintaining alignment between human interests and their choices and actions will be crucial.

Would AI alignment be hard with deep learning?

Show your endorsement of this answer by giving it a stamp of approval!

Ajeya Cotra has written an excellent article named Why AI alignment could be hard with modern deep learning on this question.

How difficult should we expect alignment to be?

Show your endorsement of this answer by giving it a stamp of approval!

Here we ask about the additional cost of building an aligned powerful system, compare to its unaligned version. We often assume it to be nonzero, in the same way it's easier and cheaper to build an elevator without emergency brakes. This is referred as the alignment tax, and most AI alignment research is geared toward reducing it.

One operational guess by Eliezer Yudkowsky about its magnitude is "[an aligned project will take] at least 50% longer serial time to complete than [its unaligned version], or two years longer, whichever is less". This holds for agents with enough capability that their behavior is qualitatively different from a safety engineering perspective (for instance, an agent that is not corrigible by default).

An essay by John Wentworth argues for a small chance of alignment happening "by default", with an alignment tax of effectively zero.

What is artificial general intelligence safety / AI alignment?

Show your endorsement of this answer by giving it a stamp of approval!

AI alignment is a field that is focused on causing the goals of future superintelligent artificial systems to align with human values, meaning that they would behave in a way which was compatible with our survival and flourishing. This may be an extremely hard problem, especially with deep learning, and is likely to determine the outcome of the most important century. Alignment research is strongly interdisciplinary and can include computer science, mathematics, neuroscience, philosophy, and social sciences.

AGI safety is a related concept which strongly overlaps with AI alignment. AGI safety is concerned with making sure that building AGI systems doesn’t cause things to go badly wrong, and the main way in which things can go badly wrong is through misalignment. AGI safety includes policy work that prevents the building of dangerous AGI systems, or reduces misuse risks from AGI systems aligned to actors who don’t have humanity’s best interests in mind.

Why is AI alignment a hard problem?

Show your endorsement of this answer by giving it a stamp of approval!
The problem of AI alignment can be compared in difficulty to a combination of rocket science (extreme stresses on components of the system, very narrow safety margins), launching space probes (once something goes wrong, it may be too late to be able to go back in and fix your code) and developing totally secure cryptography (your code may become a superintelligent adversary and seek to find and exploit even the tiniest flaws in your system). "AI alignment: treat it like a cryptographic rocket probe” - Eliezer Yudkowsky
One sense in which alignment is a hard problem is analogous to the reason rocket science is a hard problem. Relative to other engineering endeavors, rocket science had so many disasters because of the extreme stresses placed on various mechanical components and the narrow margins of safety required by stringent weight limits. A superintelligence would put vastly more “stress” on the software and hardware stack it is running on, which could cause many classes of failure which don’t occur when you’re working with subhuman systems.

Alignment is also hard like space probes are hard. With recursively self-improving systems, you won’t be able to go back and edit the code later if there is a catastrophic failure because it will competently deceive and resist you.

"You may have only one shot. If something goes wrong, the system might be too 'high' for you to reach up and suddenly fix it. You can build error recovery mechanisms into it; space probes are supposed to accept software updates. If something goes wrong in a way that precludes getting future updates, though, you’re screwed. You have lost the space probe."

Additionally, alignment is hard like cryptographic security. Cryptographers attempt to safeguard against “intelligent adversaries” who search for flaws in a system which they can exploit to break it. “Your code is not an intelligent adversary if everything goes right. If something goes wrong, it might try to defeat your safeguards…” And at the stage where it’s trying to defeat your safeguards, your code may have achieved the capabilities of a vast and perfectly coordinated team of superhuman-level hackers! So if there is even the tiniest flaw in your design, you can be certain that it will be found and exploited. As with standard cybersecurity, "good under normal circumstances" is just not good enough – your system needs to be unbreakably robust.

"AI alignment: treat it like a cryptographic rocket probe. This is about how difficult you would expect it to be to build something smarter than you that was nice – given that basic agent theory says they’re not automatically nice – and not die. You would expect that intuitively to be hard." Eliezer Yudkowsky

Another immense challenge is the fact that we currently have no idea how to reliably instill AIs with human-friendly goals. Even if a consensus could be reached on a system of human values and morality, it’s entirely unclear how this could be fully and faithfully captured in code.

For a more in-depth view of this argument, see Yudkowsky's talk "AI Alignment: Why It’s Hard, and Where to Start" below (full transcript here). For alternative views, see Paul Christiano's “AI alignment landscape” talk, Daniel Kokotajlo and Wei Dai’s “The Main Sources of AI Risk?” list, and Rohin Shah’s much more optimistic position.

What are the main sources of AI existential risk?

Show your endorsement of this answer by giving it a stamp of approval!

A comprehensive list of major contributing factors to AI being a threat to humanity's future is maintained on by Daniel Kokotajlo on the Alignment Forum.

Unanswered canonical questions

Is it hard like 'building a secure OS that works on the first try'? Hard like 'the engineering/logistics/implementation portion of the Manhattan Project'? Both? Some other option? Etc.