Why is AI alignment a hard problem?

From Stampy's Wiki

Canonical Answer

The problem of AI alignment can be compared in difficulty to a combination of rocket science (extreme stresses on components of the system, very narrow safety margins), launching space probes (once something goes wrong, it may be too late to be able to go back in and fix your code) and developing totally secure cryptography (your code may become a superintelligent adversary and seek to find and exploit even the tiniest flaws in your system). "AI alignment: treat it like a cryptographic rocket probe” - Eliezer Yudkowsky
One sense in which alignment is a hard problem is analogous to the reason rocket science is a hard problem. Relative to other engineering endeavors, rocket science had so many disasters because of the extreme stresses placed on various mechanical components and the narrow margins of safety required by stringent weight limits. A superintelligence would put vastly more “stress” on the software and hardware stack it is running on, which could cause many classes of failure which don’t occur when you’re working with subhuman systems.

Alignment is also hard like space probes are hard. With recursively self-improving systems, you won’t be able to go back and edit the code later if there is a catastrophic failure because it will competently deceive and resist you.

"You may have only one shot. If something goes wrong, the system might be too 'high' for you to reach up and suddenly fix it. You can build error recovery mechanisms into it; space probes are supposed to accept software updates. If something goes wrong in a way that precludes getting future updates, though, you’re screwed. You have lost the space probe."

Additionally, alignment is hard like cryptographic security. Cryptographers attempt to safeguard against “intelligent adversaries” who search for flaws in a system which they can exploit to break it. “Your code is not an intelligent adversary if everything goes right. If something goes wrong, it might try to defeat your safeguards…” And at the stage where it’s trying to defeat your safeguards, your code may have achieved the capabilities of a vast and perfectly coordinated team of superhuman-level hackers! So if there is even the tiniest flaw in your design, you can be certain that it will be found and exploited. As with standard cybersecurity, "good under normal circumstances" is just not good enough – your system needs to be unbreakably robust.

"AI alignment: treat it like a cryptographic rocket probe. This is about how difficult you would expect it to be to build something smarter than you that was nice – given that basic agent theory says they’re not automatically nice – and not die. You would expect that intuitively to be hard." Eliezer Yudkowsky

Another immense challenge is the fact that we currently have no idea how to reliably instill AIs with human-friendly goals. Even if a consensus could be reached on a system of human values and morality, it’s entirely unclear how this could be fully and faithfully captured in code.

For a more in-depth view of this argument, see Yudkowsky's talk "AI Alignment: Why It’s Hard, and Where to Start" below (full transcript here). For alternative views, see Paul Christiano's “AI alignment landscape” talk, Daniel Kokotajlo and Wei Dai’s “The Main Sources of AI Risk?” list, and Rohin Shah’s much more optimistic position.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!


Canonical Question Info
(edits welcome)
Asked by: plex
OriginWhere was this question originally asked
Wiki
Date: 2022/07/07

Related questions


Discussion