Aren't there easy solutions to AI alignment?
If you've been learning about AI alignment
Unfortunately, many AI alignment proposals turn out to have hidden difficulties. It’s surprisingly easy to come up with “solutions” that don’t actually solve the problem. Some intuitive alignment proposals (that are generally agreed to be flawed, incomplete, or hard to implement) include:
-
Why can’t we just tell the AI to be friendly?
-
Why can’t we just turn the AI off if it starts to misbehave?
-
Why can’t we just tell an AI just to figure out what we want and then do that?
-
Why can’t we just tell the AI not to lie?
-
Why can’t we just not build the AI a body?
-
Why can't we just live in symbiosis with AGI
? -
Why can’t we just use a more powerful AI to control a potentially dangerous AI?
-
Why can’t we just treat AI like any other dangerous technological tool?
-
Why can’t we just solve alignment through trial and error?
Some common ways alignment proposals fail are that the proposed solution:
-
Requires human observers to be smarter than the AI. Many safety solutions only work when an AI is relatively weak, but break when the AI reaches a certain level of capability (for many reasons, e.g., deceptive alignment
).Deceptive alignmentView full definitionA case where the AI acts as if it were aligned while in training, but when deployed it turns out not to be aligned.
-
Appears to make sense in natural language, but when properly unpacked is not philosophically clear enough to be usable.
-
Only solves a subcomponent of the problem, but leaves the core problem unresolved.
-
Solves the problem only as long as the AI is operating “in distribution” with respect to the original training data (distributional shift will break it).
-
Might work eventually, but we can’t expect it to work on the first try (and we'll likely only get one try at aligning a superintelligence
).SuperintelligenceView full definitionAn AI with cognitive abilities far greater than those of humans in a wide range of important domains.