What is a “treacherous turn”?
A treacherous turn is an event where an advanced AI system which has been forced by its relative weakness to pretend to be aligned turns on humanity once it becomes powerful enough to pursue its final goals openly.
That definition is quite densely packed, so here is an extended version:
-
Suppose researchers are making significant progress toward creating a powerful AI system. Things go well initially, more and more milestones are met as AI gets smarter, and the progress incentivizes the researchers to continue making the AI more intelligent.
-
At some point, the AI becomes smart enough to realize that the best way to achieve its final goals is to cooperate with humans while it is being evaluated: if the AI misbehaves, it could be shut down or its goals could be changed. Thus, the AI’s instrumental goals lead it to cooperate with humans while keeping its true intentions hidden.
-
The AI keeps up the charade until the researchers are convinced that it is safe and aligned, and eventually deploy it in the real world. Once the AI is powerful enough to thwart any attempts to shut it down or change its goals, the AI proceeds to pursue its final goals maximally. This is a treacherous turn.
Depending on what AI takeoff How quickly the first AI with roughly human-level intelligence leads to the first AI with vastly superhuman intelligence.
The above definition generally captures what a treacherous turn would look like when the AI is a deceptively aligned
A case where the AI acts as if it were aligned while in training, but when deployed it turns out not to be aligned.
-
The AI could devise escape plans before a treacherous turn to avoid the risk of being shut down completely.
-
An AI tasked to maximize human happiness might at first increase well-being cooperatively by improving overall quality of life. But once it gets smart enough to know it can stimulate the brain directly and reliably, it might start inserting electrodes into the pleasure centers of humans’ brains to stimulate them and “maximize human happiness”. So a treacherous turn can also arise when an AI finds new ways to accomplish its goals, not just when it becomes too powerful for humans to stop.1
Bostromcategorizes this situation as a treacherous turn, but this is not a case of deceptive alignment. This AI can be considered a sycophant instead of a schemer.Nick BostromPhilosopher who has done research on existential risk from AI and other causes. Formerly head of FHI at Oxford, which he founded. Author of the 2014 book Superintelligence: Paths, Dangers, Strategies.
-
A treacherous turn need not involve an AI system pursuing its individual survival. For example, a sufficiently smart AI might decide to malfunction to give the humans a misguided insight into alignment. The humans will shut it down and use the insight to tweak its successor in a way that ends up serving the original AI’s final goals.