What is the difference between sycophancy and deceptive alignment?

In the context of AI alignment, "deceptive alignment" refers to when an AI has not adopted your goal but still acts according to it, while sycophancy is when the AI has actually adopted the goal but goes about fulfilling it in the wrong way.

Ajeya Cotra provides an intuitive example by exploring three archetypes that an AI might embody. She explains this through the analogy of a young child (humanity) who has inherited a company to run and has a choice of the following three candidates (AIs):

  • Saints: Genuinely just want to help manage your estate well and look out for your long-term interests.

  • Sycophants: Want to do whatever it takes to make you short-term happy or satisfy the letter of your instructions regardless of long-term consequences.

  • Schemers: Have their own agendas and want to get access to your company and all its wealth and power so they can use it however they want.

Sycophants are an example of dishonesty. They want to see you be happy even if it means lying to you. Sycophants would provide false facts about the world to convince you that things were going well, but they don't have some long-term ulterior motive. Schemers are an example of deceptive alignment. The schemers have some ulterior motive, they have something that they want to accomplish. This means that they're actively trying to look like they're doing the right thing to accomplish that. During training, Sycophants and Schemers are behaviorally indistinguishable from Saints (i.e. models that are actually trying to do what we want). Once a deceptive Schemer AI has been deployed or is otherwise safe from modification, it can "defect" and begin to pursue its true objective.

While deceptive alignment and dishonesty are separate problems, they both need to be solved in order to make an AI system truly aligned.

Alternative Phrasing

  • What is the difference between dishonesty and deception in AI models?

  • What is the difference between deception and deceptive alignment in AI models?