What alignment techniques are used on LLMs?
As of July 2024, organizations training LLMs
An AI model that takes in some text and predicts how the text is most likely to continue.
These alignment efforts are not always successful: LLMs sometimes violate these intended constraints on their own, and users have found many techniques to coax them into doing so through jailbreaks.
It’s worth noting that this limited type of alignment1 A case where the AI acts as if it were aligned while in training, but when deployed it turns out not to be aligned.
Indeed, some people argue that these kinds of output-limiting techniques should not be referred to as “alignment”. ↩︎