What is Anthropic's alignment research agenda?

Anthropic is a major AI lab that aims to “ensure transformative AI helps people and society flourish” by “building frontier systems, studying their behaviors, working to responsibly deploy them, and regularly sharing [its] safety insights.” In March 2024, Anthropic published a summary of its views on safety research, which states that Anthropic is currently focused on “scaling supervision1, mechanistic interpretability, process-oriented learning2, and understanding and evaluating how AI systems learn and generalize”.

Anthropic has worked on a number of approaches to alignment:

  1. That is, developing methods to supervise models that may equal or surpass human cognitive capabilities. ↩︎

  2. Anthropic uses this term to refer to an approach to training models that is not based on whether they get the right results, but on whether they follow the right processes — like a math teacher who gives good grades to answers that spell out a logical sequence of steps, even if the wrong number comes out. ↩︎