Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey
Alignment research

Current techniques
Benchmarks and evals
Prosaic alignment
Interpretability
Agent foundations
Other alignment approaches
Organizations and agendas
Researchers

What is Anthropic's alignment research agenda?

Anthropic is a major AI lab that aims to “ensure transformative AI

helps people and society flourish” by “building frontier systems, studying their behaviors, working to responsibly deploy them, and regularly sharing [its] safety insights.” In March 2023, Anthropic published a summary of its views on safety research, which states that Anthropic is currently focused on “scaling supervision1, mechanistic interpretability, process-oriented learning2, and understanding and evaluating how AI systems learn and generalize”.

Anthropic has worked on a number of approaches to alignment:


  1. That is, developing methods to supervise models that may equal or surpass human cognitive capabilities. ↩︎

  2. Anthropic uses this term to refer to an approach to training models that is not based on whether they get the right results, but on whether they follow the right processes — like a math teacher who gives good grades to answers that spell out a logical sequence of steps, even if the wrong number comes out. ↩︎

Keep Reading

Continue with the next entry in "Alignment research"
What technical problems is MIRI working on?
Next
Or jump to a related question


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—2025

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.