What is Anthropic's alignment research agenda?
Anthropic is a major AI lab that aims to “ensure transformative AI
An AI that is capable of transforming society, as drastically as the industrial revolution or even more so.
A subfield of interpretability which involves reverse-engineering the mechanisms by which a model gets from its inputs to its outputs.
Anthropic has worked on a number of approaches to alignment:
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) — Applies "preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants."
-
Mathematical Framework for Transformer Circuits (2021) - Applies the idea of “circuits” to the transformers used in the architecture of recent large language models
. The paper cites a second paper which has some more significant results — specifically, the idea of “induction heads”, which are attention heads that allow for in-context learning.Large language modelView full definitionAn AI model that takes in some text and predicts how the text is most likely to continue.
-
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) — Uses a sparse autoencoder to interpret sets of neurons in a toy transformer model.
-
Language Models (Mostly) Know What they Know (Kadavath et al., 2022) — Tasks LMs with predicting which questions they will get correct, and whether their own claims are valid. Preliminary results are encouraging; generally, after giving answers, models give well-calibrated answers to the question how likely those answers are to be correct. Calibration is worse when the question is “Do you know the answer to x?”, but improves when the model is given extra source material to work with.
That is, developing methods to supervise models that may equal or surpass human cognitive capabilities. ↩︎
Anthropic uses this term to refer to an approach to training models that is not based on whether they get the right results, but on whether they follow the right processes — like a math teacher who gives good grades to answers that spell out a logical sequence of steps, even if the wrong number comes out. ↩︎