What is Anthropic's approach to LLM alignment?

From Stampy's Wiki

Canonical Answer

Anthropic fine tuned a language model to be more helpful, honest and harmless: HHH.

Motivation: The point of this is to:

  1. see if we can "align" a current day LLM, and
  2. raise awareness about safety in the broader ML community.

How can we interpret what all the neurons mean?

Chris Olah, the interpretability legend, is working on looking really hard at all the neurons to see what they all mean. The approach he pioneered is circuits: looking at computational subgraphs of the network, called circuits, and interpreting those. Idea: "decompiling the network into a better representation that is more interpretable". In-context learning via attention heads, and interpretability here seems useful.

One result I heard about recently: a linear softmax unit stretches space and encourages neuron monosemanticity (making a neuron represent only one thing, as opposed to firing on many unrelated concepts). This makes the network easier to interpret.

Motivation: The point of this is to get as many bits of information about what neural networks are doing, to hopefully find better abstractions. This diagram gets posted everywhere, the hope being that networks, in the current regime, will become more interpretable because they will start to use abstractions that are closer to human abstractions.

How do you figure out model performance scales?

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: language models, anthropic (create tag) (edit tags)

Canonical Question Info
(edits welcome)
Asked by: RoseMcClelland
OriginWhere was this question originally asked
Wiki
Date: 2022/09/13


Discussion