What is mechanistic interpretability?

2 min read

Suggest changes in Google Docs

Mechanistic interpretability aims to “reverse engineer neural networks into human understandable algorithms”. It is a subfield of interpretability that emphasizes trying to understand the “actual mechanisms and algorithms that compose the network”, in contrast to other forms of interpretability research that try to explain how a network's outputs relate to human concepts without explaining the internal functioning of the network.

Three core hypotheses of the field are:

Claim 1: Features

Features are the fundamental unit of neural networks. They correspond to directions [in the vector space of possible neuron activations]. These features can be rigorously studied and understood.

Claim 2: Circuits

Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.

Claim 3: Universality

Analogous features and circuits form across models and tasks.

Current mechanistic interpretability research focuses on identifying the circuits within a model that are responsible for particular behaviors, and on understanding phenomena such as grokking, superposition, and phase changes.

What is mechanistic interpretability?

Further reading :