Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey

What is mechanistic interpretability?

Mechanistic interpretability aims to “reverse engineer neural networks into human understandable algorithms”. It is a subfield of interpretability that emphasizes trying to understand the “actual mechanisms and algorithms that compose the network”, in contrast to other forms of interpretability research that try to explain how a network's outputs relate to human concepts without explaining the internal functioning of the network.

Three core hypotheses of the field are:

Claim 1: Features

Features are the fundamental unit of neural networks. They correspond to directions [in the vector space of possible neuron activations]. These features can be rigorously studied and understood.

Claim 2: Circuits

Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.

Claim 3: Universality

Analogous features and circuits form across models and tasks.

Current mechanistic interpretability research focuses on identifying the circuits within a model that are responsible for particular behaviors, and on understanding phenomena such as grokking, superposition, and phase changes.

Further reading :



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—2025

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.