From Stampy's Wiki
Alignment Forum Tag

LessWrong (where we pull descriptions from) is missing a description for this tag, please add one.

Canonically answered

Where can I learn about interpretability?

Show your endorsement of this answer by giving it a stamp of approval!

Christoph Molnar's online book and are great sources, as well as this overview article which summarizes 70 interpretability papers.

What is interpretability and what approaches are there?

Show your endorsement of this answer by giving it a stamp of approval!

Interpretability is about making machine learning (ML) systems easier to understand. It is hard because the computations of current ML systems often depend on billions of parameters which they learnt from data. Areas of research for making current ML models more understandable are mechanistic interpretability, finding important input features, explaining by examples, natural language explanations, and using ML architectures which are intrinsically interpretable.

  1. Mechanistic interpretability is about interpreting an ML model’s internal representations. A very simple way to do this is activation maximization: optimize the input such that one particular neuron is activated a lot. This optimized input is an indicator of the concept which the neuron represents. Work that is central for mechanistic interpretability is the circuits thread, which focuses on interpreting the algorithms implemented by subgraphs (circuits) of neural networks. There is also work on circuits in transformers in particular. Mechanistic Interpretability has the drawback that interpretability is not composable, i.e. even if we understand all the components of a system, it doesn’t mean that we understand the whole. However, there may still be a way of hierarchically decomposing a system in a way that allows us to understand each layer of abstraction of it, and thus understanding the whole.

    Feature visualization of a neuron that corresponds to dog-like features. image source

  2. The idea of finding important input features is to find out which input features are most relevant for the output. In the case of image classification, we can highlight the relevant features with a heatmap, which is called saliency map). A very simple way to do this is to take the derivative of the output with regard to the different parts of the input. This derivative denotes how much the output changes if we change a particular part of the input, i.e. how important that part of the input is for the output. Saliency maps can be useful to notice cases in which an image classifier learns to use features it should not use. For example, the paper Unmasking Clever Hans predictors and assessing what machines really learn used saliency maps to show that a horse-classifying image classifier was not using the image parts that contained the horse at all, but rather relied on the name of the photographer printed in a corner, because one of the photographers primarily took photos of horses. [image of horse thing, maybe see thesis] However, many of the common saliency methods fail basic sanity checks, such as the saliency maps almost not changing when the model weights are randomized. Therefore, saliency maps are not sufficient for a reliable understanding of ML systems.
  3. Explanation by examples means showing examples in the training data that have similar features, such as in the paper This Looks Like That: Deep Learning for Interpretable Image Recognition.
  4. Natural Language Explanations are sentences describing a model’s reasons for its outputs. For example, in the paper Joint Mind Modeling for Explanation Generation in Complex Human-Robot Collaborative Tasks a human and an AI play a virtual cooking game together, and the AI explains its plans in natural language. They find that with the explanations the human-AI team performs significantly better.
  5. However, natural language explanations, as well as finding important features and explanation by examples are post-hoc explanations: They are generated after the fact, and are therefore likely to not be faithful (i.e. not accurately describe a model’s decision process). Interpretable architectures are architectures which are simple enough to be understandable without additional tools. Cynthia Rudin is a central researcher arguing for using interpretable architectures in high-stakes situations. However, using interpretable architecutures usually comes with a significant cost to model performance.

You can read more about different approaches in this overview article which summarizes more than 70 interpretability-related papers, and in the free online book A Guide for Making Black Box Models Explainable.

What is neural network modularity?

Show your endorsement of this answer by giving it a stamp of approval!

If a neural network is modular, that means it consists of clusters (modules) of neurons, such that the neurons within the cluster are strongly connected to each other, but only weakly connected to the rest of the network.

Making networks more modular is useful if the modules represent concepts which are understandable because this helps understand the whole system better.

Relevant papers about modularity are Neural Networks are Surprisingly Modular, Clusterability in Neural Networks, and Detecting Modularity in Deep Neural Networks.

How can we interpret what all the neurons mean?

Show your endorsement of this answer by giving it a stamp of approval!

Chris Olah, the interpretability legend, is working on looking really hard at all the neurons to see what they all mean. The approach he pioneered is circuits: looking at computational subgraphs of the network, called circuits, and interpreting those. Idea: "decompiling the network into a better representation that is more interpretable". In-context learning via attention heads, and interpretability here seems useful.

One result I heard about recently: a linear softmax unit stretches space and encourages neuron monosemanticity (making a neuron represent only one thing, as opposed to firing on many unrelated concepts). This makes the network easier to interpret.

Motivation: The point of this is to get as many bits of information about what neural networks are doing, to hopefully find better abstractions. This diagram gets posted everywhere, the hope being that networks, in the current regime, will become more interpretable because they will start to use abstractions that are closer to human abstractions.

Non-canonical answers

What about having a human supervisor who must approve all the AI's decisions before executing them?

Show your endorsement of this answer by giving it a stamp of approval!

The problem is that the actions can be harmful in a very non-obvious, indirect way. It's not at all obvious which actions should be stopped.

For example when the system comes up with a very clever way to acquire resources - this action's safety depends on what it intends to use these resources for.

Such a supervision may buy us some safety, if we find a way to make the system's intentions very transparent.

What does generative visualization look like in reinforcement learning?

Show your endorsement of this answer by giving it a stamp of approval!

Generative visualization for an image classifier means showing an input image which causes the classifier to strongly recognize a feature in that image.

In reinforcement learning (RL), generative visualization means showing a sequence of observations which make an RL agent strongly want to take a specific action.

The problem with generative visualization in RL is that the space of possible observations is constrained by the transition function. So, if we optimize the observations to maximize the activation for an action, this will likely result in a sequence of observations which is impossible (incompatible with the transition function).

A way around this is to compute an embedding of possible observation sequences and to optimize in the embedding space instead of the observation space. A relevant paper here is Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents

Unanswered canonical questions