How might interpretability be helpful?

Current machine learning (ML) systems are mostly black boxes. Interpretability can help us understand them better, hopefully enough to detect and prevent problems such as goal misgeneralization, inner misalignment, and deceptive alignment.

Central to current ML models are massive, inscrutable matrices filled with numbers. Interpretability research can help us make sense of these massive matrices and make the underlying algorithm encoded by these numbers human-understandable. This results in the following advantages:

  • Preventing proxy goals: Sometimes the model might learn to optimize a proxy goal that performs well on the intended objective due to some spurious correlation in the training distribution. With better interpretability, we may be better able to observe what goal the model is actually pursuing.

  • Preventing deceptive alignment: During training, a model might start instrumentally optimizing for the human-set goal since this would allow it to retain another goal through the training phase. If the model is simply imitating an aligned goal until we stop training it, this could potentially be detected by better interpretability, thereby allowing us to only deploy such a model if we’re confident that it is safe.

  • Force-multiplier for alignment research: Currently, improving and aligning ML systems is a trial-and-error process. With interpretability, we can analyze a model and see why it gives misaligned answers. This allows an empirical approach to alignment work. Ideally, we can integrate interpretability directly into the loss function, and the training loop ensures that any resulting system stays aligned.

  • Predicting capabilities: Interpretability also enables us to better understand the underlying principles of how ML systems work, and how they change with scale, and enables us to formulate scientific laws about these systems.

  • Trust and cooperation: If people can understand each other's systems, it becomes easier for us to trust and therefore cooperate with them.

  • Enabling regulation: Interpretability gives us a metric that regulators and policymakers can use to create regulations around how aligned systems must be before they can be deployed.

An example that illustrates the utility of interpretability involves a maze-solving agent[1]. This agent is trained to navigate a maze, and during training, the agent consistently goes toward the exit. We might then assume that the underlying learned algorithm is ‘trying to go to the exit’. In deployment, the agent goes towards any green-colored object, rather than going toward the exit. This is because what the agent was actually 'trying to do' was to go towards a green-colored exit sign which just happened to correspond to where we wanted it to go. If we had interpretability tools, we would have been able to observe the underlying motivations of this agent and prevent this type of goal misgeneralization. This trivial example illustrates that 'appearing aligned' and being 'verifiably aligned' are two different things. Interpretability strives to create verifiability in alignment.

Interpretability can help us audit and detect problematic behavior in our models. However, increased understanding of ML systems can accelerate research and adoption. This makes these systems more capable and shortens the amount of time until powerful AI systems are developed and deployed. Therefore, interpretability research should be focused on areas with clear alignment applications. That said, it is still an important and promising avenue of research that provides researchers and engineers with an important set of tools that will be helpful in creating aligned AI systems.


  1. This example is illustrated with even more detail in the linked Rob Miles video ↩︎