interpretability

From Stampy's Wiki
Interpretability
interpretability
Alignment Forum Tag

Description

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

Canonically answered

Where can I learn about interpretability?

Show your endorsement of this answer by giving it a stamp of approval!

Christoph Molnar's online book and distill are great sources.

Non-canonical answers

What about having a human supervisor who must approve all the AI's decisions before executing them?

Show your endorsement of this answer by giving it a stamp of approval!

The problem is that the actions can be harmful in a very non-obvious, indirect way. It's not at all obvious which actions should be stopped.

For example when the system comes up with a very clever way to acquire resources - this action's safety depends on what it intends to use these resources for.

Such a supervision may buy us some safety, if we find a way to make the system's intentions very transparent.