|Alignment Forum Tag|
Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.
Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.
The problem is that the actions can be harmful in a very non-obvious, indirect way. It's not at all obvious which actions should be stopped.
For example when the system comes up with a very clever way to acquire resources - this action's safety depends on what it intends to use these resources for.
Such a supervision may buy us some safety, if we find a way to make the system's intentions very transparent.