What is this new technique called concept whitening (CW)?
This article serves as an introduction to a mechanism called concept whitening, which is used to alter a given layer of the network to allow us to better understand the computation leading up to that layer. Concept whitening gives us a much clearer understanding of how the network gradually learns concepts over layers, and it can be used in any layer of a network without hurting predictive performance.
Can concept whitening provide neural network interpretability?
The journal Nature published a paper which proposed that concept whitening can help steer neural networks toward learning specific concepts without sacrificing performance. Instead of searching for answers in trained parameters, concept whitening bakes interpretability into deep learning models. This technique is showing promising results and we can apply it to convolutional neural networks. They have great implications for how we perceive future research in artificial intelligence.
Features and latent space in deep learning models
The main concept of this technique is that the lower layers of a multilayered convolutional neural network will learn basic features such as corners and edges. The idea is that a neural network’s latent space would represent concepts that are relevant to the classes of images it is meant to detect.
For example, if we are training our model with images of sheep, and the set contains large swaths of green pastures, your neural network might learn to detect green farmlands instead of sheep.
Post hoc explanations of neural networks
Many deep learning explanation techniques are post hoc, but what does this mean? We try to make sense of a trained neural network by examining its output and its parameter values. Let’s see an example of one popular technique to determine what a neural network sees in an image. For this, we mask different parts of an input image. After this, we observe how these changes affect the output of the deep learning model. The objective of this technique is to create heat maps. Theese maps highlight the features of the image that are more relevant to the neural network. These methods are helpful but they still treat deep learning models like black boxes. They don’t paint a definite picture of how neural networks work.
Explanation methods do not always explain the model’s calculations, rather than that they are usually summary statistics of performance. For example, there is a problem with saliency maps. They often miss showing the wrong things that the neural network might have learned. Interpreting the role of single neurons becomes very difficult. This is because the features of a neural network are scattered across the latent space. Deep neural networks are very powerful in image recognition but their hidden layers are unknown due to its complexity.
The main objective of concept whitening is to develop neural networks. Specifically, those whose latent space is aligned with the concepts we want. The concepts that are relevant to the task it has been trained for. This approach makes the deep learning model interpretable. Therefore, it will be easier to figure out relationships. We have to study the relationships between the features of an input image and the output of the neural network.
Implications for deep learning research
Every year, deep learning models become more complicated and have a bigger scope. This brings us to the issue of how to deal with the transparency problem of these neural networks. Some people say that any attempt to impose interpretability design constraints on neural networks will result in inferior models. The brain evolved through billions of iterations without intelligent top-down design, so we can expect neural networks to reach their peak performance through a pure evolutionary path.
However, concept whitening refutes these theories. It proves that we can impose top-down design constraints on neural networks without causing any performance penalties.