Mechanistic interpretation of Transformers - Part 1 : Self-attention

Background

I was intrigued by Anthropic’s recent publication on the “Biology of a Large Language Model” and wanted to understand the methods, the results and potential engineering implications better. This sent me down a rabbit hole of the emerging subfield of neural network interpretability called Mechanistic Interpretability. The exploration is still ongoing. In this blog post I discuss how a toy-model consisting of only a single transformer layer with no feed-forward component can be understood as a linear combination of independent “paths” through the network. Further, we will see how self-attention can be viewed as a composition of two independent circuits - The Query Key and Output Value circuits. Finally, we will use this decomposition to understand some interesting phenomenon which can be seen in even such a toy model of transformers.

What is mechanistic interpretability?

Interpretability has been a rich area of DL research for over a decade. The underlying problem comes from the fact that neural nets represent and manipulate data in high dimensional continuous vector spaces - and humans don’t speak vectorese!! We prefer symbols.

It gets worse. It has been shown that individual neurons typically represent a combination of unrelated features - so called polysemanticity. So not only you are giving me vectors, you are also saying they can be mixed up in complicated ways?

All interpretability research tries to bridge this gap by explaining the decision making process of a neural net in a human understandable terms. Below is a very coarse-grained categorization of different approaches to model interpretability. We will only focus on Mechanistic interpretability here. You can get a quick overview of other methods here.

<aside> 💡

“Mechanistic interpretability” relies on the assumption that a trained neural network learns certain implicit algorithms. These algorithms are encoded in the weights and activations of the network but it is possible to reverse engineer these networks to recover human understandable version of these algorithms, at least partially. Furthermore, it is also possible to intervene in specific ways to change the implemented algorithm.

</aside>

In particular, researchers developed a method of understanding the representation space of CNNs in terms of “features” and “circuits”. “Features” here mean some semantically meaningful representation of the input or part of the input. In the context of images, this could mean detecting edges or boundaries in the image. “Circuits” represent a linear combination of underlying features which represent more complex structures in the representation space. For instance, simple line detectors can be combined using model weights in a sub-graph of the model to give rise to circuits which generate more complex edges. We will not go into further depth of mechanistic interpretation of CNN based architectures here but this is an excellent reference for that.

The success of using features and circuits for probing CNNs encouraged researchers to postulate these speculative claims about neural network interpretability:

Features - Features are the fundamental unit of neural networks (e.g. curve detectors). These can be thought of as variables in an algorithms storing relevant state.