AGISF - Interpretability
Week 6 of the AI alignment curriculum. Interpretability is the study of ways to, well, interpret AI models, currently mainly NNs.
Mechanistic interpretability
This aims to understand networks on the level of individual neurons.
Zoom In: an introduction to circuits (Olah et al., 2020)
Claims
- Features are the fundamental unit of neural networks. They correspond to directions. These features can be rigorously studied and understood.
- Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood
- Analogous features and circuits form across models and tasks.
Features
- Detect a single thing
- Earlier layers find generic things, like gradients or curves
- Later layers find specific things, like wheels or faces
- Controversial whether a single neuron can detect a single thing
- Polysemantic neurons respond to multiple features
Circuits
- Detects more complicated things, like all orientations of a curve
- Superposition -> pure features that successfully can match a complicated thing get connected to other ones (i.e. it creates later polysemantic neurons), making use of the fact that specific things tend to not go together (e.g. a picture is either of a car or a dog)
Universability
- NNs tend to generate the same base detection blocks, which would suggest that later layers might also do so
Toy models of superposition (Elhage et al., 2022)
- Introduces 2 forces that counteract each other:
- Privileged basis - some representations can be encoded as a single neuron
- Superposition - linear representations can represent more features than dimensions, which can be seen as neural networks simulating larger networks
Concept-based interpretability
Techniques for automatically probing (and potentially modifying) human-interpretable concepts stored within neural networks
Discovering latent knowledge in language models without supervision (Burns et al., 2022)
- Tries to generate models that are both confident and consistent
- Is the assumption that the truth is the most consistent way of answering?
- Consistency is a better goal than being a good Bing?
Probing a deep neural network (Alain and Bengio, 2018)
I don’t get this. Is the idea to try to project each layer into a much smaller dimension using logistic regression?