AGISF - Interpretability

2 minute read

Week 6 of the AI alignment curriculum. Interpretability is the study of ways to, well, interpret AI models, currently mainly NNs.

Mechanistic interpretability

This aims to understand networks on the level of individual neurons.

Zoom In: an introduction to circuits (Olah et al., 2020)

Claims

  1. Features are the fundamental unit of neural networks. They correspond to directions. These features can be rigorously studied and understood.
  2. Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood
  3. Analogous features and circuits form across models and tasks.

Features

  • Detect a single thing
  • Earlier layers find generic things, like gradients or curves
  • Later layers find specific things, like wheels or faces
  • Controversial whether a single neuron can detect a single thing
  • Polysemantic neurons respond to multiple features

Circuits

  • Detects more complicated things, like all orientations of a curve
  • Superposition -> pure features that successfully can match a complicated thing get connected to other ones (i.e. it creates later polysemantic neurons), making use of the fact that specific things tend to not go together (e.g. a picture is either of a car or a dog)

Universability

  • NNs tend to generate the same base detection blocks, which would suggest that later layers might also do so

Toy models of superposition (Elhage et al., 2022)

  • Introduces 2 forces that counteract each other:
  1. Privileged basis - some representations can be encoded as a single neuron
  2. Superposition - linear representations can represent more features than dimensions, which can be seen as neural networks simulating larger networks

Concept-based interpretability

Techniques for automatically probing (and potentially modifying) human-interpretable concepts stored within neural networks

Discovering latent knowledge in language models without supervision (Burns et al., 2022)

  • Tries to generate models that are both confident and consistent
  • Is the assumption that the truth is the most consistent way of answering?
  • Consistency is a better goal than being a good Bing?

Probing a deep neural network (Alain and Bengio, 2018)

I don’t get this. Is the idea to try to project each layer into a much smaller dimension using logistic regression?

Acquisition of chess knowledge in AlphaZero (McGrath et al., 2021)

Locating and Editing Factual Associations in GPT: blog post (Meng et al., 2022)