AGISF - Interpretability

2023-03-07

2 minute read

Notes , Alignment , AGI , AGISF

Week 6 of the AI alignment curriculum. Interpretability is the study of ways to, well, interpret AI models, currently mainly NNs.

Mechanistic interpretability

This aims to understand networks on the level of individual neurons.

Features are the fundamental unit of neural networks. They correspond to directions. These features can be rigorously studied and understood.
Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood
Analogous features and circuits form across models and tasks.

Detects more complicated things, like all orientations of a curve
Superposition -> pure features that successfully can match a complicated thing get connected to other ones (i.e. it creates later polysemantic neurons), making use of the fact that specific things tend to not go together (e.g. a picture is either of a car or a dog)

NNs tend to generate the same base detection blocks, which would suggest that later layers might also do so

Privileged basis - some representations can be encoded as a single neuron
Superposition - linear representations can represent more features than dimensions, which can be seen as neural networks simulating larger networks

Techniques for automatically probing (and potentially modifying) human-interpretable concepts stored within neural networks

I don’t get this. Is the idea to try to project each layer into a much smaller dimension using logistic regression?