Backpropagation
The basic equations
Given the following notation:
- $w_{jk}^l$ are the weights from the $k^{th}$ neuron in the $l^{n-1}$ layer to the $j^{th}$ neuron of the $l^{th}$ layer
- $a_j^l$ is the activation of the $j^{th}$ neuron in the $l^{th}$ layer
- $b_j^l$ is the bias of the $j^{th}$ neuron in the $l^{th}$ layer
- $z^l$ is $a^{l-1}w^l + b^l$ for the $l^{th}$ layer
- $z_j^l$ is $\sum_ja_j^{l-1}w_{jk}^l + b_j^l$ for the $j^{th}$ neuron in the $l^{th}$ layer
- $\sigma$ is the activation function used
- $C$ is the loss of for a given input
- $\frac {\partial C}{\partial a_j^L}$ is the cost function gradient for $a_j^L$
- $\delta^l_j$ is the error of the $j^{th}$ neuron of the $l^{th}$ layer
- $x^L$ is $x$ from the output layer, e.g. $a^L$ are the activations of the output layer
Backpropagation can be condensed into the following 4 equations:
- $\delta^L = \frac {\partial C}{\partial a_j^L} \sigma’(z_j^L) = \nabla_aC \odot \sigma’(z^L)$
- $\delta^l=((w^{l+1})^T\delta^{l+1})\odot\sigma’(z^l)$
- $\frac {\partial C}{\partial b^l_j} = \delta^l_j$
- $\frac {\partial C}{\partial w_{jk}^l} = a_k^{l-1}\delta_j^l$
Lessons from the output layer error ( $\delta^L = \frac {\partial C}{\partial a_j^L} \sigma’(z_j^L) = \nabla_aC \odot \sigma’(z^L)$ )
- The L2 loss function is nice for gradient calculations, as $((a - b)^2)’ = \frac 1 2(a-b)$, so $\frac {\partial C}{\partial a_j^L}=(a_j^L - y_j)$
- This is all based on gradients, i.e. derivatives, i.e. how fast the cost is changing. If the gradient for a given $a_j^L$ is small, then it means that output didn’t contribute much to the overall result, and so it shouldn’t contribute all that much to the overall error
- All the components are easily computed and/or depend on things that have already been computed
- $\nabla_aC \odot \sigma’(z^L)$ is nicely vectorized, e.g. for L2 $\nabla_aC \odot \sigma’(z^L) = (a^L - y) \odot \sigma’(z^L)$
- Storing all the calculation results of $z_j^l$ will speed up the calculations, at the cost of memory (cheap, but with lots of weights could turn out to be non trivial)
Lessons from the layer specific errors ( $\delta^l=((w^{l+1})^T\delta^{l+1})\odot\sigma’(z^l)$ )
- The $\sigma’(z^l)$ term is just how much this neuron influenced $l+1$. So what this is doing is scaling the neuron specific error by how much it influenced its targets
- The transpose is pretty much “for each error, get all the weights that influenced it and scale them by the error”, since: $$w^T\delta = \begin{bmatrix} w_{11} & w_{12} & \dots & w_{1n} \\\ w_{21} & w_{22} & \dots & w_{2n} \\\ \vdots & & \ddots & \vdots \\\ w_{n1} & w_{n2} & \dots & w_{nn} \\\ \end{bmatrix}^T \begin{bmatrix} \delta_{1} \\\ \delta_{2} \\\ \vdots \\\ \delta_{n} \\\ \end{bmatrix} = \begin{bmatrix} w_{11} & w_{21} & \dots & w_{n1} \\\ w_{12} & w_{22} & \dots & w_{n2} \\\ \vdots & & \ddots & \vdots \\\ w_{1n} & w_{2n} & \dots & w_{nn} \\\ \end{bmatrix} \begin{bmatrix} \delta_{1} \\\ \delta_{2} \\\ \vdots \\\ \delta_{n} \\\ \end{bmatrix} =$$ $$= \delta_1 \begin{bmatrix} w_{1,1} \\\ w_{1,2} \\\ \vdots \\\ w_{1,n} \\\ \end{bmatrix} + \delta_2 \begin{bmatrix} w_{2,1} \\\ w_{2,2} \\\ \vdots \\\ w_{2,n} \\\ \end{bmatrix} + \dots + \delta_n \begin{bmatrix} w_{n,1} \\\ w_{n,2} \\\ \vdots \\\ w_{n,n} \\\ \end{bmatrix} = \delta_1 w_{1,} + \delta_2 w_{2,} + \delta_n w_{n,}$$ (where $w_{jk}$ is the weight from $k$ to $j$, so $w_{j,}$ is the vector of weights coming into neuron $j$)
- building on the previous point, $w^T\delta$ results in a vector of a per neuron scaling factor that sums up how much the given neuron influenced the next layer. This then gets multiplied by how much the neuron itself changed ($\sigma’(z^l_j)$) to get the overall error. So if the neuron was shouting “Me! Me! Pick me!”, and the next layer decided to do so, the error of the neuron will be magnified. On the other hand, if it was ignored, then its error will also be ignored. This goes both ways. If it was screaming for attention and it turned out that it was correct, then even though it will be magnified, the change will be small (since it was right), therefore the overall error will still be small.
Lessons from the bias update ($\frac {\partial C}{\partial b^l_j} = \delta^l_j$)
- This was confusing for me, until I did the derivative. $\frac {\partial C}{\partial z_j^l}=\delta_j^l$, by definition. $\frac {\partial C}{\partial b_j^l} = \frac {\partial C}{\partial z_j^l}\frac {\partial z_j^l}{\partial b_j^l}$ and $\frac {\partial z_j^l}{\partial b_j^l} = 1$ because $z_j^l = a_k^{l-1}w_jk^l + b^l_j$.
Lessons from the weight partial derivative ($\frac {\partial C}{\partial w_{jk}^l} = a_k^{l-1}\delta_j^l$)
- $\frac {\partial C }{\partial w}=a_{in}\delta_{out}$ - this is nice and intuitive.
- The error gets split among the weights, proportional to how load the input was shouting. This is interesting and suggests something deeper.
- Every input gets penalized. If there are 2 inputs, with the same magnitude, one of which is totally “correct”, while the other will result in a large penalty, then both will be equally punished. In this case, both will receive the whole penalty.
- Won’t this punish overconfidence a lot more than underconfidence?