Cost functions

2 minute read

Quadratic cost function ($C = \frac {(y-a)^2}{2}$)

  • This is nice and simple, with the additional bonus that $C’ = y - a$
  • The loss grows exponentially, so a large error is treated a lot more harshly than a small error - this seems a good idea
  • It’s very xenophobic, in that it will go out of its way to harm outliers
  • The exponent will cause the loss to always be positive. Would allowing it to go negative result in finding optima faster? Or would it just overshoot in the opposite direction?
  • Bad behaviour gets punished, and good ignored. Yet it works. Beware the naturalist fallacy

Cross-entropy cost function ($C = -\sum_iy_i\log(a_i)$)

  • Each possible outcome will have it’s own clause. In the case of a binary classifier, this will be $-(y\log(a) + (1 - y)\log(1 - y))$, where one clause will disappear (seeing as $y \in {0, 1}$)
  • For tasks with $n$ categories, there will be $n$ clauses, e.g. $-(1\cdot\log(0.2) + 0\cdot\log(0.3) + 0\cdot\log(0.5))$ for the case of $y = cat1, a=[0.2, 0.3, 0.5]$
  • Cross-entropy works on probability distributions. So it needs inputs such that $\sum x =1, x \in <0, 1>$
  • The sum will be negative, since all the $a_i < 1$, so the $/log(a_i) < 0$. The minus makes sure that the loss is always non negative
  • During backpropagation, it’s very simply to calculate $C’$ for L2, but that leaves the activation function derivative, which in the case of it being the sigmoid function, can cause neurons to saturate. The derivative of cross-entropy cancels this out, leaving $\frac {\partial C}{\partial w_j} = x_j(a-y)$. This doesn’t saturate the output neurons (but hidden ones still can).
  • Was found by starting with the desired shape of $\frac {\partial C}{\partial w_j}$ and integrating backwards
  • Measures how surprised one should be with the difference between $y$ and $a^L$

Softmax ($a = \frac {e^{x_j}} {\sum_k e^{z_k} }$)

  • Transforms input vectors into probability distributions
  • Will work well with cross-entropy
  • Can be used directly to assign probabilities for each category
  • The exponentials ensure that all values are positive
  • The derivative of softmax uses some dark magic, but the result is that $\frac {\partial a_j}{\partial z_k} > 0$ for $j = k$ and $\frac {\partial a_j}{\partial z_j} < 0$ for $j \neq k$. This means that changing $z_j$ will cause changes in the opposite direction for all other $z$s
  • Softmax layers are nonlocal - each neuron depends on all other neurons in the layer for its outputs

Log likelihood cost function ($C = - \ln a^L_y$)

  • This is just the cross-entropy for multiple classes, where $y \in {0, 1}$. So for $n$ classes, where $k$ is the correct class: $$\begin{cases} y_i = 0 \text{ for } i\ne k \\\ y_i = 1 \text{ for } i = k \end{cases}$$ This ends up being $- \delta_{jy}\ln a^L_j$, because only the index for the correct class will be non zero.
  • Softmax + log likelihood are equivalent to sigmoid + cross-entropy. This I still don’t understand - it’s something to do with the dark magic of the derivatives.