# Cost functions

# Quadratic cost function ($C = \frac {(y-a)^2}{2}$)

- This is nice and simple, with the additional bonus that $C’ = y - a$
- The loss grows exponentially, so a large error is treated a lot more harshly than a small error - this seems a good idea
- It’s very xenophobic, in that it will go out of its way to harm outliers
- The exponent will cause the loss to always be positive. Would allowing it to go negative result in finding optima faster? Or would it just overshoot in the opposite direction?
- Bad behaviour gets punished, and good ignored. Yet it works. Beware the naturalist fallacy

# Cross-entropy cost function ($C = -\sum_iy_i\log(a_i)$)

- Each possible outcome will have it’s own clause. In the case of a binary classifier, this will be $-(y\log(a) + (1 - y)\log(1 - y))$, where one clause will disappear (seeing as $y \in {0, 1}$)
- For tasks with $n$ categories, there will be $n$ clauses, e.g. $-(1\cdot\log(0.2) + 0\cdot\log(0.3) + 0\cdot\log(0.5))$ for the case of $y = cat1, a=[0.2, 0.3, 0.5]$
- Cross-entropy works on probability distributions. So it needs inputs such that $\sum x =1, x \in <0, 1>$
- The sum will be negative, since all the $a_i < 1$, so the $/log(a_i) < 0$. The minus makes sure that the loss is always non negative
- During backpropagation, it’s very simply to calculate $C’$ for L2, but that leaves the activation function derivative, which in the case of it being the sigmoid function, can cause neurons to saturate. The derivative of cross-entropy cancels this out, leaving $\frac {\partial C}{\partial w_j} = x_j(a-y)$. This doesn’t saturate the output neurons (but hidden ones still can).
- Was found by starting with the desired shape of $\frac {\partial C}{\partial w_j}$ and integrating backwards
- Measures how surprised one should be with the difference between $y$ and $a^L$

# Softmax ($a = \frac {e^{x_j}} {\sum_k e^{z_k} }$)

- Transforms input vectors into probability distributions
- Will work well with cross-entropy
- Can be used directly to assign probabilities for each category
- The exponentials ensure that all values are positive
- The derivative of softmax uses some dark magic, but the result is that $\frac {\partial a_j}{\partial z_k} > 0$ for $j = k$ and $\frac {\partial a_j}{\partial z_j} < 0$ for $j \neq k$. This means that changing $z_j$ will cause changes in the opposite direction for all other $z$s
- Softmax layers are nonlocal - each neuron depends on all other neurons in the layer for its outputs

# Log likelihood cost function ($C = - \ln a^L_y$)

- This is just the cross-entropy for multiple classes, where $y \in {0, 1}$. So for $n$ classes, where $k$ is the correct class: $$\begin{cases} y_i = 0 \text{ for } i\ne k \\\ y_i = 1 \text{ for } i = k \end{cases}$$ This ends up being $- \delta_{jy}\ln a^L_j$, because only the index for the correct class will be non zero.
- Softmax + log likelihood are equivalent to sigmoid + cross-entropy. This I still don’t understand - it’s something to do with the dark magic of the derivatives.