Cost functions

2023-01-20

2 minute read

Notes , Mathematics , ML , NNs

Quadratic cost function ($C = \frac {(y-a)^2}{2}$)

This is nice and simple, with the additional bonus that $C' = y - a$
The loss grows exponentially, so a large error is treated a lot more harshly than a small error - this seems a good idea
It’s very xenophobic, in that it will go out of its way to harm outliers
The exponent will cause the loss to always be positive. Would allowing it to go negative result in finding optima faster? Or would it just overshoot in the opposite direction?
Bad behaviour gets punished, and good ignored. Yet it works. Beware the naturalist fallacy

Each possible outcome will have it’s own clause. In the case of a binary classifier, this will be $-(y\log(a) + (1 - y)\log(1 - y))$, where one clause will disappear (seeing as $y \in {0, 1}$)
For tasks with $n$ categories, there will be $n$ clauses, e.g. $-(1\cdot\log(0.2) + 0\cdot\log(0.3) + 0\cdot\log(0.5))$ for the case of $y = cat1, a=[0.2, 0.3, 0.5]$
Cross-entropy works on probability distributions. So it needs inputs such that $\sum x =1, x \in <0, 1>$
The sum will be negative, since all the $a_i < 1$, so the $/log(a_i) < 0$. The minus makes sure that the loss is always non negative
During backpropagation, it’s very simply to calculate $C'$ for L2, but that leaves the activation function derivative, which in the case of it being the sigmoid function, can cause neurons to saturate. The derivative of cross-entropy cancels this out, leaving $\frac {\partial C}{\partial w_j} = x_j(a-y)$. This doesn’t saturate the output neurons (but hidden ones still can).
Was found by starting with the desired shape of $\frac {\partial C}{\partial w_j}$ and integrating backwards
Measures how surprised one should be with the difference between $y$ and $a^L$

Transforms input vectors into probability distributions
Will work well with cross-entropy
Can be used directly to assign probabilities for each category
The exponentials ensure that all values are positive
The derivative of softmax uses some dark magic, but the result is that $\frac {\partial a_j}{\partial z_k} > 0$ for $j = k$ and $\frac {\partial a_j}{\partial z_j} < 0$ for $j \neq k$. This means that changing $z_j$ will cause changes in the opposite direction for all other $z$s
Softmax layers are nonlocal - each neuron depends on all other neurons in the layer for its outputs

This is just the cross-entropy for multiple classes, where $y \in {0, 1}$. So for $n$ classes, where $k$ is the correct class: $$\begin{cases} y_i = 0 \text{ for } i\ne k \\\
y_i = 1 \text{ for } i = k \end{cases}$$ This ends up being $- \delta_{jy}\ln a^L_j$, because only the index for the correct class will be non zero.
Softmax + log likelihood are equivalent to sigmoid + cross-entropy. This I still don’t understand - it’s something to do with the dark magic of the derivatives.