Hints when starting on a new problem
- Start by getting better than change results, as a baseline for improvements
- Strip the problem space down to a simpler version, e.g. just learn to classify 0 and 1, rather than all the digits of MNIST
- Focus on getting decent values hyperparameters one by one (e.g. $\lambda$ or $\eta$), rather than randomly jumping around hyperparameter space
- Start with getting decent learning rates etc. before scaling up the number of neurons
- Initially jump about by largish amounts, looking for a decent value, then fine tune
- Can be useful to intertwine various hyperparameter optimizations, as they influence each other
- Pay very close attention to validation accuracy
Speed up feedback loops
- Use a small subset of data to start with
- Experiment with a stripped down version by removing some hidden layers
- Increase the frequency of monitoring, e.g. every n batches, rather than every n epochs
Number of training epochs
- Use early stopping - if there was no improvement over the last 10 epochs, stop and validate (or try different hyperparameters)
Learning rate ($\eta$)
- Start by finding the threshold value for $\eta$ at which the training cost immediately begins decreasing, instead of oscillating or increasing. A decent initial value is $\eta = 0.01$. If it starts decreasing right away, keep multiplying by 10 till it doesn’t. If it didn’t decrease right away, divide by 10 until it does. This will give an order of magnitude estimation for the threshold value of $\eta$
- A quick value for $\eta$ is half the threshold value
Learning rate schedule
Regularization parameter ($\lambda$)
- Start with 0, until $\eta$ is set, then set to 1. Next use the validation set to find a decent value, scaling up and down by factors of 10. Then fine tune $\lambda$, after which return to fine tuning $\eta$
Mini-batch size
- Relatively independent of other hyperparameters
- Mainly influences the time needed to train stuff, rather than the actual training results
- Find some decent values for the other hyperparameters, then try a few batch sizes, scaling $\eta$ up and down. Choose whichever size gives the best $\frac {\text{accuracy}}{\text{clock time}}$
Automated hyperparameter search
- Should read up on this? Or just use whatever fast.ai gives?