Understanding Adaptive Learning Rate in Neural Network

Tuesday, June 6, 2017

1. Introduction
2. History and Overview about Artificial Neural Network

3. Single neural network

3.1 Perceptron

3.2 Adaptive Linear Neurons

3.3 Problems with Perceptron (AI Winter)

4. Multi-layer neural network

4.1 Overview about Multi-layer Neural Network
4.2 Forward Propagation
4.3 Cost function
4.4 Backpropagation
4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron

4.6 Some optional techniques for Multi-layer Neural Network Optimization

4.6.1 Adaptive Learning Rate (Annealing)
4.6.2 Momentum Terms
4.6.3 Regularization

4.7 Multi-layer Neural Network for binary/multi classification

5. Install and using Multi-layer Neural Network to classify MNIST data

5.1 Overview about MNIST data
5.2 Implement Multi-layer Neural Network
5.3 Debugging Neural Network with Gradient Descent Checking

6. Summary
7. References

Adaptive Learning Rate

In practice with complex problems, often we not use fixed learning rate, especially when training Neural Network with Mini-batch Gradient Descent, the error surface was very noisy because we just use the fraction of the whole dataset in every epoch, right? With fixed learning rate, which causes our model too much "energy" when we try to "climb down the hill" on the error surface, therefore it's can be overshooting the optimal. Because of that, Annealing learning rate (or learning rate decay) came to help us.
So, what is Annealing learning rate? It's a technique that allow us to decrease the learning rate \(\eta\) after every epoch, it means that \(\eta^{(i)} > \eta^{(i + 1)}\) with \(i\) is the \(i^{th}\) epoch. Let's take a look,

With the same idea but the different representation, there are two common types of Annealing is that:

Declines the learning rate after \(m\) epochs, where \(m\) is the number of epochs
Declines the learning rate in every epoch:

Using Exponiental Decay with the formula,

\(\eta = \eta e^{-ci}\)

\(c\) is constant value
\(i\) is the \(i^{th}\) epoch

Using Fraction Decay with the formula,

\(\eta = \frac{\eta}{1 + ci}\)

Look at the formula, with pre-defined constant value \(c\), when \(i\) is incremental, \(\eta\) will decrease gradually. But, how we choose the constant value \(c\)? There is tricky and no good answer for this question, but the idea is that if we choose small value of \(c\), \(\eta\) will become normal and there is no impact of Annealing, reversly if we choose very large value of \(c\), \(\eta\) will decrease very quickly even equal zero, then our model can not learn when it still not optimal yet. It's similar when your motorcycle is out of gas on the middle road, but you still not reach the target place you want.

Understanding Adaptive Learning Rate in Neural Network

Adaptive Learning Rate

No comments :

Post a Comment