TrisZaska's Machine Learning Blog

Understanding Adaptive Learning Rate in Neural Network

1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
4. Multi-layer neural network
5. Install and using Multi-layer Neural Network to classify MNIST data
6. Summary
7. References

Adaptive Learning Rate

In practice with complex problems, often we not use fixed learning rate, especially when training Neural Network with Mini-batch Gradient Descent, the error surface was very noisy because we just use the fraction of the whole dataset in every epoch, right? With fixed learning rate, which causes our model too much "energy" when we try to "climb down the hill" on the error surface, therefore it's can be overshooting the optimal. Because of that, Annealing learning rate (or learning rate decay) came to help us.
So, what is Annealing learning rate? It's a technique that allow us to decrease the learning rate \(\eta\) after every epoch, it means that \(\eta^{(i)} > \eta^{(i + 1)}\) with \(i\) is the \(i^{th}\) epoch. Let's take a look,
With the same idea but the different representation, there are two common types of Annealing is that:
  • Declines the learning rate after \(m\) epochs, where \(m\) is the number of epochs
  • E.g: After every 20 epochs, we decrease learning rate once with pre-defined value such as half or 0.03, etc.
  • Declines the learning rate in every epoch:
    • Using Exponiental Decay with the formula,
    •  
      \(\eta = \eta e^{-ci}\)
      Where,
      • \(c\) is constant value
      • \(i\) is the \(i^{th}\) epoch
    • Using Fraction Decay with the formula,
    •  
      \(\eta = \frac{\eta}{1 + ci}\)
Look at the formula, with pre-defined constant value \(c\), when \(i\) is incremental, \(\eta\) will decrease gradually. But, how we choose the constant value \(c\)? There is tricky and no good answer for this question, but the idea is that if we choose small value of \(c\), \(\eta\) will become normal and there is no impact of Annealing, reversly if we choose very large value of \(c\), \(\eta\) will decrease very quickly even equal zero, then our model can not learn when it still not optimal yet. It's similar when your motorcycle is out of gas on the middle road, but you still not reach the target place you want.

No comments :

Post a Comment

Leave a Comment...