TrisZaska's Machine Learning Blog

Understanding the Momentum Terms in Neural Network

1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
4. Multi-layer neural network
5. Install and using Multi-layer Neural Network to classify MNIST data
6. Summary
7. References

Momentum Terms

Momentum (αα) is very useful when we apply Multi-layer Neural Network in the real world with some problems such as trapped in local minima, slow training or in the case of Stochastic Gradient Descent where the gradient can oscillate too much, then Momentum came to save us. In fact, Momentum came from physics that describes the particle is moving then suddenly it stop cause the inertia.
So, let's define Momentum term in the case of Neural Network. It's an extra value with range from 0.00.0 to 1.01.0 to determine how much previous ΔW(t1)ΔW(t1) is contributed to current ΔW(t)ΔW(t) with the formula,
ΔW(t)=ΔW(t)+αΔW(t1)ΔW(t)=ΔW(t)+αΔW(t1)
Where,
  • αα is the Momentum value with 0<α<10<α<1
  • ΔW(t)ΔW(t) is the current weight update
  • ΔW(t1)ΔW(t1) is the previous weight update
  • tt is current time and t1t1 is the previous time
So, what is the meaning of the above formula? Since we already knew the term ΔWΔW in section 4.3, right? It's can be thought as when gradient descent keeps changing directions, then Momentum term added to increase larger step size in that direction. Therefore, it can help to escape the local minimal. Also with larger step, it will converge much faster, so be careful when choosing the learning rate ηη and the momentum αα are large, it can create big step size. Here, you maybe wonder it seems the learning rate ηη can also do that, why we need an extra Momentum αα, right?
When training Neural Network in the real world with a large dataset, it's bad if we use Batch-Gradient Descent, instead, we can use Stochastic Gradient Descent. But with Stochastic Gradient Descent, the gradient oscillate too much at the initialization, then the Momentum αα came to smoothing out the variation when the learning rate ηη can not do that or maybe can do but very slow.
But why αα can do that? Imagine with fixed learning rate ηη, when gradient go through long narrow "valley" of the error surface, gradient will be vanish and it's hard to go over to reach the global minima and maybe trapped in there, so the question is when gradient vanish, it means that the learning rate αα do the best it cans, so we need something to push gradient some "energy" to continue, says "Don't give up, let's go over!!!". And Momentum is the proportion of previous step size come to support current step, it means that in every step we have extra "energy", of course, it will be stronger to go over the local minimal.
[Image Credit: Alec Radford]
There are many optimization methods out there, but in this paper, we just discuss with Momentum (green ball) and without Momentum (red ball). As you can see, the red ball was trapped in long and narrow "valley" (local minimal).

No comments :

Post a Comment

Leave a Comment...