TrisZaska's Machine Learning Blog

Understanding the Momentum Terms in Neural Network

1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
4. Multi-layer neural network
5. Install and using Multi-layer Neural Network to classify MNIST data
6. Summary
7. References

Momentum Terms

Momentum (\(\alpha\)) is very useful when we apply Multi-layer Neural Network in the real world with some problems such as trapped in local minima, slow training or in the case of Stochastic Gradient Descent where the gradient can oscillate too much, then Momentum came to save us. In fact, Momentum came from physics that describes the particle is moving then suddenly it stop cause the inertia.
So, let's define Momentum term in the case of Neural Network. It's an extra value with range from \(0.0\) to \(1.0\) to determine how much previous \(\Delta W_{(t - 1)}\) is contributed to current \(\Delta W_{(t)}\) with the formula,
\(\Delta W_{(t)} = \Delta W_{(t)} + \alpha\Delta W_{(t - 1)}\)
Where,
  • \(\alpha\) is the Momentum value with \(0 < \alpha < 1\)
  • \(\Delta W_{(t)}\) is the current weight update
  • \(\Delta W_{(t - 1)}\) is the previous weight update
  • \(t\) is current time and \(t - 1\) is the previous time
So, what is the meaning of the above formula? Since we already knew the term \(\Delta W\) in section 4.3, right? It's can be thought as when gradient descent keeps changing directions, then Momentum term added to increase larger step size in that direction. Therefore, it can help to escape the local minimal. Also with larger step, it will converge much faster, so be careful when choosing the learning rate \(\eta\) and the momentum \(\alpha\) are large, it can create big step size. Here, you maybe wonder it seems the learning rate \(\eta\) can also do that, why we need an extra Momentum \(\alpha\), right?
When training Neural Network in the real world with a large dataset, it's bad if we use Batch-Gradient Descent, instead, we can use Stochastic Gradient Descent. But with Stochastic Gradient Descent, the gradient oscillate too much at the initialization, then the Momentum \(\alpha\) came to smoothing out the variation when the learning rate \(\eta\) can not do that or maybe can do but very slow.
But why \(\alpha\) can do that? Imagine with fixed learning rate \(\eta\), when gradient go through long narrow "valley" of the error surface, gradient will be vanish and it's hard to go over to reach the global minima and maybe trapped in there, so the question is when gradient vanish, it means that the learning rate \(\alpha\) do the best it cans, so we need something to push gradient some "energy" to continue, says "Don't give up, let's go over!!!". And Momentum is the proportion of previous step size come to support current step, it means that in every step we have extra "energy", of course, it will be stronger to go over the local minimal.
[Image Credit: Alec Radford]
There are many optimization methods out there, but in this paper, we just discuss with Momentum (green ball) and without Momentum (red ball). As you can see, the red ball was trapped in long and narrow "valley" (local minimal).

No comments :

Post a Comment

Leave a Comment...