Understanding math behind Gradient Descent Checking

Tuesday, June 6, 2017

1. Introduction
2. History and Overview about Artificial Neural Network

3. Single neural network

3.1 Perceptron

3.2 Adaptive Linear Neurons

3.3 Problems with Perceptron (AI Winter)

4. Multi-layer neural network

4.1 Overview about Multi-layer Neural Network
4.2 Forward Propagation
4.3 Cost function
4.4 Backpropagation
4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron

4.6 Some optional techniques for Multi-layer Neural Network Optimization

4.7 Multi-layer Neural Network for binary/multi classification

5. Install and using Multi-layer Neural Network to classify MNIST data

5.1 Overview about MNIST data
5.2 Implement Multi-layer Neural Network
5.3 Debugging Neural Network with Gradient Descent Checking

5.3.1 Theory
5.3.2 Implement in Python

6. Summary
7. References

Gradient Descent Checking

In the last section, we're going to learn another technique so-called Gradient Descent Checking. So, what is it? What is the idea behind it? Firstly, let's come back to our Network, the "heart" of Neural Network is Backpropagation which we learned a lot about it, right? As you can see, Neural Network with so many neurons we must take a lot of Backpropagation to calculate gradient descent, because of that, it's very difficult to determine whether our Backpropagation works correctly or not. It means that, in the instantaneous rate of change at that time, we want to check our gradient descent in each weight work right or wrong.
To do that, we'll use the definition from math called Secant line, this is a line intersects locally two points in a curve, let's take a look,

Both secant line and tangent line describe the slope at that point, the different here is the secant line intersects two points while the tangent line (gradient) is a slope at one point? So, why we use the secant line to determine whether the gradient works correctly or not? The idea is when the top point approaches to the bottom point, the slope of the secant line becomes approximate with the slope of the tangent line. I found the animation image from Wikipedia I think it's maybe useful for your visualization,

[Image Source: https://en.wikipedia.org/wiki/Derivative#/media/File:Tangent_animation.gif] I also found another interesting interactive page so you can play around with secant line and tangent line.
Alright, when applying it in Neural Network, it makes sense because two points above are similar to when we update gradient in each weight by a value so-called epsilon \(\epsilon\),

We now can check analytical gradient (tangent line) and numerical gradient (secant line) with the formula,

\(\frac{\partial}{\partial w^{(l)}_{j,i}}J(W) \approx \frac{J(w^{(l)}_{j,i} + \epsilon) - J(w^{(l)}_{j,i})}{\epsilon}\)

But the above formula is considered to be bad with less accuracy, instead, we compute the symmetric points with more accuracy,

\(\frac{\partial}{\partial w^{(l)}_{j,i}}J(W) \approx \frac{J(w^{(l)}_{j,i} + \epsilon) - J(w^{(l)}_{j,i} - \epsilon)}{2\epsilon}\)

We often choose \(\epsilon\) be small, says \(0.0001\), if you choose \(\epsilon\) too small it's can lead to the numerical problem on the computer. So, we want to compare the analytical gradient in the Neural Network and numerical gradient we compute here by the relative error and we want it to be small as possible, says \(E_r\),

\(E_r = \frac{\|J_n - J_a\|_F}{\|J_n\|_F + \|J_a\|_F} \hspace{1cm} (19)\)

Where,

\(\|...\|_F\) is Frobenius norm
\(J_n\) is the numerical gradient
\(J_a\) is the analytical gradient

Finally, if \(E_r\) is less than or equal to one pre-defined value, our analytical pass the test and vice versa. But how do we choose the threshold value? There is no fixed answer for this question, it's based-on the complexity of your Neural Network, more complexity higher relative error is true, in practice they choose small threshold value is about \(10^{-7} \sim 10^{-9}\),

\(E_r\) less than \(10^{-7}\), every okay.
\(E_r\) from \(10^{-7}\) to \(10^{-2}\), it maybe has problem.
\(E_r\) greater than \(10^{-2}\), your analytical went wrong.

>NOTE: Maybe you wonder why the equation \((19)\) has different formula with the original relative error formula on Wikipedia? So the answer is, to prevent the denominator equal zero can lead to dividing by zero. The second thing is training with gradient descent checking is very slow because compute gradient descent checking is very expensive with "for" nested-loop, therefore, after you done some gradients checking to make sure you did right, turn off Gradient checking and training the Network normally.

Understanding math behind Gradient Descent Checking

Gradient Descent Checking

No comments :

Post a Comment