1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
7. References
2. History and Overview about Artificial Neural Network
3. Single neural network
- 3.1 Perceptron
- 3.1.1 The Unit Step function
- 3.1.2 The Perceptron rules
- 3.1.3 The bias term
- 3.1.4 Implement Perceptron in Python
- 3.2 Adaptive Linear Neurons
- 3.2.1 Gradient Descent rule (Delta rule)
- 3.2.2 Learning rate in Gradient Descent
- 3.2.3 Implement Adaline in Python to classify Iris data
- 3.2.4 Learning via types of Gradient Descent
- 3.3 Problems with Perceptron (AI Winter)
- 4.1 Overview about Multi-layer Neural Network
- 4.2 Forward Propagation
- 4.3 Cost function
- 4.4 Backpropagation
- 4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron
- 4.6 Some optional techniques for Multi-layer Neural Network Optimization
- 4.7 Multi-layer Neural Network for binary/multi classification
- 5.1 Overview about MNIST data
- 5.2 Implement Multi-layer Neural Network
- 5.3 Debugging Neural Network with Gradient Descent Checking
- 5.3.1 Theory
- 5.3.2 Implement in Python
7. References
Gradient Descent Checking
In the last section, we're going to learn another technique so-called Gradient Descent Checking. So, what is it? What is the idea behind it? Firstly, let's come back to our Network, the "heart" of Neural Network is Backpropagation which we learned a lot about it, right? As you can see, Neural Network with so many neurons we must take a lot of Backpropagation to calculate gradient descent, because of that, it's very difficult to determine whether our Backpropagation works correctly or not. It means that, in the instantaneous rate of change at that time, we want to check our gradient descent in each weight work right or wrong.To do that, we'll use the definition from math called Secant line, this is a line intersects locally two points in a curve, let's take a look,
Both secant line and tangent line describe the slope at that point, the different here is the secant line intersects two points while the tangent line (gradient) is a slope at one point? So, why we use the secant line to determine whether the gradient works correctly or not? The idea is when the top point approaches to the bottom point, the slope of the secant line becomes approximate with the slope of the tangent line. I found the animation image from Wikipedia I think it's maybe useful for your visualization,
Alright, when applying it in Neural Network, it makes sense because two points above are similar to when we update gradient in each weight by a value so-called epsilon \(\epsilon\),
We now can check analytical gradient (tangent line) and numerical gradient (secant line) with the formula,
\(\frac{\partial}{\partial w^{(l)}_{j,i}}J(W) \approx \frac{J(w^{(l)}_{j,i} + \epsilon) - J(w^{(l)}_{j,i})}{\epsilon}\)
But the above formula is considered to be bad with less accuracy, instead, we compute the symmetric points with more accuracy,
\(\frac{\partial}{\partial w^{(l)}_{j,i}}J(W) \approx \frac{J(w^{(l)}_{j,i} + \epsilon) - J(w^{(l)}_{j,i} - \epsilon)}{2\epsilon}\)
We often choose \(\epsilon\) be small, says \(0.0001\), if you choose \(\epsilon\) too small it's can lead to the numerical problem on the computer. So, we want to compare the analytical gradient in the Neural Network and numerical gradient we compute here by the relative error and we want it to be small as possible, says \(E_r\),
\(E_r = \frac{\|J_n - J_a\|_F}{\|J_n\|_F + \|J_a\|_F} \hspace{1cm} (19)\)
Where,
- \(\|...\|_F\) is Frobenius norm
- \(J_n\) is the numerical gradient
- \(J_a\) is the analytical gradient
- \(E_r\) less than \(10^{-7}\), every okay.
- \(E_r\) from \(10^{-7}\) to \(10^{-2}\), it maybe has problem.
- \(E_r\) greater than \(10^{-2}\), your analytical went wrong.
No comments :
Post a Comment
Leave a Comment...