The math behind Gradient Descent rule

Tuesday, June 6, 2017

1. Introduction
2. History and Overview about Artificial Neural Network

3. Single neural network

3.1 Perceptron

3.2 Adaptive Linear Neurons

3.2.1 Gradient Descent rule (Delta rule)
3.2.2 Learning rate in Gradient Descent
3.2.3 Implement Adaline in Python to classify Iris data
3.2.4 Learning via types of Gradient Descent

3.3 Problems with Perceptron (AI Winter)

4. Multi-layer neural network

4.1 Overview about Multi-layer Neural Network
4.2 Forward Propagation
4.3 Cost function
4.4 Backpropagation
4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron

4.6 Some optional techniques for Multi-layer Neural Network Optimization

4.7 Multi-layer Neural Network for binary/multi classification

5. Install and using Multi-layer Neural Network to classify MNIST data

5.1 Overview about MNIST data
5.2 Implement Multi-layer Neural Network
5.3 Debugging Neural Network with Gradient Descent Checking

6. Summary
7. References

Gradient Descent rule (Delta rule)

Before we're going to detail about Gradient Descent, look at the image above we immediately recognize that Adaline has linear function is activation fuction and Unit step function is used to final output (-1 or 1), therefore we can calculate the error based on linear function (real continous value) instead of Unit step function (binary value) in Perceptron. Why these guys do that? Because of the properties of the linear function has real continous value, so that the cost function will becomes differentiable, that allow some optimization algorithms such as Gradient Descent can run on it, look at the image below to get intuition about how Gradient Descent can minimize the cost function,

Look at the image above, where the Sum Squared Error (SSE) is quadratic function and it's convex with the formula,

\( J(w) = \frac{1}{2} \sum_i(y^{(i)} - \phi(z^{(i)}))^2 \hspace{1cm}(3)\)

In equation \((3)\), which we have extra \(\frac{1}{2}\), it's just convenient when we calculate partial derivative you'll see later
According to Wikipedia, Gradient descent is also known as steepest descent, or the method of steepest descent. Intuitionally, we can imagine that Gradient Descent is "climb down the hill". Mathematically, it's a tangent line which opposite direction toward gradient direction to reach the local or global minima and we can update the weight in each epoch \(i^{th}\) of neuron \(j^{th}\) with the formula,

\(w^{(i)}_j = w^{(i)}_j + \Delta w^{(i)}\)

where,
\(\Delta w^{(i)} = -\eta\frac{\partial J}{\partial w^{(i)}_j} \hspace{1cm}(4)\) with,
\(\frac{\partial J}{\partial w^{(i)}_j} = -\sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j \hspace{1cm}(5)\) so combine equation \((4)\) and \((5)\) we'll eliminate the minus sign and have final equation (Delta rule),
\(\Delta w^{(i)} = \eta \sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j\hspace{1cm}(6)\) Why we have the equation \((5)\)? If you have familiar with caculus and it's partial derivative, it's easy to derive the equation \((5)\), the left side of equation \((5)\) can be read as "the partial derivative of \(J\) with respect to \(w\)". We'll illustrate here,

\(\frac{\partial J}{\partial w^{(i)}_j} = \frac{\partial}{\partial w^{(i)}_j} \frac{1}{2}\sum_i(y^{(i)} - \phi(z^{(i)}))^2\)

\(= \frac{1}{2}\frac{\partial}{\partial w^{(i)}_j} \sum_i(y^{(i)} - \phi(z^{(i)}))^2\)

\(= \frac{1}{2}2\sum_i(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w^{(i)}_j}(y^{(i)} - \phi(z^{(i)})) \hspace{1cm}(7)\)

\(= \sum_i(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w^{(i)}_j}(y^{(i)} - w^{(i)}_jx^{(i)}_j)\)

\(= \sum_i(y^{(i)} - \phi(z^{(i)}))(0 - 1x^{(i)}_j)\)

\(= \sum_i(y^{(i)} - \phi(z^{(i)}))(-x^{(i)}_j)\)

\(= -\sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j \Leftrightarrow (5)\)

In equation \((7)\), we reduce \(2\) by \(\frac{1}{2}\), that's reason why we add the extra term \(\frac{1}{2}\) in the equation \((3)\) of the cost function for convenient. Okay, we did it, let's go to another interesting topic.

The math behind Gradient Descent rule

Gradient Descent rule (Delta rule)

No comments :

Post a Comment