TrisZaska's Machine Learning Blog

The math behind Gradient Descent rule

1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
4. Multi-layer neural network
5. Install and using Multi-layer Neural Network to classify MNIST data
6. Summary
7. References

Gradient Descent rule (Delta rule)

Before we're going to detail about Gradient Descent, look at the image above we immediately recognize that Adaline has linear function is activation fuction and Unit step function is used to final output (-1 or 1), therefore we can calculate the error based on linear function (real continous value) instead of Unit step function (binary value) in Perceptron. Why these guys do that? Because of the properties of the linear function has real continous value, so that the cost function will becomes differentiable, that allow some optimization algorithms such as Gradient Descent can run on it, look at the image below to get intuition about how Gradient Descent can minimize the cost function,
Look at the image above, where the Sum Squared Error (SSE) is quadratic function and it's convex with the formula,
\( J(w) = \frac{1}{2} \sum_i(y^{(i)} - \phi(z^{(i)}))^2 \hspace{1cm}(3)\)
In equation \((3)\), which we have extra \(\frac{1}{2}\), it's just convenient when we calculate partial derivative you'll see later
According to Wikipedia, Gradient descent is also known as steepest descent, or the method of steepest descent. Intuitionally, we can imagine that Gradient Descent is "climb down the hill". Mathematically, it's a tangent line which opposite direction toward gradient direction to reach the local or global minima and we can update the weight in each epoch \(i^{th}\) of neuron \(j^{th}\) with the formula,
\(w^{(i)}_j = w^{(i)}_j + \Delta w^{(i)}\)
where,
\(\Delta w^{(i)} = -\eta\frac{\partial J}{\partial w^{(i)}_j} \hspace{1cm}(4)\)
with,
\(\frac{\partial J}{\partial w^{(i)}_j} = -\sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j \hspace{1cm}(5)\)
so combine equation \((4)\) and \((5)\) we'll eliminate the minus sign and have final equation (Delta rule),
\(\Delta w^{(i)} = \eta \sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j\hspace{1cm}(6)\)
Why we have the equation \((5)\)? If you have familiar with caculus and it's partial derivative, it's easy to derive the equation \((5)\), the left side of equation \((5)\) can be read as "the partial derivative of \(J\) with respect to \(w\)". We'll illustrate here,
\(\frac{\partial J}{\partial w^{(i)}_j} = \frac{\partial}{\partial w^{(i)}_j} \frac{1}{2}\sum_i(y^{(i)} - \phi(z^{(i)}))^2\)

\(= \frac{1}{2}\frac{\partial}{\partial w^{(i)}_j} \sum_i(y^{(i)} - \phi(z^{(i)}))^2\)

\(= \frac{1}{2}2\sum_i(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w^{(i)}_j}(y^{(i)} - \phi(z^{(i)})) \hspace{1cm}(7)\)

\(= \sum_i(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w^{(i)}_j}(y^{(i)} - w^{(i)}_jx^{(i)}_j)\)

\(= \sum_i(y^{(i)} - \phi(z^{(i)}))(0 - 1x^{(i)}_j)\)

\(= \sum_i(y^{(i)} - \phi(z^{(i)}))(-x^{(i)}_j)\)

\(= -\sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j \Leftrightarrow (5)\)
In equation \((7)\), we reduce \(2\) by \(\frac{1}{2}\), that's reason why we add the extra term \(\frac{1}{2}\) in the equation \((3)\) of the cost function for convenient. Okay, we did it, let's go to another interesting topic.

No comments :

Post a Comment

Leave a Comment...