1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
7. References
2. History and Overview about Artificial Neural Network
3. Single neural network
- 3.1 Perceptron
- 3.1.1 The Unit Step function
- 3.1.2 The Perceptron rules
- 3.1.3 The bias term
- 3.1.4 Implement Perceptron in Python
- 3.2 Adaptive Linear Neurons
- 3.2.1 Gradient Descent rule (Delta rule)
- 3.2.2 Learning rate in Gradient Descent
- 3.2.3 Implement Adaline in Python to classify Iris data
- 3.2.4 Learning via types of Gradient Descent
- 3.3 Problems with Perceptron (AI Winter)
- 4.1 Overview about Multi-layer Neural Network
- 4.2 Forward Propagation
- 4.3 Cost function
- 4.4 Backpropagation
- 4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron
- 4.6 Some optional techniques for Multi-layer Neural Network Optimization
- 4.7 Multi-layer Neural Network for binary/multi classification
- 5.1 Overview about MNIST data
- 5.2 Implement Multi-layer Neural Network
- 5.3 Debugging Neural Network with Gradient Descent Checking
7. References
Gradient Descent rule (Delta rule)
Before we're going to detail about Gradient Descent, look at the image above we immediately recognize that Adaline has linear function is activation fuction and Unit step function is used to final output (-1 or 1), therefore we can calculate the error based on linear function (real continous value) instead of Unit step function (binary value) in Perceptron. Why these guys do that? Because of the properties of the linear function has real continous value, so that the cost function will becomes differentiable, that allow some optimization algorithms such as Gradient Descent can run on it, look at the image below to get intuition about how Gradient Descent can minimize the cost function,Look at the image above, where the Sum Squared Error (SSE) is quadratic function and it's convex with the formula,
\( J(w) = \frac{1}{2} \sum_i(y^{(i)} - \phi(z^{(i)}))^2 \hspace{1cm}(3)\)
In equation \((3)\), which we have extra \(\frac{1}{2}\), it's just convenient when we calculate partial derivative you'll see later According to Wikipedia, Gradient descent is also known as steepest descent, or the method of steepest descent. Intuitionally, we can imagine that Gradient Descent is "climb down the hill". Mathematically, it's a tangent line which opposite direction toward gradient direction to reach the local or global minima and we can update the weight in each epoch \(i^{th}\) of neuron \(j^{th}\) with the formula,
\(w^{(i)}_j = w^{(i)}_j + \Delta w^{(i)}\)
where,
\(\frac{\partial J}{\partial w^{(i)}_j} = \frac{\partial}{\partial w^{(i)}_j} \frac{1}{2}\sum_i(y^{(i)} - \phi(z^{(i)}))^2\)
\(= \frac{1}{2}\frac{\partial}{\partial w^{(i)}_j} \sum_i(y^{(i)} - \phi(z^{(i)}))^2\)
\(= \frac{1}{2}2\sum_i(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w^{(i)}_j}(y^{(i)} - \phi(z^{(i)})) \hspace{1cm}(7)\)
\(= \sum_i(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w^{(i)}_j}(y^{(i)} - w^{(i)}_jx^{(i)}_j)\)
\(= \sum_i(y^{(i)} - \phi(z^{(i)}))(0 - 1x^{(i)}_j)\)
\(= \sum_i(y^{(i)} - \phi(z^{(i)}))(-x^{(i)}_j)\)
\(= -\sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j \Leftrightarrow (5)\)
In equation \((7)\), we reduce \(2\) by \(\frac{1}{2}\), that's reason why we add the extra term \(\frac{1}{2}\) in the equation \((3)\) of the cost function for convenient. Okay, we did it, let's go to another interesting topic.
No comments :
Post a Comment
Leave a Comment...