Learning via types of Gradient Descent

Tuesday, June 6, 2017

1. Introduction
2. History and Overview about Artificial Neural Network

3. Single neural network

3.1 Perceptron

3.2 Adaptive Linear Neurons

3.2.1 Gradient Descent rule (Delta rule)
3.2.2 Learning rate in Gradient Descent
3.2.3 Implement Adaline in Python to classify Iris data
3.2.4 Learning via types of Gradient Descent

3.3 Problems with Perceptron (AI Winter)

4. Multi-layer neural network

4.1 Overview about Multi-layer Neural Network
4.2 Forward Propagation
4.3 Cost function
4.4 Backpropagation
4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron

4.6 Some optional techniques for Multi-layer Neural Network Optimization

4.7 Multi-layer Neural Network for binary/multi classification

5. Install and using Multi-layer Neural Network to classify MNIST data

5.1 Overview about MNIST data
5.2 Implement Multi-layer Neural Network
5.3 Debugging Neural Network with Gradient Descent Checking

6. Summary
7. References

Learning via types of Gradient Descent

We've already implemented Perceptron and Adaline models. It's a little bit different out there when we update \(w\),
Look again the equation \((2)\) implement Perceptron rule,

\(\Delta w^{(i)} = \eta (target^{(i)} - output^{(i)})x^{(i)}_j\)

and the equation \((6)\) implement Delta rule,

\(\Delta w^{(i)} = \eta \sum_i(y^{(i)} - \phi(z^{(i)}))x^{(i)}_j\)

Can you see the different between two equations? In Perceptron, we actually update the weight incrementally each training samples in each epoch, that's so-called Stochastic Gradient Descent (or Online Learning). In the other hand, Adaline using the whole training data set in each epoch to calculate and update the weight, that's so-called Batch Gradient Descent.
So, what is the advantages and disadvantages of them?

Imaging you're AI Engineer, you're working in Google about apply Machine Learning on the server of Youtube which serves billion of users every day. What do you think? I don't know what you think, but surely it's very challenges, hm !!! So, with large huge data sets, if you training whole of them in every epoch, repeat again and again to minimize error, it definitely takes huge memory, executes time and computational very expensive. Then, we'll come to the solution is that, instead of load whole data in every epoch, we load one by one sample to update weight, its mean that Stochastic Gradient Descent can come to help.
Another advantage is that Online Learning can update weight immediately when another data samples come in. It's very useful if you're working on web application interact with user in real time.
Stochastic Gradient Descent maybe optimize the surface error noisier than Batch Gradient Descent and it does not reach global optimal, but don't worried, although it does not reach global minimum it's valued very close to the global minimum.

From this, thanks for who suggest another type is called Mini-bacth Gradient Descent combine two previous types that solve the problem we must implement so many "for" loops in Online learning or must load whole dataset in Batch Gradient Descent. Mini-batch Gradient Descent allows us to use the subset of whole training dataset, therefore we can vectorize it. If \(k\) is the subset of training samples,

\(1 < k < m\)

Where \(m\) is the whole training samples.

Learning via types of Gradient Descent

Learning via types of Gradient Descent

No comments :

Post a Comment