1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
7. References
2. History and Overview about Artificial Neural Network
3. Single neural network
- 3.1 Perceptron
- 3.1.1 The Unit Step function
- 3.1.2 The Perceptron rules
- 3.1.3 The bias term
- 3.1.4 Implement Perceptron in Python
- 3.2 Adaptive Linear Neurons
- 3.2.1 Gradient Descent rule (Delta rule)
- 3.2.2 Learning rate in Gradient Descent
- 3.2.3 Implement Adaline in Python to classify Iris data
- 3.2.4 Learning via types of Gradient Descent
- 3.3 Problems with Perceptron (AI Winter)
- 4.1 Overview about Multi-layer Neural Network
- 4.2 Forward Propagation
- 4.3 Cost function
- 4.4 Backpropagation
- 4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron
- 4.6 Some optional techniques for Multi-layer Neural Network Optimization
- 4.6.1 Adaptive Learning Rate (Annealing)
- 4.6.2 Momentum Terms
- 4.6.3 Regularization
- 4.7 Multi-layer Neural Network for binary/multi classification
- 5.1 Overview about MNIST data
- 5.2 Implement Multi-layer Neural Network
- 5.3 Debugging Neural Network with Gradient Descent Checking
7. References
Regularization
We're here to discuss another technique call Regularization. Before going to understand it and why it useful for our Neural Network, let's consider the problem so-called overfitting. So, what is overfitting? Let's take a look the image below,As you can see, the problem of overfitting occurs when our Neural Network fit the data so much, instead of representing the general latent pattern of data (middle image), it fits all the noisy points in data (right-side image), it's true because in real world, data is always inconsistent, noisy or complex. But how our model can be overfitting with that data? Let's consider three cases,
- Under-fitting (high bias) occurs when our model is too simple or have little parameters (weight coefficients), so it can not handle complex pattern of data. Do you remember? In the section 4.4.1, we visualized the Decision boundary of Perceptron on non-linear data and Perceptron work very poor, right?
- Then, we used Multi-layer Neural Network in the section 4.4.2 to solve the problem of Perceptron with 30 neurons in hidden layer, the result looks like middle image, so our model work well (so-called bias-variance tradeoff).
- But if we use 1000 or more neurons in hidden layer, what happens? Over-fitting (high variance) will occurs because our model have too many parameters.
Bias measures how far our model go from correct output, therefore high bias means the error of Neural Network is very high, not strong enough to fit the data, so-called under-fitting.So, with over-fitting our model belongs to the training data so much, when tackling the unseen data it's usually work very poor. Then we have to think how to solve the problem of over-fitting? There are many solutions such as,
Variance measures how sensitivity between our model and the data, therefore high variance means there is no error, too strong to fit the data, so-called over-fitting.
- Increase the training data samples, but in fact training dataset is always limited.
- Remove several features that we think it is not relevant to our goal, but what if we make a mistake, that features we removed could be useful for our model.
- Implement simple model with less parameters, but it can be underfitting if the dimensionality of data is too high
> L2 Regularization
It's most used in Machine learning algorithm by add the term,
\(\frac{\lambda}{2}\|w\|^2_2 = \frac{\lambda}{2}\sum_i(w_i)^2\)
> L1 RegularizationIf L2 is least square shrinkage, then L1 is so-called least absolute shrinkage,
\(\frac{\lambda}{2}\|w\|_1 = \frac{\lambda}{2}\sum_i|w_i|\)
Apply L2 and L1 Regularization is the same, so we just write how to apply L2 Regularizatoin. So, the idea is increase the bias of our model, therefore we need to add L2 Regularization right after our cost function,Where the term,
just the sum of square of all weights in our network except the bias terms. Therefore, we using gradient descent to update the current weight \(w_{(t)}\),
\(w_{(t)} = w_{(t-1)} -\eta\left(\frac{\partial J}{\partial w_{(t-1)}} + \lambda w_{(t-1)}\right)\)
\(\Leftrightarrow w_{(t)} = w_{(t-1)}(1 - \eta\lambda) - \eta\frac{\partial J}{\partial w_{(t-1)}}\)
As you can see, \(w_{(t)}\) will be decrease before update gradient descent.In the equation \((18)\), where \(i = 1, j = 1\) its mean we do not need to regularized the bias term, because since regularization tend to push the weights toward zero, and the bias used to shift our activation function so we don't want the bias become zero.
No comments :
Post a Comment
Leave a Comment...