TrisZaska's Machine Learning Blog

Understanding Regularization in Neural Network

1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
4. Multi-layer neural network
5. Install and using Multi-layer Neural Network to classify MNIST data
6. Summary
7. References

Regularization

We're here to discuss another technique call Regularization. Before going to understand it and why it useful for our Neural Network, let's consider the problem so-called overfitting. So, what is overfitting? Let's take a look the image below,
As you can see, the problem of overfitting occurs when our Neural Network fit the data so much, instead of representing the general latent pattern of data (middle image), it fits all the noisy points in data (right-side image), it's true because in real world, data is always inconsistent, noisy or complex. But how our model can be overfitting with that data? Let's consider three cases,
  • Under-fitting (high bias) occurs when our model is too simple or have little parameters (weight coefficients), so it can not handle complex pattern of data. Do you remember? In the section 4.4.1, we visualized the Decision boundary of Perceptron on non-linear data and Perceptron work very poor, right?
  • Then, we used Multi-layer Neural Network in the section 4.4.2 to solve the problem of Perceptron with 30 neurons in hidden layer, the result looks like middle image, so our model work well (so-called bias-variance tradeoff).
  • But if we use 1000 or more neurons in hidden layer, what happens? Over-fitting (high variance) will occurs because our model have too many parameters.
You can check this out for yourself. All codes are available on Github, in case over-fitting, I assume you must have noisy data and adjust the number of hidden neurons are bigger.
Bias measures how far our model go from correct output, therefore high bias means the error of Neural Network is very high, not strong enough to fit the data, so-called under-fitting.
Variance measures how sensitivity between our model and the data, therefore high variance means there is no error, too strong to fit the data, so-called over-fitting.
So, with over-fitting our model belongs to the training data so much, when tackling the unseen data it's usually work very poor. Then we have to think how to solve the problem of over-fitting? There are many solutions such as,
  • Increase the training data samples, but in fact training dataset is always limited.
  • Remove several features that we think it is not relevant to our goal, but what if we make a mistake, that features we removed could be useful for our model.
  • Implement simple model with less parameters, but it can be underfitting if the dimensionality of data is too high
Then, Regularization came to suggest to help us solve this problems, in Neural Network it can be know as weight decay. The basic idea behind Regularization is it try to penalty (reduce) the weights of our Network by adding the bias term, therefore the weights are close to 0, it's mean our model is more simpler, right? There are many types of general regularization but in this paper we just talk about L2 Regularization and L1 Regularization.
> L2 Regularization
It's most used in Machine learning algorithm by add the term,
\(\frac{\lambda}{2}\|w\|^2_2 = \frac{\lambda}{2}\sum_i(w_i)^2\)
> L1 Regularization
If L2 is least square shrinkage, then L1 is so-called least absolute shrinkage,
\(\frac{\lambda}{2}\|w\|_1 = \frac{\lambda}{2}\sum_i|w_i|\)
Apply L2 and L1 Regularization is the same, so we just write how to apply L2 Regularizatoin. So, the idea is increase the bias of our model, therefore we need to add L2 Regularization right after our cost function,
Where the term,
just the sum of square of all weights in our network except the bias terms. Therefore, we using gradient descent to update the current weight \(w_{(t)}\),
\(w_{(t)} = w_{(t-1)} -\eta\left(\frac{\partial J}{\partial w_{(t-1)}} + \lambda w_{(t-1)}\right)\)
\(\Leftrightarrow w_{(t)} = w_{(t-1)}(1 - \eta\lambda) - \eta\frac{\partial J}{\partial w_{(t-1)}}\)
As you can see, \(w_{(t)}\) will be decrease before update gradient descent.
In the equation \((18)\), where \(i = 1, j = 1\) its mean we do not need to regularized the bias term, because since regularization tend to push the weights toward zero, and the bias used to shift our activation function so we don't want the bias become zero.

No comments :

Post a Comment

Leave a Comment...