1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
7. References
2. History and Overview about Artificial Neural Network
3. Single neural network
- 3.1 Perceptron
- 3.1.1 The Unit Step function
- 3.1.2 The Perceptron rules
- 3.1.3 The bias term
- 3.1.4 Implement Perceptron in Python
- 3.2 Adaptive Linear Neurons
- 3.2.1 Gradient Descent rule (Delta rule)
- 3.2.2 Learning rate in Gradient Descent
- 3.2.3 Implement Adaline in Python to classify Iris data
- 3.2.4 Learning via types of Gradient Descent
- 3.3 Problems with Perceptron (AI Winter)
- 4.1 Overview about Multi-layer Neural Network
- 4.2 Forward Propagation
- 4.3 Cost function
- 4.4 Backpropagation
- 4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron
- 4.6 Some optional techniques for Multi-layer Neural Network Optimization
- 4.7 Multi-layer Neural Network for binary/multi classification
- 5.1 Overview about MNIST data
- 5.2 Implement Multi-layer Neural Network
- 5.3 Debugging Neural Network with Gradient Descent Checking
7. References
Backpropagation
As we usually do, firstly let's define the definition of Backpropagation. According to Wikipedia, Backpropagation is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. Since we've already discussed Gradient Descent in the topic of Adaline, it's easy to understand this definition, right? Before going mathematics, look at the image below to have a better intuition about it,Backpropagation start when we finished forward propagation, from right side propagate backward to the left side. Mathematically, it's using the partial derivative of the cost function (error at the output layer) to calculate gradient descent, using it to update the weights. For better understanding, we divide Backpropagation into three parts: Calculate the error \(\delta\), then calculate the Gradient Descent \(\nabla\) and finally update the weights \(W\)
>>> Part 1: Calculate the error \(\delta\)
Before understanding their formulae, we agree that we all do calculate based on matrix-vector representation, right? Start from output layer, we calculate the error, says \(\delta\),
\(\delta^{(3)} = a^{(3)} - y \hspace{1cm} (10)\)
Where,
- \(\delta^{(3)}\) is the error in the layer \((3)\), so-called output layer
- \(a^{(3)}\) is the activation function in output layer
- \(y\) is the target output
\(\delta^{(3)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}} \hspace{1cm}(11)\)
After zoomed one neuron in the output layer, as you can see, the reason why we use the chain rule to calculate derivative is that to see the relationship between \(J(w)\) and the net input \(z\), we must go through \(a\), consider \(a\) is a "bridge" between them, right? It has the same interpretation on all neurons in the output layer, we had have some better intuition about it, let's start to calculate the equation \((11)\),
Firstly we have,
\(\mathbf{J(W) = -\sum y\log \left( a^{(3)}\right) + \left(1 - y\right)\log\left(1 - a^{(3)}\right)}\)
then,
\(\frac{\partial}{\partial a^{(3)}}J(W) = -\left[y\left(\frac{1}{a^{(3)}}\right) + (1 - y)\frac{\partial}{\partial a^{(3)}}\log\left(1 - a^{(3)}\right)\right]\)
\(= -\left[y\left(\frac{1}{a^{(3)}}\right) + (1 - y)\left(\frac{1}{1 - a^{(3)}}\right)\frac{\partial}{\partial a^{(3)}}\left(1 - a^{(3)}\right)\right]\)
\(= -\left[y\left(\frac{1}{a^{(3)}}\right) + (1 - y)\left(\frac{1}{1 - a^{(3)}}\right)(-1)\right]\)
\(= -\left[\left(\frac{y}{a^{(3)}}\right) + \left(\frac{y - 1}{1 - a^{(3)}}\right)\right]\)
\(= -\left[\frac{y(1 - a^{(3)}) + (y - 1)a^{(3)}}{a^{(3)}(1 - a^{(3)})}\right]\)
\(= -\left[\frac{y - a^{(3)}}{a^{(3)}(1 - a^{(3)})}\right]\)
\(= \frac{a^{(3)} - y}{a^{(3)}(1 - a^{(3)})} \hspace{1cm} (*)\)
Secondly we have,
\(\mathbf{a^{(3)} = \frac{1}{1 + e^{-z^{(3)}}}}\)
then,
\(\frac{\partial}{\partial z^{(3)}}a^{(3)}= \left(\frac{1}{1 + e^{-z^{(3)}}}\right)'\)
\(= \left[\left(1 + e^{-z^{(3)}}\right)^{-1}\right]'\)
\(= -\left(1 + e^{-z^{(3)}}\right)^{-2}\frac{\partial}{\partial z^{(3)}}\left(1 + e^{-z^{(3)}}\right)\)
\(= -\left(1 + e^{-z^{(3)}}\right)^{-2}\left(-e^{-z^{(3)}}\right)\)
\(= \frac{-1}{\left(1 + e^{-z^{(3)}}\right)^2}\left(-e^{-z^{(3)}}\right)\)
\(= \frac{e^{-z^{(3)}}}{\left(1 + e^{-z^{(3)}}\right)^2}\)
\(= \frac{1}{1 + e^{-z^{(3)}}}\frac{e^{-z^{(3)}}}{1 + e^{-z^{(3)}}}\)
\(= \frac{1}{1 + e^{-z^{(3)}}}\frac{(1 + e^{-z^{(3)}}) - 1}{1 + e^{-z^{(3)}}}\)
\(= \frac{1}{1 + e^{-z^{(3)}}}\left(1 - \frac{1}{1 + e^{-z^{(3)}}}\right)\)
\(= a^{(3)}\left(1 - a^{(3)}\right) \hspace{1cm} (**)\)
From \((*)\) and \((**)\), equation \((11)\) can be written as,
\(\delta^{(3)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}} = \frac{a^{(3)} - y}{a^{(3)}(1 - a^{(3)})}\left[a^{(3)}\left(1 - a^{(3)}\right)\right] = a^{(3)} - y \Leftrightarrow (10)\)
> NOTE: In this situation, when we use \(\log()\), its mean natural logarithm: \(\log_e()\) = \(\ln()\)
Some useful formulae of caculus: \(\log(x)' = \frac{1}{x}\), \(\log(e^x) = e^x\), \(\log(e^{-x}) = -e^{-x}\), \(f(g(x))' = f(g(x))'g(x)'\)
Alright, we just calculate the error \(\delta^{(3)}\) in the output layer, let's continue to calculate the error in hidden layer, we call it \(\delta^{(2)}\) with the formula,
\(\delta^{(2)} = w^{(2)}\delta^{(3)}\left(a^{(2)}\left(1 - a^{(2)}\right)\right) \hspace{1cm} (12)\)
Okay, \(\delta^{(2)}\) describes how much change in the net input \(z^{(2)}\) will affect the total error, right? So, \(\delta^{(2)}\) can be written as,
\(\delta^{(2)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial z^{(2)}}\)
\(=\delta^{(3)}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial z^{(2)}} \hspace{1cm} (13)\)
Since,
\(z^{(3)} = w^{(2)}a^{(2)} \Rightarrow \frac{\partial}{\partial a^{(2)}}\left(z^{(3)}\right) = \left[w^{(2)}a^{(2)}\right]' = w^{(2)}\)
And,
\(a^{(2)} = \frac{1}{1 + e^{-z^{(2)}}} \Rightarrow \frac{\partial}{\partial z^{(2)}}a^{(2)} = a^{(2)}\left(1 - a^{(2)}\right)\)
Therefore, equation \((13)\) is similar to equation \((12)\) we dit it,
\((13) \Leftrightarrow \delta^{(3)}w^{(2)}\left(a^{(2)}\left(1 - a^{(2)}\right)\right) \Leftrightarrow (12)\)
We do not calculate \(\delta^{(1)}\) is the input layer because input layer belong to data, it's raw data we cannot change it, right?>>> Part 2: Calculate the Gradient Descent \(\nabla\)
If \(\delta\) describes how much we change the net input \(z\) will affect the total error, then \(\nabla\) describes how much we change the weight \(w\) will affect the error? But, our goal is try to update the weight, so the purpose when calculate the \(\delta\) is it can support to calculate the \(\nabla\). In fact, we can calculate the gradient descent directly from the total error, so for the convenient when we implement code in Python readily and clearly, I recommend we should calculate the \(\delta\) then go to calculate the \(\nabla\). Alright, it's just extra information, let's go to solve \(\nabla\).
Firstly, we need to calculate the gradient descent with respect to the weight between the output layer and the hidden layer,
\(\nabla^{(2)} = \delta^{(3)}a^{(2)} \hspace{1cm} (14)\)
Contiuously, an image to illustrated the equation \((14)\),Similarly when we calculate \(\delta\), take a look at the image, \(\nabla^{(2)}\) can be written as,
\(\nabla^{(2)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial w^{(2)}} \hspace{1cm} (15)\)
\(= \delta^{(3)}\frac{\partial z^{(3)}}{\partial w^{(2)}}\)
Since,
\(z^{(3)} = w^{(2)}a^{(2)} \Rightarrow \frac{\partial}{\partial w^{(2)}}\left(z^{(3)}\right) = \left[w^{(2)}a^{(2)}\right]' = a^{(2)}\)
then, the equation \((15)\) equal to the equation \((14)\),
\((15) \Leftrightarrow \delta^{(3)}a^{(2)} \Leftrightarrow (14)\)
Okay, we continue to calculate the gradient descent with respect to the weight between the hidden layer and the input layer,
\(\nabla^{(1)} = \delta^{(2)}x^{(1)} \hspace{1cm} (16)\)
Again, an final image to describe what are we doing,We do the same thing, \(\nabla^{(1)}\) can be written as,
\(\nabla^{(1)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial w^{(1)}} \hspace{1cm} (17)\)
\(= \delta^{(2)}\frac{\partial z^{(2)}}{\partial w^{(1)}}\)
Since,
\(z^{(2)} = w^{(1)}x^{(1)} \Rightarrow \frac{\partial}{\partial x^{(1)}}\left(z^{(2)}\right) = \left[w^{(1)}x^{(1)}\right]' = x^{(1)}\)
then, the equation \((17)\) equal to the equation \((16)\),
\((17) \Leftrightarrow \delta^{(2)}x^{(1)} \Leftrightarrow (16)\)
We've done about calculate the gradient descent \(\nabla^{(2)}\) and \(\nabla^{(1)}\), let's go to the final step where we using gradient descent to update the weight>>> Part 3: Update the weight \(W\)
Do you remember in the section of ADAptive LInear NEuron (Adaline)? When we update the weight we take the opposite step toward the direction of gradient descent, right? So, we first calculate the \(\Delta W\)
\(\Delta W = -\eta\nabla\)
Where,
- \(\eta\) is the learning rate
- \(\nabla\) is the gradient descent we calculated before
\(W = W + \Delta W\)
Alright, all of this calculation we just calculate in 1 epoch, so to achieve successful learning, we must repeat it in many epochs. But in the real world, with this simple Multi-layer Neural Network, surely our model work very poor with the problems such as expensive computation, over-fitting, etc. So, we introduce some optional technique that we'll discuss in the next section for optimization. Before going to that topic let's implement simple Multi-layer Neural Network to apply all of the stuffs we learned and see how it solve the problem of Perceptron, it's definitely fun.
Very clear, thanks
ReplyDeleteYou're welcome :)
Delete