Explain the math behind Backpropagation in Neural Network

Tuesday, June 6, 2017

1. Introduction
2. History and Overview about Artificial Neural Network

3. Single neural network

3.1 Perceptron

3.2 Adaptive Linear Neurons

3.3 Problems with Perceptron (AI Winter)

4. Multi-layer neural network

4.1 Overview about Multi-layer Neural Network
4.2 Forward Propagation
4.3 Cost function
4.4 Backpropagation
4.5 Implement simple Multi-layer Neural Network to solve the problem of Perceptron

4.6 Some optional techniques for Multi-layer Neural Network Optimization

4.7 Multi-layer Neural Network for binary/multi classification

5. Install and using Multi-layer Neural Network to classify MNIST data

5.1 Overview about MNIST data
5.2 Implement Multi-layer Neural Network
5.3 Debugging Neural Network with Gradient Descent Checking

6. Summary
7. References

Backpropagation

As we usually do, firstly let's define the definition of Backpropagation. According to Wikipedia, Backpropagation is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. Since we've already discussed Gradient Descent in the topic of Adaline, it's easy to understand this definition, right? Before going mathematics, look at the image below to have a better intuition about it,

Backpropagation start when we finished forward propagation, from right side propagate backward to the left side. Mathematically, it's using the partial derivative of the cost function (error at the output layer) to calculate gradient descent, using it to update the weights. For better understanding, we divide Backpropagation into three parts: Calculate the error \(\delta\), then calculate the Gradient Descent \(\nabla\) and finally update the weights \(W\)
>>> Part 1: Calculate the error \(\delta\)
Before understanding their formulae, we agree that we all do calculate based on matrix-vector representation, right? Start from output layer, we calculate the error, says \(\delta\),

\(\delta^{(3)} = a^{(3)} - y \hspace{1cm} (10)\)

Where,

\(\delta^{(3)}\) is the error in the layer \((3)\), so-called output layer
\(a^{(3)}\) is the activation function in output layer
\(y\) is the target output

So, another interesting question, why we have the equation \((10)\)? It's not simple we take the actual output subtract target output, as we mentioned before, we use \(\delta\) to describe how much we change in the net input \(z\) will affect the total error, so \(\delta^{(3)}\) can be written as,

\(\delta^{(3)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}} \hspace{1cm}(11)\)

Before going to calculate the partial derivative of equation \((11)\), seem equation \((11)\) using the chain rule formula to compute derivative, right? In calculus, the Chain rule is a formula for computing the derivative of the composition of two or more functions. So if you want to know more about the Chain rule, you can found on this Wikipedia page, it's easy to understand. Come back to this problem, you maybe wonder, why we can apply the chain rule formula in this situation, right? So, firstly look at the image below to see the relationship between \(J(w)\) and the net input \(z^{(3)}\) in output layer,

After zoomed one neuron in the output layer, as you can see, the reason why we use the chain rule to calculate derivative is that to see the relationship between \(J(w)\) and the net input \(z\), we must go through \(a\), consider \(a\) is a "bridge" between them, right? It has the same interpretation on all neurons in the output layer, we had have some better intuition about it, let's start to calculate the equation \((11)\),
Firstly we have,

\(\mathbf{J(W) = -\sum y\log \left( a^{(3)}\right) + \left(1 - y\right)\log\left(1 - a^{(3)}\right)}\)

then,

\(\frac{\partial}{\partial a^{(3)}}J(W) = -\left[y\left(\frac{1}{a^{(3)}}\right) + (1 - y)\frac{\partial}{\partial a^{(3)}}\log\left(1 - a^{(3)}\right)\right]\)

\(= -\left[y\left(\frac{1}{a^{(3)}}\right) + (1 - y)\left(\frac{1}{1 - a^{(3)}}\right)\frac{\partial}{\partial a^{(3)}}\left(1 - a^{(3)}\right)\right]\)

\(= -\left[y\left(\frac{1}{a^{(3)}}\right) + (1 - y)\left(\frac{1}{1 - a^{(3)}}\right)(-1)\right]\)

\(= -\left[\left(\frac{y}{a^{(3)}}\right) + \left(\frac{y - 1}{1 - a^{(3)}}\right)\right]\)

\(= -\left[\frac{y(1 - a^{(3)}) + (y - 1)a^{(3)}}{a^{(3)}(1 - a^{(3)})}\right]\)

\(= -\left[\frac{y - a^{(3)}}{a^{(3)}(1 - a^{(3)})}\right]\)

\(= \frac{a^{(3)} - y}{a^{(3)}(1 - a^{(3)})} \hspace{1cm} (*)\)

Secondly we have,

\(\mathbf{a^{(3)} = \frac{1}{1 + e^{-z^{(3)}}}}\)

then,

\(\frac{\partial}{\partial z^{(3)}}a^{(3)}= \left(\frac{1}{1 + e^{-z^{(3)}}}\right)'\)

\(= \left[\left(1 + e^{-z^{(3)}}\right)^{-1}\right]'\)

\(= -\left(1 + e^{-z^{(3)}}\right)^{-2}\frac{\partial}{\partial z^{(3)}}\left(1 + e^{-z^{(3)}}\right)\)

\(= -\left(1 + e^{-z^{(3)}}\right)^{-2}\left(-e^{-z^{(3)}}\right)\)

\(= \frac{-1}{\left(1 + e^{-z^{(3)}}\right)^2}\left(-e^{-z^{(3)}}\right)\)

\(= \frac{e^{-z^{(3)}}}{\left(1 + e^{-z^{(3)}}\right)^2}\)

\(= \frac{1}{1 + e^{-z^{(3)}}}\frac{e^{-z^{(3)}}}{1 + e^{-z^{(3)}}}\)

\(= \frac{1}{1 + e^{-z^{(3)}}}\frac{(1 + e^{-z^{(3)}}) - 1}{1 + e^{-z^{(3)}}}\)

\(= \frac{1}{1 + e^{-z^{(3)}}}\left(1 - \frac{1}{1 + e^{-z^{(3)}}}\right)\)

\(= a^{(3)}\left(1 - a^{(3)}\right) \hspace{1cm} (**)\)

From \((*)\) and \((**)\), equation \((11)\) can be written as,

\(\delta^{(3)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}} = \frac{a^{(3)} - y}{a^{(3)}(1 - a^{(3)})}\left[a^{(3)}\left(1 - a^{(3)}\right)\right] = a^{(3)} - y \Leftrightarrow (10)\)

> NOTE: In this situation, when we use \(\log()\), its mean natural logarithm: \(\log_e()\) = \(\ln()\)
Some useful formulae of caculus: \(\log(x)' = \frac{1}{x}\), \(\log(e^x) = e^x\), \(\log(e^{-x}) = -e^{-x}\), \(f(g(x))' = f(g(x))'g(x)'\)
Alright, we just calculate the error \(\delta^{(3)}\) in the output layer, let's continue to calculate the error in hidden layer, we call it \(\delta^{(2)}\) with the formula,

\(\delta^{(2)} = w^{(2)}\delta^{(3)}\left(a^{(2)}\left(1 - a^{(2)}\right)\right) \hspace{1cm} (12)\)

Equation \((12)\) is a little bit complicated, but don't worried, I'm also design an image to help you understand this formula, let's take a look,

Okay, \(\delta^{(2)}\) describes how much change in the net input \(z^{(2)}\) will affect the total error, right? So, \(\delta^{(2)}\) can be written as,

\(\delta^{(2)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial z^{(2)}}\)

\(=\delta^{(3)}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial z^{(2)}} \hspace{1cm} (13)\)

Since,

\(z^{(3)} = w^{(2)}a^{(2)} \Rightarrow \frac{\partial}{\partial a^{(2)}}\left(z^{(3)}\right) = \left[w^{(2)}a^{(2)}\right]' = w^{(2)}\)

And,

\(a^{(2)} = \frac{1}{1 + e^{-z^{(2)}}} \Rightarrow \frac{\partial}{\partial z^{(2)}}a^{(2)} = a^{(2)}\left(1 - a^{(2)}\right)\)

Therefore, equation \((13)\) is similar to equation \((12)\) we dit it,

\((13) \Leftrightarrow \delta^{(3)}w^{(2)}\left(a^{(2)}\left(1 - a^{(2)}\right)\right) \Leftrightarrow (12)\)

We do not calculate \(\delta^{(1)}\) is the input layer because input layer belong to data, it's raw data we cannot change it, right?
>>> Part 2: Calculate the Gradient Descent \(\nabla\)
If \(\delta\) describes how much we change the net input \(z\) will affect the total error, then \(\nabla\) describes how much we change the weight \(w\) will affect the error? But, our goal is try to update the weight, so the purpose when calculate the \(\delta\) is it can support to calculate the \(\nabla\). In fact, we can calculate the gradient descent directly from the total error, so for the convenient when we implement code in Python readily and clearly, I recommend we should calculate the \(\delta\) then go to calculate the \(\nabla\). Alright, it's just extra information, let's go to solve \(\nabla\).
Firstly, we need to calculate the gradient descent with respect to the weight between the output layer and the hidden layer,

\(\nabla^{(2)} = \delta^{(3)}a^{(2)} \hspace{1cm} (14)\)

Contiuously, an image to illustrated the equation \((14)\),

Similarly when we calculate \(\delta\), take a look at the image, \(\nabla^{(2)}\) can be written as,

\(\nabla^{(2)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial w^{(2)}} \hspace{1cm} (15)\)

\(= \delta^{(3)}\frac{\partial z^{(3)}}{\partial w^{(2)}}\)

Since,

\(z^{(3)} = w^{(2)}a^{(2)} \Rightarrow \frac{\partial}{\partial w^{(2)}}\left(z^{(3)}\right) = \left[w^{(2)}a^{(2)}\right]' = a^{(2)}\)

then, the equation \((15)\) equal to the equation \((14)\),

\((15) \Leftrightarrow \delta^{(3)}a^{(2)} \Leftrightarrow (14)\)

Okay, we continue to calculate the gradient descent with respect to the weight between the hidden layer and the input layer,

\(\nabla^{(1)} = \delta^{(2)}x^{(1)} \hspace{1cm} (16)\)

Again, an final image to describe what are we doing,

We do the same thing, \(\nabla^{(1)}\) can be written as,

\(\nabla^{(1)} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial w^{(1)}} \hspace{1cm} (17)\)

\(= \delta^{(2)}\frac{\partial z^{(2)}}{\partial w^{(1)}}\)

Since,

\(z^{(2)} = w^{(1)}x^{(1)} \Rightarrow \frac{\partial}{\partial x^{(1)}}\left(z^{(2)}\right) = \left[w^{(1)}x^{(1)}\right]' = x^{(1)}\)

then, the equation \((17)\) equal to the equation \((16)\),

\((17) \Leftrightarrow \delta^{(2)}x^{(1)} \Leftrightarrow (16)\)

We've done about calculate the gradient descent \(\nabla^{(2)}\) and \(\nabla^{(1)}\), let's go to the final step where we using gradient descent to update the weight
>>> Part 3: Update the weight \(W\)
Do you remember in the section of ADAptive LInear NEuron (Adaline)? When we update the weight we take the opposite step toward the direction of gradient descent, right? So, we first calculate the \(\Delta W\)

\(\Delta W = -\eta\nabla\)

Where,

\(\eta\) is the learning rate
\(\nabla\) is the gradient descent we calculated before

Finally, we update the weight,

\(W = W + \Delta W\)

Alright, all of this calculation we just calculate in 1 epoch, so to achieve successful learning, we must repeat it in many epochs. But in the real world, with this simple Multi-layer Neural Network, surely our model work very poor with the problems such as expensive computation, over-fitting, etc. So, we introduce some optional technique that we'll discuss in the next section for optimization. Before going to that topic let's implement simple Multi-layer Neural Network to apply all of the stuffs we learned and see how it solve the problem of Perceptron, it's definitely fun.

Explain the math behind Backpropagation in Neural Network

Backpropagation

2 comments :