TrisZaska's Machine Learning Blog

The math behind Forward Propagation in Multi-layer Neural Network

1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
4. Multi-layer neural network
5. Install and using Multi-layer Neural Network to classify MNIST data
6. Summary
7. References

Forward Propagation

Alright, it's very easy to understand forward propagation since we already went through the basic Single Neural Network, if you're interesting this topic and want to read it immediately, don't worried, because forward propagation is so easy, it require nothing to understand. Okay, basically it's still a linear combination of the input and weight, since the neuron in Multi-layer Neural Network is fully connected together, therefore the output of the previous layer is the input of next layer, so, it's always better to visualize something by the illustration before going to their mathematics, right? And there it is,
Remember, Multi-layer Neural Network just have 3 layers and agree with me that there is no connection between biases of two layers, right? So look at the image above, beginning in the input layer, we start to calculate the net input of neuron \(a^{(2)}_1\),
\(z^{(2)}_1 = w^{(1)}_{1,0}x_0 + w^{(1)}_{1,1}x_1 + w^{(1)}_{1,2}x_2\)

\(a^{(2)}_1 = \phi\left(z^{(2)}_1\right) \hspace{1cm} (8)\)
Where,
  • \(z^{(2)}_1\) is the net input of neuron 1 in layer 2
  • \(w^{(1)}_{1,0}\) is the weight between neuron 1 in layer 2 and input \(x_0\) in layer \(1\)
  • \(w^{(1)}_{1,1}\) is the weight between neuron 1 in layer 2 and input \(x_1\) in layer \(1\)
  • \(w^{(1)}_{1,2}\) is the weight between neuron 1 in layer 2 and input \(x_2\) in layer \(1\)
In equation (8) where,
  • \(a^{(2)}_1\) is the output of neuron 1 in layer 2
  • \(\phi()\) is the activation function of neuron 1 in layer 2
As we mentioned before, there are many activation functions out there such as linear function, unit step function, tanh function, etc. But, often used is Sigmoid function(or Logistic Function), in equation \((8)\), \(\phi()\) is a Sigmoid function with the formula,

\(\phi\left(z^{(2)}_1\right) = \frac{1}{1 + e^{-z^{(2)}_1}}\)
 
Here is the graph of Sigmoid function, basically, it squash input value in range 0.0 to 1.0 and the center of Sigmoid function is 0.5 when input value is 0,
So, here we just calculate the net input and activation function of one neuron \(a^{(2)}_1\), we must calculate all of neurons except input neurons because input neurons belong to raw data and the bias neurons. For the convienent when implement, we also will use the matrix-vector representation,

\(\mathbf{z^{(2)} = w^{(1)}x^{(1)}}\)

\(\mathbf{a^{(2)} = \phi\left(z^{(2)}\right)}\)
Where,
  • \(\mathbf{x^{(1)}}\) is [m + 1] x 1 dimensional features vector, while m is the number of features plus bias.
  • \(\mathbf{w^{(1)}}\) is h x [m + 1] dimensional weight matrix, while h is the number of hidden layer.
  • \(\mathbf{z^{(2)}}\) is h x 1 dimensional vector, because \(\mathbf{z^{(2)} = w^{(1)}x^{(1)}}\), so h x [m + 1] dimensional matrix multiple [m + 1] x 1 dimensional features vector, we will obtain h x 1 dimensional vector. If you don't know why, you can reference this page Matrix vector multiplication.
  • \(\mathbf{a^{(2)}}\) is the activation function of \(\mathbf{z^{(2)}}\), then plus one bias, therefore \(\mathbf{a^{(2)}}\) is a [h + 1] x 1 dimensional vector
But, we just do calculation for these neurons in one training samples, right? Usually, we have many training samples we need to feed all of them to learning, too. So, says, \(n\) is the number of all training samples, then

\(\mathbf{Z^{(2)} = W^{(1)}X^{(1)}}\)

\(\mathbf{A^{(2)} = \phi\left(Z^{(2)}\right)}\)
Where,
  • \(\mathbf{X^{(1)}}\) is n x [m + 1] dimensional features matrix, while row n is the number of all training samples and column m is the number of features plus bias.
  • \(\mathbf{W^{(1)}}\) is h x [m + 1] dimensional weight matrix, while h is the number of hidden layer.
  • \(\mathbf{Z^{(2)}}\) is h x n dimensional matrix, because \(\mathbf{Z^{(2)} = W^{(1)}X^{(1)}}\), so h x [m + 1] dimensional weight matrix multiply n x [m + 1] dimensional features matrix, we will obtain h x n dimensional matrix.
  • \(\mathbf{A^{(2)}}\) is the is the activation function of \(\mathbf{Z^{(2)}}\), then plus one bias, therefore \(\mathbf{A^{(2)}}\) is a [h + 1] x n dimensional matrix.
> NOTE: When implement in Python, we must transpose dimension of the input matrix \(\mathbf{\left[X^{(1)}\right]^T}\) become [m + 1] x n dimension, then h x [m + 1] dimension multiply [m + 1] x n dimension we will have the correct result h x n dimensional matrix.
So, we almost done, do similarly with the output layer, we have

\(\mathbf{Z^{(3)} = W^{(2)}A^{(2)}}\)

\(\mathbf{A^{(3)} = \phi\left(Z^{(3)}\right)}\)
Where,
  • \(\mathbf{W^{(2)}}\) is t x [h + 1] dimensional weight matrix, while t is the number of output layer, h + 1 is the number of hidden layer plus bias.
  • \(\mathbf{Z^{(3)}}\) is t x n dimensional matrix, because \(\mathbf{Z^{(3)} = W^{(2)}A^{(2)}}\), so t x [h + 1] dimensional weight matrix multiply [h + 1] x n dimensional output matrix, we will obtain t x n dimensional matrix.
  • \(\mathbf{A^{(3)}}\) is the same dimension to \(\mathbf{Z^{(3)}}\), because we just mapping \(\mathbf{Z^{(3)}}\) through Sigmoid function to obtain \(\mathbf{A^{(3)}}\).
Alright, we've done about step 1 - Forward Propagation to obtain the output, let's go to step 2 to understand how we calculate the error and minimize the cost function.

No comments :

Post a Comment

Leave a Comment...