TrisZaska's Machine Learning Blog

The math behind Perceptron rule

1. Introduction
2. History and Overview about Artificial Neural Network
3. Single neural network
4. Multi-layer neural network
5. Install and using Multi-layer Neural Network to classify MNIST data
6. Summary
7. References

The Perceptron rules

The Perceptron rule is the "heart" that cause Perceptron learning with 2 easy steps:
  1. Initial small weight equal 0 or randomly between -1 and 1
  2. With each trainning sample \(x^{(i)}\) :
    • Calculate the error between target ouput and actual ouput
    • Updates the weight
So, we will discuss detail about Perceptron rule. Firstly, what is the main ideas of Perceptron learning rule? Look at steps of Perceptron rule is that it try to minimize the error by adjusted the weight so the next time, we hope the error will decrease close to 0, right? As we mentioned before, with each training sample \(x^{(i)}\), update the weight of the each neuron \(j^{th}\) the formula,
$$w^{(i)}_j = w^{(i)}_j + \Delta w^{(i)}$$
Where, \( \Delta w^{(i)} = \eta (target^{(i)} - output^{(i)})x^{(i)}_j\hspace{1cm}(2)\)
In this equation \((2)\), where
  • \(\eta\) is the learning rate, it's value from 0.0 to 1.0
  • \(target^{(i)}\) is the true class
  • \(output^{(i)}\) is the current predicted class
  • \(x^{(i)}_j\) is the input of training sample \(i^{th}\) of neuron \(j^{th}\)
I assume you're very curious enthusiastic people, you want to dig deeper about the equation of Perceptron learning rule. You maybe wonder
  • What is the learning rate \(\eta\) and why \(\eta\) is important?
  • Why we have the term \((target^{(i)} - output^{(i)})\)?
  • Why we multiply by extra input \(x^{(i)}_j\)?
We'll discuss each question right now, let's go,
  • What is the learning rate \(\eta\) and why \(\eta\) is important?
  • The learning rate \(\eta\) is the constant number of it's value from 0.0 to 1.0 we added to control how far we'll go when we try to minimize error in each epoch, why learning rate \(\eta\) is important and how to choose it properly we'll discuss more detail in the next topic of Multi-layer Neural Network
  • Why we have the term \((target^{(i)} - output^{(i)})\)?
  • Because we want to update the weight of Perceptron to minimize error? But, what is error? In this scope, error \((E)\) is the distance between true ouput and predicted output, so we subtract what we want with what we actual get to see how much it change when we update the weight, right? Remember the output of Perceptron have just two values -1 and 1, so with \(E^{(i)} = target^{(i)} - output^{(i)}\)
    • If \(E^{(i)} = 0\), it means that \(target^{(i)} = output^{(i)}\), so nothing changes, our Perceptron work properly
    • If \(E^{(i)} < 0\), it means that \(target^{(i)} = -1\), but \(output^{(i)} = 1\), then \(E = -2\), therefore we want to decrease \(w\) and the next time, \(output^{(i)}\) will decrease close to \(-1\)
    • If \(E^{(i)} > 0\), it means that \(target^{(i)} = 1\), but \(output^{(i)} = -1\), then \(E = +2\), therefore we want to increase \(w\) and the next time, \(output^{(i)}\) will increase close to \(1\)
  • Why we multiply by extra input \(x^{(i)}_j\)?
  • This is tricky question, but understand why multiply by extra input \(x^{(i)}_j\) are important to understand why Perceptron rule can minimize error. Let's started with the error \((E)\) in the previous question. It nothing to say when \(E = 0\), but when Perceptron predicted wrong, how do we update weight? For simplicity, skip the learning rate \(\eta\), suppercript \(i^{th}\) and subcript \(j^{th}\) in this question,
    • Assume input \(x\) is POSITIVE
      • If \(E < 0\), we already calculated \(E = -2\), so we want to decrease \(w\), right? \(w = w + \Delta w\), \(\Delta w = E = -2 \), so \(w = w - 2 \) (\(w\) decreased)
      • \(\Rightarrow\) When \( x \geqslant 0, \)if \( w \) drecreased, \(\mathbf{z = w^Tx}\) will decrease, that right because we want the net input \(z\) decrease so that \(g(z)\) close to -1
      • If \(E > 0\), we already calculated \(E = +2\), so we want to increase \(w\), right? \(w = w + \Delta w\), \(\Delta w = E = +2 \), so \(w = w + 2 \) (\(w\) increased)
      • \(\Rightarrow\) When \( x \geqslant 0, \)if \( w \) increased, \(\mathbf{z = w^Tx}\) will increase, that also right because we want the net input \(z\) increase so that \(g(z)\) close to 1
    • But what if input \(x\) is NEGATIVE?
      • Increase \(w\) will decrease the net input \(z\), so that \(g(z)\) wil close to -1, but what we want is 1, what's wrong? Why? Because when
        \(x < 0, w\) increase \(\Rightarrow w_jx_j\) is big negative number, when we sum, the net input \(z = \mathbf{w^Tx}\) with have big negative number in it, defintely decrease.
      • Similarly, decrease \(w\) will increase the net input \(z\).
    \(\Rightarrow\) That's why we need to multiply extra \(x^{(i)}_j\) when update \(w\) to solve the problem when \(x^{(i)}_j\) is negative.

No comments :

Post a Comment

Leave a Comment...