For newcomers to neural-networks, the words 'back-propagation' signal the point from which everything becomes a blur! I didn't understand the concept of BP until it was slapped quite plainly in front of me, with all the details laid out. So, for those of you having a little trouble grasping the concept, hopefully this essay will clear it up. I will assume that you have read the Perceptron essay, and you have a small understanding of basic calculus and terminology. Also, for the C++ programmers out there, I made a BP case-study for you to look at after you have finished reading this essay.
From Single To Multi-layerIn a single-layer network, each neuron adjusts its weights according to what output was expected of it, and the output it gave. This can be mathematically expressed by the Perceptron Delta Rule:
This is of no use though when you extend the network to multiple layers to account for non-linearly separable problems. Why is this the case? When adjusting a weight anywhere in the network, you have to be able to tell what effect this will have on the overall effect of the network. To do this, you have to look at the derivative of the error function with respect to that weight.
Our biggest problem at the moment is that the hard-limiter function for the perceptron was non-continuous, thus non-differentiable. The most popular alternative function used with back-propagation nets is the Sigmoid function. The equation is show to the right, and the graph for it below.
You can see that the function plateaus out at 0 and 1 on the y-axis, and it also crosses the y-axis at 0.5, making the function 'neat' and relatively easy to differentiate. Therefore, through the necessary mathematics, this weight adjustment formula is obtained:
This is quite a complex formula, though, so to ease computation and make the whole process more 'iteratively-friendly', we approximate it to:
This is now the formula used for all the neurons in the network. Yet, how does the network calculate the error for the hidden layers? Hidden layers require a different definition of d (delta). We have to know the effect on the output of the neuron if a weight is to change. Therefore, we need to know the derivative of the error with respect to that weight. Again, for sake of simplicity I am skipping over the mathematical derivation of the delta rule. It has been proven that for neuron q in hidden layer p, delta is:
Notice where the 'back' part of 'Back-propagation' comes in? At the end of the equation is dp+1. Each delta value for hidden layers requires that the delta value for the layer after it be calculated. So, in a 3-layer network, the output-layer delta is first calculated using the first delta formula shown. This value is then used to calculate for all remaining hidden layers using the formula shown above. Hence the name 'back-propagation', as the error from the output layer is slowly propagated backwards through the network.
Back-propagation is not (in my opinion) a concept easily grasped without some hands-on experience. Therefore, if you are a programmer, first look at the code I wrote, then experiment with a BP network of your own. Try different architectures, with multiple outputs to get yourself familiar with the formulas and algorithms necessary.