Multilayer networks solve the classification problem for non linear sets by employing hidden layers, whose neurons are not directly connected to the output. The additional hidden layers can be interpreted geometrically as additional hyper-planes, which enhance the separation capacity of the network. Figure 2.2 shows typical multilayer network architectures.
This new architecture introduces a new question: how to train the hidden units for which the desired output is not known. The Backpropagation algorithm offers a solution to this problem.
Figure 2.2: Examples of Multilayer Neural Network Architectures.
The training occurs in a supervised style. The basic idea is to present the input vector to the network; calculate in the forward direction the output of each layer and the final output of the network. For the output layer the desired values are known and therefore the weights can be adjusted as for a single layer network; in the case of the BP algorithm according to the gradient decent rule.
To calculate the weight changes in the hidden layer the error in the output layer is back-propagated to these layers according to the connecting weights. This process is repeated for each sample in the training set. One cycle through the training set is called an epoch. The number of epoches needed to train the network depends on various parameters, especially on the error calculated in the output layer.
The following description of the Backpropagation algorithm is based on the descriptions in [rume86],[faus94], and [patt96].
The assumed architecture is depicted in Figure 2.3. The input vector has $n$ dimensions, the output vector has $m$ dimensions, the bias (the used constant input) is $-1$, there is one hidden layer with $g$ neurons. The matrix V holds the weights of the neurons in the hidden layer. The matrix W defines the weights of the neurons in the output layer. The learning parameter is $\eta $, and the momentum is $\alpha $. For a discussion on the parameters see page .
Figure 2.3: The Backpropagation Network.
The used unipolar activation function and its derivative are given by:
$f(net)\; =\; \{11+e-\lambda net\}$
$f\text{'}(net)=\; \{e-\lambda net(1+e-\lambda net)2\}$
The training set consists of pairs where $xp$ is the input vector and $tp$ is the desired output vector.
$T\; =\; \{\; (x1,t1),\; \&ldots;,\; (xP,\; tP)\; \}$
$net$_{H}=V^{T} x'
$h$_{i} = f(net_{3770}H_{i})
$net$_{y}=W^{T} h'
$out$_{i}=f(net_{3772}y_{i})
$\delta $_{3773}out_{i} = f'(net_{3773}y_{i}) (t_{i} - out_{i})
$\delta $_{3774}H_{i}= f'(net_{3774}H_{i}) ∑_{j=1}^{m} w_{ij} δ_{3}out_{j}
$\Delta WT(t)\; =\; \eta \delta $_{out} h'^{T}
$\Delta VT(t)\; =\; \eta \delta $_{H} x'^{T}
$W(t+1)\; =\; W(t)\; +\; \Delta W(t)\; +\; \alpha \Delta W(t-1)$
$V(t+1)\; =\; V(t)\; +\; \Delta V(t)\; +\; \alpha \Delta V(t-1)$
Training continues until the overall error in one training cycle is sufficiently small; this stop condition is given by:
$E$_{max} > E
This acceptable error $E$_{max} has to be selected very carefully, if $E$_{max} is too large the network is under-trained and lacks in performance, if $E$_{max} is selected too small the network will be biased towards the training set (it will be overfitted).
One measure for the $E$ is the root mean square error calculated by:
$E=\{1P\; m\}\sum $_{p=1}^{P} ∑_{i=1}^{m} (t_{i}^{p} - out_{i}^{p})^{2}
The selection of the parameters for the Backpropagation algorithm and the initial settings of the weight influences the learning speed as well as the convergence of the algorithm.
The initial Weights chosen determine the starting point in the error landscape, which controls whether the learning process will end up in a local minimum or the global minimum. The easiest method is to select the weights randomly from a suitable range, such as between (-0.1,0.1) or (-2,2).
If the weight values are too large the $net$ value will large as well; this causes the derivative of the activation function to work in the saturation region and the weight changes to be near zero. For small initial weights the changes will also be very small, which causes the learning process to be very slow and might even prevent convergence.
More sophisticated approaches to select the weights, such as the Nguyen-Widrow Initialization which calculates the interval from which the weights are taken in accordance with the number of input neurons and the number of hidden neurons, can improve the learning process. There are also statistical methods to estimate suitable initial values for the weights, [faus94, p297ff,] and [patt96, p165f,].
Figure 2.4: The Influence of the Learning Rate on the Weight Changes.
The Learning Coefficient $\eta $ determines the size of the weight changes. A small value for $\eta $ will result in a very slow learning process. If the learning coefficient is too large the large weight changes may cause the desired minimum to be missed. A useful range is between 0.05 and 2 dependent on the problem. The influence of the $\eta $ on the weight changes is shown in Figure 2.4.
An improved technique is to use an adaptive learning rate. A large initial learning coefficient should help to escape from local minima, while reducing $\eta $ later should prevent the learning process from overshooting the reached minimum.
The Momentum $\alpha $ causes the weight changes to be dependent on more than one input pattern. The change is a linear combination of the current gradient and the previous gradient. The useful range for this parameter is between 0 and 1. For some data sets the momentum makes the training faster, while for others there may be no improvement. The momentum usually makes it less likely that the training process will get stuck in a local minimum.
In recent years an enormous number of publications on refinements and improvements of the Backpropagation algorithms have been published. However most of the suggested improvements are only useful if the problem meets certain conditions. For examples see [patt96, p176f,], [faus94, p305ff,], and [zell94, p115ff,].
Nevertheless the multilayer feedforward networks trained with the Backpropagation method are probably the most practically used networks for real world applications.