Training occurs in two stages, using the Backpropagation algorithm described in section 2.3.
In the first phase all sub-networks in the input layer are trained. The individual training set for each sub-network is selected from the original training set and consists of the components of the original vector which are connected to this particular network (as an input vector) together with the desired output class represented in binary or 1-out-of-k coding.
In the second stage the decision network is trained. To calculate the training set each original input pattern is applied to the input layer; the resulting vector together with the desired output class (represented in a 1-out-of-k coding) form the training pair for the decision module.
To simplify the description of the training a small intermediate representation is used, further it is assumed that the permutation function is the identity .
The original training set is: , and where is the th component of the th input vector, is the class number, , where is the number of training instances.
The module is connected to:
The training set for the module :
for all , where is the output class
represented in a binary code.
The mapping performed by the input layer is denoted by:
The training set for the decision network:
and
. Where is the output class
represented in a 1-out-of-k code.
The mapping of the decision network is denoted by:

Figure 5.3: The Training Algorithm.
The training algorithm is summarized in Figure 5.3.
The training of each module in the input layer is independent of all other modules so this can be done in parallel. The training is stopped either when each module has reached a sufficient small error or a defined maximum number of steps has been performed. This keeps the modules independent.
Alternatively training can be stopped if the overall error of all modules is sufficiently small or the number of maximum steps has been performed. This assumes that the training occurs step by step simultaneously in all modules.