For the analysis the following assumptions are made:

- The modules used in the modular neural network have only one hidden layer with four neurons.
- The long intermediate representation is used for the modular neural system (for the short representation the number of weights is even smaller).
- The function $\nu $ takes a module and returns the number of
weight connections.
(e.g. $\nu (M)=i$, where $M=(a,b,[h])$ and $i=a\; h+h\; b$ for a
one hidden layer module and where $M=(a,b,[h$
_{1},h_{2}]) and $i=a\; h$_{1}+h_{1}h_{2}+h_{2}b for a two hidden layer network.) - The function $\tau $ represents the time needed for training a module. The analysis assumes that $\tau $ is dependent only on the number of weights and is monotonously increasing $i\; >\; j\; \Rightarrow \tau (i)\; >\; \tau (j)$. This means the training time is longer if there are more weights in the network. In a real system it is likely to be dependent on other parameters as well.

Training the new network architecture is faster than training a monolithic modular network on the same problem for three reasons:

- The number of connections in the modular network and hence the
number of weights, is much less than in a monolithic MLP.
Fewer weights lead to fewer operations during the BP-training.
This results directly in a speed-up of the learning procedure.
Consider a modular network with ten input modules, each with $n$ inputs, $h$

_{m}= 4 hidden layer neurons, and $k$ outputs. The decision module has $(10\; k)$ inputs, $h$_{m}= 4 hidden layer neurons, and $k$ outputs. This is denoted by: $M$_{mod}=(l,k,10,(n,k,[h_{m}]),(10,k,[h_{m}])), where $l\; =\; 10\; n$.A monolithic network with the same number of inputs and outputs, and with two hidden layers, each with $h$

_{s}neurons can be denoted by: $BP=(l,k,[h$_{s},h_{s}])For the number of neurons to be equal in the two networks:

$10\; (h$

_{m}+1)+h_{m}+k = 2*h_{s}+k$\Rightarrow h$

_{s}= {11 h_{m}+ 102} = 27 The number of weights in each network is:

$\nu (BP)\; =\; 27\; l\; +\; 27\; k\; +\; 729$

$\nu (M$

_{mod}) = 4 l + 84 kIf the input is sufficiently large so that:

$l>32\; +\; 2.5\; k$

then

$\nu (BP)\; >\; \nu (M$

_{mod})and hence

$\tau (\nu (BP))\; >\; \tau (\nu (M$

_{mod}))since $\tau $ monotonously increasing.

- The modules in the input-layer are mutually independent, so the training
can be performed in parallel. The training time for a full
parallel implementation is the maximum time needed for training one of
the input modules plus the time to train the decision module.
Therefore the number of weights that have to be regarded
as time factor in a parallel training is only the number of weights in
an input module plus the number of weights in the decision module.
Assuming $M$
_{i}is a module in the input layer and $M$_{d}is the decision module the training time $T$ can be calculated as follows:$T\; =\; \tau (\nu (M$

_{i})) + τ(ν(M_{d}))Assuming the example from above ($M$

_{mod}and $BP$) the speed-up is significant. The number of inputs per module is assumed to be larger than eight ($n>8$).The number of weights to consider for training in each network is:

$\nu (BP)\; =\; 27\; l\; +\; 27\; k\; +\; 729$

$\nu (M$

_{mod}) = 2 (4 l + 4 k)The ratio between the numbers of weights to train ($k=2,\; n>8,\; l=10\; n$):

$\{\nu (BP)\nu (M$

_{mod})} = {270 n + 7838 n + 16} > {270 n + 783 10 n} > {270 n 10 n} = 27 The number of weights to consider for the time need to train the network is at least 27 times less than in a monolithic MLP.

- Splitting the training vector into parts often helps to focus on
common attributes. Consider the following (admittedly contrived)
example:
Original Set Set MLP$$ _{1}Set MLP$$ _{2}$x$ _{1}x_{2}x_{3}x_{4}x_{5}x_{6}$y$ $x$ _{1}x_{2}x_{3}$y$ $x$ _{4}x_{5}x_{6}$y$ 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 Class `0' is determined by $x$

_{1}, $x$_{2}, and $x$_{3}; which will be learned very quickly by MLP$$_{1}, which sees this tuple `0 0 0 : 0' three times during one training cycle.Similarly class `1' is determined by the $x$

_{4}, $x$_{5}, and $x$_{6}, which will be quickly learned by MLP$$_{2}.It is unlikely the that a real world data set has the same structure as the example. However, it can be conceivably produce significant improvements, particularly with large input dimensions.

Mit Okt 4 16:45:34 CEST 2000