    Next: Learning Problems Up: Analysis of the New Previous: Analysis of the New

## A Fast Network

For the analysis the following assumptions are made:

• The modules used in the modular neural network have only one hidden layer with four neurons.
• The long intermediate representation is used for the modular neural system (for the short representation the number of weights is even smaller).
• The function $ν$ takes a module and returns the number of weight connections. (e.g. $ν(M)=i$, where $M=(a,b,[h])$ and $i=a h+h b$ for a one hidden layer module and where $M=(a,b,[h$1,h2]) and $i=a h$1+h1 h2+h2 b for a two hidden layer network.)
• The function $τ$ represents the time needed for training a module. The analysis assumes that $τ$ is dependent only on the number of weights and is monotonously increasing $i > j ⇒τ(i) > τ(j)$. This means the training time is longer if there are more weights in the network. In a real system it is likely to be dependent on other parameters as well.

Training the new network architecture is faster than training a monolithic modular network on the same problem for three reasons:

1. The number of connections in the modular network and hence the number of weights, is much less than in a monolithic MLP. Fewer weights lead to fewer operations during the BP-training. This results directly in a speed-up of the learning procedure.

Consider a modular network with ten input modules, each with $n$ inputs, $h$m = 4 hidden layer neurons, and $k$ outputs. The decision module has $(10 k)$ inputs, $h$m = 4 hidden layer neurons, and $k$ outputs. This is denoted by: $M$mod=(l,k,10,(n,k,[hm]),(10,k,[hm])), where $l = 10 n$.

A monolithic network with the same number of inputs and outputs, and with two hidden layers, each with $h$s neurons can be denoted by: $BP=(l,k,[h$s,hs])

For the number of neurons to be equal in the two networks:

$10 (h$m+1)+hm+k = 2*hs+k

$⇒h$s = {11 hm + 102} = 27

The number of weights in each network is:

$ν(BP) = 27 l + 27 k + 729$

$ν(M$mod) = 4 l + 84 k

If the input is sufficiently large so that:

$l>32 + 2.5 k$

then

$ν(BP) > ν(M$mod)

and hence

$τ(ν(BP)) > τ(ν(M$mod))

since $τ$ monotonously increasing.

2. The modules in the input-layer are mutually independent, so the training can be performed in parallel. The training time for a full parallel implementation is the maximum time needed for training one of the input modules plus the time to train the decision module. Therefore the number of weights that have to be regarded as time factor in a parallel training is only the number of weights in an input module plus the number of weights in the decision module. Assuming $M$i is a module in the input layer and $M$d is the decision module the training time $T$ can be calculated as follows:

$T = τ(ν(M$i)) + τ(ν(Md))

Assuming the example from above ($M$mod and $BP$) the speed-up is significant. The number of inputs per module is assumed to be larger than eight ($n>8$).

The number of weights to consider for training in each network is:

$ν(BP) = 27 l + 27 k + 729$

$ν(M$mod) = 2 (4 l + 4 k)

The ratio between the numbers of weights to train ($k=2, n>8, l=10 n$):

${ν(BP)ν(M$mod)} = {270 n + 7838 n + 16} > {270 n + 78310 n} > {270 n10 n} = 27

The number of weights to consider for the time need to train the network is at least 27 times less than in a monolithic MLP.

3. Splitting the training vector into parts often helps to focus on common attributes. Consider the following (admittedly contrived) example:

 Original Set Set MLP1 Set MLP2 $x$1 x2 x3 x4 x5 x6 $y$ $x$1 x2 x3 $y$ $x$4 x5 x6 $y$ 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1

Class `0' is determined by $x$1, $x$2, and $x$3; which will be learned very quickly by MLP1, which sees this tuple `0 0 0 : 0' three times during one training cycle.

Similarly class `1' is determined by the $x$4, $x$5, and $x$6, which will be quickly learned by MLP2.

It is unlikely the that a real world data set has the same structure as the example. However, it can be conceivably produce significant improvements, particularly with large input dimensions.    Next: Learning Problems Up: Analysis of the New Previous: Analysis of the New

Albrecht Schmidt
Mit Okt 4 16:45:34 CEST 2000