next up previous contents
Next: Learning Problems Up: Analysis of the New Previous: Analysis of the New

A Fast Network

For the analysis the following assumptions are made:

Training the new network architecture is faster than training a monolithic modular network on the same problem for three reasons:

  1. The number of connections in the modular network and hence the number of weights, is much less than in a monolithic MLP. Fewer weights lead to fewer operations during the BP-training. This results directly in a speed-up of the learning procedure.

    Consider a modular network with ten input modules, each with n inputs, hm = 4 hidden layer neurons, and k outputs. The decision module has (10 k) inputs, hm = 4 hidden layer neurons, and k outputs. This is denoted by: Mmod=(l,k,10,(n,k,[hm]),(10,k,[hm])), where l = 10 n.

    A monolithic network with the same number of inputs and outputs, and with two hidden layers, each with hs neurons can be denoted by: BP=(l,k,[hs,hs])

    For the number of neurons to be equal in the two networks:

    10 (hm+1)+hm+k = 2*hs+k

    ⇒hs = {11 hm + 102} = 27

    The number of weights in each network is:

    ν(BP) = 27 l + 27 k + 729

    ν(Mmod) = 4 l + 84 k

    If the input is sufficiently large so that:

    l>32 + 2.5 k

    then

    ν(BP) > ν(Mmod)

    and hence

    τ(ν(BP)) > τ(ν(Mmod))

    since τ monotonously increasing.

  2. The modules in the input-layer are mutually independent, so the training can be performed in parallel. The training time for a full parallel implementation is the maximum time needed for training one of the input modules plus the time to train the decision module. Therefore the number of weights that have to be regarded as time factor in a parallel training is only the number of weights in an input module plus the number of weights in the decision module. Assuming Mi is a module in the input layer and Md is the decision module the training time T can be calculated as follows:

    T = τ(ν(Mi)) + τ(ν(Md))

    Assuming the example from above (Mmod and BP) the speed-up is significant. The number of inputs per module is assumed to be larger than eight (n>8).

    The number of weights to consider for training in each network is:

    ν(BP) = 27 l + 27 k + 729

    ν(Mmod) = 2 (4 l + 4 k)

    The ratio between the numbers of weights to train (k=2, n>8, l=10 n):

    {ν(BP)ν(Mmod)} = {270 n + 7838 n + 16} > {270 n + 78310 n} > {270 n10 n} = 27

    The number of weights to consider for the time need to train the network is at least 27 times less than in a monolithic MLP.

  3. Splitting the training vector into parts often helps to focus on common attributes. Consider the following (admittedly contrived) example:

    Original Set Set MLP1 Set MLP2
    x1 x2 x3 x4 x5 x6 y x1 x2 x3 y x4 x5 x6 y
    0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
    0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
    0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
    1 1 0 1 0 1 1 1 1 0 1 1 0 1 1
    1 0 1 1 0 1 1 1 0 1 1 1 0 1 1
    0 1 1 1 0 1 1 0 1 1 1 1 0 1 1

    Class `0' is determined by x1, x2, and x3; which will be learned very quickly by MLP1, which sees this tuple `0 0 0 : 0' three times during one training cycle.

    Similarly class `1' is determined by the x4, x5, and x6, which will be quickly learned by MLP2.

    It is unlikely the that a real world data set has the same structure as the example. However, it can be conceivably produce significant improvements, particularly with large input dimensions.


next up previous contents
Next: Learning Problems Up: Analysis of the New Previous: Analysis of the New

Albrecht Schmidt
Mit Okt 4 16:45:34 CEST 2000