Splitting-up the training set into subsets can also bring problems. The number of equal input vectors with different possible output values may increase, especially for modules with a small number of input variables. Consider the worst case: a 4-Bit-Parity problem:

Original Set | Set MLP$$_{1} | Set MLP$$_{2} | |||

$x$_{1} x_{2} x_{3} x_{4} | $y$ | $x$_{1} x_{2} | $y$ | $x$_{3} x_{4} | $y$ |

0 0 0 0 | 0 | 0 0 | 0 | 0 0 | 0 |

0 0 0 1 | 1 | 0 0 | 1 | 0 1 | 1 |

0 0 1 0 | 1 | 0 0 | 1 | 1 0 | 1 |

0 0 1 1 | 0 | 0 0 | 0 | 1 1 | 0 |

0 1 0 0 | 1 | 0 1 | 1 | 0 0 | 1 |

0 1 0 1 | 0 | 0 1 | 0 | 0 1 | 0 |

0 1 1 0 | 0 | 0 1 | 0 | 1 0 | 0 |

0 1 1 1 | 1 | 0 1 | 1 | 1 1 | 1 |

1 0 0 0 | 1 | 1 0 | 1 | 0 0 | 1 |

1 0 0 1 | 0 | 1 0 | 0 | 0 1 | 0 |

1 0 1 0 | 0 | 1 0 | 0 | 1 0 | 0 |

1 0 1 1 | 1 | 1 0 | 1 | 1 1 | 1 |

1 1 0 0 | 0 | 1 1 | 0 | 0 0 | 0 |

1 1 0 1 | 1 | 1 1 | 1 | 0 1 | 1 |

1 1 1 0 | 1 | 1 1 | 1 | 1 0 | 1 |

1 1 1 1 | 0 | 1 1 | 0 | 1 1 | 0 |

To discuss this problem more general the following definition is necessary. $P(y=a)$ is the probability of the output variable $y$ having the value $a$. And $P(y=a\; |\; x=b)$ is the conditional probability of $y=a$ if $x=b$.

**Definition: A Set of Statistically Neutral Variables**

Consider the function $f(x$_{1}, &ldots;, x_{n}, &ldots;, x_{m})=y.
A subset of the input $\{x$_{1}, &ldots;, x_{n}} with $n<m$ is called
*statistically neutral* if the knowledge of the values of these variables
does not increase the knowledge of the result $y$.

Formally: The set of input
variables $\{x$_{1}, &ldots;, x_{n}} is statistically neutral if:

$P(y=a)\; =\; P(y=a\; |\; x$_{1} = b_{1} ∧&ldots;∧x_{n} = b_{n})

If the whole set of input variables and all possible subsets for all modules are statistically neutral, the network can not learn the task. If only some of the modules are supplied with a statistically neutral set of input variables the network may perform satisfactory.

In [ston95] it is shown that monolithic multilayer feedforward networks can learn statistically neutral problems but they can not generalize on them.

This situation sounds very unlikely in real world data, and the expectation was observed during experimentation.

Especially in tasks with a large input space, such as picture recognition, this problem can be ignored. In tasks with a small number of input attributes, one way to reduce the possibility of having statistically neutral input sets is to recode the data in a sparse code.

Mit Okt 4 16:45:34 CEST 2000