In one series of learning experiments, a 22-element network was used which had three layers, 10 elements on the first, 11 on the second, and 1 on the third. The single element on the third was the final output, and was a fixed majority function of the 11 elements in the second layer. These in turn each received inputs from each of the 10 on the first layer and from each of the 6 basic inputs. The 10 on the first layer each received only the 6 basic inputs. A set of four logical functions, A, B, C, and D, was used. Function A was actually a linear threshold function which could be generated by the weights 8, 7, 6, 5, 4, 3, 2, functions B and C were chosen by randomly filling in a truth table, while D was the parity function.
| A | B | C | D | ||||
|---|---|---|---|---|---|---|---|
| r | e | r | e | r | e | r | e |
| 5 | 54 | 8 | 100 | 11 | 101 | 4 | 52 |
| 4 | 37 | 9 | 85 | 4 | 60 | 5 | 62 |
| 4 | 44 | 6 | 72 | 9 | 85 | 6 | 56 |
Table I gives the results of one series of runs with these functions and this network, starting with various random initial weights. The quantity, r, is the number of complete passes through the 64-entry truth table before the function was completely learned, while e is the total number of errors made. In evaluating the results it should be noted that an ideal learning device would make an average of 32 errors altogether on each run. The totals recorded in these runs are agreeably close to this ideal. As expected, the linear threshold function is the easiest to learn, but it is surprising that the parity function was substantially easier than the two randomly chosen functions. [Table II] gives a chastening result of the same experiment with all interconnecting weights removed except that the final element is a fixed majority function of the other 21 elements. Thus there was adaptation on one layer only. As can be seen [Table I] is hardly better than [Table II] so that the value of variable interconnecting weights was not being fully realized. In a later experiment the number of elements was reduced to 12 elements and the same functions used. In this case the presence of extra interconnecting weights actually proved to be a hindrance! However a close examination of the incrementing process brought out the fact that the troublesome behavior was due to the greater chance of having only a few (often only one) elements do nearly all the incrementing. It is expected that the use of the additional refinements discussed herein will produce a considerable improvement in bringing out the full power of adaptation in multiple layers of a network.
| A | B | C | D | ||||
|---|---|---|---|---|---|---|---|
| r | e | r | e | r | e | r | e |
| 7 | 47 | 18 | 192 | 8 | 110 | 4 | 48 |
| 3 | 40 | 7 | 69 | 10 | 98 | 6 | 68 |
| 4 | 43 | 7 | 82 | 4 | 47 | 6 | 46 |
FUTURE PROBLEMS
Aside from the previous question of deciding on network structure, there are several other questions that remain to be studied in learning networks.
There is the question of requiring more than a single output from a network. If, say, two outputs are required for a given input, one +1 and the other -1, this runs into conflict with the incrementing process. Changes that aid one output may act against the other. Apparently the searching process depicted before with a varying bias must be considerably refined to find weight changes which act on all the outputs in the required way. This is far from an academic question because there will undoubtedly be numerous cases in which the greatest part of the input-output computation will have shared features for all output variables. Only at later levels do they need to be differentiated. Hence it is necessary to envision a single network producing multiple outputs rather than a separate network for each output variable if full efficiency is to be achieved.
Another related question is that of using input variables that are either many-, or continuous-, valued rather than two-valued. No fundamental difficulties are discernible in this case, but the matter deserves some considerable study and experimentation.
Another important question involves the use of a succession of inputs for producing an output. That is, it may be useful to allow time to enter into the network’s logical action, thus giving it a “dynamic” as well as “static” capability.