Learning Theory and Experiments with 
Competitive Networks 
Griff L. Bilbro 
North Carolina State University 
Box 7914 
Raleigh, NC 27695-7914 
Abstract 
David E. Van den Bout 
North Carolina State University 
Box 7914 
Raleigh, NC 27695-7914 
We apply the theory of Tishby, Levin, and Solla (TLS) to two problems. 
First we analyze an elementary problem for which we find the predictions 
consistent with conventional statistical results. Second we numerically 
examine the more realistic problem of training a competitive net to learn 
a probability density from samples. We find TLS useful for predicting 
average training behavior. 
I TLS APPLIED TO LEARNING DENSITIES 
Recently a theory of learning has been constructed which describes the learning 
of a relation from examples (Tishby, Levin, and Solla, 1989), (Schwartz, Samalan, 
Solla, and Denker, 1990). The original derivation relies on a statistical mechanics 
treatment of the probability of independent events in a system with a specified 
average value of an additive error function. 
The resulting theory is not restricted to learning relations and it is not essentially 
statistical mechanical. The TLS theory can be derived from the principle of maz- 
imum entropy, a general inference tool which produces probabilities characterized 
by certain values of the averages of specified functions(Jaynes, 1979). A TLS theory 
can be constructed whenever the specified function is additive and associated with 
independent examples. In this paper we treat the problem of learning a probability 
density from samples. 
Consider the model as some function p(z[o) of fixed form and adjustable parameters 
o which are to be chosen to approximate (z) where the overline denotes the true 
density. All we know about  are the elements of a training set T which are drawn 
846 
Learning Theory and Experiments with Competitive Networks 847 
from it. Define an error (zlw ). By the principal of maximum entropy 
p(l) = 1-L-e-''l (1) 
z() ' 
can be interpreted as the unique density which contains no other information except 
a specified value of the average error 
(> = f d p(l)(l). (.) 
In Equation 1 z is a normazation that is assumed to be independent of the vue 
of w; the parameter  is called the enitivit and is adjusted so that the average 
error is equal to some T, the specified target error on the trning set. We w use 
the convention that an integr operates on the entire expression that foows it. 
The usual Bayes rule produces a density in w from p(z[) and from a prior density 
p()(w) which reflects at best a genuine prior probabty or at least a restriction 
to the acceptable portion of the search space. Posterior to trying on m certain 
examples, 
() _ ,0() (-  ()) () 
Z 
i=l 
where Zm is a normalization that depends on the particular set of examples as well 
as their number. In order to remove the eect of any particular set of examples, we 
can average this posterior density over aH possible m examples 
<,()>. = f a()(,)...()()  () 
This average posterior density models the expected density of nets or w after train- 
ing. This distribtution in w implies the followin: expected posterior density for a 
new example 
f a(.+ I) (p-I ()).. (5) 
TLS compare this probability in z+ with the true target probabty to obtn 
the Average Prediction Probability or APP after training 
the average over both the trning set z(m) and an independent test example 
In the averages of Equations 4 and 6 are inconvenient to evuate exactly because of 
the g term in Equation 3. TLS propose an "annealed approbation" to APP in 
which the average of the ratio of Equation 4 is replaced by the ratio of the averages. 
Equation 6 becomes 
 = t 0()g.() 
t 0()g() 
where 
(8) 
848 Bilbro and Van den Bout 
Equation 7 is well suited for theoretical analysis and is also convenient for numer- 
ical predictions. To apply Equation 7 numerically, we will produce Monte Carlo 
estimates for the moments of g that involve sampling p(0) (o). If the dimension of w 
is larger than 50, it is preferable to histogram g rather than evaluate the moments 
directly. 
1.1 ANALYSIS OF AN ELEMENTARY EXAMPLE 
In this section we theoretically analyze a learning problem with the TLS theory. 
We will study the adjustment of the mean of a Gaussian density to represent a 
finite number of samples. The utility of this elementary example is that it admits 
an analytic solution for the APP of the previous section. All the relevant integrals 
can be computed with the identity 
We take the true density to be a Gaussian of mean  and variance 1/2a 
p(z) = / e-a(' -)'. 
We model the prior density as a Gaussian with mean w0 and variance 1/2r 
p()(w) = V_e-,(-0) '. 
(10) 
(11) 
We choose the simplest error function 
e(z[w) = (z - o) 2, (12) 
the squared error between a sample z and the Gaussian "model" defined by its 
mean w, which is to become our estimate of . In Equation 1, this error function 
leads to 
= (1s) 
with z(f 0 = f which is independent of w as assumed. We determine f by solving 
for the error on the training set to get f = 
The generalization, Equation 8, can now be evaluated with Equation 9 
where 
(14) 
(15) 
fe-(-)' 
= , 
a+/' 
is less than either a or/. The denominator of Equation 7 becomes 
(t,m/2 / r 
exp( m, (_ w0)) 
(16) 
Learning Theory and Experiments with Competitive Networks 849 
with a similar expression for the numerator. 
The case of many examples or little prior knowledge is interesting. 
tions 7 and 16 in the limit mn>> . 
Consider Equa- 
(17) 
which climbs to an asymptotic value of V/ for m  oo. In order to compare this 
with intuition, consider that the sample mean of {z, z2, ..., z,} approaches ' to 
within a variance of 1/2ma, so that 
(18) 
which makes Equation 6 agree with Equation 17 for large enough/. In this sense, 
the statistical mechanical theory of learning differs from conventional Bayesian es- 
timation only in its choice of an unconventional performance criterion APP. 
2 GENERAL NUMERICAL PROCEDURE 
In this section we apply the theory to the more realistic problem of learning a 
continuous probability density from a finite sample set. We can estimate the mo- 
ments of Equation 7 by the following Monte Carlo procedure. Given a training set 
T = {zt}i_T drawn from the unknown density  on domain X with finite volume V, 
an error function e(z[w), a training error e,, and a prior density p(o)(w) of vectors 
such that each w specifies a candidate function, 
1. Construct two sample sets: a prior set of P functions o = {wl0 } drawn from 
p(0)(w) and a set of U input vectors/ = {z,,} drawn uniformly from X. For 
each p in the prior set, tabulate the error e,10 = e(z,1o10 ) for every point in 
and the error eq0 = e(zt]wl0 ) for every point in T. 
2. Determine the sensitivity/ by solving the equation (e) = e, where 
(19) 
3. Estimate the average generalization of a given wl0 from Equation 8 
(20) 
4. The performance after m examples is the ratio of Equation 7. By construction 
7 ) is drawn from p(0) so that 
(21) 
850 Bilbro and Van den Bout 
1.5 
0.7 
0 20 40 60 80 100 
Training Set Size 
.o lo 
.o13 
.o9 
.o2 
1.5 
0.7 
0 20 4O 60 8O 100 
Training Set Size 
(b) 
Figure 1: Predicted APP versus number of training samples for a 20-neuron com- 
petitive network trained to various target errors where the neuron weights were 
initialized from (a) a uniform density, (b) an antisymmetrically skewed density. 
3 COMPETITIVE LEARNING NETS 
We consider competitive learning nets (CLNs) because they are familiar and use- 
ful to us (Van den Bout and Miller, 1990), because there exist two widely known 
training strategics for CLNs (the neurons can learn either independently or under a 
global interaction called conscience (DeSieno, 1988), and because CLNs can be ap- 
plied to one-dimensional problems without being too trivial. Competitive learning 
nets with conscience qualitatively change their behavior when they are trained on 
finite sample sets containing fewer examples than neurons; except for that regime 
we found the theory satisfactory. All experiments in this section were conducted 
upon the following one-dimensional training density 
(? 
(z) = otherwise. 
In Figure I is the Average Prediction Probability (APP) for k = 20 versus m, 
for several values of target error e, and for two prior densitsitics; first consider 
predictions from the uniform prior. For e, = 0.01, APP practically attains its 
asymptote of 1.5 by m = 40 examples. Assuming the APP to bc dominated in 
the limit by the largest #, we expect a CLN trained to an error of 0.01 on a set of 
40 examples to perform 1.5 times better than an untrained net on unseen samples 
drawn from the same probability density. This leads to a predicted probable error 
of about 
1 
e,,,ob = 2 k p(')' (22) 
For k = 20, e,ob = 0.017 for e,= .01 and e.ob = 0.021 for e,= 0.02. 
We performed 5,000 training trials of a 20-neuron CLN on randomly selected sets of 
Learning Theory and Experiments with Competitive Networks 851 
0.03 
0.02 
   '  * .* '  A*.'  '  
 
0.01 0.02 0.03 0 0.01 0.02 
Traing Ewor Traing or 
(b) 
0.01 
0 0,03 
Figure 2: Experimentally determined and predicted values of total error across 
the training density after competitive learning was performed using a 20-neuron 
network trained to various target errors (a) with 40 samples, (b) with 20 samples. 
40 samples from the training density. Each network was trained to a target error in 
the range [0.005, 0.03] on its 40 samples, and the average error on the total density 
was then calculated for the trained network. Figure 2 is a plot of 500 of these 
trials along with the predicted errors for various target errors. The probable error 
is qualitatively correct and the scatter of actual experiments increases in width by 
about the ratio of APPs for m = 20 and m = 40. For the case of m = 20 examples, 
the same net can only bc expected to exhibit probable errors of .019 and .023 for 
corresponding training target errors, which is compared graphically in Figure 2 with 
the experimentally determined errors for m = 20. 
The APP curves saturate at a value of m that is insensitive to the prior density 
from which the nets arc drawn. The vertical scale does depend somewhat on the 
prior however. Consider Figure 1, which also shows the APP curves for the same 
k = 20 net with the prior density antisymmetrically skewed away from the true 
density by the following function: 
{ --_ O_w_l, 
P() (v) = otherwise. 
For m  20 the shapes of the curves are almost unchanged, even though the vertical 
scale is different: saturation occurs at about the same value of m. Even when 
the prior greatly overrepresents poor nets, their effect on the prediction rapidly 
diminishes with training set size. This is important because in actual training, the 
effect of the initial configuration is also quickly lost. For m  20 the predictions 
are not valid in any case, since our simple error function does not reflect the actual 
probability even approximately for m  k in these nets. It is for m  20 where 
the only significant differences between the two families of curves occur. We have 
also been able to draw the same conclusions from less structured prior densities 
generated by assigning positive normalised random numbers to intervals of the 
852 Bilbro and Van den Bout 
domain. Moreover, we generally find that TLS predicts that about twice as many 
samples as neurons are needed to train competitive nets of other sizes. 
4 CONCLUSION 
TLS can be applied to learning densities as well as relations. We considered the 
effects of varying the number of examples, the target training error, and the choice 
of prior density. In these experiments on learning a density as well as others dealing 
with learning a binary output (Bilbro and Snyder, 1990), a ternary output (Chow, 
Bilbro, and Ycc, 1990), and a continuous output (Bilbro and Klenin, 1990) we 
find if saturation occurs for m substantially less than the total number of available 
samples, say rn < IT]/2, that m is a good predictor of sufficient training set size. 
Moreover there is evidence from a reformulation of the learning theory based on the 
grand canonical ensemble that supports this statistical approach (Klenin,1990). 
References 
G. L. Bilbro and M. Klenin. (1990) Thermodynamic Models of Learning: Applica- 
tions. Unpublished. 
G. L. Bilbro and W. E. Snyder. (1990) Learning theory, linear separability, and 
noisy data. CCSP-TR-90/7, Center for Communications and Signal Processing, 
Box 7914, Raleigh, NC 27(195-7914. 
M. Y. Chow, G. L. Bilbro and S. O. Yee. (1990) Application of Learning Theory to 
Single-Phase Induction Motor Incipient Fault Detection Artificial Neural Networks. 
Submitted to International Journal of Neural Systems. 
D. DeSieno. (1988) Adding a conscience to competitive learning. In IEEE Interna- 
tional Conference on Neural Networks, pages I:117-I:124. 
E. T. Jaynes. (1979) Where Do We Stand on Maximum Entropy?. In R. D. Leven 
and M. Tribus (Eds.), Mazimum Entropy Formalism, M. I. T. Press, Cambridge, 
pages 17-118. 
M. Klenin. (1990) Learning Models and Thermostatistics: A Description of Over- 
training and Generalization Capacities. NETR-90/3, Center for Communications 
and Signal Processing, Neural Engineering Group, Box 7914, Raleigh, NC 27(195- 
7914. 
D. B. Schwartz, V. K. Samalan, S. A. Solla & J. S. Denker. (1990) Exhaustive 
Learning. Neural Computation. 
N. Tishby, E. Levin, and S. A. Solla. (1989) Consistent inference of probabilities in 
layered networks: Predictions and generalization. IJCNN, IEEE, New York, pages 
11:403-410. 
D. E. Van den Bout and T. K. Miller III. (1990) TInMANN: The integer markovian 
artificial neural network. Accepted for publication in the Journal of Parallel and 
Distributed Computing. 
