Adaptive Soft Weight Tying 
using Gaussian Mixtures 
Steven J. Nowlan 
Computational Neuroscience Laboratory 
The Salk Institute, P.O. Box 5800 
San Diego, CA 92186-5800 
Geoffrey E. Hinton 
Department of Computer Science 
.University of Toronto 
Toronto, Canada M5S 1A4 
Abstract 
One way of simpli'ing neural networks so they generalize better is to add 
an extra term Io the error function that will penalize complexity. We 
propose a new penalty term in which the distribution of weight values 
is modelled as a mixture of multiple gaussians. Under this model, a set 
of weights is simple if the weights can be clustered into subsets so that 
the weights in each cluster have similar values. We allow the parameters 
of the mixture model to adapt at t. he same time as the network learns. 
Simulations demonstrate that this complexity term is more effective than 
previous complexity terms. 
1 Introduction 
A major problem in training artificial neural networks is to ensure that they will 
generalize well to cas{' that lhey hv  not been Irareed on. Some recent theoretical 
results (Baum and ttaussler, 19S9) Iave ugge[ed that in order to guarautec good 
generaJizalion lhe anlounl of lt'orll;tio required o dr,'cl iecit} lh  ouLput 
vectors of all the training cases ml.qt be considerably larger than the nmilber of 
independent weight> n the network in many practical probles there s only 
a small amount of labelled data available for trmning and this creates problems 
for any approach that uses a large, homogeneous network with many independent 
weights. As a result. there h been mubh recent, interest in techniques [hat can 
train large networks with relatively sinall amounts of labelled data and still provide 
good generalization performance. 
In order to improve generalization, the number of free parameters in [he network 
must be reduced. One of the oldest and simplest approaches to removing excess 
degrees of freedom from a network is to add an extra term o the error ftmcton 
993 
994 Nowlan and Hinton 
that penalizes complexity: 
cost = data-misfit + X complexity (1) 
During learning, the network is trying to find a locally optimal trade-off between 
the data-misfit (the usual error term) and the complexity of the net. The relative 
importance of these two terms can be estimated by finding the value of X that 
optimizes generalization to a validation set. Probably the simplest approximation 
to complexity is the sum of the squares of the weights, i w. Differentiating 
this complexity measure leads to simple weight decay (Plaut, Nowlan and Hinton, 
1986) in which each weight decays towards zero at a rate that is proportional to its 
magnitude. This decay is countered by the gradient of the error term, so weights 
which are not critical to network performance, and hence always have small error 
gradients, decay away leaving only the weights necessary to solve the problem. 
The use of a. i w? penalty term can also be interpreted from a Bayesian 
perspective. l The "complexity" of a set of weights, Xi w, may be described 
as its negal.iw.  log probal)ility density under a radially symmetric gaussian prior 
distribution on the weights. The distribution is centered at the origin a. nd has vari- 
ance 1/X. For multilayer networks, it is hard to find a good theoretical justification 
for this prior, but Ilinton (1987)justities it empirically by showing that it greatly 
improves generalization on a very diIficult task. More recently, Mackay (1991) has 
shown that even better generalization can be achieved by using different values of 
X for the weights in different layers. 
2 A more complex measure of network complexity 
If we wish to elimina. te sinall weights without forcing large weights away from the 
values they need to model the data, we can use a prior which is a mixture of a 
narrow (n) and a broad (b) gaussian, both centered at zero. 
I 0 1 - ' 
= vr;,, + 
where . and  are the mixing proportions of the two gaussians and are therefore 
constrained to sum to 1. 
Assuming that, the weight values were generated Dom a gaussian mixture, the con- 
ditional probability that a particular weight, wi, was generated by a particular 
gaussian, j, is called the responsibility of that gaussian for the weight and is: 
(3) 
= 
where pj(wi) is the probability density of wi under gaussian j. 
When the mixing proportions of the two ga. ussians are comparable, the ua. rrow gus- 
sian gets most of the responsibility for a snall weight. Adopting the Bayesian per- 
2 " 
spective, the cost of a weight under the narrow gaussian is proportional to w /2a5. 
As long as a, is quite small there will be strong pressure to reduce the magnitude 
R. Szeliski, personal communication, 1985. 
Adaptive Soft Weight Tying using Gaussian Mixtures 995 
of small weights even further. Conversely, the broad gaussian takes most of the 
responsibility for large weight values, so there is much less pressure to reduce them. 
In the limiting case when the broad gaussian becomes a uniform distribution, there 
is ahnost no pressure to reduce very large weights because they are almost certainly 
generated by the unitbrm distribution. A complexity term very similar to this limit- 
ing case is used in the "weight elimination" technique of (Weigend, Huberman and 
Rumelhart, 1990) to improve generalization for a time series prediction task. '-' 
3 Adaptive Gaussian Mixtures and Soft Weight-Sharing 
A mixture of a narrow, zero-mean gaussian with a broad gaussian or a uniform allows 
us to favor networks with many near-zero weights, and this improves generalization 
on many tasks. But practical experience with hand-coded weight constraints has 
also shown that great improvements can be achieved by constraining particular 
subsets of the weights to share the same value (Lang, Waibel and Hinton, 1990; Le 
Cun, 1989). Mixtures of zero-mean gausstans and uniforms cannot imllement this 
type of symmetry constraiut. If however, we use multiple gausstans and allow their 
means and variances to adapt as the network learns, we can implenient a "soft" 
version of weight.-sharing in which tile learning algorit. lnn decides for itself whicl 
weights should be tied together. (We may also allow the nixing proportions to 
adapt so that we are uot assuing all sets of tied weights are the same size.) 
The basic idea is that a gaussian which takes responsibility for a subset of the 
weights will squeeze those weights together since it can then have a lower variance 
and assign a ligher probability density to each weight. If the gaussias all start 
with high variance, the iuitiaI division of weights into subsets will be very soft. As 
the variances shrink and the network learns, the decisions about how to group the 
weights into subsets are influenced by the task the network is learning to perlbrn. 
To make these intuitive ideas a bit more concrete, we lnay define a cost function of 
the general form giwm in (1): 
- . , :(y - d) 2 - log 7rjpj(wi) (4) 
where a is the variauce of the squared error and each pj (wi) is a gaussian density 
with mean pj and standard deviation aj. We optimize this function by ad.justing 
the wi and the mixture parameters ?rj, pj, and aj, and ay.a 
The partial derivative of C vith respect to each weight is the sum of the usual 
squared error derivative and a term due to the complexity cost for the weight: 
2See (Nowlan, 1991) for a precise description of the relationship between mixture models 
and the model use(t by (Weigend, l:luberman and Rumelhart, 1990). 
3 2 
1/er s may be thought of as playing the same role Ks X in equation 1 in determining a 
trade-off between the misfit and complexity costs. K is a normalizing factor based on a 
gaussian error model. 
996 Nowlan and Hinton 
Method Train % Correct Test % Correct 
Vanilla Back Prop. 100.0 q- 0.0 67.3 q- 5.7 
Cross Valid. 98.8 q- 1.1 83.5 q- 5.1 
Weight Elimination 100.0 q- 0.0 89.8 q- 3.0 
Soft-share - 5 Comp. 100.0 q- 0.0 95.6 q- 2.7 
Soft-share - 10 Conq). 100.0 q- 0.0 97.1 q- 2.1 
Table 1: Summary of generalization performance of 5 different training techniques 
on the shift detection problem. 
The derivative of the complexity cost term is simply a weighted sum of the difference 
between the weight value and the center of each of the gaussians. The weighting 
factors are the responsibility measures defined in equation 3 and if over time a 
single gaussian claims most of the responsibility for a particular weight the effect 
of the complexity cost term is simply to pull the weight towards the center of the 
responsible gaussian. The strength of this tbrce is inversely proportional to the 
variance of the gaussian. 
In the simulations described below, all of the parameters (wi, Ij, o'j, Zrj) are updated 
simultaneousl using a conjugate gradient descent procedure. To prevent variances 
shrinking too fast or going negative we optimize logj rather than j. To ensure 
that the mixing proportions sum to I and are positive, we optimize xj where zrj = 
exp(xj)/ exp(zi). For further details see (Nowlan and Hinton, 1992). 
4 Simulation Results 
We compared the generalization performance of soft weight-tying to other tech- 
niques on two differeut problends. The first problem, a 20 input, one output shift 
detection network, was chosen because it was binary problen for whicl solutions 
which generalize well exhibit a lot of repeated weight structure. The generalization 
performmce ot' et, works traied using the cost criterion given in equation 4 was 
compared to networks trained in three other ways: No cost term to penalize com- 
plexity; No explicit complexity cost. term, but use of a validation set to terminate 
learning; Weight elimination (\Veiged, Huberman a. nd Rumelhart, 1990) 4. The 
simulation results are summarized in Table 1. 
The network had 20 input units, 10 hidden units, and a single output unit and 
contained 101 weights. The first 10 input units in this network were given a random 
binary pattern, and the second group of 10 input units were given the same pattern 
circularly shifted by 1 bit left, or right. The desired output of the network was +1 
for a left shift and -1 [br a right shift. A data set of 2400 patterns was created by 
randomly generating a 10 bit string, and choosing with equal probability to shift 
the string left or right. The data set was divided into 100 training cases, 1000 
validation cases, and 1300 test cases. The training set was deliberately chosen to 
be very small (< 5% of possible patterns) to explore the region in which complexity 
penalties should have the largest. impact. Ten simulations were performed with each 
4With a fixed value of A chosen by cross-validation. 
Adaptive Soft Weight Tying using Gaussian Mixtures 997 
1,2 
t. 
0.8 
0.7. 
0.6,. 
0.5. 
0.4. 
0,3 
0.2. 
i I i 
-4.5-4-).5-];-2.5-2-1.-s.-.0.$ 0 0.5 
Figure 1: Final mixture probability density for a typical solution to the shift de- 
tection problem. Five of the components in the mixture can be seen as distinct 
bumps in the probability density. Of the remaining five components, two have beet[ 
eliminated by having their mixing proportions go to zero and the other three are 
very broad and form the baseline offset of the density function. 
method, starting fi'on ten different initial weight sets (z.e. each method used the 
same ten initial weight configurations). 
The final weight distrilutions discovered by the soft weight-tying techuique are 
shown in Figure 1. There is no significant component with mean 0. The classical 
assumption that the netxvork cot. ains a large number of iessent. ial weights which 
can be eliminated to iprove gcmralization is not appropriate for this problem ad 
network architecture. 'Fhis may explain why the weight elimination model used 
by Weigend el al (Weigend, Ilubermau and Rumelhart, 1990) performs relatively 
poorly in this situation. 
The second task chosen to evaluate the effectiveness of our complexity penalty was 
the prediction of the yearly sunspot average from the averages of previous years. 
This task has been well studied as a time-series prediction benchmark in the statis- 
tics literature (Priestley, 1991b; Priestley, 1991a) and has also been investigated by 
(Weigend, Hubermau and Rumelhart, 1990) using a complexity penalty similar to 
the one discussed in section 2. 
The network architecture used was idctical to the one used in the study by Weigend 
et al: The network had 12 iuput uuil.s which represented the yearly average fi'om the 
preceding 1:2 years, 8 hidden units, aml a single linear output uuit which represented 
the predictio for the avera.gc ]unl)cr o[' sunspol. s in t. he curteat yea.r. Yearly 
sunspot data from 1700 to 1920 was used to train the network to perform this one- 
step prediction task, and the evaluation of t. he network was based on data from 
998 Nowlan and Hinton 
Method Test arv 
TAR 0.097 
RBF 0.092 
WRH 0.086 
Soft-share- 3 Comp. 0.077 4-0.0029 
Soft-share- 8 Comp. 0.072 4- 0.0022 
Table 2: Snmmary of average relative variance of 5 different models ou the one-step 
sunspot prediction problen. 
1921 to 1955. 5 The evaluatiou of prediction performance used the average relative 
variance (arv) measure discussed in (Weigend, Huberman and lumelhart, 1990). 
Simulations were performed using the same conjugate gradient method used for the 
first problem. Complexity measures based on gaussian mixtures with 3 aud 8 com- 
ponents were used and ten simulations were performed with each (using the same 
training data but different initial weight configurations). The results of these simu- 
lations are summarized in Table 2 along with the best result obtained by Weigend ct 
al (Weigend, Huberman and Rumelhart, 1990)(WtH), the bilinear auto-regression 
model of Tong and Lira (Tong and Lira, 1980) (TAR) e, and the multi-layer RBF 
network of tie and Lapedes (IIe and Lapedes, 1991) (tiBF). All figures represent 
the arv on the test set. For the mixture complexity models, this is the average over 
the ten simulations, plus or minus one standard deviation. 
Since the results for the models other than the mixture complexity trained networks 
are based on a single simulation it is difficult to assign statistical signifigance to the 
differences shown in Table 2. We may note however, that the difference between 
the 3 and 8 component mixture complexity models is significant (p > 0.95) and the 
differences between the 8 component model and the other models are much larger. 
Figure 2 shows an 8 component mixture model of the final weight distribution. It is 
quite unlike t, he distribution in Figure 1 and is actually quite close t,o a mixture o[' 
two zero-mean gausstans, one broad and one narrow. This nay explain why weight 
elimination works quite well [br this task. 
Weigend el al point out that tbr tinge series prediction tasks sucl as the sunspot 
task a much more interesting neastre of performance is the ability of the nodel to 
predict more than one t.ic step into the fut, ure. One way to approach the nulti- 
step prediction problem is to use iterated single-step prediction. In this method, the 
predicted output is fed back as input for the next prediction and all other input 
units have their values shifted back one unit. Thus the input typically consists 
of a combiuation of actual and predicted values. When predicting more than one 
step into the future, the prediction error depends both on how many steps into the 
future one is predicting (I) and on what point in the time series the prediction 
began, An appropriate error measure for iterated prediction is the average relatvve 
I-times iterated prediction vartauce (Weigend, Iluberman and Rumelhart, 1990) 
SThe authors thank Andreas Weigend for providing his version of this data. 
eThis was the model fa. vored by Priestly (PriestIcy, 1991a) in a recent evaluation of 
classical statistical approaches to this task. 
Adaptive Soft Weight Tying using Gaussian Mixtures 999 
1.8 
1.7. 
1.6. 
1.5. 
1.,I. 
1.3. 
1.2 
1.1. 
[. 
1),9. 
0.8. 
1),7. 
I),5. 
1).1 
Figure 2: Typical final nixture probability density fox' the sunspot prediction prob- 
lem with a model containing 8 mixt, ure components. 
0.9. 
0.8. 
0.7. 
0.5. 
0 
2 t 4 5 6 ? 8 $ 10 11 12 1 14 15 16 17 18 19 20 21 22 
Figure 3: Average relative 1-rinses iterated prediction variance versus number of 
prediction it, erations for the sunspot. time series froin 1921 to 1955. Closed circles 
represent the 'i'AI model, open circles tile WItH model, closed squares the 3 
component complexity nodel, and open squares [he ; component complexity model. 
Ten different sets of initial weights were used for t, he :3 and 8 component complexity 
models and one standard deviation error bars are shown. 
1000 Nowlan and Hinton 
which averages predictions I steps into the future over all possible starting points. 
Using this measure, the performance of various models is shown in Figure a. 
5 Summary 
The simulations we have described provide evidence [hat [he use of a more flexible 
model for the distribution of weights in a network can lead to better generalization 
performance than weight decay, weight elimina[ion, or techniques that control the 
learning time. The flexibility of our model is clearly demons[rated in [he very differ- 
en[ final weight distributions discovered for [he two different problems investigated 
in this paper. The ability to automatically adap[ to individual problems suggests 
that the method should have broad applicability. 
Acknowledgements 
This research was funded by the Ontario ITRC, the Canadian NSERC and the Howard 
Hughes Medical Institute. Hinton is the Norands fellow of the Canadian Institute for 
Advanced Research. 
References 
Baum, E. 13. and }taussler, D. (1989). What size net gives wdid gener,xlizat. ion? Neural 
Comp,lation, 1:151--160. 
Hie, X. and Lapedes, A. (1991). Nonlinear modelling and prediction by successive approxi- 
mation using Radial Basis Functions. Technical Report LA-UR-91-1375, Los Alanlos 
National Laboratory. 
Hinton, G. E. (1987). Learning translation invariant recognition in a massively parallel 
network. in Proc. Conf. Parallel Architectures and Languages urope, Eindhoven. 
Lang, K. J., Waibel, A. H., and Hinton, G. E. (1990). A time-delay neural network 
architecture for isolated word recognition. Neural Networks, 3:23-43. 
Le Cun, Y. (1989). Generalization and network design strategies. Technical Report CRG- 
TR-89-4, University of Toronto. 
MacKay, D. J. C. (1991). Bayesian Modelling and Neural Networks. PhD thesis, Compu- 
tation and Neural Systems, California Institute of Technology, Pasadena, CA. 
NowInn, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algo- 
rithms based on Fitting Statistical Mixtures. PhD thesis, School of Computer Science, 
Carnegie Mellon University, Pittsburgh, PA. 
Nowlan, S. J. and Hintou, G. E. (195}2). Sinplifying neural networks by soft weight- 
sharing. Neural Computation. In press. 
Plaut, D.C., NowInn, S. J., and tlinlon, G. E. (1986). Experiments on learning by 
back-propagation. TechnicM Report CMU-CS-86-126, Carnegie-Mellon University, 
Pittsburgh PA 15213. 
Priestley, M. B. (1991a.). Non-lircor and Non-stationo.ry Time Serzes Analysis. Acltdemic 
Press. 
Priestley, M. B. ( 1991 b). Spectral Analysis and Time Series. Academic Press. 
Tong, H. a, nd [.in, K. S. (198(I). Threshold autoregression, limit cycles, and cyclicd data. 
Journal Royal Statistical Society B, 42. 
Weigend, A. S., Ituberma. n, B. A., and |{umelhart, D. E. (1990). Predicting the future: A 
connectionist apl)roach. luter,ato,al Journal of Neural Systems, 1. 
