Optimal Brain Surgeon: 
Extensions and performance comparisons 
Babak Hassibi* 
David G. Stork 
Gregory Wolff 
Takahiro Watanabe 
Ricoh California Research Center 
2882 Sand Hill Road Suite 115 
Menlo Park, CA 94025-7022 
and 
*Department of Electrical Engineering 
105B Durand Hall 
Stanford University 
Stanford, CA 94305-4055 
Abstract 
We extend Optimal Brain Surgeon (OBS) -- a second-order 
method for pruning networks -- to allow for general error mea- 
sures, and explore a reduced computational and storage implemen- 
tation via a dominant eigenspace decomposition. Simulations on 
nonlinear, noisy pattern classification problems reveal that OB$ 
does lead to improved generalization, and performs favorably in 
comparison with Optimal Brain Damage (OBD). We find that the 
required retraining steps in OBD may lead to inferior generaliza- 
tion, a result that can be interpreted as due .to injecting noise back 
into the system. A common technique is to stop training of a large 
network at the minimum validation error. We found that the test 
error could be reduced even further by means of OB$ (but not 
OBD) pruning. Our results justify the t - o approximation used 
in OB$ and indicate why retraining in a highly pruned network 
may lead to inferior performance. 
263 
264 Hassibi, Stork, Wolff, and Watanabe 
1 INTRODUCTION 
The fundamental theory of generalization favors simplicity. For a given level of 
performance on observed data, models with fewer parameters can be expected to 
perform better on test data. In practice, we find that neural networks with fewer 
weights typically generalize better than large networks with the same training error. 
To this end, LeCun, Denker and Solla's (1990) Optimal Brain Damage method 
(OBD) sought to delete weights by keeping the training error as small as possible. 
Hassibi and Stork (1993) extended OBD to include the off-diagonal terms in the 
network's Hessian, which were shown to be significant and important for pruning 
in classical and benchmark problems. 
OBD and Optimal Brain Surgeon (OBS) share the same basic approach of training 
a network to (local) minimum in error at weight w*, and then pruning a weight 
that leads to the smallest increase in the training error. The predicted functional 
increase in the error for a change in full weight vector 5w is: 
OE )T 1 02E 
6E =  6w 6w v  
0 =H 
 + o([[wll 3) , (1) 
  
m0 
where H is the Hessian matrix. The first term vanishes because we are at a local 
minimum in error; we ignore third- and higher-order terms (Gorodkin et al., 1993). 
Hassibi and Stork (1993) first showed that the general solution for minimizing this 
function given the constraint of deleting one weight was: 
2 
Wq 1 Wq 
W---- [$_l]q q $ -1 .eq and Lq-- 2 [$-l]qq (2) 
Here, eq is the unit vector along the qth direction in weight space and Lq is the 
saliency of weight q -- an estimate of the increase in training error if weight q is 
pruned and the other weights updated by the left equation in Eq. 2. 
2 
GENERAL ERROR MEASURES AND FISHER'S 
METHOD OF SCORING 
In this section we show that the recursive procedure for computing the inverse 
Hessian for sum squared errors presented in Hassibi and Stork (1993) generalizes 
to any twice differentiable distance norm and that the key approximation based on 
Fisher's method of scoring is still valid. 
Consider an arbitrary twice differentiable distance norm d(t, o) where t is the de- 
sired output (teaching vector) and o = F(w, in) the actual output. Given a weight 
vector w, F maps the input vector in to the output; the total error over P patterns 
1 P , o[kl). 
is E =  k=l d( t[k] It is straightforward to show that for a single output 
unit network the Hessian is: 
Optimal Brain Surgeon: Extensions and Performance Comparisons 265 
1 P cgF(w, in [k]) c92d(t[k],o []) cgFT(w, in []) 
H =   Ow 002 Ow 
k=l 
1 P cgd(t [1, o [1) c92F(w, in [k]) 
7 0o 0w 2 
k=l 
(3) 
The second term is of order O(]lt - oil); using Fisher's method of scoring (Sever & 
Wild, 1989), we set this term to zero. Thus our Hessian reduces to: 
I OF(w, in 02d(t[l,o 0FT(w, in 
H: 0w 002 0w (4) 
k=l 
c92 d(t [kl ,o[l ) 
We define X  (w'in[) and a  and following the logic of Hsibi 
0W 002 
and Sork (1993) we can easily show tha the recursion for computing the inverse 
Hessian becomes: 
T 
--1 Hi ' Xk+l 'Xk+l ' Hi , H 1: ct-1I, and H 1 = H -1 
Hk+ 1 = H 1 - ___p + XkL 1  H 1  Xk+ 1 ' 
ak 
(5) 
where  is a small parameter -- effectively a weight decay constant. Note how 
different error measures d(t, o) scale the gradient vectors Xk forming the Hessian 
(Eq. 4). For the squared error d(t, o) -- (t - 0) 2, we have a: 1, and all gradient 
vectors are weighted equally. The cross entropy or Kullback-Leibler distance, 
o (1 -o) 0 _< o,t < 1 (6) 
d(t,o) = olog  + (1 - o) log (1 - t) ' - 
i Hence if o [] is close to zero or one, X is given a large 
yields ak = Otkl(1--OIkl)' 
weight in the Hessian; conversely, the smallest value of a occurs when o [] = 1/2. 
This is desirable and makes great intuitive sense, since in the cross entropy norm 
the value of o [] is interpreted as the probability that the kth input pattern belongs 
to a particular class, and therefore we give large weight to X whose class we are 
most certain and small weight to those which we are least certain. 
3 EIGENSPACE DECOMPOSITION 
Although OBS has been shown to be a powerful method for small and interme- 
diate sized networks -- Hassibi, Stork and Wolff (1993) applied OBS successfully 
to NETtalk -- its use in larger problems is difficult because of large storage and 
computation requirements. For a network of n weights, simply storing the Hessian 
requires 0(n2/2) elements and O(Pn 2) computations are needed for each pruning 
step. Reducing this computational burden requires some type of approximation. 
Since OBS uses the inverse of the Hessian, any approximation to OBS will at some 
level reduce to an approximation of H. For instance OBD uses a diagonal approx- 
imation; magnitude-based methods use an isotropic approximation; and dividing 
the network into subsets (e.g., hidden-to-output and input-to-hidden) corresponds 
to the less-restrictive block diagonal approximation. In what follows we explore the 
dominant eigenspace decomposition of the inverse Hessian as our approximation. It 
should be remembered that all these are subsets of the full OBS approach. 
266 Hassibi, Stork, Wolff, and Watanabe 
3.1 Theory 
The dominant eigendecomposition is the best low-rank approximation of a matrix 
(in an induced 2-norm sense). Since the largest eigenvalues of H-1 are the smallest 
eigenvalues of H, this method will, roughly speaking, be pruning weights in the 
approximate nullspace of H. Dealing with a low rank approximation of H -1 will 
drastically reduce the storage and computational requirements. 
Consider the eigendecomposition of H: 
0 ) , , 
H = (UsUN) N UN = UsYlsUs + UNNUN' 
(7) 
where Es contains the largest eigenvalues of H and N the smallest ones. (We use 
the subscripts q and N to loosely connote signal and noise.) The dimension of the 
noise subspace is typically ra << n. Us and UN are n x (n -- ra) and n x ra matrices 
that span the dominant eigenspace of H and H-l, and * denotes matrix transpose 
and complex conjugation. If, as suggested above, we restrict the weight prunings to 
lie in UN, we obtain the following saliency and full weight change when removing 
the qth weight: 
2 
I Wq (8) 
Jq ----  eq T' UN' "]r 1  rv-e q 
Wq -- 1  
(V = eqT'UN' [1. Uv .eq N UNeq , (9) 
where we have used 'bars' to indicate that these are approximations to Eq. 2. Note 
now that we need only to store ]N and UN, which have roughly nm elements. 
Likewise the computation required to estimate N and UN is O(Pnm). 
The bound on Lq is: 
Lq _ Lq _ Lq q- 2 LqLq 1 
wq  ' er(s)' (10) 
where er(s) is the smallest eigenvalue of $. Moreover if er(s) is large enough so 
1 we have the following simpler form: 
that rr(s) > [H_l]q q 
Lq < .q < Lq (11) 
-- --1-- 
[H-1lqq_(s) 
In either case Eqs. 10 and 11 indicate that the larger _(s) is, the tighter the bounds 
are. Thus if the subspace dimension m is such that the eigenvalues in Us are large, 
then we will have a good approximation. 
LeCun, Simard and Pearlmutter (1993) have suggested a method that can be used 
to estimate the smallest eigenvectors of the Hessian. However, for OBS (as we shall 
see) it is best to use the Hessian with the t - o approximation, and their method 
is not appropriate. 
Optimal Brain Surgeon: Extensions and Performance Comparisons 267 
3.2 Simulations 
We pruned networks trained on the three Monk's problems (Thrun et al., 1991) us- 
ing the full OBS and a 5-dimensional eigenspace version of OBS, using the validation 
error rate for stopping criterion. (We chose a 5-dimensional subspace, because this 
reduced the computational complexity by an order of magnitude.) The Table shows 
the number of weights obtained. It is clear that this eigenspace decomposition was 
not particularly successful. It appears as though the the off-diagonal terms in H be- 
yond those in the eigenspace are important, and their omission leads to bad pruning. 
However, this warrants further study. 
Monkl 
Monk2 
Monk3 
unpruned OBS 5-d eigenspace 
58 
39 
39 
14 
16 
4 
28 
27 
11 
4 OBS/OBD COMPARISON 
General criteria for comparing pruning methods do not exist. Since such meth- 
ods amount to assuming a particular prior distribution over the parameters, the 
empirical results usually tell us more about the problem space, than about the 
methods themselves. However, for two methods, such as OBS and OBD, which 
utilize the same cost function, and differ only in their approximations, empirical 
comparisons can be informative. Hence, we have applied both OBS and OBD to 
several problems, including an artificially generated statistical classification task, 
and a real-world copier voltage control problem. As we show below, the OBS algo- 
rithm usually results in better generalization performance. 
4.1 MULTIPLE GAUSSIAN PRIORS 
We created a two-catagory classification problem with a five-dimensional in- 
put space. Category A consisted of two Gaussian distributions with mean 
vectors /A1 ---- (1,1,0,1,.5) and /A2 = (0,0,1,0,.5) and covariances nl = 
Diag[0.99, 1.0, 0.88, 0.70, 0.95] and 2: Diag[1.28, 0.60, 0.52, 0.93, 0.93] while cat- 
egory B had means /B -- (0,1,0,0,.5) and /B2 = (1,0,1,1,.5) and covariances 
B1 = Dia910.84, 0.68, 1.28, 1.02, 0.89] and B2 = Diag[0.52, 1.25, 1.09, 0.64, 1.13]. 
The networks were feedforward with 5 input units, 9 hidden units, and a single 
output unit (64 weights total). The training and the test sets consisted of 1000 
patterns each, randomly chosen from the equi-probable categories. The problem 
was a difficult one: even with the somewhat large number of weights it was not 
possible to obtain less than 0.15 squared error per training pattern. We trained the 
networks to a local error minimum and then applied OBD (with retraining after 
each pruning step using backpropagation) as well as OBq. 
Figure 1 (left) shows the training errors for the network as a function of the number 
of remaining weights during pruning by OBq and by OBD. As more weights are 
pruned the training errors for both OBq and OBD typically increase. Comparing 
the two graphs for the first pruned weights, the training error for OBD and OBq 
are roughly equal, after which the training error of OBq is less until the 24th weight 
268 Hassibi, Stork, Wolff, and Watanabe 
E 
.165 
.16 
.155 
.15 
30 
Train 
OBD 
OBS,,% 
E 
22 
21! 
21 
20! 
2 
19'. 
30 
Test 
OBS 
35 40 45 50 55 60 65 35 40 45 50 55 60 65 
number of weights number of weights 
Figure 1: OBS and OBD training error on a sum of Gaussians prior pattern clas- 
sification task as a function of the number of weights in the network. (Pruning 
proceeds right to left.) OBS pruning employed o = 10 -6 (cf., Eq. 5); OBD em- 
ployed 60 retraining epochs after each pruning. 
is removed. The reason OBD training is initially slightly better is that the network 
was not at an exact local minimum; indeed in the first few stages the training error 
for OBD actually becomes less than its original value. (Training exhaustively to 
the true local minimum took prohibitively long.) In contrast, due to the t - o 
approximation OBS tries to keep the network response close to where it was, even 
if that isn't the minimum w*. We think it plausible that if the network were at an 
exact local minimum OBS would have had virtually identical performance. 
Since OBD is using retraining the only reason why OBS can outperform after the 
first steps is that OBD has removed an incorrect weight, due to its diagonal approx- 
imation. (The reason OBS behaves poorly after removing 24 weights -- a radically 
pruned net -- may be that the second-order approximation breaks down at this 
point.) We can see that the minimum on test error occurs before this breakdown, 
meaning that the failed approximation (Fig. 2) does not affect our choice of the 
optimal network, at least for this problem. 
The most important and interesting result is the test error for these pruned networks 
(Figure 1, right). The test error for OBD does not show any consistent behaviour, 
other than the fact that on the average it generally goes up. This is contrary to what 
one would expect of a pruning algorithm. It seems that the retraining phase works 
against the pruning process, by tending to reinforce overfitting, and to reinject the 
training set noise. For 'OBS, however, the test error consistently decreases until 
after removing 22 weights a minimum is reached, because the t - o approximation 
avoids reinjecting the training set noise. 
4.2 OBS/OBD PRUNING AND "STOPPED" NETWORKS 
A popular method of avoiding overfitting is to stop training a large net when the 
validation error reaches a minimum. In order to explore whether pruning could 
improve the performance on such a "stopped" network (i.e., not at w*), we mon- 
itored the test error for the above problem and recorded the weights for which a 
minimum on the test set occured. We then applied OBS and OBD to this network. 
Optimal Brain Surgeon: Extensions and Performance Comparisons 269 
E 
204 
202 
200 
198 
196 
194 
OBS 
35 40 45 50 55 60 
number of weights 
Figure 2: A 64-weight network was trained to minimum validation error on the 
Gaussian problem -- not w* -- and then pruned by OBD and by OBS. The test 
error on the resulting network is shown. (Pruning proceeds from right to left.) Note 
especially that even though the network is far from w*, OBS leads lower test error 
over a wide range of prunings, even through OBD employs retraining. 
The results shown in Figure 2 indicate that with OB$ we were able to reduce the 
test error, and this reached a minimum after removing 17 weights. OBD was not 
able to consistently reduce the test error. 
This last result and those from Fig. 2 have important consequences. There are no 
universal stopping criteria based on theory (for the reasons mentioned above), but 
it is a typical practice to use validation error as such a criterion. As can be seen 
in Figure 2, the test error (which we here consider a validation error) consistantly 
decreases to a unique miniumum for pruning by OBS. For the network pruned (and 
continuously retrained) by OBD, there is no such structure in the validation curves. 
There seems to be no reliable clue that would permit the user to know when to stop 
pruning. 
4.3 COPIER CONTROL APPLICATION 
The quality of an image produced by a copier is dependent upon a wide variety of 
factors: time since last copy, time since last toner cartridge installed, temperature, 
humidity, overall graylevel of the source document, etc. These factors interact in a 
highly non-linear fashion, so that mathematical modelling of their interrelationships 
is difficult. Morita et al. (1992) used backpropagation to train an 8-4-8 network (65 
weights) on real-world data, and managed to achieve an RMS voltage error of 0.0124 
on a critical control plate. We pruned his network with both OBD with retraining 
as well as with OBS. When the network was pruned by OBD with retraining, the 
test error continually increased (erratically) such that at 34 remaining weights, the 
RMS error was 0.023. When also we pruned the original net by OBS, and the test 
error gradually decreased such that at the same number of weights the test error 
was 0.012 -- significantly lower than that of the net pruned by OBD. 
270 Hassibi, Stork, Wolff, and Watanabe 
5 CONCLUSIONS 
We compared pruning by OBS and by OBD with retraining on a difficult non-linear 
statistical pattern recognition problem and found that OBS led to lower generaliza- 
tion error. We also considered the widely used technique of training large nets to 
minimum validation error. To our surprise, we found that subsequent pruning by 
OBS lowered generalization error, thereby demonstrating that such networks still 
have overfitting problems. We have found that the dominant eigenspace approach 
to OBS leads to poor performance. Our simulations support the claim that the 
t - o approximation used in OBS avoids reinjecting training set noise into the net- 
work. In contrast, including such t - o terms in OBS reinjects training set noise 
and degrades generalization performance, as does retraining in OBD. 
Acknowledgements 
Thanks to T. Kailath for support of B.H. through grants AFOSR 91-0060 and 
DAAL03-91-C-0010. Address reprint requests to Dr. Stork: stork@crc.ricoh.com. 
References 
J. Gorodkin, L. K. Hansen, A. Krogh, C. Svarer and O. Winther. (1993) A quanti- 
tative study of pruning by Optimal Brain Damage. International Journal of Neural 
Systems 4(2) 159-169. 
B. Hassibi & D. G. Stork. (1993) Second order derivatives for network pruning: 
Optimal Brain Surgeon. In S. J. Hanson, J. D. Cowan and C. L. Giles (eds.), 
Advances in Neural Information Processing Systems 5, 164-171. San Mateo, CA: 
Morgan Kaufmann. 
B. Hassibi, D. G. Stork & G. Wolff. (1993) Optimal Brain Surgeon and general 
network pruning. Proceedings of ICNN 93, San Francisco I IEEE Press. 293-299. 
Y. LeCun, J. Denker & S. Solla. (1990) Optimal Brain Damage. In D. Touretzky 
(ed.), Advances in Neural Information Processing Systems 2, 598-605. San Mateo, 
CA: Morgan Kaufmann. 
Y. LeCun, P. Simard & B. Pearlmutter. (1993) Automatic learning rate max- 
imization by on-line estimation of the Hessian's eigenvectors. In S. J. Hanson, 
J. D. Cowan & C. L. Giles (eds.), Advances in Neural Information Processing Sys- 
tems 5, 156-163. San Mateo, CA: Morgan Kaufmann. 
T. Morita, M. Kanaya, T. Inagaki, H. Murayama & S. Kato. (1992) Photo-copier 
image density control using neural network and fuzzy theory. Second International 
Workshop on Industrial Fuzzy Control 4 Intelligent Systems December 2-4, College 
Station, TX, 10. 
S. Thrun and 23 co-authors. (1991) The Monk's Problems -- A performance com- 
parison of different learning algorithms. CMU-CS-91-197 Carnegie-Mellon Univer- 
sity Dept. of Computer Science Technical Report. 
