H Optimal Training Algorithms and 
their Relation to Backpropagation 
Babak Hassibi* 
Information Systems Laboratory 
Stanford University 
Stanford, CA 94305 
Thomas Kailath 
Information Systems Laboratory 
Stanford University 
Stanford, CA 94305 
Abstract 
We derive global H a optimal training algorithms for neural net- 
works. These algorithms guarantee the smallest possible prediction 
error energy over all possible disturbances of fixed energy, and are 
therefore robust with respect to model uncertainties and lack of 
statistical information on the exogenous signals. The ensuing es- 
timators are infinite-dimensional, in the sense that updating the 
weight vector estimate requires knowledge of all previous weight 
esimates. A certain finite-dimensional approximation to these es- 
timators is the backpropagation algorithm. This explains the lo- 
cal H a optimality of backpropagation that has been previously 
demonstrated. 
I Introduction 
Classical methods in estimation theory (such as maximum-likelihood, maximum 
entropy and least-squares) require a priori knowledge of the statistical properties 
of the exogenous signals. In some applications, however, one is faced with model 
uncertainties and lack of statistical information, which has led to an increasing 
interest in minimax estimation (see e.g., Zames 1981, Khargonekar and Nagpal 
1991, and the references therein) with the belief that the resulting so-called H  
algorithms will be more robust and less sensitive to parameter variations. 
*Contact author: Information Systems Laboratory, Stanford University, Stanford CA 
94305. Phone (415) 723-1538. Fax (415) 723-8473. E-mail: hassibi@rascals.stanford.edu. 
192 Babak Hassibi, Thomas Kailat 
In (Hassibi, Sayed and Kailath, 1994) we have shown that LMS (Widrow and Hoff, 
1960) and backpropagation (Rumelhart and Mclelland, 1986), the currently most 
widely used adaptive algorithms that have long been considered to be approximate 
H 2 (or least-squares) solutions, are indeed H  optimal and locally H  optimal al- 
gorithms, respectively. This, in our view, connects earlier work in learning theory to 
more recent ideas in robust estimation, and explains why LMS and backpropagation 
have found wide applicability in such a diverse range of problems. 
The local H  optimality of backpropagation implies that backpropagation min- 
imizes the energy gain from the disturbances to the prediction errors, only if the 
initial condition is close enough to the true weight vector and if the disturbances are 
small enough. In this paper we derive global H  optimal estimators that minimize 
the energy gain from the disturbances to the prediction errors for all initial condi- 
tions and disturbances. The resulting estimator (given by Theorem 1) has growing 
memory, which we refer to as being infinite-dimensional, since updating the weight 
vector estimate requires knowledge of all previous weight estimates. When the un- 
derlying model is linear, we show that this infinite-dimensional estimator reduces 
to the finite-dimensional LMS filter. When the underlying model is nonlinear, the 
infinite-dimensionality of the estimator may prohibit its practical applicablity, and 
one needs to construct finite-dimensional approximations to this estimator. We 
consider two such approximations here: one yields the backpropagation algorithm, 
and the other is a second-order algorithm based on the Newton-Raphson iteration. 
There are, no doubt, a wide variety of other approximations which should be worthy 
of further scrutiny. 
2 Robust Estimation 
In estimation problems one assumes a certain model (say an FIR filter in adaptive 
filtering, or a neural network), observes a corrupted version of the output of this 
model, and wants to estimate the parameters associated with this model (say the 
weights of the FIR filter or neural network). Most estimation algorithms make some 
assumption about the nature of the disturbances, and then proceed to estimate the 
parameters using some optimality criterion. To be more specific, we shall consider 
the following two cases. 
2.1 The Linear Case 
Suppose that we observe an output sequence {di} that obeys the following linear 
model 
di = xiw + vi, (1) 
where x/ = [ xi xi2 ... xi, ] is a known input vector, w is the unknown 
weight vector that we intend to estimate, and .[vi) is an unknown disturbance 
sequence. Let wi = .T'(do, d,..., di) denote the estimate of the weight vector given 
the inputs .[xj) and the outputs {dj) from time 0 up to and including time i. The 
most widely used estimate wi, is one that satisfies the following H  criterion 
TW]2 
mn I-]w- w_]  + Idj - xj , (2) 
j=0 
where u is a positive constant that reflects a priori knowledge as to how close w is 
to the initial estimate w_. The exact solution to the above criterion is given by 
the RLS algorithm (Haykin, 1991): 
Wi -- Wi-1 - ]p,i(di - 3/Tw), w-1 (3) 
HOptimal Training Algorithms and Their Relation to Backpropagation 193 
where w_x denotes the initial value, 
and 
Pi xi 
kp,i - 1 q- xiT pixi 
t ' Po=P I' 
Pi+x : Pi 1 q- xiPix i 
If we assume that in the model (1) the w - w_ and {vi} are zero-mean indepen- 
dent Gaussian random variables with variances/I and 1, respectively, then the cost 
function in (2) is simply the associated log-likelihood function. Thus the estimate 
given by minimizing (2) will be the maximum-likelihood estimate of the weight vec- 
tor w. In particular, it can be shown that under these assumptions, RLS minimizes 
the expected prediction energy 
i 
E lie liP- E Ix'm - xwj_112. 
j=O 
Note that the LMS algorithm is an approximation to RLS where kp,i is replaced 
by IXi, so that the estimates are updated along the direction of the instantaneous 
gradient of (2)' 
Wi = Wi--1 q- lJzi(di - giTw), W-x. (4) 
2.2 The Nonlinear Case 
Suppose now that we observe an output sequence {di} that obeys the following 
nonlinear model 
= g(w) + v,, (5) 
where gi(.) is a known nonlinear function (with bounded first and second order 
derivatives), w is the unknown weight vector we intend to estimate, and {vi} is an 
unknown disturbance sequence. In a neural network context the index i in gi(.) 
will correspond to the nonlinear function that maps the weight vector to the output 
when the ith input pattern is presented, i.e., gi(w) - g(xi, w) where xi is the ith 
input pattern. As before we shall denote by wi = .T(d0,..., di) the estimate of the 
weight vector using measurements up to and including time i. The H 2 criterion for 
finding the estimate is 
mn ju-Xlw - w_112 q- Idj -gj(w)l 2 . 
j=0 
(6) 
As in the linear case, if we assume that in the model (5) the disturbances w - w_ 
and {vi} are zero-mean independent Gaussian random variables with variances uI 
and 1, respectively, then the cost function in (6) is the log-likelihood function and 
the weight vector that minimizes it is the maximum-likelihood estimate. However, 
contrary to the linear case, the solution to (6) will not, in general, minimize the 
expected prediction error energy. 
In the nonlinear case exact solutions to (6) do not exist, and the backpropagation 
algorithm is a generalization of the LMS algorithm where once more the estimates 
are updated along the negative direction of the instantaneous gradient of the log- 
likelihood function: 
agi , '(di gi(wi)), W_l. 
wi = wi- + Iww Wi-) - (7) 
Generalizations of the RLS algorithm to the nonlinear setting are the second order 
Gauss-Newton methods. 
194 Babak Hassibi, Thomas Kailat 
2.3 The Question of Robustness 
In view of the above discussion we have seen that He-optimal estimation strategies 
(see (2) and (6)) are maximum-likelihood and minimize the expected prediction 
error energy (in the linear case), if we assume that the disturbances are zero-mean 
independent Gaussian random variables. However, the question that begs itself 
is what the performance of such estimators will be if the assumptions on the dis- 
turbances are violated, or if there are modelling errors in our model so that the 
disturbances must include the modelling errors? In other words 
- is it possible that small disturbances and modelling errors may lead to large esti- 
mation errors? 
Obviously a nonrobust algorithm would be one for which the above is true, and a 
robust algorithm would be one for which small disturbances lead to small estimation 
errors. (For example in the adaptive filtering problem, where we assumed an FIR 
model, the true model may be IIR, but we neglect the tail of the filter since its 
components are small. However, unless one uses a robust estimation algorithm, it 
is conceivable that this small modelling error may result in large estimation errors.) 
The problem of robust estimation is thus an important one, and is worthy of study 
in its own right. Rather surprisingly, it had not received much attention until 
quite recently. The H  criterion has been introduced (Zames, 1981) as a means 
of studying such questions in the contexts of estimation and control. This is the 
subject of the next section. 
3 The H  Problem 
The H  estimation formulation is an attempt to address the robustness question 
raised in the previous section. The idea is to come up with estimators that minimize 
(or in the suboptimal case bound) the maximum energy gain from the disturbances 
to the estimation errors. This will guarantee that if the disturbances are small (in 
energy) then the estimation errors will be as small as possible (in energy), no matter 
what the disturbances are. In other words the maximum energy gain is minimized 
over all possible disturbances. The robustness of the H  estimators arises from 
this fact. Since they make no assumption about the disturbances, they have to 
accomodate for all conceivable disturbances, and are thus over-conservative. 
We once more assume that we observe an output sequence {di} that obeys the 
following nonlinear model 
= vi, (8) 
where gi(.) is a known nonlinear function, w is an unknown weight vector, and 
is an unknown disturbance sequence that includes noise and/or modelling errors. 
Recall that in a neural network context gi(w) = g(xi, w), where xi is the ith input 
pattern. As before we shall denote by wi = .(d0,..., di) the estimate of the weight 
vector using measurements up to and including time i, and the prediction error by 
i-- gi(W) -- gi(Wi--1)  
The optimal H  estimation problem may now be stated as follows. 
Problem ! (Optimal H  Estimation Problem)Find an H-optimal esti- 
mation strategy wi = .(do, d,...,di) that minimizes the maximum energy gain 
from the disturbances w- w_ and {vi) to the prediction errors {ei = gi(w) - 
gi(wi-1)), and obtain the resulting 
Ilell i 
%pt = inf sup 12 = inf sup Zj=o IgJ(w) - gj(wj-)l 2 
+ tlvll w-l 2 + Z}=o Ivl 2 
(9) 
HOptimal Training Algorithms and Their Relation to Backpropagation 195 
where I  > 0 reflects a priori knowledge of how close w is to the initial estimate 
w-x, and where h2 is the space of all causal square-summable sequences. '/opt is the 
so-called minimum H  norm. 
Note that the infimum in (9) is taken over all causal estimators .. Although the 
H  estimation problem has been solved in the linear case, to date there does not 
exist a satisfactory solution for the nonlinear case, and indeed the class of nonlinear 
functions gi(.) for which the above problem has a solution is not known (Ball and 
Helton, 1992). 
We have, however, been able to solve Problem 1 in the case where the gi(.) are 
bounded functions with bounded first and second order derivatives. These con- 
ditions are of course satisfied by multi-layer neural networks with sigmoidal ele- 
ments. The result is stated below, where we call the column vectors {xi) exciting 
i 
if limi_oo Yj_-0 xxj -- oc. 
Theorem 1 (H  Optimal Algorithm) Consider the model (8) where the gi(.) 
are bounded and have bounded first and second order derivatives, and suppose we 
wish to minimize the maximum energy gain from the unknowns w - w_ and {vi} 
to the prediction errors {el}. If 
1 
0 </ < infinf (10) 
i w o_(w 2' 
cOw \ 
and the {-(w)) are exciting, then the minimum H c norm is 
Ow 
'/opt  1. 
In this case an optimal H  estimator is given by the following sequence of nonlinear 
equations 
iW 0 ---- (d O - g0(W_l)) 0-w (w0) 
---- (do -- go(W-1)) OO--w (W1) -{- (dl - gl(W0)) o-w (Wl) 
: (do-go(W-1))q(wi)-c (dl-gl(WO))O-w(Wi)-c 
 ..-c (di- gi(wi-1))-'(wi) 
(11) 
Remarks: 
(i) The fact that Oi(w) = gi(wi-) implies that the output prediction has the 
same structure as our model (i.e. that there exists a weight vector estimate 
wi- that yields the desired output prediction). 
(ii) Theorem i states that '/opt = 1. While it is not intuitively difficult to 
convince oneself that '/opt cannot be less than one (simply choose the dis- 
turbances vi so that vi = el, whereby the ratio in (9) can be made arbitrarily 
close to one), the surprising fact is that '/opt is one. What this means is that 
t. he estimator of Theorem 1 guarantees that the energy of the prediction 
errors will never exceed the energy of the disturbances. This is of course 
not true of other estimators. 
(iii) Theorem 1 gives an upper bound on the quantity tt that guarantees '/opt = 
1. As we shall see below, the tt of Theorem i is a generalization of the 
learning rate tt of the LMS and backpropagation algorithms (see (4) and 
(7)), and this is in accordance with the well-known fact that LMS and 
backpropagation behave poorly if the learning rate is chosen too large. 
196 Babak Hassibi, Thomas Kailat 
(iv) In view of Theorem 1, to obtain the estimate wi we need to solve a nonlin- 
ear equation that involves all previous estimates Wo,..., wi_i. This means 
that the estimator (11) is infinite-dimensional. Although this may prohibit 
practical applications of this algorithm, it will be very useful to study spe- 
cial cases under which the estimator becomes finite-dimensional, or to find 
finite-dimensional approximations for (11). This will be done in the next 
section. 
4 Special Cases 
4.1 The Linear Case 
In the linear case the model we consider has 
xw, 
so that -(w) = xi. Although the linear function gi(w) = xw does not satisfy the 
boundedness condition of Theorem 1, let us investigate the consequence of applying 
algorithm (11) to this case. Thus the (i + 1)th equation in (11) becomes 
1 
--Wi __--(d0--x0Tw_l)X0--(dl--xlTw0)Xl--...--(di_l-XLlwi_2)Xi_l--(di-xiTWi_l)Xi . 
But from the ith equation 
1 
--Wi_ 1 : (d 0 - 2W_l)2 0 +(dl - 2w0)21 +... + (di-1 - XLlWi_2)xi_ 1 
so that 
I 1 
+ - (la) 
which is the LMS algorithm (4). Thus in the linear case the estimator of Theorem 1 
specializes to the LMS algorithm. This is expected since we have shown in (Hassibi 
et al., 1994) that the LMS algorithm is H  optimal. The result obtained there is 
as follows. 
Theorem 2 (LMS Algorithm) 
minimize the maximum energy gain from the unknowns w - w_ and vi 
prediction errors el. If the input vectors xi are exciting and 
Consider the model (1), and suppose we wish to 
to the 
1 
0 </a < inf- (13) 
then the minimum H  norm is %pt = 1. In this case an optimal H  estimator is 
given by the LMS algorithm with learning rate , viz. 
, W-- 1 (14) 
Wi -- Wi-1 -{- txi(di - xiTWi_l) 
Note that in the linear case the estimator is finite-dimensional since to find wi we 
only require knowledge of wi_x. 
4.2 Backpropagation 
As mentioned at the end of Section 3, the H  optimal estimator of Theorem 1 is 
infinite-dimensional, in the sense that to obtain the estimate wi we need all previous 
estimates wo,..., wi_. We may obtain finite-dimensional approximations to the 
estimator (11) by constructing approximations to the nonlinear equations appearing 
HOptimal Training Algorithms and Their Relation to Backpropagation 197 
in (11). However, the resulting estimators will no longer be H a optimal in a global 
sense, but will only have local optimality. 
The method that we shall use to obtain such approximate estimators is to assume 
that we have found the estimate wi-, and to use it as an initial guess to solve the 
(i + 1)th equation in (11) whose solution is wi. Depending on what algorithm we 
use to solve the (i + 1)th equation with initial guess wi_, we shall get a different 
approximate estimator to (11). 
To this end, suppose that we have solved the ith equation in (11) and have obtained 
wi-1, i.e. 
i , Ogo Ogi-1 W 
--Wi-x -- (do - go(W-x))-w (Wi-x) q- . .. q- (di-x - gi-x(wi-2))-b--W-w (i-1). 
We now intend to solve the (i+ 1)th equation in (11) for wi. Note that this equation 
is of the form x = f(x) (where x: wi). If we use one step of the fixed-point iteration 
method xj+ = f(xj) with initial condition x0 = wi-, we have 
1 Ogo 0___ 
-wi = (do - g0(w-))ww (wi-) +... + (&- - g-(w-2)) (w_15) 
lW,_ 1 
q-(di - gi(wi-1) Ogi (wi-1) (16) 
1 , Ogi, 
= -wi-1 q- (di- gi(wi-))ww[Wi-) (17) 
which is the backpropagation algorithm (7). Note that since we only use wi_ to 
compute wi, backpropagation is a finite-dimensional approximation to the global 
H  optimal estimator (11). Due to its approximate nature, backpropagation has 
only local H  optimality properties, as we have shown in (Hassibi et al., 1994). 
The result is stated below, where the column vectors {xi) are called persistently 
i i 
exciting if, liro_oo  j=0 xjx > aI, for some a > 0. 
Theorem 3 (Local H  Optimality) Consider the model (8) and the backprop- 
agation algorithm (7). Suppose that the -(wi_) are persistently exciting, and 
that (It3) is satisfied. Then for each  > O, there exist 5, 52 > 0 such that for all 
I TM - w_[ < 5 and all v C h with [vii < 52, we have 
IIe II 2 < i 
/-llw- w-lle II v II 2 - 
Note that contrary to the global Theorem 1, backpropagation cannot achieve 7opt -- 
1, and that it bounds the energy gain by x/T + e only for small enough disturbances. 
4.3 A Second-Order Algorithm 
If instead of using one step of the fixed-point iteration to solve for wi, as was done in 
Section 4.2, we use one step of the Newton-Raphson method with initial condition 
w_, we obtain the following algorithm as an approximation to (11). 
Wi ---- 
Wi--1 q- l(di- gi(Wi-1))(I)i'(Wi-1), 
(I)i--1 q- lg(di-1 - gi-l(Wi-2)) ---L(wi-2), 
w-1 
(I)0 = I. 
(18) 
As before, (18) has only local optimality properties. However, since the Newton- 
Raphson method is less crude than the fixed-point iteration, it is expected to have 
198 Babak Hassibi, Thomas Kailat 
better local properties than backpropagation. The complexity of the algorithm is 
O(n 2) per iteration which is, of course, higher than backpropagation which requires 
only O(n) per iteration. 
5 Conclusion 
We have derived global H c optimal estimators for training neural networks. Such 
H  optimal algorithms will be most applicable in uncertain environments where 
there may be modelling errors, and where the statistics and/or distributions of the 
disturbances are not known (or are too expensive to obtain). 
The resulting H  optimal algorithm of Theorem 1 is infinite-dimensional, so that 
computing the most recent weight vector estimate requires knowledge of all previ- 
ous weight estimates. We considered two finite-dimensional approximations to this 
estimator (one of which was backpropagation) with the property that constructing 
the most recent weight estimate required only the immediately preceding weight 
estimate. However, the estimator of Theorem I has a very interesting structure 
that should allow for a wide variety of approximations, some of which may yield 
alternatives to the backpropagation algorithm. In particular, it would be interest- 
ing to study the possiblity of constructing estimators where updating the weight 
estimates requires more than one (but only finitely many) previous estimates. 
The estimators constructed in this paper used prediction error as their criterion and 
should therefore have good generalization properties. It is also possible to construct 
similar estimators using filtered or smoothing error as the criterion, though this was 
not done due to lack of space. 
Acknowledgements 
This work was supported in part by the Air Force Office of Scientific Research, 
Air Force Systems Command under Contract AFOSR91-0060 and by the Army 
Research Office under contract DAAL03-89-K-0109. 
References 
J. A. Ball and J. W. Helton. (1992) Nonlinear H  control theory for stable plants. 
Math. Control Signals Systems, 5:233-261. 
B. Hassibi, A. H. Sayed, and T. Kailath. (1994) H  optimality criteria for LMS 
and backpropagation. To appear in Advances in Neural Information Processing 
Systems, Vol. 6, Morgan-Kaufmann. 
S. Haykin. (1991) Adaptive Filter Theory. Prentice Hall, Englewood Cliffs, NJ. 
P. P. Khargonekar and K. M. Nagpal. (1991) Filtering and smoothing in an H c 
setting. IEEE Trans. on Automatic Control, AC-36:831-847. 
D. E. Rumelhart, J. L. McClelland and the PDP Research Group. (1986) Parallel 
distributed processing: explorations in the microstructure of cognition. Cambridge, 
Mass. : MIT Press. 
B. Widrow and M. E. Hoff, Jr. (1960) Adaptive switching circuits. IRE WESCON 
Cony. Rec., Pt.4:96-104. 
G. Zames. (1981) Feedback optimal sensitivity: model preference transformation, 
multiplicative seminorms and approximate inverses. IEEE Trans. on Automatic 
Control, AC-26:301-320. 
