Optimality Criteria for LMS and 
Backpropagation 
Babak Hassibi 
Information Systems Laboratory 
Stanford University 
Stanford, CA 94305 
All H. Sayed 
Dept. of Elec. and Comp. Engr. 
University of California Santa Barbara 
Santa Barbara, CA 93106 
Thomas Kailath 
Information Systems Laboratory 
Stanford University 
Stanford, CA 94305 
Abstract 
We have recently shown that the widely known LMS algorithm is 
an H a optimal estimator. The H a criterion has been introduced, 
initially in the control theory literature, as a means to ensure ro- 
bust performance in the face of model uncertainties and lack of 
statistical information on the exogenous signals. We extend here 
our analysis to the nonlinear setting often encountered in neural 
networks, and show that the backpropagation algorithm is locally 
H a optimal. This fact provides a theoretical justification of the 
widely observed excellent robustness properties of the LMS and 
backpropagation algorithms. We further discuss some implications 
of these results. 
I Introduction 
The LMS algorithm was originally conceived as an approximate recursive procedure 
that solves the following problem (Widrow and Hoff, 1960): given a sequence of n x 1 
input column vectors {hi), and a corresponding sequence of desired scalar responses 
{di), find an estimate of an n x 1 column vector of weights w such that the sum 
of squared errors, -./N=0[di- hiw[ 2, is minimized. The LMS solution recursively 
351 
352 HassiN, Sayed, and Kailath 
updates estimates of the weight vector along the direction of the instantaneous gra- 
dient of the squared error. It has long been known that LMS is an approximate 
minimizing solution to the above least-squares (or H 2) minimization problem. Like- 
wise, the celebrated backpropagation algorithm (Rumelhart and McClelland, 1986) 
is an extension of the gradient-type approach to nonlinear cost functions of the form 
/N=0 ]di- hi(w)l 2, where hi(.) are known nonlinear functions (e.g., sigmoids). It 
also updates the weight vector estimates along the direction of the instantaneous 
gradients. 
We have recently shown (HassiN, Sayed and Kailath, 1993a) that the LMS algo- 
rithm is an H -optimal filter, where the H  norm has recently been introduced 
as a robust criterion for problems in estimation and control (Zames, 1981). In gen- 
eral terms, this means that the LMS algorithm, which has long been regarded as 
an approximate least-mean squares solution, is in fact a minimizer of the H  error 
norm and not of the H 2 norm. This statement will be made more precise in the 
next few sections. In this paper, we extend our results to a nonlinear setting that 
often arises in the study of neural networks, and show that the backpropagation 
algorithm is a locally H-optimal filter. These facts readily provide a theoretical 
justification for the widely observed excellent robustness and tracking properties of 
the LMS and backpropagation algorithms, as compared to, for example, exact least 
squares methods such as RLS (Haykin, 1991). 
In this paper we attempt to introduce the main concepts, motivate the results, and 
discuss the various implications. We shall, however, omit the proofs for reasons of 
space. The reader is refered to (Hassibi et al. 1993a), and the expanded version of 
this paper for the necessary details. 
2 Linear H  Adaptive Filtering 
We shall begin with the definition of the H  norm of a transfer operator. As 
will presently become apparent, the motivation for introducing the H  norm is to 
capture the worst case behaviour of a system. 
Let h2 denote the vector space of square-summable complex-valued causal sequences 
{fk, 0 _< k < co}, viz., 
h2 = {set of sequences {fk} such that ff < co} 
k=0 
with inner product < {f),{g} > = y=ofgk , where  denotes complex 
conjugation. Let T be a transfer operator that maps an input sequence {ui} to an 
output sequence {Yi}. Then the H  norm of T is equal to 
IIrlloo- sup Ilyl12 
11'112 
the notation I111= anote the n-norm or sequence viz., 
k=Ok k 
The H  norm may thus be regarded as the maximum ener9 gain from the input 
u to the output F. 
OptimaLity Criteria for LMS and Backpropagation 353 
Suppose we observe an output sequence {di} that obeys the following model: 
& = + vi 
where hi T = [ hi hi2 ... hi,, ] is aknown input vector, w is an unknown weight 
vector, and {vi} is an unknown disturbance, which may also include modeling errors. 
We shall not make any assumptions on the noise sequence {vi} (such as whiteness, 
normally distributed, etc.). 
Let wi = .T'(do, di,..., di) denote the estimate of the weight vector w given the 
observations {d j} from time 0 up to and including time i. The objective is to 
determine the functional .T', and consequently the estimate wi, so as to minimize a 
certain norm defined in terms of the prediction error 
=hTw-Tw,- 
which is the difference between the true (uncorrupted) output hTw and the pre- 
dicted output hiZwi_. Let T denote the transfer operator that maps the unknowns 
{w- w_, {vi}} (where w-1 denotes an initial guess of w) to the prediction errors 
{ei}. The H  estimation problem can now be formulated as follows. 
Problem 1 (Optimal H  Adaptive Problem) Find an H-optimal estima- 
tion strategy wi = .T'(do, dl,..., di) that minimizes Ilrlloo, and obtain the resulting 
7o 2 = inf []Tll2 = inf sup Ilel122 
w-l' + Ilvlll (2) 
where Iw- w_112 = (w- w_)'(w - w_), and I u is a positive constant that reflects 
apriori knowledge as to how close w is to the initial guess w_. 
Note that the infimum in (2) is taken over all causal estimators .T'. The above 
problem formulation shows that H  optimal estimators guarantee the smallest 
prediction error energy over all possible disturbances of fixed energy. H  estimators 
are thus over conservative, which reflects in a more robust behaviour to disturbance 
variation. 
Before stating our first result we shall define the input vectors {hi} exciting if, and 
only if, 
N 
lim 
N-+oo 
i=0 
Theorem 1 (LMS Algorithm) Consider the model (I), and suppose we wish to 
minimize the H  norm of the transfer operator from the unknowns w - w_ and 
vi to the prediction errors el. If the input vectors hi are exciting and 
1 
0 <  < inf (3) 
i hiThi 
then the minimum H  
given by the LMS algorithm witIs learning rate I u, viz. 
Wi --' Wi--1 '- hi(di -- hiTwi-1) 
norm is '/opt -- 1. In this case an optimal H  estimator is 
, w_ (4) 
354 Hassibi, Sayed, and Kailath 
In other words, the result states that the LMS algorithm is an Ha-optimal filter. 
Moreover, Theorem 1 also gives an upper bound on the learning rate/ that ensures 
the H  optimality of LMS. This is in accordance with the well-known fact that 
LMS behaves poorly if the learning rate is too large. 
Intuitively it is not hard to convince oneself that 7opt cannot be less than one. To 
this end suppose that the estimator has chosen some initial guess w_l. Then one 
may conceive of a disturbance that yields an observation that coincides with the 
output expected from w_l, i.e. 
hiT w-1 = hiT w q- vi -- di 
In this case one expects that the estimator will not change its estimate of w, so that 
W i -- W_ 1 for all i. Thus the prediction error is 
i -- hi T w -- hi T wi_ 1 = hi T w - hi T w_ 1 -- -v i 
and the ratio in (2) can be made arbitrarily close to one. 
The surprising fact though is that '/opt is one and that the LMS algorithm achieves 
it. What this means is that LMS guarantees that the energy of the prediction 
error will never exceed the energy of the disturbances. This is not true for other 
estimators. For example, in the case of the recursive least-squares (RLS) algorithm, 
one can come up with a disturbance of arbitrarily small energy that will yield a 
prediction error of large energy. 
To demonstrate this, we consider a special case of model (1) where hi is now a 
scalar that randomly takes on the values +1 or -1. For this model/ must be less 
than 1 and we chose the value tt -- .9. We compute the H * norm of the transfer 
operator from the disturbances to the prediction errors for both RLS and LMS. We 
also compute the worst case RLS disturbance, and show the resulting prediction 
errors. The results are illustrated in Fig. 1. As can be seen, the H * norm in 
the RLS case increases with the number of observations, whereas in the LMS case 
it remains constant at one. Using the worst case RLS disturbance, the prediction 
error due to the LMS algorithm goes to zero, whereas the prediction error due to 
the RLS algorithm does not. The form of the worst case RLS disturbance is also 
interesting; it competes with the true output early on, and then goes to zero. 
We should mention that the LMS algorithm is only one of a family of H * optimal 
estimators. However, LMS corresponds to what is called the central solution, and 
has the additional properties of being the maximum entropy solution and the risk- 
sensitive optimal solution (Whittle 1990, Glover and Mustafa 1989, Hassibi et al. 
199ab). 
If there is no disturbance in (1) we have the following 
Corollary 1 If in addition to the assumptions of Theorem I there is no disturbance 
in (I), then LMS guarantees 11 e 112_< y-llw- w_ll 2, meaning that the prediction 
error converges to zero. 
Note that the above Corollary suggests that the larger y is (provided (3) is satisfied) 
the faster the convergence will be. 
Before closing this section we should mention that if instead of the prediction error 
one were to consider the filtered error eli -- hiw - hiwi, then the H  optimal 
estimator is the so-called normalized LMS algorithm (Hassibi et al. 1993a). 
Optimality Criteria for LMS and Backpropagation 355 
2.5 
2 
1.5 
1 
0.5 
0 
1 
0.98 
0.96 
0.94 
0.92 
0.9 
o 
5O 5O 
0.5 
0 
-0.5 
(c) 
I 
0.5 
(d) 
o 
o.I It , ,., , .r- c-:r, ; 
-1 -1 
0 50 0 50 
Figure 1: H  norm of transfer operator as a function of the number of observations 
for (a) RLS, and (b) LMS. The true output and the worst case disturbance signal 
(dotted curve) for RLS are given in (c). The predicted errors for the RLS (dashed) 
and LMS (dotted) algorithms corresponding to this disturbance are given in (d). 
The LMS predicted error goes to zero while the RLS predicted error does not. 
3 Nonlinear H  Adaptive Filtering 
In this section we suppose that the observed sequence {di} obeys the following 
nonlinear model 
di = hi(w) + (5) 
where hi(.) is a known nonlinear function (with bounded first and second order 
derivatives), w is an unknown weight vector, and {vi} is an unknown disturbance 
sequence that includes noise and/or modelling errors. In a neural network context 
the index i in hi(.) will correspond to the nonlinear function that maps the weight 
vector to the output when the ith input pattern is presented, i.e., hi(w) = h(x(O, w) 
where x  is the ith input pattern. As before we shall denote by wi = .T(do,..., di) 
the estimate of the weight vector using measurements up to and including time i, 
and the prediction error by 
i ---- hi(w)--hi(wi-1) 
Let T denote the transfer operator that maps the unknowns/disurbances 
{w-w_i, {vi}} to the prediction errors {e}. 
Problem 2 (Optimal Nonlinear H  Adaptive Problem) Find 
an H-optimal estimation strategy wi - .T(do, dl,..., di) that minimizes IlTIl, 
356 Hassibi, Sayed, and Kailath 
and obtain the resulting 
V0  --inf IlTll 2 
= inf sup 
  w,vEh2 
Currently there is no general solution to the above problem, and the class of non- 
linear functions hi (.) for which the above problem has a solution is not known (Ball 
and Helton, 1992). 
To make some headway, though, note that by using the mean value theorem (5) 
may be rewritten as 
(:?hi T 
d i -- hi(Wi_l) q- (-- (W_l).(w - wi_l) q- v i (7) 
where w_ is a point on the line connecting w and wi-. Theorem 1 applied to (7) 
shows that the recursion 
W i -- Wi-- 1 q- lww(Wi_l)(d i -- hi(Wi_l) ) (S) 
will yield 7 = 1. The problem with the above algorithm is that the w i are 
not known. But it suggests that the 7opt in Problem 2 (if it exists) cannot be 
less than one. Moreover, it can be seen that the backpropagation algorithm is an 
approximation to (8) where w is replaced by wi. To pursue this point further we 
use again the mean value theorem to write (5) in the alternative form 
Ohi T 1 )T. 02hi 
di -- hi(Wi_l)q-w w (wi-1).(w-wi-1)q-(w-wi-1 Ow 2 (i-1)'(w-wi-1) q-vi 
(9) 
where once more tbi_ lies on the line connecting wi-1 and w. Using (9) and 
Theorem I we have the following result. 
Theorem 2 (Backpropagation Algorithm) Consider the model (5) 
backpropagation algorithm 
Ohi . 
Wi -- Wi_l q- lww (Wi-1)(di -- hi(Wi_l) ) 
and the 
then if the ow [wi-) are exciting, and 
, W-1 (10) 
1 
0 < y < inf (11) 
i Oh T 
Ow (Wi--1) Oh(Wi 1) 
 (9w - 
then for all nonzero w, v  he 
<1 
0h T 2 
II o (Wi_l)(W- wi_x) Ils 
I(W-- Wi_l) T Oa'-xh (i_l).(W -- Wi_l) 112 2 -- 
-IIw-- W-Xl 2q- I[ vi q-  o 
where 
Ohi _  
(W-- Wi--1) T OW9 (i--1) (W-- Wi--1) ---- hi(w ) -- hi(wi-1) ohiT 
(wi-1).(w-- wi-1) 
Hoo Optimality Criteria for LMS and Backpropagation 357 
The above result means that if one considers a new disturbance v i - vi q-  
wi_) T -(i_) (w- wi-), whose second term indicates how far hi(w) is from a 
first order approximation at point wi-, then backpropagation guarantees that the 
energy of the linearized prediction error 0 (wi_)(w- wi_) does not exceed the 
energy of the new disturbances w - W_l and v i. 
It seems plausible that if w_  is close enough to w then the second term in v i should 
be small and the true and linearized prediction errors should be close, so that we 
should be able to bound the ratio in (6). Thus the following result is expected, 
where we have defined the vectors {hi} persistently exciting if, and only if, for all 
a 6  
N 
i=0 
Theorem 3 (Local H  Optimality) Consider the model (5) and the backprop- 
agation algorithm (10). Suppose that the h'fw 
Ow  i-   are persistently exciting, and 
that (11) is satisfied. Then for each  > O, there exist 51, 52 > 0 such that for all 
]W- W_11 < (1 and all v 6 h2 with ]vii < 2, we have 
II < 
II v - 
The above Theorem indicates that the backpropagation algorithm is locally H * 
optimal. In other words for w_ sufficiently close to w, and for sufficiently small 
disturbance, the ratio in (6) can be made arbitrarily close to one. Note that the 
conditions on w and vi are reasonable, since if for example w is too far from W_l, 
or if some vi is too large, then it is well known that backpropagation may get stuck 
in a local minimum, in which case the ratio in (6) may get arbitrarily large. 
As before (11) gives an upper bound on the learning rate /, and indicates why 
backpropagation behaves poorly if the learning rate is too large. 
If there is no disturbance in (5) we have the following 
Corollary 2 If in addition to the assumptions in Theorem 3 there is no disturbance 
in (5), then for every  > 0 there exists a t5 > 0 such that for all I w - W_l[ < , 
the backpropagation algorithm will yield l[ e' [12_< y-1(1 + ), meaning that the 
prediction error converges to zero. Moreover wi will converge to w. 
Again provided (11) is satisfied, the larger/ is the faster the convergence will be. 
4 Discussion and Conclusion 
The results presented in this paper give some new insights into the behaviour of 
instantaneous gradient-based adaptive algorithms. We showed that if the underlying 
observation model is linear then LMS is an H * optimal estimator, whereas if the 
underlying observation model is nonlinear then the backpropagation algorithm is 
locally H * optimal. The H * optimality of these algorithms explains their inherent 
robustness to unknown disturbances and modelling errors, as opposed to other 
estimation algorithms for which such bounds are not guaranteed. 
358 Hassibi, Sayed, and Kailath 
Note that if one considers the transfer operator from the disturbances to the pre- 
diction errors, then LMS (backpropagation) is H  optimal (locally), over all causal 
estimators. This indicates that our result is most applicable in situations where 
one is confronted with real-time data and there is no possiblity of storing the train- 
ing patterns. Such cases arise when one uses adaptive filters or adaptive neural 
networks for adaptive noise cancellation, channel equalization, real-time control, 
and undoubtedly many other situations. This is as opposed to pattern recognition, 
where one has a set of training patterns and repeatedly retrains the network until 
a desired performance is reached. 
Moreover, we also showed that the H  optimality result leads to convergence proofs 
for the LMS and backpropagation algorithms in the absence of disturbances. We 
can pursue this line of thought further and argue why choosing large learning rates 
increases the resistance of backpropagation to local minima, but we shall not do so 
due to lack of space. 
In conclusion these results give a new interpretation of the LMS and backpropaga- 
tion algorithms, which we believe should be worthy of further scrutiny. 
Acknowledgement s 
This work was supported in part by the Air Force Office of Scientific Research, Air 
Force Systems Command under Contract AFOSR91-0060 and in part by a grant 
from Rockwell International Inc. 
References 
J. A. Ball and J. W. Helton. (1992) Nonlinear H m control theory for stable plants. 
Math. Control Signals Systems, 5:233-261. 
K. Glover and D. Mustafa. (1989) Derivation of the maximum entropy H m con- 
troller and a state space formula for its entropy. Int. J. Control, 50:899-916. 
B. Hassibi, A. H. Sayed, and T. Kailath. (1993a) LMS is H * Optimal. IEEE Conf. 
on Decision and Control, 74-80, San Antonio, Texas. 
B. Hassibi, A. H. Sayed, and T. Kailath. (1993b) Recursive linear estimation in 
Krein spaces - part II: Applications. IEEE Conf. on Decision and Control, 3495- 
3501, San Antonio, Texas. 
S. Haykin. (1991) Adaptive Filter Theory. Prentice Hall, Englewood Cliffs, NJ. 
D. E. Rumelhart, J. L. McClelland and the PDP Research Group. (1986) Parallel 
distributed processing: explorations in the microstructure of cognition. Cambridge, 
Mass. : MIT Press. 
P. Whittle. (1990) Risk Sensitive Optimal Control. John Wiley and Sons, New 
York. 
B. Widrow and M. E. Hoff, Jr. (1960) Adaptive switching circuits. IRE WESCON 
Cony. Rec., Pt.4:96-104. 
G. Zames. (1981) Feedback optimal sensitivity: model preference transformation, 
multiplicative seminorms and approximate inverses. IEEE Trans. on Automatic 
Control, AC-26:301-320. 
