Worst-case Loss Bounds 
for Single Neurons 
David P. Helmbold 
Department of Computer Science 
University of California, Santa Cruz 
Santa Cruz, CA 95064 
USA 
Jyrki Kivinen 
Department of Computer Science 
P.O. Box 26 (Teollisuuskatu 23) 
FIN-00014 University of Helsinki 
Finland 
Manfred K. Warmuth 
Department of Computer Science 
University of California, Santa Cruz 
Santa Cruz, CA 95064 
USA 
Abstract 
We analyze and compare the well-known Gradient Descent algo- 
rithm and a new algorithm, called the Exponentiated Gradient 
algorithm, for training a single neuron with an arbitrary transfer 
function. Both algorithms are easily generalized to larger neural 
networks, and the generalization of Gradient Descent is the stan- 
dard back-propagation algorithm. In this paper we prove worst- 
case loss bounds for both algorithms in the single neuron case. 
Since local minima make it difficult to prove worst-case bounds 
for gradient-based algorithms, we must use a loss function that 
prevents the formation of spurious local minima. We define such 
a matching loss function for any strictly increasing differentiable 
transfer function and prove worst-case loss bound for any such 
transfer function and its corresponding matching loss. For exam- 
ple, the matching loss for the identity function is the square loss 
and the matching loss for the logistic sigmoid is the entropic loss. 
The different structure of the bounds for the two algorithms indi- 
cates that the new algorithm out-performs Gradient Descent when 
the inputs contain a large number of irrelevant components. 
310 D.P. HELMBOLD, J. KIVINEN, M. K. WARMUTH 
1 INTRODUCTION 
The basic element of a neural network, a neuron, takes in a number of real-valued 
input variables and produces a real-valued output. The input-output mapping of 
a neuron is defined by a weight vector w  R v, where N is the number of input 
variables, and a transfer function 4. When presented with input given by a vector 
x 6 R v, the neuron produces the output ) = O(w. x). Thus, the weight vector 
regulates the influence of each input variable on the output, and the transfer function 
can produce nonlinearities into the input-output mapping. In particular, when the 
transfer function is the commonly used logistic function, 4(p) = 1/(1 + e-P), the 
outputs are bounded between 0 and 1. On the other hand, if the outputs should 
be unbounded, it is often convenient to use the identity function as the transfer 
function, in which case the neuron simply computes a linear mapping. In this 
paper we consider a large class of transfer functions that includes both the logistic 
function and the identity function, but not discontinuous (e.g. step) functions. 
The goal of learning is to come up with a weight vector w that produces a 
desirable input-output mapping. This is achieved by considering a sequence 
$ = ((x,y),...,(xt,yt)) of ecamples, where for t = 1,..., the value yt  R 
is the desired output for the input vector xt, possibly distorted by noise or other 
errors. We call xt the tth instance and yt the tth outcome. In what is often called 
batch learning, all  examples are given at once and are available during the whole 
training session. As noise and other problems often make it impossible to find a 
weight vector w that would satisfy b(w. xt) = yt for all t, one instead introduces a 
loss function L, such as the square loss given by L(y,)) = (y- ))2/2, and finds a 
weight vector w that minimizes the empirical loss (or training error) 
Loss(w, $): ] L(yt, (w. xt)) (1) 
t=l 
With the square loss and identity transfer function ;b(p) = p, this is the well-known 
linear regression problem. When ;b is the logistic function and L is the entropic loss 
given by L(y,)) = yln(y/)) + (1 - y)ln((1 - y)/(1 - ))), this can be seen as a 
special case of logistic regression. (With the entropic loss, we assume 0 < yt, )t < 1 
for all t, and use the convention 0 In 0 = 0 In(0/0) = 0.) 
In this paper we use an on-line prediction (or life-long learning) approach to the 
learning problem. It is well known that on-line performance is closely related to 
batch learning performance (Littlestone, 1989; Kivinen and Warmuth, 1994). 
Instead of receiving all the examples at once, the training algorithm begins with 
some fixed start vector w, and produces a sequence Wl,. ., wt+ of weight vectors. 
The new weight vector wt+ is obtained by applying a simple update rule to the 
previous weight vector wt and the single example (xt, yt). In the on-line prediction 
model, the algorithm uses its tth weight vector, or hypothesis, to make the prediction 
)t = ;b(wt 'xt). The training algorithm is then charged a loss L(yt, .Or) for this tth 
trial. The performance of a training algorithm A that produces the weight vectors 
wt on an example sequence $ is measured by its total (cumulative) loss 
Loss(A, $) =  L(yt, qS(wt. xt)) (2) 
t=l 
Our main results are bounds on the cumulative losses for two on-line prediction 
algorithms. One of these is the standard Gradient Descent (GD) algorithm. The 
other one, which we call EG + is also based on the gradient but uses it in a different 
, 
Worst-case Loss Bounds for Single Neurons 311 
manner than GD. The bounds are derived in a worst-case setting: we make no as- 
sumptions about how the instances are distributed or the relationship between each 
instance xt and its corresponding outcome yt. Obviously, some assumptions are 
needed in order to obtain meaningful bounds. The approach we take is to compare 
the total losses, Loss(GD, S) and Loss(EG+,S), to the least achievable empirical 
loss, infw Loss(w, $). If the least achievable empirical loss is high, the dependence 
between the instances and outcomes in $ cannot be tracked by any neuron using 
the transfer function, so it is reasonable that the losses of the algorithms are also 
high. More interestingly, if some weight vector achieves a low empirical loss, we 
also require that the losses of the algorithms are low. Hence, although the algo- 
rithms always predict based on an initial segment of the example sequence, they 
must perform almost as well as the best fixed weight vector for the whole sequence. 
The choice of loss function is crucial for the results that we prove. In particular, 
since we are using gradient-based algorithms, the empirical loss should not have spu- 
rious local minima. This can be achieved for any differentiable increasing transfer 
function q5 by using the loss function L defined by 
,-() 
3-() 
(3) 
For y < ) the value L(y, )) is the area in the z x qS(z) plane below the function 
65(z), above the line 6(z) = y, and to the left of the line z = qS-()). We call L the 
matching loss function for transfer function qS, and will show that for any example 
sequence $, if L = L then the mapping from w to Loss(w, $) is convex. For 
example, if the transfer function is the logistic function, the matching loss function 
is the entropic loss, and if the transfer function is the identity function, the matching 
loss function is the square loss. Note that using the logistic activation function with 
the square loss can lead to a very large number of local minima (Auer et al., 1996). 
Even in the batch setting there are reasons to use the entropic loss with the logistic 
transfer function (see, for example, Solla et al., 1988). 
How much our bounds on the losses of the two algorithms exceed the least empirical 
loss depends on the maximum slope of the transfer function we use. More impor- 
tantly, they depend on various norms of the instances and the vector w for which 
the least empirical loss is achieved. As one might expect, neither of the algorithms 
is uniformly better than the other. Interestingly, the new EG + algorithm is better 
when most of the input variables are irrelevant, i.e., when some weight vector w 
with wi = 0 for most indices i has a low empirical loss. On the other hand, the 
GD algorithm is better when the weight vectors with low empirical loss have many 
nonzero components, but the instances contain many zero components. 
The bounds we derive concern only single neurons, and one often combines a number 
of neurons into a multilayer feedforward neural network. In particular, applying 
the Gradient Descent algorithm in the multilayer setting gives the famous back 
propagation algorithm. Also the EG + algorithm, being gradient-based, can easily 
be generalized for multilayer feedforward networks. Although it seems unlikely 
that our loss bounds will generalize to multilayer networks, we believe that the 
intuition gained from the single neuron case will provide useful insight into the 
relative performance of the two algorithms in the multilayer case. Furthermore, the 
EG + algorithm is less sensitive to large numbers of irrelevant attributes. Thus it 
might be possible to avoid multilayer networks by introducing many new inputs, 
each of which is a non-linear function of the original inputs. Multilayer networks 
remain an interesting area for future study. 
Our work follows the path opened by Littlestone (1988) with his work on learning 
312 D.P. HELMBOLD, J. KIVINEN, M. K. WARMUTH 
thresholded neurons with sparse weight vectors. More immediately, this paper is 
preceded by results on linear neurons using the identity transfer function (Cesa- 
Bianchi et al., 1996; Kivinen and Warmuth, 1994). 
2 THE ALGORITHMS 
This section describes how the Gradient Descent training algorithm and the new 
Exponentiated Gradient training algorithm update the neuron's weight vector. 
For the remainder of this paper, we assume that the transfer function b is increasing 
and differentiable, and Z is a constant such that ;b'(p) < Z holds for all p 6 R. For 
the loss function L defined by (3) we have 
 x)) 
Owi -- (b(w. x) - y)xi (4) 
Treating L(y, b(w.x)) for fixed x and y as a function of w, we see that the Hessian 
H of the function is given by Hi1 = q(w.x)xixj. Then vTHv = b(w.x)(v.x) 2, so 
H is positive definite. Hence, for an arbitrary fixed $, the empirical loss Loss(w, $) 
defined in (1) as a function of w is convex and thus has no spurious local minima. 
We first describe the Gradient Descent (GD) algorithm, which for multilayer net- 
works leads to the back-propagation algorithm. Recall that the algorithm's predic- 
tion at trial t is )t = b(wt  xt), where wt is the current weight vector and xt is 
the input vector. By (4), performing gradient descent in weight space on the loss 
incurred in a single trial leads to the update rule 
Wtq-1 -- Wt -- (t -- yt)xt 
The parameter /is a positive learning rate that multiplies the gradient of the loss 
function with respect to the weight vector wt. In order to obtain worst-case loss 
bounds, we must carefully choose the learning rate /. Note that the weight vector 
t-1 
wt of GD always satisfies wt = w + Y']i=i aixi for some scalar coefficients ai. 
Typically, one uses the zero initial vector w = 0. 
A more recent training algorithm, called the Exponentiated Gradient (EG) algo- 
rithm (Kivinen and Warmuth, 1994), uses the same gradient in a different way. This 
algorithm makes multiplicative (rather than additive) changes to the weight vector, 
and the gradient appears in the exponent. The basic version of the EG algorithm 
also normalizes the weight vector, so the update is given by 
N 
j-1 
The start vector is usually chosen to be uniform, w = (l/N,..., l/N). Notice that 
it is the logarithms of the weights produced by the EG training algorithm (rather 
than the weights themselves) that are essentially linear combinations of the past 
examples. As can be seen from the update, the EG algorithm maintains the con- 
straints wt,i > 0 and Y-]i wt,i = 1. In general, of course, we do not expect that such 
constraints are useful. Hence, we introduce a modified algorithm EG + by employing 
a linear transformation of the inputs. In addition to the learning rate /, the EG = 
algorithm has a scaling factor U > 0 as a parameter. We define the behavior of 
EG + on a sequence of examples $ = ((x,y),...,(xt,yt)) in terms of the EG al- 
gorithm's behavior on a transformed example sequence $' = ((x, y),..., (x, yt)) 
Worst-case Loss Bounds for Single Neurons 313 
where x' = (Uz,..., Uzjv, -Uzl, . .., --UzN). The EG algorithm uses the uniform 
start vector (1/(2N),..., 1/(2N)) and learning rate supplied by the EG + algorithm. 
At each time time t the N-dimensional weight vector w of EG + is defined in terms 
of the 2N-dimensional weight vector w  of EG as 
w,,, : U(w;,, - w',,N+,). 
Thus EG + with scaling factor U can learn any weight vector w  R S with IIwllx < 
u by having the embedded EG algorithm learn the appropriate 2N-dimensional 
(nonnegative and normalized) weight vector w t. 
3 MAIN RESULTS 
The loss bounds for the GD and EG + algorithms can be written in similar forms 
that emphasize how different algorithms work well for different problems. When 
L = L, we write Loss(w, $) and Loss(A, $) for the empirical loss of a weight 
vector w and the total loss of an algorithm A, as defined in (1) and (2). We give 
the upper bounds in terms of various norms. For x  R N, the 2-norm IIxll2 is the 
Euclidean length of the vector x, the 1-norm IIxlll the sum of the absolute values 
of the components of x, and the cx>-norm llxll the maximum absolute value of 
any component of x. For the purposes of setting the learning rates, we assume 
that before training begins the algorithm gets an upper bound for the norms of 
instances. The GD algorithm gets a parameter X2 and EG a parameter X such 
that IIxtll2 < x2 and Ilxtll _< X hold for all t. Finally, recall that Z is an upper 
bound on St(p). We can take Z = 1 when 5 is the identity function and Z = 1/4 
when q5 is the logistic function. 
Our first upper bound is for GD. For any sequence of examples $ and any weight 
vector u  R S, when the learning rate is r/= 1/(2XZ) we have 
Loss(GD,$) < Loss(., $)+ (11ullX2fZ  
Our upper bounds on the EG + algorithm require that we restrict the one-norm of 
the comparison class: the set of weight vectors competed against. The comparison 
class contains all weight vectors u such that Ilull is at most the scaling factor, 
U. For any scaling factor U, any sequence of examples $, and any weight vector 
u  R/v with Ilull _< u, we have 
4 
Loss(EG+, S) _ Loss(u, S) + (UXoo)Zln(2N) 
when the learning rate is  - 1/(4(UX)Z). 
Note that these bounds depend on both the unknown weight vector u and some 
norms of the input vectors. If the algorithms have some further prior information 
on the sequence S they can make a more informed choice of 9. This leads to bounds 
with a constant of i before the the Loss,(u, S) term at the cost of an additional 
square-root term (for details see the full paper, Helmbold et al., 1996). 
It is important to realize that we bound the total loss of the algorithms over any 
adversarially chosen sequence of examples where the input vectors satisfy the norm 
bound. Although we state the bounds in terms of loss on the data, they imply that 
the algorithms must also perform well on new unseen examples, since the bounds 
still hold when an adversary adds these additional examples to the end of the 
sequence. A formal treatment of this appears in several places (Littlestone, 1989; 
314 D.P. HELMBOLD, J. KIVINEN, M. K. WARMUTH 
Kivinen and Warmuth, 1994). Furthermore, in contrast to standard convergence 
proofs (e.g. Luenberger, 1984), we bound the loss on the entire sequence of examples 
instead of studying the convergence behavior of the algorithm when it is arbitrarily 
close to the best weight vector. 
Comparing these loss bounds we see that the bound for the EG : algorithm grows 
with the maximum component of the input vectors and the one-norm of the best 
weight vector from the comparison class. On the other hand, thc loss bound for the 
GD algorithm grows with the two-norm (Euclidean length) of both vectors. Thus 
when the best weight vector is sparse, having few significant components, and the 
input vectors are dense, with several similarly-sized components, the bound for the 
EG : algorithm is better than the bound for the GD algorithm. More formally, 
consider the noise-free situation where Los%(u, $) = 0 for some u. Assume xt 
{-1, 1 }v and u 6 {-1, 0, 1 }v with only k nonzero components in u. We can 
then take X2 = v/-, X = 1, Ilull2 = v, and U = k. The loss bounds 
become (16/3)k2Zln(2N) for EG + and 2kZN for GD, so for N >> k the EG + 
algorithm clearly wins this comparison. On the other hand, the GD algorithm has 
the advantage over the EG algorithm when each input vector is sparse and the best 
weight vector is dense, having its weight distributed evenly over its components. For 
example, if the inputs xt are the rows of an N x N unit matrix and u 
then X2 = X = 1, Ilull - and U = N. Thus the upper bounds become 
(16/3)N2Zln(2N) for EG + and 2NZ for GD, so here GD wins the comparison. 
Of course, a comparison of the upper bounds is meaningless unless the bounds are 
known to be reasonably tight. Our experiments with artificial random data suggest 
that the upper bounds are not tight. However, the experimental evidence also 
indicates that EG + is much better than GD when the best weight vector is sparse. 
Thus the upper bounds do predict the relative behaviors of the algorithms. 
The bounds we give in this paper are very similar to the bounds Kivinen and 
Warmuth (1994) obtained for the comparison class of linear functions and the square 
loss. They observed how the relative performances of the GD and EG + algorithms 
relate to the norms of the input vectors and the best weight vector in the linear 
case. 
Our methods are direct generalizations of those applied for the linear case (Kivinen 
and Warmuth, 1994). The key notion here is a distance function d for measuring 
the distance d(u, w) between two weight vectors u and w. Our main distance 
measures are the Squared Euclidean distance llu-wll and the Relative Entropy 
distance (or Kullback-Leibler divergence) y']= ui ln(ui/wi). The analysis exploits 
an invariant over t and u of the form 
aL(yt,wt. xt) - bL(yt,u. xt) _< d(u, wt) - d(u,wt+) , 
where a and b are suitably chosen constants. This invariant implies that at each 
trial, if the loss of the algorithm is much larger than that of an arbitrary vector 
u, then the algorithm updates its weight vector so that it gets closer to u. By 
summing the invariant over all trials we can bound the total loss of the algorithms 
in terms of Los%(u, $) and d(u, w). Full details will be contained in a technical 
report (Helmbold et al., 1996). 
4 OPEN PROBLEMS 
Although the presence of local minima in multilayer networks makes it difficult 
to obtain worst case bounds for gradient-based algorithms, it may be possible to 
Worst-case Loss Bounds for Single Neurons 315 
analyze slightly more complicated settings than just a single neuron. One likely 
candidate is to generalize the analysis to logistic regression with more than two 
classes. In this case each class would be represented by one neuron. 
As noted above, the matching loss for the logistic transfer function is the entropic 
loss, so this pair does not create local minima. No bounded transfer function 
matches the square loss in this sense (Auer et al., 1996), and thus it seems im- 
possible to get the same kind of strong loss bounds for a bounded transfer function 
and the square loss as we have for any (increasing and differentiable) transfer func- 
tion and its matching loss function. 
As the bounds for EG + depend only logarithmically on the input dimension, the 
following approach may be feasible. Instead of using a multilayer net, use a single 
(linear or sigmoided) neuron on top of a large set of basis functions. The logarithmic 
growth of the loss bounds in the number of such basis functions mean that large 
numbers of basis functions can be tried. 
Note that the bounds of this paper are only worst-case bounds and our experiments 
on artificial data indicate that the bounds may not be tight when the input values 
and best weights are large. However, we feel that the bounds do indicate the relative 
merits of the algorithms in different situations. Further research needs to be done 
to tighten the bounds. Nevertheless, this paper gives the first worst-case upper 
bounds for neurons with nonlinear transfer functions. 
References 
P. Auer, M. Herbster, and M. K. Warmuth (1996). Exponentially many local min- 
ima for single neurons. In Advances in Neural Information Processing Systems 8. 
N. Cesa-Bianchi, P. Long, and M. K. Warmuth (1996). Worst-case quadratic loss 
bounds for on-line prediction of linear functions by gradient descent. IEEE Trans- 
actions on Neural Networks. To appear. An extended abstract appeared in COLT 
'93, pp. 429-438. 
D. P. Helmbold, J. Kivinen, and M. K. Warmuth (1996). Worst-case loss bounds 
for single neurons. Technical Report UCSC-CRL-96-2, Univ. of Calif. Computer 
Research Lab, Santa Cruz, CA, 1996. In preparation. 
J. Kivinen and M. K. Warmuth (1994). Exponentiated gradient versus gradient 
descent for linear predictors. Technical Report UCSC-CRL-94-16, Univ. of Calif. 
Computer Research Lab, Santa Cruz, CA, 1994. An extended abstract appeared in 
STOC '95, pp. 209-218. 
N. Littlestone (1988). Learning when irrelevant attributes abound: A new linear- 
threshold algorithm. Machine Learning, 2:285-318. 
N. Littlestone (1989). From on-line to batch learning. In Proc. nd Annual Work- 
shop on Computational Learning Theory, pages 269-284. Morgan Kaufmann, San 
Mateo, CA. 
D. G. Luenberger (1984). Linear and Nonlinear Programming. Addison-Wesley, 
Reading, MA. 
S. A. Solla, E. Levin, and M. Fleisher (1988). Accelerated learning in layered neural 
networks. Complex Systems, 2:625-639. 
