Relative Loss Bounds for Multidimensional 
Regression Problems 
Jyrki Kivinen 
Department of Computer Science 
P.O. Box 26 (Teollisuuskatu 23) 
FIN-00014 University of Helsinki, Finland 
Manfred K. Warmuth 
Department of Computer Science 
University of California, Santa Cruz 
Santa Cruz, CA 95064, USA 
Abstract 
We study on-line generalized linear regression with multidimensional 
outputs, i.e., neural networks with multiple output nodes but no hidden 
nodes. We allow at the final layer transfer functions such as the soft- 
max function that need to consider the linear activations to all the output 
neurons. We use distance functions of a certain kind in two completely 
independent roles in deriving and analyzing on-line learning algorithms 
for such tasks. We use one distance function to define a matching loss 
function for the (possibly multidimensional) transfer function, which al- 
lows us to generalize earlier results from one-dimensional to multidimen- 
sional outputs. We use another distance function as a tool for measuring 
progress made by the on-line updates. This shows how previously stud- 
ied algorithms such as gradient descent and exponentiated gradient fit 
into a common framework. We evaluate the performance of the algo- 
rithms using relative loss bounds that compare the loss of the on-line 
algoritm to the best off-line predictor from the relevant model class, thus 
completely eliminating probabilistic assumptions about the data. 
1 INTRODUCTION 
In a regression problem, we have a sequence of n-dimensional real valued inputs at  1t. n, 
t = 1,... ,, and for each input at a k-dimensional real-valued desired output h 
Our goal is to find a mapping that at least approximately models the dependency between 
at and tt- Here we consider the parametric case t = f(o; at) where the actual outputt 
corresponding to the input at is determined by a parameter vector w  1t. ' (e.g., weights 
in a neural network) through a given fixed model f (e.g., a neural network architecture). 
288 J. Kivinen and M. K. Warmuth 
Thus, we wish to obtain parameters o., such that, in some sense, f(o; zt)  Yt for all 
t. The most basic model f to consider is the linear one, which in the one-dimensional 
case k = 1 means that f(w; at) = o.,  :rt for o., E R n. In the multidimensional case 
we actually have a whole matrix f  R nxn of parameters and f(f; :rt) = f:rt. The 
goodness of the fit is quantitatively measured in terms of a loss function; the square loss 
given by '-t,j (yt,j - .Ot,d)/2 is a popular choice. 
In generalized linear regression [MN89] we fix a transfer function b and apply it on top of a 
linear model. Thus, in the one-dimensional case we would have f (o.,; :rt) = 4(o.,.:rt). Here 
4 is usually a continuous increasing function from R to R, such as the logistic function 
that maps z to 1/(1 q- e-z). It is still possible to use the square loss, but this can lead to 
problems. In particular, when we apply the logistic transfer function and try to find a weight 
vector o., that minimizes the total square loss over  examples (at, Yt), we may have up to 
 local minima [AHW95, Bud93]. Hence, some other choice of loss function might be 
more convenient. In the one-dimensional case it can be shown that any continuous strictly 
increasing transfer function b has a specific matching loss function L such that, among 
other useful properties, '-t L(yt, qb(o  :et)) is always convex in o, so local minima are 
not a problem [AHW95]. For example, the matching loss function for the logistic transfer 
function is the relative entropy (a generalization of the logarithmic loss for continuous- 
valued outcomes). The square loss is the matching loss function for the identity transfer 
function (i.e., linear regression). 
The main theme of the present paper is the application of a particular kind of distance func- 
tions to analyzing learning algorithms in (possibly multidimensional) generalized linear 
regression problems. We consider a particular manner in which a mapping b: R '  R ' 
can be used to define a distance function Ab: R ' x R n -- R; the assumption we must 
make here is that b has a convex potential function. The matching loss function L men- 
tioned above for a transfer function  in the one-dimensional case is given in terms of 
the distance function A as L(qS(a), qb(&)) -' A(&,a). Here, as whenever we use the 
matching loss L (g, .0), we assume that t and .0 are in the range of qb, so we can write 
g = qb(a) and .0 = qb(a) for some a and 6. Notice that for k -- 1, any strictly increasing 
continuous function has a convex potential (i.e., integral) function. In the more interesting 
case k > 1, we can consider transfer functions such as the softmax function, which is com- 
monly used to transfer arbitrary vectors a  1t. n into probability vectors . (i.e., vectors 
such that .0i _ 0 for all i and 'i .0i -' 1). The matching loss function for the softmax func- 
tion defined analogously with the one-dimensional case tums out to be the relative entropy 
(or Kullback-Leibler divergence), which indeed is a commonly used measure of distance 
between probability vectors. For the identity transfer function, the matching loss function 
is the squared Euclidean distance. 
The first result we get from this observation connecting matching losses to a general notion 
of distance is that certain previous results on generalized linear regression with matching 
loss on one-dimensional outputs [HKW95] directly generalize to multidimensional out- 
puts. From a more general point of view, a much more interesting feature of these distance 
functions is how they allow us to view certain previously known learning algorithms, and 
introduce new ones, in a simple unified framework. To briefly explain this framework with- 
out unnecessary complications, we restrict the following discussion to the case k = 1, i.e., 
f(.,; :r) --- b(o  :r) 6 R with ., 6 R n . 
We consider on-line learning algorithms, by which we here mean an algorithm that pro- 
cesses the training examples one by one, the pair (at, yt) being processed at time t. Based 
Relative Loss Bounds for Multidimensional Regression Problems 289 
on the training examples the algorithm produces a whole sequence of weight vectors 
t -- 1,..., . At each time t the old weight vector o.,t is updated into wt+z based on at and 
yt. The best-known such algorithm is on-line gradient descent. To see some alternatives, 
consider first a distance function AX/, defined on 1:1, n by some function X/': l:{,n -- 
(Thus, we assume that X/' has a convex potential.) We represent the update somewhat indi- 
rectly by introducing a new parameter vector 0t E 1 :{,n from which the actual weights 
are obtained by the mapping wt = X/'(0t). The new parameters are updated by 
0t+ = 0. - (Lo(u.. 
(1) 
where r/> 0 is a learning rate. We call this algorithm the general additive algorithm with 
parameterization function b. Notice that here 0 is updated by the gradient with respect 
to o.,, so this is not just a gradient descent with reparameterization [JW98]. However, we 
obtain the usual on-line gradient descent when b is the identity function. When b is 
the softmax function, we get the so-called exponentiated gradient (EG) algorithm [KW97, 
HKW951. 
The connection of the distance function Ab to the update (1) is two-fold. First, (1) can 
be motivated as an approximate solution to a minimization problem in which the distance 
Axl,(Ot, Ot+) is Used as a kind of penalty term to prevent too drastic an update based on 
a single example. Second, the distance function AX/, can be used as a potential function 
in analyzing the performance of the resulting algorithm. The same distance functions have 
been used previously for exactly the same purposes [KW97, HKW95] in important special 
cases (the gradient descent and EG algorithms) but without realizing the full generality of 
the method. 
It should be noted that the choice of the parameterization function xb is left completely 
free, as long as b has a convex potential function. (In contrast, the choice of the transfer 
function qb depends on what kind of a regression problem we wish to solve.) Earlier work 
suggests that the softmax parameterization function (i.e., the EG algorithm) is particularly 
suited for situations in which some sparse weight vector oJ gives a good match to the 
data [HKW95, KW97]. (Because softmax normalizes the weight vector and makes the 
components positive, a simple transformation of the input data is typically added to realize 
positive and negative weights with arbitrary norm.) 
In work parallel to this, the analogue of the general additive update (1) in the context of 
linear classification, i.e., with a threshold transfer function, has recently been developed 
and analyzed by Grove et al. [GLS97] with methods and results very similar to ours. Cesa- 
Bianchi [CB97] has used somewhat different methods to obtain bounds also in cases in 
which the loss function does not match the transfer function. Jagota and Warmuth [JW98] 
view (1) as an Euler discretization of a system of partial differential equations and investi- 
gate the performance of the algorithm as the discretization parameter approaches zero. 
The distance functions we use here have previously been applied in the context of expo- 
nential families by Amari lama85] and others. Here we only need some basic technical 
properties of the distance functions that can easily be derived from the definitions. For a 
discussion of our line of work in a statistical context see Azoury and Warmuth [AW97]. 
In Section 2 we review the definition of a matching loss function and give examples. Sec- 
tion 3 discusses the general additive algorithm in more detail. The actual relative on-line 
loss bounds we have for the general additive algorithm are explained in Section 4. 
290 J. Kivinen and M. K. Warmuth 
2 DISTANCE FUNCTIONS AND MATCHING LOSSES 
Let b: P -- P be a function that has a convex potential function P4 (i.e., b -- KzPb 
for some convex P4: BP - B,). We first define a distance function A(k for 4 by 
(2) 
Thus, the distance Aqb( , a) is the error we make if we approximate Pqb() by its first- 
order Taylor polynomial around a. Convexity of Pb implies that Aqb is convex in its first 
argument. Further, Aqb( , a) is nonnegative, and zero if and only if qb() = qb(a). 
We can alternatively write (2) as A qb( , a) = f (rk(r) - rk(a) ) . dr where the integral is 
a path integral the value of which must be independent of the actual path chosen between 
a and . In the one-dimensional case, the integral is a simple definite integral, and 
has a convex potential (i.e., integral) function if it is strictly increasing and continuous 
[AHW95, HKW95]. 
Let now qb have range Vqb C_ 1tP and distance function Aqb. Assuming that there is a 
function Lqb: Vqb x Vqb - R such that Lk(rk(a), rk()) = Ark(a, a) holds for all  and 
a, we say that Lqb is the matching loss function for 
Example 1 Let b be a linear function given by b(a) = Aa where A E R  x, is symmetri- 
cal and positive definite. Then b has the convex potential function Pqb(a) -- a?Aa/2, and 
(2) gives A 4(a , a) = (a - a)?A(a - a). Hence, Lqb(y , ,) -- (y - O)TA-1 (y -- 
for all y, .  R t'. El 
Example 2 Let o-: R n - R n, ri (a) n 
= exp(ai)/Y'./= exp(a./), be the softmax function. 
k 
It has a potential function given by Pr(a) = In ']4=1 exp(aj). To see that Ptr is convex, 
notice that the Hessian D2Ptr is given by D2Pr(a)ij = Jijtri(a) - tri(a)rj (a). Given 
a vector a E R n, let now X be a random variable that has probability ri(a) of taking 
the value zi. We have aTDo'(a)a '/=! o'i(a)z -=! 1 
= - = 
EX 9' - (EX) e -- VarX _> 0. Straightforward algebra now gives the relative entropy 
-' y'./= y./ln(yi/.0./) as the matching loss function. (To allow y./= 0 or.05 = 0, 
we adopt the standard convention that 0 In 0 = 0 In(0/0) = 0 and y In(y/0) = cx> for 
> 0.) rn 
In the relative loss bound proofs we use the basic property [JW98, Ama85] 
zx(a', = a) + zx(a, + - (a' - a) 
(3) 
This shows that our distances do not satisfy the triangle inequality. Usually they are not 
symmetrical, either. 
3 THE GENERAL ADDITIVE ALGORITHM 
We consider on-line learning algorithms that at time tfirst receive an input at E R n, 
then produce an output .t Rt', and finally receive as feedback the desired output gtt  
BP. To define the general additive algorithm, assume we are given a transfer function 
Relative Loss Bounds for Multidimensional Regression Problems 291 
: R ' -4 1Z ' that has a convex potential function. (We will later use the matching loss as 
a performance measure.) We also require that all the desired outputs Yt are in the range 
of . The algorithm's predictions are now given by .t = ck(f2tzt) where f2t E 1Z xn is 
the algorithm's weight matrix at time t. To see how the weight matrix is updated, assume 
further we have a parameterization function b' 1Z n -- 1Z n with a distance Ab. The 
algorithm maintains kn real-valued parameters. We denote by 6)t the k x n matrix of the 
values of these parameters immediately before trial t. Futher, we denote by Ot,j the jth row 
of 6)t, and by xb(6)t) the matrix with xb(Ot,j) as its jth row. Given initial parameter values 
6)1 and a learning rate r/> 0, we now define the general additive (GA) algorithm as the 
algorithm that repeats at each trial t the following prediction and update steps. 
Prediction: Upon recieving the instance zt, give the prediction ,t = 
Update: Forj = 1,..., k, set Ot+l,.i = Ot,.i - rl(.Ot,.i - yt,.i)zt. 
Note that (2) implies X7A&(, a)) = &() - &(a), so this update indeed turns out to be 
the same as (1) when we recall that Lck(yt, ,t) = Ack(C2tzt, at) where Yt = ok(at). 
The update can be motivated by an optimization problem given in terms of the loss and 
distance. Consider updating an old parameter matrix 6) into a new matrix 6) based on 
a single input a and desired output y. A natural goal would be to minimize the loss 
L& (y, &(xb (6))z)). However, the algorithm must avoid losing too much of the information 
it has gained during the previous trials and stored in the form of thee old parameter matrx 
We thus set as the algorithm's goal to minimize the sum A/, (6}, 6}) + r/Lb (y , 
where r/ > 0 is a parameter regulating how fast the algorithm is willing to move its pa- 
rameters. Under certain regularity assumptions, the update rule of the GA algorithm can 
be shown to approximately solve this minimization problem. For more discussion and ex- 
amples in the special case of linear regression, see [KW97]. An interesting related idea is 
using all the previous examples in the update instead of just the last one. For work along 
these lines in the linear case see Vovk [Vov97] and Foster [Fos91]. 
4 RELATIVE LOSS BOUNDS 
Consider a sequence $ = ((zz,!/z),..., (zl,!/l)) of training examples, and let 
Loss& (GA, $) = 'tt=  Lck(yt, ,t) be the loss incurred by the general additive algorithm 
on this sequence when it always uses its current weights fit for making the tth prediction 
^ = -'t= Lqb(Yt, b(f2zt)) be the loss of a fixed predictor 
Yt. Similarly, let Lossb(f2 , S) t 
f2. Basically, our goal is to show that if some f2 achieves a small loss, then the algorithm is 
not doing much worse, regardless of how the sequence $ was generated. Making additional 
probabilistic assumptions allows such on-line loss bounds to be converted into more tradi- 
tional results about generalization errors [KW97]. To give the bounds for Loss& (GA, S) 
in terms of Lossb (f2 , $) we need some additional parameters. The first one is the distance 
Ab(O1 , 6}) where f2 = /,(O) and O1 is the initial parameter matrix of the GA algorithm 
(which can be arbitrary). The second one is defined by 
sup { I O e e x } 
where A' = { az,..., at } is the set of input vectors and DX/,(0) is the Jacobian with 
(Dxb(O))ij = Obi(O)/OOj. The value bx,xb can be interpreted as the maximum norm of 
292 J. Kivinen and M. K. Warmuth 
any input vector in a norm defined by the parameterization function X/'. In Example 3 below 
we show how bx,xl , can easily be evaluated when X/' is a linear function or the softmax 
function. The third parameter cb, defined as 
% = / - 
I 
Y,Y  
relates the matching loss function for the transfer function b to the square loss. In Ex- 
ample 4 we evaluate this constant for linear functions, the softmax function, and the one- 
dimensional case. 
Example3 Consider bounding the value aTDtr(0)a where tr is the softmax func- 
tion. As we saw in Example 2, this value is a variance of a random variable with the 
range { xl,...,xn }. Hence, we have bx,r _< max.x(maxixi - minizi)9'/4 <_ 
max.x IIllL where lllloo = max/I,l. 
Ifs/, is a linear function with X/,(0) = AO for a symmetrical positive definite A, we clearly 
have bx,xb <_ ,max maxaex ae 9' where ,max is the largest eigenvalue of A. 
Example 4 For the softmax function o' the matching loss function Lo. is the relative en- 
tropy (see Example 2), for which it is well known that Ltr (y, ,) >_ 2(y - ,)9.. Hence, we 
have cb _< 1/4. 
If b is a linear function given by a symmetrical positive semidefinite matrix A, we see from 
Example 1 that cb is the largest eigenvalue of A. 
Finally, in the special case k = 1, with qS: R -4 R differentiable and strictly increasing, we 
can show cqb _< Z ifZ is a bound such that 0 < qS'(z) _< Z holds for all z. rn 
Assume now we are given constants b _> bx,b and c _> cck. Our first loss bound states that 
for any parameter matrix 6) we have 
Loss4,(GA, S) < 2Loss4,(g,(e), S) + 4bcAxb(ex , 6}) 
when the learning rate is chosen as r/= 1/(2bc). (Proofs are omitted from this extended 
abstract.) The advantage of this bound is that with a fixed learning rate it holds for any 
13, so we need no advance knowledge about a good 13. The drawback is the factor 2 in 
front of Lossqb (b (O), S), which suggests that asymptotically the algorithm might not ever 
achieve the performance of the best fixed predictor. A tighter bound can be achieved by 
more careful tuning. Thus, given constants K > 0 and R > 0, if we choose the learning 
rate as r/= (.v/(bcR) 9. + KbcR- bcR)/(Kbc) (with, = 1/(2bc) if K = 0) we obtain 
Lossqb(GA, S) _ Lossb(Xb(e), S) + 2x/'KbcR + 4bcR 
for any 13 that satisfies Lossqb(b(O), S) _< K and Ab(O1,13) _< R. This shows that if 
we restrict our comparison to parameter matrices within a given distance R of the initial 
matrix of the algorithm, and we have a reasonably good guess K as to the loss of the best 
fixed predictor within this distance, this knowledge allows the algorithm to asymptotically 
match the performance of this best fixed predictor. 
Relative Loss Bounds for Multidimensional Regression Problems 293 
Acknowledgments 
The authors thank Katy Azoury, Chris Bishop, Nico16 Cesa-Bianchi, David Helmbold, and 
Nick Littlestone for helpful discussions. Jyrki Kivinen was supported by the Academy of 
Finland and the ESPRIT project NeuroCOLT. Manfred Warmuth was supported by the NSF 
grant CCR 9700201. 
References 
lama85] 
[AHW95] 
[AW97] 
[Bud93] 
[CB97] 
[Fos91] 
[GLS97] 
[HKW95] 
[JW98] 
[KW97] 
[MN89] 
[Vov97] 
S. Amari. Differential Geometrical Methods in Statistics. Springer Verlag, 
Berlin, 1985. 
P. Auer, M. Herbster, and M. K. Warmuth. Exponentially many local minima 
for single neurons. In Proc. 1995 Neural Information Processing Conference, 
pages 316-317. MIT Press, Cambridge, MA, November 1995. 
K. Azoury and M. K. Warmuth. Relative loss bounds and the exponential family 
of distributions. Unpublished manuscript, 1997. 
M. Budinich. Some notes on perceptton learning. J. Phys. A.: Math. Gen., 
26:4237-4247, 1993. 
N. Cesa-Bianchi. Analysis of two gradient-based algorithms for on-line regres- 
sion. In Proc. iOth Annu. Conf. on Cornput. Learning Theory, pages 163-170. 
ACM, 1997. 
D. P. Foster. Prediction in the worst case. The Annals of Statistics, 19(2): 1084- 
1090, 1991. 
A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results 
for linear discriminant updates. In Proc. i Oth Annu. Conf. on Cornput. Learning 
Theory, pages 171-183. ACM, 1997. 
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for 
sigmoided linear neurons. In Proc. Neural Information Processing Systems 
1995, pages 309-315. MIT Press, Cambridge, MA, November 1995. 
A. K. Jagota and M. K. Warmuth. Continuous versus discrete-time nonlinear 
gradient descent: Relative loss bounds and convergence. Presented at Fifth Sym- 
posium on Artificial Intelligence and Mathematics, Ft. Lauderdale, FL, 1998. 
J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient up- 
dates for linear prediction. Information and Computation, 132(1): 1-64, January 
1997. 
P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, 
New York, 1989. 
V. Vovk. Competitive on-line linear regression. In Proc. Neural Information 
Processing Systems 1997. MIT Press, Cambridge, MA, 1998. 
