On-line Learning of Dichotomies 
N. Barkai 
Racah Institute of Physics 
The Hebrew University 
Jerusalem, Israel 91904 
naamaf iz .huj i. ac. il 
H. S. Seung 
AT&T Bell Laboratories 
Murray Hill, NJ 07974 
seungphysics.att.com 
H. Sompolinsky 
Racah Institute of Physics 
The Hebrew University 
Jerusalem, Israel 91904 
and AT&T Bell Laboratories 
haimf iz. huj i. ac. il 
Abstract 
The performance of on-line algorithms for learning dichotomies is studied. In on-line learn- 
ing, the number of examples P is equivalent to the learning time, since each example is 
presented only once. The learning curve, or generalization error as a function of P, depends 
on the schedule at which the learning rate is lowered. For a target that is a perceptton rule, 
the learning curve of the perceptton algorithm can decrease as fast as p-i, if the sched- 
ule is optimized. If the target is not realizable by a perceptton, the perceptton algorithm 
does not generally converge to the solution with lowest generalization error. For the case 
of unrealizability due to a simple output noise, we propose a new on-line algorithm for a 
perceptron yielding a learning curve that can approach the optimal generalization error as 
fast as p-l/2. We then generalize the perceptron algorithm to any class of thresholded 
smooth functions learning a target from that class. For "well-behaved" input distributions, 
if this algorithm converges to the optimal solution, its learning curve can decrease as fast 
1 Introduction 
Much work on the theory of learning from examples has focused on batch learning, in which the learner is 
given all examples simultaneously, or is allowed to cycle through them repeatedly. In many situations, it is 
more natural to consider on-line learning paradigms, in which at each time step a new example is chosen. 
The examples are never recycled, and the learner is not allowed to simply store them (see e.g, Heskes, 
1991; Hansen, 1993; Radons, 1993). Stochastic approximation theory (Kushner, 1978) provides a framework 
for understanding of the local convergence properties of on-line learning of smooth functions. This paper 
addresses the problem of on-line learning of dichotomies, for which no similarly complete theory yet exists. 
304 N. Barkai, H. S. Seung, H. Sompolinsky 
We begin with on-line learning of perceptron rules. Since its introduction in the early 60's, the perceptron 
algorithm has been used as a simple model of learning a binary classification rule. The algorithm has been 
proven to converge in finite time and to yield a half plane separating any set of linearly separable examples. 
The perceptron algorithm, however, is not efficien in the sense of distribution-free PAC learning (Valiant, 
1984), for one can construct input distributions that require an arbitrarily long convergence time. In a recent 
paper (Baum, 1990) Baum proved that the perceptton algorithm applied in an on-line mode, converges as 
p-US when learning a half space under a uniform input distribution, where P is the number of presented 
examples drawn at random. For on-line learning P is also the number of time steps. Baum also generalized 
his result to any "non-malicious" distribution. Kabashima has found the same power law for learning a 
two-layer parity machine with non-overlapping inputs, using an on-line least action algorithm (Kabashima, 
1994). 
If efficiency is measured only by the number of examples used (disregarding time), these particular on-line 
algorithms are much worse than batch algorithms. Any batch algorithm which is able to correctly classify a 
given set of P examples will converge as P- (Vapnik, 1982; Amari, 1992; Seung, 1992). In this paper, we 
construct on-line algorithms that can actually achieve the same power law as batch algorithms, demonstrating 
that the results of Banm and Kabashima do not reflect a fundamental limitation of on-line learning. 
In Section 3, we study on-line algorithms for perceptron learning of a target rule that is not realizable by 
a perceptron. Here it is nontrivial to construct an algorithm that even converges to the optimal one, let 
alone to optimize the rate of convergence. For the special case of a target rule that is a perceptron corrupted 
by output noise this can be done. In Section 4, our results are generalized to dichotomies generated by 
thresholding smooth functions. In Section 5 we summarize the results. 
2 On-line learning of a perceptton rule 
We consider a half space rule generated by a normalized teacher perceptron Wo 6 R N, Wo  Wo = 1 such 
that any vector S 6 R N is given a label ao(S) = sgn(Wo  S). We study the case of a Gaussian input 
distribution centered at zero with a unit variance in each direction in space: 
N 
1 _S./. 
= II .-- (1) 
Averages over this input distribution will be written with angie brackets 0. A student perceptron W is 
trained by an on-line perceptton algorithm. At each time step, an input S  R t is drawn at random, 
according to distribution Eq. (1) and the student's output a(S) = sgn(W  S) is calculated. The student is 
then updated according to the perceptron rule: 
W' = W + 4S;W)0(S)S (2) 
and is then normalized so that W. W = 1 at all times. The factor e(S; W) denotes the error of the student 
perceptron on the input S: e = 1 if a(S)a0(S) = 1, and 0 otherwise. The learning rate 1 is the magnitude 
of change of the weights at each time step. It is scaled by N to ensure that the change in the overlap 
R = W. W0 is of order 1IN. Thus, a change of O(1) occurs only after presentation of P = O(N) examples. 
The performance of the student is measured by the generalization error, defined as the probability of dis- 
agreement between the student and the teacher on an arbitrary input % = (e(S; W)). In the present case, 
% is 
cos- R 
 =  (3) 
Although for simplicity we analyze below the performance of the perceptron rule (2) only for large N, 
our results apply to finite N as well. Multiplying Eq. (2) by W0 after incorporation of the normalization 
operation and averaging with respect to the input distribution (1), yields the following differential equation 
for R(r) where c = P/N, 
dR 1 - R 2 q2 Rcos -1 R 
da -   2r (4) 
On-line Learning of Dichotomies 305 
Here terms of order  have been neglected. 
The evolution of the overlap R, and thus of the generalization error, depends on the schedule at which the 
learning rate 7 decreases. We consider two cases, a constant 7 and a time-dependent 7. 
Constant learning rate: When 7 is held fixed, Eq. (4) has a stable fixed point at R < 1, and hence 
converges to an 7-dependent nonzero value e(7). For 7 << 1, 1 - R(7) or 7 2 and % or x/T- R is therefore 
proportional to 7, 
oo (7) = 7/2v5-;  (5) 
The convergence to this value is exponential in a, % (a) - coo(7) ~ exp(-7a/v/). 
Time-dependent learning rate: Convergence to % = 0 can be achieved if 7 decreases slowly enough with 
a. We study the limiting behaviour of the system for 7 which is decreasing with time as 7 = (70v-) 
z > 1. In this case the rate is reduced too fast before a sufficient number of examples have been seen. This 
results in R which does not converge to 1 but instead to a smaller value that depends on its initial value. 
z < 1. The system follows the change in 7 adiabatically. Hence, to first order in 
Thus, % converges to zero with an asymptotic rate %(a) ~ a -z. 
z = 1. The behaviour of the system depends on the prefactor 
70 - 1 a 70 > 1 
70 1 (6) 
A 
~ 70<1 
where A depends on the initial condition. Thus the optimal asymptotic change of 7 is 2vr/a, in which case 
the error will behave asymptotically as %(a) ~ 1.27/a. This is not far from the batch asymptotic (Seung, 
1992) (a) ~ 0.628/a. We have confirmed these results by numerical simulation of the algorithm Eq. (2). 
Figure I presents the results of the optimal learning schedule, i.e., 7 = 2vr/a. The numerical results are 
in excellent agreement with the prediction (a) = 1.27/a for the asymptotic behavior. Finally, we note 
that our analysis of the time-dependent case is similar to that of Kabashima and Shinomoto for a different 
on-line learning problem (Kabashima, 1993). 
3 On-line learning of a perceptron with output noise 
In the case discussed above, the task can be fully realized by a perceptron, i.e., there is a perceptron W 
such that eg = 0. In more realistic situations a perceptron will only provide an approximation of the target 
function, so that the minimal value of e9 is greater than zero. These cases are called unrealizable tasks. A 
drawback of the above on-line algorithm is that, for a general unrealizable task, it does not converge to 
the optimal perceptron, i.e., it does not approach the minimum of %. To illustrate this fact we consider a 
perceptron rule corrupted by output noise. The label of an input S is a0(S), where a0(S) = sgn(W0  S) 
with probability 1 - p, and - sgn(W0  S) with probability p. We assume 0 <_ p <_ 1/2. For reasons which 
will become clear later, the input distribution is taken as a Gaussian centered at U 
N 
p(s) = -L-e (7) 
.=  
In this case % is given by 
v'l - R z q '/ ' (8) 
where q0 = U  W0 denotes the overlap betwn the center of the distribution d the tether perceptton, 
d q = U  W is the overlap between the center of the distribution and W. The integrs in Eq. (8) are 
306 N. Barkai, H. S. Seung, H. Sompolinsky 
with respect to a Gaussian measure Dy = exp(-y2/2)/vf and H(x) = foo Dy. Note that the optimal 
perceptron is the teacher W = W0 i.e., R = 1, q = q0, which yields the minimal error erain ---- p. 
First, we consider training with the normalized perceptron rule (2). In this case, we obtain differential 
equations for two variables: R and q. Solving these equations we find that in general, W converges to a 
vector with a direction which is in the plane of W0 and U and is does not point in the direction of W0 even 
in the limit of q -- 0. Here we present the result for the limit of q -- 0 and small noise level, i.e., p << 1. In 
this case, we obtain for eo(/= 0) 
(1 - 2H(qo)) uv/-Yo 
eoo(O)=p+p 1+(,2_q02) +O(p ') (9) 
where u = [U[ is the magnitude of the center of the input distribution. For p = 0, the only solution is 
R = I and q = q0, in agreement with the previous results. For p > 0 the optimal solution is retrieved only 
in the following special cases: (i) the input distribution is isotropic, i.e., q0 = u = 0; (ii) when U is parallel 
to W0, i.e., u = q0; and (iii) when U is orthogonal to W0, i.e., q0 = 0. This holds also for large value of 
p. In these special cases, the symmetry of the input distribution relative to the teacher vector, guarantees 
that the deviations from W -- W0 incurred by the inputs that come with the wrong label cancel each other 
on average. According to Eq. (9), for other directions of U, % is above the optimal value. Note that the 
additional term in % is of the same order of magnitude (O(p)) as the minimal error. 
In the following we suggest a modified on-line algorithm for learning a perceptton rule with output noise. 
The student weights are changed according to 
W' ---- W + -e(S; W)er0(S)(S - T(S)) (10) 
followed by a normalization of W. This algorithm differs from the perceptron algorithm in that the change in 
W is not proportional to the present input, but to a shifted vector. The shifting vector T(S), is determined 
by the requirement that the tether W0 will be a fixed point of the algorithm in the limit of   0. This is 
equivalent to the condition 
(eo(S)ro(S)(S - T(S))) = 0 (11) 
where e0(S) is the error function for S when W = W0. This condition does not determine T uniquely. A 
simple choice is one for which T is independent of S. This leads to 
T = (sgn(Wo- S)S) 
= 
<sgn(w0. s)) 
where we used the fact that for any S, eo(S)a0(S) equals -sgn(W0. S) with probability p, and zero with 
probability (1 - p). This uniform shift is possible only when (a0)  0, namely when the average frequencies 
of +1 and -1 labels are not equal. If this is not the case, one has to choose nonuniform forms of T(S). 
Note that in general T has to be learned so that Eq. (10) has to be supplemented by appropriate equations 
for changing T. In the case of Eq. (12), one can easily learn separately the numerator and denominator by 
running averages of a0S and a0, respectively. We have studied analytically the above algorithm for the case 
of the Gaussian input distribution Eq. (7), in the limit of large N. The shifting vector is given by 
W ./'-exp(-q/2) 
T=U+ ov 7 _- 2--(qo ) (13) 
The differential equations for the overlaps R and q in the neighborhood of the point R = I and q = q0 are, 
d6R = _rl/2/r r exp(-q(/2)6R + rlap (14) 
da 
d6q _ exp(-q/2)(6q I  
da rl  + qo6R) + rl qoP 
where 5R = 1 - R and 5q = qo - q. In the limit   0, R = I and q = q0 is indeed a stable fixed point of 
the algorithm, so that the student converges to the optimal perceptton W0, and hence the eneralization 
error converges to its minimal value e,ni, = p. Since, unlike Eq. (4), the coefficient of the  term in Eq. 
On-line Learning of Dichotomies 307 
(14) is constant, 6Ro(7) ec 7, for small fixed 7, and not to 72. Thus, in this ce, the generization error 
approhes, in the limit a  , the ue 
exp(-q/4) 
() = p + p ( 05) 
For a timedependent , the convergence to the optimM weights depends on the choice of (n),  in the 
ce o the noiseless perceptron rule. For. = (.0exp(q/2)) a -, with z  1, the error converges to 
p. For z < 1, to first order in l/a, %(a) = e((a)), yielding 
(a) - p  a -/. (6) 
When z = 1, the rate of convergence depends on the value of . 
%(a) -p   a-/a, q0 > 1 (17) 
a -n/a,  < 1 
d logithmic corrections to a -/ for  = 1. Thus, the optimM rate of convergence is 
which is hieved for q0 = 2. 
We have test successlly this algorithm by simulations of leing a perceptton rule with output noise 
with sever input distributions, including the Gaussi, of Eq. (7). Fibre 2 presents the generization 
error  a function of a for the Gaussi distribution, with p = 0.2, d we have chosen 0 = 2. The error 
converges to the optimM vue 0.2  a -1/2 in agramerit with the thry. For comprison the result of the 
usu perceptron algorithm is Mso preented. This gorithm converges to %  0.32, clely lger th the 
optim vue. 
4 On-line learning of thresholded smooth functions 
Our results for the realizable perceptron can be extended to a more general class of dichotomies, namely 
thresholded smooth functions. They are defined as dichotomies of the form 
(S; W) = sgn(/(S; W)) (19) 
where f is a differentiable function of a set of parameters, deoted by W, and S is the input vector. We 
consider here the case of a realizable task, where the examples are given with labels ao corresponding to a 
target machine W0 which is in the W space. For this task we propose the following generalization of the 
perceptron rule (2) 
w' = w + .(s; W)o(S)Vf(S; w) (20) 
where V denotes a gradient w.r.t.W. Then, as we argue below, the vector W0 is a stable fixed point in the 
limit of 7 --r 0. Furthermore, for constant small 7 the residual error scales as coo oc 7. For 7 ~ a-z, z < 1, 
() ~ oo(7()) ~ -z. 
To show this, let us consider for simplicity the one-dimensional case, w t = w + 7g(w, s), where 
g(w, s) = o(- f (,, s) f (wo, s) ) sgu(f (wo, s) ) o- ' (21) 
This equation can be converted into a Markov equation for the probability distribution, P(w, n) (Van Karo- 
pen, 1981) 
P(w, n + 1) = / dw'W(w'lw)P(v:', n) (22) 
where W(wlw') =< 6(w' - w - 7g(w, s)) > is the transition rate from w to w'. In the limit of small fixed 7, 
the equilibrium distribution, P, can be shown to have the following scaling form, 
Poo(w;7) = 1r(6w/7) (23) 
308 N. Barkai, H. S. Seung, H. Sompolinsky 
where 5w = w - wo and F(x) obeys the following difference equation 
LF(x) -- y. O((f q- ax)f)l(f q- ax)[F(x q- f) - IxlF) - 0 (24) 
a=4-1 
where f is the value of the gradient Of(wo, s)/Ow at the decision boundary of f(wo, s), namely at the point 
s obeying f(wo, s) = 0. Note that since we are interested in normalizable solutions of Eq. (24), F(x) has to 
vanish for for all x > Ifil. This result is valid provided the input distribution is smooth and nonvanishing 
near the decision boundary. Furthermore, Of lOw at wo may not vanish on the decision boundary. Under 
the same conditions, it can be shown that the error is homogeneous in 5w with degree 1, hence it should 
scale linearly with 9, i.e., e o 9. It should be noted that, unlike other on-line learning problems (Heskes, 
1991; Hansen, 1993; Radons, 1993), the equilibrium distribution is our case is not Gaussian. 
For a time-dependent  of the form  = 0n -z, z < 1, P(w,n) at long times is of the form 
(25) 
I + 
P(w,.) = 
where F is the stationary distribution, given by Eq. (24) and the coefficient of the correction, G, solves the 
inhomogeneous equation 
dF 
+ zr) = ,0La(x) (26) 
where the linear operator  is defined in Eq. (24). Thus, to leading order in inverse time, the system 
follows adiabatically the finite- stationary distribution, yielding %(n) which vanishes asymptotically as 
%(n) o (n) ~ n -. The optimal schedule is obtained for z = 1. In this case, P(w, n) = 7 -l(n)F(Sw/(n)) 
where F(x) solves the homogeneous equation 
dF 
+ = ,0Lr(x) (27) 
For sufficiently large T0, this equation has a solution, implying that % o n -. 
Similarly, the results of Section 3 can also be extended to the case of thresholded- smooth functions with 
a probability p of an error due to isotropic output noise. In this case, the optimal choice is again r/o n -1 
yielding % -p  x/-' It should be noted that for this case, the probability distribution for small  does reduce 
to a Gaussian distribution in 5w//-O. Using a multidimensional Markov equation, it is straightforward to 
extend these results to higher dimensions. The small  limit yields equations similar to Eqs. (24-26), that 
involve integration over the decision boundary of f(W, S). 
5 Summary and Discussion 
We have found that the perceptron rule (2) with normalization can lead to a variety of learning curves, 
depending on the schedule at which the learning rate is decreased. The optimal schedule leads to an inverse 
power law learning curve, e 9 ~ a -. Baum's results (Baum, 1990) of a non-normalized perceptron with a 
constant learning rate can be viewed as a special case of the above analysis. In the non-normalized perceptton 
algorithm, the magnitude of the student's weights grow with a as IWl ~ a 1/a. The time evolution of the 
overlap R, and thus of the generalization error is governed by the effective learning rate efr = q/IWl leading 
via Eq. (6) to the result % ~ a -1/3. Similar results apply to the two-layer parity machine studied in 
(Kabashima, 1994). 
Our analysis, leading to the equations of motion (4) and (14), was based on the limit of large N and P, such 
that a - PIN remains finite. We would like to stress however, that this limit is only necessary in deriving 
the full form of the learning curve, i.e., R(ct) for all a. On the other hand, our results for the large P 
asymptote of the learning curve for small  are valid for finite N as well, as implied by the general treatment 
of the previous section. 
Unrealizable perceptron rules present a more complicated problem. We have presented here a modified 
perceptton algorithm that converges to the optimal solution in the special case of an isotropic output noise. 
On-line Learning of Dichotomies 309 
In this case, the convergence to the optimal error is as -1/2. This is the same power law as obtained in 
the standard sample complexity upper bounds (Vapnik, 1982) and in the approximate replica symmetric 
calculations (Seung, 1992) for batch learning of unrealizable rules. It should be stressed however, that the 
success of the modified algorithm in the case of an output noise depends on the fact that the errors made 
by the optimal solution are uncorrelated with the input. Thus, finding an on-line algorithm that can cope 
with other types of unrealizability remains an important problem. 
The learning algorithms for the perceptton rule, without and with output noise, can be generalized to learning 
thresholded smooth functions, assuming certain reasonable properties of the input distribution are present, 
as shown in Section 4. The dependence of the learning curve on the learning rate schedule remains roughly 
the same as in the perceptron case. This implies that on-line learning of realizable dichotomies, with possible 
output noise, can achieve the same power laws in the number of examples that is typical of batch learning 
of such rules. Furthermore, the on-line formulation possesses the theoretical virtues of addressing time as 
well as sample complexity, so that the same power laws imply the polynomial relationship between the time 
and the achieved error level. The above conclusions assume that the equilibrium state at small learning 
rates is unique, which in general is not the case. The issue of overcoming local minima in on-line learning 
is a difficult problem (Heskes, 1992) Finally, the theoretical results for on-line learning has the important 
advantage of not requiring the use of the often problematic replica formalism. 
Acknowledgements 
We are grateful for helpful discussions with Y. Freund, M. Kearns, R. Schapire, and E. Shamir, and thank Y. 
Kabashima for bringing his paper to our attention. HS is partially supported by the Fund for Basic Research 
of the Israeli Academy of Arts and Sciences. 
References 
S. Amari, N. Fujita, and S. Shinomoto. Four types of learning curves. Neural Cornput., 4:605-618, 1992. 
E. B. Baum. The perceptron algorithm is fast for nonmalicious distributions. Neural Cornput., 2:248-260, 
1990. 
H. J. Kushner and D. S. Clark. Stochastic approximation methods for constrained and unconstrained systems. 
Springer, Berlin, 1978. 
L. K. Hansen, R. Pathria, and P. Salamon. Stochastic dynamics of supervised learning. J. Phys., A26:63-71, 
1993. 
T. Heskes and B. Kappen. Learning processes in neural networks. Phys. Rev., A44:2718-2762, 1991. 
T. Heskes, E. T. Po Slijpen, and B. Kappen. Learning in neural networks with local minima. Phys. Rev., 
A46:5221-5231, 1992. 
Y. Kabashima. Perfect loss of generalization due to noise in k -- 2 parity machines. J. Phys., A27:1917-1927, 
1994. 
Y. Kabashima and S. Shinomoto. Incremental learning with and without queries in binary choice problems. 
In Proc. of IJCNN, 1993. 
G. Radons. On stochastic dynamics of supervised learning. J. Phys., A26:3455-3461, 1993. 
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Phys. Rev., 
A45:6056-6091, 1992. 
L. G. Valiant. A theory of the learnable. Commun. ACM, 27:1134-1142, 1984. 
N.G. Van Karopen. Stochastic processes in physics and chemistry. North holland 1981. 
V. N. Vapnik. Estimation of Dependences based on Empirical Data. Springer-Verlag, New York, 1982. 
310 N. Barkai, H. S. Seung, H. Sompolinsky 
0.03 
0.025 
0.02 
0.015 
0.01 
0.005 
0.005 0.01 0.015 0.02 
1/a 
Figure 1: Asymptotic performance of a realizable perceptron. Simulation results for 7o = 2 and N = 50 
(solid curve) are compared with the theoretical prediction % = 1.27/a (dashed curve). 
% - p 
0.35 
0.3 
0.25 
0.2 
0.15 
0.1 
0.05 
0 
I I I I I I 
0 0.1 0.2 0.5 0.6 
I I 
0.3 0.4 0.7 
1/v 
Figure 2: Simulation results for on-line learning of a perceptron with output noise. Here r/o -- 2, p = 0.2, 
N = 250, u -- 4, and qo = -1.95. The regular perceptron learning (dashed curve) is compared with the 
modified algorithm (solid curve). The dashed line shows the theoretical prediction Eq. (18) 
