Learning Time-varying Concepts 
Anthony Kuh 
Dept. of Electrical Eng. 
U. of Hawaii at Manoa 
Honolulu, HI 96822 
kuh@wiliki.eng.hawaii.edu 
Thomas Petsche 
Siemens Corp. Research 
755 College Road East 
Princeton, NJ 08540 
petsche learning. siemens.com 
Ronald L. Rivest 
Lab. for Computer Sci. 
MIT 
Cambridge, MA 02139 
rivest@theory.lcs.mit.edu 
Abstract 
This work extends computational learning theory to situations in which concepts 
vary over time, e.g., system identification of a time-varying plant. We have 
extended formal definitions of concepts and learning to provide a framework 
in which an algorithm can track a concept as it evolves over time. Given 
this framework and focusing on memory-based algorithms, we have derived 
some PAC-style sample complexity results that determine, for example, when 
tracking is feasible. We have also used a similar framework and focused on 
incremental tracking algorithms for which we have derived some bounds on 
the mistake or error rates for some specific concept classes. 
1 INTRODUCTION 
The goal of our ongoing research is to extend computational learning theory to include 
concepts that can change or evolve over time. For example, face recognition is complicat- 
ed by the fact that a persons face changes slowly with age and more quickly with changes 
in make up, hairstyle, or facial hair. Speech recognition is complicated by the fact that 
a speakers voice may change over time due to fatigue, illness, stress, or background 
noise (Galletti and Abbott, 1989). 
Time varying systems often appear in adaptive control or signal processing applications. 
For example, adaptive equalizers adjust the receiver and transmitter to compensate for 
changes in the noise on a transmission channel (Lucky et al., 1968). The kinematics of 
a robot arm can change when it picks up a heavy load or when the motors and drive 
train responses change due to wear. The output of a sensor may drift over time as the 
components age or as the temperature changes. 
183 
184 Kuh, Petsche, and Rivest 
Computational learning theory as introduced by Valiant (1984) can make some useful 
statements about whether a given class of concepts can be learned and provide proba- 
bilistic bounds on the number of examples needed to learn a concept. Haussler, et al. 
(1987), and Littlestone (1989) have also shown that it is possible to bound the number of 
mistakes that a learner will make. However, while these analyses allow the concept to be 
chosen arbitrarily, that concept must remain fixed for all time. Littlestone and Warmuth 
(1989) considered concepts that may drift, but in the context of a different accuracy 
measure than we use. Our research seeks explore further modifications to existing theory 
to allow the analysis of performance when learning time-varying concept. 
In the following, we describe two approaches we are exploring. Section 3 describes 
an extension of the PAC-model to include time-varying concepts and shows how this 
new model applies to algorithms that base their hypotheses on a set of stored examples. 
Section 4 described how we can bound the mistake rate of an algorithm that updates its 
estimate based on the most recent example. In Section 2 we define some notation and 
terminology that is used in the remainder of the based. 
2 NOTATION & TERMINOLOGY 
For a dichotomy that labels each instance as a positive or negative example of a concept, 
we can formally describe the model as follows. Each instance xi is drawn randomly, 
according to an arbitrary fixed probability distribution, from an instance space X. The 
concept c to be learned is drawn randomly, according to an arbitrary fixed probability 
distribution, from a concept class C. Associated with each instance is a label ai = c(xi) 
such that ai - 1 if xi is a positive example and ai = 0 otherwise. The learner is presented 
with a sequence of examples (each example is a pair {xi, ai)) chosen randomly from X. 
The learner must form an estimate, , of c based on these examples. 
In the time-varying case, we assume that there is an adversary who can change c over 
time, so we change notation slightly. The instance xt is presented at time t. The concept 
ct is active at time t if the adversary is using ct to label instances at that time. The 
sequence of t active concepts, ct -- {cl,..., ct} is called a concept sequence of length t. 
The algorithm's task is to form an estimate -t of the actual concept sequence ct, i.e., at 
each time t, the tracker must use the sequence of randomly chosen examples to form an 
estimate t of ct. A set of length t concept sequences is denoted by C(t) and we call a 
set of infinite length concept sequences a concept sequence space and denote it by C. 
Since the adversary, if allowed to make arbitrary changes, can easily make the tracker's 
task impossible, it is usually restricted such that only small or infrequent changes are 
allowed. In other words, each C(t) is a small subset of C t. 
We consider two different types of different types of "tracking" (learning) algorithms, 
memory-based (or batch) and incremental (or on-line). We analyze the sample complexity 
of batch algorithms and the mistake (or error) rate of incremental algorithms. 
In t,he usual case where concepts are time-invariant, batch learning algorithms operate 
in two distinct phases. During the first phase, the algorithm collects a set of training 
examples. Given this set, it then computes a hypothesis. In the second phase, this 
hypothesis is used to classify all future instances. The hypothesis is never again updated. 
In Section 3 we consider memory-based algoritms derived from batch algorithms. 
Learning Time-varying Concepts 185 
When concepts are time-invariant, an on-line learning algorithm is one which constantly 
modifies its hypothesis. On each iteration, the leamer (1) receives an instance; (2) predicts 
a label based on the current hypothesis; (3) receives the correct label; and (4) uses the 
correct label to update the hypothesis. In Section 4, we consider incremental algorithms 
based on on-line algorithms. 
When studying learnability, it is helpful to define the Vapnik-Chervonenkis (VC) dimen- 
sion (Vapnik and Chervonenkis, 1971) of a concept class: VCdim(C) is the cardinality 
of the largest set such that every possible labeling scheme is achieved by some concept 
in C. Blumer et al. (1989) showed that a concept class is learnable if and only if the 
VC-dimension is finite and derived an upper bound (that depends on the VC dimension) 
for the number of examples need to PAC-leam a learnable concept class. 
3 MEMORY-BASED TRACKING 
In this section, we will consider memory-based trackers which base their current hypoth- 
esis on a stored set of examples. We build on the definition of PAC-leaming to define 
what it means to PAC-track a concept sequence. Our main result here is a lower bound 
on the maximum rate of change that can be PAC-tracked by a memory-based learner. 
A memory-based tracker consists of (a) a function w(e, 6); and (b) an algorithm  that 
produces the current hypothesis, t using the most recent w (e, 6) examples. The memory- 
based tracker thus maintains a sliding window on the examples that includes the most 
recent w (e, 6) examples. We do not require that  run in polynomial time. 
Following the work of Valiant (1984) we say that an algorithm .,4 PAC-tracks a concept 
sequence space  C_  if, for any c  , any distribution D on X, any e, 6 > 0, and 
access to examples randomly selected from X according to D and labeled at time t by 
concept ct; for all t sufficiently large, with t  chosen uniformly at random between 1 and 
t, it is true that 
Pr(d(ct,,t,) _< 6) >_ 1-6. 
The probability includes any randomization algorithm .4 may use as well as the random 
selection of t  and the random selection of examples according to the distribution D, 
and where d(c,c') = D(x  c(x)  c'(x)) is the probability that c and c' disagree on a 
randomly chosen example. 
Learnability results often focus on learners that see only positive examples. For many 
concept classes this is sufficient, but for others negative examples are also necessary. 
Natarajan (1987) showed that a concept class that is PAC-leamable can be learned using 
only positive examples if the class is closed under intersection. 
With this in mind, let's focus on a memory-based tracker that modifies its estimate 
using only positive examples. Since PAC-tracking requires that .4 be able to PAC-leam 
individual concepts, it must be true that .4 can PAC-track a sequence of concepts only if 
the concept class is closed under intersection. However, this is not sufficient. 
Observation 1. Assume C is closed under intersection. If positive examples are drawn 
from cl  C prior to time to, and from c2  C, cl C_ c2, after time to, then there exists an 
estimate of c2 that is consistent with all examples drawn from ci. 
The proof of this is straightforward once we realize that if cl _C c2, then all positive 
186 Kuh, Petsche, and Rivest 
examples drawn prior to time to from Cl are consistent with c2. The problem is therefore 
equivalent to tint choosing a set of examples from a subset of c2 and then choosing more 
examples from all of c2 -- it skews that probability distribution, but any estimate of c2 
will include all examples drawn from Cl. 
Consider the set of closed intervals on [0, 1], C = {[a, b] ] 0 _< a, b <_ 1 }. Assume that, 
for some d _> b, ct = Cl = [a,b] for all t _< to and ct = c2 = [a,d] for all t > to. All 
the examples drawn prior to to, {xt ' t < to}, are consistent with c2 and it would be nice 
to use these examples to help estimate c2. How much can these examples help? 
Theorem 1. Assume C is closed under intersection and VCdim(C) is finite; C2 C_ C; 
and .4 has PAC learned Cl  C at time to. Then,for some d such that VCdim(C2) <_ d _< 
VCdim(C ), the maximum number of examples drawn after time to required so that .4 can 
PAC learn c2  C is upper bounded by m(, 6) = max ( log 6 2-, 8. log !)) 
In other words, if there is no prior information about c2, then the number of examples 
required depends on VCdim(C). However, the examples drawn from cl can be used to 
shrink the concept space towards C2. For example, when cl = [a, b] and c2 = [a, c], 
in the limit where c = cl, the problem of learning c2 reduces to learning a one-sided 
interval which has VC-dimension 1 venus 2 for the two-sided interval. Since it is unlikely 
that c = cl, it will usually be the case that d > VCdim(C2). 
In order to PAC-track c, most of the time .4 must have re(e, 6) examples consistent with 
the current concept. This implies that w(e, 6) must be at least m(e, 6). Further, since the 
concepts are changing, the consistent examples will be the most recent. Using a sliding 
window of size m (e, 6), the tracker will have an estimate that is based on examples that 
are consistent with the active concept after collecting no more than re(e, 6) examples 
after a change. 
In much of our analysis of memory-based trackers, we have focused on a concept se- 
quence space x which is the set of all concept sequences such that, on average, each 
concept is active for at least 1/,X time steps before a change occurs. That is, ifN(c, t) is 
the number of changes in the tint t time steps of c, Cx = {c: lim suPt_..o o N (c, t)/t _< )}. 
The question then is, for what values of ,X does there exist a PAC-tracker? 
Theorem 2. Let  be a memory-based tracker with w(e, 6) = re(e, 6/2) which draws 
instances labeled according to some concept sequence c  Cx with each ct  C and 
VCdim(C ) < va. For any e > 0 and 6 > O, .4 can UPAC track C if ,X < m(e, 6/2). 
This theorem provides a lower bound on the maximum rate of change that can be tracked 
by a batch tracker. Theorem 1 implies that a memory-based tracker can use examples 
from a previous concept to help estimate the active concept. The proof of theorem 2 
assumes that some of the most recent m(e, 6) examples are not consistent with ct until 
m (e, 6) examples from the active concept have been gathered. An algorithm that removes 
inconsistent examples more intelligently, e.g., by using conflicts between examples or 
information about allowable changes, will be able to track concept sequence spaces that 
change more rapidly. 
Learning Time-varying Concepts 187 
4 INCREMENTAL TRACKING 
Incremental tracking is similar to the on-line learning case, but now we assume that there 
is an adversary who can change the concept such that Ct+l - ct. At each iteration: 
1. the adversary chooses the active concept ct; 
2. the tracker is given an unlabeled instance, xt; 
3. the tracker predicts a label using the current hypothesis: tt = t- (xt); 
4. the tracker is given the correct label at; 
5. the tracker forms a new hypothesis: t = (t-, (xtat)). 
We have defined a number of different types of trackers and adversaries: A prudent 
tracker predicts that at -' 1 if and only if t (Jr) '-' 1. A conservative tracker changes 
its hypothesis only if at  dt. A benign adversary changes the concept in a way that 
is independent of the tracker's hypothesis while a malicious adversary uses information 
about the tracker and its hypothesis to choose a ct+ to cause an increase in the error 
rate. The most malicious adversary chooses Ct+l to cause the largest possible increase in 
error rate on average. 
We distinguish between the error of the hypothesis formed in step 5 above and a mistake 
made in step 3 above. The instantaneous error rate of an hypothesis is et '- d(ct, t). 
It is the probability that another randomly chosen instance labeled according to ct will 
be misclassified by the updated hypothesis. A mistake is a mislabeled instance, and we 
define a mistake indicator function Mt -- 1 if ct (xt) - t-(xt). 
1 t 
We define the average error rate st = 7 Y'i=i et and the asymptotic error rate is s = 
lim inft-o st. The average mistake rate is the average value of the mistake indicator 
1 t 
function, t '-' 7 Ei=I Mr, and the asymptotic mistake rate is tt = lim inft-o tit. 
We are modeling the incremental tracking problems as a Markov process. Each state 
of the Markov process is labeled by a triple (c, , c), and corresponds to an iteration in 
which c is the active concept,  is the active hypothesis, and c is the set of changes the 
adversary is allowed to make given c. We are still in the process of analyzing a general 
model, so the following presents one of the special cases we have examined. 
Let X be the set of all points on the unit circle. We use polar coordinates so that 
since the radius is fixed we can label each point by an angle 0, thus X = [0, 2r). 
Note that X is periodic. The concept class C is the set of all arcs of fixed length r 
radians, i.e., all semicircles that lie on the unit circle. Each c  C can be written as 
c = [r(20 - 1) mod 2r, 2r0), where 0  [0, 1). We assume that the instances are chosen 
uniformly from the circle. 
The adversary may change the concept by rotating it around the circle, however, the 
maximum rotation is bounded such that, given ct, ct+i must satisfy d(Ct+l, ct) _< 7. For 
the uniform case, this is equivalent to restricting Ot+i = Ot q-/3 mod 1, where 0 _</3 _< 
7/2. 
The tracker is required to be conservative, but since we are satisfied to lower bound the 
error rate, we assume that every time the tracker makes a mistake, it is told the correct 
concept. Thus, t = t-1 if no mistake is made, but t = ct wherever a mistake is made. 
188 Kuh, Petsche, and Rivest 
The worst case or most malicious adversary for a conservative tracker always tries to 
maximize the tracker's error rate. Therefore, whenever the tracker deduces ct (i.e. when- 
ever the tracker makes a mistake), the adversary picks a direction by flipping a fair 
coin. The adversary then rotates the concept in that direction as far as possible on each 
iteration. Then we can define a random direction function St and write 
+1, w.p. 1/2/ft-1 = ct-i; 
St = -1, w.p. 1/2/ft-1 = ct-i; 
St-i, if t-1 -f: ct-1. 
Then the adversary chooses the new concept to be Ot = Or-1 + St7/2. 
Since the adversary always rotates the concept by 7/2, there are 2/7 distinct concepts that 
can occur. However, when O(t + 1/7) = O(t) + 1/2 mod 1, the semicircles do not overlap 
and therefore, after at most 1/7 changes, a mistake will be made with probability one. 
Because at most 1/7 consecutive changes can be made before the mistake rate returns 
to zero, because the probability of a mistake depends only on Ot- Or, and because of 
inherent symmetries, this system can be modeled by a Markov chain with k = 1/7 states. 
Each state si corresponds to the case lot -Otl = i7 mod 1. The probability of a transition 
from state si to state si+l is P(si+llsi) = 1 - (i + 1)7. The probability of a transition 
from state si to state so is P(solsi) -- (i q- 1)7. All other transition probabilities are 
zero. This Markov chain is homogeneous, irreducible, aperiodic, and finite so it has an 
invariant distribution. By solving the balance equations, for 7 sufficiently small, we find 
that 
I  __ iq+l 
e(sl) ,i-I,=o(1 - 
= ..... Jt e-X2dx (1) 
" 7 -- 1 'l-I'J 
z-,i=o .i=o - i7) 
Since we assume that 7 is small, the probability that no mistake will occur for each of 
k - 1 consecutive time steps after a mistake, P(s,-1), is very small and we can say that 
the probability of a mistake is approximately P(s0). Therefore, from equation 1, for small 
7, it follows that/.tmaliciou s  /'-/7r. 
If we drop the assumption that the adversary is malicious, and instead assume the the 
adversary chooses the direction randomly at each iteration, then a similar sort of analysis 
yields that ttbenign = O (72/3). 
Since the foregoing analysis assumes a conservative tracker that chooses the best hy- 
pothesis every time it makes a mistake, it implies that for this concept sequence space 
and any conservative tracker, the mistake rate is 0(71/2) against a malicious adversary 
and O (72/3t') against a benign adversary. For either adversary, it can be shown that 
5 CONCLUSIONS AND FURTHER RESEARCH 
We can draw a number of interesting conclusions form the work we have done so far. 
First, tracking sequences of concepts is possible when the individual concepts are learn- 
able and change occurs "slowly" enough. Theorem 2 gives a weak upper bound on the 
rate of concept changes that is sufficient to insure that tracking is possible. 
Learning Time-varying Concepts 189 
Theorem 1 implies that there can be some trade-off between the size (VC-dimension) 
of the changes and the rate of change. Thus, if the size of the changes is restricted, 
Theorems 1 and 2 together imply that the maximum rate of change can be faster than for 
the general case. It is significant that a simple tracker that maintains a sliding window 
on the most recent set of examples can PAC-track the new concept after a change as 
quickly as a static learner can if it starts from scratch. This suggests it may be possible 
to subsume detection so that it is implicit in the operation of the tracker. One obviously 
open problem is to determine d in Theorem 1, i.e., what is the appropriate dimension to 
apply to the concept changes? 
The analysis of the mistake and error rates presented in Section 4 is for a special case 
with VC-dimension 1, but even so, it is interesting that the mistake and error rates are 
significantly worse than the rate of change. Preliminary analysis of other concept classes 
suggests that this continues to be true for higher VC-dimensions. We are continuing 
work to extend this analysis to other concept classes, including classes with higher VC- 
dimension; non-conservative learners; and other restrictions on concept changes. 
Acknowledgments 
Anthony Kuh gratefully acknowledges the support of the National Science Foundation 
through grant EET-8857711 and Siemens Corporate Research. Ronald L. Rivest grateful- 
ly acknowledges support from NSF grant CCR-8914428, ARO grant N00014-89-J-1988, 
and a grant from the Siemens Corporation. 
References 
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1989). Learnability and the 
Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 
36(4):929-965. 
Galletti, I. and Abbott, M. (1989). Development of an advanced airborne speech recog- 
nizer for direct voice input. Speech Technology, pages 60--63. 
Haussler, D., Littlestone, N., and Warmuth, M. K. (1987). Expected mistake bounds for 
on-line learning algorithms. (Unpublished). 
Littlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algo- 
rithms. Technical Report UCSC-CRL-89-11, Univ. of California at Santa Cruz. 
Littlestone, N. and Warmuth, M. K. (1989). The weighted majority algorithm. In Pro- 
ceedings oflEEE FOCS Conference, pages 256-261. IEEE. (Extended abstract only.). 
Lucky, R. W., Salz, J., and Weldon, E. J. (1968). Principles of Data Communications. 
McGraw-Hill, New York. 
Natarajan, B. K. (1987). On learning boolean functions. In Proceedings of the Nineteenth 
Annual ACM Symposium on Theory of Computing, pages 296-304. 
Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27:1134-1142. 
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative 
frequencies of events to their probabilities. Theory of Probability and its Applications, 
16:264-280. 
