Sample Complexity for Learning 
Recurrent Perceptron Mappings 
Bhaskar Dasgupta 
Department of Computer Science 
University of Waterloo 
Waterloo, Ontario N2L 3G1 
CANADA 
bdasgupedaisy. uwa erloo. ca 
Eduardo D. Sontag 
Department of Mathematics 
Rutgers University 
New Brunswick, NJ 08903 
USA 
sonageconrol. rutgers. edu 
Abstract 
Recurrent perceptron classifiers generalize the classical perceptron 
model. They take into account those correlations and dependences 
among input coordinates which arise from linear digital filtering. 
This paper provides tight bounds on sample complexity associated 
to the fitting of such models to experimental data. 
1 Introduction 
One of the most popular approaches to binary pattern classification, underlying 
many statistical techniques, is based on percepttons or linear discriminants; see 
for instance the classical reference (Duda and Hart, 1973). In this context, one is 
interested in classifying k-dimensional input patterns 
V - (Vl,... ,v k) 
into two disjoint classes A + and A-. A perceptron P which classifies vectors into 
A + and A- is characterized by a vector (of "weights") 5' 6 I k, and operates as 
follows. One forms the inner product 
5'.V '-- ClU 1 q- ... CkV k. 
If this inner product is positive, v is classified into A +, otherwise into A-. 
In signal processing and control applications, the size k of the input vectors v is 
typically very large, so the number of samples needed in order to accurately "learn" 
an appropriate classifying perceptron is in principle very large. On the other hand, 
in such applications the classes A + and A- often can be separated by means of a 
dynamical system of fairly small dimensionality. The existence of such a dynamical 
system reflects the fact that the signals of interest exhibit context dependence and 
Sample Complexity for Learning Recurrent Perceptron Mappings 205 
correlations, and this prior information can help in narrowing down the search for a 
classifier. Various dynamical system models for classification appear from instance 
when learning finite automata and languages (Giles el. al., 1990) and in signal 
processing as a channel equalization problem (at least in the simplest g-level case) 
when modeling linear channels transmitting digital data from a quantized source, 
e.g. (Baksho et. al., 1991) and (Pulford et. al., 1991). 
When dealing with linear dynamical classifiers, the inner product '.v represents 
a convolution by a separating vector 5' that is the impulse-response of a recursive 
digital filter of some order n << k. Equivalently, one assumes that the data can 
be classified using a 5' that is n-recursive, meaning that there exist real numbers 
rl,  ., rn so that 
cj -- E cj_iri , j = n + l,...,k. 
i-'1 
Seen in this context, the usual perceptrons are nothing more than the very special 
subclass of "finite impulse response" systems (all poles at zero); thus it is appropri- 
ate to call the more general class "recurrent" or "IIR (infinite impulse response)" 
perceptrons. Some authors, particularly Back and Tsoi (Back and Tsoi, 1991; Back 
and Tsoi, 1995) have introduced these ideas in the neural network literature. There 
is also related work in control theory dealing with such classifying, or more generally 
quantized-output, linear systems; see (Delchamps, 1989; Koplon and Sontag, 1993). 
The problem that we consider in this paper is: if one assumes that there is an 
n-recursive vector 5' that serves to classify the data, and one knows n but not 
the particular vector, how many labeled samples v (i) are needed so as to be able 
to reliably estimate c'?. More specifically, we want to be able to guarantee that 
any classifying vector consistent with the seen data will classify "correctly with 
high probability" the unseen data as well. This is done by computing the VC 
dimension of the related concept class and then applying well-known results from 
computational learning theory. Very roughly speaking, the main result is that 
the number of samples needed is proportional to the logarithm of the length k (as 
opposed to k itself, as would be the case if one did not take advantage of the recurrent 
structure). Another application of our results, again by appealing to the literature 
from computational learning theory, is to the case of "noisy" measurements or more 
generally data not exactly classifiable in this way; for example, our estimates show 
roughly that if one succeeds in classifying 95% of a data set of size logq, then with 
confidence , i one is assured that the prediction error rate will be < 90% on future 
(unlabeled) samples. 
Section 5 contains a result on polynomial-time learnability: for n constant, the 
class of concepts introduced here is PAC learnable. Generalizations to the learning 
of real-valued (as opposed to Boolean) functions are discussed in Section 6. For 
reasons of space we omit many proofs; the complete paper is available by electronic 
mail from the authors. 
2 Definitions and Statements of Main Results 
Given a set X, and a subset X of X, a dichotomy on X is a function 
Assume given a class  of functions X - {-1, 1}, to be called the class of classifier 
functions. The subset X C_ X is shattered by . if each dichotomy on X is the 
restriction to X of some q G .. The Vapnik-Chervonenkis dimension vc (.) is the 
supremum (possibly infinite) of the set of integers n for which there is some subset 
206 B. DASGUPTA, E. D. SONTAG 
X C_ : of cardinality n which can be shattered by .. Due to space limitations, 
we omit any discussion regarding the relevance of the VC dimension to learning 
problems; the reader is referred to the excellent surveys in (Maass, 1994; Tur&n, 
1994) regarding this issue. 
Pick any two integers n>0 and q>0. A sequence 
'--- (C1,...,Cn+q)  I nq-q 
is said to be n-recursive if there exist real numbers rx,..., r so that 
cn+j = E cn+j-iri , j = 1,..., q. 
i--1 
(In particular, every sequence of length n is n-recursive, but the interesting cases 
are those in which q 3a 0, and in fact q >> n.) Given such an n-recursive sequence 
5', we may consider its associated perceptton classifier. This is the map 
e' l+q-{-1,1}' (Xl,...,X+q) sign (cixi) 
\i=l 
where the sign function is understood to be defined by sign (z) = -1 if z < 0 and 
sign (z) = 1 otherwise. (Changing the definition at zero to be +1 would not change 
the results to be presented in any way.) We now introduce, for each two fixed n, q 
as above, a class of functions: 
-,q := { 1  a+q is n-recursive}. 
This is understood as a function class with respect to the input space 1 = I +q, 
and we are interested in estimating vc (.,q). 
Our main result will be as follows (all logs in base 2): 
Theorem 1 
Imax {n, n [log(J1 + } 
_< vc (.,,q) < min{n+q, 18n+4nlog(q+ 1)} I 
Note that, 
in particular, when q > max{2 + n 2, 32}, one has the tight estimates 
 log q _< vc (:l:,q) < 8n log q. 
The organization of the rest of the paper is as follows. In Section 3 we state an 
abstract result on VC-dimension, which is then used in Section 4 to prove Theo- 
rem 1. Finally, Section 6 deals with bounds on the sample complexity needed for 
identification of linear dynamical systems, that is to say, the real-valued functions 
obtained when not taking "signs" when defining the maps 
3 An Abstract Result on VC Dimension 
Assume that we are given two sets : and A, to be called in this context the set of 
inputs and the set of parameter values respectively. Suppose that we are also given 
a function 
F' Ax-+{-1,1}. 
Associated to this data is the class of functions 
Y := {F(A,.)' 5-- {-1, 1}IA  A} 
Sample Complexity for Learning Recurrent Perceptron Mappings 207 
obtained by considering F as a function of the inputs alone, one such function for 
each possible parameter value ,. Note that, given the same data one could, dually, 
study the class 
Y* : {F(.,,): A-- {-1, 1} I e } 
which obtains by fixing the elements of X and thinking of the parameters as inputs. 
It is well-known (and in any case, a consequence of the more general result to be 
presented below) that vc (Y) > Llog(vc (y*))J, which provides a lower bound on 
vc (Y) in terms of the "dual VC dimension." A sharper estimate is possible when 
A can be written as a product of n sets 
A = A1 x A2 x ... x A, 
(1) 
and that is the topic which we develop next. 
We assume from now on that a decomposition of the form in Equation (1) is given, 
and will define a variation of the dual VC dimension by asking that only certain 
dichotomies on A be obtained from .*. We define these dichotomies only on "rect- 
angular" subsets of A, that is, sets of the form 
L= L1 X ...X L, CA 
with each Li C_ Ai a nonempty subset. Given any index 1 _<  _< n, by a -axis 
dichotomy on such a subset L we mean any function 5  L -- {-1, 1} which depends 
only on the th coordinate, that is, there is some function q  L -- {-1, 1} so that 
5(,1,...,,) = q(,) for all (AI,...,,)  L; an axis dichotomy is a map that 
is a -axis dichotomy for some . A rectangular set L will be said to be axis- 
shattered if every axis dichotomy is the restriction to L of some function of the form 
F(.,)' A -- {-1, 1), for some e g[. 
Theorem 2 If L = L1 x ... X L, C_ A can be axis-shattered and each set Li has 
cardinality ri, then vc (.T) > [1og(r)J +... + [1og(r)J. 
(In the special case n=l one recovers the classical result vc (Y) _> [1og(vc (Y*))J.) 
The proof of Theorem 2 is omitted due to space limitations. 
4 Proof of Main Result 
We recall the following result; it was proved, using Milnor-Warren bounds on the 
number of connected components of semi-algebraic sets, by Goldberg and Jerrum: 
Fact 4.1 (Goldberg and Jerrum, 1995) Assume given a function F : A x X -- 
{-1,1} and the associated class of functions .T := {F(,, .): X -- {-1,1} IA 6 A}. 
Suppose that A = I k and X = ll , and that the function F can be defined in terms 
of a Boolean formula involving at most s polynomial inequalities in k + n variables, 
each polynomial being of degree at most d. Then, vc (.T) _< 2k 1og(Seds). [] 
Using the above Fact and bounds for the standard "perceptron" model, it is not 
difficult to prove the following Lemma. 
Lemma 4.2 vc (.T,q) _< min{n + q, 18n + an 1og(q + 1)} 
Next, we consider the lower bound of Theorem 1. 
Lemma 4.3 vc (.T,,q) >_ max{n, nLlog(L1 + 
208 B. DASGUPTA, E. D. SONTAG 
Proof. As ,,q contains the class of functions qbe with  = (Cl,..., cn, 0,..., 0), which 
in turn being the set of signs of an n-dimensional linear space of functions, has VC 
dimension n, we know that vc (.,,q) > n. Thus we are left to prove that if q > n 
then vc (.,q) > n[log([1 + 
The set of n-recursive sequences of length n + q includes the set of sequences of the 
following special form: 
cj =E/ -1, j=l,...,n+q (2) 
i----1 
where ai, li  I for each i = 1,..., n. Hence, to prove the lower bound, it is 
sufficient to study the class of functions induced by 
F  x+q--{-1,1}, (A,...,A,xi,...,x,+q)-+sign EA-Ixj  
i--1 j--1 
Let r: [i+-ij and let Lx,...,L, be n disjoint sets of real numbers (if desired, 
integers), each of cardinality r. Let L = Ui__i Li. In addition, if rn < q+n-1, then 
select an additional set B of (q+n-rn-1) real numbers disjoint from L. 
We will apply Theorem 2, showing that the rectangular subset L1 x ... X L can 
be axis-shattered. Pick any n 6 {1,...,n} and any qb:L - {-1, 1}. Consider the 
(unique) interpolating polynomial 
n+q 
P(')-- E XJ'J--1 
j--1 
in A of degree q + n - 1 such that 
p(A)={ 0(A) ifAL 
if e (L v B)- 
Now pick  = (xx,..., X,+q-X). Observe that 
F(li,12,...,ln,xi,...,xn4rq) ---- sign p(li = 6(l) 
for M1 (/1,..., l,) 6 LI X...X Ln, since p(l) = 0 for I  L and p(l) = 4(1) otherwise. 
It follows from Theorem 2 that vc (,q)  nLlog(r)], s desired. 
5 The Consistency Problem 
We next briefly discuss polynomial time learnability of recurrent perceptron map- 
pings. As discussed in e.g. (rrur&n, 1994), in order to formalize this problem we 
need to first choose a data structure to represent the hypotheses in .,q. In addi- 
tion, since we are dealing with complexity of computation involving real numbers, 
we must also clarify the meaning of "finding" a hypothesis, in terms of a suitable 
notion of polynomial-time computation. Once this is done, the problem becomes 
that of solving the consistency problem: 
Given a set of s k s(e, 6) inputs 1, 2,''., s  In+q, and an 
arbitrary dichotomy A' {1,2,...,s} - {-1, 1} find a represen- 
tation of a hypothesis qbe 6 .,q such that the restriction of qbe to 
the set {1,2,...,s} is identical to the dichotomy A (or report 
that no such hypothesis exists). 
Sample Complexity for Learning Recurrent Perceptron Mappings 209 
The representation to be used should provide an "efficient encoding" of the values 
of the parameters rl,..., rn, Cl,..., Cn: given a set of inputs (Xl,..., X+q)  I '+q, 
one should be able to efficiently check concept membership (that is, compute 
sign (y]__+x q cixi)). Regarding the precise meaning of polynomial-time computation, 
there are at least two models of complexity possible: the unit cost model which deals 
with algebraic complexity (arithmetic and comparison operations take unit time) 
and the logarithmic cost model (computation in the Turing machine sense; inputs 
(xi,..., X+q) are rationals, and the time involved in finding a representation of 
rl, ..., r, Cl,. ., c is required to be polynomial on the number of bits L. 
Theorem 3 For each fixed n > O, the consistency problem for .g',,q can be solved 
in time polynomial in q and s in the unit cost model, and time polynomial in q, s, 
and L in the logarithmic cost model. 
Since vc (.,q) - O(n -t- nlog(q -t- 1)), it follows from here that the class .,q is 
learnable in time polynomial in q (and L in the log model). Due to space limitations, 
we must omit the proof; it is based on the application of recent results regarding 
computational complexity aspects of the first-order theory of real-closed fields. 
6 Pseudo-Dimension Bounds 
In this section, we obtain results on the learnability of linear systems dynamics, that 
is, the class of functions obtained if one does not take the sign when defining recur- 
rent perceptrons. The connection between VC dimension and sample complexity 
is only meaningful for classes of Boolean functions; in order to obtain learnability 
results applicable to real-valued functions one needs metric entropy estimates for 
certain spaces of functions. These can be in turn bounded through the estimation 
of Pollard's pseudo-dimension. We next briefly sketch the general framework for 
learning due to Haussler (based on previous work by Vapnik, Chervonenkis, and 
Pollard) and then compute a pseudo-dimension estimate for the class of interest. 
The basic ingredients are two complete separable metric spaces : and  (called 
respectively the sets of inputs and outputs), a class . of functions f : : -  
(called the decision rule or hypothesis space), and a function :  x  - [0, r] c I 
(called the loss or cost function). The function  is so that the class of functions 
(x, y)  (f(x), y) is "permissible" in the sense of Haussler and Pollard. Now, 
one may introduce, for each f G ., the function 
Af,e : ]xYxI-{-1,1} : (x,y,t)-sign((f(x),y)-t) 
as well as the class 4:=, consisting of all such ALe. The pseudo-dimension of . 
with respect to the loss function , denoted by PD [., ], is defined as: 
PO := vc 
Due to space limitations, the relationship between the pseudo-dimension and the 
sample complexity of the class . will not be discussed here; the reader is referred 
to the references (Haussler, 1992; Maass, 1994) for details. 
For our application we define, for any two nonnegative integers n, q, the class 
'"'q := {*l 'I"+q 
where   I "+q -- I  (xx, ...,X,+q) - 
can be proved using Fact 4.1. 
is n-recursive } 
E?__+i q CiX i . The following Theorem 
Theorem 4 Let p be a positive integer and assume that the loss function  is given 
by (Yl, Y2)- ]Yl- Y21 p. Then, PD ['n,q,] __< 1an + 4nlog(p(q + 1)). 
210 B. DASGUPTA, E. D. SONTAG 
Acknowledgement s 
This research was supported in part by US Air Force Grant AFOSR-94-0293. 
References 
A.D. BACK AND A.C. TsoI, FIR and IIR synapses, a new neural network archi- 
lecture for time-series modeling, Neural Computation, 3 (1991), pp. 375-385. 
A.D. BACK AND A.C. TsoI, A comparison of discrete-time operator models for 
nonlinear system identification, Advances in Neural Information Processing Systems 
(NIPS'94), Morgan Kaufmann Publishers, 1995, to appear. 
A.M. BAKSHO, S. DASGUPTA, J.S. GARNETT, AND C.R. JOHNSON, On the sim- 
ilarity of conditions for an open-eye channel and for signed filtered error adaptive 
filter stability, Proc. IEEE Conf. Decision and Control, Brighton, UK, Dec. 1991, 
IEEE Publications, 1991, pp. 1786-1787. 
A. BLUMER, A. EHRENFEUCHT, D. HAUSSLER, AND M. WARMUTH, Learnability 
and the Vapnik-Chervonenkis dimension, J. of the ACM, 36 (1989), pp. 929-965. 
D.F. DELCHAMPS, Extracting State Information from a Quanlized Output Record, 
Systems and Control Letters, 13 (1989), pp. 365-372. 
R.O. DUDA AND P.E. HART, Pattern Classification and Scene Analysis, Wiley, 
New York, 1973. 
C.E. GILES, G.Z. SUN, H.H. CHEN, Y.C. LEE, AND D. CHEN, Higher order re- 
current networks and grammatical inference, Advances in Neural Information Pro- 
cessing Systems 2, D.S. Touretzky, ed., Morgan Kaufmann, San Mateo, CA, 1990. 
P. GOLDBERG AND M. JERRUM, Bounding the Vapnik-Chervonenkis dimension of 
concept classes parameterized by real numbers, Mach Learning, 18, (1995): 131-148. 
D. HAUSSLER, Decision theoretic generalizations of the PA C model for neural nets 
and other learning applications, Information and Computation, 100, (1992): 78-150. 
R. KOPLON AND E.D. SONTAG, Linear systems with sign-observations, SIAM J. 
Control and Optimization, 31(1993): 1245 - 1266. 
W. MAASS, Perspectives of current research about the complexity of learning in 
neural nets, in Theoretical Advances in Neural Computation and Learning, V.P. 
Roychowdhury, K.Y. Siu, and A. Orlitsky, eds., Kluwer, Boston, 1994, pp. 295-336. 
G.W. PULFORD, R.A. KENNEDY, AND B.D.O. ANDERSON, Neural network struc- 
ture for emulating decision feedback equalizers, Proc. Int. Conf. Acoustics, Speech, 
and Signal Processing, Toronto, Canada, May 1991, pp. 1517-1520. 
E.D. SONTAG, Neural networks for control, in Essays on Control: Perspectives 
in the Theory and its Applications (H.L. Trentelman and J.C. Willems, eds.), 
Birkhauser, Boston, 1993, pp. 339-380. 
GYRGY TURIN, Computational Learning Theory and Neural Networks:A Survey 
of Selected Topics, in Theoretical Advances in Neural Computation and Learning, 
V.P. Roychowdhury, K.Y. Siu,and A. Orlitsky, eds., Kluwer, Boston, 1994, pp. 
243-293. 
L.G. VALIANT A theory of the learnable, Comm. ACM, 27, 1984, pp. 1134-1142. 
V.N.VAPNIK, Estimation of Dependencies Based on Empirical Data, Springer, 
Berlin, 1982. 
