Structural Risk Minimization for 
Nonparametric Time Series Prediction 
Ron Meir* 
Department of Electrical Engineering 
Technion, Haifa 32000, Israel 
rmeirdumbo. technion. ac. il 
Abstract 
The problem of time series prediction is studied within the uniform con- 
vergence framework of Vapnik and Chervonenkis. The dependence in- 
herent in the temporal structure is incorporated into the analysis, thereby 
generalizing the available theory for memoryless processes. Finite sam- 
ple bounds are calculated in terms of covering numbers of the approxi- 
mating class, and the tradeoff between approximation and estimation is 
discussed. A complexity regularization approach is outlined, based on 
Vapnik's method of Structural Risk Minimization, and shown to be ap- 
plicable in the context of mixing stochastic processes. 
1 Time Series Prediction and Mixing Processes 
A great deal of effort has been expended in recent years on the problem of deriving robust 
distribution-free error bounds for learning, mainly in the context of memoryless processes 
(e.g. [9]). On the other hand, an extensive amount of work has been devoted by statisticians 
and econometricinns to the study of parametric (often linear) models of time series, where 
the dependence inherent in the sample, precludes straightforward application of many of 
the standard results form the theory of memoryless processes. In this work we propose 
an extension of the framework pioneered by Vapnik and Chervonenkis to the problem of 
time series prediction. Some of the more elementary proofs are sketched, while the main 
technical results will be proved in detail in the full version of the paper. 
Consider a stationary stochastic process  = {-.- , X_l, Xo, X1,  . }, where Xi is a ran- 
dom variable defined over a compact domain in R and such that [Xi[ _< B with probability 
1, for some positive constant B. The problem of one-step prediction, in the mean square 
sense, can then be phrased as that of finding a function f(-) of the infinite past, such that 
E IXo - f(X_-)l 2 is minimal, where we use the notation X] = (Xi,Xi_l,... ,Xj), 
 This work was supported in pan by the a grant from the Israel Science Foundation 
Structural Risk Minimization fo r Nonparametric 7rne Series Prediction 309 
j ) i. It is well known that the optimal predictor in this case is given by the conditional 
mean, E[XolX_-]. While this solution, in principle, settles the issue of optimal predic- 
tion, it does not settle the issue of actually computing the optimal predictor. First of all, 
note that to compute the conditional mean, the probabilistic law generating the stochastic 
process  must be known. Furthermore, the requirement of knowing the full past, X, 
is of course rather stringent. In this work we consider the more practical situation, where 
a finite sub-sequence X = (X, X2,.   , XN) is observed, and an optimal prediction is 
needed, conditioned on this data. Moreover, for each finite sample size N we allow the 
predictors to be based only on a finite lag vector of size d. Ultimately, in order to achieve 
full generality one may let d --+ ec when N --+ ec in order to obtain the optimal predictor. 
We first consider the problem of selecting an empirical estimator from a class of functions 
Y'a,, ' R a  R, where n is a complexity index of the class (for example, the number 
of computational nodes in a feedforward neural network with a single hidden layer), and 
{fl -< B for f E Y'a,,. Consider then an empirical predictor fa,,,N 
(xi_a), i > N, 
for Xi based on the finite data set X N and depending on the d-dimensional lag vector 
Xi_ a, where fa,n,v E .T',n. It is possible to split the error incurred by this predictor into 
three terms, each possessing a rather intuitive meaning. It is the competition between these 
terms which determines the optimal solution, for a fixed amount of data. First, define the 
loss of a functional predictor f  R a  R as L(f) E Xi i- 2, . 
= - f(Xi_a) ] and let 
be the optimal function in .ct,n minimizing this loss. Furthermore, denote the optimal lag 
d predictor by f, and its associated loss by L. We are then able to split the loss of the 
empirical predictor fd,n,N into three basic components, 
(l) 
where Ld,n,N --' L(fa,n,N). The third term, L}, is related to the error incurred in using a fi- 
nite memory model (of lag size d) to predict a process with potentially infinite memory. We 
do not at present have any useful upper bounds for this term, which is related to the rate of 
convergence in the martingale convergence theorem, which to the best of our knowledge is 
unknown for the type of mixing processes we study in this work. The second term in (1), is 
related to the so-called approximation error, given by E[ f i-  , { ,'i--1'12 
(Xi- a)-Jct,,  i-a to which 
it can be immediately related through the inequality I[al p - Ib[P[ < pla- bl{ max(a, b)I p- . 
This term measures the excess error incurred by selecting a function f from a class of lim- 
ited complexity a,n, while the optimal lag d predictor f may be arbitrarily complex. Of 
course, in order to bound this term we will have to make some regularity assumptions about 
the latter function. Finally, the first term in (1) re, presents the so called estimation error, 
and is the only term which depends on the data X  . Similarly to the problem of regression 
for i.i.d. data, we expect that the approximation and estimation terms lead to conflicting 
demands on the choice of the the complexity, n, of the functional class .Ta,,. Clearly, in 
order to minimize the approximation error the complexity should be made as large as pos- 
sible. However, doing this will cause the estimation error to increase, because of the larger 
freedom in choosing a specific function in a,n to fit the data. However, in the case of time 
series there is an additional complication resulting from the fact that the misspecification 
error L is minimized by choosing d to be as large as possible, while this has the effect 
of increasing both the approximation as well as the estimation errors. We thus expect that 
some optimal values of d and n exist for each sample size N. 
Up to this point we have not specified how to select the empirical estimator fa,,,N. In this 
work we follow the ideas of Vapnik [8], which have been studied extensively in the con- 
text of i.i.d observations, and restrict our selection to that hypothesis which minimizes the 
f Xi_  2 
empiricalerror, givenby Lv(f)= 1 Y=a+ IXi- ( i-a)[ For this function it is 
N--d ' 
easy to establish (see for example [8]) that (L&n,N -L},,) < 2 sup/ey., I L(f)- Lv (f)l. 
The main distinction from the i.i.d case, of course, is that random variables appearing in 
310 R. Meir 
the empirical error, L:v(f), are no longer independent. It is clear at this point that some 
assumptions are needed regarding the stochastic process , in order that a law of large 
numbers may be established. In any event, it is obvious that the standard approach of using 
randomization and symmetrization as in the i.i.d case [3] will not work here. To circum- 
vent this problem, two approaches have been proposed. The first makes use of the so-called 
method of sieves together with extensions of the Bernstein inequality to dependent data [6]. 
The second approach, to be pursued here, is based on mapping the problem onto one char- 
acterized by an i.i.d process [10], and the utilization of the standard results for the latter 
case. 
In order to have some control of the estimation error discussed above, we will restrict our- 
selves in this work to the class of so-called mixing processes. These are processes for which 
the 'future' depends only weakly on the 'past', in a sense that will now be made precise. 
Following the definitions and notation of Yu [10], which will be utilized in the sequel, let 
at = a(X) and (r+ m: a(X+m), be the sigma-algebras of events generated by the ran- 
dom variables X = (X, X2,... , Xt) and Xt m - (Xt+m, Xt+m+,. . . ), respectively. 
We then define m, the coefficient of absolute regularity, as 
m -- supt> Esup {IP(Bla,) - P(B)I  B E ;+m}, where the expectation is taken 
with respect to at = a(X). A stochastic process is said to be -mixing if 3m  0 as 
m  oc. We note that there exist many other definitions of mixing (see [2] for details). 
The motivation for using the -mixing coefficient is that it is the weakest form of mixing 
for which uniform laws of large numbers can be established. In this work we consider two 
type of processes for which this coefficient decays to zero, namely algebraically decaying 
processes for which m < m -', , r > 0, and exponentially mixing processes for which 
m < / exp{ -bm'}, , b,  > 0. Note that for Markov processes mixing implies expo- 
nential mixing, so that at least in this case, there is no loss of generality in assuming that 
the process is exponentially mixing. Note also that the usual i.i.d process may be obtained 
from either the exponentially or algebraically mixing process, by taking the limit   oc 
or r  ec, respectively. 
In this section we follow the approach taken by Yu [10] in deriving uniform laws of large 
numbers for mixing processes, extending her mainly asymptotic results to finite sample 
behavior, and somewhat broadening the class of processes considered by her. The basic 
idea in [10], as in many related approaches, involves the construction of an independent- 
block sequence, which is shown to be 'close' to the original process in a well-defined 
probabilistic sense. We briefly recapitulate the construction, slightly modifying the nota- 
tion in [10] to fit in with the present paper. Divide the sequence X into 2uv blocks, 
each of size aN; we assume for simplicity that N = 2]tNaN. The blocks are then num- 
bered according to their order in the block-sequence. For 1 < j < /v define H i -- 
{i' 2(j - 1)aN + 1 _< i _< (2j - 1)aN} and Tj = {i  (2j - 1)aN + 1 < i < (2j)av}. 
Denote the random variables corresponding to the H i and Tj indices as X  = {Xi ' 
i 6 Hi} and X'(J) = {Xi  i 6 Tj}. The sequence of H-blocks is then denoted by 
XaN {X() N 
--' }/=1' Now, construct a sequence of independent and identically distributed 
,rE(J)/"'v where E(J) = {i ' i  Hj}, such that the sequence is indepen- 
(i.i.d.) blocks. /Jj=l, 
dent of X and each block has the same distribution as the block x(J) from the original 
sequence. Because the process is stationary, the blocks (J) are not only independent but 
also identically distributed. The basic idea in the construction of the independent block 
sequence is that it is 'close', in a well-defined sense to the original blocked sequence Xav. 
Moreover, by appropriately selecting the number of blocks,/N, depending on the mixing 
nature of the sequence, one may relate properties of the original sequence X, to those of 
the independent block sequence  (see Lemma 4.1 in [ 10]). 
Let  be a class of bounded functions, such that 0 < f < B for any f  . In order to 
Structural Risk Minimization for Nonparametric Time Series Prediction 311 
relate the uniform deviations (with respect to ) of the original sequence X N to those of 
the independent-block sequence Ea:v, use is made of Lemma 4.1 from [10]. We also utilize 
Lemma 4.2 from [10] and modify it so that it holds for finite sample size. Consider the 
block-independent sequence Easy and define t,v 
- ES= ./(.=.(5)) where ) = 
YicHj f(i), J = 1, 2,... ,/N, is a sequence of independent random variables such that 
Il < arB. In the remainder of the paper we use variables with a tilde above them to 
denote quantities related to the transformed block sequence. Finally, we use the symbol 
EN tO denote the empirical average with respect to the original sequence, namely ENf -- 
d) 
- Y'4=a+ f(Xi). The following result can be proved by a simple extension of 
Lemma 4.2 in [lO]. 
Lemma 1.1 Suppose oT is a permissible class of bounded functions, Ifl -< B for f E oT. 
Then 
[' {sup ,ENf-Ell > c} <_2[{sup,t,rf--f[>aNs}}+2ttN/a:. (2) 
IC.' f.' 
The main merit of Lemma 1.1 is in the transformation of the problem from the domain of 
dependent processes, implicit in the quantity [ENf -- Ell, to one characterized by indepen- 
dent processes, implicit in the term t,v o f - f corresponding to the independent blocks. 
The price paid for this transformation is the extra term 2/v3N which appears on the r.h.s 
of the inequality appearing in Lemma 1.1. 
2 Error Bounds 
The development in Section 1 was concerned with a scalar stochastic process . In order 
to use the results in the context of time series, we first define a new vector-valued pro- 
cess X" = {... ,_,3o,3,... } where 3i = (Xi,Xi-,... ,Xi-a)  R t+l. For 
this sequence the/3-mixing coefficients obey the inequality/3m(') _< 
be a space of functions mapping R a --> R, and for each f  oT let the loss function be 
(Xi_a) = IXi - f(X_-)l 2. The loss space is given by 
given by f i . 
It is well known in the theory of empirical processes (see [7] for example), that in or- 
der to obtain upper bounds on uniform deviations of i.i.d sequences, use must be made 
of the so-called covering number of the function class oT, with respect to the empiri- 
cal l,N norm, given by l,N(f,g) N_ N 
= Yi= If(Xi) - g(Xi)l. Similarly, we de- 
note the empirical norm with respect to the independent block sequence by/1,t,v, where 
= y'j= ]f(X(J)) - O(X(J) I, and where f(X (j)) = Eiej Xi and simi- 
larly for .0. Following common practice we denote the c-covering number of the functional 
space oT using the metric p by .N'(e, oT, p). 
Definition 1 Let ; be a class of real-valued functions from R 19 --+ R, D = d + 1. For 
each ef  J:' and  (:1,:2, . ,.7v), .7i  R D, let f() aN 
= " = -i= f(.7i). Then 
define ff. s = {t?l ' l e Es} , where t?l ' Rav -- R +. 
In order to obtain results in terms of the covering numbers of the space y rather than/.y, 
which corresponds to the transformed sequence, we need the following lemma, which is 
not hard to prove. 
Lemma 2.1 For any e > 0 
312 R. Meir 
PROOF The result follows by sequence of simple inequalities, showing that [,t,v (f, ) -< 
aNl,v(f,g). I 
We now present the main result of this section, namely an upper bound for the uniform 
deviations of mixing processes, which in turn yield upper bounds on the error incurred by 
the empirically optimal predictor fd,n.2v. 
Theorem 2.1 Let X' = {... , X1, X0, X1,    } be a bounded stationary -mixing stochas- 
tic process, with IXil _< B, and let Y' be a class of bounded functions, f: R 't -', [0, B]. 
For each sample size N, let f N be the function in Y' which minimizes the empirical error, 
and f* is the function in Y' minimizing the true error L(f). Then, 
P {L(]N)- L(f*) > c} _< 8EJV(c',Y',/1,N) exp 64(2B) 4 + 
(3) 
where s -- s/128B. 
PROOF The theorem is established by making use of Lemma 1.1, and the basic results from 
the theory of uniform convergence for i.i.d. processes, together with Lemma 2.1 relating 
the covering numbers of the spaces/y and y. The covering numbers of .r and Y' are 
easily related using A/'(e,  y , L ( P) ) < JV'(s /2B, , L1 ( P) ). I 
Up to this point we have not specified/N and aN, and the result is therefore quite general. 
In order to obtain weak consistency we require that that the r.h.s. of (3) converge to zero 
for each c > 0. This immediately yields the following conditions on/ZN (and thus also on 
aN through the condition 2aNltN -- N). 
Corollary 2.1 Under the conditions of Theorem 2.1, and the added requirements that d - 
O(aN) and./V'(e, , li,N) < ac, the following choices of pN are sufficient to guarantee the 
weak consistency of the empirical predictor f N : 
Pv  N /(+) (exponential mixing), (4) 
PN  N s/(l+s), 0 < s < r (algebraic mixing), (5) 
where the notation aN  bN implies that (bN) _< aN < O(bN). 
PROOF Consider first the case of exponential mixing. In this case the r.h.s. of (3) clearly 
converges to zero because of the finiteness of the covering number. The fastest rate of 
convergence is achieved by balancing the two terms in the equation, leading to the choice 
IN  N /(+). In the case of algebraic mixing, the second term on the r.h.s. of (3) is 
of the order O(lNa ) where we have used d = o(aN). Since ]tNa N , N, a sufficient 
condition to guarantee that this term converge to zero is that/N  N s/(+*), 0 < s < r, 
as was claimed. I 
In order to derive bounds on the expected error, we need to make an assumption concerning 
the covering number of the space Y'. In particular, we know from the work Haussler [4] 
that the covering number is upper bounded as follows 
Af(e,, L (P)) <_ e(Pdim() + l) ( 2-B) Pdim(y) , 
for any measure ?. Thus, assuming the finiteness of the pseudo-dimension of Y' guarantees 
a finite covering number. 
Structural Risk Minimization for Nonparametric Time Series Prediction 
3 Structural Risk Minimization 
313 
The results in Section 2 provide error bounds for estimators formed by minimizing the em- 
pirical error over a fixed class of d-dimensional functions. It is clear that the complexity of 
the class of functions plays a crucial role in the procedure. If the class is too rich, mani- 
fested by very large covering numbers, clearly the estimation error term will be very large. 
On the other hand, biasing the class of functions by restricting its complexity, leads to poor 
approximation rates. A well-known strategy for overcoming this dilemma is obtained by 
considering a hierarchy of functional classes with increasing complexity. For any given 
sample size, the optimal trade-off between estimation and approximation can then be de- 
termined by balancing the two terms. Such a procedure was developed in the late seventies 
by Vapnik [8], and termed by him structural risk minimization (SRM). Other more recent 
approaches, collectively termed complexity regularization, have been extensively studied 
in recent years (e.g. [1]). It should be borne in mind, however, that in the context of time 
series there is an added complexity, that does not exist in the case of regression. Recall 
that the results derived in Section 2 assumed some fixed lag vector d. In general the op- 
timal value of d is unknown, and could in fact be infinite. In order to achieve optimal 
performance in a nonparametric setting, it is crucial that the size of the lag be chosen adap- 
tively as well. This added complexity needs to be incorporated into the SRM framework, 
if optimal performance in the face of unknown memory size is to be achieved. 
Let .'a,n, d, n E lq be a sequence of functions, and define .' = Ua= Un-- .'a,n. For any 
f'a,n let 
A/' (e, 'a,,) = supA/'(e, 'a,,,/,r), 
x N 
which from [4] is upper bounded by cs -Pdim(".'). We observe in passing that Lugosi and 
Nobel [5] have recently considered situations where the pseudo-dimension Pdim(0ra,n) is 
unknown, and the covering number is estimated empirically from the data. Although this 
line of thought is potentially very useful, we do not pursue it here, but rather assume that 
upper bounds on the pseudo-dimensions of .'a,, are known, as is the case for many classes 
of functions used in practice (see for example [9]). 
In line with the standard approach in [8] we introduce a new empirical function, which 
takes into account both the empirical error as well as the complexity costs penalizing overly 
complex models (large complexity index n and lag size d). Let 
ka,.r(f) -- LN(f) + Aa,.r(S) + Aa,r, (6) 
where Lv (f) is the empirical error of the predictor f and the complexity penalties A are 
given by 
/logA/' (e, Ya,,) + 
Aa"'v(s) = V 764(----)7 (7) 
 Cd 
a,N = N/64(2B)4. (8) 
The specific fore and constants in these definitions are chosen with hindsight, so as to 
achieve the optimal rates of convergence in Theorem 3.1 below. The constants c and ca 
are positive constants obeying , e -  1 and similarly for ca. A possible choice is 
c, = 2 log n + 1 and ca = 2 log d + 1. The value of  can be chosen in accordance with 
Corollary 2.1. 
Let a,,, minimize the empirical eor L(f) within the class of functions a,,. We 
assume that the classes a,n are compact, so that such a minimizer exists. Further, let  
314 R. Meir 
be the function in ' minimizing the complexity penalized loss (6), namely 
a,,,iv(fv) = minmin La,n,v(fa,,,v) (9) 
d>l n>l 
The following basic result establishes the consistency of the structural risk minimization 
approach, and yields upper bounds on its performance. 
Theorem 3.1 Let .Ta,n, d, n E IN be sequence of functional classes, where f  f'a,n is 
a mapping from R a to l The expected loss of the function fN, selected according to the 
SRM principle, is upper bounded by 
EL(fN) < min L(f) + c + + 4(2B)2Uv/3,N/2. 
-- a,, /N (10) 
The main merit of Theorem 3.1 is the demonstration that the SRM procedure achieves an 
optimal balance between approximation and estimation, while retaining its nonparametric 
attributes. In particular, if the optimal lag d predictor f belongs to .Td,no for some no, the 
SRM predictor would converge to it at the same rate as if no were known in advance. The 
same type of adaptivity is obtained with respect to the lag size d. The nonparametric rates 
of convergence of the SRM predictor will be discussed in the full paper. 
References 
[10] 
[1] A. Barron. Complexity Regularization with Application to Artificial Neural Net- 
works. In G. Roussas, editor, Nonparametric Functional Estimation and Related 
Topics, pages 561-576. Kluwer Academic Press, 1991. 
[2] L. Gy6rfi, W. Hirdle, P. Sarda, and P. Vieu. Nonparametric Curve Estimation from 
77me Series. Springer Verlag, New York, 1989. 
[3] D. Haussler. Decision Theoretic Generalizations of the PAC Model for Neural Net 
and Other Learning Applications. Information and Computation, 100:78-150, 1992. 
[4] D. Haussler. Sphere Packing Numbers for Subsets of the Boolean n-Cube with 
Bounded Vapnik-Chervonenkis Dimesnion. J. Combinatorial Theory, Series A 
69:217-232, 1995. 
[5] G. Lugosi and A. Nobel. Adaptive Model Selection Using Empirical Complexities. 
Submitted to Annals Statis., 1996. 
[6] D. Modha and E. Masry. Memory Universal Prediction of Stationary Random Pro- 
cesses. IEEE Trans. lnf Th., January, 1998. 
[7] D. Pollard. Convergence of Empirical Processes. Springer Verlag, New York, 1984. 
[8] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, 
New York, 1992. 
[9] M. Vidyasagar. A Theory of Learning and Generalization. Springer Verlag, New 
York, 1996. 
B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. 
Annals of Probability, 22:94-116, 1984. 
