Stable Linear Approximations to 
Dynamic Programming for Stochastic 
Control Problems with Local Transitions 
Benjamin Van Roy and John N. Tsitsiklis 
Laboratory for Information and Decision Systems 
Massachusetts Institute of Technology 
Cambridge, MA 02139 
e-mail: bvr@mit.edu, jnt@mit.edu 
Abstract 
We consider the solution to large stochastic control problems by 
means of methods that rely on compact representations and a vari- 
ant of the value iteration algorithm to compute approximate cost- 
to-go functions. While such methods are known to be unstable in 
general, we identify a new class of problems for which convergence, 
as well as graceful error bounds, are guaranteed. This class in- 
volves linear parameterizations of the cost-to-go function together 
with an assumption that the dynamic programming operator is a 
contraction with respect to the Euclidean norm when applied to 
functions in the parameterized class. We provide a special case 
where this assumption is satisfied, which relies on the locality of 
transitions in a state space. Other cases will be discussed in a full 
length version of this paper. 
I INTRODUCTION 
Neural networks are well established in the domains of pattern recognition and 
function approximation, where their properties and training algorithms have been 
well studied. Recently, however, there have been some successful applications of 
neural networks in a totally different context - that of sequential decision making 
under uncertainty (stochastic control). 
Stochastic control problems have been studied extensively in the operations research 
and control theory literature for a long time, using the methodology of dynamic 
programming [Bertsekas, 1995]. In dynamic programming, the most important 
object is the cost-to-go (or value) function, which evaluates the expected future 
1046 B.V. ROY, J. N. TSITSIKLIS 
cost to be incurred, as a function of the current state of a system. Such functions 
can be used to guide control decisions. 
Dynamic programming provides a variety of methods for computing cost-to-go 
functions. Unfortunately, dynamic programming is computationally intractable in 
the context of many stochastic control problems that arise in practice. This is 
because a cost-to-go value is computed and stored for each state, and due to the 
curse of dimensionality, the number of states grows exponentially with the number 
of variables involved. 
Due to the limited applicability of dynamic programming, practitioners often rely 
on ad hoc heuristic strategies when dealing with stochastic control problems. Sev- 
eral recent success stories - most notably, the celebrated Backgammon player of 
Tesauro (1992) - suggest that neural networks can help in overcoming this limita- 
tion. In these applications, neural networks are used as compact representations 
that approximate cost-to-go functions using far fewer parameters than states. This 
approach offers the possibility of a systematic and practical methodology for ad- 
dressing complex stochastic control problems. 
Despite the success of neural networks in dynamic programming, the algorithms 
used to tune parameters are poorly understood. Even when used to tune the pa- 
rameters of linear approximators, algorithms employed in practice can be unstable 
[Boyan and Moore, 1995; Gordon, 1995; Tsitsiklis and Van Roy, 1994]. 
Some recent research has focused on establishing classes of algorithms and compact 
representation that guarantee stability and graceful error bounds. Tsitsiklis and Van 
Roy (1994) prove results involving algorithms that employ feature extraction and in- 
terpolative architectures. Gordon (1995) proves similar results concerning a closely 
related class of compact representations called averagers. However, there remains 
a huge gap between these simple approximation schemes that guarantee reasonable 
behavior and the complex neural network architectures employed in practice. 
In this paper, we motivate an algorithm for tuning the parameters of linear com- 
pact representations, prove its convergence when used in conjunction with a class 
of approximation architectures, and establish error bounds. Such architectures are 
not captured by previous results. However, the results in this paper rely on addi- 
tional assumptions. In particular, we restrict attention to Markov decision problems 
for which the dynamic programming operator is a contraction with respect to the 
Euclidean norm when applied to functions in the parameterized class. Though 
this assumption on the combination of compact representation and Markov deci- 
sion problem appears restrictive, it is actually satisfied by several cases of practical 
interest. In this paper, we discuss one special case which employs affine approxima- 
tions over a state space, and relies on the locality of transitions. Other cases will 
be discussed in a full length version of this paper. 
2 MARKOV DECISION PROBLEMS 
We consider infinite horizon, discounted Markov decision problems defined on a 
finite state space $ = {1,...,n} [Bertsekas, 1995]. For every state i  $, there is 
a finite set U(i) of possible control actions, and for each pair i,j  $ of states and 
control action u  U(i) there is a probability pij(u) of a transition from state i to 
state j given that action u is applied. Furthermore, for every state i and control 
action u  U(i), there is a random variable ciu which represents the one-stage cost 
if action u is applied at state i. 
Let/  [0, 1) be a discount factor. Since the state spaces we consider in this paper 
Stable Linear Approximations Programming for Stochastic Control Problems 104 7 
are finite, we choose to think of cost-to-go functions mapping states to cost-to-go 
values in terms of cost-to-go vectors whose components are the cost-to-go values 
of various states. The optimal cost-to-go vector V* E n is the unique solution to 
Bellman's equation: 
V/* = min (Z[ciu] q-/ Epij(u)Vj*), Vie S. (1) 
uev(i) 
j$ 
If the optimal cost-to-go vector is known, optimal decisions can be made at any 
state i as follows: 
u*-arg min (E[ciu] q-/?EPij(u)Vj*) Vie $. 
uU(i) ' 
j$ 
There are several algorithms for computing V* but we only discuss the value itera- 
tion algorithm which forms the basis of the approximation algorithm to be consid- 
ered later on. We start with some notation. We define the dynamic programming 
operator as the mapping T: '  ' with components Ti:    defined by 
Ti(V) ---- min (E[ciu] q-/EPij(u)Vj) Vi E $. (2) 
uU(i) ' 
j$ 
It is well known and easy to prove that T is a maximum norm contraction. In 
particular, 
liT(V) - v(v')lloo V'lloo, vv, v' e 
The value iteration algorithm is described by 
V(t + )= T(V(t)), 
where V(0) is an arbitrary vector in ' used to initialize the algorithm. It is easy 
to see that the sequence {V(t)} converges to V*, since T is a contraction. 
3 APPROXIMATIONS TO DYNAMIC PROGRAMMING 
Classical dynamic programming algorithms such as value iteration require that we 
maintain and update a vector V of dimension n. This is essentially impossible when 
n is extremely large, as is the norm in practical applications. We set out to overcome 
this limitation by using compact representations to approximate cost-to-go vectors. 
In this section, we develop a formal framework for compact representations, describe 
an algorithm for tuning the parameters of linear compact representations, and prove 
a theorem concerning the convergence properties of this algorithm. 
3.1 COMPACT REPRESENTATIONS 
A compact representation (or approximation architecture) can be thought of as a 
scheme for recording a high-dimensional cost-to-go vector V E ' using a lower- 
dimensional parameter vector w E m (m << n). Such a scheme can be described by 
a mapping : m  ' which to any given parameter vector w E  associates 
a cost-to-go vector '(w). In particular, each component 'i (w) of the mapping is 
the ith component of a cost-to-go vector represented by the parameter vector w. 
Note that, although we may wish to represent an arbitrary vector V E n, such a 
scheme allows for exact representation only of those vectors V which happen to lie 
in the range of . 
In this paper, we are concerned exclusively with linear compact representations of 
the form V(w) - Mw, where M E nxm is a fixed matrix representing our choice 
of approximation architecture. In particular, we have zi (w) - Miw, where Mi (a 
row vector) is the ith row of the matrix M. 
1048 B.V. ROY, J. N. TSITSIKLIS 
3.2 A STOCHASTIC APPROXIMATION SCHEME 
Once an appropriate compact representation is chosen, the next step is to generate 
a parameter vector w such that Z(w) approximates V*. One possible objective is 
to minimize squared error of the form IIMw- V*I]. If we were given a fixed set 
of N samples ((i, Vii ), (i9., Vi), ..., (iN, Vi )) of an optimal cost-to-go vector V*, it 
seems natural to choose a parameter vector w that minimizes -,v=l (Mijw- V_.*) 9' 
On the other hand, if we can actively sample as many data pairs as we want, one 
at a time, we might consider an iterative algorithm which generates a sequence of 
parameter vectors (w(t)) that converges to the desired parameter vector. One such 
algorithm works as follows: choose an initial guess w(0), then for each t  (0, 1,...) 
sample a state i(t) from a uniform distribution over the state space and apply the 
iteration 
w(t + 1)-w(t)-(t)(Mi(t)w(t)- Vi(t) ) T 
* (3) 
where {a(t)) is a sequence of diminishing step sizes and the superscript T denotes 
a transpose. Such an approximation scheme conforms to the spirit of traditional 
function approximation - the algorithm is the common stochastic gradient descent 
method. However, as discussed in the introduction, we do not have access to such 
samples of the optimal cost-to-go vector. We therefore need more sophisticated 
methods for tuning parameters. 
One possibility involves the use of an algorithm similar to that of Equation 3, 
replacing samples of Vit) with Ti(t)(V(t)). This might be justified by the fact that 
T(V) can be viewed as an improved approximation to V*, relative to V. The 
modified algorithm takes on the form 
w(t + 1) = w(t) - (t)(Mi(t)w(t) - Ti(t)(Mw(t)))Mi(t). (4) 
Intuitively, at each time t this algorithm treats T(Mw(t)) as a "target" and takes 
a steepest descent step as if the goal were to find a w that would minimize HMw - 
T(Mw(t))ll. Such an algorithm is closely related to the TD(0) algorithm of Sutton 
(1988). Unfortunately, as pointed out in Tsitsiklis and Van Roy (1994), such a 
scheme can produce a diverging sequence {w(t)) of weight vectors even when there 
exists a parameter vector w* that makes the approximation error V* - Mw* zero at 
every state. However, as we will show in the remainder of this paper, under certain 
assumptions, such an algorithm converges. 
3.3 MAIN CONVERGENCE RESULT 
Our first assumption concerning the step size sequence { c(t)) is standard to stochas- 
tic approximation and is required for the upcoming theorem. 
Assumption I Each step size a(t) is chosen prior to the generation of i(t), and 
the sequence satisfies t=0 c(t)= c and t=0 c2(t) < c. 
Our second assumption requires that T: n _ n be a contraction with respect 
to the Euclidean norm, at least when it operates on value functions that can be 
represented in the form Mw, for some w. This assumption is not always satisfied, 
but it appears to hold in some situations of interest, one of which is to be discussed 
in Section 4. 
Assumption 2 There exists some/?'  [0, 1) such that 
IIT(Mw) - T(Mw')112 _</'llMw- Mw'lle, 
Vw, w   m. 
Stable Linear Approximations to Programming for Stochastic Control Problems 1049 
The following theorem characterizes the stability and error bounds associated with 
the algorithm when the Markov decision problem satisfies the necessary criteria. 
Theorem 1 Let Assumptions i and 2 hold, and assume that M has full column 
rank. Let II = M(MTM)-M T denote the projection matrix onto the subspace 
x = {a4wlw  m). then, 
(a) With probability 1, the sequence w(t) converges to w*, the unique vector that 
solves: 
Mw* = IIT(Mw*). 
(b) Let V* be the optimal cost-to-go vector. The following error bound holds: 
IlMw* - V'l[ < (1 q- fi)vr5 T, V* 
 _ 777; ,v - IIo. 
3.4 OVERVIEW OF PROOF 
Due to space limitations, we only provide an overview of the proof of Theorem 1. 
Let s: R " 4 R " be defined by 
where the expectation is taken over i uniformly distributed among {1,...,n}. 
Hence, 
E[w(t + 1)]w(t),(t)] = w(t) - (t)s(w(t)), 
where the expectation is taken over i(t). We can rewrite s as 
1( 
s(w) = - M'Mw- M'T(Mw , 
n 
and it can be thought of as a vector field over R ". If the sequence {w(t)} converges 
to some w, then s(w) must be zero, and we have 
MTMw = M'T(Mw) 
Mw = IIT(Mw). 
Note that 
IInT(Mw) - nT(Mw')ll2 _</YlIMw- Mw'112, Vw, w'  m, 
due to Assumption 2 and the fact that projection is a nonexpansion of the Euclidean 
norm. It follows that 1-IT(.) has a unique fixed point w*  R m, and this point 
uniquely satisfies 
Mw* - YIT(Mw*). 
We can further establish the desired error bound: 
and it follows that 
[IMtv* - 1-IT(nv*)11 + IInT(nv*) - nv*11 + IInV* - v*11. 
fi'llMw* - v*112 q- IIT(nV*) - V*11= + IlnV* - 
fi'llMw* - V*ll2 q- (1 q- fi)v'-511nv* - v*11o, 
( + )v 
1 - fil 
IlnV* - v*11o. 
i W* 
Consider the potential function U(w) = 11w- Ill. We will establish that 
(X7U(w))Ts(w) k 7U(w), for some 7 > 0, and we are therefore dealing with a 
1050 B.V. ROY, J. N. TSITSIKLIS 
"pseudogradient algorithm" whose convergence follows from standard results on 
stochastic approximation [Polyak and Tsypkin, 1972]. This is done as follows: 
n 
= 
n 
where the last equMity follows because MTH = M . Using the contraction as- 
sumption on T and the nonexpansion property of projection mappings, we have 
IlnV(Mw) - Mw*ll  : IInV(Mw) - nV(Mw*)ll 
 'llw-w*11, 
and applying the Cauchy-Schwartz inequality, we obtain 
(VU(w)) T (w)  (lIMw - Mw*ll  -IMw - Mw*l1211Mw* - nT(Mw)11 ) 
n 
1 
 -(1 - ')lIMw - Mw*ll . 
n 
Since M has full column rank, it follows that (VU(w)) rs(w)  flU(w), for some 
fixed ff > 0, and the proof is complete. 
4 EXAMPLE: LOCAL TRANSITIONS ON GRIDS 
Theorem I leads us to the next question: are there some interesting cases for which 
Assumption 2 is satisfied? We describe a particular example here that relies on 
properties of Markov decision problems that naturally arise in some practical situ- 
ations. 
When we encounter real Markov decision problems we often interpret the states 
in some meaningful way, associating more information with a state than an index 
value. For example, in the context of a queuing network, where each state is one 
possible queue configuration, we might think of the state as a vector in which each 
component records the current length of a particular queue in the network. Hence, 
if there are d queues and each queue can hold up to k customers, our state space is 
a finite grid 2 (i.e., the set of vectors with integer components each in the range 
{0,...,k- 1}). 
Consider a state space where each state i  {1,...,n} is associated to a point 
x i  Z (n = k), as in the queuing example. We might expect that individual 
transitions between states in such a state space are local. That is, if we are at 
a state x i the next visited state xJ is probably close to x i in terms of Euclidean 
distance. For instance, we would not expect the configuration of a queuing network 
to change drastically in a second. This is because one customer is served at a time 
so a queue that is full can not suddenly become empty. 
Note that the number of states in a state space of the form Z grows exponentially 
with d. Consequently, classical dynamic programming algorithms such as value 
iteration quickly become impractical. To efficiently generate an approximation to 
the cost-to-go vector, we might consider tuning the parameters w   and a   
of an affine approximation Zi(w, a) -- wTx i + a using the algorithm presented in 
the previous section. It is possible to show that, under the following assumption 
Stable Linear Approximations to Programming for Stochastic Control Problems 1051 
concerning the state space topology and locality of transitions, Assumption 2 holds 
with  = 2 + -, and thus Theorem 1 characterizes convergence properties of 
the algorithm. 
Assumption 3 The Markov decision problem has state space S = {1,..., ka}, and 
each state i is uniquely associated with a vector x i 6 . with k _> 6(1 - 2)- + 3. 
Any pair xi,x j  Z of consecutively visited states either are identical or have 
exactly one unequal component, which differs by one. 
While this assumption may seem restrictive, it is only one example. There are many 
more candidate examples, involving other approximation architectures and partic- 
ular classes of Markov decision problems, which are currently under investigation. 
5 CONCLUSIONS 
We have proven a new theorem that establishes convergence properties of an al- 
gorithm for generating linear approximations to cost-to-go functions for dynamic 
programming. This theorem applies whenever the dynamic programming operator 
for a Markov decision problem is a contraction with respect to the Euclidean norm 
when applied to vectors in the parameterized class. In this paper, we have described 
one example in which such a condition holds. More examples of practical interest 
will be discussed in a forthcoming full length version of this paper. 
Acknowledgments 
This research was supported by the NSF under grant ECS 9216531, by EPRI under 
contract 8030-10, and by the ARO. 
References 
Bertsekas, D. P. (1995) Dynamic Programming and Optimal Control. Athena Sci- 
entific, Belmont, MA. 
Boyan, J. A. &: Moore, A. W. (1995) Generalization in Reinforcement Learning: 
Safely Approximating the Value Function. In J. D. Cowan, G. Tesauro, and D. 
Touretzky, editors, Advances in Neural Information Processing Systems 7. Morgan 
Kaufmann. 
Gordon, G. J. (1995) Stable Function Approximation in Dynamic Programming. 
Technical Report: CMU-CS-95-103, Carnegie Mellon University. 
Polyak, B. T. &: Tsypkin, Y. Z., (1972) Pseudogradient Adaptation and Training 
Algorithms. Avtomatika i Telemekhanika, 3:45-68. 
Sutton, R. S. (1988) Learning to Predict by the Method of Temporal Differences. 
Machine Learning, 3:9-44. 
Tesauro, G. (1992) Practical Issues in Temporal Difference Learning. Machine 
Learning, 8:257-277. 
Tsitsiklis, J. &: Van Roy, B. (1994) Feature-Based Methods for Large Scale Dynamic 
Programming. Technical Report: LIDS-P-2277, Laboratory for Information and 
Decision Systems, Massachusetts Institute of Technology. Also to appear in Machine 
Learning. 
