Reinforcement Learning for Continuous 
Stochastic Control Problems 
Rmi Munos 
CEMAGREF, LISC, Parc de Tourvoie, 
BP 121, 92185 Antony Cedex, FRANCE. 
Remi. Munos@cemagref. fr 
Paul Bourgine 
Ecole Polytechnique, CREA, 
91128 Palaiseau Cedex, FRANCE. 
Bourgine@poly. polytechnique.fr 
Abstract 
This paper is concerned with the problem of Reinforcement Learn- 
ing (RL) for continuous state space and time stochastic control 
problems. We state the Hamilton-Jacobi-Bellman equation satis- 
fied by the value function and use a Finite-Difference method for 
designing a convergent approximation scheme. Then we propose a 
RL algorithm based on this scheme and prove its convergence to 
the optimal solution. 
Introduction to RL in the continuous, stochastic case 
The objective of RL is to find -thanks to a reinforcement signal- an optimal strategy 
for solving a dynamical control problem. Here we sudy the continuous time, con- 
tinuous state-space stochastic case, which covers a wide variety of control problems 
including target, viability, optimization problems (see [FS93], [KP95])_for which a 
formalism is the following. The evolution of the current state x(t) E 0 (the state- 
space, with O open subset of IRd), depends on the control u(t)  U (compact subset) 
by a stochastic differential equation, called the state dynamics: 
dx = f(x(t),u(t))dt + c(x(t),u(t))dw (1) 
where f is the local drift and c.dw (with w a brownian motion of dimension r and 
a a d x r-matrix) the stochastic part (which appears for several reasons such as lake 
of precision, noisy influence, random fluctuations) of the diffusion process. 
For initial state x and control u(t), (1) leads to an infinity of possible trajectories 
x(t). For some trajectory x(t) (see figure 1)_, let - be its exit time from O (with 
the convention that if x(t) always stays in O, then - = o). Then, we define the 
functional J of initial state x and control u(.) as the expectation for all trajectories 
of the discounted cumulative reinforcement: 
J(x; u(.) ) - Ex,u(.) { /o "/t r(x(t), u(t) )dt + -/ R(x(-) ) ) 
1030 R. Munos and P. Bourgine 
where r(x, u) is the running reinforcement and R(x) the boundary reinforcement. 
? is the discount factor (0 _< "/< 1). In the following, we assume that f, cr are of 
class C 2, r and R are Lipschitzian (with constants Lr and LR) and the boundary 
c90 is C 2. 
Figure 1: The state space, the discretized ES(the square dots) and its frontier 
(the round ones). A trajectory x(t) goes through the neighbourhood of state 
RL uses the method of Dynamic Programqing (DP) which generates an optimal 
(feed-back) control u*(x) by estimating the value function (VF), defined as the 
maximal value of the functional J as a function of initial state x: 
V(x) = sup J(x; u(.)). (2) 
In the IlL approach, the state dynamics is unknown from the system ; the only 
available information for learning the optimal control is the reinforcement obtained 
at the current state. Here we propose a model-based algorithm, i.e. that learns 
on-line a model of the dynamics and approximates the value function by successive 
iterations. 
Section 2 states the Hamilton-Jacobi-Bellman equation and use a Finite-Difference 
(FD) method derived from Kushner [Kus90] for generating a convergent approxi- 
mation scheme. In section 3, we propose a RL algorithm based on this scheme and 
prove its convergence to the VF in appendix A. 
2 A Finite Difference scheme 
Here, we state a second-order nonlinear differential equation (obtained from the DP 
principle, see [FS93]) satisfied by the value function, called the Hamilton-Jacobi- 
Bellman equation. 
Let the d x d matrix a = or.or  (with  the transpose of the matrix). We consider 
the uniformly parabolic case, i.e. we assume that there exists c > 0 such that 
Vx E , Vu E U, Vy G lRd,y4,j=l aij(x, u)yiyj _> cllyll 2. Then V is C 2 (see [Kry80]). 
Let Vx be the gradient of V and Vx,j its second-order partial derivatives. 
Theorem 1 (Hamilton-Jacobi-Bellman) The following HJB equation holds 
V(x) In'), + sup [r(x,u) + V(x).f(x,u) +  -.i",j=i aijVj(x)] =0 .for x  0 
uU 
Besides, V satisfies the .following boundary condition  V(x) = R(x) for x  00. 
Reinforcement Learning for Continuous Stochastic Control Problems 1031 
Remark 1 The challen9e of learnin9 the VF is motivated by the fact that from V, 
we can deduce the followin9 optimal feed-back control policy' 
u*(x) E arg sup [r(x,u) + Vx(x).f(x,u) +  -4,j= aijVxj(x)] 
uU 
In the following, we assume that O is bounded. Let el, ..., ed be a basis for ]R d. 
Let the positive and negative parts of a function b be  b + = max(b, 0) and 
b- - max(-b, 0). For any discretization step 6, let us consider the lattices  6Z a - 
6. Y-i=l jiei where jl,..., ja are any integers, and E 5 - 6Z a f O. Let c0E 5 the 
frontier of E 5 denote the set of points { E 6Z a  O such that at least one adjacent 
point  4- 6ei G E 5 } (see figure 1). 
Let U 5 c U be a finite control set that approximates U in the sense: 
U 5' C U 5 and UsU 5 -- U. Besides, we assume that- Vi = 1..d, 
aii(x , it) -- Eji laij( x, /')l -- 0. 
By replacing the gradient V() by the forward and backward first-order finite- 
difference quotients: AV():  [V( 4- 5ei) - V()] and V,j () by the second- 
order fmite-difference quotients' 
-  5ei) eel) 2v(e)] 
,%,,v() - r [v( + + v(,. - - 
Ax:kix jV() -- 1 [W( q- ei 4- ej) q- V( - e i : ej) 
-- 252 
-( + ,) - u( - ,) - u( + ,) - u( - ,) + 2u()] 
in the HJB equation, we obtain the following  for   E , 
V 6 ()   + SUpuev {r(, u) + il [f (, )-i V6 () - f (, )'i V6 () 
"(e') A,,V()+ Ei ((e')5 + .V( 5(e') )]} 
Knowing that (At In 3') is an approximation of (.at _ 1) as At tends to 0, we deduce' 
V5() = supuev [?('u)ECep.P(,u,()VS(() +r(,u)r(,u)] (4) 
2 
with r(,u) = E:2[,(,)+,,(,)_ E,,,(,) ] (5) 
wch appears  a DP equation for some te Markovian Decision Procs (see 
[Ber87]) whose state space is E* d probabti of tramition  
p(, 71, 4- ei) 
p(, u,  +Sei 4- 
p(, u,  -- 5ei 4- 5ej) 
p(,,) 
- -( u) for i # j, 
-- 252 uijkq 
= a}(,u) for ij, 
= 0 otherwise. 
(6) 
Thanks to a contraction property due to the discount factor 3', there exists a unique 
solution (the fixed-point) Vto equation (4) for   E 5 with the boundary condition 
Vs() = R() for   0E 5. The following theorem (see [Kus90] or [FS93]) insures 
that V 5 is a convergent approximation scheme. 
1032 R. Munos and P. Bourgine 
Theorem 2 (Convergence of the FD scheme) V  converges to V as 6  0  
lim,,, V  () = V(x) uniformly on  
--x 
Remark 2 Condition (3) insures that the p(, u, ) are positive. If this condition 
does not hold, several possibilities to overcome this are described in [Kus90]. 
3 The reinforcement learning algorithm 
Here we assume that f is bounded from below. As the state dynamics (f and a) 
is unknown from the system, we approximate it by building a model f and  from 
samples of trajectories x(t) ' we consider series of successive states x = x(t) 
and y = x(t + r) such that' 
- Vte [t, t + ], x(t) e N() neighbourhood of  whose diameter is inferior to 
kN.6 for some positive constant kN, 
- the control u is constant for t E [t, t + r], 
- r satisfies for some positive kl and k2, 
k16 _< rk _< k26. 
(7) 
Then incrementally update the model  
and compute the approximated time (x, u) and the approximated probabilities of 
transition (, u, ) by replacing f and a by f and  in (5) and (6). 
We obtain the following updating rule of the VS-value of state ' 
V+(): supev ["/r('') Y] (, u, )V() + (x, u)F(, u)] 
(9) 
which can be used as an off-line (synchronous, Gauss-Seidel, asynchronous) or on- 
time (for example by updating V, (f) as soon as a trajectory exits from the neigh- 
bourood of f) DP algorithm (see [BBS95]). 
Besides, when a trajectory hits the boundary 00 at some exit point x(r) then 
update the closest state  E 0E  with- 
(e) = 
(10) 
Theorem 3 (Convergence of the algorithm) Suppose that the model as well 
as the V -value o.f every state   E and control u  U are regularly updated 
(respectively with (3) and (9)) and that every state  e OE 5 are updated with (10) 
at least once. Then Ve  O, 3A such that V5 _< A, 3N, Vn >_ N, 
supez IV2()- V()I _<  with probability I 
Reinforcement Learning for Continuous Stochastic Control Problems 1033 
4 Conclusion 
This paper presents a model-based RL algorithm for continuous stochastic control 
problems. A model of the dynamics is approximated by the mean and the covariance 
of successive states. Then, a RL updating rule based on a convergent FD scheme is 
deduced and in the hypothesis of an adequate exploration, the convergence to the 
optimal solution is proved as the discretization step 5 tends to 0 and the number 
of iteration tends to infinity. This result is to be compared to the model-free RL 
algorithm for the deterministic case in [Mun97]. An interesting possible future 
work should be to consider model-free algorithms in the stochastic case for which a 
Q-learning rule (see [Wat89]) could be relevant. 
A Appendix: proof of the convergence 
Let Mf, Ma, Mf:, and M.r be the upper bounds of f, a, fx and ax and my the lower 
bound of f. Let E  = super, IV()- V()] and E = super, IV()- V()]. 
A.1 Estimation error of the model f. and  and the probabilities 
Suppose that the trajectory xe(t) occured for some occurence we(t) of the brownJan 
motion' xe(t) = xe 4- ftt f(xe(t), u)dt 4- ftt er(xe(t), u)dwe. Then we consider a 
trajectory ze(t) starting from  at te and following the same brownian motion' 
Let ze = ze(te +re). Then (ye- xe)-(ze -) = ft [f(xe(t),u) - f(ze(t),u)]dt+ 
ftt.+- 
 [er(xe(t), u) - c(ze(t), u)] dwe. Thus, from the C  property of f and 
11(Ye - xe) - (ze - e)ll -< (M/. + M.).kN.re.& (11) 
The diffusion processes has the following property .see for example the It6-Taylor 
majoration in [KP95])  E [ze] - 4-re.f(, u)4-0(re) which, from (7), is equivalent 
[z-z- 1 f(, u) 4- 0(5) Thus from the law of large n-tubers and (11)' 
to' Ex r. = ' 
}1 I} II 1 
limsup (,u) - f(,u) = limsup K Ek-1 - 
n- n- L r 
= (Mi. + M).kv.5 + 0(5) = 0(5) w.p. 1 (12) 
Besides, diffusion processes have the following property (again see [KP95])' 
- .f(, ) .re + 
 [(ze )(ze - )'] = (, ,)re +/(, ,) '  o() wh, om (), 
is ffilent to' E [(.--ri(,))(z--ri(,))'] = a(,u)+ O(52). Let r: 
z - - rkf (, u) and  = y - xk - rf(, u) wch satis (kom (11) and (12))' 
IIru - l[ = (MI + Mo,).ru.kN.6 + r&.O(6) (13) 
n 
From the definition of -(, u) we have' -(, u) - a(, u) =  Ek=l "' 
, z T k 
E [--J"l 4- 0(5 2) and from the law of lge nbers, (12) and (13), we have- 
L r J 
msup {l( ) - (, ){I = limsup E= " ' + 0(62) 
1034 R Munos and P. Bourgine 
with probability 1. Thus there exists k/ and k, s.t. 3A1,V5 _< Ai,3Ni,n >_ N1, 
I['(,u)- f(,u)[ I _<kf.6w.p. 1 (14) 
[[(,u) - a(,u)[[ _< ka.5 2 w.p. 1 
Besides, from (5) and (14), we have: 
,.(.."+a. ) 62 
I-(, u) - ,(, u)l _< (a.,w.5)"- <- k'5 (15) 
d from a property of eonential ction, 
(')-'(') = k. I  (16) 
;.5. 
We c deduce from (14) that' 
msup ]p(,u, 0 -(,u,)]  5:(z+a.)5 5 kp6 w.p. I (17) 
with kp = 4(d.M)(2.ki + d.k) for 6  A = n z+a., ZS.M  
A.2 Estimation of] 
V;+l(e)- 
After having updated V2() with rule (9), let A denote the difference 
[v2+1( c) - v*(c)l. From (4), (9) and (8), 
+7 ('). E(, , 4) [v() - v()] + E(, , 4).(, )k(, ) - (, )l 
+ Z (, u, )[(, ) - (, )] (, ) fo n  e v  
As V is differentiable we have  V() = V_() + V. (- ) + o(1K -11)- Let 
us define a hnear function  such that- V(x) = V() + Vx. (x- ). Then 
we have: [p(sC, u,(') -(sC, u,(')] V6((') = (,u,)-(,u,()]. [V6(()- V(()] + 
(,u,O-(,u,O]V(O, thus: C(,u,)-(,u,O]VS() = kp.Ea.5 + 
Bid, om the convergence of the scheme (thmrem 2), we have Ea.6 = 
0(5). From the linearity of , 
erty of r, 
= 0(5) and from (15), (16) and the Lipschitz prop- 
_ k-52) hi i 
2 , 
(18) 
Reinforcement Learning for Continuous Stochastic Control Problems 1035 
A.a A sufficient condition for super. Iv()- V*() _< e2 
Let us suppose that for all  6 E , the following conditions hold for some c > 0 
> _< -c (19) 
< 
__ W+l(% ) - Vs()  s2 (20) 
om the hothesis that a states  E E * are rearly updated, there exists an 
integer m such that at stage n + m 1 the  E E * have bn updated at let 
once since stage n. Besides, since   E OG  are updated at let once with 
re (10), V  0 *, ]V2()- V*()I = I(z(r))- ()[  2.L.6  e2 for any 
_ a Thin, from (19) and (20) we have- 
6<3= 2.' 
> 
  2  E+m  2 
Thus there ests N such that  Vn 
A.4 Convergence of the algorithm 
Let us prove theorem 3. For any e > 0, let us consider ex > 0 and e2 > 0 such that 
ex +2 =e. Assume E > 2, then from (18), A: E-k.6.s2+o(6) < E-k.5. 
for 6 < A3. Thus (19) holds for c - k.6. e-a Suppose now that E, 5 < 2 From (18), 
A < (1 - k.6)2 + o(6) < e2 for 6 < Aa and condition (20) is true. 
Thus for 6 _< min{Ax, A2, Aa}, the sufficient conditions (19) and (20) are satisfied. 
So there exists N, for all n >_ N, E _< 2. Besides, from the convergence of the 
scheme (theorem 2), there exists A0 st. V6 < A0, supee. [V*() - V()] _< 
Thus for 6 < rain{A0, Ax, A2, Aa}, 3N, Vn ) N, 
sup IV( c) - V(C)l _< sup ]V( c) - V6(c)l + sup 
References 
[BBS95] 
[Bet87] 
[rs931 
[KP95] 
[Kry80] 
[Kus90] 
[Mun97] 
[War89] 
Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to 
act using real-time dynamic programming. Artificial Intelligence, (72):81- 
138, 1995. 
Dimitri P. Bertsekas. Dynamic Programming: Deterministic and Sto- 
chastic Models. Prentice Hall, 1987. 
Wendell H. Fleming and H. Mete Soner. Controlled Markov Processes and 
Viscosity Solutions. Applications of Mathematics. Springer-Verlag, 1993. 
Peter E. Kloeden and Eckhard Platen. Numerical Solutions of Stochastic 
Differential Equations. Springer-Verlag, 1995. 
N.V. Krylov. Controlled Diffusion Processes. Springer-Verlag, New York, 
1980. 
Harold J. Kushner. N-merical methods for stochastic control problems in 
continuous time. SIAM J. Control and Optimization, 28:999-1048, 1990. 
Rmi Munos. A convergent reinforcement learning algorithm in the con- 
tinuous case based on a finite difference method. International Joint Con- 
ference on Artificial Intelligence, 1997. 
Christopher J.C.H. Watkins. Learning from delayed reward. PhD thesis, 
Cambridge University, 1989. 
