Adaptive choice of grid and time in 
reinforcement learning 
Stephan Pareigis 
stp@numerik.uni-kiel.de 
Lehrstuhl Praktische Mathematik 
Christian-Albrechts-Universit/it Kiel 
Kiel, Germany 
Abstract 
We propose local error estimates together with algorithms for adap- 
tive a-posteriori grid and time refinement in reinforcement learn- 
ing. We consider a deterministic system with continuous state and 
time with infinite horizon discounted cost functional. For grid re- 
finement we follow the procedure of numerical methods for the 
Bellman-equation. For time refinement we propose a new criterion, 
based on consistency estimates of discrete solutions of the Bellman- 
equation. We demonstrate, that an optimal ratio of time to space 
discretization is crucial for optimal learning rates and accuracy of 
the approximate optimal value function. 
I Introduction 
Reinforcement learning can be performed for fully continuous problems by discretiz- 
ing state space and time, and then performing a discrete algorithm like Q-learning 
or RTDP (e.g. [5]). Consistency problems arise if the discretization needs to be 
refined, e.g. for more accuracy, application of multi-grid iteration or better starting 
values for the iteration of the approximate optimal value function. In [7] it was 
shown, that for diffusion dominated problems, a state to time discretization ratio 
k/h of Ch , 7 > 0 has to hold, to achieve consistency (i.e. k = o(h)). It can be 
shown, that for deterministic problems, this ratio must only be k/h = C, C a con- 
stant, to get consistent approximations of the optimal value function. The choice 
of the constant C is crucial for fast learning rates, optimal use of computer memory 
resources and accuracy of the approximation. 
We suggest a procedure involving local a-posteriori error estimation for grid refine- 
ment, similar to the one used in numerical schemes for the Bellman-equation (see 
[4]). For the adaptive time discretization we use a combination from step size con- 
Adaptive Choice of Grid and Time in Reinforcement Learning 1037 
trol for ordinary differential equations and calculations for the rates of convergence 
of fully discrete solutions of the Bellman-equation (see [3]). We explain how both 
methods can be combined and applied to Q-learning. A simple numerical example 
shows the effects of suboptimal state space to time discretization ratio, and provides 
an insight in the problems of coupling both schemes. 
2 Error estimation for adaptive choice of grid 
We want to approximate the optimal value function V: 9 -4 IR in a state space 
9 C IR a of the following problem: Minimize 
J(x, u(.)) :- e -p g(y,,(.)(r), u(r))dr, 
u(.): IR+ -4 A measurable, (1) 
where g: 9 x A -4 IR+ is the cost function, and y,(.)(.) is the solution of the 
differential equation 
)(t) - f(y(t), u(t)), y(O) = x. (2) 
As a trial space for the approximation of the optimal value function (or Q-function) 
we use locally linear elements on simplizes Si, i = 1,..., Ns which form a triangu- 
lation of the state space, Ns the number of simplizes. The vertices shall be called 
xi, i - 1,..., N, N the dimension of the trial space 1 . This approach has been used 
in numerical schemes for the Bellman-equation ([2], [4]). We will first assume, that 
the grid is fixed and has a discretization parameter 
k = m.axdiam{S/}. 
Other than in the numerical case, where the updates are performed in the vertices of 
the triangulation, in reinforcement learning only observed information is available. 
We will assume, that in one time step of size h > 0, we obtain the following 
information: 
 the current state y, G 9, 
 an action an  A, 
 the subsequent state Yn+l := Yyn,an(h) 
 the local cost rn = r(yn,an) = fo a e-Pg(yyn,a,(r),an(r))dr. 
The state y, in which an update is to be made, may be any state in . A shall be 
finite, and an locally constant. 
The new value of the fully discrete Q-function Q(y a) should be set to 
Q(y,a) shall be r + e-PaV(y+), 
where V(yn+)= mina Q(yn+,a). We call the right side the update function 
Pa(z,a, V) := r(z,a) + e-PaV(yz,a(a)), z e . (3) 
We will update Q in the vertices {xi }=x of the triangulation in one of the following 
two ways: 
i When an adaptive grid is used, then Ns and N depend on the refinement. 
1038 S. Parei g is 
Kaczmarz-update. Let A T = (A,..., AN) be the vector of barycentric coordi- 
nates, such that 
N 
Y=EAixi' 0_<Ai_< 1, for alli= 1,...,N. 
i----1 
Then update 
N ] 
+ . 
i--1 
(4) 
Kronecker-update. Let S 3 Yn and x be the vertex of S, closest to y (if there 
is a draw, then the update can be performed in all winners). Then update Q only 
in x according to 
Q(x, an) := rn + e-PnVn(yn+). (5) 
Each method has some assets and drawbacks. In our computer simulations the 
Kaczmarz-update seemed to be more stable over the Kronecker-update (see [6]). 
However, examples may be constructed where a (HSlder-)continuous bounded op- 
timal value function V is to be approximated, and the Kaczmarz-update produces 
an approximation with arbitrarily high [[.[[sup-norm (place a vertex x of the trian- 
gulation in a point where --V is infinity, and use as update states the vertex x in 
turn with an arbitrarily close state ). 
Kronecker-update will provide a bounded approximation if V is bounded. Let zn 
be the fully-discrete optimal value function 
zn(xi ) = min{r(xi, a) + e-PnVn(y,a(h)), i= 1,...,N. 
Then it can be shown, that an approximation_performed by Kronecker-update will 
k 
eventually be caught in an e-neighborhood of V (with respect to the I I- 
if the data points y0, y, y2,... are sufficiently dense. Under regularity conditions 
on V, e may be bounded by 2 
k 
e <_ + (6) 
As a criterion for grid refinement we choose a form of a local a posterJori error esti- 
mate as defined in [4]. Let Vn  (x) = mina Qn (x, a) be the current iterate of the op- 
timal value function. Let a G U be the minimizing control a = argminaQ(x , a). 
Then we define 
e(x) := [Vf (x)- P(x, a, Vf) I. (Z) 
If Vf is in the e-neighborhood of f, then it can be shown, that (for every x  f2 
and simplex S with z  S, a as above) 
0 _< e(x) <_ sup P(z,a, V) - inf P(z,a, V). 
zS zS: 
If zff is Lipschitz-continuous, then an estimate using only Gronwall's inequality 
bounds the right side and therefore e(x) by Cp-, where C depends on the Lipschitz- 
constants of zn and the cost g. 
2With respect to the results in [3] we assume, that also z _< C(h -1- n) can be shown. 
Adaptive Choice of Grid and Time in Reinforcement Learning 1039 
The value ej :- maxes iea (x) defines a function, which is locally constant on every 
simplex. We use e j, j = 1,..., N as an indicator function for grid refinement. The 
(global) tolerance value tolk for ej shall be set to 
Ns 
i--1 
where we have chosen 1 < C < 2. We approximate the function e on the simplizes 
in the following way, starting in some Yn  Sj' 
1. apply a control a  U constantly on [T, T + hi 
2. receive value rn and subsequent state yn+ 
3. calculate the update value Pa (x, a, V) 
4. if (IPa(x,a, V) - V(x)l >_ ej) then ej :- 
It is advisable to make grid refinements in one sweep. We also store (different to 
the described algorithm) several past values of ej in every simplex, to be able to 
distinguish between large ej due to few visits in that simplex and the large ej due to 
space discretization error. For grid refinement we use a method described in ([1]). 
3 A local criterion for time refinement 
Why not take the smallest possible sampling rate? There are two arguments for 
adaptive time discretization. First, a bigger time step h naturally improves (de- 
creases) the contraction rate of the iteration, which is e -pa . The new information 
is conveyed from a point further away (in the future) for big h, without the need 
to store intermediate states along the trajectory. It is therefore reasonable to start 
with a big h and refine where needed. 
The second argument is, that the grid and time discretization k and h stand in a 
certain relation. In [3] the estimate 
k 
iv(x)- v(x)l < C(h + ), for all x e , 
C a constant 
is proven (or similar estimates, depending on the regularity of V). For obvious 
reasons, it is desirable to start with a coarse grid (storage, speed), i.e. k large. 
Having a too small h in this case will make the approximation error large. Also 
here, it is reasonable to start with a big h and refine where needed. 
What can serve as a refinement criterion for the time step h? In numerical schemes 
for ordinary differential equations, adaptive step size control is performed by es- 
timating the local truncation error of the Taylor series by inserting intermediate 
points. In reinforcement learning, however, suppose the system has a large trunca- 
tion error (i.e. it is difficult to control) in a certain region using large h and locally 
constant control functions. If the optimal value function is nearly constant in this 
region, we will not have to refine h. The criterion must be, that at an intermediate 
point, e.g. at time hi2, the optimal value function assumes a value considerably 
smaller (better) than at time h. However, if this better value is due to error in the 
state discretization, then do not refine the time step. 
We define a function H on the simplices of the triangulation. H(S) > 0 holds the 
time-step which will be used when in simplex S. Starting at a state Yn  f, Yn  Sn 
at time T > 0, with the current iterate of the Q-function Q (V respectively) the 
following is performed: 
1040 S. Pareig is 
1. apply a control a  U constantly on IT, T q- h] 
2. take a sample at the intermediate state z = yy,,a(h/2) 
3. if (H($ < *v/diam{s,}) then end. 
else 
4. compute V(z) -- min0 Q(z,b) 
5. compute Pn/2(yn,a, V) - rh/2(yn,a) q-e-Ph/V(z) 
6. compute P(y,a, V) - r6(y,,a) 
7. if (Pa/(yn,a, V) _< P(y,,a, Vff)-tol) update H($,) - H($n)/2 
The value  is currently set to 
2 
 -- C(y,a) - lrn/(y,a)- rn(y,a)l , 
whereby a local value of MJg2 is approximated, Mr(x) -- maXa If(x,a)l Lg an 
p ' 
approximation of IV'g(x, a)l (if g is sufficiently regular). 
to]. depends on the local value of V and is set to 
= 
How can a Q-function Q()(x, a), with state dependent time and space discretisa- 
() 
tion be approximated and stored? We have stored the time discretisation function 
H locally constant on every simplex. This implies (if H is not constant on f), that 
there will be vertices xj, such that adjacent triangles hold different values of H. 
The Q-function, Which is stored in the vertices, then has different choices of H(xj). 
We solved this problem, by updating a function Q(xj, a) with Kaczmarz-update 
and the update value PH(v,)(y,,a, V), Yn in an to Xj adjacent simplex, regardless 
 QH(Xj, a) therefore has an ambiguous semantic: 
of the different H-values in xj k 
it is the value if a is applied for 'some time', and optimal from there on. 'some 
time' depends here on the value of H in the current simplex. It can be shown, that 
Qk Qk 
] H(xj)/2(xj,a)- H(j)(xj,a)l is less than the space discretization error. 
4 A simple numerical example 
We demonstrate the effects of suboptimal values for space and time discretisation 
with the following problem. Let the system equation be 
u i (y-v), v-- , yf----[0,1]x [0,1] (8) 
=f(y,u) :- -1 u .375 
The stationary point of the uncontrolled system is v. The eigenvalues of the system 
are {u + i, u - i}, u  [-c, c]. The system is reflected at the boundary. 
The goal of the optimal control shall be steer the solution along a given trajectory in 
state space (see figure 1), minimizing the integral over the distance from the current 
state to the given trajectory. The reinforcement or cost function is therefore chosen 
to be 
g(y) = dist(L, y), (9) 
where L denotes the set of points in the given trajectory. The cost functional takes 
the form 
Jp(y,a(.)) = e-P g(yy,a('))dv. (10) 
Adaptive Choice of Grid and Time in Reinforcement Learning 1041 
0.5 
I 
 
0 0.5 
Figure 1: The lef[ picture depicts he L-form of [he given [rajec[ory. The stationary 
poin[ of [he system is a[ (.375, .375) (depicted as a big do). The op[imal value function 
computed by numerical schemes on a fine fixed grid is depicted wi[h oo large time dis- 
cre[iza[ion (middle) and small time discre[izaion (right) (rotated by abou 100 degrees 
for be[[er viewing). The waves in he middle picture show the effec[ of oo large time seps 
in regions where g varies considerably. 
In he learning problem, he adaptive grid mechanism ries to resolve the waves 
(figure 1, middle picture) which come from he large ime discreization. This is 
depicted in figure 2. We used only hree differen ime sep sizes (h = 0.1, 0.05 and 
0.025) and sared globally wih he coarses sep size 0.1. 
Figure 2: The adaplive grid mechanism refines correctly. However, in the lef picture, 
unnecessary eftor[ is spended in resolving regions, in which [he time s[ep should be refined 
urgently. The righ[ picture shows [he result, if adaptive [ime is also used. Regions outside 
he L-form are refined in he early s[ages of learning while h was s[ill large. An additional 
coarsening should be considered in future work. We used a high ra[e of random jumps in 
[he process and locally a cer[ain[y equivalence controller o produce [hese pictures. 
1042 $. Pareigis 
5 Discussion of the methods and conclusions 
We described a time and space adaptive method for reinforcement learning with 
discounted cost functional. The ultimate goal would be, to find a self tuning algo- 
rithm which locally adjusted the time and space discretization automatically to the 
optimal ratio. The methods worked fine in the problems we investigated, e.g. non- 
linearities in the system showed no problems. Nevertheless, the results depended 
on the choice of the tolerance values , ;ol and ;olk. We used only three time dis- 
cretization steps to prevent adjacent triangles holding time discretization values too 
far apart. The smallest state space resolution in the example is therefore too fine 
for the finest time resolution. A solution can be, to eventually use controls that are 
of higher order (in terms of approximation of control functions) than constant (e.g. 
linear, polynomial, or locally constant on subintervals of the finest time interval). 
This corresponds to locally open loop controls. 
The optimality of the discretization ratio time/space could not be proven. Some 
discontinuous value functions g gave problems, and we had problems handling stiff 
systems, too. 
The learning period was considerably shorter (about factor 100 depending on the 
requested accuracy and initial data) in the adaptive cases as opposed to fixed grid 
and time with the same accuracy. 
From our experience, it is difficult in numerical analysis to combine adaptive time 
and space discretization methods. To our knowledge this concept has not yet been 
applied to the Bellman-equation. Theoretical work is still to be done. We are aware, 
that triangulation of the state space yields difficulties in implementation in high 
dimensions. In future work we will be using rectangular grids. We will also make 
some comparisons with other algorithms like Patti-game ([5]). To us, a challenge is 
seen in handling discontinuous systems and cost functions as they appear in models 
with dry friction for example, as well as algebro-differential systems as they appear 
in robotics. 
References 
[1] E. B/insch. Local mesh refinement in 2 and 3 dimensions. IMPACT Cornput. 
$ci. Engrg. 3, Vol. 3:181-191, 1991. 
[2] M. Falcone. A numerical approach to the infinite horizon problem of determin- 
istic control theory. Appl Math Optira 15:1-13, 1987. 
[3] R. Gonzalez and M. Tidball. On the rates of convergence of fully discrete 
solutions of Hamilton-Jacobi equations. INRIA, Rapports de Recherche, No 
1376, Programme 5, 1991. 
[4] L. Griine. An adaptive grid scheme for the discrete Hamilton-Jacobi-Bellman 
equation. Numerische Mathematik, Vol. 75, No. 3:319-337, 1997. 
[5] A. W. Moore and C. G. Atkeson. The patti-game algorithm for variable resolu- 
tion reinforcement learning in multidimensional state-spaces. Machine Learning, 
Volume 21, 1995. 
[6] S. Pareigis. Lernen der L&ung der Bellman-Gleichung durch Beobachtung yon 
kontinuierlichen Prozefien. PhD thesis, Universitgt Kiel, 1996. 
[7] S. Pareigis. Multi-grid methods for reinforcement learning in controlled diffusion 
processes. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, 
Advances in Neural Information Processing Systems, volume 9. The MIT Press, 
Cambridge, 1997. 
