Finite-Sample Convergence Rates for 
Q-Learning and Indirect Algorithms 
Michael Kearns and Satinder Singh 
AT&T Labs 
180 Park Avenue 
Florham Park, NJ 07932 
{ mkear ns, bavej a} @research. at t. com 
Abstract 
In this paper, we address two issues of long-standing interest in the re- 
inforcement learning literature. First, what kinds of performance guar- 
antees can be made for Q-learning after only a finite number of actions? 
Second, what quantitative comparisons can be made between Q-learning 
and model-based (indirect) approaches, which use experience to estimate 
next-state distributions for off-line value iteration? 
We first show that both Q-learning and the indirect approach enjoy 
rather rapid convergence to the optimal policy as a function of the num- 
ber of state transitions observed. In particular, on the order of only 
(Nlog(1/e)/e2)(log(N) + loglog(I/e)) transitions are sufficient for both 
algorithms to come within e of the optimal policy, in an idealized model 
that assumes the observed transitions are "well-mixed" throughout an 
N-state MDP. Thus, the two approaches have roughly the same sample 
complexity. Perhaps surprisingly, this sample complexity is far less than 
what is required for the model-based approach to actually construct a good 
approximation to the next-state distribution. The result also shows that 
the amount of memory required by the model-based approach is closer to 
N than to N 2. 
For either approach, to remove the assumption that the observed tran- 
sitions are well-mixed, we consider a model in which the transitions are 
determined by a fixed, arbitrary exploration policy. Bounds on the number 
of transitions required in order to achieve a desired level of performance 
are then related to the stationary distribution and mixing time of this 
policy. 
1 Introduction 
There are at least two different approaches to learning in Markov decision processes: 
indirect approaches, which use control experience (observed transitions and payoffs) 
to estimate a model, and then apply dynamic programming to compute policies from 
the estimated model; and direct approaches such as Q-learning [2], which use control 
Convergence Rates for Q-Learning and Indirect Algorithms 997 
experience to directly learn policies (through value functions) without ever explicitly 
estimating a model. Both are known to converge asymptotically to the optimal pol- 
icy [1, 3]. However, little is known about the performance of these two approaches 
after only a finite amount of experience. 
A common argument offered by proponents of direct methods is that it may require 
much more experience to learn an accurate model than to simply learn a good policy. 
This argument is predicated on the seemingly reasonable assumption that an indirect 
method must first learn an accurate model in order to compute a good policy. On 
the other hand, proponents of indirect methods argue that such methods can do 
unlimited off-line computation on the estimated model, which may give an advantage 
over direct methods, at least if the model is accurate. Learning a good model may 
also be useful across tasks, permitting the computation of good policies for multiple 
reward functions [4]. To date, these arguments have lacked a formal framework for 
analysis and verification. 
In this paper, we provide such a framework, and use it to derive the first finite-time 
convergence rates (sample size bounds) for both Q-learning and the standard indirect 
algorithm. An important aspect of our analysis is that we separate the quality of the 
policy generating experience from the quality of the two learning algorithms. In 
addition to demonstrating that both methods enjoy rather rapid convergence to the 
optimal policy as a function of the amount of control experience, the convergence rates 
have a number of specific and perhaps surprising implications for the hypothetical 
differences between the two approaches outlined above. Some of these implications, 
as well as the rates of convergence we derive, were briefly mentioned in the abstract; 
in the interests of brevity, we will not repeat them here, but instead proceed directly 
into the technical material. 
2 MDP Basics 
Let M be an unknown N-state MDP with A actions. We use P(ij) to denote the 
probability of going to state j, given that we are in state i and execute action a; 
and _R4(i ) to denote the reward received for executing a from i (which we assume is 
fixed and bounded between 0 and 1 without loss of generality). A policy r assigns 
an action to each state. The value of state i under policy r, V(i), is the expected 
discounted sum of rewards received upon starting in state i and executing r forever: 
V(i) = E=[r + 7r2 + 72r3 +" '], where rt is the reward received at time step t 
under a random walk governed by r from start state i, and 0 <_ '< 1 is the discount 
factor. It is also convenient to define values for state-action pairs (i,a)' Q4(i, a)= 
-R4(i) + 7 -,j P (ij) V (j). The goal of learning is to approximate the optimal policy 
r* that maximizes the value at every state; the optimal value function is denoted Q4. 
Given Q, we can compute the optimal policy as r*(i): argmaxa{Q4(i,a)}. 
If M is given, value iteration can be used to compute a good approximation to the 
optimal value function. Setting our initial guess as Qo(i,a) = 0 for all (i,a), we 
iterate as follows: 
Qt+(i,a) : t4(i ) + 7E[Pi(ij)(j)] (1) 
J 
where we define (j) = max,{Qt(j,b)}. It can be shown that after e iterations, 
<_ Given any approximation Q to Q4 we can com- 
pute the greedy approximation r to the optimal policy r* as r(i) -- argmax a {Q(i, a) }. 
998 M. Kearns and S. Singh 
The Parallel Sampling Model 
In reinforcement learning, the transition probabilities P4(ij) are not given, and a 
good policy must be learned on the basis of observed experience (transitions) in M. 
Classical convergence results fc;r algorithms such as Q-learning [1] implicitly assume 
that the observed experience is generated by an arbitrary "exploration policy" r, and 
then proceed to prove convergence to the optimal policy if r meets certain mini- 
mal conditions -- namely, r must try every state-action pair infinitely often, with 
probability 1. This approach conflares two distinct issues: the quality of the explo- 
ration policy r, and the quality of reinforcement learning algorithms using experience 
generated by rr. In contrast, we choose to separate these issues. If the exploration 
policy never or only very rarely visits some state-action pair, we would like to have 
this reflected as a factor in our bounds that depends only on rr; a separate factor 
depending only on the learning algorithm will in turn reflect how efficiently a partic- 
ular learning algorithm uses the experience generated by rr. Thus, for a fixed r, all 
learning algorithms are placed on equal footing, and can be directly compared. 
There are probably various ways in which this separation can be accomplished; we 
now introduce one that is particularly clean and simple. We would like a model of 
the ideal exploration policy -- one that produces experiences that are "well-mixed", 
in the sense that every state-action pair is tried with equal frequency. Thus, let us 
define a parallel sampling subroutine PS(M) that behaves as follows: a single call to 
PS(M) returns, for every state-action pair (i,a), a random next state j distributed 
according to P(ij). Thus, every state-action pair is executed simultaneously, and 
the resulting N x A next states are reported. A single call to PS(M) is therefore really 
simulating N x A transitions in M, and we must be careful to multiply the number 
of calls to PS(M) by this factor if we wish to count the total number of transitions 
witnessed. 
What is PS(M) modeling? It is modeling the idealized exploration policy that man- 
ages to visit every state-action pair in succession, without duplication, and without 
fail. It should be intuitively obvious that such an exploration policy would be optimal, 
from the viewpoint of gathering experience everywhere as rapidly as possible. 
We shall first provide an analysis, in Section 5, of both direct and indirect reinforce- 
ment learning algorithms, in a setting in which the observed experience is generated 
by calls to PS(M). Of course, in any given MDP M, there may not be any exploration 
policy that meets the ideal captured by PS(M) -- for instance, there may simply be 
some states that are very difficult for any policy to reach, and thus the experience 
generated by any policy will certainly not be equally mixed around the entire MDP. 
(Indeed, a call to PS(M) will typically return a set of transitions that does not even 
correspond to a trajectory in M.) Furthermore, even if PS(M) could be simulated 
by some exploration policy, we would like to provide more general results that ex- 
press the amount of experience required for reinforcement learning algorithms under 
any exploration policy (where the amount of experience will, of course, depend on 
properties of the exploration policy). 
Thus, in Section 6, we sketch how one can bound the amount of experience required 
under any r in order to simulate calls to PS(M). (More detail will be provided in a 
longer version of this paper.) The bound depends on natural properties of r, such as 
its stationary distribution and mixing time. Combined with the results of Section $, 
we get the desired two-factor bounds discussed above: for both the direct and indirect 
approaches, a bound on the total number of transitions required, consisting of one 
factor that depends only on the algorithm, and another factor that depends only on 
the exploration policy. 
Convergence Rates for Q-Learning and Indirect Algorithms 999 
4 The Learning Algorithms 
We now explicitly state the two reinforcement learning algorithms we shall analyze 
and compare. In keeping with the separation between algorithms and exploration 
policies already discussed, we will phrase these algorithms in the parallel sampling 
framework, and Section 6 indicates how they generalize to the case of arbitrary ex- 
ploration policies. We begin with the direct approach. 
Rather than directly studying standard Q-learning, we will here instead examine a 
variant that is slightly easier to analyze, and is called phased Q-learning. However, we 
emphasize that all of our results can be generalized to apply to standard Q-learning 
(with learning rate ,(i, a) = t(i,a), where t(i, a) is the number of trials of (i, a) so far). 
Basically, rather than updating the value function with every observed transition from 
(i, a), phased Q-learning estimates the expected value of the next state from (i, a) 
on the basis of many transitions, and only then makes an update. The memory 
requirements for phased Q-learning are essentially the same as those for standard 
Q-learning. 
Direct Algorithm Phased Q-Learning: As suggested by the name, the algo- 
rithm operates in phases. In each phase, the algorithm will make mr) calls to PS(M) 
(where mr) will be determined by the analysis), thus gathering mr) trials of every 
state-action pair (i, a). At the fth phase, the algorithm updates the estimated value 
function as follows: for every (i, a), 
T/D 
(2) 
where j, 'e 
-",J-D are the rn2 next states observed from (i, a) on the m2 calls to 
PS(M) during the eth phase. The policy computed by the algorithm is then the 
gre. edy policy determined by the final value function. Note that phased Q-learning 
is quite like standard Q-learning, except that we gather statistics (the summation in 
Equation (2)) before making an update. 
We now proceed to describe the standard indirect approach. 
Indirect Algorithm: The algorithm first makes mi calls to PS(M) to obtain mi 
next state samples for each (i, a). It then builds an empirical model of the transition 
A 
probabilities as follows: P(ij): #(i-j) where 4/=(i --a j) is the number of times 
state j was reached on the mr trials of (i, a). The algorithm then does value iteration 
A 
(as described in Section 2) on the fixed model P(ij) for ei phases. Again, the policy 
computed by the algorithm is the greedy policy dictated by the final value function. 
Thus, in phased Q-learning, the algorithm runs for some number 2 phases, and each 
phase requires m2 calls to PS(M), for a total number of transitions ez) x mr) x N x A. 
The direct algorithm first makes mr calls to PS(M), and then runs eI phases of 
value iteration (which requires no additional data), for a total number of transitions 
mix N x A. The question we now address is: how large must m2, mi,e2,i be 
so that, with probability at least 1 -5, the resulting policies have expected return 
within e of the optimal policy in M? The answers we give yield perhaps surprisingly 
similar bounds on the total number of transitions required for the two approaches in 
the parallel sampling model. 
5 Bounds on the Number of Transitions 
We now state our main result. 
1000 M. Kearns and S. Singh 
Theorem 1 For any MDP M: 
 For an appropriate choice of the parameters mr and and , the total number 
of calls to PS(M) required by the indirect algorithm in order to ensure that, 
with probability at least 1 - 5, the expected return of the resulting policy will 
be within e of the optimal policy, is 
O((1/e2)(log(N/5) + loglog(l/e)). (3) 
 For an appropriate choice of the parameters mD and D, the total number of 
calls to PS(M) required by phased Q-learning in order to ensure that, with 
probability at least i - 5, the expected return of the resulting policy will be 
within e of the optimal policy, is 
O((log(1/e)/e2)(log(N/5) + loglog(l/e)). (q) 
The bound for phased Q-learning is thus only O(log(1/e)) larger than that for the 
indirect algorithm. Bounds on the total number of transitions witnessed in either 
case are obtained by multiplying the given bounds by N x A. 
Before sketching some of the ideas behind the proof of this result, we first discuss 
some of its implications for the debate on direct versus indirect approaches. First of 
all, for both approaches, convergence is rather fast: with a total number of transitions 
only on the order of N log(N) (fixing e and 5 for simplicity), near-optimal policies 
are obtained. This represents a considerable advance over the classical asymptotic 
results: instead of saying that an infinite number of visits to every state-action pair 
are required to converge to the optimal policy, we are claiming that a rather small 
number of visits are required to get close to the optimal policy. Second, by our 
analysis, the two approaches have similar complexities, with the number of transitions 
required differing by only a log(l/e) factor in favor of the indirect algorithm. Third 
-- and perhaps surprisingly -- note that since only O(log(N)) calls are being made 
to PS(M) (again fixing e and 5), and since the number of trials per state-action pair 
is exactly the number of calls to PS(M), the total number of non-zero entries in the 
h 
model Pi(ij) built by the indirect approach is in fact only O(log(N)). In other 
h 
words, P4(ij) will be extremely sparse -- and thus, a terrible approximation to the 
true transition probabilities -- yet still good enough to derive a near-optimal policy! 
A 
Clever representation of Pq(ij) will thus result in total memory requirements that 
are only O(N log(N)) rather than O(N2). Fourth, although we do not have space 
to provide any details, if instead of a single reward function, we are provided with L 
reward functions (where the L reward functions are given in ad. vance of observing any 
experience), then for both algorithms, the number of transitions required to compute 
near-optimal policies for all L reward functions simultaneously is only a factor of 
O(log(L)) greater than the bounds given above. 
Our own view of the result and its implications is: 
 Both algorithms enjoy rapid convergence to the optimal policy as a function 
of the amount of experience. 
 In general, neither approach enjoys a significant advantage in convergence 
rate, memory requirements, or handling multiple reward functions. Both are 
quite efficient on all counts. 
We do not have space to provide a detailed proof of Theorem 1, but instead provide 
some highlights of the main ideas. The proofs for both the indirect algorithm and 
phased Q-learning are actually quite similar, and have at their heart two slightly 
Convergence Rates for Q-Learning and Indirect Algorithms 1001 
different uniform convergence lemmas. For phased Q-learning, it is possible to show 
that, for any bound D on the number of phases to be executed, and for any r > 0, 
we can choose mD so that 
(1/reD) E (J) -- E ri (j) -( r (5) 
k=l j 
will hold simultaneously for every (i, a) and for every phase  = 1,..., Q). In other 
words, at the end of every phase, the empirical estimate of the expected next-state 
value for every (i, a) will be close to the true expectation, where here the expectation 
is with respect to the current estimated value function . 
For the indirect algorithm, a slightly more subtle uniform convergence argument is 
required. Here we show that it is possible to choose, for any bound Q on the number 
of iterations of value iteration to be executed on the 4(ij), and for any r > 0, a 
value mr such that 
E (6) 
J 
for every (i,a) and every phase l = 1,...,it, where the (j) are the value functions 
resulting from performing true value iteration (that is, on the P4(ij)). Equation (6) 
essentially says that expectations of the true value functions are quite similar under 
either the true or estimated model, even though the indirect algorithm never has 
access to the true value functions. 
In either case, the uniform convergence results allow us to argue that the corre- 
sponding algorithms still achieve successive contractions, as in the classical proof 
of value iteration. For instance, in the case of phased Q-learning, if we define 
At : max(i,a){])e(i,a)- Qt(i,a)l}, we can derive a recurrence relation for 
as follows: 
I)t+l (i, a) - Qe+ (i, 
< 7 max 
-- 
< 7r + 7A. 
m 
l/m) E (J) - 7 E Pi 'l(j) 
k=l j 
+ 
J 
Here we have made use of Equation (5). 
(7) 
8) 
(9) 
h 
Since Ao = 0 (Qo = Qo), this recurrence 
gives At <_ r(7/(1-- 7)) for any e. From this it is not hard to show that for any (i,a) 
[)(i,a)-Q*(i,a)l _< r(7/(1 -7)) q- 'J. (10) 
From this it can be shown that the regret in expected return suffered by the policy 
computed by phased Q-Learning after  phases is at most (r7/(1-7)+7)(2/(1-7)). 
,The proof proceeds by setting this regret smaller than the desired e, solving for t and 
r, and obtaining the resulting bound on mr). The derivation of bounds for the indirect 
algorithm is similar. 
6 Handling General Exploration Policies 
As promised, we conclude our technical results by briefly sketching how we can trans- 
late the bounds obtained in Section 5 under the idealized parallel sampling model into 
1002 M. Kearns and $. Singh 
bounds applicable when any fixed policy r is guiding the exploration. Such bounds 
must, of course, depend on properties of r. Due to space limitations, we can only 
outline the main ideas; the formal statements and proofs are deferred to a longer 
version of the paper. 
Let us assume for simplicity that r (which may be a stochastic policy) defines an 
er#odic Markov process in the MDP M. Thus, r induces a unique stationary distri- 
bution PM,,(i, a) over state-action pairs -- intuitively, PM,, (i, a) is the frequency of 
executing action a from state i during an infinite random walk in M according to 
r. Furthermore, we can introduce the standard notion of the mixin# time of r to 
its stationary distribution -- informally, this is the number T of steps required such 
that the distribution induced on state-action pairs by T-step walks according to r 
will be "very close" to PM,, I Finally, let us define p, = min(i,,){PM,,(i, a)}. 
Armed with these notions, it is not difficult to show that the number of steps we must 
take under r in order to simulate, with high probability, a call to the oracle PS(M), 
is polynomial in the quantity T/p. The intuition is straightforward: at most every 
T steps, we obtain an "almost independent" draw from PM,(i, a); and with each 
independent draw, we have at least probability p of drawing any particular (i, a) 
pair. Once we have sampled every (i,a) pair, we have simulated a call to PS(M). 
The formalization of these intuitions leads to a version of Theorem 1 applicable to 
any r, in which the bound is multiplied by a factor polynomial in T/p, as desired. 
However, a better result is possible. In cases where p may be small or even 0 (which 
would occur when r simply does not ever execute some action from some state), the 
factor T/p is large or infinite and our bounds become weak or vacuous. In such 
cases, it is better to define the sub-MDP M (a), which is obtained from M by simply 
deleting any (i, a) for which PM,, (i, a) < a, where a > 0 is a parameter of our choos- 
ing. In M (a), p > a by construction, and we may now obtain convergence rates 
to the optimal policy in M,(a) for both Q-learning and the indirect approach like 
those given in Theorem 1, multiplied by a factor polynomial in T/a. (Technically, 
we must slightly alter the algorithms to have an initial phase that detects and elim- 
inates small-probability state-action pairs, but this is a minor detail.) By allowing 
a to become smaller as the amount of experience we receive from r grows, we can 
obtain an "anytime" result, since the sub-MDP M (a) approaches the full MDP M 
as a- O. 
References 
[1] Jaakkola, T., Jordan, M. I., Singh, S. On the convergence of stochastic iterative dy- 
namic programming algorithms. Neural Computation, 6(6), 1185-1201, 1994. 
[2] C. J. C. H. Watkins. Learning from Delayed Rewards. Ph.D. thesis, Cambridge Uni- 
versity, 1989. 
[3] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introductior. MIT Press, 
1998. 
[4] S. Mahadevan. Enhancing Transfer in Reinforcement Learning by Building Stochastic 
Models of Robot Actions. In Machine Learning: Proceedings of the Ninth International 
Conference, 1992. 
x Formally, the degree of closeness is measured by the distance between the transient and 
stationary distributions. For brevity here we will simply assume this parameter is set to a 
very small, constant value. 
