Policy Gradient Methods for 
Reinforcement Learning with Function 
Approximation 
Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour 
AT&T Labs - Research, 180 Park Avenue, Florham Park, NJ 07932 
Abstract 
Function approximation is essential to reinforcement learning, but 
the standard approach of approximating a value function and deter- 
mining a policy from it has so far proven theoretically intractable. 
In this paper we explore an alternative approach in which the policy 
is explicitly represented by its own function approximator, indepen- 
dent of the value function, and is updated according to the gradient 
of expected reward with respect to the policy parameters. Williams's 
REINFORCE method and actor-critic methods are examples of this 
approach. Our main new result is to show that the gradient can 
be written in a form suitable for estimation from experience aided 
by an approximate action-value or advantage function. Using this 
result, we prove for the first time that a version of policy iteration 
with arbitrary differentiable function approximation is convergent to 
a locally optimal policy. 
Large applications of reinforcement learning (RL) require the use of generalizing func- 
tion approximators such neural networks, decision-trees, or instance-based methods. 
The dominant approach for the last decade has been the value-function approach, in 
which all function approximation effort goes into estimating a value function, with 
the action-selection policy represented implicitly as the "greedy" policy with respect 
to the estimated values (e.g., as the policy that selects in each state the action with 
highest estimated value). The value-function approach has worked well in many appli- 
cations, but has several limitations. First, it is oriented toward finding deterministic 
policies, whereas the optimal policy is often stochastic, selecting different actions with 
specific probabilities (e.g., see Singh, Jaakkola, and Jordan, 1994). Second, an arbi- 
trarily small change in the estimated value of an action can cause it to be, or not be, 
selected. Such discontinuous changes have been identified as a key obstacle to estab- 
lishing convergence assurances for algorithms following the value-function approach 
(Bertsekas and Tsitsiklis, 1996). For example, Q-learning, Sarsa, and dynamic pro- 
gramming methods have all been shown unable to converge to any policy for simple 
MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsit- 
siklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the 
best approximation is found at each step before changing the policy, and whether the 
notion of "best" is in the mean-squared-error sense or the slightly different senses of 
residual-gradient, temporal-difference, and dynamic-programming methods. 
In this paper we explore an alternative approach to function approximation in RL. 
1058 R. S. Sutton, D. McAllester, S. Singh and Y. Mansour 
Rather than approximating a value function and using that to compute a determinis- 
tic policy, we approximate a stochastic policy directly using an independent function 
approximator with its own parameters. For example, the policy might be represented 
by a neural network whose input is a representation of the state, whose output is 
action selection probabilities, and whose weights are the policy parameters. Let 0 
denote the vector of policy parameters and p the performance of the corresponding 
policy (e.g., the average reward per step). Then, in the policy gradient approach, the 
policy parameters are updated approximately proportional to the gradient: 
Op 
A  a, (1) 
where a is a positive-definite step size. If the above can be achieved, then 0 can 
usually be assured to converge to a locally optimal policy in the performance measure 
p. Unlike the value-function approach, here small changes in 0 can cause only small 
changes in the policy and in the state-visitation distribution. 
In this paper we prove that an unbiased estimate of the gradient (1) can be obtained 
from experience using an approximate value function satisfying certain properties. 
Williams's (1988, 1992) REINFORCE algorithm also finds an unbiased estimate of 
the gradient, but without the assistance of a learned value function. REINFORCE 
learns much more slowly than RL methods using value functions and has received 
relatively little attention. Learning a value function and using it to reduce'the variance 
of the gradient estimate appears to be essential for rapid learning. Jaakkola, Singh 
and Jordan (1995) proved a result very similar to ours for the special case of function 
approximation corresponding to tabular POMDPs. Our result strengthens theirs and 
generalizes it to arbitrary differentiable function approximators. Konda and Tsitsiklis 
(in prep.) independently developed a very simialr result to ours. See also Baxter and 
Bartlett (in prep.) and Marbach and Tsitsiklis (1998). 
Our result also suggests a way of proving the convergence of a wide variety of algo- 
rithms based on "actor-critic" or policy-iteration architectures (e.g., Barto, Sutton, 
and Anderson, 1983; Sutton, 1984; Kimura and Kobayashi, 1998). In this paper we 
take the first step in this direction by proving for the first time that a version of 
policy iteration with general differentiable function approximation is convergent to 
a locally optimal policy. Baird and Moore (1999) obtained a weaker but superfi- 
cially similar result for their VAPS family of methods. Like policy-gradient methods, 
VAPS includes separately parameterized policy and value functions updated by gra- 
dient methods. However, VAPS methods do not climb the gradient of performance 
(expected long-term reward), but of a measure combining performance and value- 
function accuracy. As a result, VAPS does not converge to a locally optimal policy, 
except in the case that no weight is put upon value-function accuracy, in which case 
VAPS degenerates to REINFORCE. Similarly, Gordon's (1995) fitted value iteration 
is also convergent and value-based, but does not find a locally optimal policy. 
I Policy Gradient Theorem 
We consider the standard reinforcement learning framework (see, e.g., Sutton and 
Barto, 1998), in which a learning agent interacts with a Markov decision process 
(MDP). The state, action, and reward at each time t 6 {0, 1, 2,...} are denoted st  
$, at  .A, and rt   respectively. The environment's dynamics are characterized by 
state transition probabilities, Pa s, = Pr {St+l = s' I st = s, at = a}, and expected re- 
wards 7Z] = E {rt+l I st = s, at = a}, Vs, s   $, a  .A. The agent's decision making 
procedure at each time is characterized by a policy, r(s, a, 0) = Pr {at = alst = s, 0}, 
Vs  $, a  jI, where 0  t, for I << I$1, is a parameter vector. We assume that r 
is diffentiable with respect to its parameter, i.e., that o exists. We also usually 
write just r(s,a) for r(s,a,O). 
Policy Gradient Methods for RL with Function Approximation 1059 
With function approximation, two ways of formulating the agent's objective are use- 
ful. One is the average reward formulation, in which policies are ranked according to 
their long-term expected reward per step, p(r): 
p(r) = lim 1-E{rl + r2 +... + rn [r} = Z d(s) Z r(s,a)7, 
where d  (s) = limt_ Pr (st = s[so, r) is the stationary distribution of states under 
r, which we assume exists and is independent of so for all policies. In the average 
reward formulation, the value of a state-action pair given a policy is defined as 
ao=a,r ), Vse$,aeA. 
t----1 
The second formulation we cover is that in which there is a designated start state 
so, and we care only about the long-term reward obtained from it. We will give our 
results only once, but they will apply to this formulation as well under the definitions 
p(r) = E 7t-lrt SO,r and Qr(s,a) = E 7k-lrt+k st = s, at = a,r . 
t=l Xk=l 
where 7 6 [0, 1] is a discount rate (7 = I is allowed only in episodic tasks). In this 
formulation, we define dr(s) as a discounted weighting of states encountered starting 
at so and then following r: dr(s) = --o 7 tPr (st- S[So,'). 
Our first result concerns the gradient of the performance metric with respect to the 
policy parameter: 
Theorem I (Policy Gradient). For any MDP, in either the average-reward or 
start-state formulations, 
Op Or( s, a) 
oo = a's) oo (2) 
Proof: See the appendix. 
This way of expressing the gradient was first discussed for the average-reward formu- 
lation by Marbach and Tsitsiklis (1998), based on a related expression in terms of the 
state-value function due to Jaakkola, Singh, and Jordan (1995) and Cao and Chen 
(1997). We extend their results to the start-state formulation and provide simpler 
and more direct proofs. Williams's (1988, 1992) theory of REINFORCE algorithms 
can also be viewed as implying (2). In any event, the key aspect of both expressions 
for the gradient is that their are no terms of the form oa._s: the effect of policy 
changes on the distribution of states does not appear. This is convenient for approxi- 
mating the gradient by sampling. For example, if s was sampled from the distribution 
(8'a)O(s,a) would be an unbiased estimate of 
obtained by following r, then -]-a oe ' 
-e Of course, Q (s, a) is also not normally known and must be estimated. One ap- 
00' 
proach is to use the actual returns, Rt oo oo 
-- Ek=i l't+k -- p(w) (or Re = Ek=l vk-iwt+k 
in the start-state formulation) as an approximation for each Qr(st, at). This leads to 
Williams's episodic REINFORCE algorithm, A0t oc o(,,)Re  (the 1 
corrects for the oversampling of actions preferred by r), which is known to follow  
in expected value (Williams, 1988, 1992). 
2 Policy Gradient with Approximation 
Now consider the case in which Q is approximated by a learned function approxima- 
tor. If the approximation is sufficiently good, we might hope to use it in place of Q 
1060 R. S. Sutton, D. Mc,41lester, S. Singh and Y. Mansour 
in (2) and still point roughly in the direction of the gradient. For example, Jaakkola, 
Singh, and Jordan (1995) proved that for the special case of function approximation 
arising in a tabular POMDP one could assure positive inner product with the gra- 
dient, which is sufficient to ensure improvement for moving in that direction. Here 
we extend their result to general function approximation and prove equality with the 
gradient. 
Let fw: $ x j[ - R be our approximation to Qx, with parameter w. It is natural 
a rtr / s 
to learn fw by following rr and updating w by a rule such as Awt o<  [  t, at) - 
 'fw(8"aO where Or(st, at) is some unbiased 
f(st,at)]   [O(st,at)- f(st,-tj o , 
estimator of Qr(st, at), perhaps Pu. When such a process has converged to a local 
optimum, then 
 d'r(s) E '(s'a)[ Q'r(s'a) - fu,(s,a)] Ofu,(s,a) 
Ow - O. 
(s) 
Theorem 2 (Policy Gradient with Function Approximation). If fw satisfies 
(3) and is compatible with the policy parameterization in the sense that I 
Ofu,(s,a) O'(s,a) I 
Ow O0 r(s,a) ' 
then 
Op &r( s, a) 
00 = d(s)  S(s,a). 
(4) 
(5) 
Proof: Combining (3) and (4) gives 
&r(s,a) 
oo - = 0 
(6) 
which tells us that the error in fu(s, a) is orthogonal to the gradient of the policy 
parameterization. Because the expression above is zero, we can subtract it from the 
policy gradient theorem (2) to yield 
Op &r( s, a) 
O0 = y" dr(s) y" O0 qr(s'a) - E dr(s) Y O'(s,a) 
00 [(s,)- w(s,)] 
&r(s,a) [Q=(s,a) - Q=(s,a) + f,,(s,a)] 
&r( s, a) 
= d(s) oo /(s,a). q..V. 
3 Application to Deriving Algorithms and Advantages 
Given a policy parameterization, Theorem 2 can be used to derive an appropriate 
form for the value-function parameterization. For example, consider a policy that is 
a Gibbs distribution in a linear combination of features: 
Eb eOTq ' 
Vs 6 $, s  A, 
Tsitsiklis (personal communication) points out that f being linear in the features given 
on the righthand side may be the only way to satisfy this condition. 
Policy Gradient Methods for RL with Function Approximation 1061 
where each qbsa is an/-dimensional feature vector characterizing state-action pair s, a. 
Meeting the compatibility condition (4) requires that 
Ofw(s,a) Or(s,a) 1 y.r(s,b)qbsb, 
0w = b 
so that the natural parameterization of fw is 
In other words, fw must be linear in the same features as the policy, except normalized 
to be mean zero for each state. Other algorithms can easily be derived for a variety 
of nonlinear policy parameterizations, such as multi-layer backpropagation networks. 
The careful reader will have noticed that the form given above for f requires 
that it have zero mean for each state: -.ar(s,a)fw(s,a) = O, s  $. In this 
sense it is better to think of fw as an approximation of the advantage function, 
Ar(s,a) = Qr(s,a) - Vr(S) (much as in Baird, 1993), rather than of Qx. Our 
convergence requirement (3) is really that fw get the relative value of the ac- 
tions correct in each state, not the absolute value, nor the variation from state to 
state. Our results can be viewed as a justification for the special status of advan- 
tages as the target for value function approximation in RL. In fact, our (2), (3), 
and (5), can all be generalized to include an arbitrary function of state added to 
the value function or its approximation. For example, (5) can be generalized to 
o_ _ sdr(s) --a O [fw(s,a) + v(s)] ,where v' $ - R is an arbitrary function. 
o0-- 
(This follows immediately because a -0'a -- 0, s  $.) The choice of v does not 
affect any of our theorems, but can substantially affect the variance of the gradient 
estimators. The issues here are entirely analogous to those in the use of reinforce- 
ment baselines in earlier work (e.g., Williams, 1992; Dayan, 1991; Sutton, 1984). In 
practice, v should presumably be set to the best available approximation of V . Our 
results establish that that approximation process can proceed without affecting the 
expected evolution of f and r. 
4 Convergence of Policy Iteration with Function Approximation 
Given Theorem 2, we can prove for the first time that a form of policy iteration with 
function approximation is convergent to a locally optimal policy. 
Theorem 3 (Policy Iteration with Function Approximation). Let rr 
and fw be any differentiable function approximators for the policy and value 
function respectively that satisfy the compatibility condition (4) and for which 
max0,,.,i,j I ooaoj I < B < oo. Let {rk}=o be any step-size sequence such that 
limkm ak = 0 and }-'--k ak = oo. Then, for any MDP with bounded rewards, the 
sequence {p(rrk)}k=o , defined by any 0o, rk = r(.,., Ok), and 
wk = w such that y.d(s) y.rk(s,a)[Q'*(s,a) - f(s,a)]Of- 'a) = 0 
Ok+ 1 ---- 0 k q-Otk ydW($) y o7rk($'a) 
OO 
converges such that limk_m o__ _- 0. 
Proof: Our Theorem 2 assures that the 0k update is in the direction of the gradient. 
(8'a) and on the MDP's rewards together assure us that 0____ 
The bounds on oo, ooj ao,ao 
1062 R. S. Sutton, D. McAllester, S. Singh and Y. Mansour 
is also bounded. These, together with the step-size requirements, are the necessary 
conditions to apply Proposition 3.5 from page 96 of Bertsekas and Tsitsildis (1996), 
which assures convergence to a local optimum. Q.E.D. 
Acknowledgements 
The authors wish to thank Martha Steenstrup and Doina Precup for comments, and Michael 
Kearns for insights into the notion of optimal policy under function approximation. 
References 
Baird, L. C. (1993). Advantage Updating. Wright Lab. Technical Report WL-TR-93-1146. 
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approxima- 
tion. Proc. of the Twelfth Int. Conf. on Machine Learning, pp. 30-37. Morgan Kaufmann. 
Baird, L. C., Moore, A. W. (1999). Gradient descent for general reinforcement learning. 
NIPS 11. MIT Press. 
Barto, A. G., Sutton, R. S., Anderson, C. W. (1983). Neuronlike elements that can solve 
difficult learning control problems. IEEE Trans. on Systems, Man, and Cybernetics 3:835. 
Baxter, J., Bartlett, P. (in prep.) Direct gradient-based reinforcement learning: I. Gradient 
estimation algorithms. 
Bertsekas, D. P., Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific. 
Cao, X.-R., Chen, H.-F. (1997). Perturbation realization, potentials, and sensitivity analysis 
of Markov Processes, IEEE Trans. on Automatic Control J(10):1382-1393. 
Dayan, P. (1991). Reinforcement comparison. In D. S. Touretzky, J. L. Elman, T. J. Se- 
jnowski, and G. E. Hinton (eds.), Connectionist Models: Proceedings of the 1990 Summer 
School, pp. 45-51. Morgan Kaufmann. 
Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings 
of the Twelfth Int. Conf. on Machine Learning, pp. 261-268. Morgan Kaufmann. 
Gordon, G. J. (1996). Chattering in SARSA(A). CMU Learning Lab Technical Report. 
Jaakkola, T., Singh, S. P., Jordan, M. I. (1995) Reinforcement learning algorithms for par- 
tially observable Markov decision problems, NIPS 7, pp. 345-352. Morgan Kaufman. 
Kimura, H., Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility 
traces: Reinforcement learning with imperfect value functions. Proc. ICML-98, pp. 278-286. 
Konda, V. R., Tsitsiklis, J. N. (in prep.) Actor-critic algorithms. 
Marbach, P., Tsitsiklis, J. N. (1998) Simulation-based optimization of Markov reward pro- 
cesses, technical report LIDS-P-2411, Massachusetts Institute of Technology. 
Singh, S. P., Jaakkola, T., Jordan, M. I. (1994). Learning without state-estimation in 
partially observable Markovian decision problems. Proc. ICML-9J, pp. 284-292. 
Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning. Ph.D. thesis, 
University of Massachusetts, Amherst. 
Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. 
Tsitsiklis, J. N. Van Roy, B. (1996). Feature-based methods for large scale dynamic pro- 
gramming. Machine Learning 22:59-94. 
Williams, R. J. (1988). Toward a theory of reinforcement-learning connectionist systems. 
Technical Report NU-CCS-88-3, Northeastern University, College of Computer Science. 
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist 
reinforcement learning. Machine Learning 8:229-256. 
Appendix: Proof of Theorem 1 
We prove the theorem first for the average-reward formulation and then for the start- 
state formulation. 
OV' (s) 0 
O0 -- OOZW(s,a)Q'(s,a) Vse$ 
a 
: 'Ols, a'ls, 
O0 
Policy Gradient Methods for RL with Function Approximation 1063 
Therefore, 
0-7 = [ oo q'(,,a) + (,,a)  
S t 
Summing both sides over the stationary distribution d , 
but since d  is stationary, 
Op 
$ 
O(s,a)  ,, ou(s ') 
= d(s) oo q(s,a) + d (s   
$ a S t 
$ 
0__ 0(s, 
o0 = d(s) o0 )(s,). q.s.. 
For the sta-state formulation: 
OVa(s) df 0 
oo = oo(s,)(s,) 
=  'O(s,)(s,) + (s,) 
o0 (s,) + (s,) ,p2,, U(s') (7) 
 O(,a) 
= (3,,) oo 
x k=0 a 
aer several steps of unrolling (7), where Pr(s  x, k, ) is the probability of going 
kom state s to state x in k steps under policy . It is then immiate that 
t=l 
 O(s,a) -(s,a) 
s k=O a 
0(s, a) 
= :(s) oo q(s,a). 
Q.E.D. 
