Actor-Critic Algorithms 
Vijay R. Konda John N. Tsitsiklis 
Laboratory for Information and Decision Systems, 
Massachusetts Institute of Technology, 
Cambridge, MA, 02139. 
konda@mit. edu, jnt@mit. edu 
Abstract 
We propose and analyze a class of actor-critic algorithms for 
simulation-based optimization of a Markov decision process over 
a parameterized family of randomized stationary policies. These 
are two-time-scale algorithms in which the critic uses TD learning 
with a linear approximation architecture and the actor is updated 
in an approximate gradient direction based on information pro- 
vided by the critic. We show that the features for the critic should 
span a subspace prescribed by the choice of parameterization of the 
actor. We conclude by discussing convergence properties and some 
open problems. 
1 Introduction 
The vast majority of Reinforcement Learning (RL) [9] and Neuro-Dynamic Pro- 
gramming (NDP) [1] methods fall into one of the following two categories: 
(a) 
(b) 
Actor-only methods work with a parameterized family of policies. The gra- 
dient of the performance, with respect to the actor parameters, is directly 
estimated by simulation, and the parameters are updated in a direction of 
improvement [4, 5, 8, 13]. A possible drawback of such methods is that the 
gradient estimators may have a large variance. Furthermore, as the pol- 
icy changes, a new gradient is estimated independently of past estimates. 
Hence, there is no "learning," in the sense of accumulation and consolida- 
tion of older information. 
Critic-only methods rely exclusively on value function approximation and 
aim at learning an approximate solution to the Bellman equation, which will 
then hopefully prescribe a near-optimal policy. Such methods are indirect 
in the sense that they do not try to optimize directly over a policy space. A 
method of this type may succeed in constructing a "good" approximation of 
the value function, yet lack reliable guarantees in terms of near-optimality 
of the resulting policy. 
Actor-critic methods aim at combining the strong points of actor-only and critic- 
only methods. The critic uses an approximation architecture and simulation to 
learn a value function, which is then used to update the actor's policy parameters 
Actor-Critic Algorithms 1009 
in a direction of performance improvement. Such methods, as long as they are 
gradient-based, may have desirable convergence properties, in contrast to critic- 
only methods for which convergence is guaranteed in very limited settings. They 
hold the promise of delivering faster convergence (due to variance reduction), when 
compared to actor-only methods. On the other hand, theoretical understanding of 
actor-critic methods has been limited to the case of lookup table representations of 
policies [6]. 
In this paper, we propose some actor-critic algorithms and provide an overview of 
a convergence proof. The algorithms are based on an important observation. Since 
the number of parameters that the actor has to update is relatively small (compared 
to the number of states), the critic need not attempt to compute or approximate 
the exact value function, which is a high-dimensional object. In fact, we show that 
the critic should ideally compute a certain "projection" of the value function onto a 
low-dimensional subspace spanned by a set of "basis functions," that are completely 
determined by the parameterization of the actor. Finally, as the analysis in [11] 
suggests for TD algorithms, our algorithms can be extended to the case of arbitrary 
state and action spaces as long as certain ergodicity assumptions are satisfied. 
We close this section by noting that ideas similar to ours have been presented in 
the simultaneous and independent work of Sutton et al. [10]. 
2 
Markov decision processes and parameterized family of 
RSP's 
Consider a Markov decision process with finite state space $, and finite action space 
A. Let g: $ x A -> I be a given cost function. A randomized stationary policy (RSP) 
is a mapping/ that assigns to each state x a probability distribution over the action 
space A. We consider a set of randomized stationary policies IP - {/;8  In), 
parameterized in terms of a vector 8. For each pair (x,u)  5' x A, lU(x,u) denotes 
the probability of taking action u when the state x is encountered, under the policy 
corresponding to 8. Let pxy (u) denote the probability that the next state is y, given 
that the current state is x and the current action is u. Note that under any RSP, the 
sequence of states {Xn) and of state-action pairs {Xn, Un) of the Markov decision 
process form Markov chains with state spaces 5' and 5' x A, respectively. We make 
the following assumptions about the family of policies IP. 
(A1) 
(A2) 
For all x  5' and u  A the map 8 -> /(x,u) is twice differentiable 
with bounded first, second derivatives. Furthermore, there exists a ll 
valued function  (x, u) such that V/ (x, u) =/ (x, u) (x, u) where the 
mapping 8  (x, u) is bounded and has first bounded derivatives for any 
fixed x and u. 
For each 8  I n , the Markov chains {Xn) and {Xn, Un) are irreducible and 
aperiodic, with stationary probabilities r (x) and r/ (x, u) = r (x)/ (x, u), 
respectively, under the RSP/. 
In reference to Assumption (A1), note that whenever/(x,u) is nonzero we have 
)(x,u) = /(x, u) = Vin/(x,u). 
Consider the average cost function ,X: I '  I, given by 
= g(x, 
xS,uA 
1010 V. R. Konda and J. N. Tsitsiklis 
We are interested in minimizing A(9) over all 8. For each 9  ll n, let Vs ' $  lt 
be the "differential" cost function, defined as solution of Poisson equation: 
A(8) + Vs(x)= E ps(x,u) g(x,u)+ Epxv(u)Vs(y)]. 
uA y 
Intuitively, Vs (x) can be viewed as the "disadvantage" of state x: it is the expected 
excess cost - on top of the average cost - incurred if we start at state x. It plays 
role similar to that played by the more familiar value function that arises in total 
or discounted cost Markov decision problems. Finally, for every O  ll n, we define 
the q-function qs ' $ x A - ll, by 
qs(x,u) = g(x,u) - A(O) + Epxv(u)Vs(y). 
y 
We recall the following result, as stated in [8]. (Different versions of this result have 
been established in [3, 4, 5].) 
Theorem 1. 
t9 () _ E ris(x,u)qs(x,u)$(x,u ) (1) 
where $(x, u) stands for the ith component of 
In [8], the quantity qs(x,u) in the above formula is interpreted as the expected 
excess cost incurred over a certain renewal period of the Markov chain Xn, 
under the RSP Ps, and is then estimated by means of simulation, leading to actor- 
only algorithms. Here, we provide an alternative interpretation of the formula in 
Theorem 1, as an inner product, and thus derive a different set of algorithms, which 
readily generalize to the case of an infinite space as well. 
For any   ll n , we define the inner product (., ')s of two real valued functions ql, q2 
on S x A, viewed as vectors in lll$1lAI, by 
(ql, q2)s = E rio(x, u)ql (x, u)q2(x, u). 
XU 
With this notation we can rewrite the formula (1) as 
(9OiA(O) = (qs,)s, i- 1,...,n. 
Let ]]. Ils denote the norm induced by this inner product on lllSllAI. For each O  ll n 
let s denote the span of the vectors $; i _( i _(n) in lllslll (This is same as 
the set of all functions f on S x A of the form f(x,u) - 54 i$(x,u), for some 
scalars O 1 ,... , On. ) 
Note that although the gradient of A depends on the q-function, which is a vector 
in a possibly very high dimensional space ltlSllAI, the dependence is only through 
its inner products with vectors in s. Thus, instead of "learning" the function 
it would suffice to learn the projection of qs on the subspace 
Indeed, let IIs ll [$t11  s be the projection operator defined by 
Since 
Ilsq = arg min Ilq- Oils. 
it is enough to compute the projection of qs onto 
Actor-Critic Algorithms 1 O11 
3 Actor-critic algorithms 
We view actor critic-algorithms as stochastic gradient algorithms on the parameter 
space of the actor. When the actor parameter vector is ), the job of the critic is 
to compute an approximation of the projection IIq of q onto . The actor uses 
this approximation to update its policy in an approximate gradient direction. The 
analysis in [11, 12] shows that this is precisely what TD algorithms try to do, i.e., 
to compute the projection of an exact value function onto a subspace spanned by 
feature vectors. This allows us to implement the critic by using a TD algorithm. 
(Note, however, that other types of critics are possible, e.g., based on batch solution 
of least squares problems, as long as they aim at computing the same projection.) 
We note some minor differences with the common usage of TD. In our context, 
we need the projection of q-functions, rather than value functions. But this is 
easily achieved by replacing the Markov chain (xt) in [11, 12] by the Markov chain 
(Xn, Un). A further difference is that [11, 12] assume that the control policy and 
the feature vectors are fixed. In our algorithms, the control policy as well as the 
features need to change as the actor updates its parameters. As shown in [6, 2], 
this need not pose any problems, as long as the actor parameters are updated on a 
slower time scale. 
We are now ready to describe two actor-critic algorithms, which differ only as far 
as the critic updates are concerned. In both variants, the critic is a TD algorithm 
with a linearly parameterized approximation architecture for the q-function, of the 
form 
m 
j=l 
where r = (r,...,r m)  m denotes the parameter vector of the critic. The 
features b, j = 1,... ,m, used by the critic are dependent on the actor parameter 
vector ) and are chosen such that their span in ]$11AI, denoted by , contains 
 . Note that the formula (2) still holds if II is redefined as projection onto  
as long as  contains . The most straightforward choice would be to let m = n 
and b -  for each i. Nevertheless, we allow the possibility that m > n and  
properly contains , so that the critic uses more features than that are actually 
necessary. This added flexibility may turn out to be useful in a number of ways: 
1. It is possible for certain values of 8, the features  are either close to 
zero or are almost linearly dependent. For these values of ), the operator 
II becomes ill-conditioned and the algorithms can become unstable. This 
might be avoided by using richer set of features . 
2. For the second algorithm that we propose (TD() e  1) critic can only 
compute approximate - rather than exact - projection. The use of additional 
features can result in a reduction of the approximation error. 
Along with the parameter vector r, the critic stores some auxiliary parameters: these 
are a (scalar) estimate , of the average cost, and an m-vector z which represents 
Sutton's eligibility trace [1, 9]. The actor and critic updates take place in the course 
of a simulation of a single sample path of the controlled Markov chain. Let rk, zk,  
be the parameters of the critic, and let 9 be the parameter vectpr of the actor, at 
time k. Let (X, U) be the state-action pair at that time. Let Xk+ be the new 
state, obtained after action U is applied. A new action U+I is generated according 
to the RSP corresponding to the actor parameter vector 9. The critic carries out 
an update similar to the average cost temporal-difference method of [12]: 
= + - 
1012 V. R. Konda andJ. N. TsitsiMis 
(Here, 7k is a positive stepsize parameter.) The two variants of the critic use 
different ways of updating zk' 
TD(1) Critic: 
Let x* be a state in $. 
Zk_{_ 1 ---- Z k '- O0(Xk_{_l,Uk_{_l), 
= 
if X+  x* 
otherwise. 
TD(a) Critic, 0 _  < 1: 
Zk_{_ 1 -- OtZk -{- o (Xk_l_l , Uk_{_l ). 
Actor: 
Finally, the actor updates its parameter vector by letting 
Ok-l-1 -- Ok -- /k F(rk)Qr ( Xk+ l , Uk + l )0 ( Xk + l , Uk+ l ). 
Here,/ is a positive stepsize and F(rk) > 0 is a normalization factor satisfying: 
(A3) F(.) is Lipschitz continuous. 
(A4) There exists C > 0 such that 
c 
r(r) <_ 
The above presented algorithms are only two out of many variations. For instance, 
one could also consider "episodic" problems in which one starts from a given initial 
state and runs the process until a random termination time (at which time the 
process is reinitialized at x*), with the objective of minimizing the expected cost 
until termination. In this setting, the average cost estimate A is unnecessary 
and is removed from the critic update formula. If the critic parameter r were to 
be reinitialized each time that x* is entered, one would obtain a method closely 
related to Williams' REINFORCE algorithm [13]. Such a method does not involve 
any value function learning, because the observations during one episode do not 
affect the critic parameter r during another episode. In contrast, in our approach, 
the observations from all past episodes affect current critic parameter r, and in 
this sense critic is "learning". This can be advantageous because, as long as 0 is 
slowly changing, the observations from recent episodes carry useful information on 
the q-function under the current policy. 
4 Convergence of actor-critic algorithms 
Since our actor-critic algorithms are gradient-based, one cannot expect to prove 
convergence to a globally optimal policy (within the given class of RSP's). The 
best that one could hope for is the convergence of VA(0) to zero; in practical terms, 
this will usually translate to convergence to a local minimum of (0). Actually, 
because the TD(a) critic will generally converge to an approximation of the desired 
projection of the value function, the corresponding convergence result is necessarily 
weaker, only guaranteeing that VA(O) becomes small (infinitely often). Let us now 
introduce some further assumptions. 
Actor-Critic Algorithms 1013 
(A5) For each 0  n, we define an m x m matrix G(O) by 
G(o) = r. 
We assume that G(O) is uniformly positive definite, that is, there exists 
some q > 0 such that for all r  11 m and 0  1t ' 
rrG(O)r > qllrll 2 
(A6) We assume that the stepsize sequences {7k}, {/k} are positive, nonincreas- 
ing, and satisfy 
6k >0, V, y.6k = ec, y.6 < 
k k 
where 5k stands for either/k or ffk. We also assume that 
---0. 
Note that the last assumption requires that the actor parameters be updated at a 
time scale slower than that of critic. 
Theorem 2. In an actor-critic algorithm with a TD(1) critic, 
lim inf IIV (0k)11 - 0 
k 
Furthermore, if {0k} is bounded w.p. I then 
lim IIVX(0k)]l- 0 
k 
Theorem 3. For every e > O, there exists 
liminfk llVX(0k)11 <_  .p. 1. 
w.p. 1. 
w.p. 1. 
sufficiently close to 
1, such that 
Note that the theoretical guarantees appear to be stronger in the case of the TD(1) 
critic. However, we expect that TD(a) will perform better in practice because of 
much smaller variance for the parameter rk. (Similar issues arise when considering 
actor-only algorithms. The experiments reported in [7] indicate that introducing a 
forgetting factor a < i can result in much faster convergence, with very little loss of 
performance.) We now provide an overview of the proofs of these theorems. Since 
/k/ffk -* 0, the size of the actor updates becomes negligible compared to the size 
of the critic updates. Therefore the actor looks stationary, as far as the critic is 
concerned. Thus, the analysis in [1] for the TD(1) critic and the analysis in [12] 
for the TD(a) critic (with a < 1) can be used, with appropriate modifications, to 
conclude that the critic's approximation of H0 qo will be "asymptotically correct". 
If r(O) denotes the value to which the critic converges when the actor parameters 
are fixed at 0, then the update for the actor can be rewritten as 
= - +/kek, 
where ek is an error that becomes asymptotically negligible. At this point, standard 
proof techniques for stochastic approximation algorithms can be used to complete 
the proof. 
15 Conclusions 
The key observation in this paper is that in actor-critic methods, the actor pa- 
rameterization and the critic parameterization need not, and should not be chosen 
1014 V. R. Konda andJ. N. Tsitsiklis 
independently. Rather, an appropriate approximation architecture for the critic is 
directly prescribed by the parameterization used in actor. 
Capitalizing on the above observation, we have presented a class of actor-critic algo- 
rithms, aimed at combining the advantages of actor-only and critic-only methods. In 
contrast to existing actor-critic methods, our algorithms apply to high-dimensional 
problems (they do not rely on lookup table representations), and are mathematically 
sound in the sense that they possess certain convergence properties. 
Acknowledgments: This research was partially supported by the NSF under 
grant ECS-9873451, and by the AFOSR under grant F49620-99-1-0320. 
References 
[1] D. P. Bertsekas and J. N. Tsitsiklis. Neurodynamic Programming. Athena 
Scientific, Belmont, MA, 1996. 
[2] V. S. Borkar. Stochastic approximation with two time scales. Systems and 
Control Letters, 29:291-294, 1996. 
[3] X. R. Cao and H. F. Chen. Perturbation realization, potentials, and sensitiv- 
ity analysis of Markov processes. IEEE Transactions on Automatic Control, 
42:1382-1393, 1997. 
[4] P. W. Glynn. Stochastic approximation for monte carlo optimization. In Pro- 
ceedings of the 1986 Winter Simulation Conference, pages 285-289, 1986. 
[5] T. Jaakola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithms 
for partially observable Markov decision problems. In Advances in Neural In- 
formation Processing Systems, volume 7, pages 345-352, San Francisco, CA, 
1995. Morgan Kaufman. 
[6] V. R. Konda and V. S. Borkar. Actor-critic like learning algorithms for Markov 
decision processes. SIAM Journal on Control and Optimization, 38(1):94-123, 
1999. 
[7] P. Marbach. Simulation based optimization of Markov reward processes. PhD 
thesis, Massachusetts Institute of Technology, 1998. 
[8] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov 
reward processes. Submitted to IEEE Transactions on Automatic Control. 
[9] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 
Cambridge, MA, 1995. 
[10] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient meth- 
ods for reinforcement learning with function approximation. In this proceedings. 
[11] J. N. Tsitsiklis and B. Van Roy. 
ing with function approximation. 
42(5):674-690, 1997. 
[12] J. N. Tsitsiklis and B. Van Roy. 
[13] 
An analysis of temporal-difference learn- 
IEEE Transactions on Automatic Control, 
Average cost temporal-difference learning. 
Automatica, 35(11):1799-1808, 1999. 
R. Williams. Simple statistical gradient following algorithms for connectionist 
reinforcement learning. Machine Learning, 8:229-256, 1992. 
