Monte Carlo Matrix Inversion and 
Reinforcement Learning 
Andrew Barto and Michael Duff 
Computer Science Department 
University of Massachusetts 
Amherst, MA 01003 
Abstract 
We describe the relationship between certain reinforcement learn- 
ing (RL)methods based on dynamic programming (DP)and a class 
of unorthodox Monte Carlo methods for solving systems of linear 
equations proposed in the 1950's. These methods recast the solu- 
tion of the linear system as the expected vlue of a statistic suitably 
defined over sample paths of a Markov chain. The significance of 
our observations lies in arguments (Curriss, 1954) that these Monte 
Carlo methods scale better with respect to state-space size than do 
standard, iterative techniques for solving systems of linear equa- 
tions. This analysis also establishes convergence rate estimates. 
Because methods used in RL systems for approximating the evalu- 
ation function of a fixed control policy also approximate solutions 
to systems of linear equations, the connection to these Monte Carlo 
methods establishes that algorithms very similar to TD algorithms 
(Sutton, 1988) are asymptotically more efficient in a precise sense 
than other methods for ewluating policies. Further, all DP-based 
RL methods have some of the properties of these Monte Carlo al- 
gorithms, which suggests that although RL is often perceived to 
be slow, for sufficiently large problems, it may in fact be more ef- 
ficient than other known classes of methods capable of producing 
the same results. 
687 
688 Barto and Duff 
Introduction 
Consider a system whose dynamics are described by a finite state Markov chain with 
transition matrix P, and suppose that at each time step, in addition to making a 
transition from state zt - i to zt+ - j with probability pi, the system produces 
a randomly determined reward, rt+, whose expected value is R/. The evaluation 
function, Ix, maps states to their expected, infinite-horizon discounted returns: 
t----0 
It is well known that V uniquely satifies a linear system of equations describing 
local consistency: 
or 
(1) 
The problem of computing or estimating V is interesting and important in its 
own right, but perhaps more significantly, it arises as a (rather computationally- 
burdensome) step in certain techniques for solving Markov Decision Problems. In 
each iteration of Policy-Iteration (Howard, 1960), for example, one must determine 
the evaluation function associated with some fixed control policy, a policy that 
improves with each iteration. 
Methods for solving (1) include standard iterative techniques and their variants-- 
successive approximation (Jacobi or Gauss-Seidel versions), successive over- 
relaxation, etc. They also include some of the algorithms used in reinforcement 
learning (RL) systems, such as the family of TD algorithms (Sutton, 1988). Here 
we describe the relationship between the latter methods and a class of unorthodox 
Monte Carlo methods for solving systems of linear equations proposed in the 1950's. 
These methods recast the solution of the linear system as the expected value of a 
statistic suitably defined over sample paths of a Markov chain. 
The significance of our observations lies in arguments (Curriss, 1954) that these 
Monte Carlo methods scale better with respect to state-space size than do stan- 
dard, iterative techniques for solving systems of linear equations. This analysis also 
establishes convergence rate estimates. Applying this analysis to particular mem- 
bers of the family of TD algorithms (Sutton, 1988) provides insight into the scaling 
properties of the TD family as a whole and the reasons that TD methods can be 
effective for problems with very large state sets, such as in the backgammon player 
of Tesauro (Tesauro, 1992). 
Further, all DP-based RL methods have some of the properties of these Monte 
Carlo algorithms, which suggests that although RL is often slow, for large problems 
(Markov Decision Problems with large numbers of states) it is in fact far more prac- 
tical than other known methods capable of producing the same results. First, like 
many RL methods, the Monte Carlo algorithms do not require explicit knowledge 
of the transition matrix, P. Second, unlike standard methods for solving systems 
of linear equations, the Monte Carlo algorithms can approximate the solution for 
some variables without expending the computational effort required to approximate 
Monte Carlo Matrix Inversion and Reinforcement Learning 689 
the solution for all of the variables. In this respect, they are similar to DP-based 
RL algorithms that approximate solutions to Markovian decision processes through 
repeated trials of simulated or actual control, thus tending to focus computation 
onto regions of the state space that are likely to be relevant in actual control (Barto 
at. al., 1991). 
This paper begins with a condensed summary of Monte Carlo algorithms for solv- 
ing systems of linear equations. We show that for the problem of determining an 
evaluation function, they reduce to simple, practical implementations. Next, we 
recall arguments (Curriss, 1954) regarding the scaling properties of Monte Carlo 
methods compared to iterative methods. Finally, we conclude with a discussion of 
the implications of the Monte Carlo technique for certain algorithms useful in RL 
systems. 
2 
Monte Carlo Methods for Solving Systems of Linear 
Equations 
The Monte Carlo approach may be motivated by considering the statistical evalua- 
tion of a simple sum, k ak. If {p} denotes a set of values for a probability mass 
function that is arbitrary (save for the requirement that a  0 imply p  0), then 
(-) p, which may be interpreted as the expected value of a random 
variable Z defined by Pr {Z = -} = p. 
P 
om equation (1) and the Neumann series representation of the inverse it is is clear 
that 
V: (I- 7P)-R = R +TPR +7PR +... 
whose i * component is 
 ..+ +... (2) 
i...i 
and it is this series that we wish to evMuate by statisticM means. 
A technique originated by Ulam and yon-Neumann (Forsythe & Leibler, 1950) uti- 
zes an arbitrarily defined Markov chn with transition matrk  and state set 
{1, 2, ..., n} (V is assumed to have n components). The chain begins in state i and 
is aowed to make k transitions, where k is drawn from a geometric distribution 
with parameter p,,p; i.e., Pr{k state transitions} 
= p,,p(1 - p,,p). The Markov 
chain, governed by P and the geometricay-distributed stopping criterion, defines 
a mass function assigning probabty to every trajectory of every length starting in 
state i, 0 = i0 = i  : i  .-.  t: it, and to each such trajectory there 
corresponds a unique term in the sum (2). 
For the case of value estimation, "Z" is defined by 
- H=x 
690 Barto and Duff 
which for/5 = p and P, tee= ' becomes 
The sample average of sampled values of Z is guaranteed to converge (as the number 
of samples grows lrge) to state i's expected, infinite-horizon discounted return. 
In Wasow's method (Wasow, 1952), the truncated Neumann series 
E 
i ixia i...i 
is expressed as  plus the expected value of the sum of N random variables 
Z, Z, ..., ZN, the intention being that 
E(Z) = 7   PiqPqi " 'Pi_i. 
i...i 
Let trajectories of length N be generated by he Markov chain governed by P. A 
given term 7kPiiPixi a .. 'pi_i is associated with a trajectories i  i  i  
 ..  i  i+  ...  iN whose first k + 1 states are i,i,...,i. The measure 
of this set of trajectories is just iiixia '' 'Pi_xi. Thus, the random variables Zn, 
k: 1, N are defined by 
PiixPixia ' ' ' Pi_xi 
If P: P, then the estimate becomes an average of sample truncated, discounted 
returns: i =  + , +  + '" + 7N. 
The Ulam/von Neumann approach may be reconciled with that of Wasow by pro- 
cessing a given trajectory a posterJori, converting it into a set of terminated paths 
consistent with any choice of stopping-state transition probabties. For example, 
for a stopping state transition probabty of 1 - , a path of length k has proba- 
bmty Each "prefix" of the observed path z(0)  z(1)  (2) .-. can 
be weighted by the probabty of a path of corresponding length, resulting in an 
estimate, V, that is the sampled, discounted return: 
V =  7 R(n). 
=0 
3 Complexity 
In (Curtiss, 1954) Curtiss establishes a theoretical comparison of the complexity 
(number of multiplications) required by the Ulam/von Neumann method and a 
stationary linear iterative process for computing a single component of the solution 
to a system of linear equations. Curtiss develops an analytic formula for bounds 
on the conditional mean and variance of the Monte-Carlo sample estimate, V, and 
mean and variance of a sample path's time to absorption, then appeals to the 
Monte Carlo Matrix Inversion and Reinforcement Learning 691 
n 
1000 
9oo 
8oo 
7oo 
6oo 
5oo 
4oo 
3oo 
2oo 
lOO 
0 
0 100 200 300 400 500 600 700 800 900 1000 
Figure 1: Break-even size of state space versus accuracy. 
Central Limit Theorem to establish a 95%-confidence interval for the complexity of 
his method to reduce the initial error by a given factor,/. x 
For the case of value-estimation, Curtiss' formula for the Monte-Carlo complexity 
may be written as 
1 (1 + --) (3) 
This is compared to the complexity of the iterative method, which for the value- 
estimation problem takes the form of the classical dynamic programming recursion, 
V (n+l) -- R " 7PV('): 
WORKiterati. e --- 
log____{) n2 + 
1 + log '7 
The iterative method's complexity has the form ar,  + r,, with a > 1, while the 
Monte-Carlo complexity is independent of n--it is most sensitive to the amount of 
error reduction desired, signified by . Thus, given a fixed amount of computation, 
for large enough r,, the Monte-Carlo method is likely (with 95% confidence level) to 
produce better estimates. The theoretical "break-even" points are plotted in Figure 
1, and Figure 2 plots work versus state-space size for example values of -y and/. 
'That is, for the iterative method, ( is aennea via [IV() - V(")II < llV) - v0)ll, 
while for the Monte Carlo method, i is defined via [V()(i)- 'MI < 11v(o) _ v(0)l[, 
where rM is the average over M sample V's. 
692 Barto and Duff 
50000 - 
45000 
4OOOO 
35000 
30000 
25000 
20000 
15000 
10000 
5000 
0 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
 / 
0 10 20 30 40 50 60 70 80 90 
...... Iterative 
Monte Carlo 
--- Gauss 
I 
1 oo 
n 
Figure 2: Work versus number of states for 7 = .5 and/ = .01. 
4 Discussion 
It was noted that the analytic complexity Curtiss develops is for the work required 
to compute one component of a solution vector. In the worst case, all components 
could be estimated by constructing r, separate, independent estimators. This would 
multiply the Monte-Carlo complexity by a factor of r,, and its scahng supremacy 
would be only marginally preserved. A more efficient approach would utilize data 
obtained in the course of estimating one component to estimate other components 
as well; Rubinstein (Rubinstein, 1981) decribes one way of doing this, using the 
notion of "covering paths." Also, it should be mentioned that substituting more 
sophisticated iterative methods, such as Gauss-Seidel, in place of the simple suc- 
cessive approximation scheme considered here, serves only to improve the condition 
number of the underlying iterative operator--the amount of computation required 
by iterative methods remains an 2 + n, for some a > 1. 
An attractive feature of the the analysis provided by Curtiss is that, in effect, it 
yields information regarding the convergence rate of the method; that is, Equation 
4 can be re-arranged in terms of . Figure 3 plots  versus work for example values 
of  and n. 
The simple Monte Carlo scheme considered here is practically identical to the 
limiting case of TD-A with A equal to one (TD-1 differs in that its averaging of 
sampled, discounted returns is weighted with recency). Ongoing work (Duff) ex- 
plores the connection between TD-A (Sutton, 1988), for general values of A, and 
Monte Carlo methods augmented by certain variance reduction techniques. Also, 
Barnard (Barnard) has noted that TD-0 may be viewed as a stochastic approxima- 
Monte Carlo Matrix Inversion and Reinforcement Leaming 693 
1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
Iterative  . ...... 
Monte Carlo ............ 
0 10000 20000 30000 40000 50000 
Work 
Figure 3: Error reduction versus work for -y = .9 and r, = 100. 
tion method for solving (1). 
On-line RL methods for solving Markov Decision Problems, such as Real-Time 
Dynamic Programming (RTDP)(Barto et. al., 1991), share key features with the 
Monte Carlo method. As with many algorithms, RTDP does not require explicit 
knowledge of the transition matrix, P, and neither, of course, do the Monte Carlo 
algorithms. RTDP approximates solutions to Markov Decision Problems through 
repeated trials of simulated or actual control, focusing computation upon regions of 
the state space likely to be relevant in actual control. This computational "focusing" 
is also a feature of the Monte Carlo algorithms. While it is true that a focusing 
of sorts is exhibited by Monte Carlo algorithms in an obvious way by virtue of 
the fact that they can compute approximate solutions for single components of 
solution vectors without exerting the computational labor required to compute all 
solution components, a more subtle form of computational focusing also occurs. 
Some of the terms in the Neumann series (2) may be very unimportant and need 
not be represented in the statistical estimator at all. The Monte Carlo method's 
stochastic estimation process achieves this automatically by, in effect, making the 
appearance of the representative of a non-essential term a very rare event. 
These correspondences--between TD-0 and stochastic approximation, between TD- 
 and Monte Carlo methods with variance reduction, between DP-based RL al- 
gorithms for solving Markov Decision Problems and Monte Carlo algorithms -- 
together with the comparatively favorable scaling and convergence properties en- 
joyed by the simple Monte Carlo method discussed in this paper, suggest that DP- 
based RL methods like TD/stochastic-approximation or RTDP, though perceived 
to be slow, may actually be advantageous for problems having a sufficiently large 
694 Barto and Duff 
number of states. 
Acknowledgement 
This material is based upon work supported by the National Science Foundation 
under Grant ECS-9214866. 
References 
E. Barnard. Temporal-Difference Methods and Markov Models. Submitted for 
publication. 
A. Barto, S. Bradtke, & S. Singh. (1991) Real-Time Learning and Control Using 
Asynchronous Dynamic Programming. Computer Science Department, University 
of Massachusetts, Tech. Rept. 91-57. 
J. Curriss. (1954) A Theoretical Comparison of the Efficiencies of Two Classical 
Methods and a Monte Carlo Method for Computing One Component of the Solution 
of a Set of Linear Algebraic Equations. In H. A. Mayer (ed.), Symposium on Monte 
Carlo Methods, 191-233. New york, NY: Wiley. 
M. Duff. A Control Variate Perspective for the Optimal Weighting of Truncated, 
Corrected Returns. In Preparation. 
S. Forsythe & K. Leibler. (1950) Matrix Inversion by a Monte Carlo Method. Math. 
Tables Other Aids Cornput., 4:127-129. 
R. Howard. (1960) Dynamic Programming and Markov Proceses. Cambridge, MA: 
MIT Press. 
R. Rubinstein. (1981) Simulation and the Monte Carlo Method. New York, NY: 
Wiley. 
R. Sutton. (1988) Learning to Predict by the Method of Temporal Differences. 
Machine Learning 3:9-44. 
G. Tesauro. (1992) Practical Issues in Temporal Difference Learning. Machine 
Learning 8:257-277. 
W. Wasow. (1952) A Note on the Inversion of Matrices by Random Walks. Math. 
Tables Other Aids Cornput., 6:78-81. 
