Reinforcement Learning with Soft State 
Aggregation 
Satinder P. Singh Tommi Jaakkola Michael I. Jordan 
singh@psyche.mit.edu tommi@psyche.mit.edu jordan@psyche.mit.edu 
Dept. of Brain & Cognitive Sciences (E-10) 
M.I.T. 
Cambridge, MA 02139 
Abstract 
It is widely accepted that the use of more compact representations 
than lookup tables is crucial to scaling reinforcement learning (RL) 
algorithms to real-world problems. Unfortunately almost all of the 
theory of reinforcement learning assumes lookup table representa- 
tions. In this paper we address the pressing issue of combining 
function approximation and RL, and present 1) a function approx- 
imator based on a simple extension to state aggregation (a com- 
monly used form of compact representation), namely soft state 
aggregation, 2) a theory of convergence for RL with arbitrary, but 
fixed, soft state aggregation, 3) a novel intuitive understanding of 
the effect of state aggregation on online RL, and 4) a new heuristic 
adaptive state aggregation algorithm that finds improved compact 
representations by exploiting the non-discrete nature of soft state 
aggregation. Preliminary empirical results are also presented. 
I INTRODUCTION 
The strong theory of convergence available for reinforcement learning algorithms 
(e.g., Dayan & Sejnowski, 1994; Watkins &: Dayan, 1992; Jaakkola, Jordan & Singh, 
1994; Tsitsiklis, 1994) makes them attractive as a basis for building learning con- 
trol architectures to solve a wide variety of search, planning, and control problems. 
Unfortunately, almost all of the convergence results assume lookup table representa- 
362 Satinder Singh, Tommi Jaakkola, Michael L Jordan 
tions for value functions (see Sutton, 1988; Dayan, 1992; Bradtke, 1993; and Vanroy 
& Tsitsiklis, personal communication; for exceptions). It is widely accepted that 
the use of more compact representations than lookup tables is crucial to scaling RL 
algorithms to real-world problems. 
In this paper we address the pressing issue of combining function approximation and 
RL, and present 1) a function approximator based on a simple extension to state 
aggregation (a commonly used form of compact representation, e.g., Moore, 1991), 
namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but 
fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state 
aggregation on online RL, and 4) a new heuristic adaptive state aggregation algo- 
rithm that finds improved compact representations by exploiting the non-discrete 
nature of soft state aggregations. Preliminary empirical results are also presented. 
Problem Definition and Notation: We consider the problem of solving large 
Markovian decision processes (MDPs) using RL algorithms and compact function 
approximation. We use the following notation: $ for state space, 4 for action space, 
pa(s, s t) for transition probability, Ra(s) for payoff, and ? for discount factor. The 
objective is to maximize the expected, infinite horizon, discounted sum of payoffs. 
1.1 FUNCTION APPROXIMATION: SOFT STATE CLUSTERS 
In this section we describe a new function approximator (FA) for RL. In section 3 
we will analyze it theoretically and present convergence results. The FA maps the 
state space $ into M > 0 aggregates or clusters from cluster space X. Typically, 
M << I$1. We allow soft clustering, where each state s belongs to cluster x with 
probability P(xls), called the clustering probabilities. This allows each state s to 
belong to several clusters. An interesting special case is that of the usual state 
aggregation where each state belongs only to one cluster. The theoretical model is 
that the agent can observe the underlying state but can only update a value function 
for the clusters. The value of a cluster generalizes to all states in proportion to 
the clustering probabilities. Throughout we use the symbols x and y to represent 
individual clusters and the symbols s and s t to represent individual states. 
2 A GENERAL CONVERGENCE THEOREM 
An online RL algorithm essentially sees a sequence of quadruples, < st, at, st+, rt >, 
representing a transition from current state st to next state st+ on current action 
at with an associated payoff ft. We will first prove a general convergence theorem 
for Q-learning (Watkins & Dayan, 1992) applied to a sequence of quadruples that 
may or may not be generated by a Markov process (Bertsekas, 1987). This is 
required because the RL problem at the level of the clusters may be non-Markovian. 
Conceptually, the sequence of quadruples can be thought of as being produced by 
some process that is allowed to modify the sequence of quadruples produced by a 
Markov process, e.g., by mapping states to clusters. In Section 3 we will specialize 
the following theorem to provide specific results for our function approximator. 
Consider any stochastic process that generates a sequence of random quadruples, 
 = {< xi, ai, Yi, ri >}i, where xi, Yi  Y, ai  A, and ri is a bounded real number. 
Note that xi+ does not have to be equal to yi. Let IYI and IAI be finite, and define 
Reinforcement Learning with Soft State Aggregation 363 
indicator variables 
x(x, a, u) = { 
and 
1 when 9i =< z, a,.,. > (for any y, and any r) 
Xi(x,a)-- 0 otherwise. 
1 when 9i =< z,a,y,. > (for any r) 
0 otherwise, 
Define 
Pia, J( x' y) = E5..i Xi(x, a, y) 
and 
E=i rixi(x, a) 
n?,;) = Ei= x(x, a) 
Theorem 1: IfYe > 0, 3Me < oo, such that for alli > 0, for allx, y G Y, and 
for all a G A, the following conditions characterize the infinite sequence 9: with 
probability i - e, 
pa , and 
I i,i+M(X, y) -- Pa(x Y)I < 
IR,i+M(x)_ta(x)] < , (1) 
where for all x a, and y, with probability one P,(x, y) P(x, y), and R  
a(x). Then, online Q-learning applied to such a sequence will converge with 
probability one to the solution of the following system of equations: Vx G Y, and 
Va  A, 
Q(x,a) = a(x) + 7  pa(x,Y)Q(Y' at) (2) 
yY 
Prooff Consider the semi-batch version of Q-learning that collects the changes 
to the value function for M steps before making the change. By assumption, for 
any e, making Me large enough will ensure that with probability i - e, the sample 
quantities for the i th batch, Pia. i.M (x, y) and trl,i+M(i)(x) are within  of the 
, -e (,) 
asymptotic quantities. In Appendix A we prove that the semi-batch version of Q- 
learning outlined above converges to the solution of Equation 2 with probability one. 
The semi-batch proof can be extended to online Q-learning by using the analysis 
developed in Theorem 3 of Jaakkola el al. (1994). In brief, it can be shown that 
the difference caused by the online updating vanishes in the limit thereby forcing 
semi-batch Q-learning and online Q-learning to be equal asymptotically. The use 
of the analysis in Theorem 3 from Jaakkola et al. (1994) requires that the learning 
'() -- 1 uniformly w.p.1.; Me(k) is 
rate parameters a are such that maxtEM(i)at(x ) 
the k th batch of size Me. If at(x) is non-increasing in addition to satisfying the 
conventional Q-learning conditions, then it will also meet the above requirement. 
Theorem i provides the most general convergence result available for Q-learning 
(and TD(0)); it shows that for an arbitrary quadruple sequence satisfying the er- 
godicity conditions given in Equations 1, Q-learning will converge to the solution 
of the MDP constructed with the limiting probabilities (P0,oo) and payoffs (R0,oo). 
Theorem i combines and generalizes the results on hard state aggregation and 
value iteration presented in Vanroy & Tsitsiklis (personal communication), and on 
partially observable MDPs in Singh et al. (1994). 
364 Satinder Singh, Tommi Jaakkola, Michael I. Jordan 
3 RL AND SOFT STATE AGGREGATION 
In this section we apply Theorem 1 to provide convergence results for two cases: 1) 
using Q-learning and our FA to solve MDPs, and 2) using Sutton's (1988) TD(0) 
and our FA to determine the value function for a fixed policy. As is usual in online 
RL, we continue to assume that the transition probabilities and the payoff function 
of the MDP are unknown to the learning agent. Furthermore, being online such 
algorithms cannot sample states in arbitrary order. In this section, the clustering 
probabilities P(xls ) are assumed to be fixed. 
Case 1: Q-learning and Fixed Soft State Aggregation 
Because of function approximation, the domain of the learned Q-value function is 
constrained to be 2' x A (2' is cluster space). This section develops a "Bellman 
equation" (e.g., Bertsekas, 1987) for Q-learning at the level of the cluster space. We 
assume that the agent follows a stationary stochastic policy r that assigns to each 
state a non-zero probability of executing every action in every state. Furthermore, 
we assume that the Markov chain under policy r is ergodic. Such a policy r is a 
persistently exciting policy. Under the above conditions P(slx) = 
where for all s, P(s) is the steady-state probability of being in state s. 
Corollary 1: Q-learning with soft state aggregation applied to an MDP while 
following a persistently exciting policy r will converge with probability one to the 
solution of the following system of equations: (x, a) 6 (A' x A), 
Q(x,a) = yP(slx) IRa(s) + ? Pa(s,y)rn,axQ(y,a')] (3) 
s y 
and Pa(s, y) = Es, P(s, d)P(yld ). The Q-value function for the state space can 
then be constructed via Q(s,a) - - P(xls)Q(x, a)for all (s,a). 
Proof.' It can be shown that the sequence of quadruples produced by following pol- 
icy r and independently mapping the current state s to a cluster x with probability 
P(xls ) satisfies the conditions of Theorem 1. Also, it can be shown that 
15a(x,y) = y. P(sIx)P(s,y), and la(x)= y. P(slx)Ra(S). 
$ 
Note that the Q-values found by clustering are dependent on the sampling policy 
r, unlike the lookup table case. 
Case 2: TD(0) and Fixed Soft State Aggregation 
We present separate results for TD(0) because it forms the basis for policy-iteration- 
like methods for solving Markov control problems (e.g., Barto, Sutton & Anderson, 
1983) -- a fact that we will use in the next section to derive adaptive state aggre- 
gation methods. As before, because of function approximation, the domain of the 
learned value function is constrained to be the cluster space Y. 
Corollary 2: TD(0) with soft state aggregation applied to an MDP while following 
a policy r will converge with probability one to the solution of the following system 
Reinforcement Learning with Soft State Aggregation 365 
of equations: x 6 X, 
V(x) = P=(s]x)[R=(s)+Ty.P=(s,y)V(y)] (4) 
s y 
where again as in Q-learning the value function for the state space can be con- 
structed via V(s) - P(xls)V(x ) for all s. 
Proof: Corollary i implies Corollary 2 because TD(0) is a special case of Q-learning 
for MDPs with a single (possibly randomized) action in each state. Equation 4 
provides a "Bellman equation" for TD(0) at the level of the cluster space. 2 
4 ADAPTIVE STATE AGGREGATION 
In previous sections we restricted attention to a function approximator that had a 
fixed compact representation. How might one adapt the compact representation on- 
line in order to get better approximations of value functions? This section presents 
a novel heuristic adaptive algorithm that improves the compact representation by 
finding good clustering probabilities given an a priori fixed number of clusters. Note 
that for arbitrary clustering, while Corollaries i and 2 show that RL will find so- 
lutions with zero Bellman error in cluster space, the associated Bellman error in 
the state space will not be zero in general. Good clustering is therefore naturally 
defined in terms of reducing the Bellman error for the states of the MDP. 
Let the clustering probabilities be parametrized as follows P(xls; O) - , 
where O(x, s) is the weight between state s and cluster x. Then the Bellman error 
at.state s given parameter 0 (a matrix) is, 
J(s,O) = V(s[O)- [R(s) + ? P(s,s')V(s'}O)] 
$1 
Adaptive State Aggregation (ASA) Algorithm: 
Step 1' Compute V(xlO ) for all x e X using the TD(0) algorithm. 
s=() Go to step 1. 
Step 2' Let A0 = -a o0 
where Step 2 tries to minimize the Bellman error for the states by holding the 
cluster values fixed to those computed in Step 1. We have 
OO(y,s) 
- V(slO))]. 
The Bellman error J(slO ) cannot be computed directly because the transition proba- 
bilities P(s, s t) are unknown. However, it can be estimated by averaging the sample 
366 Satinder Singh, Tomrni Jaakkola, Michael I. Jordan 
Bellman error. e(yls;O) is known, and (1 - 7P'(s,s)) is always positive, and in- 
dependent of y, and can therefore be absorbed into the step-size c. The quantities 
V(ylO ) and V(s[O) are available at the end of Step 1. In practice, Step 1 is only 
carried out partially before Step 2 is implemented. Partial evaluation works well 
because the changes in the clustering probabilities at Step 2 are small, and because 
the final V(x[O) at the previous Step 1 is used to initialize the computation ofV(xlO ) 
at the next Step 1. 
l0 
9 
8 
7 
-m 6 
5 
4 
3 
2 
...... 2 Clusters 
......... 4 Clusters 
10 Clusters 
 , 20 Clusters 
l 
O0 10 2 30 4 5 6 0 iO 90 100 
Iterations of ASA 
Figure 1' Adaptive State Clustering. See text for explanation. 
Figure 1 presents preliminary empirical results for the ASA algorithm. It plots the 
squared Bellman error summed over the state space as a function of the number of 
iterations of the ASA algorithm with constant step-size c. It shows error curves 
for 2, 4, 10 and 20 clusters averaged over ten runs of randomly constructed 20 
state Markov chains. Figure 4 shows that ASA is able to adapt the clustering 
probabilities to reduce the Bellman error in state space, and as expected the more 
clusters the smaller the asymptotic Bellman error. In future work we plan to test 
the policy iteration version of the adaptive soft aggregation algorithm on Markov 
control problems. 
5 SUMMARY AND FUTURE WORK 
Doing RL on aggregated states is potentially very advantageous because the value of 
each cluster generalizes across all states in proportion to the clustering probabilities. 
The same generalization is also potentially perilous because it can interfere with the 
contraction-based convergence of RL algorithms (see Yee, 1992; for a discussion). 
This paper resolves this debate for the case of soft state aggregation by defining a set 
of Bellman Equations (3 and 4) for the control and policy evaluation problems in the 
non-Markovian cluster space, and by proving that Q-learning and TD(0) solve them 
respectively with probability one. Theorem i presents a general convergence result 
that was applied to state aggregation in this paper, but is also a generalization of 
the results on hidden state presented in Singh et al. (1994), and may be applicable 
Reinforcement Learning with Soft State Aggregation 367 
to other novel problems. It supports the intuitive picture that if a non-Markovian 
sequence of state transitions and payoffs is ergodic in the sense of Equation 1, then 
RL algorithms will converge w.p.1. to the solution of an MDP constructed with the 
limiting transition probabilities and payoffs. 
We also presented a new algorithm, ASA, for adapting compact representations, 
that takes advantage of the soft state aggregation proposed here to do gradient de- 
scent in clustering probability space to minimize squared Bellman error in the state 
space. We demonstrated on simple examples that ASA is able to adapt the cluster- 
ing probabilities to dramatically reduce the Bellman error in state space. In future 
work we plan to extend the convergence theory presented here to discretizations of 
continuous state MDPs, and to further test the ASA algorithms. 
A Convergence of semi-batch Q-learning (Theorem 1) 
Consider a semi-batch algorithm that collects the changes to the Q-value function 
for M steps before making the change to the Q-value function. Let 
kM kM 
/i(x) =  rixi(x,a); M(x,a)= y. Xi(x,a) 
i=(k-1)M i'-(k-1)M 
and 
kM 
M(x,a,y)=  Xi(x,a,y) 
i-(k-1)M 
Then the Q-value of (x, a) after the k th batch is given by: 
Q+x(x,a) = (1- M(x,a)a(x,a))Q(x,a) 
a) a) 
Let 0 be the solution to Equation 2. Define, 
r(, ) = () (' ' ) 
yY 
then, if Ve(z) = max Qe(z, a) and (z) = max 0(z, a), 
y 
M (x, a, y) a' ] 
+7 Y. M(x,a) maxQ(y, ) 
yEY 
maxQ(y,a')-O(x,a), 
+? y. [(M(x'a'Y) - P,oo(x,y))Z(y)] 
y 
The quantity F(x, a) can be bounded by 
IIF(x, a)[I < 711V - ell +l (a() _ 
-- \ Mk (x, a) 
+?11-y(Mk(,a,Y)_ p,a (x Y))f/(Y)]I < ?lIV lll+C% M 
M(x,a) 0,c , -- -- , 
() 
where %M is the larger of I M(,a) 
M*(:'a'Y)--P,o(x,y)) [ By 
and ?l Y.y( M(,a)  
368 Satinder Singh, Tommi Jaakkola, Michael L Jordan 
assumption for any e > 0, 3M  x such that %M,  e with probability 1 - e. The 
variance of F(x, a) can also be shown to be bounded because the variance of the 
sample probabilities is bounded (everything else is similar to standard Q-learning 
for MDPs). Therefore by Theorem i of Jaakkola et al. (1994), for any e  0, 
with probability (1-e), Q(x,a) - Qo(x,a), where [Qo(x,a)- Q(x,a)[ 
Therefore, semi-batch Q-learning converges with probability one. 
Acknowledgements 
This project was supported in part by a grant from the McDonnell-Pew Foundation, 
by a grant from ATR Human Information Processing Research Laboratories, and by 
a grant from Siemens Corporation. Michael I. Jordan is a NSF Presidential Young 
Investigator. 
References 
A. G. Barto, R. S. Sutton, & C. W. Anderson. (1983) Neuronlike elements that can 
solve difficult learning control problems. IEEE SMC, 13:835-846. 
D. P. Bertsekas. (1987) Dynamic Programming: Deterministic and Stochastic Mod- 
els, Prentice-Hall. 
S. J. Bradtke. (1993) Reinforcement learning applied to linear quadratic regulation. 
In Advances in Neural Information Processing Systems 5, pages 295-302. 
P. Dayan. (1992) The convergence of TD(A) for general A. Machine Learning, 
8(3/4):341-362. 
P. Dayan & T.J. Sejnowski. (1994) TD(A) converges with probability 1. Machine 
Learning, 13(3). 
T. Jaakkola, M. I. Jordan, & S. P. Singh. (1994) On the convergence of stochastic 
iterative dynamic programming algorithms. Neural Computation, 6(6):1185- 
1201. 
A. W. Moore. (1991) Variable resolution dynamic programming: Efficiently learning 
action maps in multivariate real-valued state-spaces. In Maching Learning: 
Proceedings of the Eighth International Workshop, pages 333-337. 
S. P. Singh, T. Jaakkola, & M. I. Jordan. (1994) Learning without state-estimation 
in partially observable markovtan decision processes. In Machine Learning: 
Proceedings of the Eleventh International Conference, pages 284-292. 
P. S. Sutton. (1988) Learning to predict by the methods of temporal differences. 
Machine Learning, 3:9-44. 
J. Tsitsiklis. (1994) Asynchronous stochastic approximation and Q-learning. Ma- 
chine Learning, 16(3):185-202. 
B. Vanroy & J. Tsitsiklis. (personal communication) 
C. J. C. H. Watkins & P. Dayan. (1992) Q-learning. Machine Learning, 8(3/4):279- 
292. 
R. C. Yee. (1992) Abstraction in control learning. Technical Report COINS Techni- 
cal Report 92-16, Department of Computer and Information Science, University 
of Massachusetts, Amherst, MA 01003. A dissertation proposal. 
