Reinforcement Learning for Mixed 
Open-loop and Closed-loop Control 
Eric A. Hansen, Andrew G. Barto, and Shlomo Zilbersteln 
Department of Computer Science 
University of Massachusetts 
Amherst, MA 01003 
hansen ,barto,shlomo)cs.u mass.edu 
Abstract 
Closed-loop control relies on sensory feedback that is usually as- 
sumed to be free. But if sensing incurs a cost, it may be cost- 
effective to take sequences of actions in open-loop mode. We de- 
scribe a reinforcement learning algorithm that learns to combine 
open-loop and closed-loop control when sensing incurs a cost. Al- 
though we assume reliable sensors, use of open-loop control means 
that actions must sometimes be taken when the current state of 
the controlled system is uncertain. This is a special case of the 
hidden-state problem in reinforcement learning, and to cope, our 
algorithm relies on short-term memory. The main result of the pa- 
per is a rule that significantly limits exploration of possible memory 
states by pruning memory states for which the estimated value of 
information is greater than its cost. We prove that this rule allows 
convergence to an optimal policy. 
I Introduction 
Reinforcement learning (RL) is widely-used for learning closed-loop control poli- 
cies. Closed-loop control works well if the sensory feedback on which it relies is 
accurate, fast, and inexpensive. But this is not always the case. In this paper, we 
address problems in which sensing incurs a cost, either a direct cost for obtaining 
and processing sensory data or an indirect opportunity cost for dedicating limited 
sensors to one control task rather than another. If the cost for sensing is significant, 
exclusive reliance on closed-loop control may make it impossible to optimize a per- 
formance measure such as cumulative discounted reward. For such problems, we 
describe an RL algorithm that learns to combine open-loop and closed-loop control. 
By learning to take open-loop sequences of actions between sensing, it can optimize 
a tradeoff'between the cost and value of sensing. 
Reinforcement Learning for Mixed Open-loop and Closed-loop Control 1027 
The problem we address is a special case of the problem of hidden state or partial 
observability in RL (e.g., Whitehead & Lin, 1995; McCaHum, 1995). Although we 
assume sensing provides perfect information (a significant limiting assumption), use 
of open-loop control means that actions must sometimes be taken when the current 
state of the controlled system is uncertain. Previous work on RL for partially 
observable environments has focused on coping with sensors that provide imperfect 
or incomplete information, in contrast to deciding whether or when to sense. Tan 
(1991) addressed the problem of sensing costs by showing how to use RL to learn a 
cost-effective sensing procedure for state identification, but his work addressed the 
question of which sensors to use, not when to sense, and so still assumed closed-loop 
control. 
In this paper, we formalize the problem of mixed open-loop and closed-loop control 
as a Markov decision process and use RL in the form of Q-learning to learn an op- 
timal, state-dependent sensing interval. Because there is a combinatorial explosion 
of open-loop action sequences, we introduce a simple rule for pruning this large 
search space. Our most significant result is a proof that Q-learning converges to 
an optimal policy even when a fraction of the space of possible open-loop action 
sequences is explored. 
2 Q-learning with sensing costs 
Q-learning (Watkins, 1989) is a well-studied RL algorithm for learning to control 
a discrete-time, finite state and action Markov decision process (MDP). At each 
time step, a controller observes the current state z, takes an action a, and receives 
an immediate reward r with expected value r(z,a). With probability p(z, a, y) 
the process makes a transition to state y, which becomes the current state on the 
next time step. A controller using Q-learning learns a state-action value function, 
Q(z,a), that estimates the expected total discounted reward for taking action a 
in state z and performing optimally thereafter. Each time step, Q is updated for 
state-action pair (z, a) after receiving reward r and observing resulting state y, as 
follows: 
where a  (0, 1] is a learning rate parameter, -y  [0, 1) is a discount factor, and 
/(y) -- maxQ(y,b). Watkins and Dayan (1992) prove that Q converges to an 
^ 
optimal state-action value function Q (and V converges to an optimal state value 
function V) with probability one if all actions continue to be tried from all states, 
the state-action value function is represented by a lookup-table, and the learning 
rate is decreased in an appropriate manner. 
If there is a cost for sensing, acting optimally may require a mixed strategy of open- 
loop and closed-loop control that allows a controller to take open-loop sequences 
of actions between sensing. This possibility can be modeled by an MDP with two 
kinds of actions: control actions that have an effect on the current state but do not 
provide information, and a sensing action that reveals the current state but has no 
other effect. We let o (for observation) denote the sensing action and assume it 
provides perfect information about the underlying state. Separating control actions 
and the sensing action gives an agent control over when to receive sensory feedback, 
and hence, control over sensing costs. 
When one control action follows another without an intervening sensing action, the 
second control action is taken without knowing the underlying state. We model 
this by including "memory states" in the state set of the MDP. Each memory 
state represents memoky of the last observed state and the open-loop sequence of 
control actions taken since; because we assume sensing provides perfect information, 
1028 E. A. Hansen, A. G. Barto and S. Zilberstein 
x 
x.b 
Xb 
 xl-b 
Figure 1' A tree of memory states rooted at observed state x. The set of control 
actions is {a,b} and the length bound is 2. 
remembering this much history provides a sufficient statistic for action selection 
(Monaban, 1982). Possible memory states can be represented using a tree like 
the one shown in Figure 1, where the root represents the last observed state and 
the other nodes represent memory states, one for each possible open-loop action 
sequence. For example, let za denote the memory state that results from taking 
control action a in state z. Similarly, let zab denote the memory state that results 
from taking control action b in memory state za. Note that a control action causes 
a deterministic transition to a subsequent memory state, while a sensing action 
causes a stochastic transition to an observed state - the root of some tree. There 
is a tree like the one in figure I for each observable state. 
This problem is a special case of a partially observable MDP and can be formalized 
in an analogous way (Monaban, 1982). Given a state-transition and reward model 
for a core MDP with a state set that consists only of the underlying states of a 
system (which for this problem we also call observable states), we can define a state- 
transition and reward model for an MDP that includes memory states in its state 
set. As a convenient notation, let p(z, a..ak, y) denote the probability that taking 
an open-loop action sequence ai..ak from state z results in state y, where both z 
and y are states of the underlying system. These probabilities can be computed 
recursively from the single-step state-transition probabilities of the core MDP as 
follows: 
p(z,a..a,y) = y,p(a,ai..a_,z)p(z,a,y). 
State-transition probabilities for the sensing action can then be defined as 
p(lllal..ak, o, y) ---- (, al..ak, y), 
and a reward function for the generalized MDP can be similarly defined as 
r(za..ai_x, ak ) - Zp(z, a..ai_, y)r(y, ai ), 
y 
where the cost of sensing in state z of the core MDP is r(z, o). 
If we assume a bound on the number of control actions that can be taken between 
sensing actions (i.e., a bound on the depth of each tree) and also assume a finite 
number of underlying states, the number of possible memory states is finite. It 
follows that the MDP we have constructed is a well-defined finite state and action 
MDP, and all of the theory developed for Q-learning continues to apply, including 
its convergence proof. (This is not true of partially observable MDPs in general.) 
Therefore, Q-learning can in principle find an optimal policy for interleaving control 
actions and sensing, assuming sensing provides perfect information. 
3 Limiting Exploration 
A problem with including memory states in the state set of an MDP is that it 
increases the size of the state set exponentially. The combinatorial explosion of 
Reinforcement Learning for Mixed Open-loop and Closed-loop Control 1029 
state-action values to be learned raises doubt about the computational feasibility of 
this generalization of RL. We present a solution in the form of a rule for pruning each 
tree of memory states, thereby limiting the number of memory states that must be 
explored. We prove that even if some memory states are never explored, Q-learning 
converges to an optimal state-action value function. Because the state-action value 
function is left undefined for unexplored memory states, we must carefully define 
what we mean by an optimal state-action value function. 
Definition: A state-action value function is optimal if it is sufficient for generat- 
ing optimal behavior and the values of the state-action pairs visited when behaving 
optimally are optimal. 
A state-action value function that is undefined for some states is optimal, by this 
definition, if a controller that follows it behaves identically to a controller with a 
complete, optimal state-action value function. This is possible if the states for which 
the state-action value function is undefined are not encountered when an agent acts 
optimally. Barto, Bradtke, and Singh (1995) invoke a similar idea for a different 
class of problems. 
Let g(asax..ak) denote the expected reward for taking actions a..ak in open-loop 
mode after observing state z: 
k-1 
: + 
i-----1 
Let h(za..a) denote the discounted expected value of perfect information after 
reaching memory state a..a, which is equal to the discounted Q-value for sensing 
in memory state a..a minus the cost for sensing in this state: 
h(a..ak) = 'yk  p(za..a, o, y)V(y) = 'y(Q(a..a, o) - r(a..ak, o)). 
y 
Both g and h are easily learned during Q-learning, and we refer to the learned 
estimates as 0 and/. These are used in the pruning rule, as follows: 
Pruning rule: IfO(za..a)+ h(za..ak) < l(z), then memory states that descend 
from za..a do not need to be explored. A controller should immediately execute a 
sensing action when it reaches one of these memory states. 
The intuition behind the pruning rule is that a branch of a tree of memory states 
can be pruned after reaching a memory state for which the value of information 
is greater than or equal to its cost. Because pruning is based on estimated values, 
memory states that are pruned at one point during learning may later be explored as 
learned estimates change. The net effect of pruning, however, is to focus exploration 
on a subset of memory states, and as Q-learning converges, the subset of unpruned 
memory states becomes stable. The following theorem is proved in an appendix. 
Theorem: Q-learning converges to an optimal state-action value function with 
probability one if, in addition to the conditions for convergence given by Watkins 
and Dayan (1992), exploration is limited by the pruning rule. 
This result is closely related to a similar result for solving this class of problems 
using dynamic programming (Hansen, 1997), where it is shown that pruning can 
assure convergence to an optimal policy even if no bound is placed on the length 
of open-loop action sequences - under the assumption that it is optimal to sense 
at finite intervals. This additional result can be extended to Q-learning as well, 
although we do not present the extension in this paper. An artificial length bound 
can be set as low or high as desired to ensure a finite set of memory states. 
1030 E. A. Hansen, A. G. Barto and S. Zilberstein 
 10 II 12 
13 14 15 16 
17 18 19 20 
Goal Slop 7 WNNNO 14 NWNO 
I WO 8 WWNNNO 15 WWO 
 N() 9 NWO 16 WWWO 
3 \WO I0 WNWO 17 NNNO 
4 NNO II WWO 18 WNNNO 
5 NNWO 12 WWWO 19 WWNO 
6 NNN() 13 NNO 211 WWWNO 
Figure 2: (a) Grid world with numbered states (b) Optimal policy 
We use the notation g and h in our statement of the pruning rule to emphasize 
its relationship to pruning in heuristic search. If we regard the root of a tree of 
memory states as the start state and the memory state that corresponds to the 
best open-loop action sequence as the goal state, then g can be regarded as the 
cost-to-arrive function and the value of perfect information h can be regarded as an 
upper bound on the cost-to-go function. 
4 Example 
We describe a simple example to illustrate the extent of pruning possible using this 
rule. Imagine that a "robot" must find its way to a goal location in the upper 
left-hand corner of the grid shown in Figure 2a. Each cell of the grid corresponds 
to a state, with the states numbered for convenient reference. The robot has five 
control actions; it can move north, east, south, or west, one cell at a time, or it 
can stop. The problem ends when the robot stops. If it stops in the goal state it 
receives a reward of 100, otherwise it receives no reward. The robot must execute 
a sequence of actions to reach the goal state, but its move actions are stochastic. If 
the robot attempts to move in a particular direction, it succeeds with probability 
0.8. With probability 0.05 it moves in a direction 90 degrees off to one side of its 
intended direction, with probability 0.05 it moves in a direction 90 degrees off to the 
other side, and with probability 0.1 it does not move at all. If the robot's movement 
would take it outside the grid, it remains in the same cell. Because its progress is 
uncertain, the robot must interleave sensing and control actions to keep track of its 
location. The reward for sensing is -1 (i.e., a cost of 1) and for each move action it 
is -4. To optimize expected total reward, the robot must find its way to the goal 
while minimizing the combined cost of moving and sensing. 
Figure 2b shows the optimal open-loop sequence of actions for each observable state. 
If the bound on the length of an open-loop sequence of control actions is five, the 
number of possible memory states for this problem is over 64,000, a number that 
grows explosively as the length bound is increased (to over 16 million when the 
bound is nine). Using the pruning rule, Q-learning must explore just less than 1000 
memory states (and no deeper than nine levels in any tree) to converge to an optimal 
policy, even when there is no bound on the interval between sensing actions. 
5 Conclusion 
We have described an extension of Q-learning for MDPs with sensing costs and 
a rule for limiting exploration that makes it possible for Q-learning to converge 
to an optimal policy despite exploring a fraction of possible memory states. As 
already pointed out, the problem we have formalized is a partially observable MDP, 
Reinforcement Learning for Mixed Open-loop and Closed-loop Control 1031 
although one that is restricted by the assumption that sensing provides perfect 
information. An interesting direction in which to pursue this work would be to 
explore its relationship to work on RL for partially observable MDPs, which has so 
far focused on the problem of sensor uncertainty and hidden state. Because some 
of this work also makes use of tree representations of the state space and of learned 
state-action values (e.g., McCallum, 1995), it may be that a similar pruning rule 
can constrain exploration for such problems. 
Acknowledgement s 
Support for this work was provided in part by the National Science Foundation un- 
der grants ECS-9214866 and IRI-9409827 and in part by Rome Laboratory, USAF, 
under grant F30602-95-1-0012. 
References 
Barto, A.G.; Bradtke, S.J.; & Singh, S.P. (1995) Learning to act using real-time 
dynamic programming. Artificial Intelligence 72(1/2):81-138. 
Hansen, E.A. (1997) Markov decision processes with observation costs. University 
of Massachusetts at Amherst, Computer Science Technical Report 97-01. 
McCallum, R.A. (1995) Instance-based utile distinctions for reinforcement learning 
with hidden state. In Proc. 12th Int. Machine Learning Conf. Morgan Kaufmann. 
Monahah, G.E. (1982) A survey of partially observable Markov decision processes: 
Theory, models, and algorithms. Management Science 28:1-16. 
Tan, M. (1991) Cost-sensitive reinforcement learning for adaptive classification and 
control. In Proc. 9th Nat. Conf. on Artificial Intelligence. AAAI Press/MIT Press. 
Watkins, C.J.C.H. (1989) Learning from delayed rewards. Ph.D. Thesis, University 
of Cambridge, England. 
Watkins, C.J.C.H. & Dayan, P. (1992) Technical note: Q-learning. Machine Learn- 
ing 8(3/4):279-292. 
Whitehad, S.D. & Lin, L.-J.(1995) Reinforcement learning of non-Markov decision 
processes. Artificial Intelligence 73:271-306. 
Appendix 
Proof of theorem: Consider an MDP with a state set that consists only of the 
memory states that are not pruned. We call it a "pruned MDP" to distinguish 
it from the original MDP for which the state set consists of all possible memory 
states. Because the pruned MDP is a finite state and action MDP, Q-learning with 
pruning converges with probability one. What we must show is that the state-action 
values to which it converges include every state-action pair visited by an optimal 
controller for the original MDP, and that for each of these state-action pairs the 
learned state-action value is equal to the optimal state-action value for the original 
MDP. 
Let Q and/k denote the values that are learned by Q-learning when its exploration is 
limited by the pruning rule, and let Q and V denote value functions that are optimal 
when the state set of the MDP includes all possible memory states. Because an MDP 
has an optimal stationary policy and each control action causes a deterministic 
transition to a subsequent memory state, there is an optimal path through each 
tree of memory states. The learned value of the root state of each tree is optimal if 
and only if the learned value of each memory state along this path is also optimal. 
1032 E. A. Hansen, A. G. Barto and S. Zilberstein 
Therefore to show that Q-learning with pruning converges to an optimal state- 
action value function, it is sufficient to show that  = V for every observable state 
z. Our proof is by induction on the number of control actions that can be taken 
between one sensing action and the next. We use the fact that if Q-learning has 
converged, then (za..ai)= g(za..ai) and h(xa..ai) = yp(,a..ai, y)/(y) for 
every memory state al..ai. 
First note that if (za) + 7v(asa,o) + (a) > /(), that is, if / for some 
observable state z can be improved by exploring a path of a single control action 
followed by sensing, then it is contradictory to suppose Q-learning with pruning has 
converged because single-depth memory states in a tree are never pruned. Now, 
make the inductive hypothesis that Q-learning with pruning has not converged if  
can be improved for some observable state by exploring a path of less than k control 
actions before sensing. We show that it has not converged if /can be improved for 
some observable state by exploring a path of k control actions before sensing. 
Suppose ' for some observable state x can be improved by exploring a path that 
consists of taking the sequence of control actions a..ak before sensing, that is, 
+ o) + > 
Since only pruning can prevent improvement in this case, let za..ai be the memory 
state at which application of the pruning rule prevents za..aa from being explored. 
Because the tree has been pruned at this node, ;/(z) >_ .(za..a) + h(zax..a), and 
so 
+ o)+ > + 
We can expand this inequality as follows: 
(zal..ai) q- /i Z p(z,a..ai, y)[.O(yai+l..aa)q- 7a-ir(yai+l..ai,o) q- (yai+..aa)] 
y 
> + 
Simplification and expansion of  yields 
E p(, a..ai, y) [.O(ya,+ ..a ) + 7k-' r(yai+ ..a, o) + 'y -' E p(y, ai+ ..a, z)  (z)] 
y6S z 
> Ep(,a..ai, y)(y). 
y 
Therefore, there is some observable state, y, such that 
.O(yai+..aa) + 7k-iv(yai+..aa,o) + 7 a-i Ep(y, ai+..aa,z)(z ) > (y). 
Because the value of observable state y can be improved by taking less than k 
control actions before sensing, by the inductive hypothesis Q-learning has not yet 
converged. [] 
The proof provides insight into how pruning works. If a state-action pair along 
some optimal path is temporarily pruned, it must be possible to improve the value 
of some observable state by exploring a shorter path of memory states that has 
not been pruned. The resulting improvement of the value function changes the 
threshold for pruning and the state-action pair that was formerly pruned may no 
longer be so, making further improvement of the learned value function possible. 
