Reinforcement Learning based on 
Oneline EM Algorithm 
Masa-aki Sato l 
iATR Human Information Processing Research Laboratories 
Seika, Kyoto 619-0288, Japan masaaki@hip.atr.co.jp 
Shin Ishii $t 
:Nara Institute of Science and Technology 
Ikoma, Nara 630-0101, Japan ishii@is.aist-nara.ac.jp 
Abstract 
In this article, we propose a new reinforcement learning (RL) 
method based on an actor-critic architecture. The actor and 
the critic are approximated by Normalized Gaussian Networks 
(NGnet), which are networks of local linear regression units. The 
NGnet is trained by the on-line EM algorithm proposed in our pre- 
vious paper. We apply our RL method to the task of swinging-up 
and stabilizing a single pendulum and the task of balancing a dou- 
ble pendulum near the upright position. The experimental results 
show that our RL method can be applied to optimal control prob- 
lems having continuous state/action spaces and that the method 
achieves good control with a small number of trial-and-errors. 
1 INTRODUCTION 
Reinforcement learning (RL) methods (Barto et al., 1990) have been successfully 
applied to various Markov decision problems having finite state/action spaces, such 
as the backgammon game (Tesauro, 1992) and a complex task in a dynamic envi- 
ronment (Lin, 1992). On the other hand, applications to continuous state/action 
problems (Werbos, 1990; Doya, 1996; Sofge & White, 1992) are much more difficult 
than the finite state/action cases. Good function approximation methods and fast 
learning algorithms are crucial for successful applications. 
In this article, we propose a new RL method that has the above-mentioned two 
features. This method is based on an actor-critic architecture (Barto et al., 1983), 
although the detailed implementations of the actor and the critic are quite differ- 
Reinforcement Learning Based on On-Line EM Algorithm 1053 
ent from those in the original actor-critic model. The actor and the critic in our 
method estimate a policy and a Q-function, respectively, and are approximated by 
Normalized Gaussian Networks (NGnet) (Moody & Darken, 1989). The NGnet is a 
network of local linear regression units. The model softly partitions the input space 
by using normalized Gaussian functions, and each local unit linearly approximates 
the output within its partition. As pointed out by Sutton (1996), local models such 
as the NGnet are more suitable than global models such as multi-layered percep- 
trons, for avoiding serious learning interference in on-line RL processes. The NGnet 
is trained by the on-line EM algorithm proposed in our previous paper (Sato & 
Ishii, 1998). It was shown that this on-line EM algorithm is faster than a gradient 
descent algorithm. In the on-line EM algorithm, the positions of the local units 
can be adjusted according to the input and output data distribution. Moreover, 
unit creation and unit deletion are performed according to the data distribution. 
Therefore, the model can be adapted to dynamic environments in which the input 
and output data distribution changes with time (Sato & Ishii, 1998). 
We have applied the new RL method to optimal control problems for deterministic 
nonlinear dynamical systems. The first experiment is the task of swinging-up and 
stabilizing a single pendulum with a limited torque (Doya, 1996). The second 
experiment is the task of balancing a double pendulum where a torque is applied 
only to the first pendulum. Our RL method based on the on-line EM algorithm 
demonstrated good performances in these experiments. 
2 NGNET AND ON-LINE EM ALGORITHM 
In this section, we review the on-line EM algorithm for the NGnet proposed in our 
previous paper (Sato & Ishii, 1998). The NGnet (Moody & Darken, 1989), which 
transforms an N-dimensional input vector x to a D-dimensional output vector y, is 
defined by the following equations. 
gI denotes the number of units, and the prime (') denotes a transpose. Gi(x) is 
an N-dimensional Gaussian function, which has an N-dimensional center/ and an 
(N x N)-dimensional covariance matrix E. lVi and bi are a (D x N)-dimensional lin- 
ear regression matrix and a D-dimensional bias vector, respectively. Subsequently, 
we use notations l _= (H,b) and 2" -- (x', 1). 
The NGnet can be interpreted as a stochastic model, in which a pair of an input and 
an output, (x, y), is a stochastic event. For each event, a unit index i ( {1, ..., M} 
is assumed to be selected, which is regarded as a hidden variable. The stochastic 
model is defined by the probability distribution for a triplet (x, y, i), which is called 
a cmnplete event: 
P(x,y, ilO ): 
x exp )- -- 
i ] 
2o;. (Y - . 
(2) 
Here, 0 -- {/i, i, rr, 1' l i: 1,...,M} is a set of model parameters. We can easily 
prove that the expectation value of the output y for a given input x, i.e., E[ylx] = 
1054 M. Sato and S. lshii 
f yP(ylx, O)dy, is identical to equation (1). Namely, the probability distribution (2) 
provides a stochastic model for the NGnet. 
From a set of T events (observed data) (X,Y) _= {(x(t),y(t)) I t: 1,...,T}, the 
model parameter 0 of the stochastic model (2) can be determined by the maximum 
likelihood estimation method, in particular, by the EM algorithm (Dempster et al., 
1977). The EM algorithm repeats the following E- and M-steps. 
E (Estimation) step: Let 0 be the present estimator. By using 0, the posterior 
probability that the i-th unit is selected for (x(t),y(t)) is given as 
M 
P(ilx(t),y(t),O ): P(x(t),y(t),il)/ Y P(x(t),y(t),jlO). 
j=l 
(3) 
M (Maximizatipn) step' Using the posterior probability (3), the expected log- 
likelihood L(OIO, X, Y) for the complete events is defined by 
T 
L(01,X,Y ) = y Y P(ilx(t),y(t),O)logP(x(t),y(t),ilO)  
t----1 i=1 
(4) 
Since an increase of L(010, X, Y) implies an increase of the log-likelihood for the ob- 
served data (X, Y) (Dempster et al., 1977), L(OIO, X , Y) is maximized with respect 
to 0. A solution of the necessity condition OL/00 = 0 is given by (Xu et al., 1995) 
(Sa) 
(5b) 
(5c) 
(5d) 
where (')i denotes a weighted mean with respect to the posterior probability (3) 
and it is defined by 
I T 
{f(x,y)),(T) --  y f(x(t),y(t))P(ilx(t),y(t),b ). 
t=l 
(6) 
The EM algorithm introduced above is based on batch learning (Xu et al., 1995), 
namely, the parameters are updated after seeing all of the observed data. We 
introduce here an on-line version (Sato & Ishii, 1998) of the EM algorithm. Let 
O(t) be the estimator after the t-th observed data (x(t),y(t)). In this on-line EM 
algorithm, the weighted mean (6) is replaced by 
T T 
<< f(x,y) >>i (T) _-- rl(T ) Z( H 
t:l s:t+l 
X(s))f(x(t),y(t))P(ilx(t),y(t),O(t- 1)). (7) 
The parameter A(t) E [0, 1] is a discount factor, which is introduced for forgetting 
T T 
the effect of earlier inaccurate estimator. r/(T) _= (t=l(1-I:t+ A(s))) -x is a nor- 
malization coefficient and it is iteratively calculated by r/(t) = (1 + A(t)/rl(t- 1)) - . 
The modified weighted mean <<  >>i can be obtained by the step-wise equation: 
<< f(x,y) >>i (t) =<< f(x,y) >>i (t- 1) 
+rl(t) [f(x(t),y(t))Pi(t)- << f(x,y) >>i (t - 1)], 
(8) 
Reinforcement Learning Based on On-Line EM Algorithm 1055 
where Pi(t) = P(ilx(t),y(t),O(t - 1)). Using the modified weighted mean, the new 
parameters are obtained by the following equations. 
1 [,i(t- 1)- Pi(t),i(t- 1)&(t)5'(t),i(t- 1) (9a) 
f\i(t) = 1 - (t) (1/r/(t) - 1) + Pi(t)'(t),i(t - 1)(t) 
Ii(t) :<< x >i (t)/<< 1 >i (t) (9b) 
[(t) = (t- 1) + v(t)g(t)(y(t) - i(t- 1)&(t))&'(t)i(t) (9c) 
1 [<< [y[2 > (t)- Tr (i(t)<< &y' >>i (t))] /<< 1 > (t), (9d) 
where i(t)  [<< ' >>i]-. E?(t) can be obtained from the following relation 
with i(t). 
( S:i (t) --s:l(t)i(t) ). (10) 
fki(t) << 1 >i (t): _,i(t)El(t ) 1 + 'i(t)ES(t)i(t) 
It can be proved that this on-line EM algorithm is equivalent to the stochastic 
approximation for finding the maximum likelihood estimator, if the time course of 
the discount factor ,(t) is given by 
X(t) t-T 1 - (1 - a)/(at + b), (11) 
where a (1 > a > 0) and b are constants (Sato & Ishii, 1998). 
We also employ dynamic unit manipulation mechanisms in order to efficiently allo- 
cate the units (Sato & Ishii, 1998). The probability P(x(t),y(t),ilO(t-1))indicates 
how probable the i-th unit produces the datum (x(t), y(t)) with the present param- 
eter O(t - 1). If the probability for every unit is less than some threshold value, a 
new unit is produced to account for the new datum. The weighted mean << I >>i (t) 
indicates how much the i-th unit has been used to account for the data until t. If 
the mean becomes less than some threshold value, this unit is deleted. 
In order to deal with a singular input distribution, a regularization for E?(t) is 
introduced as follows. 
E?(t) = [(<< xx' >>i (t)- Ii(t)lfi(t) << 1 >>i (t) (12a) 
+ c << Ai 2 >>i (t)Iv) / << 1 >>i (t)] - 
<< A. >>i (t)---- (<< Ix] 2 >>i (t)- [/i(t)l 2 << 1 >>i (t)) /_N, (12b) 
where Iv is the (N x N)-dimensional identity matrix and c is a small constant. The 
corresponding i (t) can be calculated in an on-line manner using a similar equation 
to (9a) (Sato & Ishii, 1998). 
3 REINFORCEMENT LEARNING 
In this section, we propose a new RL method based on the on-line EM algorithm 
described in the previous section. In the following, we consider optimal control prob- 
lems for deterministic nonlinear dynamical systems having continuous state/action 
spaces. It is assumed that there is no knowledge of the controlled system. An 
actor-critic architecture .(Barto et a1.,1983) is used for the learning system. In the 
original actor-critic model, the actor and the critic approximated the probability 
of each action and the value function, respectively, and were trained by using the 
TD-error. The actor and the critic in our RL method are different from those in 
the original model as explained later. 
1056 M. Sato and S. Ishii 
For the current state, xc(t), of the controlled system, the actor outputs a control 
signal (action) u(t), which is given by the policy function (.), i.e., u(t) = (xc(t)). 
The controlled system changes its state to xc(t + 1) after receiving the control 
signal u(t). Subsequently, a reward r(xc(t),u(t)) is given to the learning system. 
The objective of the learning system is to find the optimal policy function that 
maximizes the discounted future return defined by 
V(xc) =  tr(xc(t),(xc(t))) xc(0)=xc ' (13) 
t:0 
where 0 < 7 < 1 is a discount factor. V(xc), which is called the value function, is 
defined for the current policy function (.) employed by the actor. The Q-function 
is defined by 
Q(Xc,U) = 717(xc(t + 1)) + r(xc,u), (14) 
where xc(t) = Xc and u(t) = u are assumed. The value function can be obtained 
from the Q-function: 
V(x = Q(xc,(x)). (15) 
The Q-function should satisfy the consistency condition 
Q(x(t),u(t)) = 7Q(Xc(t + 1), (x(t + 1)) + r(x(t),u(t)). (16) 
In our RL method, the policy function and the Q-function are approximated by the 
NGnets, which are called the actor-network and the critic-network, respectively. In 
the learning phase, a stochastic actor is necessary in order to explore a better policy. 
For this purpose, we employ a stochastic model defined by (2), corresponding to 
the actor-network. A stochastic action is generated in the following way. A unit 
index i is selected randomly according to the conditional probability P(ilx) for 
a given state x. Subsequently, an action u is generated randomly according to 
the conditional probability P(ulxc,i) for a given x and the selected i. The value 
function can be defined for either the stochastic policy or the deterministic policy. 
Since the controlled system is deterministic, we use the value function defined for 
the deterministic policy which is given by the actor-network. 
The learning process proceeds as follows. For the current state xc(t), a stochastic 
action u(t) is generated by the stochastic model corresponding to the current actor- 
network. At the next time step, the learning system gets the next state xc(t+ 1) and 
the reward r(xc (t), u(t)). The critic-network is trained by the on-line EM algorithm. 
The input to the critic-network is (x(t),u(t)). The target output is given by the 
right hand side of (16), where the Q-function and the deterministic policy function 
(-) are calculated using the current critic-network and the current actor-network, 
respectively. The actor-network is also trained by the on-line EM algorithm. The 
input to the actor-network is xc (t). The target output is given by using the gradient 
of the critic-network (Sofge & White, 1992): 
Utahget = (x(t)) + eU-u (Xc(t),(xe(t))), (17) 
where the Q-function and the deterministic policy function (.) are calculated using 
the modified critic-network and the current actor-network, respectively. e is a small 
constant. This target output gives a better action, which increases the Q-function 
value for the current state x(t), than the current deterministic action (x(t)). 
In the above learning scheme, the critic-network and the actor-network are updated 
concurrently. One can consider another learning scheme. In this scheme, the learn- 
ing system tries to control the controlled system for a given period of time by using 
the fixed actor-network. In this period, the critic-network is trained to estimate the 
Reinforcement Learning Based on On-Line EM Algorithm 1057 
Q-function for the fixed actor-network. The state trajectory in this period is saved. 
At the next stage, the actor-network is trained along the saved trajectory using the 
critic-network modified in the first stage. 
4 EXPERIMENTS 
The first experiment is the task of swinging-up and stabilizing a single pendulum 
with a limited torque (Doya, 1996). The state of the pendulum is represented 
by xc = (dS, ), where  and d5 denote the angle from the upright position and the 
angular velocity of the pendulum, respectively. The reward r(xc(t), u(t)) is assumed 
to be given by 7(x(t + 1)), where 
(x) = exp(-()2/(2Vl 2) - 02/(2%2)). (18) 
v and v2 are constants. The reward (18) encourages the pendulum to stay high. 
After releasing the pendulum from a vicinity of the upright position, the control 
and the learning process of the actor-critic network is conducted for 7 seconds. This 
is a single episode. The reinforcement learning is done by repeating these episodes. 
After 40 episodes, the system is able to make the pendulum achieve an upright 
position from almost every initial state. Even from a low initial position, the system 
swings the pendulum several times and stabilizes it at the upright position. Figure 
1 shows a control process, i.e., stroboscopic time-series of the pendulum, using the 
deterministic policy after training. According to our previous experiment, in which 
both of the actor- and critic- networks are the NGnets with fixed centers trained 
by the gradient descent algorithm, a good control was obtained after about 2000 
episodes. Therefore, our new RL method is able to obtain a good control much 
faster than that based on the gradient descent algorithm. 
The second experiment is the task of balancing a double pendulum near the up- 
right position. A torque is applied only to the first pendulum. The state of the 
pendulum is represented by x = (, 2, 1, 2), where  and 2 are the first pen- 
dulum's angle from the upright direction and the second pendulum's angle from the 
first pendulum's direction, respectively.  (2) is the angular velocity of the first 
(second) pendulum. The reward is given by the height of the second pendulum's 
end from the lowest position. After 40 episodes, the system is able to stabilize 
the double pendulum. Figure 2 shows the control process using the deterministic 
policy after training. The upper two figures show stroboscopic time-series of the 
pendulum. The dashed, dotted, and solid lines in the bottom figure denote 
2/rr, and the control signal u produced by the actor-network, respectively. After 
a transient period, the pendulum is successfully controlled to stay near the upright 
position. 
The numbers of units in the actor- (critic-) networks after training are 50 (109) and 
96 (121) for the single and double pendulum cases, respectively. The RL method 
using center-fixed NGnets trained by the gradient descent algorithm employed 441 
(= 212) actor units and 18,081 (= 212 x41) critic units, for the single pendulum task. 
For the double pendulum task, this scheme did not work even when 14,641 (= 114) 
actor units and 161,051 (= 114 x 11) critic units were prepared. The numbers of 
units in the NGnets trained by the on-line EM algorithm scale moderately as the 
input dimension increases. 
5 CONCLUSION 
In this article, we proposed a new RL method based on the on-line EM algorithm. 
We showed that our RL method can be applied to the task of swinging-up and 
1058 M. Sato and S. Ishii 
stabilizing a single pendulum and the task of balancing a double pendulum near 
the upright position. The number of trial-and-errors needed to achieve good control 
was found to be very small in the two tasks. In order to apply a RL method 
to continuous state/action problems, good function approximation methods and 
fast learning algorithms are crucial. The experimental results showed that our RL 
method has both features. 
References 
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). IEEE Transactions on 
Systems, Man, and Cybernetics, 13,834-846. 
Barto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1990). Learning and Com- 
putational Neuroscience: Foundations of Adaptive Networks (pp. 539-602), MIT 
Press. 
Dempster, A. P., Laird, N.M., & Rubin, D. B. (1977). Journal of Royal Statistical 
Society B, 39, 1-22. 
Doya, K. (1996). Advances in Neural Information Processing Systems 8 (pp. 1073- 
1079), MIT Press. 
Lin, L. J. (1992). Machine Learning, 8, 293-321. 
Moody, J., & Darken, C. J. (1989). Neural Computation, 1,281-294. 
Sato, M., & Ishii, S. (1998). ATR Technical Report, TR-H-243, ATR. 
Sofge, D. A., & White, D. A. (1992). Handbook of Intelligent Control (pp. 259-282), 
Van Nostrand Reinhold. 
Sutton, R. S. (1996). Advances in Neural Information Processing Systems 8 
(pp. 1038-1044), MIT Press. 
Tesauro, G. J. (1992). Machine Learning, 8, 257-278. 
Werbos, P. J. (1990). Neural Networks for Control (pp. 67-95), MIT Press. 
Xu, L., Jordan, M. I., & Hinton, G. E. (1995). Advances in Neural Information 
Processing Systems 7 (pp. 633-640), MIT Press. 
Time Sequence of Inverted Pendulum 
0 1 2 
7 8 
Time (sec.) 
o 1 2 
2 3 4 
--10 6 
2 4 
Time (sec) 
Figure I Figure 2 
