133 
TRAINING MULTILAYER PERCEPTRONS WITH THE 
EXTENDED KALMAN ALGORITHM 
Sharad Singhal and Lance Wu 
Bell Communications Research, Inc. 
Morristown, NJ 07960 
ABSTRACT 
A large fraction of recent work in artificial neural nets uses 
multilayer perceptrons trained with the back-propagation 
algorithm described by Rumelhart et. al. This algorithm 
converges slowly for large or complex problems such as 
speech recognition, where thousands of iterations may be 
needed for convergence even with small data sets. In this 
paper, we show that training multilayer perceptrons is an 
identification problem for a nonlinear dynamic system which 
can be solved using the Extended Kalman Algorithm. 
Although computationally complex, the Kalman algorithm 
usually converges in a few iterations. We describe the 
algorithm and compare it with back-propagation using two- 
dimensional examples. 
INTRODUCTION 
Multilayer perceptrons are one of the most popular artificial neural net 
structures being used today. In most applications, the "back propagation" 
algorithm [Rumelhart et al, 1986] is used to train these networks. Although 
this algorithm works well for small nets or simple problems, convergence is 
poor if the problem becomes complex or the number of nodes in the network 
become large [Waibel et al, 1987]. In problems such as speech recognition, 
tens of thousands of iterations may be required for convergence even with 
relatively small data-sets. Thus there is much interest [Prager and Fallside, 
1988; Irie and Mivake, 1988] in other "training algorithms" which can 
compute the parameters faster than back-propagation and/or can handle nmch 
more complex problems. 
In this paper, we show that trailting multilayer perceptrons eau be viewed as 
an identification problem for a nonlinear clylmmic system. For linear dynamic 
Copyright 1989. Bell Communications Research. Inc. 
134 Singhal and Wu 
with white input and observation noise, the Kalman algorithm 
systems 
[Kalman, 1960] is known to be an optimum algorithm. 
the Kalman algorithm can be applied to nonlinear 
linearizing the system around the current estimate 
Although computationally complex, this algorithm 
consistent with all previously seen data and usually 
iterations. In the following sections, we describe how this algorithm can be 
applied to multilayer perceptrons and compare its performance with back- 
propagation using some two-dimensional examples. 
Extended versions of 
dynamic systems by 
of the parameters. 
updates parameters 
converges in a few 
THE EXTENDED KALMAN FILTER 
In this section we briefly outline the Extended Kalman filter. Mathematical 
derivations for the Extended Kalman filter are widely available in the 
literature [Anderson and Moore, 1979; Gelb, 1974] and are beyond the scope 
of this paper. 
Consider a nonlinear finite dimensional discrete time system of the form: 
x(n+l) = fn(x(n)) + g,(x(n))w(n), 
8 (n) = h (x (n))+v (n). 
(1) 
Here the vector x (n) is the state of the system at time n, w (n) is the input, 
d(n) is the observation, v(n) is observation noise and f,(.), g,-), and h,.) 
are nonlinear vector functions of the state with the subscript denoting possible 
dependence on time. We assume that the initial state, x(0), and the 
sequences {v(n)} and {w(n)} are independent and gaussian with 
(2) 
E [x (0)]: (0), E {[x (0)-- (0)]Ix (0)-- (0)] t ) = P (0), 
E[w(/'/)] - O, E[w(J )IM t(l)] = Q(FI )nl, 
Ep(n)] = 0, E[ (n)'(l)] = 
where 8nt is the Kronecker delta. Our problem is to find an estimate i(n +1) 
of x(n+l) given d(j) , O<_j<_n. We denote this estimate by (n+lln ). 
If the nonlinearities in (1) are sufficiently smooth, we can expand thein using 
Taylor series about the state estimates  (n In) and i (n In-1) to obtain 
where 
f,(x(n)) = f,,(.f(n In)) + F (n )[x (n )- (n In)] + .-. 
g, (x (n )) = g,, (i (n [n))+ -.. = G(n)+ ... 
Oh,,(x) 
, H' (n) - 
I" ) Ox 
F (n) - Ox ) 
is the value of the 
components of F(n) and H'(n) are the partial derivatives of the ith 
components of f,, (.) and h,,(.) respectively with respect to the jth component 
of x (n) at the points indicated. Neglecting higher order terms and assuming 
i.e. G(n) function g,,(.) at (n In) and the ijth 
Training Multilayer Perceptrons 135 
knowledge of ./(n In) and  (n In-1), the system in (3) can be approximated 
as 
x(n+l) = F(n)x(n) + G(n)w(n) + u(n) 
z (n) = W (n)x (n)+v (n) + y (n), 
(4) 
where 
u(n) = f.((n In))- F(n)(n In) 
y(n) = h.((n In-0)- H'(n):e(n In-1). 
It can be shown [Anderson and Moore, 1979] that the desired estimate 
: (n +1 In ) can be obtained by the recursion 
(n+l In) = f,,((n In)) 
:e(n In)= 2e(n In-1) + K(n)[a(n)- h,,((n In-1))] 
g (n) -- P(n I n -1)S (n)JR (n)+S' (n)P (n In-0H (n)]- 
P(n+l[n) = F(n)P(n In)Ft(n) + G(n)Q(n)Gt(n) 
P(n In)= P(n In-1)- K (n )Ht (n )P (n In-1) 
(6) 
(7) 
(8) 
(9) 
(10) 
with P(ll0 ) = P(0). K(n) is known as the Kahnan gain. In case of a linear 
system, it can be shown that P(n) is the conditional error covariance matrix 
associated with the state and the estimate (n +11n) is optimal in the sense 
that it approaches the conditional mean E[x(n+l) ld(O)... d(n)] for large 
n. However, for nonlinear systems, the filter is not optimal and the estimates 
can only loosely be termed conditional means. 
TRAINING MULTILAYER PERCEPTRONS 
The network under consideration is a L layer perceptton  with the i th input 
of the k th weight layer labeled as z/"-'(n ), the j th output being zf(n) and the 
weight connecting the ith input to the ]th output being 0j. We assume that 
the net has m inputs and I outputs. Thresholds are implemented as weights 
connected from input nodes 2 with fixed unit strength inputs. Thus, if there 
are N(k) nodes in the kth node layer, the total number of weights in the 
system is 
L 
M = ZN(k-I)[N(k)-I]. (11) 
k=l 
Although the inputs and outputs are dependent on time n, for notational 
brevity, we will not show this dependence unless explicitly needed. 
I. We use the convention that the number of layers is equal to the number of weight layers. Thus 
we have L layers of weights labeled I  L and L+I layers of nodes (including the input and 
output nodes) labeled 0.   L. We will refer to the kth weight layer or the kth node layer 
unless the context is clear. 
' We adopt the convention that the 1st input node is the threshold i.e. 8 ' is the threshold for 
.. . 14 
the jth output node from the kth weight layer. 
136 Singhal and Wu 
In order to cast the problem in a form for recursive estimation, we let the 
weights in the network constitute the state x of the nonlinear system, i.e. 
x = [0 L 0 L --  t 
,2, ,3 ' 0v(0),v0)]  (12) 
The vector x thus consists of all weights arranged in a linear array with 
dimension equal to the total number of weights M in the system. The system 
model thus is 
x(n) n>0, (13) 
d (n) = z z' (n) + v (n) = hn (x (n),z(n )) + v (n), (14) 
where at time n, z(n) is the input vector from the training set, d (n) is the 
corresponding desired output vector, and zZ'(n) is the output vector 
produced by the net. The components of hn(') define the nonlinear 
relationships between the inputs, weights and outputs of the net. If F(.) is the 
nonlinearity used, then z  (n) = h n (x (r/),go(r/)) is given by 
z L (n -- r{(0    F{(Oi)tz(n )}.   }}, (15) 
where F applies componentwise to vector arguments. Note that the input 
vectors appear only implicitly through the observation function hn (') in (14). 
The initial state (before training) x (0) of the network is defined by populating 
the net with gaussian randon variables with a N(7(0),P(0)) distribution where 
(0) and P (0) reflect any apriori knowledge about the weights. In the absence 
of any such knowledge, a N(0,1/(I) distribution can be used, where ( is a 
small number and I is the identity matrix. For the system in (13) and (14), 
the extended Kahnan filter recursion simplifies to 
 (n + 1) = 27 (n + K (n)[d (n) - hn (2? (n),z(n ))] (16) 
K (n) = P(n )H (n)[R (n)+Ht(n )P(n )H(n )]-I (17) 
?(n+l) = P(n)- K (n )H' (n )P (n ) (18) 
where P(n) is the (approximate) conditional error covariance matrix. 
Note that (16) is sinfilar to the weight update equation in back-propagation 
with the last term [z L- hn(,z)] being the error at the output layer. 
However, unlike the delta rule used in back-propagation, this error is 
propagated to the weights through the Kaitaart gain K(n) which updates each 
weight through ihe elltire gradient matrix H(n) and the conditional error 
covariance matrix P(n). In this sense, the Kaitaart algorithm is not a local 
training algorithm. However, the inversion required in (17) has dimension 
equal to the number of outputs 1, not the number of weights M, and thus 
does not grow as weights are added to the problem. 
EXAMPLES AND RESULTS 
To evaluate the output and the convergence properties of the extended 
Kahnan algorithm. we constructed mappings using two-dimensional inputs 
with two or four outputs as shown in Fig. 1. Limiting the input vector to 2 
dimensions allows its IO visualize the decision regions obtained by the net and 
Training Multilayer Percepttons ! 37 
to examine the outputs of any node in the net in a meaningful way. The x- 
and y-axes in Fig. i represent the two inputs, with the origin located at the 
center of the figures. The numbers in the figures represent the different 
output classes. 
3 
2 1 
1 2 
I 
(a) REGIONS (b) XOR 
Figure 1. Output decision regions for two problems 
The training set for each example consisted of 1000 random vectors uniformly 
filling the region. The hyperbolic tangent nonlinearity was used as the 
nonlinear element in the networks. The output corresponding to a class was 
set to 0.9 when the input vector belonged to that class, and to ~0.9 otherwise. 
During training, the weights were adjusted after each data vector was 
presented. Up to 2000 sweeps through the input data were used with the 
stopping criteria described below to examine the convergence properties. The 
order in which data vectors were presented was randomized for each sweep 
through the data. In case of back-propagation, a convergence constant of 0.1 
was used with no "momentum" factor. In the Kalman algorithm R was set to 
I.e -'/5, where k was the iteration number through the data. Within each 
iteration, R was held constant. 
The Stopping Criteria 
Training was considered complete if any one of the following conditions was 
satisfied: 
2000 sweeps through the input data were used, 
the RMS (root lnean squared) error at the output averaged over all 
training data during a sweep fell below a threshold t, or 
c. the error reduction /5 after the ith sweep through the data fell below a 
threshold t2, where Si=/36i_+(l-)[ei-e;_[. Here /3 is some 
positive constant less than unity, and ei is the error defined in b. 
In our sinrelations we set/:/ = 0.97, t = 10 -2 and I2 = 10 -5. 
138 Singhal and Wu 
Example 1 - Meshed, Disconnected Regions: 
Figure l(a) shows the mapping with 2 disconnected, meshed regions 
surrounded by two regions that fill up the space. We used 3-layer perceptrons 
with 10 hidden nodes in each hidden layer to Figure 2 shows the RMS error 
obtained during training for the Kalman algorithm and back-propagation 
averaged over 10 different initial conditions. The number of sweeps through 
the data (x-axis) are plotted on a logarithmic scale to highlight the initial 
reduction for the Kalman algorithm. Typical solutions obtained by the 
algorithms at termination are shown in Fig. 3. It can be seen that the Kalman 
algorithm converges in fewer iterations than back-propagation and obtains 
better solutions. 
1 
Average 0.6 - 
RMS 
Error 0.4 
0.2 
Kal m an 
0 
I I I I I I I I I I 
1 2 5 10 20 50 100 200 500 10002000 
No. of Iterations 
Figure 2. Average output error during training for Regions problem using the 
Kahnan algorithm and backprop 
I I 
I 1 
Figure 3. Typical solutions for Regions problem using (a) Kahnan algorithm 
and (b) backprop. 
Training Multilayer Perceptrons ! 39 
Example 2 - 2 Input XOR: 
Figure l(b) shows a generalized 2-input XOR with the first and third 
quadrants forming region i and the second and fourth quadrants forming 
region 2. We attempted the problem with two layer networks containing 2-4 
nodes in the hidden layer. Figure 4 shows the results of training averaged 
over 10 different randomly chosen initial conditions. As the number of nodes 
in the hidden layer is increased, the net converges to smaller error values. 
When we examine the output decision regions, we found that none of the nets 
attempted with back-propagation reached the desired solution. The Kalman 
algorithm was also unable to find the desired solution with 2 hidden nodes in 
the network. However, it reached the desired solution with 6 out of 10 initial 
conditions with 3 hidden nodes in the network and 9 out of 10 initial 
conditions with 4 hidden nodes. Typical solutions reached by the two 
algorithms are shown in Fig. 5. In all cases, the Kalman algorithm converged 
in fewer iterations and in all but one case, the final average output error was 
smaller with the Kalman algorithm. 
0.8 m 
Average 0.6 
RMS 
Error 0.4- 
0.2-- 
" .  .... nodes 
mBP 4 nodes 
Kalman 3 nodes 
Kalman 4 nodes 
I I I I I I [ I I I 
1 2 5 10 20 50 100 200 500 10002000 
No. of Iterations 
Figure 4. Average output error during training for XOR problem using the 
Kahnan algorithm and backprop 
CONCLUSIONS 
In this paper, we showed that training feed-forward nets can be viewed as a 
system identification problem for a nonlinear dynamic system. For linear 
dynamic systems, the Kalman filter is known to produce an optimal estimator. 
Extended versions of the Kahnan algorithm can be used to train feed-forward 
networks. We examined the performance of the Kahnan algorithm using 
artilicially constructed examples with two inputs and found that the algorithm 
typically converges in a few iterations. We also used back-propagation on the 
same examples and found that inwriably, the Kalman algorithm converged in 
! 40 inhal and Wu 
2 
1 
2 
I 
(a) 
Figure 5. Typical solutions for XOR problem using algorithm and 
(b) backprop. 
fewer iterations. For the XOR problem, back-propagation failed to converge 
on any of the cases considered while the Kalman algorithm was able to find 
solutions with the same network configurations. 
References 
[1] 
B. D. O. Anderson and J. B. Moore, Optimal Fihering, Prentice Hall, 
1979. 
[3] 
[4] 
[5] 
[6] 
[71 
A. Gelb, Ed., Applied Optimal Estimation, MIT Press, 1974. 
B. Irie, and S. Miyake, "Capabilities of Three-layered Perceptrons," 
Proceedings of the IEEE International Conference on Neural Networks, 
San Diego, June 1988, Vol. I, pp. 641-648. 
R. E. Kalman, "A New Approach to Linear Filtering and Prediction 
Problems," J. Basic Eng., Trans. ASME, Series D, Vol 82, No.1, 1960, 
pp. 35-45. 
R. W. Prager and F. Fallside, "The Modified Kanerva Model for 
Automatic Speech Recognition," in 1988 IEEE Workshop on Speech 
Recognition, Arden House, Harriman NY, May 31-June 3, 1988. 
D. E. Rmnelhart, G. E. Hinton and R. J. Williams, "Learning Internal 
Representations by Error Propagation," in D.E. Rumelhart and 
J. L. McCelland (Eds.), Parallel Distributed Processing: Explorations in 
the Microstructure of Cognition. Vol 1: Foundations. MIT Press, 1986. 
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. Lang 
"Phoneme Recognition Using Time-Delay Neural Networks," ATR 
inlernal Report TR-I-0006, October 30, 1987. 
