Global Optimisation of Neural Network 
Models Via Sequential Sampling 
Jo&o FG de Yeitas 
Cambridge University 
Engineering Department 
Cambridge CB2 1PZ England 
jfgf@eng.cam.ac.uk 
[Corresponding author] 
Arnaud Doucet 
Cambridge University 
Engineering Department 
Cambridge CB2 1PZ England 
ad2@eng.cam.ac.uk 
Mnhesnn Nirnnjnn 
Cambridge University 
Engineering Department 
Cambridge CB2 1PZ England 
niranj an@eng.cam.ac.uk 
Andrew H Gee 
Cambridge University 
Engineering Department 
Cambridge CB2 1PZ England 
ahg@eng.cam.ac.uk 
Abstract 
We propose a novel strategy for training neural networks using se- 
quential sampling-importance resampling algorithms. This global 
optimisation strategy allows us to learn the probability distribu- 
tion of the network weights in a sequential framework. It is well 
suited to applications involving on-line, nonlinear, non-Gaussian or 
non-stationary signal processing. 
1 INTRODUCTION 
This paper addresses sequential training of neural networks using powerful sampling 
techniques. Sequential techniques are important in many applications of neural net- 
works involving reai-time signal processing, where data arrival is inherently sequen- 
tial. Furthermore, one might wish to adopt a sequential training strategy to deal 
with non-stationarity in signals, so that information from the recent past is lent more 
credence than information from the distant past. One way to sequentially estimate 
neural network models is to use a state space formulation and the extended Kaiman 
filter (Singhal and Wu 1988, de Freitas, Niranjan and Gee 1998). This involves local 
linearisation of the output equation, which can be easily performed, since we only 
need the derivatives of the output with respect to the unknown parameters. This 
approach has been employed by several authors, including ourselves. 
Global Optimisation of Neural Network Models via Sequential Sampling 411 
However, local linearisation leading to the EKF algorithm is a gross simplification of 
the probability densities involved. Nonlinearity of the output model induces multi- 
modality of the resulting distributions. Gaussian approximation of these densities 
will loose important details. The approach we adopt in this paper is one of sampling. 
In particular, we discuss the use of'sampling-importance resampling' and 'sequential 
importance sampling' algorithms, also known as particle filters (Gordon, Salmond 
and Smith 1993, Pitt and Shephard 1997), to train multi-layer neural networks. 
2 STATE SPACE NEURAL NETWORK MODELLING 
We start from a state space representation to model the neural network's evolution 
in time. A transition equation describes the evolution of the network weights, while 
a measurements equation describes the nonlinear relation between the inputs and 
outputs of a particular physical process, as follows: 
Wk+l --- Wk + dk 
Yk = g(Wk, Xk) + Vk 
(1) 
(2) 
where (yk  o) denotes the output measurements, (xk  ) the input measure- 
ments and (wk  Rm) the neural network weights. The measurements nonlinear 
mapping g(.) is approximated by a multi-layer perceptton (MLP). The measure- 
ments are assumed to be corrupted by noise vk. In the sequential Monte Carlo 
framework, the probability distribution of the noise is specified by the user. In 
our examples we shall choose a zero mean Gaussian distribution with covariance 
R. The measurement noise is assumed to be uncorrelated with the network weights 
and initial conditions. 
We model the evolution of the network weights by assuming that they depend 
on the previous value wk and a stochastic component dk. The process noise dk 
may represent our uncertainty in how the parameters evolve, modelling errors or 
unknown inputs. We assume the process noise to be a zero mean Gaussian process 
with covariance Q, however other distributions can also be adopted. This choice of 
distributions for the network weights requires further research. The process noise 
is also assumed to be uncorrelated with the network weights. 
The posterior density p(W]Y), where Y = {y, Y2, '", Yk} and W = 
{w, w2, --., wk}, constitutes the complete solution to the sequential estima- 
tion problem. In many applications, such as tracking, it is of interest to estimate 
one of its marginals, namely the filtering density P(wklYk). By computing the fil- 
tering density recursire]y, we do not need to keep track of the complete history of 
the weights. Thus, from a storage point of view, the filtering density turns out 
to be more parsimonious than the full posterior density function. If we know the 
filtering density of the network weights, we can easily derive various estimates of 
the network weights, including centfolds, modes, medians and confidence intervals. 
3 SEQUENTIAL IMPORTANCE SAMPLING 
In the sequential importance sampling optimisation framework, a set of represen- 
tative samples is used to describe the posterior density function of the network 
parameters. Each sample consists of a complete set of network parameters. More 
specifically, we make use of the following Monte Carlo approximation: 
N 
Iv) = - (')) 
i=1 
412 d. E G. de Freitas, M. Niranjan, ,4. Doucet and,4. H. Gee 
where u/(i) represents the samples used to describe the posterior density and 5(.) 
denotes the Dirac delta function. Consequently, any expectations of the form: 
E[f(W)] = f f(W)p(WlY)dW 
may be approximated by the following estimate: 
N 
i=1 
where the samples 
,, are drawn from the posterior density function. Typically, 
one cannot draw samples directly from the posterior density. Yet, if we can draw 
samples from a proposal density function r(Wn IYn), we can transform the expecta- 
tion under p(WnlYn) to an expectation under r(Wn[Yn) as follows: 
E[fa(Wa)] = f fa(wa)P(WalY) r(WalYa)dWa 
f 
f q (W)' (W I)dW 
E[q (W ) f (W )] 
where the variables q (W) are known as the unnormalised importance ratios: 
o(Yn lWn )p(Wn ) (3) 
q = (Wn IYn) 
Hence, by drawing samples from the proposal function r(.), we can approximate 
the expectations of interest by the following estimate: 
(4) 
where the normalised importance ratios q(i) are given by: 
It is not difficult to show (de Freitas, Niranjan, Gee and Doucet 1998) that, if we 
assume w to be a hidden Markov process with initial density p(w0) and transition 
density p(wklwk_), various recursive algorithms can be derived. One of these 
algorithms (HySIR), which we derive in (de Freitas, Niranjan, Gee and Doucet 
1998), has been shown to perform well in neural network training. Here we extended 
the algorithm to deal with multiple noise levels. The pseudo-code for the HySIR 
algorithm with EKF updating is as followsX: 
We have made available the software for the implementation of the HySIR algorithm 
at the following web-site: http://svr-mm. eng. cam. ac .uk/~jfgf/softare .html. 
Global Optimisation of Neural Network Models via Sequential Sampling 413 
1. INITIALISENET0RKWEIGHTS 
2. For 
(a) 
(k: 0): 
k= 1,...,L 
SAMPLING STAGE: 
For i = 1,...,N 
 Predict via the dynamics equation: 
,,(') w(k i) d(k i) 
wk+ 1 -- q_ 
where d(k i) is a sample from p(dk) (Af(O,Q) 
 Update samples with the EKF equations. 
in our case). 
w(i) 
+1 
+1 
 Evaluate the importance ratios: 
q(i) q?r(Yk+x - (i) 
+1- wk+].) q(ki).Af(g(xk+]., ,, (i) , Rk) 
-- = Wk+l) , 
 Normalise the importance ratios: 
(i) _ + 
qk+l 
(b) RESAMPLING STAGE: 
For i = 1,--.,N 
If Neff > Threshold: 
,(i) _ ,(i) 
 'k+l -- 'k+l 
 p(i) (i) 
k+l   k+l 
r.(i) = Q.(i) 
 k+l k+l 
Else 
 Resample new index j 
from the discrete set lw+x,+x} 
,.,(i) ,(j) (i) b(J) and n*(i) n*(J) 
 "k+l -- "k+l '  k+l --  k+l k+l --- k+l 
 .(i) =  
'/k+l 
where K+x is known as the Kalman gain matrix, Imm denotes the identity matrix of 
size m x m, and R* and * are two tuning parameters, whose roles are explained in 
detail in (de Freitas, Niranjan and Gee 1997). G represents the JacobJan matrix and, 
strictly speaking, P is an approximation to the covariance matrix of the network 
weights. The resampling stage is used to eliminate samples with low probability 
and multiply samples with high probability. Various authors have described efficient 
algorithms for accomplishing this task in O(N) operations (Pitt and Shephard 1997, 
Carpenter, Clifford and Fearnhead 1997, Doucet 1998). 
414 J. E G. de Freitas, M. Niranjan, A. Doucet and A. H. Gee 
4 EXPEKIMENT 
To assess the ability of the hybrid algorithm to estimate time-varying hidden param- 
eters, we generated input-output data from a logistic function followed by a linear 
scaling and a displacement as shown in Figure 1. This simple model is equivalent 
to an MLP with one hidden neuron and an output linear neuron. We applied two 
Gaussian (A/'(0, 10)) input sequences to the model and corrupted the weights and 
output values with Gaussian noise (A/'(0,1 x 10 -3) and A/'(0,1 x 10 -4) respectively). 
We then trained a second model with the same structure using the input-output 
Figure 1: Logistic function with linear scaling and displacement used in the ex- 
periment. The weights were chosen as follows: wx(k) = 1 + k/100, w2(k) = 
sin(0.06k)- 2, ws(k)= 0.1, w4(k)= 1, ws(k)=-0.5. 
data generated by the first model. In so doing, we chose 100 sampling trajectories 
and set R to 10, Q to 1 x 10-3155, the initial weights variance to 5, P0 to 100155, 
R* to I x 10 -5. The process noise parameter Q* was set to three levels: 5 x 10 -3, 
1 x 10 -3 and 1 x 10 -, as shown in the plot of Figure 2 at time zero. In the training 
 x10 ' { ..... ' ....... ' ....... ' ....... ' ....... ' ....... "',, I 
16 I ........ ' ...... ' ....... ' .... -', "--. ' ...... ',' 
> / ' ?---- ......... " ....  .... Z'"'i  ....... '"' ? / 
 .... ' "'J ; ' ' - ' o 
  o me 
Samples 
Figure 2: Noise level estimation with the HySIP algorithm. 
phase, of 200 time steps, we allowed the model weights to vary with time. During 
this phase, the HySIP algorithm was used to track the input-output training data 
and estimate the latent model weights. In addition, we assumed three possible noise 
variance levels at the begining of the training session. After the 200-th time step, 
we fixed the values of the weights and generated another 200 input-output data 
test sets from the original model. The input test data was then fed to the trained 
model, using the weights values estimated at the 200-th time step. Subsequently, 
Global Optimisation of Neural Network Models via Sequential Sampling 415 
the output prediction of the trained model was compared to the output data from 
the original model to assess the generalisation performance of the training process. 
As shown in Figure 2, the noise level of the trajectories converged to the true value 
(1 x 10-3). In addition, it was possible to track the network weights and obtain 
accurate output predictions as shown in Figures 3 and 4. 
-2 
0 2 4 
Output prediction 
3.5 
3 
1.5 
0.5 
Output prediction 
Figure 3: One step ahead predictions during the training phase (left) and stationary 
predictions in the test phase (right). 
E lOO ' ' ' ' ' 
U3 2 
1 O0 50 -2 
0 -4 
Time W2 
o IRTr',J;-; 
0 20 
40 60 
80 100 1 140 160 1 2 
me 
Figure 4: Weights tracking performance with the HySIR algorithm. As indicated 
by the histograms of w2, the algorithm performs a global search in parameter space. 
416 d. F. G. de Freitas, M. Niranjan, A. Doucet and A. H. Gee 
5 
CONCLUSIONS 
In this paper, we have presented a sequential Monte Carlo approach for training 
neural networks in a Bayesian setting. In particular, we proposed an algorithm 
(HySIR) that makes use of both gradient and sampling information. HySIR can be 
interpreted as a Gaussian mixture filter, in that only a few sampling trajectories 
need to be employed. Yet, as the number of trajectories increases, the computational 
requirements increase only linearly. Therefore, the method is also suitable as a 
sampling strategy for approximating multi-modal distributions. Further avenues 
of research include the design of algorithms for adapting the noise covariances R 
and Q, studying the effect of different noise models for the network weights and 
improving the computational efficiency of the algorithms. 
ACKNOWLED(EMENTS 
Jofo FG de Freitas is financially supported by two University of the Witwatersrand 
Merit Scholarships, a Foundation for Research Development Scholarship (South 
Africa), an ORS award and a Trinity College External Studentship (Cambridge). 
References 
Carpenter, J., Clifford, P. and Fearnhead, P. (1997). An improved particle filter for 
non-linear problems, Technical report, Department of Statistics, Oxford Uni- 
versity, England. Available at http://www.stats.ox.ac.uk/~clifford/index. htm. 
de Freitas, J. F. G., Niranjan, M. and Gee, A. H. (1997). Hierarchichal Bayesian- 
Kalman models for regularisation and ARD in sequential learning, Tech- 
nical Report CUED/F-INFENG/TR 307, Cambridge University, http://svr- 
www. eng. cam. ac. uk / ~j fgf. 
de Freitas, J. F. G., Niranjan, M. and Gee, A. H. (1998). Regularisation in sequential 
learning algorithms, in M. I. Jordan, M. J. Kearns and S. A. Solla (eds), 
Advances in Neural Information Processintl Systems, Vol. 10, MIT Press. 
de Freitas, J. F. G., Niranjan, M., Gee, A. H. and Doucet, A. (1998). Sequen- 
tial Monte Carlo methods for optimisation of neural network models, Tech- 
nical Report CUED/F-INFENG/TR $8, Cambridge University, http://svr- 
www. eng.cam. ac. uk / ~j fgf. 
Doucet, A. (1998). On sequential simulation-based methods for Bayesian filtering, 
Technical Report CUED/F-INFENG/TR 310, Cambridge University. Avail- 
able at http://www.stats.bris.ac.uk:81/MCMC/pages/list.html. 
Gordon, N.J., Salmond, D. J. and Smith, A. F. M. (1993). Novel approach 
to nonlinear/non-Gaussian Bayesian state estimation, IEE ProceedintIs-F 
140(2): 107-113. 
Pitt, M. K. and Shephard, N. (1997). Filtering via simulation: Auxiliary particle 
filters, Technical report, Department of Statistics, Imperial College of London, 
England. Available at http://www.nuff. ox. ac.uk/economics/papers. 
Singhal, S. and Wu, L. (1988). Training multilayer percepttons with the extended 
Kalman algorithm, in D. S. Touretzky (ed.), Advances in Neural Information 
Processintl Systems, Vol. 1, San Mateo, CA, pp. 133-140. 
