Weight Space Probability Densities 
in Stochastic Learning: 
II. Transients and Basin Hopping Times 
Genevieve B. Orr and Todd K. Leen 
Department of Computer Science and Engineering 
Oregon Graduate Institute of Science & Technology 
19600 N.W. von Neumann Drive 
Beaverton, OR 97006-1999 
Abstract 
In stochastic learning, weights are random variables whose time 
evolution is governed by a Markov process. At each time-step, 
n, the weights can be described by a probability density function 
P(w, n). We summarize the theory of the time evolution of P, and 
give graphical examples of the time evolution that contrast the 
behavior of stochastic learning with true gradient descent (batch 
learning). Finally, we use the formalism to obtain predictions of the 
time required for noise-induced hopping between basins of different 
optima. We compare the theoretical predictions with simulations 
of large ensembles of networks for simple problems in supervised 
and unsupcrviscd learning. 
I Weight-Space Probability Densities 
Despite the recent application of convergence theorems from stochastic approxima- 
tion theory to neural network learning (Oja 1982, White 1989) there remain out- 
standing questions about the search dynamics in stochastic learning. For example, 
the convergence theorems do not tell us to which of several optima the algorithm 
507 
508 Orr and Leen 
is likely to converge . Also, while it is widely recognized that the intrinsic noise 
in the weight update can move the system out of sub-optimal local minima (for a 
graphical example, see Darken and Moody 1991), there have been no theoretical 
predictions of the time required to escape from local optima, or of its dependence 
on learning rates. 
In order to more fully understand the dynamics of stochastic search, we study the 
weight-space probability density and its time evolution. In this paper we summarize 
a theoretical framework that describes this time evolution. We graphically portray 
the motion of the density for examples that contrast stochastic and batch learning. 
Finally we use the theory to predict the statistical distribution of times required for 
escape from local optima. We compare the theoretical results with simulations for 
simple examples in supervised and unsupervised learning. 
2 Stochastic Learning and Noisy Maps 
2.1 Motion of the Probability Density 
We consider stochastic learning algorithms of the form 
co(n+ 1) = co(n) + /H[co(n),z(n)] (1) 
where co(n)  m is the weight, x(n) is the data exemplar input to the algorithm at 
time-step n, tt is the learning rate, and HI... ]  m is the weight update function. 
The exemplars x(n) can be either inputs or, in the case of supervised learning, 
input/target pairs. We assume that the x(n) are i.i.d. with density p(x). Angled 
brackets {... )z denote averaging over this density. In what follows, the learning 
rate will be held constant. 
The learning algorithm (1) is a noisy map on co. The weights are thus random 
variables described by the probability density function P(co, n). The time evolution 
of this density is given by the Kolmogorov equation 
P(co, n + 1) = / dco' P(co',n) W(co' --+ co) (2) 
where the single time-step transition probability is given by (Leen and Orr 1992, 
Leen and Moody 1993) 
W(co'-+co) = ( 3(co-co'-ttH[co',:] ) }z (3) 
and 3(... ) is the Dirac delta function. 
The Kolmogorov equation can be recast as a differential-difference equation by 
expanding the transition probability (3) as a power series in it. This gives a Kramers- 
Moyal expansion (Leen and Orr 1992, Leen and Moody 1993) 
1However Kushner (1987) has proved convergence to global optima for stochastic 
approximation algorithms with added Gaussian noise subject to logarithmic annealing 
schedules. 
Weight Space Probability Densities in Stochastic Learning 509 
P(w,n + 1) - P(w,n) = 
 (-) 0  
i! ( (4) 
i=l Jt .... Ji =1 
where w5 and Hj are the ja component of weight, and weight update, respectively. 
uncating (4) to second order in p leaves a Fokker-Planck equation  that is valid for 
small [oH [. The drift coefficient { H ) is simply the average update. It is important 
to note that the diffusion coefficients, (H; H; , c be stronol dependent on 
location in the weight-space. This spaiffldee'ndence inn,n both equilibria 
and trsient phenomena. In section 3.1 we will use both the Kolmogorov equation 
(2), d the Fokker-Planck equation to track the time evolution of network ensemble 
densities. 
2.2 First Passage Times 
Our discussion of basin hopping will use the notion of the first passage time (Gar- 
diner, 1990); the time required for a network initialized at coo to first pass into an 
e-neighborhood D of a global or local optimum w. (see Figure 1). The first passage 
time is a random variable. Its distribution function 7>(n; w0) is the probability that 
a network initialized at w0 makes its first passage into D at the n th iteration of the 
learning rule. ob 
igure 1: Sample search path. 
To arrive at an expression for 7>(n; coo), we first examine the probability of passing 
from the initial weight coo to the weight w after n iterations. This probability can 
be expressed as 
P(,n I o;0,0) = fd' P(o;,n I d, 1 ) W(o;0 -d). (5) 
Substituting the single time-step transition probability (3) into the above expres- 
sion, integrating over o;', and making use of the time-shift invariance of the system a 
we find 
P(o;,-I o;0, 0) = ( P(o;,-- 1 I o;0 + H(o;0, x), 0) L (6) 
Next, let G(n; wo) denote the probability that a network initialized at o;0 has not 
passed into the region D by the r th iteration. We obtain G(n; wo) by integrating 
P(w, n I w0,0) over weights o; not in D; 
a(n;o;0) = fo do; P(o;,n I o;0,0) (7) 
2See (Ritter and Schulten 1988) and (Radons et al. 1990) for independent derivations. 
aWith our assumptions of a constant learning rate /t and stationary sam- 
ple density p(:r), the system is time-shift invariant. Mathematically stated, 
P(w,n I w', m) -- P(w,n - 1 I :',m -- 1) 
510 Orr and Leen 
where D c is the complement of D. Substituting equation (6) into (7) and integrating 
over w we obtain the recursion 
G(n;wo) = ( G(n- 1;Wo 4- pH[wo,x]) ) (8) 
Before any learning takes place, none of the networks in the ensemble have entered 
D. Thus the initial condition for G is 
G(0;wo) = 1, wo6D  
(9) 
Networks that have entered D are removed from the ensemble (i.e. OD is an ab- 
sorbing boundary). Thus G satisfies the boundary condition 
a(n;wo) = o, woD . (10) 
Finally, the probability that the network has not passed into the region D on or 
before iteration n- i minus the probability the network has not passed into D 
on or before iteration n is simply the probability that the network has passed into 
D exactly at iteration n. This is just the probability for first passage into D at 
time-step n. Thus 
7>(n;w0) -- a(n-1;w0) - a(n;wo) (11) 
Finally the recursion (8) for G can be expanded in a power series in p to obtain the 
backward Kramers-Moyal equation 
G(n;w) - G(n- 1;w) = 
i=l Jl .... ji----1 
a(n-1;w) . (12) 
Truncation to second order in p results in the backward Fokker-Planck equation. In 
section 3.2 we will use both the full recursion (8) and the Fokker-Planck approxi- 
mation to (12) to predict basin hopping times in stochastic learning. 
3 Backpropagation and Competitive Nets 
We apply the above formalism to study the time evolution of the probability density 
for simple backpropagation and competitive learning problems. We give graphical 
examples of the time evolution of the weight space density, and calculate times for 
passage from local to global optima. 
3.1 Densities for the XOR Problem 
Feed-forward networks trained to solve the XOR problem provide an example of 
supervised learning with well-characterized local optima (Lisboa and Pcrantonis, 
1991). We use a 2-input, 2-hidden, 1-output network (9 weights) trained by stochas- 
tic gradient descent on the cross-entropy error function in Lisboa and Pcrantonis 
(1991). For computational tractability, we reduce the state space dimension by 
Weight Space Probability Densities in Stochastic Learning $11 
constraining the search to one- or two-dimensional subspaces of the weight space. 
To provide global optima at finite weight values, the output targets are set to J and 
1-J, with $ << 1. 
Figure 2a shows the cost function evaluated along a line in the weight space. This 
line, parameterized by v, is d.osen to pass through a global optimum at v = 0, 
and a local optimum at v = 1.0 . In this one-dimensional slice, another local 
optimum occurs at v = 1.24. Figure 2b shows the evolution of P(v, n) obtained by 
numerical integration of the Fokker-Planck equation. Figure 2c shows the evolution 
of P(v, n) estimated by simulation of 10,000 networks, each receiving a different 
random sequence of the four input/target patterns. Initially the density is peaked 
up about the local optimum at v = 1.24. At intermediate times, there is a spike of 
density at the local optimum at v = 1.0. This spike is narrow since the diffusion 
coefficient is small there. At late times the density collects at the global optimum. 
We note that for the learning rate used here, the local optimum at v = 1.24 is 
asymptotically stable under true gradient descent, and no escape would occur. 
a) -0.5 0.0 0.5 1.0 1.5 2.0 
P(v,t) 
P(v,t) 
I 
0 
e " ( 0.! 
b) v 1 11 O0 c) v 
2 
time 
Figure 2: a) XOR cost function. b) Predicted density. c) Simulated density. 
Figure 3 shows a series of snapshots of the density superimposed on the cost function 
for a 2-D slice through the XOR weight space. The first frame shows the weight 
evolution under true gradient descent. The weights are initialized at the upper 
right-hand corner of the frame, travel down the gradient and settle into a local 
optimum. The remaining frames show the evolution of the density calculated by 
direct integration of the Kolmogorov equation (2). Here one sees an early spreading 
of the initial density and the ultimate concentration at the global optimum. 
3.2 Basin Hopping Times 
The above examples graphically illustrate the intuitive notion that the noise inher- 
ent in stochastic learning can move the system out of local optima 4 In this section 
we calculate the statistical distribution of times required to pass between basins. 
4The reader should not infer from these examples that stochastic update necessarily 
converges to global optima. It is straightforward to construct examples for which stochastic 
learning convergences to local optima with probability one. 
512 Orr and Leen 
True Gradient Descent 
me=O 
Time=10 Time=28 
Time = 34 
Time = 100 
Figure 3: Weight evolution for 2-D XOR. The density is superimposed on top of the cost 
function. The first frame shows density using true gradient descent for all 100 timesteps. 
The remaining frames show the density for selected timesteps using stochastic descent. 
3.2.1 Basin Hopping in Back-propagation 
For the search direction used in the example of Figure 2, we calculated the distribu- 
tion of times required for networks initialized at v = 1.2 to first pass within e = 0.1 
of the global optimum at v = 0.0. For this example we numerically integrated 
the backward Fokker-Planck equation. We verified the theoretical predictions by 
obtaining first passage times from an ensemble of 10,000 networks initialized at 
v = 1.2. See Figure 4. For this example the agreement is good at the small learn- 
ing rate (y = 0.025) used, but degrades for larger y as higher order terms in the 
expansion (12) become significant. 
0 200 400 600 800 1000 
First Passage Time 
Figure 4: XOR problem. Simulated (histogram) and theoretical (solid line) distributions 
of first passage times for the cost function of Figure la. 
Weight Space Probability Densities in Stochastic Learning 513 
When the Fokker-Planck approximation fails, results obtained from the exact ex- 
pression (8) are in excellent agreement with experimental results. One such exam- 
ple is shown in Figure 5. Similar to Figure 2a, we have chosen a one-dimensional 
subspace of the XOR weight space (but in a different direction). Here, the Fokker- 
Planck solution is quite poor because the steepness of the cost function results in 
large contributions from higher order terms in (12). As one would expect, the exact 
solution obtained using (8) agrees well with the simulations. 
" Exact 
,o o. ; --- Fokker-Planck 
', 
-0.5 0.0 0.5 1.0 1.5 
a) v b) 0 200 400 600 800 
First Passage Time 
Figure 5: Second 1-D XOR example. a) Cost function. b) Simulated (histogram) and 
theoretical (lines) distributions of first passage times. 
3.2.2 Basin Hopping in Competitive Learning 
As a final example, we consider competitive learning with two 2-D weight vectors 
symmetrically placed about the center of a rectangle. Inputs are uniformly dis- 
tributed in a rectangle of width 1.1 and height 1. This configuration has both 
global and local optima. 
Figure 6a shows a sample path with weights started near the local optimum (crosses) 
and switching to hover around the global optimum. The measured and predicted 
(from numerical integration of (8)) distribution of times required to first pass within 
a distance e = 0.1 of the global optimum are shown in Figure 6b. 
w2 
..... wl 
0.2 0.4 0.6 0.8 I 0 1 O0 200 300 400 
a) b) First Passage Time 
500 
Figure 6: Competitive Learning a) Data (small dots) and sample weight path (large dots). 
b) First passage times. 
4 Discussion 
The dynamics of the time evolution of the weight space probability density provides 
a direct handle on the performance of learning algorithms. This paper has focused 
514 Orr and Leen 
on transient phenomena in stochastic learning with constant learning rate. The 
same theoretical framework can be used to analyze the asymptotic properties of 
stochastic search with decreasing learning rates, and to analyze equilibrium densi- 
ties. For a discussion of the latter, see the companion paper in this volume (Leen 
and Moody 1993). 
Acknowledgements 
This work was supported under grants N00014-90-J- 1349 and N000-91-J- 1482 from 
the Office of Naval Research. 
References 
E. Oja (1982), A simplified neuron model as a principal component analyzer. J. 
Math. Biology, 15:267-273. 
Halbert White (1989), Learning in artificial neural networks: A statistical perspec- 
tive. Neural Computation, 1:425-464. 
J.J. Kushncr (1987), Asymptotic global behavior for stochastic approximation and 
diffusions with slowly decreasing noise effects: Global minimization via monte carlo. 
SIAM J. Appl. Math., 47:169-185. 
Christian Darken and John Moody (1991), Note on learning rate schedules for 
stochastic optimization. In Advances in Neural Information Processing Systems 3, 
San Mateo, CA, Morgan Kaufmann. 
Todd K. Leon and Genevieve B. Orr. (1992), Weight-space probability densities 
and convergence times for stochastic learning. In International Joint Conference 
on Neural Networks, pages IV 158-164. IEEE. 
Todd K. Leon and John Moody (1993), Probability Densities in Stochastic Learning: 
Dynamics and Equilibria. In Giles, C.L., Hanson, S.J., and Cowan, J.D. (eds.), 
Advances in Neural Information Processing Systems 5. San Mateo, CA: Morgan 
Kaufmann Publishers. 
H. Ritter and K. Schultcn (1988), Convergence properties of Kohonen's topology 
conserving maps: Fluctuations, stability and dimension selection, Biol. Cybern., 
60, 59-71. 
G. Radons, H.G. Schuster and D. Werner (1990), Fokker-Planck description of learn- 
ing in backpropagation networks, International Neural Network Conference, Paris, 
II 993-996, Kluwcr Academic Publishers. 
C.W. Gardiner (1990), Handbook of Stochastic Methods, nd Ed. Springer-Verlag, 
Berlin. 
P. Lisboa and S. Perantonis (1991), Complete solution of the local minima in the 
XOR problem. Network: Computation in Neural Systems, 2:119. 
