I II I 
Learning Path Distributions using 
Nonequilibrium Diffusion Networks 
Paul Mineiro * 
pmineiro$cogsci. ucsd. edu 
Department of Cognitive Science 
University of California, San Diego 
La Jolla, CA 92093-0515 
Javier Movellan 
movellancogsci. ucsd. edu 
Department of Cognitive Science 
University of California, San Diego 
La Jolla, CA 92093-0515 
Ruth J. Williams 
williamsmah. ucsd. edu 
Department of Mathematics 
University of California, San Diego 
La Jolla, CA 92093-0112 
Abstract 
We propose diffusion networks, a type of recurrent neural network 
with probabilistic dynamics, as models for learning natural signals 
that are continuous in time and space. We give a formula for the 
gradient of the log-likelihood of a path with respect to the drift 
parameters for a diffusion network. This gradient can be used to 
optimize diffusion networks in the nonequilibrium regime for a wide 
variety of problems paralleling techniques which have succeeded in 
engineering fields such as system identification, state estimation 
and signal filtering. An aspect of this work which is of particu- 
lar interest to computational neuroscience and hardware design is 
that with a suitable choice of activation function, e.g., quasi-linear 
sigmoidal, the gradient formula is local in space and time. 
I Introduction 
Many natural signals, like pixel gray-levels, line orientations, object position, veloc- 
ity and shape parameters, are well described as continuous-time continuous-valued 
stochastic processes; however, the neural network literature has seldom explored the 
continuous stochastic case. Since the solutions to many decision theoretic problems 
of interest are naturally formulated using probability distributions, it is desirable 
to have a flexible framework for approximating probability distributions on contin- 
uous path spaces. Such a framework could prove as useful for problems involving 
continuous-time continuous-valued processes as conventional hidden Markov mod- 
els have proven for problems involving discrete-time sequences. 
Diffusion networks are similar to recurrent neural networks, but have probabilistic 
dynamics. Instead of a set of ordinary differential equations (ODEs), diffusion 
networks are described by a set of stochastic differential equations (SDEs). SDEs 
provide a rich language for expressing stochastic temporal dynamics and have proven 
*To whom correspondence should be addressed. 
Learning Path Distributions Using Nonequilibrium Diffusion Networks 599 
Figure 1: An example where the average of desirable paths yields an undesirable 
path, namely one that collides with the tree. 
useful in formulating continuous-time statistical inference problems, resulting in 
such solutions as the continuous Kalman filter and generalizations of it like the 
condensation algorithm (Isard & Blake, 1996). 
A formula is given here for the gradient of the log-likelihood of a path with re- 
spect to the drift parameters for a diffusion network. Using this gradient we can 
potentially optimize the model to approximate an entire probability distribution of 
continuous paths, not just average paths or equilibrium points. Figure i illustrates 
the importance of this kind of learning by showing a case in which learning average 
paths would have undesirable results, namely collision with a tree. Experience has 
shown that learning distributions of paths, not just averages, is crucial for dynamic 
perceptual tasks in realistic environments, e.g., visual contour tracking (Isard & 
Blake, 1996). Interestingly, with a suitable choice of activation function, e.g., quasi- 
linear sigmoidal, the gradient formula depends only upon local computations, i.e., 
no time unfolding or explicit backpropagation of error is needed. The fact that 
noise localizes the gradient is of potential interest for domains such as theoretical 
neuroscience, cognitive modeling and hardware design. 
2 Diffusion Networks 
Hereafter C refers to the space of continuous R-valued functions over the time 
interval [0, T], with T E R, T > 0 fixed throughout this discussion. 
A diffusion network with parameter A E R p is a random process defined via an It6 
SDE of the form 
dX(t) : I(t,X(t),)dt + adB(t), (1) 
x(o)v, 
where X is a C-valued process that represents the temporal dynamics of the n 
nodes in the network;/: [0, T] x R  x R p -+ R  is a deterministic function called 
the drift;   PP is the vector of drift parameters, e.g., synaptic weights, which 
are to be optimized; B is a Brownian motion process which provides the random 
driving term for the dynamics; v is the initial distribution of the solution; and 
a  R, a > 0, is a fixed constant called the dispersion coefficient, which determines 
the strength of the noise term. In this paper we do not address the problem of 
optimizing the dispersion or the initial distribution of X. For the existence and 
uniqueness in law of the solution to (1) /(.,., ) must satisfy some conditions. For 
example, it is sufficient that it is Borel measurable and satisfies a linear growth 
condition: ]/(t,x, ,)1 _< Kx(1 + ]x]) for some Kx > 0 and all t  [0, T], x e R; see 
600 P. Mineiro, J. Movellan and R. J. Williams 
(Karatzas & Shreve, 1991, page 303) for details. 
It is typically the case that the n-dimensional diffusion network will be used to 
model d-dimensional observations with n > d. In this case we divide X into hidden 
and observable  components, denoted H and O respectively, so that X = (H, O). 
Note that with a - 0 in equation (1), the model becomes equivalent to a continuous- 
time deterministic recurrent neural network. Diffusion networks can therefore be 
thought of as neural networks with "synaptic noise" represented by a Brownian mo- 
tion process. In addition, diffusion networks have Markovian dynamics, and hidden 
states if n > d; therefore, they are also continuous-time continuous-state hidden 
Markov models. As with conventional hidden Markov models, the probability den- 
sity of an observable state sequence plays an important role in the optimization of 
diffusion networks. However, because X is a continuous-time process, care must 
taken in defining a probability density. 
2.1 Density of a continuous observable path 
Let (X x, B x) defined on some filtered probability space (),/, {/t },/5) be a (weak) 
solution of (1) with fixed parameter ,k. Here X x = (H x, O h) represents the states 
of the network and is adapted to the filtration {Pt }, B x is an n-dimensional {Pt }- 
martingale Brownian motion and the filtration {Pt} satisfies the usual conditions 
(Karatzas and Shreve, 1991, page 300). Let Qx be the unique probability law 
generated by any weak solution of (1) with fixed parameter ,k 
QX(A) =/5(X x e A) for all A C :r, (2) 
where f is the Borel sigma algebra generated by the open sets of Cn. Setting 
fi = C, fih= C-d, and rio = Cd with associated Borel a-algebras f, h and o, 
respectively, we have fi = fin x rio,  = n  fo, and we can define the marginal 
laws for the hidden and observable components of the network by 
Qh(Ah) = (Ah x = (H x e Ah) for all Ahe h, (3) 
Q(Ao) = QX(Cn-d x Ao)  P(O x G Ao) for all Ao G o. (4) 
For our purposes the appropriate generalization of the notion of a probability density 
on R TM to the general probability spaces considered here is the Radon-Nikodym 
derivative with respect to a reference meure that dominates all members of the 
hmily {Qx}xeg (Poor, 1994, p.264ff). A suitable reference measure P is the law of 
the solution to (1) with zero drift ( = 0). The menures induced by this reference 
measure over fn and fo are denoted by Pn and Po, respectively. Since in the 
reference model there are no couplings between any of the nodes in the network, 
the hidden and observable processes are independent and it follows that 
P(A x Ao) = Pn(An)Po(Ao) for all An e n,no e o. (5) 
The conditions on  mentioned above are sufficient to ensure a Radon-Nikodym 
derivative for each Qx with respect to the reference meure. Using Girsanov's 
Theorem (Karatzas & Shreve, 1991, p.190ff) its form can be shown to be 
Z()= dQ { 1   
 () =exp  (t,(t),A) .(t) 
(6) 
2.2 }(t.(t).A)l 2dr .   . 
x In our treatment we me no distinction between observables which are inputs and 
those which e outputs. Inputs can be conceptuized  observables under "environ- 
mental control," i.e., whose &ifts e independent of both A and the hidden and output 
processes. 
Learning Path Distribution Using Nonequilibrium Diffusion Networks 
601 
where the first integral is an It6 stochastic integral. The random variable Z x can 
be interpreted as a likelihood or probability density with respect to the reference 
model 2. However equation (6) defines the density of Rn~valued paths of the entire 
network, whereas our real concern is the density of Rd-valued observable paths. 
Denoting 0  f as 0 -- (0h,0o) where 0h  n and o  o, note that 
((A) -- Jr9 1A(0;o) Q'X(d(ch,Co)) (7) 
h Xo 
o h 
and therefore the Radon-Nikodym derivative of Qo x with respect to Po, the density 
of interest, is given by 
dQx = e 
ZX(:) = tPo (9) 
2.2 Gradient of the density of an observable path 
The gradient of Zo x with respect to A is an important quantity for iterative opti- 
mization of cost functionals corresponding to a variety of problems of interest, e.g., 
maximum likelihood estimation of diffusion parameters for continuous path density 
estimation. Formal differentiation 3 of (9) yields 
V log ZoX(O:o) = EP[ZXlo(.,O:o)Vx log ZX(.,O:o)], (10) 
where 
Z1o( ) zx() 
ZoX(Oo) , (11) 
Vx logZX(c)- a5 J(t,(t),A) .dI(,t), (12) 
I(,t) (t)-(0)- (s,(s),X)ds. (14) 
Equation (10) states that the gradient of the density of an observable path can be 
found by clmping the observable nodes to that path d performing an average 
of ZloV  log Z x with respect to Pn, i.e., average with respect to the hidden paths 
distributed  a scaled Brownian motion. This makes intuitive sense: the output 
gradient of the log density is a weighted average of the total gradient of the log 
density, where each hidden path contributes according to its likelihood Zi  given 
the output. 
In practice to evaluate the gradient, equation (10) must be approximated. Here we 
use Monte Carlo techniques, the efficiency of which can be improved by sampling 
according to a density which reduces the variance of the integrand. Such a density 
To ee interpretation of (6) consider the simpler ce of a one-mensional Gaussian 
rdom viable with mean  and vice a. The ratio of the density of such a model 
with respect to an equivalent model with zero mean is exp(x -  2 
2 )' Equation (6) 
can be viewed  a generMization of this same idea to BrownJan motion. 
aSee (Levanony et al., 1990) for sucient conditions for the differentiation in equation 
(10) to be valid. 
602 P. Mineiro, J. Movellan and R. J. Wlliams 
where 
is available for models with hidden dynamics which do not explicitly depend upon 
the observables, i.e., the observable nodes do not send feedback connections to the 
hidden states. Models which obey this constraint are henceforth denoted facto:al. 
Denoting/h and/o as the hidden and observable components, respectively, of the 
drift vector, and Bh and Bo as the hidden and observable components, respectively, 
of the Brownian motion, for a factoffal network we have 
dH(t) - In(t,H(t),A)dt 4-adB(t), (15) 
dO(t) -- fro(t, H(t), O(t), A)dt 4- adBo(t). (16) 
The drift for the hidden variables does not depend on the observables, and Gir- 
sanov's theorem gives us an explicit formula for the density of the hidden process. 
dP () -exp -/ ttn(t,(t),A) .d(t) 
I T / (17) 
2cr2 Jo ]tt(t'(t)'A)]2dt ' 
Equations (9) and (10) can then be written in the form 
ZX(): EO[Zolh(',o)], (18) 
VlogZoX(o) = [zlh("5 ) z 
[ ZoX(coo) Vxlog (',o) , (19) 
ZoXln(co)_ Zx() {1 j0 T 
ZX(n) - exp - o(t,(t),A).o(t) 
1 r } (20) 
2a 2  ](t'(t)'A)12dt . 
Note the expectations are now performed using the meure Q. We can eily 
generate staples according to Q by numerically integrating equation (15), and in 
practice this leads to more efficient Monte Carlo appromations of the likelihood 
and adient. 
3 Example: Noisy Sinusoidal Detection 
This problem is a simple example of using diffusion networks for signal detection. 
The task was to detect a sinusoid in the presence of additive Gaussian noise. Stimuli 
were generated according to the following process 
1 sin(4rt) + B(t,o) (21) 
Y(t,v) = 1A(CO); , 
where t  [0, 1/2]. Here Y is assumed anchored in a probability space (, , p) 
large enough to accommodate the event A which indicates a signal or noise trial. 
Note that under P, B is a Brownian motion on Ca independent of A. 
A model was optimized using 100 samples of equation (21) given co  A, i.e., 100 
stimuli containing a signal. The model had four hidden units and one observable 
unit (n - 5, d = 1). The drift of the model was given by 
/(t, x, A) = t + W . g(x), (22) 
1 
a(x) - l +e-xJ ' j e {1,2,3,4,5}, 
Learning Path Distribution, Using Nonequilibrium Diffusion Networks 603 
1 
0.8 
0.6 
0.4 
0.2 
0 
0 
ROC Curve, Sinewave Detection Problem 
0.2 0.4 0.6 0.8 
hit rate 
Figure 2: Receiver operating characteristic (ROC) curve for a diffusion network 
performing a signal detection task involving noisy sinusoids. Dotted line: Detection 
performance estimated numerically using 10000 novel stimuli. Solid line: Best fit 
curve corresponding to d  - 1.89. This value of d  corresponds to performance 
within 1.5% of the Bayesian limit. 
where   R s and W is a 5x5 real-valued connection matrix. In this case  -- 
{{0i}, {Wij}, i, j - 1,... , 5}. The connections from output to hidden units were set 
to zero, allowing use of the more efficient techniques for factoffal networks described 
above. The initial distribution for the model was a &function at (1,-1, 1,-1,0). 
The model was numerically simulated with At = 0.01, and 100 hidden samples were 
used to approximate the likelihood and gradient of the log-likelihood, according to 
equations (18) and (19). The conjugate gradient algorithm was used for training, 
with the log-likelihood of the data as the cost function. 
Once training was complete, the parameter estimation was tested using 10000 novel 
stimuli and the following procedure. Given a new stimuli y we used the model to 
estimate the likelihood 7o(Y I A)  Zo(Y), where  is the parameter vector at the 
end of training. The decision rule employed was 
D(Y) = signal if 2o(Y I A) > 5, (23) 
[noise otherwise, 
where b  R is a bias term representing assumptions about the apriori probability 
of a signal trial. By sweeping across different values of b the receiver-operator 
characteristic (ROC) curve is generated. This curve shows how the probability of 
a hit, P(D = signal I A), and the probability of a false alarm, P(D = signal I Ac), 
are related. From this curve the parameter d , a measure of sensitivity independent 
of apriori assumptions, can be estimated. Figure 2 shows the ROC curve as found 
by numerical simulation, and the curve obtained by the best fit value d  - 1.89. 
This value of d  corresponds to a 82.7% correct detection rate for equal prior signal 
probabilities. 
The theoretically ideal observer can be derived for this problem, since the profile of 
the unperturbed signal is known exactly (Poor, 1994, p. 278ff). For this problem 
the optimal observer achieves d x = 2, which implies at equal probabilities for 
signal and noise trials, the Bayesian limit corresponds to a 84.1% correct detection 
rate. The detection system based upon the diffusion network is therefore operating 
close to the Bayesian limit, but was designed using only implicit information, i.e., 
100 training examples, about the structure of the signal to be detected, in contrast 
to the explicit information required to design the optimal Bayesian classifier. 

4 Discussion 
As a hybrid of neural networks and hidden Markov models, diffusion networks 
combine the power of continuous-time continuous-state representation and the ro- 
bustness of a probabilistic approach. In this paper we present a framework for 
identifying the drift parameters of a diffusion network, which builds upon previous 
work on diffusion, stochastic filtering, and artificial neural networks (Feigin, 1976; 
Campillo & Le Gland, 1989; Levanony et al., 1990; Apolloni & de Falco, 1991; 
Movellan, 1994). We have not treated the problems of identifying the initial dis- 
tribution or the dispersion of the network, both of which are probably important 
for achieving good performance on realistic tasks. Identification of the dispersion is 
especially intriguing since it is (with appropriate scaling of the drift) equivalent to 
identifying the timescale of the model. An obvious application would be in speech 
recognition, where normalization of timescale is needed to correct for variable ut- 
terance speed, even for single-speaker systems (Rabiner & Juang, 1993, p.200). 
Future work will focus on developing techniques for identifying the dispersion and 
initial distribution, improving numerical procedures used for computing the likeli- 
hood, and applying diffusions to continuous-time continuous-state perceptual prob- 
lems such as visual contour tracking (Isard & Blake, 1996). 
Acknowledgements 
The authors thank Anthony Gamst for helpful discussion. Paul Mineiro is supported 
by an NSF Minority Graduate Fellowship. Research of R. J. Williams is supported 
in part by NSF Grants GER 9023335 and DMS 9703891. 
References 
Apolloni, B. & de Falco, D. (1991). Learning by asymmetric parallel Boltzmann 
machines. Neural Computation, $(3), 402-408. 
Campillo, F. & Le Gland, F. (1989). MLE For Partially Observed Diffusions - 
Direct Maximization vs. the EM Algorithm. Stochastic Processes and Their 
Applications, $$(2), 245-274. 
Feigin, P. (1976). Maximum Likelihood Estimation for Continuous-Time Stochastic 
Processes. Advances in Applied Probability, 8, 712-736. 
Isard, M. & Blake, A. (1996). Contour Tracking by Stochastic Propagation of 
Conditional Density. Proc. European Conf. Computer Vision, 343-356. 
Karatzas, I. & Shreve, S. (1991). Brownian Motion and Stochastic Calculus. 
Springer. 
Levanony, D., Shwartz, A., & Zeitouni, O. (1990). Continuous-Time Recursive Es- 
timation. In E. Arikan (Ed.), Communication, Control, and Signal Processing. 
Elsevier Science Publishers. 
Movellan, J. (1994). A Local Algorithm to Learn Trajectories with Stochastic Neural 
Networks. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural 
Information Processing Systems, volume 6, pages 83-87. Morgan Kaufmann. 
Poor, H. V. (1994). An Introduction to Signal Detection and Estimation. Springer- 
Verlag. 
Rabiner, L. R. & Juang, H. (1993). Fundamentals of Speech Recognition. Prentice 
Hall. 
