Adaptive On-line Learning in Changing 
Environments 
Noboru Murata, Klaus-Robert Milllet, Andreas Ziehe 
GMD-First, Rudower Chaussee 5, 12489 Berlin, Germany 
{mura, klaus, ziehe}first. gmd. de 
Shun-ichi Amari 
Laboratory for Information Representation, RIKEN 
Hirosawa 2-1, Wako-shi, Saitama 351-01, Japan 
amarizoo. riken. go. jp 
Abstract 
An adaptive on-line algorithm extending the learning of learning 
idea is proposed and theoretically motivated. Relying only on gra- 
dient flow information it can be applied to learning continuous 
functions or distributions, even when no explicit loss function is gi- 
ven and the Hessian is not available. Its efficiency is demonstrated 
for a non-stationary blind separation task of acoustic signals. 
I Introduction 
Neural networks provide powerful tools to capture the structure in data by learning. 
Often the batch learning paradigm is assumed, where the learner is given all trai- 
ning examples simultaneously and allowed to use them as often as desired. In large 
practical applications batch learning is often experienced to be rather infeasible and 
instead on-line learning is employed. 
In the on-line learning scenario only one example is given at a time and then discar- 
ded after learning. So it is less memory consuming and at the same time it fits well 
into more natural learning, where the learner receives new information and should 
adapt to it, without having a large memory for storing old data. On-line learning 
has been analyzed extensively within the framework of statistics (Robbins & Monro 
[1951], Amari [1967] and others) and statistical mechanics (see eg. Saad z Solla 
[1995]). It was shown that on-line learning is asymptotically as effective as batch 
600 N. Murata, K. Miiller, A. Ziehe and S. Amari 
learning (cf. Robbins & Monro [1951]). However this only holds, if the appropriate 
learning rate r/is chosen. A too large r/spoils the convergence of learning. In earlier 
work on dichotomies Sompolinsky et al. [1995] showed the effect on the rate of 
convergence of the generalization error of a constant, annealed and adaptive lear- 
ning rate. In particular, the annealed learning rate provides an optimal convergence 
rate, however it cannot follow changes in the environment. Since on-line learning 
aims to follow the change of the rule which generated the data, Sompolinsky et al. 
[1995], Darken & Moody [1991] and Sutton [1992] proposed adaptive learning rates, 
which learn how to learn. Recently Cichoki et al. [1996] proposed an adaptive on- 
line learning algorithm for blind separation based on low pass filtering to stabilize 
learning. 
We will extend the reasoning of Sompolinsky et al. in several points: (1) we give 
an adaptive learning rule for learning continuous functions (section 3) and (2) we 
consider the case, where no explicit loss function is given and the Hessian cannot be 
accessed (section 4). This will help us to apply our idea to the problem of on-line 
blind separation in a changing environment (section 5). 
2 On-line Learning 
Let us consider an infinite sequence of independent examples (1, yl), (2, Y2), .... 
The purpose of learning is to obtain a network with parameter  which can simulate 
the rule inherent to this data. To this end, the neural network modifies its parameter 
0t at time t into bt+l by using only the next example (at+l, yt+l) given by the 
rule. We introduce a loss function l(a, y; w) to evaluate the performance of the 
network with parameter w. Let R(w) -- (l(, y; w)) be the expected loss or the 
generalization error of the network having parameter w, where ( ) denotes the 
average over the distribution of examples (a,y). The parameter w* of the best 
machine is given by w*: argminR(w). We use the following stochastic gradient 
descent algorithm (see Amari [1967] and Rumelhart et al. [1986]): 
t+l -- )t - f]tC(Jt) o-l(t+l, Yt+l; lt ), 
(1) 
where r h is the learning rate which may depend on t and C(/ot) is a positive-definite 
matrix which may depend on/or. The matrix C plays the role of the Riemannian 
metric tensor of the underlying parameter space {w}. 
When r h is fixed to be equal to a small constant r/, E[/ot] converges to w* and 
Var[/ot] converges to a non-zero matrix which is order O(r/). It means that t 
fluctuates around w* (see Amari [1967], Heskes & Kappen [1991]). If r h = c/t 
(annealed learning rate) /or converges to w* locally (Sompolinsky et al. [1995]). 
However when the rule changes over time, an annealed learning rate cannot follow 
the changes fast enough since r/t = c/t is too small. 
3 Adaptive Learning Rate 
The idea of an adaptively changing r h was called learning of the learning rule (Som- 
polinsky et al. [1995]). In this section we investigate an extension of this idea to 
differentiable loss functions. Following their algorithm, we consider 
t+l -- ot-r]tI[-l(ot)l(t+l,Yt+l;Ot), (2) 
Adaptive On-line Learning in Changing Environments 601 
(3) 
where rt and fi are constants, K(ot) is a Hessian matrix of the expected loss func- 
tion 02R(ot)/OwOw and  is an estimator of R(w*). Intuitively speaking, the 
coefficient r/in Eq.(3) is controlled by the remaining error. When the error is large, 
r/takes a relatively large value. When the error is small, it means that the estimated 
parameter is close to the optimal parameter; r/approaches to 0 automatically. Ho- 
wever, for the above algorithm all quantities (K, l, J) have to be accessible which 
they are certainly not in general. Furthermore /(at+l, Yt+l; ot) -- ] could take 
negative values. Nevertheless in order to still get an intuition of the learning beha- 
viour, we use the continuous versions of (2) and (3), averaged with respect to the 
current input-output pair (at, yt) and we omit correlations and variances between 
the quantities (rh, wt, l) for the sake of simplicity 
Wt -- --]tI(wt) -1 l(,y;wt) and trlt : rtrh /3(l(:e,y;wt) - k) - rlt . 
Noting that (Ol(a:, y; w*)/Ow) = O, we have the asymptotic evaluations 
(l(a:, y; wt) - k) _ 
1 
t(o*) - ft + (ot - o*) r * (t - *), 
with K* = 02R(w*)/OwOw. Assuming R(w*)-k is small and K(wt)  K* yields 
wt=_rlt(wt_w*), rlt:arlt (wt-w*):rK*(wt-w*)-rh . (4) 
Introducing the squared error et = (wt- w*)'K* (wt- w*), gives rise to 
dt: --2tet, Ot -- cttet -- ct. 
(s) 
The behavior of the above equation system is interesting: The origin (0, 0) is its 
attractor and the basin of attraction has a fractal boundary. Starting from an 
adequate initial value, it has the solution of the form 
1 1 1 (rt >2) and r/t =  . (6) 
It is important to note that this 1/t-convergence rate of the generalization error et 
is the optimal order of any estimator bt converging to w*. So we find that Eq.(4) 
gives us an on-line learning algorithm which converges with a fast rate. This holds 
also if the target rule is slowly fluctuating or suddenly changing. The technique to 
prove convergence was to use the scalar distance in weight space et. Note also that 
Eq.(6) holds only within an appropriate parameter range; for small r/and wt - w* 
correlations and variances between (tit, wt, l) can no longer be neglected. 
4 Modification 
From the practical point of view (1) the Hessian K* of the expected loss or (2) 
the minimum value of the expected loss/ are in general not known or (3) in some 
602 N. Murata, K. Miiller, A. Ziehe and S. Amari 
applications we cannot access the explicit loss function (e.g. blind separation). Let 
us therefore consider a generalized learning algorithm: 
it+l -- tit -- rItf(at+l, Yt+l; ]ot), 
(7) 
where J' is a flow which determines the modification when an example (t+l, Yt+l) 
is given. Here we do not assume the existence of a loss function and we only assume 
that the averaged flow vanishes at the optimal parameter, i.e. (J'(a, y; w*)) = O. 
With a loss function, the flow corresponds to the gradient of the loss. We consider 
the averaged continuous equation and expand it around the optimal parameter: 
(8) 
where K* = (O.l'(ve,y;w*)/Ow). Suppose that we have an eigenvector of the Hes- 
sian K* vector v satisfying v:rK * = ,Xv :r and let us define 
(9) 
then the dynamics of  can be approximately represented as 
d 
,, -- -Mh',. (10) 
By using , we define a discrete and continuous modification of the rule for /: 
d 
and r/t = or/t ,,). (11) 
Intuitively  corresponds to a 1-dimensional pseudo distance, where the average 
flow j is projected down to a single direction v. The idea is to choose a clever 
direction such that it is sufficient to observe all dynamics of the flow only along this 
projection. In this sense the scalar  is the simplest obtainable value to observe 
learning. Noting that  is always positive or negative depending on its initial value 
and / can be positive, these two equations (10) and (11) are equivalent to the 
equation system (5). Therefore their asymptotic solutions are 
1 1 (12) 
and r/t =  t 
'-' 
Again similar to the last section we have shown that the algorithm converges pro- 
perly, however this time without using loss or Hessian. In this algorithm, an import- 
ant problem is how to get a good projection v. Here we assume the following facts 
and approximate the previous algorithm: (1) the minimum eigenvalue of matrix 
K* is sufficiently smaller than the second minimum eigenvalue and (2) therefore 
after a large number of iterations, the parameter vector t will approach from the 
direction of the minimum eigenvector of K*. Since under these conditions the evo- 
lution of the estimated parameter can be thought of as a one-dimensional process, 
any vector can be used as v except for the vectors which are orthogonal to the 
minimum eigenvector. The most efficient vector will be the minimum eigenvector 
itself which can be approximated (for a large number of iterations) by 
Adaptive On-line Learning in Changing Environments 603 
where II II denotes the L 2 norm. Hence we can adopt  = II(f>ll. Substitut.ing the 
instantaneous average of the flow by a leaky average, we arrive at 
t+l = )t - ]tf(t+l,Yt+l; ))t), (13) 
rt+l -- (1--5)rt+Sf(t+l,Yt+l;))t), (0< 5< 1) (14) 
t+l : ]t-4- oah (/l[rt+l[]- /t), (15) 
where 5 controls the leakiness of the average and r is used as auxiliary variable 
to calculate the leaky average of the flow f. This set of rules is easy to compute. 
However /will now approach a small value because of fluctuations in the estimation 
of r which depend on the choice of a,/, 7. In practice, to assure the stability of the 
algorithm, the learning rate in Eq.(13) should be limited to a maximum value 
and a cut-off/]min should be imposed. 
5 
Numerical Experiment: an application to blind 
separation 
In the following we will describe the blind separation experiment that we conducted 
(see eg. Bell & Sejnowski [1995], Jutten & Herault [1991], Molgedey  Schuster 
[1994] for more details on blind separation). As an example we use the two sun 
audio files (sampling rate 8kHz)' "rooster" (st 1) and "space music" (s) (see Fig. 
1). Both sources are mixed on the computer via 5 = (]L -t- A) where Os < 
t < 1.25s and 3.75s _< t _< 5s and  - (]L-t- B) for 1.25s _< t < 3.75s, using 
A - (0 0.9; 0.7 0) and B - (0 0.8; 0.6 0) as mixing matrices. So the rule switches 
twice in the given data. The goal is to obtain the sources  by estimating _ and 
2, given only the measured mixed signals . A change of the mixing is a scenario 
often encountered in real blind separation tasks, e.g. a speaker turns his head or 
moves during his utterances. Our on-line algorithm is especially suited to this non- 
stationary separation task, since adaptation is not limited by the above-discussed 
generic drawbacks of a constant learning rate as in Bell & Sejnowski [1995], Jutten 
 Herault [1991], Molgedey  Schuster [1994]. Let  be the unmixed signals 
(16) 
where T is the estimated mixing matrix. Along the lines of Molgedey & Schuster 
[1994] we use as modification rule for Tt 
(i, j = 1, 2, i  j), where we substitute instantaneous averages with leaky averages 
(g tt{ }leaky --(1- (){g-1 tt{-1)leaky -'}-gtt{. 
Note that the necessary ingredients for the flow f in Eq.(13)-(14) are in this case 
simply the correlations at equal or different times; h is computed according to 
Eq.(15). In Fig.2 we observe the results of the simulation (for parameter details, 
see figure caption). After a short time (t-0.4s) of large /and strong fluctuations in 
/the mixing matrix is estimated correctly. Until t-1.25s the learning rate adapts 
cooling down approximately similar to 1/t (cf. Fig. 2c), which was predicted in 
Eq.(12) in the previous section, i.e. it finds the optimal rate for annealing. At the 
604 N. Murata, K. Miiller, A. Ziehe and S. Amari 
point of the switch where simple annealed learning would have failed to adapt to 
the sudden change, our adaptive rule increases r/ drastically and is able to follow 
the switch within another 0.4s rsp. 0.1s. Then again, the learning rate is cooled 
down automatically as intended. Comparing the mixed, original and unmixed si- 
gnals in Fig. 1 confirms the accurate and fast estimate that we already observed in 
the mixing matrix elements. The same also holds for an acoustic cross check: for 
a small part of a second both signals are audible, then as time proceeds only one 
signal, and again after the switches both signals are audible but only for a very 
short moment. The fading away of the signal is so fast to the listener that it seems 
that one signal is simply "switched off" by the separation algorithm. 
Altogether we found an excellent adaptation behavior of the proposed on-line algo- 
rithm, which was also reproduced in other simulation examples omitted here. 
6 Conclusion 
We gave a theoretically motivated adaptive on-line algorithm extending the work of 
Sompolinsky et al. [1995]. Our algorithm applies to general feed-forward networks 
and can be used to accelerate learning by the learning about learning strategy in the 
difficult setting where (a) continuous functions or distributions are to be learned, 
(b) the Hessian K is not available and (c) no explicit loss function is given. Note, 
that if an explicit loss function or K is given, this additional information can be 
incorporated easily, e.g. we can make use of the real gradient otherwise we only 
rely on the flow. Non-stationary blind separation is a typical implementation of the 
setting (a)-(c) and we use it as an application of the adaptive on-line algorithm in 
a changing environment. Note that we can apply the learning rate adaptation to 
most existing blind separation algorithms and thus make them feasible for a non- 
stationary environment. However, we would like to emphasize that blind separation 
is just an example for the general adaptive on-line strategy proposed and applica- 
tions of our algorithm are by no means limited to this scenario. Future work will 
also consider applications where the rules change more gradually (e.g. drift). 
References 
Amari, S. (1967) IEEE Trans. EC 16(3):299-307. 
Bell, T., Sejnowski, T. (1995) Neural Comp. 7:1129-1159. 
Cichocki A., Amari S., Adachi M., Kasprzak W. (1996) Self-Adaptive Neural Net- 
works for Blind Separation of Sources, ISCAS'96 (IEEE), Vol. 2, 157-160. 
Darken, C., Moody, J. (1991) in NIPS 3, Morgan Kaufmann, Palo Alto. 
Heskes, T.M., Kappen, B. (1991) Phys. Rev. A 440:2718-2726. 
Jutten, C., Herault, J. (1991) Signal Processing 24:1-10. 
Molgedey, L., Schuster, H.G. (1994) Phys. Rev. Left. 72(23):3634-3637. 
Robbins, H., Monro, S. (1951) Ann. Math. Statist., 22:400-407. 
Rumelhart, D., McClelland, J.L and the PDP Research Group (eds.) (1986), PDP 
Vol. 1, pp. 318-362, Cambridge, MA: MIT Press. 
Saad D., and Solla S. (1995), Workshop at NIPS'95, see World-Wide-Web page: 
http://neural-server. aston. ac.uk/nips95/workshop.html and references therein. 
Sompolinsky, H., Barkai, N., Seung, H.S. (1995) in Neural Networks: The Statistical 
Mechanics Perspective, pp. 105-130. Singapore: World Scientific. 
Sutton, R.S. (1992) in Proc. 10th nat. conf. on AI, 171-176, MIT Press. 
Adaptive On-line Learning in Changing Environments 605 
 0 
o 
0 0.5 
'o 0 
.x_ 
E 
-1 
0 0.5 
._m 
u 
'o 0 
0 0.5 
.o -' 
 -1  ' 
0 
I I I I I I I 
I 1.5 2 2.5 3 3.5 4 4.5 5 
I 1.5 2 2.5 3 3.5 4 4.5 5 
I I I I I I 
I 1.5 2 2.5 3 3.5 4 4.5 
I I I I I I I I 
0.5 I 1.5 3 3.5 
5 
2 2.5 4 4.5 5 
time sec 
Figure 1: st  "space music", the mixture signal It 2, the unmixed signal ut 2 and the 
separation error ut 2 - st 2 as functions of time in seconds. 
0.8 
0.4 
i 
0 0.5 
15 , 
0 
0 0.5 
10 
5 
I 1.5 2 2.5 3 3.5 4 4.5 
I -I I  J 
1.5 2 2.5 3 3.5 4 4.5 
0.5 I 1.5 2 2.5 3 3.5 4 4.5 5 
time sec 
Figure 2: Estimated mixing matrix Tt, evolution of the learning rate r/t and inverse 
learning rate 1/rh over time. Rule switches (t=1.25s, 3.75s) are clearly observed as 
drastic changes in r h. Asymptotic 1It scaling in r/amounts to a straight line in l/tit. 
Simulation parameters are a = 0.002,/9 = 20/maxll(r)ll, e - - 0.01. maxll(r)l I 
denotes the maximal value of the past observations. 
