The Learning Dynamics of 
a Universal Approximator 
Ansgar H. L. West 1'2 David Saad 1 Ian T. Nabney 1 
A. H. L. West aston. ac. uk D. Saadast on. ac. uk I.T. Nabneyaston. ac. uk 
1Neural Computing Research Group, University of Aston 
Birmingham B4 7ET, U.K. 
http://www. ncrg. aston. ac. uk/ 
2Department of Physics, University of Edinburgh 
Edinburgh EH9 3JZ, U.K. 
Abstract 
The learning properties of a universal approximator, a normalized 
committee machine with adjustable biases, are studied for on-line 
back-propagation learning. Within a statistical mechanics frame- 
work, numerical studies show that this model has features which 
do not exist in previously studied two-layer network models with- 
out adjustable biases, e.g., attractive suboptimal symmetric phases 
even for realizable cases and noiseless data. 
1 INTRODUCTION 
Recently there has been much interest in the theoretical breakthrough in the un- 
derstanding of the on-line learning dynamics of multi-layer feedforward perceptrons 
(MLPs) using a statistical mechanics framework. In the seminal paper (Saad & 
Solla, 1995), a two-layer network with an arbitrary number of hidden units was 
studied, allowing insight into the learning behaviour of neural network models whose 
complexity is of the same order as those used in real world applications. 
The model studied, a soft committee machine (Biehl & Schwarze, 1995), consists of 
a single hidden layer with adjustable input-hidden, but fixed hidden-output weights. 
The average learning dynamics of these networks are studied in the thermodynamic 
limit of infinite input dimensions in a student-teacher scenario, where a student 
network is presented serially with training examples (, ) labelled by a teacher 
network of the same architecture but possibly different number of hidden units. 
The student updates its parameters on-line, i.e., after the presentation of each 
example, along the gradient of the squared error on that example, an algorithm 
usually referred to as back-propagation. 
Although the above model is already quite similar to real world networks, the ap- 
proach suffers from several drawbacks. First, the analysis of the mean learning 
dynamics employs the thermodynamic limit of infinite input dimension -- a prob- 
lem which has been addressed in (Barber et al., 1996), where finite size effects have 
been studied and it was shown that the thermodynamic limit is relevant in most 
The Learning Dynamcis of a Universal Approxirnator 289 
cases. Second, the hidden-output weights are kept fixed, a constraint which has 
been removed in (Riegler & Biehl, 1995), where it was shown that the learning 
dynamics are usually dominated by the input-hidden weights. Third, the biases of 
the hidden units were fixed to zero, a constraint which is actually more severe than 
fixing the hidden-output weights. We show in Appendix A that soft committee 
machines are universal approximators provided one allows for adjustable biases in 
the hidden layer. 
In this paper, we therefore study the model of a normalized soft committee machine 
with variable biases following the framework set out in (Saad & Solla, 1995). We 
present numerical studies of a variety of learning scenarios which lead to remarkable 
effects not present for the model with fixed biases. 
2 DERIVATION OF THE DYNAMICAL EQUATIONS 
The student network we consider is a normalized soft committee machine of K 
hidden units with adjustable biases. Each hidden unit i consists of a bias Oi and a 
weight vector W/which is connected to the N-dimensional inputs . All hidden units 
are connected to a linear output unit with arbitrary but fixed gain ' by couplings 
of fixed strength. The activation of any unit is normalized by the inverse square 
root of the number of weight connections into the unit, which allows all weights to 
be of O(1) magnitude, independent of the input dimension or the number of hidden 
units. The implemented mapping is therefore fw()- (7/v)E/K=I g(ui--Oi), 
where ui -- W/./V and g(.) is a sigmoidal transfer function. The teacher net- 
work to be learned is of the same architecture except for a possible difference in 
the number of hidden units M and is defined by the weight vectors Bn and bi- 
ases Pn (n = 1,...,M). Training examples are of the form (t,,t,), where the 
input vectors ' are drawn form the normal distribution and the outputs are 
' = 
Yn=l g(v - pn), where v = B- /x/-. 
The weights and biases are updated in response to the presentation of an example 
(, ), along the gradient of the squared error measure e = [ - fw()] 2 
- r/0  
W//+1 - W// = fiwS  and Oil+l -- Oi = (i (1) 
with 5' =-[ - fw(e)la'(u' -od. The two learning rates are r/w for the weights 
and r/0 for the biases. In order to analyse the mean learning dynamics resulting 
from the above update equations, we follow the statistical mechanics framework in 
(Saad & Solla, 1995). Here we will only outline the main ideas and concentrate on 
the results of the calculation. 
As we are interested in the typical behaviour of our training algorithm we average 
over all possible instances of the examples . We rewrite the update equations (1) 
in W/ as equations in the order parameters describing the overlaps between pairs 
of student nodes Qij- Wi.Wj/N, student and teacher nodes Rin---Wi.Bn/N, 
and teacher nodes Tnm = Bn'Bm/N. The generalization error eg, measuring the 
typical performance, can be expressed solely in these variables and the biases 0 i and 
Pn. The order parameters Qij, Rin and the biases Oi are the dynamical variables. 
These quantities need to be self-averaging with respect to the randomness in the 
training data in the thermodynamic limit (N - c), which enforces two necessary 
constraints on our calculation. First, the number of hidden units K << N, whereas 
one needs K 00(N) for the universal approximation proof to hold. Second, one 
can show that the updates of the biases have to be of O(1/N), i.e., the bias learning 
rate has to be scaled by l/N, in order to make the biases self-averaging quantities, 
a fact that is confirmed by simulations [see Fig. 1]. If we interpret the normalized 
290 A. H. L. West, D. Saad and I. T. Nabney 
example number a = !/N as a continuous time variable, the update equations for 
the order parameters and the biases become first order coupled differential equations 
dQij 
da 
dRi, dO i 
da = n (Siv,) , d da = -n (Si) ' (2) 
Choosing g(x) = eff(x/) as the sigmoidal trsfer, most integrations in Eqs. [2) 
can be performed alytically, but for single Gaussian integrals remaining for - 
terms and the generization error. The exact form of the resulting dynamical 
equations is quite complicated and will be presented elsewhere. Here we only re- 
mark, that the gain 7 of the linear output unit, which determines the output scale, 
merely rescales the learning rates with 7 2 and can therefore be set to one without 
loss of generality. Due to the numerical integrations required, the differential equa- 
tions can only be solved accurately in moderate times for smaller student networks 
(K  5) but any teacher size M. 
3 ANALYSIS OF THE DYNAMICAL EQUATIONS 
The dynamical evolution of the overlaps Qij, Rin and the biases 0 i follows from 
integrating the equations of motion (2) from initial conditions determined by the 
(random) initialization of the student weights W/ and biases Oi. For random ini- 
tialization the resulting norms Qii of the student vector will be order O(1), while 
the overlaps Qij between different student vectors, and student-teacher vectors Rin 
will be only order O(1/v/). A random initialization of the weights and biases can 
therefore be simulated by initializing the norms Qii, the biases Oi and the normalized 
overlaps Oij -- Qij/v/QiiQjj and in -- Rin/x/QiiTnn from uniform distributions 
in the [0, 1], [- 1, 1], and [- 10 -2, 10 -12] intervals respectively. 
We find that the results of the numerical integration are sensitive to these ran- 
dom initial values, which has not been the case to this extent for fixed biases. 
Furthermore, the dynamical behaviour can become very complex even for realiz- 
able cases (K = M) and networks with three or four hidden units. For sake of 
simplicity, we will therefore restrict our presentation to networks with two hidden 
units (K - M = 2) and uncorrelated isotropic teachers, defined by Tnm -- 5nm, al- 
though larger networks and graded teacher scenarios were investigated extensively 
as well. We have further limited our scope by investigating a common learning 
rate (r/o = r/0 -- r/w) for biases and weights. To study the effect of different weight 
initialization, we have fixed the initial values of the student-student overlaps Qij 
and biases Oi, as these can be manipulated freely in any learning scenario. Only the 
initial student-teacher overlaps Rin are randomized as suggested above. 
In Fig. I we compare the evolution of the overlaps, the biases and the generalization 
error for the soft committee machine with and without adjustable bias learning a 
similar realizable teacher task. The student denoted by  lacks biases, i.e., Oi -- O, 
and learns to imitate an isotropic teacher with zero biases (Pn -- 0). The other 
student features adjustable biases, trained from an isotropic teacher with small 
biases (Pl,2 -- :F0.1). For both scenarios, the learning rate and the initial conditions 
were judiciously chosen to be r/o = 2.0, Qll = 0.1, Q22 = 0.2, in -- 012 = 
U[-10 -x2, 10 -12] with 01 = 0.0 and 02 = 0.5 for the student with adjustable biases. 
In both cases, the student weight vectors (Fig. la) are drawn quickly from their 
initial values into a suboptimal symmetric phase, characterized by the lack of spe- 
cialization of the student hidden units on a particular teacher hidden unit, as can 
be depicted from the similar values of Rin in Fig. lb. This symmetry is broken 
The Learning Dynamcis of a Universal Approximator 291 
Oi 
1.0-- 
0.8  
0.6- 
0.4- 
0.2- 
0.0- 
'= '-..... 
 =Qn (N=10) Qu,Q2 
! o Qu (N-100) / Q 
 Q12 (N-10) / Qi ........... 
k_ ................... 
100 200 300 400 500 600 700 
,.oJ 
-[ 
0.8] 
0.-    (=100) / 2P ...... 
v* v* /]  ..... 
11  22.. Rll 
0.4  2 Rl ..... 
 Rh,Rh/' Rl .... 
0.2   R--- 
R,R 
0 ]00 200 300 400 500 600 
o.a o, 
0-21 '. 0 (0.5) 
0.1 q ...-- ,.=--- ..7 
o, o, ...... 
01 ......... 01 (0'01) 
.** J  .. 
0.0- "''---'-. .... , ............. ::.. 
.... x \ x 
-0.1 '. x x x "'.. 
f 02 .... 4 02 .... -' 02 .............. 
-0.2- I 0](10-) O] (10-) 01 "!'0"! ..... 
-0.3 
100 200 300 400 500 600 700 
0.02 
0.015 
0.01 
0.005 
0.0 
(d) 
6s(0.01 ) --- es(O* ) --- 
6s(0.1 ) ...... (s(0) ........... 
es(0.5) es(10 -s) ...... 
es(1 ) .... es(10 -4) ..... 
N=200  
t'..',  t , 
"'T-'):'' '' "l'"  ''" 
0 100 200 300 400 500 600 
Figure 1: The dynamical evolution of the student-student overlaps Qij (a), and the 
student-teacher overlaps Rin (b) as a function of the normalized example number a 
is compared for two student-teacher scenarios: One student (denoted by .) has fixed 
zero biases, the other has adjustable biases. The influence of the symmetry in the 
initialization of the biases on the dynamics is shown for the student biases 0 i (C), 
and the generalization error e s (d): 0 = 0 is kept for all runs, but the initial value 
of 02 varies and is given in brackets in the legends. Finite size simulations for input 
dimensions N = 10... 500 show that the dynamical variables are self-averaging. 
almost immediately in the learning scenario with adjustable biases and the student 
converges quickly to the optimal solution, characterized by the evolution of the 
overlap matrices Q, R and biases 0 i (see Fig. lc) to their optimal values T and 
Pn (up to the permutation symmetry due to the arbitrary labeling of the student 
nodes). Likewise, the generalization error e s decays to zero in Fig. ld. The student 
with fixed biases is trapped for most of its training time in the symmetric phase 
before it eventually converges. 
Extensive simulations for input dimensions N - 10... 500 confirm that the dynamic 
variables are self-averaging and show that variances decrease with 1/N. The mean 
trajectories are in good agreement with the theoretical predictions even for very 
small input dimensions (N - 10) and are virtually indistinguishable for N - 500. 
The length of the symmetric phase for the isotropic teacher scenario is dominated 
by the learning rate l, but also exhibits a logarithmic dependence on the typical 
700 
700 
The length of the symmetric phase is linearly dependent on r/o for small learning rates. 
292 A. H. L. West, D. Saad and I. T. Nabney 
0.6 
0.4- 
0.2- 
0i o.0 
-0.2- 
-0.6 
,...--.- ;.2 ......  
.. ......... ! ./ 
?-"-'' =''; LT ....... ,' / 
 .--..... --- . 
...... 9 ..... a '" ' ''"""' "/ 
........... ol ...... Ol '" 
' ' ' I ' ' ' I ' ' ' I ' ' ' I 
0 400 800 1200 1600 
32OO 
28OO 
2400- 
c 
2000- 
1600- 
1200 
/2_n II I I ; 1 
........... /o=U.U.tll I . ; / 
...... /o=0.1 ';I i I ;  
..... /o=1.5 , j ' 
---/o=z  i / // 
........ ;!  / 
'' 'l'''l'''l'''l' ''l' 
0.0 0.2 0.4 0.6 0.8 1.0 
Figure 2: (a) The dynamical evolution of the biases 0i for a student imitating an 
isotropic teacher with zero biases. reveals symmetric dynamics for 01 and 0. The 
student was randomly initialized identically for the different runs, but for a change 
in the range of the random initialization of the biases (U[-b, b]), with the value of 
b given in the legend. Above a critical value of b the student remains stuck in a 
suboptimal phase. (b) The normalized convergence time cc - /oac is shown as a 
function of the initialization of 02 for varios learning rates r/o (see legend, r/o  - 0 
symbolizes the dynamics neglecting r/o  terms.). 
differences in the initial student-teacher overlaps Rin (Biehl et al., 1996) which are 
typically of order O(1/x/) and cannot be influenced in real scenarios without a 
priori knowledge. The initialization of the biases, however, can be controlled by 
the user and its influence on the learning dynamics is shown in Figs. lc and ld for 
the biases and the generalization error respectively. For initially identical biases 
(01 -- 02 -- 0), the evolution of the order parameters and hence the generalization 
error is almost indistinguishable from the fixed biases case. A breaking of this 
symmetry leads to a decrease of the symmetric phase linear in log(1Ol - Ol) until 
it has all but disappeared. The dynamics are again slowed down for very large 
initialization of the biases (see ld), where the biases have to travel a long way to 
their optimal values. 
This suggests that for a given learning rate the biases have a dominant effect in 
the learning process and strongly break existent symmetries in weight space. This 
is argueably due to a steep minimum in the generalization error surface along the 
direction of the biases. To confirm this, we have studied a range of other learning 
scenarios including larger networks and non-isotropic teachers, e.g., graded teachers 
with Tnm -- n6nm. Even when the norms of the teacher weight vectors are strongly 
graded, which also breaks the weight symmetry and reduces the symmetric phase 
significantly in the case of fixed biases, we have found that the biases usually have 
the stronger symmetry breaking effect: the trajectories of the biases never cross, 
provided that they were not initialized too symmetrically. 
This would seem to promote initializing the biases of the student hidden units evenly 
across the input domain, which has been suggested previously on a heuristic basis 
(Nguyen & Widrow, 1990). However, this can lead to the student being stuck in a 
suboptimal configuration. In Fig. 2a, we show the dynamics of the student biases Oi 
when the teacher biases are symmetric (Pn - 0). We find that the student progress 
is inversely related to the magnitude of the bias initialization and finally fails to 
converge at all. It remains in a suboptimal phase characterized by biases of the same 
large magnitude but opposite sign and highly correlated weight vectors. In effect, 
the outputs of the two student nodes cancel out over most of the input domain. In 
The Learning Dynamcis of a Universal Approximator 293 
Fig. 2b, the influence of the learning rate in combination with the bias initialization 
in determining convergence is illustrated. The convergence time ac, defined as the 
example number at which the generalization error has decayed to a small value, 
here judiciously chosen to be 10 -s, is shown as a function of the initial value of 02 
for various learning rates r/0. For convenience, we have normalized the convergence 
time with 1/r/o. The initialization of the other order parameters is identical to 
Fig. la. One finds that the convergence time diverges for all learning rates, above 
a critical initial value of 02. For increasing learning rates, this transition becomes 
sharper and occurs at smaller 02, i.e., the dynamics become more sensitive to the 
bias initialization. 
4 SUMMARY AND DISCUSSION 
This research has been motivated by recent progress in the theoretical study of 
on-line learning in realistic two-layer neural network models -- the soft-committee 
machine, trained with back-propagation (Saad & Solla, 1995). The studies so far 
have excluded biases to the hidden layers, a constraint which has been removed in 
this paper, which makes the model a universal approximator. The dynamics of the 
extended model turn out to be very rich and more complex than the original model. 
In this paper, we have concentrated on the effect of initialization of student weights 
and biases. We have further restricted our presentation for simplicity to realizable 
cases and small networks with two hidden units, although larger networks were 
studied for comparison. Even in these simple learning scenarios, we find surpris- 
ing dynamical effects due to the adjustable biases. In the case where the teacher 
network exhibits distinct biases, unsymmetric initial values of the student biases 
break the node symmetry in weight space effectively and can speed up the learning 
process considerably, suggesting that student biases should in practice be initially 
spread evenly across the input domain if there is no a priori knowledge of the func- 
tion to be learned. For degenerate teacher biases however such a scheme can be 
counterproductive as different initial student bias values slow down the learning 
dynamics and can even lead to the student being stuck in suboptimal fixed points, 
characterized by student biases being grouped symmetrically around the degenerate 
teacher biases and strong correlations between the associated weight vectors. 
In fact, these attractive suboptimal fixed points exist even for non-degenerate 
teacher biases, but the range of initial conditions attracted to these suboptimal 
network configurations decreases in size. Furthermore, this domain is shifted to 
very large initial student biases as the difference in the values of the teacher biases 
is increased. We have found these effects also for larger network sizes, where the 
dynamics and number of attractive suboptimal fixed points with different internal 
symmetries increases. Although attractive suboptimal fixed points were also found 
in the original model (Biehl et al., 1996), the basins of attraction of initial values 
are in general very small and are therefore only of academic interest. 
However, our numerical work suggests that a simple rule of thumb to avoid being 
attracted to suboptimal fixed points is to always initialize the squared norm of a 
weight vector larger than the magnitude of the corresponding bias. This scheme 
will still support spreading of the biases across the main input domain in order to 
encourage node symmetry breaking. This is somewhat similar to previous findings 
(Nguyen & Widrow, 1990; Kim & Ra, 1991), the former suggesting spreading the 
biases across the input domain, the latter relating the minimal initial size of each 
weight with the learning rate. This work provides a more theoretical motivation for 
these results and also distinguishes between the different rSles of biases and weights. 
In this paper we have addressed mainly one important issue for theoreticians and 
294 A. H. L. West, D. Saad and L T. Nabney 
practitioners alike: the initialization of the student network weights and biases. 
Other important issues, notably the question of optimal and maximal learning rates 
for different network sizes during convergence, will be reported elsewhere. 
A THEOREM 
Let $g denote the class of neural networks defined by sums of the form '],i= nig(ui -- Oi) 
where K is arbitrary (representing an arbitrary number of hidden units), 01 E R and ni Z 
(i.e. integer weights). Let b(x) -- Og(x)/Ox and let D0 denote the class of networks defined 
by sums of the form .. wib(ui -Oi) where wi  . If g is continuously differentiable and 
if the class D0 are uni'e--rsal approximators, then $g is a class of universal approximators; 
that is, such functions are dense in the space of continuous functions with the Loo norm. 
As a corollary, the normalized soft committee machine forms a class of universal approxi- 
mators with both sigmoid and error transfer functions [since radial basis function networks 
are universal (Park & Sandberg, 1993) and we need consider only the one-dimensional in- 
put case as noted in the proof below]. Note that some restriction on g is necessary: if g is 
the step function, then with arbitrary hidden-output weights, the network is a universal 
approximator, while with fixed hidden-output weights it is not. 
A.1 Proof 
By the arguments of (Hornik et al., 1990) which use the properties of trigonometric poly- 
nomials, it is sufficient to consider the case of one-dimensional input and output spaces. 
Let I denote a compact interval in R and let f be a continuous function defined on I. 
Because 0 is universal, given any e  0 we can find weights wi and biases 0i such that 
f - wib (u - Oi) <  (i) 
Because the rationals are dense in the reals, without loss of generality we can assume 
that the weights wi 6 Q. Since b(x) is continuous and I is compact, the convergence of 
[g(x + h) - g(x)]/h to Og(x)/Ox is uniform and hence for all n > n (2--) the following 
inequality holds: 
I[nw [g ( u'+ 1 O) -g(u-O')] -wib(u-Oi)[]oo  2( (ii) 
Also note that for suitable ni  n 2--7 ' mi = niwi  Z, as wi is a rational number. 
Thus, by the triangle inequality, 
I Oi -g(u-Oi) - wib(u-Oi) < . 
ni 
i-- ! oo 
The result now follows from equations (i) and (iii) and the triangle inequality. 
(iii) 
References 
Barber, D., Saad, D., & Sollich, P. 1996. Europhys. Left., 34, 151-156. 
Biehl, M., & Schwarze, H. 1995. J. Phys. A, 28, 643-656. 
Biehl, M., Riegler, P., & WShler, C. 1996. University of Wiirzburg Preprint WUE-ITP- 
96-003. 
Hornik, K., Stinchcombe, M., & White, H. 1990. Neural Networks, 3, 551-560. 
Kim, Y. K., & Ra, J..B. 1991. Pages 396-d01 of: International Joint Conference on 
Neural Networks 91. 
Nguyen, D., & Widrow, B. 1990. Pages C1-C6 of: IJCNN International Conference on 
Neural Networks 90. 
Park, J., & Sandberg, I. W. 1993. Neural Computation, 5, 305-316. 
Riegler, P., & Biehl, M. 1995. J. Phys. A, 28, L507-L513. 
Saad, D., & Solla, S. A. 1995. Phys. Rev. E, 52, 4225-4243. 
