An Optimization Method of Layered 
Neural Networks based on the Modified 
Information Criterion 
Sumio Watanabe 
Information and Communication R & D Center 
Ricoh Co., Ltd. 
3-2-3, Shin-Yokohama, Kohoku-ku, Yokohama, 222 Japan 
su mio@ip e. rdc. ricoh. co.j p 
Abstract 
This paper proposes a practical optimization method for layered 
neural networks, by which the optimal model and parameter can 
be found simultaneously. We modify the conventional information 
criterion into a differentiable function of parameters, and then, min- 
imize it, while controlling it back to the ordinary form. Effective- 
hess of this inethod is discussed theoretically and experimentally. 
I INTRODUCTION 
Learning in artificial neural networks has been studied based on a statistical frame- 
work, because the statistical theory clarifies the quantitative relation between the 
empirical error and the prediction error. Let us consider a function (w; x) from 
the input space R Ic to the output space R  with a parameter v. We assnine that 
N 
training samples {(xi, Yi)}i=l are taken froin the true probability density Q(x,y). 
Let, us define the empirical error by 
293 
294 Watanabe 
and the prediction error by 
E(v) = / / ,,,y- (v;x)ll2Q(x,y)dxdy. (2) 
If we find a parameter v* which minimizes Eemp(W), then 
2(F(w*) + 1) 
< E(w') >= (1 + NL ) < Ee,p(V') > +o(---), (3) 
where <  > is the average value for the training samples, o(1/N) is a sinall term 
which satisfies No(i/N) --, 0 when N  x>, and F(w*), N, and L are respectively 
the numbers of the effective parameters of w*, the training samples, and output 
units. 
Although the average <  > cannot be calculated in the actual application, the 
optimal model for the mininmm prediction error can be found by choosing the 
model that minimizes the Akaike information criterion (AIC) [1], 
2(F0v* ) + 1))Eeop(V*). (4) 
J(v*) = (1 + NL 
This method was generalized for arbitrary distance [2]. The Bayes information 
criterion (BIC) [3] and the mininmm description length (MDL) [4] were proposed 
to overcome the inconsistency problem of AIC that the true model is not always 
chosen even when N  x>. 
The above information criteria have been applied to the neural network model selec- 
tion problem, where the maxinmm likelihood estinator w* was calculated for each 
model, and then information criteria were cronpared. Nevertheless, the practical 
problem is caused by the fact that we can not always find the mximuln likelihood 
estimator for each model, and even if we can, it. takes long calculation time. 
In order to ilnprove such 1nodel selection procedures, this paper proposes a prac- 
tical learning algorithm by which the optimal model and parameter can be found 
simultaneously. Let us consider a modified information criterion, 
2(F,(w) + 1))E,m0v) ' (5) 
J,Ov) = (1 + NL 
where a, > 0 is a parameter and Fo,(tv) is a C'-class fiinction which converges 
to F(iv) when a, --} O. We nfinimize J,(w), while controlling a, as a, --} O, To 
show effectiveness of this xnethod, we show experimental results, and discuss the 
theoretical background. 
2 A Modified Information Criterion 
2.1 A Formal Information Criterion 
Let us consider a conditional probability distribution, 
1 exp( Ily - (w;x)112 
vlx) - 
), (6) 
An Optimization Method of Layered Neural Networks 295 
xvhere a function V(,v;x)= {qi0v; x)} is given by the three-layered perceptton, 
H K 
= p(,v,o + p(wo + ) (7) 
j=l k=l 
and w = {w$o, w 0 } is a set of biases and weights and p(.) is a sigmoidM function. 
Let M be the full-connected neural network model with K input units, H hidden 
units, and L output units, and : be the family of all models made fi'om M, by 
pruning weights or eliminating biases. When a set of training samples {(, y)}= 
is given, we define an empirical loss and the 1)rediction loss by 
N 
'm'(w') -- N -'1gP(w'';Yilxi)' (8) 
i=1 
Minimizing Lemp(W, rr) is 
L(w, ) is equivalent to lninimizing E(w 
(w;, a;) which minimizes L,,,p(tv. ) 
AIC, we have the following formula, 
= -//Q(x,y)logPOv, a;y[x)dxdy. (9) 
equivalent to minimizing E,,p(w), and minimizing 
). Wc assumc that there exists a parameter 
in each model AI  :4. By the theory of 
F(w;,) + 1 1 
< L(v;,o';,) >=< L,,,,(wS,. rr;,) > + N + o(). (10) 
Based on this property, let us dcfine a forraM information criterion I(M) for a model 
M by 
I(M) = 2NL,,q,('tv, ;, ) + A(Fo(v;) + 1) (11) 
where A is a constant and Fo(w) is the number of nonzero parameters in w, 
L H H K 
Fo(W) = Z E fo(Wi5) + E Z f(w5)' (1) 
i= j=o 5= k=o 
where fo(X) is 0 if z = 0, or 1 if otherwise. I(M) is formally equal to AIC if 
A = 2, or MDL if A = log(N). Note that F(w) N Fo(w) for arbitrary w and that 
F(,) = Fo(w;) if and only if the Pisher information matrix of the model M is 
positive definite. 
2.2 A Modified Information Criterion 
In order to find the optimal inodcl and parameter simultaneously, we define a mod- 
ified information criterion. For a, > O. 
Io(w,a) = 2NL,p(w,a) + A(Fo(tv) + 1), (13) 
L H H I 
i= j=o j= .=o 
where fo,(x) satisfies the following two conditions. 
296 Watanabe 
(1) f,.(x) --} fo(x)when c --} O. 
(2) If 151  lYl then 0  f,(x)  f,(y) _< 1. 
For example, 1- exp(-x2/a 2) and 1- 1/(1 + (x/a) 2) satisfy this condition. Based 
on these definitions, we have the following theorem. 
Theorem rain 1(214) = lira rain I (w, a). 
AI a.0 tv, 
This theorem shows that the optilnal modcl and parameter can be found by mini- 
mizing I(w, ) while controlling a  a,  0 (The parameter a, plays the same role 
s the temperature in tn simulated annealing). As F,(x)  Fo(x} is not uniform 
convergence, this theorem needs the second condition on f,(x). (For proof of the 
theorem, see [5]). 
If we choose a differentiable function for f(w), then its local minimun can be 
found by the steepest descent method, 
dw _ 0 da a 
dt -- -Ov I(w'a)' dt - I'Ov'a)' (15) 
These equations result in a learning dynamics, 
 A& 20F. 
i=1 
whr a  = (l/X;)ET:, II - (';:,)11 . nd . is o,,,ly ontrond s ,  0. 
This dynamics can be understood as the error backpropagation with the added 
term. 
3 Experimental Results 
3.1 The true distribution is contained in the models 
First, xve consider a case when tile true distribution is contained in the 1nodel 
family .A4. Figure 1 (1) shows tile true model from which tile training samples were 
taken. One thonsand input smnples were taken from tile uniform probability on 
[-0.5, 0.5] x [-0.5,0.5] x [-0.5, 0.5]. Tile output samples were calculated by tile 
network in Figure 1 (1), and noizes were added which were taken from a normal 
distribution with tile expectation 0 and thc variance 3.33 x 10 -3. Ten thousands 
testing samples were taken fi'om the same distribution. We used f(w) = 1 - 
exp(-w2/20 '2) as a softener function, and tile "annealing schcdule" of ca xws sct 
as c(n) = a'o(1 - n./n,m,) + e, where ,n is tile training cycle number, a'o = 3.0, 
n, = 25000, and e = 0.01. Figure 1 (2) shows the full-connected model 
with 10 hidden units, which is the initial model. In the training, the learning speed 
/was set as 0.1. 
We compared tile empirical errors and the prediction errors for several cases for A 
(Figure 1 (5), (6}). If A = 2, tile criterion is AIC, and if A = log(N) = 6.907, it is 
BIC or MDL. Fignre 1 (3) and (4) show the optimized models and parameters for 
tile criteria with A = 2 and A = 5. When A = 5, tile true model could be found. 
An Optimization Method of Layered Neural Networks 297 
3.2 The true distribution is not contained 
Second, let us coilsider a case that the true distribution is not contained in the 
model family. For tile training samples and tile testing samples, we used tile same 
probability density as the above case except that the fimction was 
1 {sin(r(x + x2)) + tanh(x3) + 2}. 
(17) 
Figure 2 (1) and (2) show the training error and the prediction error, respectively. 
In this case, the best generalized model was found by AIC, shown in Figure 3. In 
the optimized network, x and x2 were ahnost separated fi'om x3, which means that 
the network could find the structure of the true modcl in eq.(17.) 
The practical application to ultrasonic image reconstruction is shown in Figure 3. 
4 Discussion 
4.1 An information criterion and pruning weights 
If P(w, o'; y{x) sufficiently approximates Q(.y[x) and N is sufficiently large, xve have 
  L(tLM,o-^I 
Lemp(VM,ryM) -- .  ) + 
F( wt ) + 1 1 
N + Z +o() (18) 
where ZN = Letup(IbM ) - L(',b ) and ', is the parameter which minimizes L(w, a) 
in the model M. Although < Z2v >= 0 resulting in equation (10), its standard 
deviation has the same order a.s (1/x/). However, if/1{ C/I.I2 or/14 D M2, then 
b and 5M2 expected to be ahnost common. and it doesn't essentially affect the 
model selection problem [2]. 
The model fanfily made by pruning weights or by eliminating biases is not a totally 
ordered set but a partially ordered set for the order "C". Therefore, if a model 
M  .M is selected, it is the optimal model in a local inodel family ga  = {M   
.Ad;_3g  C -3/ or _3(  D _3I}, but it may not be tile optimal inodel ill the global 
family .A4. Artificial neural networks have the local nininmm 1)roblem not only in 
the parameter space but also in the model family. 
4.2 The degenerate Fisher information matrix. 
If the true probability is contained in tile model and tile numl)er of hidden units is 
larger than necessary one, then the Fisher information matrix is degenerated, and 
consequently, the maxinmm likelihood estimator is not subject to the asymptotically 
normal distribution [6]. Therefore, the prediction error is not given by eq.(3), or 
AIC cannot be derived. However, by the proposed method, the selected model has 
the non-degenerated Filler information matrix, because if it. is degenerate then the 
modified information criterion is not lninimized. 
298 Watanabe 
unit  
hidden 
3 . 
unlt$ 
,  (3,ttml (. 2, (4, Optimid by A:5 
(1)True model (2) hitial model for learning. m ) -3  3.31 10 
E(w% = 3.39 X 10 w% = 3.37 X 10'3 
3.35 ' 
3.3 ' 
3.25 
E (w*) 
amp 
10'3 A/' initial 1 
 2 6.9 tial 2 
I I I I I I I I 
I  I I I I  I A 
AIC MDL 
(5) The emprical error 
E(w*) l x10 -3  
3.45 - / ' initial 
3.4 -- 
3.380 X 10-3 
3.35-- I I 2 I I I I 61'9 I  A 
I ' I I I I  I 
AIC MDL 
(6) The prediction error 
initial 2 
initial 
Figure 1' True distribution is contained in the models. 
EemgW*) initial 2 
X 10 '3 initial 1 3.7 
3.6 
ial 3 3.6-- 
3.5 
3.5 
3.4 
3.4 
3.3 ]/,  I I I I 6'1 
AIC BIC 
(1) The empirical error 
-3 
E(w*) The empirical error 3.31X 10 
initial 2 
The prediction error 3.41X 10 -3 
X 10.3 initial 7 2 
itial 3 
(-3'.96 ' . _ .' 
35 ,, -45 
 3.409 X 10 -3 
i 12   t  6' A 
AIC BIC xl x2 x3 
(2) The prediction error 
(3) Optimized by AI C (A=2). 
Figure 2: True distribution is not contained in the models. 
An Optimization Method of Layered Neural Networks 299 
.. , ': -,  -_ . . .. 
5'..? ...... :': ' 
i 
- '-5 I 
:. .-.....):. 
(1) An Ultrasonic hnaging Systmn 
(2) Sample Objects. 
hnagcs for Training Images for Testing 
Reconsuucted Image : ' jjj?'-?]j}; ?.?:'?j: 
/ I pixel  . ,i -: 5L ::' ::, '::::. 
hidden  
neighborhood  est llg .[IC ....  [ 
U!asonic Image 3 32 Restored using MDL 
(3) NeurM Networks (d)Restored hnages 
Figure 3: Practical Application to Image Restoration 
The proposed method was applied to ultrasonic image restoration. Figure 3 (1). (2), 
(3), (4) respectively show an ult. r,'tsonic imaging system, the sample objects, and a 
neural network for image restoration, and the original restored images. The nmnber 
of paramet. ers optimized by LSM, AIC. and MDL wcrc respectively 166. 138. and 
57. Rather noizeless images were obtained using the nodified AIC or MDL. For 
example, the "Tail of R" was clearly restored using AIC. 
300 Watanabe 
4.3 Relation to another generalization methods 
In the neural infornlation processing field, nlany methods have been proposed for 
preventing the over-fitting problcnl. One of the nlost famous nlcthods is the weight 
decay nlethod, in xvhich we assunit a priori probability distribution on the parameter 
space and nlininlize 
El (w) - Zemp(tU) q- he(w), (19) 
where A and C(w) are chosen by several heuristic nlethods [7]. Tile BIG is tile 
infornlation criterion for such a nlcthod [3], and the proposed nlcthod nlay be 
understood ,as a nlcthod hoxv to control A and C(w). 
5 Conclusion 
An optinlization nlethod for layered neural networks was proposed based on the 
modified information criterion, and its effectiveness was discussed theoretically and 
experimentally. 
Acknowledgements 
The author xvould like to thank Prof. S. Amari, Prof. S. Yoshizawa, Prof. K. 
Aihara in University of Tokyo, and all nlembers of tile Alnari seminar for their 
active discussions about statistical methods in neural networks. 
References 
[1] H.Akaike. (1974) A New Look at tile Statistical Model Identification. it IEEE 
Trans. oil Automatic Control, Vol. AC-19, No.6, pp.716-723. 
[2] N.Murata, S.Yoshiza;va, and S.Ainari.(1992) Leariling Curves, Model Selection 
and Conlplexity of Neural Networks. Advances in Neural Inforrn, ation Procssing 
Systems 5, San Mateo, Morgan Kaufman, pp.607-614. 
[3] C.Schwarz (1978) Estimating the dimension of a model. Annals of Statistics 
Vol.6, pp.461-464. 
[4] J.Rissancn. (1984) Universal Coding, Information, Prediction, and Estimation. 
IEEE Trans. on Information. Theory, Vol.30, pp.629-636. 
[5] S.Vatanabc. (1993) An Optimization Method of Artificial Neural Networks 
based on a Modified Information Criterion. IEICE technical Report Vol. NC93-52, 
pp.71-78. 
[6] H.Whitc. (1989) Learning in Artificial Neural Networks: A Statistical Perspec- 
tive. Neural Computation, Vol.l, pp.425-464. 
[7] A.S.Wcigcnd, D.E.Rumclhart, and B.A.Hubcrman. (1991) Generalization of 
weight-elinlination with application to forecasting. Advances in Neural Inforrn, ation 
Processing Systems, Vol.3, pp.875-882. 
PART II 
LEARNING THEORY, 
GENERALIZATION, 
AND COMPLEXITY 
