Robust Parameter Estimation And 
Model Selection For Neural Network 
Regression 
Yong Liu 
Department of Physics 
Institute for Brain and Neural Systems 
Box 1843, Brown University 
Providence, RI 02912 
yongcns. brown. edu 
Abstract 
In this paper, it is shown that the conventional back-propagation 
(BPP) algorithm for neural network regression is robust to lever- 
ages (data with = corrupted), but not to outliers (data with y 
corrupted). A robust model is to model the error as a mixture of 
normal distribution. The influence function for this mixture model 
is calculated and the condition for the model to be robust to outliers 
is given. EM algorithm [5] is used to estimate the parameter. The 
usefulness of model selection criteria is also discussed. Illustrative 
simulations are performed. 
I Introduction 
In neural network research, the back-propagation (BPP) algorithm is the most 
popular algorithm. In the regression problem y : r(z, w) + e, in which r(m, 9) 
denote a neural network with weight 9, the algorithm is equivalent to modeling 
the error as identically independently normally distributed (i.i.d.), and using the 
maximum likelihood method to estimate the parameter [13]. Howerer, the training 
data set may contain surprising data points either due to errors in y space (outliers) 
when the response vectors ys of these data points are far away from the underlying 
function surface, or due to errors in a: space (leverages), when the the feature vectors 
192 
Robust Parameter Estimation and Model Selection for Neural Network Regression 193 
ms of these data points are far away from the mass of the feature vectors of the rest 
of the data points. These abnormal data points may be able to cause the parameter 
estimation biased towards them. A robust algorithm or robust model is the one 
that overcome the influence of the abnormal data points. 
A lot of work has been done in linear robust regression [8, 6, 3]. In neural network. 
it is generally believed that the role of sigmoidal function of the basic computing 
unit in the neural net has some significance in the robustness of the neural net 
to outliers and leverages. In this article, we investigate this more thoroughly. It 
turns out the conventional normal model (BPP algorithm) is robust to leverages 
due to sigmoidal property of the neurons, but not to outliers (section 2). From the 
Bayesian point of view [2], modeling the error as a mixture of normal distributions 
with different variances, with flat prior distribution on the variances, is more robust. 
The influence function for this mixture model is calculated and condition for the 
model to be robust to outliers is given (section 3.1). An efficient algorithm for 
parameter estimation in this situation is the EM algorithm [5] (section 3.2). In 
section 3.3, we discuss a choice of prior and its properties. In order to choose 
among different probability models or different forms of priors, and neural nets 
with different architecture, we discuss the model selection criteria in section 4. 
Illustrative simulations on the choice of prior, or the t distribution model, and the 
normal distribution model are given. Model selection statistics, is used to choose 
the degree of freedom oft distribution, different neural network, and choose between 
a t model and a normal model (section 4 and 5). 
2 
Issue Of Robustness In Normal Model For Neural Net 
Regression 
One way to think of the outliers and leverages is to regard them as a data per- 
turbation on the data distribution of the good data. Remember that a estimated 
parameter T = T(F) is an implicit function of the underlying data distribution F. 
To evaluate the influence of T by this distribution perturbation, we use the influence 
function [6] of estimator T at point z = (a, y) with data distribution F, which is 
defined as 
T((1 - t)F + tA) - T(F) (1) 
IF(T, z, F) = lim_0+ t 
in which A has mass 1 at z.1 This definition is equivalent to a definition of deriva- 
tive with respect to F except what we are dealing now is the derivative of functional. 
This definition gives the amount of change in the estimator T with respect to a dis- 
tribution perturbation tAz at point z : (z, y). For a robust estimation of the 
parameter, we expect the estimated parameter does not change significantly with 
respect to a data perturbation. In another word, the influence function is bounded 
for a robust estimation. 
Denote the conditional probability model of y given z as i.i.d. f(yla, 8) with param- 
eter 8. If the error function is the negative log-likelihood plus or not plus a penalty 
term, then a general property of the influence function of the estimated parameter 
, is IF(,(ai,yi),F) x Vlogf(yilai,) (for proof, see [11]). Denote the neural 
The probability density of the distribution A is 5(y - a:). 
194 Liu 
net, with h hidden units and the dimension of the output being one (d 
0): + 
k=l 
in which rr(z) is the sigmoidal function or 1/(1 + exp(z)) and O 
= 1), as 
(2) 
-- {a, w,t/}. 
For a normal model, f(ylz, O,a): c(y; v(m,O), cr ) in which A/'(y; c, rr) denotes 
variate normal distribution with mean c and covariance matrix rr'I. Straightforward 
calculation yield (dy = 1) 
ir(,(,y),F)(y-v,)) aia(wiz+ti)z (3) 
ale' (iz + ii) nx x 
Since y with a large value makes the influence function unbounded, thus the normal 
model or the back-propagation algorithm for regression is not robust to outliers. 
Since a' (w +t) tends to be zero for z that is hr away from the projection z +i 
O, the influence function is bounded for a abnormal , or the normal model for 
regression is robust to leverages. This analysis can be easily extented to a neural 
net with multiple hidden layers and multiple outputs. Since the neural net model 
is robust to leverages, we shall concentrate on the discussion of robustness with 
respect to outliers afterwards. 
3 Robust Probability Model And Parameter Estimation 
3.1 Mixture Model 
One method for the robust estimation is by the Bayesian analysis [2]. Since our 
goal is to overcome the influence of outliers in the data set, we model the error as 
a mixture of normal distributions, or, 
f(y[z,O, cr) = j f(yl,O,q, cr)r(q)dq (4) 
with f(ylz, O,q, cr)= A/'(y;,(z,O),c?/q)and the prior distribution on q is denoted 
as r(q). Intuitively, a mixture of different normal distributions with different qs, or 
different variances, somehow conveys the idea that a data point is generated from 
a normal distribution with large variance, which can be considered to be outliers, 
or from that with small variance, which can be considered to be good data. This 
requires r(q) to be fiat to accommodate the abnormal data points. A case of 
extreme non-flat prior is to choose r(q) = 5(q - 1), which will make f(ylz, 0, cr) to 
be a normal distribution model. This model has been discussed in previous section 
and it is not robust to outliers. 
Calculation yields (dy = 1) the influence function as 
IF(, (,y),F)  (y- v,))  
airy (iz + ii) hxl 
(5) 
Robust Parameter Estimation and Model Selection for Neural Network Regression 195 
in which 
b: E [qly, z, cr,] (6) 
where expectation is taken with respect to the posterior distribution of q, or 
,r(qly , z, cr, 8) = f(yl,0,q,o)(q) For the influence function to be bounded for a y 
with large value, (- r(m,g))b must be bounded. This is the condition on ,r(q) 
when the distribution f(ylm, 8, or) is robust to outliers. It can be noticed that the 
mixture model is robust to leverages for the same reason as in the case of the normal 
distribution model. 
3.2 Algorithm For Parameter Estimation 
An efficient parameter estimation method for the model in equation 4 is the EM 
algorithm [5]. In EM algorithm, a portion of the parameter is regarded as the miss- 
ing observations. During the estimation, these missing observations are estimated 
through previous estimated parameter of the model. Afterwards, these estimated 
missing observations are combined with the real observations to estimate the pa- 
rameter of the model. In our mixture model, we shall regard {qi, i: 1, ...n} as the 
missing observations. Denote w: {mi, yi, i - 1, ...n} as the training data set. 
It is a straight forward calculation for the EM algorithm (see Liu, 1993b) once one 
write down the full probability f({yi,qi}l{:ri},cr,$). The algorithm is equivalent to 
mmmmng 
1 2 
0)) 
i=1 
_   (, _ 
and estimating cr at the s step by (5') (') -  Ei:i wi -1)(yi (i, ))2. 
If we use/(ylm, 0, ) cr exp(-p(ly-(m, 0)l/)) and denote p(z) = p' (z), calculation 
yield, w = E [qly, m c, 8] - () I=fy-.(,0)t/o. This has exact the same choice of 
' z 
weight w? -1) as in the iterative reweighted regression algorithm [7]. What we have 
here, different from the work of Holland et al., is that we link the EM algorithm 
with the iterative reweighted algorithm, and also extend the algorithm to a much 
more general situation. The weighting wi provides a measure of the goodness of a 
data point. Equation 7 estimates the parameters based on the portion of the data 
that are good. A penalization term on 9 can also be included in equation 7. 2 
3.3 Choice Of Prior 
There are a lot choices of prior distribution ,r(q) (for discussion, see [11]). We 
only discuss the choice vq  X, i.e., a chi distribution with v degree of freedom. 
By intergrating equation 4, f(ylm, 8, rr) - r((+a)/.) (1 + (Y-v('))) -(+a)/ 
r(/2)(a)/ a ' 
It is a dy variate t distribution with v degree of freedom, mean 0 and covariance 
y+d 
matrix aI. Calculation yields, E [qy, , a, $]: +(y-v(,0))/ The t distribution 
2A prior on 0 can be '(0) cre -'(x')/(2), which yields a additional penalization term 
a(A, 0) in equation 7, in which A denotes a tunning parameters of the penalization. 
196 Liu 
becomes a normal distribution as v goes to infinity. For finite v, it has heavier tail 
than the normal distribution and thus is appropriate for regression with to outliers. 
Actually the condition for robustness, (y- r/(z,t)) being bounded for a y with 
large value, is satisfied. The weighting w o 1/{1 + y- r/(z,) /&2} balances the 
influence of the ys with large values, achieving robustness with respect to outliers 
for the t distribution. 
4 Model Selection Criteria 
The meaning of model is in a broad sense. It can be the degree of penalization, or 
a probability model, or a neural net architecture, or the combination of the above. 
A lot of work has been done in model selection [1, 17, 15, 4, 13, 14, 10, 12]. The 
choice of a model is based on its prediction ability. A natural choice is the expected 
negative log-likelihood. This is equivalent to using the Kullback-Leibler measure [9] 
for model selection, or -E [logf(ylz , model)] + E [log f(ylz, true model)]. This has 
problem if the model can not be normalized as in the case of a improper prior. 
Equation 7 implies that we can use 
1 (s) 
as the cross-validation [16] assessment of model m, in which ,:  w;, w; is 
the convergence limit of w? ), or equation 6, and _i is the estimator of $ with 
th data deleted from the full data set. The successfulhess of the cross-validation 
method depends on a robust parameter estimation. The cross-validation method is 
to calculate the average prediction error on a data based on the rest of the data in 
the training data set. In the presence of outliers, predicting an outlier based on the 
rest of the data, is simply not meaningfil in the evaluation of the model. Equation 
8 takes consideration of the outliers. Using result from [10], we can show [11] with 
penalization term a(;, 0), 
in which 
models in comparison contains a improper prior, the above model selection statistics 
can be used. 
If the models in comparison have close forms of f(yl:, O, cr), the average negative 
log-likelihood can be used as the model selection criteria. In Liu's work [10], an 
approximation form for the unbiased estimation of expected negative log-likelihood 
was provided. If we use the negative log-likelihood plus a penalty term a(;, 0) as 
the parameter optimization criteria, the model selection statistics is 
'* 1 '* 1 
T(w) = 1 1ogf(yili,-i)  1ogf(Yilmi,) + -Tr(C-1D) (11) 
Robust Parameter Estimation and Model Selection for Neural Network Regression 197 
in which C = Y]ix Volgf(YilZi,)Vlgf(YitZi, tj) 
and D = -Y]ix VoVlogf(yilzi,)+ VoX7c(Jk,0). The optimal model is the 
one that minimizes this statistics. If the true underlying distribution is the normal 
distribution model and there is no penalization terms, it is easy to prove C --, D as 
n goes to infinite. Then the statistics becomes AIC [1]. 
2 
1.5 
1 
0.5 
0 
-0.5 
-1.5 
0 
?P fit without leverages ..... 
_ data se with le[erage o I I I 
1 2 3 4 5 
o o  
o 
o 
Figure 1: BPP fit to data set with leverages, and comparison with BPP fit to the 
data set without the leverages. An one hidden layer neural net with 4 hidden units, 
is fitted to a data set with 10 leverage, which are on the right side of m = 3.5, by 
using the conventional BPP method. The main body of the data (90 data points) 
was generated from y: sin(z) + e with e  ,V(e;O, rr = 0.2). It can be noticed 
that the fit on the part of good data points was not dramatically influenced by the 
leverages. This verified our theoretical result about the robustness of a neural net 
with respect to leverages 
5 Illustrative Simulations 
For the results shown in figure 2 and 3, the training data set contains 93 data point 
from y -- sin(z)+e and seven y values (outliers) randomly generated from region [1, 
2), in which e -- A:(e; 0, rr: 0.2). The neural net we use is of the form in equation 
2. Denote h as the number of hidden units in the neural net. The caption of each 
figure (1, 2, 3) explains the usefulness of the parameter estimation algorithm and 
the model selection. 
Acknowledgement s 
The author thanks Leon N Cooper, M.P. Perrone. The author also thanks his wife 
Cong. This research was supported by grants from NSF, C)NR and ARC). 
References 
[1] H. Akaike. 
Information theory and an extension of the maximum likelihood 
198 Liu 
1.1 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
- statisti 
- MSE on the test set (x 10- - 
(3,3)(2,3)(4,3)(2,4)(3,4)(5,3)(3,5)( 1,3)(3,7)(rh 3Xn,4Xrh5Xn,7) 
models (v, h), n stands for normal distribution model (BPP fit) 
Figure 2: Model selection statistics T, for fits to data set with outliers, tests on 
a independent data set with 1000 data points from y = sin(z) + , where   
A/'(e; 0, cr = 0.2). it can be seen that T, statistics is in consistent with the error on 
the test data set. The T, statistics favors t model with small v than for the normal 
distribution models. 
Y 
2 
1.5 
1 
0.5 
0 
-0.5 
-1 
-1.5 
3PP fit without outliers ..... 
- da.t_a sf. wi_t_h mf_.!i_ers o 
0 1 2 3 
o I I I I  I o I 
_ 0 
O0 CO 0 0 
0 _0 '_ 0 O0 0 
...... 
. o o 
o o 
4 5 6 7 
Figure 3: Fits to data set with outliers, and comparison with BPP fit to the data 
set without the outliers. The best fit in the four BPP fits (h = 3), according to T, 
statistics, was influenced by the outliers, tending to shift upwards. Although the 
distribution is not a t distribution at all, the best fit by the EM algorithm under 
the t model (v = 3, h = 3), also according to T, statistics, gives better result than 
the BPP fit, actually is almost the same as the BPP fit (h = 3) to the training data 
set without the outliers. This is due to the fact that a t distribution has a heavy 
tail to accommodate the outliers 
Robust Parameter Estimation and Model Selection for Neural Network Regression 199 
[10] 
[11] 
[12] 
[13] 
[15] 
[16] 
principle. In Petroy and Czaki, editors, Proceedings of the 2nd International 
Symposium on Information Theory, pages 267-281, 1973. 
[2] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer- 
Verlag, 1985. 
[3] R. D. Cook and S. Weisberg. Characterization of an empirical influence func- 
tion for detecting influential cases in regression. Technometrics, 22:495-508, 
1980. 
[4] P. Craven and G. Wahba. Smoothing noisy data with spline func- 
tions:estimating the correct degree of smoothing by the method of generalized 
cross-validation. Numer. Math., 31:377-403, 1979. 
[5] A. P. Dempster, N.M. Laird, and D. B. Rubin. Maximum likelihood from 
incomplete data via the EM algorithm. (with discussion). J. Roy. Star. Soc. 
Set. B, 39:1-38, 1977. 
[6] F.R. Hampel, E.M. Rouchetti, P.J. Rousseeuw, and W.A. Stahel. Robust Statis- 
tics: The approach based on influence functions. Wiley, 1986. 
[7] P.J. Holland and R.E. Welsch. Robust regression using iteratively reweighted 
least-squares. Commun. star. A, 6:813-88, 1977. 
[8] P.J. Huber. Robust Statistics. New York: Wiley, 1981. 
[9] S. Kullback and R.A. Leibler. On information and sufficiency. Ann. Star., 
22:79-86, 1951. 
Y. Liu. Neural Network Model Selection Using Asymptotic Jackknife Estima- 
tor and Cross-Validation Method. In C.L. Giles, S.J.and Hanson, and J.D. 
Cowan, editors, Advances in neural information processing system 5. Morgan 
Kaufmann Publication, 1993. 
Y. Liu. Robust neural network parameter estimation and model selection for 
regression. Submitted., 1993. 
Y. Liu. Unbiased estimate of generalization error and model selection criterion 
in neural network. Submitted to Neural Network, 1993. 
D. MacKay. Bayesian methods for adaptive models. PhD thesis, California 
Institute of Technology, 1991. 
[14] J. E. Moody. The effective number of parameters, an analysis of generalization 
and regularization in nonlinear learning system. In J. E. Moody, S. J. Hanson, 
and R. P. Lippmann, editors, Advances in neural information processing system 
4, pages 847-854. Morgan Kaufmann Publication, 1992. 
G. Schwartz. Estimating the dimension of a model. Ann. Star, 6:461-464, 1978. 
M. Stone. Cross-validatory choice and assessment of statistical predictions 
(with discussion). J. Roy. Star. Soc. Set. B, 36:111-147, 1974. 
[17] M. Stone. An asymptotic equivalence of choice of model by cross-validation 
and Akaike's criterion. J. Roy. Star. Soc., Set. B, 39(1):44-47, 1977. 
