Robust Neural Network Regression for Offiine 
and Online Learning 
Thomas Briege!* 
Siemens AG, Corporate Technology 
D-81730 Munich, Germany 
thomas.briegel@mchp.siemens.de 
Volker Tresp 
Siemens AG, Corporate Technology 
D-81730 Munich, Germany 
volker. tresp@mchp.siemens.de 
Abstract 
We replace the commonly used Gaussian noise model in nonlinear 
regression by a more flexible noise model based on the Student-t- 
distribution. The degrees of freedom of the t-distribution can be chosen 
such that as special cases either the Gaussian distribution or the Cauchy 
distribution are realized. The latter is commonly used in robust regres- 
sion. Since the t-distribution can be interpreted as being an infinite mix- 
ture of Gaussians, parameters and hyperparameters such as the degrees 
of freedom of the t-distribution can be learned from the data based on an 
EM-leaming algorithm. We show that modeling using the t-distribution 
leads to improved predictors on real world data sets. In particular, if 
outliers are present, the t-distribution is superior to the Gaussian noise 
model. In effect, by adapting the degrees of freedom, the system can 
"learn" to distinguish between outliers and non-outliers. Especially for 
online learning tasks, one is interested in avoiding inappropriate weight 
changes due to measurement outliers to maintain stable online learn- 
ing capability. We show experimentally that using the t-distribution as 
a noise model leads to stable online learning algorithms and outperforms 
state-of-the art online learning methods like the extended Kalman filter 
algorithm. 
1 INTRODUCTION 
A commonly used assumption in nonlinear regression is that targets are disturbed by inde- 
pendent additive Gaussian noise. Although one can derive the Gaussian noise assumption 
based on a maximum entropy approach, the main reason for this assumption is practica- 
bility: under the Gaussian noise assumption the maximum likelihood parameter estimate 
can simply be found by minimization of the squared error. Despite its common use it is far 
from clear that the Gaussian noise assumption is a good choice for many practical prob- 
lems. A reasonable approach therefore would be a noise distribution which contains the 
Gaussian as a special case but which has a tunable parameter that allows for more flexible 
distributions. In this paper we use the Student-t-distribution as a noise model which con- 
tains two free parameters - the degrees of freedom , and a width parameter o -2. A nice 
feature of the t-distribution is that if the degrees of freedom , approach infinity, we recover 
the Gaussian noise model. If , < c we obtain distributions which are more heavy-tailed 
than the Gaussian distribution including the Cauchy noise model with , - 1. The latter 
* Now with McKinsey & Company, Inc. 
408 T. Briegel and V. Tresp 
is commonly used for robust regression. The first goal of this paper is to investigate if the 
additional free parameters, e.g. v, lead to better generalization performance for real world 
data sets if compared to the Gaussian noise assumption with v = oc. The most common 
reason why researchers depart from the Gaussian noise assumption is the presence of out- 
liers. Outliers are errors which occur with low probability and which are not generated by 
the data-generation process that is subject to identification. The general problem is that a 
few (maybe even one) outliers of high leverage are sufficient to throw the standard Gaus- 
sian error estimators completely off-track (Rousseeuw & Leroy, 1987). In the second set of 
experiments we therefore compare how the generalization performance is affected by out- 
liers, both for the Gaussian noise assumption and for the t-distribution assumption. Dealing 
with outliers is often of critical importance for online learning tasks. Online learning is of 
great interest in many applications exhibiting non-stationary behavior like tracking, sig- 
nal and image processing, or navigation and fault detection (see, for instance the NIPS*98 
Sequential Learning Workshop). Here one is interested in avoiding inappropriate weight 
chances due to measurement outliers to maintain stable online learning capability. Outliers 
might result in highly fluctuating weights and possible even instability when estimating the 
neural network weight vector online using a Gaussian error assumption. State-of-the art 
online algorithms like the extended Kalman filter, for instance, are known to be nonrobust 
against such outliers (Meinhold & Singpurwalla, 1989) since they are based on a Gaussian 
output error assumption. 
The paper is organized as follows. In Section 2 we adopt a probabilistic view to outlier 
detection by taking as a heavy-tailed observation error density the Student-t-distribution 
which can be derived from an infinite mixture of Gaussians approach. In our work we use 
the multi-layer perceptron (MLP) as nonlinear model. In Section 3 we derive an EM algo- 
rithm for estimating the MLP weight vector and the hyperparameters offiine. Employing 
a state-space representation to model the MLP's weight evolution in time we extend the 
batch algorithm of Section 3 to the online learning case (Section 4). The application of the 
computationally efficient Fisher scoring algorithm leads to posterior mode weight updates 
and an online EM-type algorithm for approximate maximum likelihood (ML) estimation 
of the hyperparameters. In in the last two sections (Section 5 and Section 6) we present 
experiments and conclusions, respectively. 
2 THE t-DENSITY AS A ROBUST ERROR DENSITY 
We assume a nonlinear regression model where for the t-th data point the noisy target 
Yt  I is generated as 
Yt = g(xt;w) + v (1) 
and xt  k is a k-dimensional known input vector. g(.; wt) denotes a neural network 
model characterized by weight vector wt  n, in our case a multi-layer perceptron 
(MLP). In the offiine case the weight vector wt is assumed to be a fixed unknown constant 
vector, i.e. wt -- w. Furthermore, we assume that vt is uncorrelated noise with density 
Pvt (.). In the offiine case, we assume Pvt (.) to be independent of t, i.e. pv (.) _= p (.). In 
the following we assume that p (.) is a Student-t-density with , degrees of freedom with 
F(+i ( z2) ,+ 
\ 2  
p(z) = T(zlae, v) = aF(-) 1 + cr-5- , v,a > 0. (2) 
It is immediately appent at for v = 1 we recover e heavy-tailed Cauchy density. at 
is not so obvious is at for v   we obtain a Gaussi density. For e derivation of 
 e EM-leing rules in e next section it is important to note at the t-denstiy can be 
ought of as being  infinite mixture of Gaussians of the fore 
Robust Neural Network Regression for Offiine and Online Learning 409 
3 
2 
1 
o 
-1 
-2 
-3 
Boaton Housing data with additive outliers 
t.2 
0.71-    .,,,,,,," .I, l 
o -   ;o ; 2'0 2 
number of outller [%] 
Figure 1: Left: p(.)-functions for the Gaussian density (dashed) and t-densities with v = 
1, 4,15 degrees of freedom. Right: MSE on Boston Housing data test set for additive 
oufliers. The dashed line shows results using a Gaussian error measure and the continuous 
line shows the results using the Student-t-distribution as error measure. 
where T(zl a2, v) is the Student-t-density with v degrees of freedom and width parameter 
a 2, A/'(zl0, a/u) is a Gaussian density with center 0 and variance a/u and  
where X is a Chi-square distribution with v degrees of freedom evaluated at  > 0. 
To compare different noise models it is useful to evaluate the "/,-function" defined as (Hu- 
ber, 1964) 
l,(z) = -0 logp,(z)/Oz (4) 
i.e. the negative score-function of the noise density. In the case of i.i.d. samples the p- 
function reflects the influence of a single measurement on the resulting estimator. Assum- 
ing Gaussian measurement errors pv(z) = A/'(zlO, a ) we derive p(z) = z/a  which 
means that for Izl - o a single outlier z can have an infinite leverage on the estimator. In 
contrast, for constructing robust estimators West (1981) states that large outliers should not 
have any influence on the estimator, i.e. p(z) -+ 0 for Izl -* o. Figure 1 (left) shows p(z) 
for different v for the Student-t-distribution. It can be seen that the degrees of freedom v 
determine how much weight outliers obtain in influencing the regression. In particular, for 
finite v, the influence of outliers with Izl -. o approaches zero. 
3 ROBUST OFFLINE REGRESSION 
As stated in Equation (3), the t-density can be thought of as being generated as an infinite 
mixture of Gaussians. Maximum likelihood adaptation of parameters and hyperparameters 
can therefore be performed using an EM algorithm (Lange et al., 1989). For the t-th sample, 
a complete data point would consist of the triple (xt, Yt, ut) of which only the first two are 
known and at is missing. 
In the E-step we estimate for every data point indexed by t 
Ot: (t/ld q- 1)/(Y ld q- (it) 
(5) 
where at = E[utlYt, xt] is the expected value of the unknown ut given the available data 
(xt, Yt ) and where 6t = (Yt - g(xt ; wld)) 2/0.2,old ' 
In the M-step the weights w and the hyperparameters a 2 and v are optimized using 
T 
w new = argm2n{Eat(Ut-a(xt;w)) (6) 
t=l 
410 T. Briegel and I. Tresp 
T 
t=l 
/]new .__ argmax{ Tv v 
log - Tlog{r()} 
T T 
t=l t=l 
where 
t = DG ( vld + 1 
2 ) - og( + (9) 
wi e Digamma function DG(z) = OF(z)/Oz. Note at e M-step for v is a one- 
dimensional nonlinem optimization problem. Also note at e M-steps for e weights in 
 e MLP reduce to a weighted least squmes regression problem in which outliers tend to 
be weighted down. e exception of course is e Gaussi case wi v   in which all 
tes obtain equal weight. 
4 ROBUST ONLINE REGRESSION 
For robust online regression, we assume that the model Equation (1) is still valid but that 
w can change over time, i.e. w _= wt. In particular we assume that wt follows a first order 
random walk with normally distributed increments, i.e. 
wtlWt_l ,,, ./V(wt_, Qt) (10) 
and where wo is normally distributed with center ao and covariance Q0. Clearly, due to 
the nonlinear nature of g and due to the fact that the noise process is non-Gaussian, a fully 
Bayesian online algorithm which for the linear case with Gaussian noise can be realized 
using the Kalman filter -- is clearly infeasible. 
T , 
On the other hand, if we consider data D = {xt,yt}t= 1 the negative log-posterior 
- logp(WrID) of the parameter sequence Wr = (Wo-r,..., WT) -r is up tO a normaliz- 
ing constant 
T 
1 
-logp(Wr[T)) o - Elogpv(yt - g(xt;wt)) + (wo - ao)V2-l(wo - ao) 
T 
I 
Wt--1) Qt (W t -- Wt-1) (11) 
__  E ( ?l)t -- -[- --1 
t=l 
and can be used as the appropriate cost function to derive the posterior mode estimate 
W ^P for the weight sequence. The two differences to the presentation in the last section 
are that first, wt is allowed to change over time and that second, penalty terms, stemming 
from the prior and the transition density, are included. The penalty terms are penalizing 
roughness of the weight sequence leading to smooth weight estimates. 
A suitable way to determine a stationary point of - logp(WTIT)), the posterior mode es- 
timate of WT, is to apply Fisher scoring. With the current estimate WJ d we get a better 
estimate W ew = W, ld + rF7 for the unknown weight sequence WT where 7 is the solution 
of 
S(wpld)') ' = $(Wp ld) (12) 
with the negative score function s(WT) = --010gp(WTJT))/OWT and the expected infor- 
mation matrix $(W2r = E[0 2 logp(WTIT))/OWTOW]. By applying the ideas given in 
Fahrmeir & Kaufmann (1991) to robust neural network regression it tums out that solving 
(12), i.e. to compute the inverse of the expected information matrix, can be performed by 
Robust Neural Network Regression for Offiine and Online Learning 411 
Cholesky decomposition in one forward and backward pass through the set of data D. Note 
that the expected information matrix is a positive definite block-tridiagonal matrix. The 
forward-backward steps have to be iterated to obtain the posterior mode estimate W ^P 
for WT. 
For online posterior mode smoothing, it is of interest to smooth backwards after each filter 
step t. If Fisher scoring steps are applied sequentially for t = 1, 2,..., then the posterior 
mode smoother at time-step t - 1 W MAP T 
, t-1 = (/;0[t-l''''' /;t-1--llt-1) T together with the 
step-one predictor writ-1 = Wt-llt-1 is a reasonable starting value for obtaining the pos- 
terior mode smoother Wt MAP at time t. One can reduce the computational load by limiting 
the backward pass to a sliding time window, e.g. the last rt time steps, which is reasonable 
in non-stationary environments for online purposes. Furthermore, if we use the underly- 
ing assumption that in most cases a new measurement Yt should not change estimates too 
drastically then a single Fisher scoring step often suffices to obtain the new posterior mode 
estimate at time t. The resulting single Fisher scoring step algorithm with lookback param- 
eter rt has in fact just one additional line of code involving simple matrix manipulations 
compared to online Kalman smoothing and is given here in pseudo-code. Details about the 
algorithm and a full description can be found in Briegel & Tresp (1999). 
Online single Fisher scoring step algorithm (pseudo-code) 
for t = 1, 2,... repeat the following four steps: 
 Evaluate the step-onepredictor writ_ . 
 Perform the forward recursions for s -- t - rt,..., t. 
 New data point (a:t, Yt) arrives: evaluate the corrector step totl t. 
 Perform the backward smoothing recursions Ws-lit for s = t,..., t - ft. 
For the adaptation of the parameters in the t-distribution, we apply results from Fahrmeir 
& Kilnstier (1999) to our nonlinear assumptions and use an online EM-type algorithm for 
approximate maximum likelihood estimation of the hyperparameters vt and at e . We assume 
the scale factors cr and the degrees of freedom vt being fixed quantities in a certain time 
window of length h, e.g. at e = a 2 , vt = v, t  {t - t, t}. For deriving online EM update 
equations we treat the weight sequence wt together with the mixing variables ut as missing. 
By linear Taylor series expansion oft7(.; ws) about the Fisher scoring solutions walt and by 
approximating posterior expectations E[w,]/)] with posterior modes wdt, s  {t - t, t} 
and posterior covariances cov [w, [D] with curvatures Edt = E[ (w, - w dt)(w, - w,lt) -r [/)] 
in the E-step, a somewhat lengthy derivation results in approximate maximum likelihood 
update rules for a 2 and v similar to those given in Section 3. Details about the online 
EM-type algorithm can be found in Briegel & Tresp (1999). 
5 EXPERIMENTS 
1. Experiment: Real World Data Sets. In the first experiment we tested if the Student- 
t-distribution is a useful error measure for real-world data sets. In training, the Student- 
t-distribution was used and both, the degrees of freedom v and the width parameter cr 2 
were adapted using the EM update rules from Section 3. Each experiment was repeated 
50 times with different divisions into training and test data. As a comparison we trained 
the neural networks to minimize the squared error cost function (including an optimized 
weight decay term). On the test data set we evaluated the performance using a squared 
error cost function. Table 1 provides some experimental parameters and gives the test 
set performance based on the 50 repetitions of the experiments. The additional explained 
variance is defined as [in percent] 100 x (1 - MSPE7-/MSPE) where MSPE7- is the 
mean squared prediction error using the t-distribution and MSPE is the mean squared 
prediction error using the Gaussian error measure. Furthermore we supply the standard 
412 T. Briegel and V. Tresp 
Table 1: Experimental parameters and test set performance on real world data sets. 
Data Set # Inputs/Hidden Training I Test Add.Exp.Var. [%] Std. [%] I 
Boston Housing (13/6) 400 106 4.2 0.93 
Sunspot ( 12/7) 221 47 5.3 0.67 
Fraser River (12/7) 600 334 5.4 0.75 
error based on the 50 experiments. In all three experiments the networks optimized with 
the t-distribution as noise model were 4-5% better than the networks optimized using the 
Gaussian as noise model and in all experiments the improvements were significant based on 
the paired t-test with a significance level of 1%. The results show clearly that the additional 
free parameter in the Student-t-distribution does not lead to overfitting but is used in a 
sensible way by the system to value down the influence of extreme target values. Figure 2 
shows the normal probability plots. Clearly visible is the derivation from the Gaussian 
distribution for extreme target values. We also like to remark that we did not apply any 
preselection process in choosing the particular data sets which indicates that non-Gaussian 
noise seems to be the rule rather than the exception for real world data sets. 
0'999 I 
0 08 ,/ 
-0751 
0.98 : ,t 
090 
0 75 
025 
-08-o9-0'4-0'2  0:2 0'4 0'9 0'8 
madeels afro. b'atmfxj v,h Gausam-e denaty 
Nomd obablty p4 FmMf Fvor am 
0.98 
095 
090 
075 
050 
025 
010 
005 
oo2 
ooo 
Figure 2: Normal probability plots of the three training data sets after learning with the 
Gaussian error measure. The dashed line show the expected normal probabilities. The 
plots show clearly that the residuals follow a more heavy-tailed distribution than the normal 
distribution. 
2. Experiment: Outliers. In the second experiment we wanted to test how our approach 
deals with outliers which are artificially added to the data set. We started with the Boston 
housing data set and divided it into training and test data. We then randomly selected a 
subset of the training data set (between 0.5% and 25%) and added to the targets a uniformly 
generated real number in the interval [-5, 5]. Figure 1 (right) shows the mean squared error 
on the test set for different percentages of added outliers. The error bars are derived from 
20 repetitions of the experiment with different divisions into training and test set. It is 
apparent that the approach using the t-distribution is consistently better than the network 
which was trained based on a Gaussian noise assumption. 
3. Experiment: Online Learning. In the third experiment we examined the use of the 
t-distribution in online learning. Data were generated from a nonlinear map t = 0.6a: 2 + 
bsin(6a:) - I where b = -0.75,-0.4,-0.1,0.25 for the first, second, third and fourth 
set of 150 data points, respectively. Gaussian noise with variance 0.2 was added and for 
training, a MLP with 4 hidden units was used. In the first experiment we compare the 
performance of the EKF algorithm with our single Fisher scoring step algorithm. Figure 3 
(left) shows that our algorithm converges faster to the correct map and also handles the 
transition in the model (parameter b) much better than the EKE In the second experiment 
with a probability of 10% outliers uniformly drawn from the interval [-5, 5] were added to 
the targets. Figure 3 (middle) shows that the single Fisher scoring step algorithm using the 
Robust Neural Network Regression for Offline and Online Learning 413 
t-distribution is consistently better than the same algorithm using a Gaussian noise model 
and the EKF. The two plots on the right in Figure 3 compare the nonlinear maps learned 
after 150 and 600 time steps, respectively. 
EK v-' GFS-tO 
10' 
EIO: . GFS.-10 rs. 'iFS-I 0 Mal,ing afte T-150 Mal,ing afte T-6OO 
100 200 3 4 5 IXX3 I 200 3 4 500 IXX3 -1 0 -1 0 
Te me x x 
Figure 3: Left & Middle: Online MSE over each of the 4 sets of training data. On the 
left we compare extended Kalman filtering (EKF) (dashed) with the single Fisher scoring 
step algorithm with rt = 10 (GFS-10) (continuous) for additive Gaussian noise. The 
second figure shows EKF (dashed-dotted), Fisher scoring with Gaussian error noise (GFS- 
10) (dashed)and t-distributed error noise (TFS-10) (continuous), respectively for data with 
additive outliers. Right: True map (continuous), EKF learned map (dashed-dotted) and 
TFS-10 map (dashed) after T - 150 and T = 600 (data sets with additive outliers). 
6 CONCLUSIONS 
We have introduced the Student-t-distribution to replace the standard Gaussian noise as- 
sumption in nonlinear regression. Learning is based on an EM algorithm which estimates 
both the scaling parameters and the degrees of freedom of the t-distribution. Our results 
show that using the Student-t-distribution as noise model leads to 4-5% better test errors 
than using the Gaussian noise assumption on real world data set. This result seems to in- 
dicate that non-Gaussian noise is the rule rather than the exception and that extreme target 
values should in general be weighted down. Dealing with outliers is particularly important 
for online tasks in which outliers can lead to instability in the adaptation process. We in- 
troduced a new online learning algorithm using the t-distribution which leads to better and 
more stable results if compared to the extended Kalman filter. 
References 
Briegel, T. and Tresp, V. (1999) Dynamic Neural Regression Models, Discussion Paper, Seminar far 
Statistik, Ludwig Maximilians Universit3t MQnchen. 
de Freitas, N., Doucet, A. and Niranjan, M. (1998) Sequential Inference and Learning, NIPS*98 
Workshop, Breckenridge, CO. 
Fahrmeir, L. and Kaufmann, H. (199 l) On Kalman Filtering, Posterior Mode Estimation and Fisher 
Scoring in Dynamic Exponential Family Regression, Metrika 38, pp. 37-60. 
Fahrmeir, L. and KQnstler, R. (1999) Penalized likelihood smoothing in robust state space models, 
Metrika 49, pp. 173-191. 
Huber, P.J. (1964) Robust Estimation of Location Parameter, Annals of Mathematical Statistics 35, 
pp. 73-101. 
Lange, K., Little, L., Taylor, J. (1989) Robust Statistical Modeling Using the t-Distribution, JASA 
84, pp. 881-896. 
Meinhold, R. and Singpurwalla, N. (1989) Robustification of Kalman Filter Models, JASA 84, pp. 
470-496. 
Rousseeuw, P. and Leroy, A. (1987) Robust Regression and Outlier Detection, John Wiley & Sons. 
West, M. (1981) Robust Sequential Approximate Bayesian Estimation, JRSS B 43, pp. 157-166. 
