Ensemble and Modular Approaches for 
Face Detection: a Comparison 
Raphael Feraud *and Olivier Bernier t 
France-Tdldcom CNET DTL/DLI 
Technopole Anticipa, 2 avenue Pierre Marzin, 22307 Lannion cedex, FRANCE 
Abstract 
A new learning model based on autoassociative neural networks 
is developped and applied to face detection. To extend the de- 
tection ability in orientation and to decrease the number of false 
alarms, different combinations of networks are tested: ensemble, 
conditional ensemble and conditional mixture of networks. The 
use of a conditional mixture of networks allows to obtain state of 
the art results on different benchmark face databases. 
1 A constrained generative model 
Our purpose is to classify an extracted window x from an image as a face (x 6 
/) or non-face (x 6 Af). The set of all possible windows is  = /t2 Af, with 
/fq A/' = 0. Since collecting a representative set of non-face examples is impossible, 
face detection by a statistical model is a difficult task. An autoassociative network, 
using five layers of neurons, is able to perform a non-linear dimensionnality reduction 
[Kramer, 1991]. However, its use as an estimator, to classify an extracted window 
as face or non-face, raises two problems: 
1. l/', the obtained sub-manifold can contain non-face examples (l/C 
2. owing to local minima, the obtained solution can be close to the linear 
solution: the principal components analysis. 
Our approach is to use counter-examples in order to find a sub-manifold as close as 
possible to / and to constrain the algorithm to converge to a non-linear solution 
[Feraud, R. et al., 1997]. Each non-face example is constrained to be reconstructed 
as its projection on /. The projection P of a point x of the input space  on /, is 
defined by: 
*email: feraud@lannion.cnet.fr 
temail: bernier@lannion.cnet.fr 
Ensemble and Modular Approaches for Face Detection: A Comparison 473 
 if x  1;, then P(x) = x, 
 if x  1;: P(x) = argminyev(d(x,y)), where d is the Euclidian distance. 
During the learning process, the projection v of x on 1; is approximated 
! Y']i=x vi where vx v2 . v are the n nearest neighbouts, 
by: P(x), , , ,.. , , 
in the training set of faces, of v, the nearest face example of z. 
The goal of the learning process is to approximate the distance/9 of an input space 
element x to the set of faces 
 /9(x, 1;) = Ilx - v(x)l I ,.. (x - &)2, where M is the size of input image x 
and & the image reconstructed by the neural network, 
 let x  , then x  1; if and only if/9(x, 1;) _< r, with r  IR, where r is a 
threshold used to adjust the sensitivity of the model. 
15 x 20 outputs 
OOOO.....000000 
000 ' 000 50neurons 
O0 ***** O0 35 neons 
OOOO.....000000 
15 x 20 inputs 
Figure 1' The use of two hidden layers and counter-examples in a compression 
neural network allows to realize a non-linear dimensionality reduction. 
In the case of non-linear dimensionnality reduction, the reconstruction error is re- 
lated to the position of a point to the non-linear principal components in the input 
space. Nevertheless, a point can be near to a principal component and far from the 
set of faces. With the algorithm proposed, the reconstruction error is related to the 
distance between a point to the set of faces. As a consequence, if we assume that 
the learning process is consistent [Vapnik, 1995], our algorithm is able to evaluate 
the probability that a point belongs to the set of faces. Let y be a binary random 
variable: y -- 1 corresponds to a face example and y = 0 to a non-face example, we 
use: 
(-? 
P(y-- llx ) = e- = , where rr depends on the threshold r 
The size of the training windows is 15x20 pixels. The faces are normalized in 
position and scale. The windows are enhanced by histogram equalization to obtain 
a relative independence to lighting conditions, smoothed to remove the noise and 
normalized by the average face, evaluated on the training set. Three face databases 
are used: after vertical mirroring, Bj, is composed of 3600 different faces with 
orientation between 0 degree and 20 degree, Bj, 2 is composed of 1600 different faces 
with orientation between 20 degree and 60 degree and Bj, a is the concatenation of 
Bj, and Bj, 2, giving a total of 5200 faces. All of the training faces are extracted 
474 R. Feraud and O. Bernier 
from the usenix face database(**), from the test set B of CMU(**), and from 100 
images containing faces and complex backgrounds. 
Figure 2: Left to right: the counter-examples successively chosen by the algorithm 
are increasingly similar to real faces (iteration I to 8). 
The non-face databases (Bfx,Bf2,Bf3), corresponding to each face database, 
are collected by an iterative algorithm similar to the one used in 
[Sung, K. and Poggio, T., 1994] or in [Rowley, H. et al., 1995]: 
 2) 
3) 
 4) 
 5) 
 6) 
Bn ! -- 0, r -- Train, 
the neural network is trained with B! + 
the face detection system is tested on a set of background images, 
a maximum of 100 subimages xi are collected with 19 (xi, 
B,! = B,! + {x0,...,x,}, r = r +it, with p > 0, 
while r < rraa go back to step 2. 
After vertical mirroring, the size of the obtained set of non-face examples is re- 
spectively 1500 for B,j,x, 600 for B,j, 2 and 2600 for Bj'a. Since the non-face set 
(Af) is too large, it is not possible to prove that this algorithm converge in a finite 
time. Nevertheless, in only 8 iterations, collected counter-examples are close to the 
set of faces (Figure 2). Using this algorithm, three elementary face detectors are 
constructed: the front view face detector trained on Bj, and B,j, (CGM1), the 
turned face detector trained on Bj, 2 and B,j, 2 (CGM2) and the general face detector 
trained on Bj, a and B,j'a (CGM3). 
To obtain a non-linear dimensionnality reduction, five layers are necessary. However, 
our experiments show that four layers are sufficient. Consequently, each CGM 
has four layers (Figure 1). The first and last layers consist each of 300 neurons, 
corresponding to the image size 15x20. The first hidden laver has 35 neurons and 
the second hidden layer 50 neurons. In order to reduce the false alarm rate and to 
extend the face detection ability in orientation, different combinations of networks 
are tested. The use of ensemble of networks to reduce false alarm rate was shown 
by [Rowley, H. et al., 1995]. However, considering that to detect a face in an image, 
there are two subproblems to solve, detection'of front view faces and turned faces, 
a modular architecture can also be used. 
2 Ensemble of CGMs 
Generalization error of an estimator can be decomposed in two terms: the bias and 
the variance [Geman, S. et al., 1992]. The bias is reduced with prior knowledge. 
The use of an ensemble of estimators can reduce the variance when these estimators 
are independently and identically distributed [Raviv, Y. and Intrator, N., 1996]. 
Each face detector i produces: 
Ei[ylx]- Pi(y: 11x) 
Assuming that the three face detectors (CGM1,CGM2,CGM3) are independently 
and identically distributed (iid), the ouput of the ensemble is: 
Enxemble and Modular Approaches for Face Detection: A Comparison 475 
3 
i--1 
3 Conditional mixture of CGMs 
To extend the detection ability in orientation, a conditional mixture of CGMs is 
tested. The training set is separated in two subsets: front view faces and the 
corresponding counter-examples (0 = 1) and turned faces and the corresponding 
counter-examples (0 -- 0). The first subnetwork (CGM1) evaluates the probability 
of the tested image to be a front view face, knowing the label equals I (P(y -- 
llx, 0 = 1)). The second (CGM2) evaluates the probability of the tested image to 
be a turned face, knowing the label equals 0 (P(y = 11x, O- 0)). A gating network 
is trained to evaluate P(O = 1Ix), supposing that the partition 0 = 1, 0 - 0 can be 
generalized to every input: 
E[ylx ]: E[ylO - 1,x]f(x) + E[ylO - 0, x](1- f(x)) 
Where fix) is the estimated value of P(O = l[x). 
This system is different from a mixture of experts introduced by 
[Jacobs, R. A. et al., 1991]: each module is trained separately on a subset of the 
training set and then the gating network learns to combine the outputs. 
4 Conditional ensemble of CGMs 
To reduce the false alarm rate and to detect front view and turned faces, an original 
combination, using (CGM1,CGM2) and a gate network, is proposed. Four sets are 
defined: 
 . is the front view face set, 
 7 ) is the turned face set, with Y'  7 ) - 0, 
 V = Y' O 7 ) is the face set, 
 A/' is the non-face set, with V  Af = 0, 
Our goal is to evaluate P(a:  VIx ). Each estimator computes respectively' 
 P(x  Fix  't.J A/',x) (CGMI(x)), 
 P(x  Plx  PO*,x) 
Using the Bayes theorem, we obtain: 
Since x 6 Y'  x 6 jr 0 iV', then: 
P(x e ;:Ix, x e 9: v ) = P(x e 9: v ;Ix) 
476 R. Feraud and O. Bernier 
q= P(  .TI ) = P(  .TI  .T 0 iV', )P(  .T 0 A/'[) 
. e e ;:la:) = e e ;:la: e :v ;v, a:)[e e :la:) + e e ;via:)] 
In the same way, we have: 
PC e Pla:) = PC e Pla: e P v v, a:)[e e Pla:) + PC e ;via:)] 
Then: 
Rewriting the previous equation using the following notation, CGMI(a:) for 
P(a:  Y'la:  .T t_J A/', a:) and CGMI(a:) for P(a:  vla:  v U N', a:), we have: 
(1) 
(2) 
Then, we can deduce the behaviour of the conditional ensemble: 
 in iV', if the output of the gate network is 0.5, as in the case of ensembles 
, 
the conditional ensemble reduces the variance of the error (first term of the 
right side of the equation (1)), 
 in V, as in the case of the conditional mixture, the conditional ensemble 
permits to combine two different tasks (second term of the right side of the 
equation (2)): detection of turned faces and detection of front view faces. 
The gate network f(a:) is trained to calculate the probability that the tested image 
is a face (P(a:  Via:)), using the following cost function: 
C = y. ([f(a:i)MGCl(a:) + (1 - f(a:i))]MGC2(a:) - yi) 2 + y. (f(a:i) - 0.5) 
5 Discussion 
Each 15x20 subimage is extracted and normalized by enhancing, smoothing and 
substracting the average face, before being processed by the network. The detection 
threshold r is fixed for all the tested images. To detect a face at different scales, 
the image is subsampled. 
The first test allows to evaluate the limits in orientation of the face detectors. The 
sussex face database(**), containing different faces with ten orientations betwen 0 
degree and 90 degrees, is used (Table 1). The general face detector (CGM3) uses 
the same learning face database than the different mixtures of CGMs. Nevertheless, 
CGM3 has a smaller orientation range than the conditional mixtures of CGMs, 
and the conditional ensemble of CGMs. Since the performances of the ensemble 
of CGMS are low, the corresponding hypothesis (the CGMs are iid) is invalid. 
Moreover, this test shows that the combination by a gating neural network of CGMs, 
Ensemble and ModularApproaches for Face Detection.' A Comparison 477 
Table l:Results on Sussex face database 
orientation CGM1 CGM2 CGM3 Ensemble 
(degree) (1,2,3) 
0 100.0 % 100.0 % 100.0 % 100.0 % 
10 62.5 % 100.0 % 87.5 % 100.0 % 
20 50.0 % 100.0 % 87.5 % 87.5 % 
30 12.5 % 100.0 % 62.5 % 62.5 % 
40 0.0 % 100.0 % 50.0 % 12.5 % 
50 0.0 % 75.0 % 0.0 % 0.0 % 
Conditional Conditional 
ensemble mixture 
(1,2,gate) (1,2,gate) 
100.0 % 100.0 % 
100.0 % 100.0 % 
100.0 % 100.0 % 
100.0 % 100.0 % 
62.5.0 % 87.5 % 
37.5 % 62.5 % 
60 0.0 % 37.5 % 0.0 % 0.0 % 0.0 % 37.5 % 
70 0.0 % 37.5 % 0.0 % 0.0 % 0.0 % 25.0 % 
trained on different training set, allows to extend the detection ability to both front 
view and turned faces. The conditional mixture of CGMs obtains results in term 
of orientation and false alarm rate close to the best CGMs used to contruct it (see 
Table I and Table 2). 
The second test allows to evaluate the false alarms rate and to compare our results 
with the best results published so far on the test set A [Rowley, H. et al., 1995] of 
the CMU (**), containing 42 images of various quality. First, these results show that 
the model, trained without counter-examples (GM), overestimates the distribution 
of faces and its false alarm rate is too important to use it as a face detector. Second, 
the estimation of the probability distribution of the face performed by one CGM 
(CGM3) is more precise than the one obtained by [Rowley, H. et al., 1995] with 
one SWN (see Table 2). The conditional ensemble of CGMs and the conditional 
mixture of CGMs obtained a similar detection rate than an ensemble of SWNs 
[Rowley, H. et al., 1995], but with a false alarm rate two or three times lower. Since 
the results of the conditional ensemble of CGMs and the conditional mixture of 
CGMs are close on this test, the detection rate versus the number of false alarms 
is plotted (Figure 3), for different thresholds. The conditional mixture of CGMs 
curve is above the one for the conditional ensemble of CGMs. 
85 
eO 30  GO  70 80 
Number of fmm 
Figure 3: Detection rate versus number of false alarms on the CMU test set A. In 
dashed line conditional ensemble and in solid line conditional mixture. 
The conditionnal mixture of CGMs is used in an application called LISTEN 
[Collobert, M. et al., 1996]: a camera detects, tracks a face and controls a micro- 
phone array towards a speaker, in real time. The small size of the subimages (15x20) 
processed allows to detect a face far from the camera (with an aperture of 60 de- 
grees, the maximum distance to the camera is 6 meters). To detect a face in real 
time, the number of tested hypothesis is reduced by motion and color analysis. 
478 R. Feraud and O. Bernier 
Table 2:Results on the CMU test set A 
GM: the model trained without counter-examples, CGMi: face detector, CGM2: 
turned face detector, CGM3: general face detector. SWN: shared weight network. 
(*) Considering that our goal is to detect human faces, non-human faces and rough 
face drawings have not been taken into account. 
Model 
GM 
CGM1 
CGM2 
CGM3 
Detection rate 
[Rowley, 1995] 
(one SWN) 
Ensemble 
(CGM1,CGM2,CGM3) 
Conditional ensemble 
(CGM1,CGM2,gate) 
[Rowley, 1995] 
(three $WNs) 
Conditional mixture 
(CGM1,CGM2,gate) 
84 % 
77% 
127/164 
85 % 
139/164 
85 % 
139/164 
84 % 
142/169' 
74% 
121/164 
82 % 
134/164 
85 % 
144/169' 
87% 
142/164 
+/-5% 
+/-5% 
+/-5% 
False alarms rate 
1/1,000 +/- 0.1/1,000,00( 
5.43/1,000,000 
47/33,700,000 
6.3/1,000,000 
212/33,700,000 
+/- 0.38/1,000,00 
+/- 0.37/1,000,00 
+/- 5 % 1.36/1,000,000 +/- 0.41/1,0o0,0o 
46/33,700,000 
+/- 5 % 8.13/1,000,000 +/- o.4/1,ooo,oo( 
179/22,000,000 
+/- 5 % 0.71/1,000,000 +/- 0.43/1,000,00, 
24/33,700,000 
+/- 5 % 0.77/1,000,000 +/- o.38/1,ooo,oo, 
26/33,700,000 
+/- 5 % 2.13/1,000,000 +/- o.42/1,ooo,oo, 
47/22,000,000 
+/- 5 % 1.15/1,000,000 +/- o.35/1,ooo,oo 
39/33,700,000 
(**) usetrix face database, sussex face database and CMU test sets can be retrieved at 
www.cs.rug.nl/peterkr/FACE/face.html. 
References 
[Collobert, M. et al., 1996] Collobert, M., Feraud, R., Le Tourneur, G., Bernier, O., Vial- 
let, J.E, Mahieux, Y., and Collobert, D. (1996). Listen: a system for locating and 
tracking individual speaker. In Second International Conference On Automatic Face 
and Gesture Recognition. 
[Feraud, R. et al., 1997] Feraud, R., Bernier, O., and Collobert, D. (1997). A constrained 
generafive model applied to face detection. Neural Processing Letters. 
[Geman, S. et al., 1992] Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural 
networks and the bias-variance dilemma. Neural Computation, 4:1-58. 
[Jacobs, R. A. et ed., 1991] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 
(1991). Adaptative mixtures of local experts. Neural Computation, 3:79-87. 
[Kramer, 1991] Kramer, M. (1991). Nonlinear principal component analysis using autoas- 
sociative neural networks. AIChE Journal, 37:233-243. 
[Raviv, Y. and Intrator, N., 1996] Raviv, Y. and Intrator, N. (1996). Bootstrapping with 
noise: An effective regularization technique. Connection Science, 8:355-372. 
[Rowley, H. et al., 1995] Rowley, H., Baluja, S., and Kanade, T. (1995). Human face 
detection in visual scenes. In Neural Information Processing Systems 8. 
[Sung, K. and Poggio, T., 1994] Sung, K. and Poggio, T. (1994). Example-based learning 
for view-based human face detection. Technical report, M.I.T. 
[Vapnik, 1995] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer- 
Verlag New York Heidelberg Berlin. 
