Exponentially many local minima for single 
neurons 
Peter Auer 
Mark Herbster 
Manfred K. Warmuth 
Department of Computer Science 
Santa Cruz, California 
{pauer, mark,manfred} @cs.ucsc.edu 
Abstract 
We show that for a single neuron with the logistic function as the transfer 
function the number of local minima of the error function based on the 
square loss can grow exponentially in the dimension. 
1 INTRODUCTION 
Consider a single artificial neuron with d inputs. The neuron has d weights w E R e. The 
output of the neuron for an input pattern x E R e is 0 = 5(x  w), where 5: R -- R 
is a transfer function. For a given sequence of training examples ((xt, yt))st_<,, each 
consisting of a pattern xt E R e and a desired output yt E R, the goal of the training phase 
for neural networks consists of minimizing the error function with respect to the weight 
vector w E R e. This function is the sum of the losses between outputs of the neuron and 
the desired outputs summed over all training examples. In notation, the error function is 
m 
where L  R x R -- [0, oc) is the loss function. 
A common example of a transfer function is the logistic function logistic(z) =  which 
l+e-z 
has the bounded range (0, 1). In contrast, the identity function id(z) = z has unbounded 
range. One of the most common loss functions is the square loss L(y, 0) - (Y - )2. Other 
examples are the absolute loss lY - 01 and the entropic loss yln 
We show that for the square loss and the logistic function the error function of a single 
neuron for n training examples may have Ln/dJ t local minima. More generally, this holds 
for any loss and transfer function for which the composition of the loss function with the 
transfer function (in notation L(y, qS(x. w)) is continuous and has bounded range. This 
Exponentially Many Local Minima for Single Neurons 317 
0.62 
0.6 
0.58 
0.56 
0.54 
0.52 
0.5 
0.48 
-12 
-lO 
-8 _ -8 
log w2 log wl 
-2 
o o 
14 
-12 
Figure 1: Error Function with 25 Local Minima (16 Visible), Generated by 10 Two- 
Dimensional Examples. 
proves that for any transfer function with bounded range exponentially many local minima 
can occur when the loss function is the square loss. 
The sequences of examples that we use in our proofs have the property that they are non- 
realizable in the sense that there is no weight vector w 6 R  for which the error function 
is zero, i.e. the neuron cannot produce the desired output for all examples. We show with 
some minimal assumptions on the loss and transfer functions that for a single neuron there 
can be no local minima besides the global minimum if the examples are realizable. 
If the transfer function is the logistic function then it has often been suggested in the 
literature to use the entropic loss in artificial neural networks in place of the square loss 
[BW88, WD88, SLF88, Wat92]. In that case the error function of a single neuron is 
convex and thus has only one minimum even in the non-realizable case. We generalize this 
observation by defining a matching loss for any differentiable increasing transfer functions 
05: 
--- / O-- 1 () 
LO(Y, .0) (05(z) - y) dz . 
Jb-l(y) 
The loss is the area depicted in Figure 2a. If 05 is the identity function then L 0 is the square 
loss likewise if 05 is the logistic function then L0 is the entropic loss. For the matching loss 
the gradient descent update for minimizing the error function for a sequence of examples 
is simply 
Whew := Wola -- r/ 05(xt' Wola) - yt)xt , 
where r/is a positive learning rate. Also the second derivatives are easy to calculate for 
this general setting: L(y,,O(x,.w)) = 05'(xt  w)zt,izt,j. Thus, if Ht(w) is the Hessian 
Ow, Owj 
of Lo(yt, 05(x,  w)) with respect to w then v 'Ht(w)v = 05'(x,  w)(v- x,) =. Thus 
318 P. AUER, M. HERBSTER, M. K. WARMUTH 
y 
1 
0.8 
0.6 
0.4 
0.2 
-4 -2 0 2 4 6 
(a) (b) 
Figure 2: (a) The Matching Loss Function L,. 
(b) The Square Loss becomes Saturated, the Entropic Loss does not. 
Ht is positive semi-definite for any increasing differentiable transfer function. Clearly 
yt= Ht(w) is the Hessian of the error function E(w) for a sequence of m examples and 
it is also positive semi-definite. It follows that for any differentiable increasing transfer 
function the error function with respect to the matching loss is always convex. 
We show that in the case of one neuron the logistic function paired with the square loss 
can lead to exponentially many minima. It is open whether the number of local minima 
grows exponentially for some natural data. However there is another problem with the 
pairing of the logistic and the square loss that makes it hard to optimize the error function 
with gradient based methods. This is the problem of flat regions. Consider one example 
(x, y) consisting of a pattern x (such that x is not equal to the all zero vector) and the 
desired output y. Then the square loss (logistic(x  w) - y)2, for y E [0, 1] and w E R a, 
turns flat as a function of w when .0 = logistic(x  w) approaches zero or one (for example 
see Figure 2b where d = 1 and y = 0). It is easy to see that for all bounded transfer 
functions with a finite number of minima and corresponding bounded loss functions, the 
same phenomenon occurs. In other words, the composition L(y, 5(x. w)) of the square 
loss with any bounded transfer function 5 which has a finite number of extrema turns flat as 
Ix. w[ becomes large. Similarly, for multiple examples the error function E(w) as defined 
above becomes flat. In flat regions the gradients with respect to the weight vector w are 
small, and thus gradient-based updates of the weight vector may have a hard time moving 
the weight vector out of these flat regions. This phenomenon can easily be observed in 
practice and is sometimes called "saturation" [Hay94]. In contrast, if the logistic function 
is paired with the entropic loss (see Figure 2b), then the error function turns flat only at the 
global minimum. The same holds for any increasing differentiable transfer function and its 
matching loss function. 
A number of previous papers discussed conditions necessary and sufficient for multiple 
local minima of the error function of single neurons or otherwise small networks [WD88, 
SS89, BRS89, Blu89, SS91, GT92]. This previous work only discusses the occurrence of 
multiple local minima whereas in this paper we show that the number of such minima can 
grow exponentially with the dimension. Also the previous work has mainly been limited 
to the demonstration of local minima in networks or neurons that have used the hyperbolic 
tangent or logistic function with the square loss. Here we show that exponentially many 
minima occur whenever the composition of the loss function with the transfer function is 
continuous and bounded. 
The paper is outlined as follows. After some preliminaries in the next section, we give formal 
Exponentially Many Local Minima for Single Neurons 319 
0 35 
03 
0 25 
uJ 
O2 
0.15 
0.1 
0.05 
-2 -1 0 1 
Figure 3: 
1 
0.9 
0.6 
07 
too6 
o.5 
04 
03 
02 
ol 
2 3 4 6 7 -6 -6 -4 -2 0 2 4 6 8 
w Iogw 
(a) (b) 
(a) Error Function for the Logistic Transfer Function and the 
Square Loss with Examples ((10, .55), (.7, .25)) 
(b) Sets of Minima can be Combined. 
statements and proofs of the results mentioned above in Section 3. At first (Section 3.1) we 
show that n one-dimensional examples might result in n local minima of the error function 
(see e.g. Figure 3a for the error function of two one-dimensional examples). From the local 
minima in one dimension it follows easily that n d-dimensional examples might result in 
Ln/d j a local minima of the error function (see Figure 1 and discussion in Section 3.2). 
We then consider neurons with a bias (Section 4), i.e. we add an additional input that is 
clamped to one. The error function for a sequence of examples $ = ((xt,ye))<e<,,, is 
now 
/92 
Es(B, w) - E L(yt,cfi(B + wxt)), 
t--1 
where B denotes the bias, i.e. the weight of the input that is clamped to one. We can prove 
that the error function might have [n/2dl a local minima if loss and transfer function are 
symmetric. This holds for example for the square loss and the logistic transfer function. 
The proofs are omitted due to space constraints. They are given in the full paper [AHW96], 
together with additional results for general loss and transfer functions. 
Finally we show in Section 5 that with minimal assumptions on transfer and loss functions 
that there is only one minimum of the error function if the sequence of examples is realizable 
by the neuron. 
The essence of the proofs is quite simple. At first observe that if loss and transfer function are 
bounded and the domain is unbounded, then there exist areas of saturation where the error 
function is essentially flat. Furthermore the error function is "additive" i.e. the error function 
produced by examples in $ U $' is simply the error function produced by the examples in 
$ added to the error function produced by the examples in $, Esus, = Es + Es,. Hence 
the local minima of Es remain local minima of Eau.s, if they fall into an area of saturation 
of Es. Similarly, the local minima of Es, remain local minima of Esus, as well (see 
Figure 3b). In this way sets of local minima can be combined. 
2 PRELIMINARIES ' 
We introduce the notion of minimum-containing set which will prove useful for counting 
the minima of the error function. 
320 P. AUER, M. HERBSTER, M. K. WARMUTH 
Definition 2.1 Let f  Ra-R be a continuous function. Then an open and bounded set 
U  R e is called a minimum-containing set for f if for each w on the boundary of U there 
is a w* 6 U such that f(w*) < f(w). 
Obviously any minimum-containing set contains alocal minimum of the respective function. 
Furthermore each of n disjoint minimum-containing sets contains a distinct local minimum. 
Thus it is sufficient to find n disjoint minimum-containing sets in order to show that a 
function has at least n local minima. 
3 MINIMA FOR NEURONS WITHOUT BIAS 
We will consider transfer functions  and loss functions L which have the following 
property: 
(P1): 
The transfer function 4  RR is non-constant. The loss function L  4(R) x 
qb(R)--[0, oc) has the property that L(gi, gi) = 0 and L(gi, 0) > 0 for all gi  
0 E qb(R). Finally the function L(-, qb(-))  qb(R) x R---[0, oc) is continuous and 
bounded. 
3.1 ONE MINIMUM PER EXAMPLE IN ONE DIMENSION 
Theorem 3.1 Let q3 and L satisfy (P1). Then for all n > 1 there is a sequence of n 
examples $ = ((Zl,gi), ...,(zn,gi)), zt R, gi (R), such that Es(w) has n distinct 
local miniran. 
Since L(gi, qb(w)) is continuous and non-constant there are w-, w*, w + E 1t, such that the 
values qb(w-), qb(w*), qb(w +) are all distinct. Furthermore we can assume without loss 
of generality that 0 < w- < w* < w +. Now set gi = qb(w*). If the error function 
L(gi, qb(w)) has infinitely many local miniran then Theorem 3.1 follows immediately, e.g. 
by setting z = ... = z, = 1. If L(gi, ck(w)) has only finitely many minima then 
limw-oo L(gi, qb(w)) = L(gi, qb(oc)) exists since L(gi, qb(w)) is bounded and continuous. 
We use this fact in the following lemma. It states that we get a new minimum-containing 
set by adding an example in the area of saturation of the error function. 
Lemma3.2 Assume that limw_oo L(y, q3(w)) exists. Let $ = {(al,Yl), . . .,(xn,gin)) 
be a sequence of examples undo < w i- < w < %+ < ... < w < w[ < wn + 
such that E$(w-) > E$(w) and E$(w) < E$(wt +) for t = 1,...,n. Let S t = 
( ( xo, gi), ( zl , gi1), . . . , ( zn, gin)) where zo is sufficiently large. Furthermore let w; = w*/zo 
and Wo :1: = w+/zo (where w-, w*, w+,gi = ok(w*) are as above). Then 0 < w- < w; < 
Wo + < w i- < w r < %+ < ...< w < w[ < wn +and 
E$,(w-) > E$,(w;) and E$,(w;) < E$,(wt +), fort = 0,..., n. 
(1) 
Proof. We have to show that for all z0 sufficiently large condition (1) is satisfied, i.e. that 
lim Es,(w) < lim Es,(w), fort =0,...,n. (2) 
a70. OO a70. OO 
We get 
lim 
a70. OO 
Es,(w) = L(gi, ck(w*)) + lim Es(w*/zo)= L(gi, ck(w*)) + Es(O), 
recalling that w) = w*/zo and 8' = $ U (zo, gi). Analogously 
lim Es,(Wo +) = L(gi, ck(w+)) + Es(O). 
Exponentially Many Local Minima for Single Neurons 321 
Thus equation (2) holds for t = 0. For t = 1,..., n we get 
lim Es,(w[)= lim L(y, 6(w[z0)) + Es(w')= L(y,b(oc))+ 
1170.- OO 1170--+ OO 
and 
lim Es,(wP) = lim L(y,c)(wPzo)) + Es(wP) = L(y,b(oc)) + Es(wP). 
Since Es(w;) < Es(wp) fort = 1,...,n, the lemma follows. [] 
Proof of Theorem 3.1. The theorem follows by induction from Lemma 3.2 since each 
interval (w-, wt +) is a minimum-containing set for the error function. rn 
Remark. Though the proof requires the magnitude of the examples to be arbitrarily large ! 
in practice local minima show up for even moderately sized w (see Figure 3a). 
3.2 CURSE OF DIMENSIONALITY: THE NUMBER OF MINIMA MIGHT 
GROW EXPONENTIALLY WITH THE DIMENSION 
We show how the 1-dimensional minima of Theorem 3.1 can be combined to obtain d- 
dimensional minima. 
Lemma 3.3 Let f  R - R be a continuous function with n disjoint minimum-containing 
sets U, . . . , Un. Then the sets Ut x  .. x Utd, tj G { 1,..., n}, are n e disjoint minimum- 
containing sets for the function # ' R e -- R, g( x  , . . . , x e ) = f ( x  ) +... q- f ( x e ). 
ProoL Omitted. 
Theorem 3.4 Let  and L satisfy (P1). Then for all n > 1 there is a sequence of examples 
$ = ((x, it),..., (xn, y)), xt G R e, it  (R), such that Es(w) has [t e distinct local 
minima. 
Proof. By Lemma 3.2 there exists a sequence of one-dimensional examples $ -- 
((zl,U),...,(z[j,U)) such that Es,(w) has LJ disjoint minimum-containing sets. 
Thus by Lemma 3.3 the error function Es(w) has []e disjoint minimum-containing 
sets where $ = (((z,O,...,O),y),...,((z[j,O,...,O),y),...,((O,...,x),y),..., 
((0,..., gini), y)). [] 
4 MINIMA FOR NEURONS WITH A BIAS 
Theorem 4.1 Let the transfer function c) and the loss function  satisfy c)( Bo + z) - c)o = 
qbo-qb(Bo- z)and L(qbo + y, qbo + )) = L(qbo- y, qbo- ))forsome Bo, qbo  R andall 
z  R, y, 0  qb(R). Furthermore let qb have a continuous second derivative and assume 
0 2 
that the first derivative of q at Bo is non-zero. At last let 3--o L(y, )) be continuous in y 
and y, L(y,y) = Oforall y  qb(R), and 3-o L(y,)) (qbo, qbo) > 0. Then forall n _> 1 
there is a sequence ofexamples $ = ((x,y),..., (x,,y,)}, xt  R e, Yt  qb(R), such 
that Es(B, w)has l J e distinct local minima. 
Note that the square loss along with either the hyperbolic or logistic transfer function 
satisfies the conditions of the theorem. 
There is a parallel proof where the magnitudes of the examples may be arbitrarily small. 
322 P. AUER, M. HERBSTER, M. K. WARMUTH 
5 ONE MINIMUM IN THE REALIZABLE CASE 
We show that when transfer and loss function are monotone and the examples are realizable 
then there is only a single minimal surface. A sequence of examples $ is realizable if 
Es (w) = 0 for some w E R e. 
Theorem 5.1 Let c and L satisfy (P1). Furthermore let q be monotone and L such that 
L(!1,!1+ r) _< L(!1,!1+ r2)forO _< r _< r2 orO > r >_ r2. Assume thatfor some 
sequence of examples $ there is a weight vector Wo E R e such that Es (wo) = 0. Then for 
each w  R e thefunctionh(a) = Es((1 - a)wo + aw) is increasing for  _ O. 
Thus each minimum w can be connected with wo by the line segment wow such that 
Es (w) = 0 for all w on wows. 
Proof of Theorem 5.1. Let $ = {(x,l),...,(xn,In)l. Then h(c) = 
-tn___ LOb, b(woxt + a(w - Wo)Xt)). Since Yt -- b(woxt) it suffices to show that 
L(c(z), c(z+or))ismonotonicallyincreasingin a >_ 0 for all z, r E R. Leto _< al _< a2. 
Since b is monotone we get b(z + cr) - b(z) + r, b(z + a2r) - b(z) + r2 where 
O_< r <_ r2orO >_ r _> r2. ThusL(ck(z),ck(z +r)) _< L(ck(z),ck(z +2r)). [] 
Acknowledgments 
We thank Mike Dooley, Andrew Klinger and Eduardo Sontag for valuable discussions. Peter Auer 
gratefully acknowledges support from the FWF, Austria, under grant J01028-MAT. Mark Herbster 
and Manfred Warmuth were supported by NSF grant IRI-9123692. 
References 
[AHW96] 
[Blu89] 
[BRS89] 
[BW88] 
[GT92] 
[Hay94] 
[SLF88] 
[SS89] 
[SS91] 
[Wat92] 
[WD88] 
P. Auer, M. Herbster, and M. K. Warmuth. Exponentially many local minima for single 
neurons. Technical Report UCSC-CRL-96-1, Univ. of Calif. Computer Research Lab, 
Santa Cmz, CA, 1996. In preperation. 
E.K. Blum. Approximation of boolean functions by sigmoidal networks: Part i: Xor and 
othertwo-variable functions. Neural Computation, 1:532-540, Febmary 1989. 
M.L. Brady, R. Raghaven, and J. Slawny. Back propagation fails to separate where 
perceptrons succeed. IEEE Transactions On Circuits and Systems, 36(5):665-674, May 
1989. 
E. Baum and F. Wilczek. Supervised leaming of probability distributions by neural 
networks. In D.Z. Anderson, editor, Neural Information Processing Systems, pages 52- 
61, New York, 1988. American Insitute of Physics. 
Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE 
Transaction on Pattern Analysis and Machine Intelligence, 14(1):76-86, 1992. 
S. Haykin. Neural Networks: a Comprehensive Foundation. Macmillan, New York, NY, 
1994. 
S. A. Solla, E. Levin, and M. Fleisher. Accelerated leaming in layered neural networks. 
Complex Systems, 2:625-639, 1988. 
E.D. Sontag and H.J. Sussmann. Backpropagation can give rise to spurious local minima 
even for networks without hidden layers. Complex Systems, 3(1 ):91-106, Febmary 1989. 
E.D. Sontag and H.J. Sussmann. Back propagation separates where perceptrons do. Neural 
Networks, 4(3), 1991. 
R. L. Watrous. A comparison between squared error and relative entropy metrics using 
several optimization algorithms. Complex Systems, 6:495-505, 1992. 
B.S. Wittner and J.S. Denker. Strategies for teaching layered networks classification tasks. 
In D.Z. Anderson, editor, Neural Information Processing Systems, pages 850-859, New 
York, 1988. American Insitute of Physics. 
