488 Solutions to the XOR Problem 
Frans M. Coetzee * 
coetzee@ece. cmu. edu 
Department of Electrical Engineering 
Carnegie Mellon University 
Pittsburgh, PA 15213 
Virginia L. Stonick 
ginny@ece. cmu. edu 
Department of Electrical Engineering 
Carnegie Mellon University 
Pittsburgh, PA 15213 
Abstract 
A globally convergent homotopy method is defined that is capable 
of sequentially producing large numbers of stationary points of the 
multi-layer perceptron mean-squared error surface. Using this al- 
gorithm large subsets of the stationary points of two test problems 
are found. It is shown empirically that the MLP neural network 
appears to have an extreme ratio of saddle points compared to 
local minima, and that even small neural network problems have 
extremely large numbers of solutions. 
1 Introduction 
The number and type of stationary points of the error surface provide insight into 
the difficulties of finding the optimal parameters of the network, since the stationary 
points determine the degree of the system[1]. Unfortunately, even for the small 
canonical test problems commonly used in neural network studies, it is still unknown 
how many stationary points there are, where they are, and how these are divided 
into minima, maxima and saddle points. 
Since solving the neural equations explicitly is currently intractable, it is of interest 
to be able to numerically characterize the error surfaces of standard test problems. 
To perform such a characterization is non-trivial, requiring methods that reliably 
converge and are capable of finding large subsets of distinct solutions. It can be 
shown[2] that methods which produce only one solution set on a given trial become 
inefficient (at a factorial rate) at finding large sets of multiple distinct solutions, 
since the same solutions are found repeatedly. This paper presents the first prov- 
ably globally convergent homotopy methods capable of finding large subsets of the 
Currently with Siemens Corporate Research, Princeton NJ 08540 
488 Solutions to the XOR Problem 411 
stationary points of the neural network error surface. These methods are used to 
empirically quantify not only the number but also the type of solutions for some 
simple neural networks. 
1.1 Sequential Neural Homotopy Approach Summary 
We briefly acquaint the reader with the principles of homotopy methods, since these 
approaches differ significantly from standard descent procedures. 
Homotopy methods solve systems of nonlinear equations by mapping the known 
solutions from an initial system to the desired solution of the unsolved system of 
equations. The basic method is as follows: Given a final set of equations f(z) - 
0, x G D C_ " whose solution is sought, a homotopy function h  D x T - " is 
defined in terms of a parameter r G T C , such that 
h(z,-)={ a(z) when-=0 
f(z) when-=l 
where the initial system of equations g() = 0 has a known solution. For opti- 
mization problems fix)= Ves(x) where es(x)is the error measure. Conceptually, 
h(e, -) = 0 is solved numerically for  for increasing values of % starting at - = 0 
at the known solution, and incrementally varying - and correcting the solution x 
until - = 1, thereby tracing a path from the initial to the final solutions. 
The power and the problems of homotopy methods lie in constructing a suitable 
function h. Unfortunately, for a given f most choices of h will fail, and, with the ex- 
ception of polynomial systems, no guaranteed procedures for selecting h exist. Paths 
generally do not connect the initial and final solutions, either due to non-existence 
of solutions, or due to paths diverging to infinity. However, if a theoretical proof 
of existence of a suitable trajectory can be constructed, well-established numerical 
procedures exist that reliably track the trajectory. 
The following theorem, proved in [2], establishes that a suitable homotopy exists 
for the standard feed-forward backpropagation neural networks: 
Theorem 1.1 Let e s be the unregularized mean square error (MSE) problem for 
the multi-layer perceptton network, with weights/S  . Let/so  U C  and 
a  V C , where U and V are open bounded sets. Then except for a set of 
measure zero (/S, a)  U x V, the solutions (fi, ,') of the set of equations 
h(/S, ) -- (1 - )(/S -/So) + D (e s + ;(11/S- allS)) - 0 (1) 
where  > 0 and ; :  -  satisfies 2;"(as)a s + ;Y(a ) > 0 as a - , form non- 
crossing one dimensional trajectories for all '  , which are bounded  -  [0, 1]. 
Furthermore, the path through (/so, O) connects to at least one solution (fi*, 1) of the 
regularized MSE error problem 
min (e s + ;(11/S - allS)) (2) 
On -  [0, 1] the approach corresponds to a pseudo-quadratic error surface being 
deformed continuously into the final neural network error surface 1 . Multiple solu- 
The common engineering heuristic whereby some arbitrary error surface is relaxed into 
another error surface generally does not yield well defmed trajectories. 
412 F. M. Coetzee and V. L. Stonick 
tions can be obtained by choosing different initial values/510. Every desired solution 
/* is accessible via an appropriate choice of a, since/510 =/* suffices. 
Figure 1 qualitatively illustrates typical paths obtained for this homotopy 2. The 
paths typically contain only a few solutions, are disconnected and diverge to infinity. 
A novel two-stage homotopy[2, 3] is used to overcome these problems by constructing 
and solving two homotopy equations. The first homotopy system is as described 
above. A synthetic second homotopy solves an auxiliary set of equations on a non- 
Euclidean compact manifold (S(0; R) x A, where A is a compact subset of R) and 
is used to move between the disconnected trajectories of the first homotopy. The 
method makes use of the topological properties of the compact manifold to ensure 
that the secondary homotopy paths do not diverge. 
+infl 
-infty 1 +infly 
4 
Figure 1: (a) Typical homotopy trajectories, illustrating divergence of paths and 
multiple solutiofis occurring on one path. (b) Plot of two-dimensional vectors used 
as training data for the second test problem (Yin-Yang problem). 
2 Test Problems 
The test problems described in this paper are small to allow for (i) a large number 
of repeated runs, and (ii) to make it possible to numerically distinguish between 
solutions. Classification problems were used since these present the only interesting 
small problems, even though the MSE criterion is not necessarily best for classifi- 
cation. Unlike most classification tasks, all algorithms were forced to approximate 
the stationary point accurately by requiring the l norm of the gradient to be less 
than 10 -, and ensuring that solutions differed in l by more than 0.01. 
The historical XOR problem is considered first. The data points (-1,-1), (1, 1), 
'(-1, 1) and (1,-1) were trained to the target values -0.8,-0.8, 0.8 and 0.8. A 
network with three inputs (one constant), two hidden layer nodes and one output 
node were used, with hyperbolic tangent transfer functions on the hidden and final 
2Note that the homotopy equation and its trajectories exist outside the interval 7' - 
[0, 
488 Solutions to the XOR Problem 413 
nodes. The regularization used/ -- 0.05, b(x) - x and a = 0 (no bifurcations were 
found for this value during simulations). This problem was chosen since it is small 
enough to serve as a benchmark for comparing the convergence and performance of 
the different algorithms. The second problem, referred to as the Yin-Yang problem, 
is shown in Figure 1. The problem has 23 and 22 data points in classes one and two 
respectively, and target values 4-0.7. Empirical evidence indicates that the smallest 
single hidden layer network capable of solving the problem has five hidden nodes. 
We used a net with three inputs, five hidden nodes and one output. This problem 
is interesting since relatively high classification accuracy is obtained using only a 
single neuron, but a 100% classification performance requires at least five hidden 
nodes and one of only a few global weight solutions. 
The stationary points form equivalence classes under tenumbering of the weights 
or appropriate interchange of weight signs. For the XOR problem each solution 
class contains up to 22 2! = 8 distinct solutions; for the Yin-Yang network, there 
are 25 5! = 3840 symmetries. The equivalence classes are reported in the following 
sections. 
3 Test Results 
A Pdbak-Poliere conjugate gradient (CG) method was used as a control since this 
method can find only minima, as contrasted to the other algorithms, all of which 
are attracted by all stationary points. In the second algorithm, the homotopy 
equation (1) was solved by following the main path until divergence. A damped 
Newton (DN) method and the two-stage homotopy method completed the set of 
four algorithms considered. The different algorithms were initialized with the same 
n random weights/o  S-l(0; vn). 
3.1 Control- The XOR problem 
The total number and classification of the solutions obtained for 250 iterations on 
each algorithm are shown in Table 1. 
Table 1: Number of equivalence class solutions obtained. XOR Problem 
Algorithm  Solutions Minima  Maxima Saddle Points 
CG 17 17 0 0 
DN 44 6 0 38 
One Stage 28 16 0 12 
Two Stage 61 17 0 44 
Total Distinct 61 17 0 44 
The probability of finding a given solution on a trial is shown in Figure 2. The 
two-stage homotopy method finds almost every solution from every initial point. 
In contrast to the homotopy approaches, the Newton method exhibits poor con- 
vergence, even when heavily damped. The sets of saddle points found by the DN 
algorithm and the homotopy algorithms are to a large extent disjoint, even though 
the same initial weights were used. For the Newton method solutions close to the 
initial point are typically obtained, while the initial point for the homotopy algo- 
414 E M. Coetzee and V. L. Stonick 
Pi 
Pi 
Pi 
I I I I I I 
50%-> 75%--> 100%-> 
0.5 
0.4 
0.3 
0.2 
0.1 Illh.llll ,,dl 
0 
I I I I I 
0 10 20 30 40 50 60 
Conjugate 
Gradient 
0.5 
0.4 
0.1 
0|1 
0 
I I I I I I 
50%-> 75%-> 100%-> 
...... mmn_nmnnn_ _ -. _ . inm-nn_--F].nn_ _ 
I I I I I I 
10 20 30 40 50 60 
Newton 
I I I I I I I 
 50%-> 75%-> 100%-> 
0.5 
0.4 
0.3 
0.2 
0.1 Single 
0 Stage 
0 10 20 30 40 50 60 Homotopy 
1 
0.8 
0.6 
0.4 
0.2 
0 
0 
I I I I I 
50%-> 75%-4 100%-> 
10 20 30 40 50 60 
Equivalence Class 
Two 
Stage 
Homotopy 
Figure 2: Probability of finding equivalence class i on a trial. Solutions have been 
sorted based on percentage of the training set correctly classified. Dark bars indicate 
local minima, light bars saddle points. XOR problem 
488 Solutions to the XOR Problem 415 
Table 2: Number of solutions correctly classifying x% of target data. 
Classification 25 % 50 % 75 % 100 % 
I Minimum 17 17 4 4 
Saddle 44 44 20 0 
Total Distinct 61 61 24 4 
rithm might differ significantly from the final solution. This difference illustrates 
that homotopy arrives at solutions in a fundamentally different way than descent 
approaches. 
Based on these results we conclude that the two-stage homotopy meets its objective 
of significantly increasing the number of solutions produced on a single trial. The 
homotopy algorithms converge more reliably than Newton methods, in theory and 
in practise. These properties make homotopy attractive for characterizing error 
surfaces. Finally, due to the large number of trials and significant overlap between 
the solution sets for very different algorithms, we believe that Tables 1-2 represent 
accurate estimates for the number and types of solutions to the regularized XOR 
problem. 
3.2 Results on the Yin-Yang problem 
The first three algorithms for the Yin-Yang problem were evaluated for 100 trials. 
The conjugate gradient method showed excellent stability, while the Newton method 
exhibited serious convergence problems, even with heavy damping. The two-stage 
algorithm was still producing solutions when the runs were terminated after multiple 
weeks of computer time, allowing evaluation of only ten different initial points. 
Table 3: Number of equivalence class solutions obtained. Yin-Yang Problem 
Algorithm  Solutions Minima  Maxima Saddle Points 
Conjugate Gradient 14 14 0 0 
Damped Newton 10 0 0 10 
One Stage Homotopy 78 15 0 63 
Two Stage Homotopy 1633 12 0 1621 
Total Distinct 1722 28 0 1694 
Table 4: Number of solutions correctly classifying x% of target data. 
Classification 75 80 90 95 96 97 98 99 100 % 
Minimum 28 28 28 26 26 5 5 2 2 
Saddle 1694 1694 1682 400 400 13 13 3 3 
Total Distinct 1722 1722 1710 426 426 18 18 5 5 
The results in Tables 3-4 for the number of minima are believed to be accurate, due 
to verification provided by the conjugate gradient method. The number of saddle 
416 E M. Coetzee and V. L Stonick 
points should be seen as a lower bound. The regularization ensured that the saddle 
points were well conditioned, i.e. the Hessian was not rank deficient, and these 
solutions are therefore distinct point solutions. 
4 Conclusions 
The homotopy methods introduced in this paper overcome the difficulties of poor 
convergence and the problem of repeatedly finding the same solutions. The use 
of these methods therefore produces significant new empirical insight into some 
extraordinary unsuspected properties of the neural network error surface. 
The error surface appears to consist of relatively few minima, separated by an 
extraordinarily large 'number of saddle points. While one recent paper by Goffe et 
al [4] had given some numerical estimates based on which it was concluded that a 
large number of minima in neural nets exist (they did not find a significant number 
of these), this extreme ratio of saddle points to minima appears to be unexpected. 
No maxima were discovered in the above runs; in fact none appear to exist within 
the sphere where solutions were sought (this seems likely given the regularization). 
The numerical results reveal astounding complexity in the neural network problem. 
If the equivalence classes are complete, then 488 solutions for the XOR problem are 
implied, of which 136 are minima. For the Yin-Yang problem, 6,600,000+ solutions 
and 107,250+ minima were characterized. For the simple architectures considered, 
these numbers appear extremely high. We are unaware of any other system of 
equations having these remarkable properties. 
Finally, it should be noted that the large number of s!dle points and the small 
ratio of minima to saddle points in neural problems can create tremendous com- 
putational difficulties for approaches which produce stationary points, rather than 
simple minima. The efficiency of any such algorithm at producing solutions will be 
negated by the fact that, from an optimization perspective, most of these solutions 
will be useless. 
Acknowledgements. The partial support of the National Science Foundation by 
grant MIP-9157221 is gratefully acknowledged. 
References 
[1] E. H. Rothe, Introduction to Various Aspects of Degree Theory in Banach Spaces. 
Mathematical Surveys and Monographs (23), Providence, Rhode Island: American 
Mathematical Society, 1986. ISBN 0-82218-1522-9. 
[2] 
F. M. Coetzee, Homotopy Approaches for the Analysis and Solution of Neural 
Network and Other Nonlinear Systems of Equations. PhD thesis, Carnegie Mellon 
University, Pittsburgh, PA, May 1995. 
[3] F. M. Coetzee and V. L. Stonick, "Sequential homotopy-based computation of 
multiple solutions to nonlinear equations," in Proc. IEEE ICASSP, (Detroit), IEEE, 
May 1995. 
[4] W. L. Goffe, G. D. Ferrier, and J. Rogers, "Global optimization of statistical 
functions with simulated annealing," Jour. Econometrics, vol. 60, no. 1-2, pp. 65-99, 
1994. 
