494 
TRAINING A 3-NODE NEURAL NETWORK 
IS NP-COMPLETE 
Avrim Blum* 
MIT Lab. for Computer Science 
Cambridge, Mass. 02139 USA 
Ronald L. Rivestt 
MIT Lab. for Computer Science 
Cambridge, Mass. 02139 USA 
ABSTRACT 
We consider a 2-layer, 3-node, n-input neural network whose nodes 
compute linear threshold functions of their inputs. We show that it 
is NP-complete to decide whether there exist weights and thresholds 
for the three nodes of this network so that it will produce output con- 
sistent with a given set of training examples. We extend the result 
to other simple networks. This result suggests that those looking for 
perfect training algorithms cannot escape inherent computational 
difficulties just by considering only simple or very regular networks. 
It also suggests the importance, given a training problem, of finding 
an appropriate network and input encoding for that problem. It is 
left as an open problem to extend our result to nodes with non-linear 
functions such as sigmoids. 
INTRODUCTION 
One reason for the recent surge in interest in neural networks is the develop- 
ment of the "back-propagation" algorithm for training neural networks. The 
ability to train large multi-layer neural networks is essential for utilizing neural 
networks in practice, and the back-propagation algorithm promises just that. 
In practice, however, the back-propagation algorithm runs very slowly, and the 
question naturally arises as to whether there are necessarily intrinsic compu- 
tational difficulties associated with training neural networks, or whether better 
training algorithms might exist. This paper provides additional support for the 
position that training neural networks is intrinsically difficult. 
A common method of demonstrating a problem to be intrinsically hard is to 
show the problem to be "NP-complete". The theory of NP-complete problems 
is well-understood (Garey and Johnson, 1979), and many infamous problems-- 
such as the traveling salesman problem--are now known to be NP-complete. 
While NP-completeness does not render a problem totally unapproachable in 
*Supported by an NSF graduate fellowship. 
tThis paper was prepared with support from NSF grant DCR-8607494, ARO Grant 
DAAL03-86-K-0171, and the Siemens Corporation. 
Training a 3-Node Neural Network is NP-Complete 495 
practice, it usually implies that only small instances of the problem can be solved 
exactly, and that large instances can at best only be solved approximately, even 
with large amounts of computer time. 
The work in this paper is inspired by Judd (Judd, 1987) who shows the following 
problem to be NP-complete: 
"Given a neural network and a set of training examples, does there 
exist a set of edge weights for the network so that the network pro- 
duces the correct output for all the training examples?" 
Judd also shows that the problem remains NP-complete even if it is only required 
a network produce the correct output for two-thirds of the training examples, 
which implies that even approximately training a neural network is intrinsically 
difficult in the worst case. Judd produces a class of networks and training ex- 
amples for those networks such that any training algorithm will perform poorly 
on some networks and training examples in that class. The results, however, 
do not specify any particular "hard network"--that is, any single network hard 
for all algorithms. Also, the networks produced have a number of hidden nodes 
that grows with the number of inputs, as well as a quite irregular connection 
pattern. 
We extend his result by showing that it is NP-complete to train a specific very 
simple network, having only two hidden nodes and a regular interconnection 
pattern. We also present classes of regular 2-layer networks such that for all 
networks in these classes, the training problem is hard in the worst case (in 
that there exists some hard sets of training examples). The NP-completeness 
proof also yields results showing that finding approximation algorithms that 
make only one-sided error or that approximate closely the minimum number 
of hidden-layer nodes needed for a network to be powerful enough to correctly 
classify the training data, is probably hard, in that these problems can be related 
to other difficult (but not known to be NP-complete) approximation problems. 
Our results, like Judd's, are described in terms of "batch"-style learning algo- 
rithms that are given all the training examples at once. It is worth noting that 
training with an "incremental" algorithm that sees the examples one at a time 
such as back-propagation is at least as hard. Thus the NP-completeness result 
given here also implies that incremental training algorithms are likely to run 
slowly. 
Our results state that given a network of the classes considered, for any training 
algorithm there will be some types of training problems such that the algorithm 
will perform poorly as the problem size increases. The results leave open the 
possibility that given a training problem that is hard for some network, there 
might exist a different network and encoding of the input that make training 
easy. 
Blum snd Rivest 
1 2 3 4 ... . 
Figure 1: The three node neural network. 
THE NEURAL NETWORK TRAINING PROBLEM 
The multilayer network that we consider has n binary inputs and three nodes: 
N, N2, Ns. All the inputs are connected to nodes N and N2. The outputs 
of hidden nodes N and N2 are connected to output node Ns which gives the 
output of the network. 
Each node Ni computes a linear threshold function fi on its inputs. If Ni has 
input z = (z,... ,x,n), then for some constants ao,..., am, 
+1 if az + a2z2 +... + a,nz,n > ao 
fi(x) = -1 otherwise. 
The a i's (j _> 1) are typically viewed as weights on the incoming edges and a0 
as the threshold. 
The training algorithm for the network is given a set of training examples. Each 
is either a positive example (an input for which the desired network output is +1) 
or a negative example (an input for which the desired output is -1). Consider 
the following problem. Note that we have stated it as a decision ("yes" or "no") 
problem, but that the search problem (finding the weights) is at least equally 
hard. 
TRAINING A 3-NODE NEURAL NETWORK: 
Given: A set of O(n) training examples on n inputs. 
Question: Do there exist linear threshold functions f, f2, fa for nodes N, N2, Na 
Training a 3-Node Neural Network is NP-Complete 497 
such that the network of figure I produces outputs consistent with the training 
set? 
Theorem: Training a 3-node neural network is NP-complete. 
We also show (proofs omitted here due to space requirements) NP-completeness 
results for the following networks: 
o 
The 3-node network described above, even if any or all of the weights for 
one hidden node are required to equal the corresponding weights of the 
other, so possibly only the thresholds differ, and even if any or all of the 
weights are forced to be from {41,-1}. 
Any k-hidden node, for k bounded by some polynomial in n (eg: k -- n2), 
two-layer fully-connected network with linear threshold function nodes 
where the output node is required to compute the AND function of its 
inputs. 
The 2-layer, 3-node n-input network with an XOR output node, if ternary 
features are allowed. 
In addition we show (proof omitted here) that any set of positive and negative 
training examples classifiable by the 3-node network with XOR output node (for 
which training is NP-complete) can be correctly classified by a perceptron with 
O(n 2) inputs which consist of the original n inputs and all products of pairs of 
the original n inputs (for which training can be done in polynomial-time using 
linear programming techniques). 
THE GEOMETRIC POINT OF VIEW 
A training example can be thought of as a point in n-dimensional space, labeled 
'4' or '-' depending on whether it is a positive or negative example. The points 
are vertices of the n-dimensional hypercube. The zeros of the functions fx and 
f2 for the hidden nodes can be thought of as (n- 1)-dimensional hyperplanes 
in this space. The planes P1 and P2 corresponding to the functions fl and 
f2 divide the space into four quadrants according to the four possible pairs of 
outputs for nodes N1 and N2. If the planes are parallel, then one or two of the 
quadrants is degenerate (non-existent). Since the output node receives as input 
only the outputs of the hidden nodes N1 and N2, it can only distinguish between 
points in different quadrants. The output node is also restricted to be a linear 
function. It may not, for example, output "+1" when its inputs are (+1,41) 
and (- 1, - 1), and output "- 1" when its inputs are (+ 1, - 1) and (- 1, + 1). 
So, we may reduce our question to the following: given O(n) points in {0, 1}" 
each point labeled '4' or '-', does there exist either 
498 Blum and Rivest 
1. a single plane that separates the '--' points from the '-' points, or 
2. two planes that partition the points so that either one quadrant contains 
all and only '-' points or one quadrant contains all and only '-' points. 
We first look at the restricted question of whether there exist two planes that 
partition the points such that one quadrant contains all and only the '--' points. 
This corresponds to having an "AND" function at the output node. We will call 
this problem: "2-Linear Confinement of Positive Boolean Examples". Once we 
have shown this to be NP-complete, we will extend the proof to the full problem 
by adding examples that disallow the other possibilities at the output node. 
Megiddo (Megiddo, 1986) has shown that for O(n) arbitrary '--' and '-' points 
in n-dimensional Euclidean space, the problem of whether there exist two hy- 
perplanes that separate them is NP-complete. His proof breaks down, however, 
when one restricts the coordinate values to C0, 1) as we do here. Our proof 
turns out to be of a quite different style. 
SET SPLITTING 
The following problem was proven to be NP-complete by Lovasz (Garey and 
Johnson 1979). 
SET-SPLITTING: 
Given: A finite set S and a collection C of subsets ci of S. 
Question: Do there exist disjoint sets S, S2 such that S U S2 - S and 
i, ci  S and ci  S2. 
The Set-Splitting problem is also known as 2-non-Monotone Colorability. Our 
use of this problem is inspired by its use by Kearns, Li, Pitt, and Valiant to 
show that learning k-term DNF is NP-complete (Kearns et al. 1987) and the 
style of the reduction is similar. 
THE REDUCTION 
Suppose we are given an instance of the Set-Splitting problem: 
s= c= _cs, 
Create the following signed points on the n-dimensional hypercube C0, 1}n: 
 Let the origin 0 n be labeled '+'. 
 For each si, put a point labeled '-' at the neighbor to the origin that has 
12... i ...n 
a 1 in the ith bit--that is, at (00-' .010.. '0). Call this point pi. 
Training a 3-Node Neural Network is NP-Complete 499 
$1 
2 
$3 
(001) 
(010) 
(000) (100) 
Figure 2: An example. 
 For each cj = {sj,..., skj}, put a point labeled '-t-' at the location whose 
bits are 1 at exactly the positions j, j2,..., jkj rothat is, at pj-t-" .9-pj. 
For example, let $-- (si,s2,ss), C- (1,2} , 1 '-- (81,82}, 2-- (82,83}' 
So, we create '-' points at: (0 0 1), (0 1 0), (1 0 0) 
and '-t-' points at: (0 0 0), (1 1 0), (0 1 1) in this reduction (see figure 2). 
Claim: The given instance of the Set-Splitting problem has a solution  the 
constructed instance of the 2-Linear Confinement of Positive Boolean Examples 
problem has a solution. 
Proof: (=) 
Given $x from the solution to the Set-Splitting instance, create the plane P: 
ax +-.. + anxn = -, where ai = -1 if si 6 $, and ai = n if si  $. Let 
the vectors a = a,),x = 
This plane separates from the origin exactly the '-' points corresponding to 
si 6 S and no '+' points. Notice that for each si 6 Sx, a'pi = -1, and that 
for each si q[ S, a 'pi -- n. For each '+' point p, a -p > - since either p is 
the origin or else p has a 1 in a bit i such that si q[ S. 
Similarly, create the plane P2 from 
Let S be the set of points separated from the origin by Px and S2 be those 
points separated by P2. Place any points separated by both planes in either 
S or S2 arbitrarily. Sets S and S2 cover S since all '-' points are separated 
from the origin by at least one of the planes. Consider some cj -- {Sjl... sj}} 
500 Blum and Rivest 
(001) 
(010) 
(000) (100) 
Figure 3: The gadget. 
and the corresponding '-' points pj,...,Pike. If, say, cj C Si, then P must 
separate all the p.i from the origin. Therefore, Px must separate py +.--+pikj 
from the origin. Since that point is the '+' point corresponding to cj, the '+' 
points are not all confined to one quadrant, contradicting our assumptions. So, 
no cj can be contained in $. Similarly, no cj can be contained in $2.  
We now add a "gadget" consisting of 6 new points to handle the other possi- 
bilities at the output node. The gadget forces that the only way in which two 
planes could linearly separate the '+' points from the '-' points would be to 
confine the '+' points to one quadrant. The gadget consists of extra points and 
three new dimensions. We add three new dimensions, Xn+l,Xn+2, and xn+s, 
and put '+' points in locations: 
(0...0101), (0-..0011) 
and '-' points in locations: 
(0...0 100), (0...0 010), 
(0...0 001), 
(0-.-0111). 
(See figure 3.) 
The '+' points of-this cube can be separated from the '-' points by appropriate 
settings of the weights of planes P and P2 corresponding to the three new 
dimensions. Given planes Pi  ax + ... + a,x, = - and P  bx + ... + 
bn xn = - J' which solve a 2-Linear Confinement of Positive Boolean Examples 
instance in n dimensions, expand the solution to handle the gadget by setting 
1 
P to az+...+anxn+z++z+2-xn+3=- 
1 
P to bx+...+b,x,-x,+-x,++x,+3=- 
Training a 3-Node Neural Network is NP-Complete 501 
(P separates '-' point (0--. 0 001) from the 'q-' points and P2 separates the 
other three '-' points from the 'q-' points). Also, notice that there is no way 
in which just one plane can separate the 'q-' points from the '-' points in the 
cube and also no way for two planes to confine all the negative points in one 
quadrant. Thus we have proved the theorem. 
CONCLUSIONS 
Training a 3-node neural network whose nodes compute linear threshold func- 
tions is NP-complete. 
An open problem is whether the NP-completeness result can be extended to 
neural networks that use sigmoid functions. We believe that it can because the 
use of sigmoid functions does not seem to alter significantly the expressive power 
of a neural network. Note that Judd (Judd 1987), for the networks he considers, 
shows NP-completeness for a wide variety of node functions including sigmoids. 
References 
James A. Anderson and Edward Rosenfeld, editors. 
dations of Research. MIT Press, 1988. 
Neurocomputing: Foun- 
M. Garey and D. Johnson. Computers and Intractability: A Guide to the 
Theory of NP-Completeness. W. H. Freeman, San Francisco, 1979. 
J. Stephen Judd. Learning in networks is hard. In Proceedings of the First 
International Conference on Neural Networks, pages 685-692, I.E.E.E., 
San Diego, California June 1987. 
Stephen Judd. Neural Network Design and the Complexity of Learning. 
PhD thesis, Computer and Information Science dept., University of Mas- 
sachusetts, Amherst, Mass. U.S.A., 1988. 
Michael Kearns, Ming Li, Leonard Pitt, and Leslie Valiant. On the learnability 
of boolean formulae. In Proceedings of the Nineteenth Annual A CM Sym- 
posium on Theory of Computing, pages 285-295, New York, New York, 
May 1987. 
Nimrod Megiddo. On The Complexity of Polyhedral Separability. Technical 
Report RJ 5252, IBM Almaden Research Center, August 1986. 
Marvin Minsky and Seymour Papert. Percepttons: An Introduction to Com- 
putational Geometry. The MIT Press, 1969. 
David E. Rumelhart and James L. McClelland, editors. Parallel Distributed 
Processing (Volume I: Foundations). MIT Press, 1986. 
