On the infeasibility of training neural 
networks with small squared errors 
Van H. Vu 
Department of Mathematics, Yale University 
vuha@math.yale.edu 
Abstract 
We demonstrate that the problem of training neural networks with 
small (average) squared error is computationally intractable. Con- 
sider a data set of M points (Xi, Y/), i - 1, 2,..., M, where Xi are 
input vectors from R d, Y/ are real outputs (Y/ & R). For a net- 
M 
work f0 in some class : of neural networks, (/4) Ei_-(f0(Xi)- 
)2)/.._ in f fe:(1/M) -.iM__(f(Xi )_ yi)2)1[2 is the (avarage) rel- 
ative error occurs when one tries to fit the data set by f0. We will 
prove for several classes Y of neural networks that achieving a rela- 
tive error smaller than some fixed positive threshold (independent 
from the size of the data set) is NP-hard. 
1 Introduction 
Given a data set (Xi, ), i = 1, 2,..., M, Xi are input vectors from R d, Y/are real 
outputs (Y  R). We call the points (Xi, Y) data points. The training problem 
for neural networks is to find a network from some class (usually with fixed number 
of nodes and layers), which fits the data set with small error. In the following we 
describe the problem with more details. 
Let Y be a class (set) of neural networks, and c be a metric norm in /M. To 
each f  , associate an error vector Ef - (]f(Xi) - ])/M__ (E: depends on the 
data set, of course, though we prefer this notation to avoid difficulty of having too 
many subindices). The norm of Ef in c shows how well the network f fits the data 
regarding to this particular norm. Furthermore, let e,: denote the smallest error 
achieved by a network in , namely: 
In this context, the training problem we consider here is to find f   such that 
372 V. H. Vu 
[]ES]] - e,: _< e:, where : is a positive number given in advance, and does not 
depend on the size M of the data set. We will call (: relative error. The norm a 
is chosen by the nature of the training process, the most common norms are: 
l norm: [Iv[] = maz[vi (interpolation problem) 
l: norm: IIvll = vY) whe=e v = (Xet square error prob- 
lem). 
The qutity I IEy][: is usually referred to as the emperical error of the training 
process. The first goal of this paper is to show that achieving small emperical error 
is NP-hard. From now on, we work with l: norm, if not otherwise specified. 
A question of great importance is: given the data set,  and ey in advance, could 
one find an ecient algorithm to solve the training problem formulated above. By 
eciency we mean an algorithm terminating in polynomial time (polynomial in the 
size of the input). This question is closely related to the problem of learning neural 
networks in polynomial time (see [3]). The input in the algorithm is the data set, 
by its size we means the number of bits required to write down all (Xi, ). 
Question 1. Given  and e y and a dala set. Could one find an ecienl algorithm 
which produces a funelton f   such that IlEsll < e + e 
Question i is very dicult o answer in general. In his paper we will investigate 
he following importan sub-question: 
Question 2. Can one achieve arbitrary small relative error using polynomial algo- 
rithms ? 
Our purpose is to give a negative answer for Question 2. This question was posed 
by L. Jones in his seminar at Yale (1996). The crucial point here is that we are 
dealing with 12 norm, which is very important from statistical point of view. Our 
investigation is also inspired by former works done in [2], [6], [7], etc, which show 
negative results in the l norm case. 
Definition. A positive number e is a threshold of a class ' of neural networks if 
the training problem by networks from .T with relative error less than  is NP-hard 
(i.e., computationally infeasible). 
In order to provide a negative answer to Question 2, we are going to show the 
existence of thresholds (which is independent from the size of the data set) for the 
following classes of networks. 
 : {flf(x) - (1/n)(Zi step(aix -- Oi)} 
 ': {fiX(x) - (i=x cistep(aix - hi)} 
 6n -- {gig(x) = -i ci(i(aix - hi)) 
where n is a positive integer, step(x) = i if x is positive and zero otherwise, ai and 
x are vectors from R d, bi are real numbers, and ci are positive numbecs. It is clear 
that the class Y contains Y; the reason why we distinguish these two cases is that 
the proof for  is relatively easy to present, while contains the most important 
ideas. In the third class, the functions bi are sigmoid functions which satisfy certain 
Lipchitzian conditions (for more details see [9]) 
Main Theorem 
(i) The classes 'l, '2, ' and ( have absolute constant (positive) thresholds 
On the Infeasibility of Training Neural Networks with Small Squared Errors 373 
(it) For every class .Tr,+2, n > O, there is a threshold of form 
(iii) For every .T'+2, n > 0, there is a threshold of form 
(iv) For every class +2, n > 0, there is a threshold of form (n-5/2d -/2. 
In the last three statements, ( is an absolute positive constant. 
Here is the key argument of the proof. Assume that there is an algorithm A which 
solves the training problem in some class (say  ) with relative error . From 
some (properly chosen) NP-hard problem. we will construct a data set so that if e 
is sufficiently small, then the solution found by A (given the constructed data set 
as input) in ., implies a solution for the original NP-hard problem. This will give 
a lower bound on e, if we assume that the algorithm A is polynomial. In all proofs 
the leading parameter is d (the dimension of data inputs). So by polynomial we 
mean a polynomial with d as variable. All the input (data) sets constructed will 
have polynomial size in d. 
The paper is organized as follow. In the next Section, we discuss earlier results 
concerning the l norm. In Section 3, we display the NP-hard results we will use in 
the reduction. In Section 4, we prove the main Theorem for class '2 and mention 
the method to handle more general cases. We conclude with some remarks and 
open questions in Section 5. 
To end this Section, let us mention one important corollary. The Main Theorem 
implies that learning ., .t and  (with respect to l norm) is hard. For more 
about the connection between the complexity of training and learning problems, we 
refer to [3], [5]. 
Notation: Through the paper Ud denotes the unit hypercube in R d. For any 
number z, zd denotes the vector (z, x, .., z) of length d. In particular, 0d denotes 
the origin of R . For any half space H, r is the complement of H. For any set A, 
is the number of elements in A. A function y(d) is said to have order of magnitude 
O(F(d)), if there are c < C positive constants such that c < y(d)/F(d) < C for all 
d. 
2 Previous works in the loo case 
The case ct = loo (interpolation problem) was considered by several authors for 
many different classes of (usually) 2-layer networks (see [6],[2], [7], [8]). Most of the 
authors investigate the case when there is a perfect fit, i.e., l,x = 0. In [2], the 
authors proved that training 2-layer networks containing 3 step function nodes with 
zero relative error is NP-hard. Their proof can be extended for networks with more 
inner nodes and various logistic output nodes. This generalized a former result of 
Maggido [8] on data set with rational inputs. Combining the techniques used in 
[2] with analysis arguments, Lee Jones [6] showed that the training problem with 
relative error 1/10 by networks with two monotone Lipschitzian Sigmoid inner nodes 
and linear output node, is also NP-hard (NP-complete under certain circumstances). 
This implies a threshold (in the sense of our definition) (1/10)M -/2 for the class 
examined. However, this threshold is rather weak, since it is decreasing in M. This 
result was also extended for the n inner nodes case [6]. 
It is also interesting to compare our results with Judd's. In [7] he considered the 
following problem "Given a network and a set of training examples (a data set), 
does there exist a set of weights so that the network gives correct output for all 
training examples ?" He proved that this problem is NP-hard even if the network is 
374 V.H. Vu 
required to produce the correct output for two-third of the traing examples. In fact, 
it was shown that there is a class of networks and a data sets so that any algorithm 
will produce poorly on some networks and data sets in the class. However, from 
this result one could not tell if there is a network which is "hard to train" for all 
algorithms. Moreover, the number of nodes in the networks grows with the size of 
the data set. Therefore, in some sense, the result is not independent from the size 
of the data set. 
In our proofs we will exploit many techniques provided in these former works. The 
crucial one is the reduction used by A. Blum and R. Rivest, which involves the 
NP-hardness of the Hypergraph 2-Coloring problem. 
3 Some NP hard problems 
Definition Let B be a CNF formula, where each clause has at most k literals. 
Let max(B) be the maximum number of clauses which can be satisfied by a truth 
assignment. The APP MAX k-SAT problem is to find a truth assignment which 
satisfies (1 - e)max(B) clauses. 
The following Theorem says that this approximation problem is NP -hard, for some 
small e. 
Theorem 3.1.1 Fix k _> 2. There is e > O, such that finding a truth assignment. 
which satisfies at least (1- e)max(B) clauses is NP-hard. 
The problem is still hard, when every literal in B appears in only few clauses, and 
every clause contains only few literals. Let Ba(5) denote the class of CNFs with at 
most 3 literals in a clause and every literal appears in at most 5 clauses (see [1]). 
Theorem 3.1.2 There is e2 > 0 such that finding a truth assignment, which satisfies 
at least (1 - e)max(B) clauses in a formula B  B3(5) is NP-hard. 
The optimal thresholds in these theorems can be computed, due to recent results 
in Thereotical Computer Science. Because of space limitation, we do not go into 
this matter. 
Let H = (V, E) be a hypergraph on the set V, and E is the set of edges (collection 
of subsets of V). Elements of V are called vertices. The degree of a vertex is the 
number of edges containing the vertex. We could assume that each edge contains 
at least two vertices. Color the vertices with color Blue or Red. An edge is colorful 
if it contains vertices of both colors, otherwise we call it monochromatic. Let c(H) 
be the maximum number of colorful edges one can achieve by a coloring. By a 
probabilistic argument, it is easy to show that c(H) is at least IEI/2 (in a random 
coloring, an edge will be colorful with probability at least 1/2). Using 3.1.2, we 
could prove the following theorem (for the proof see [9]) 
Theorem 3.1.3 There is a constant ea > 0 such that finding a coloring with at 
least (1 - e3)c(H) colorful edges is NP-hard. This statement holds even in the case 
when every but one degree in H is at most 10 
4 Proof for 
We follow the reduction used in [2]. Consider a hypergraph H(k} E) described 
Theorem 3.2.1. Let V = {1,2 .... , d + 1}, where with the possible exception of 
the vertex d + 1, all other vertices have degree at most 10. Every edge will have 
at least 2 and at most 4 vertices. So the number of edges is at least (d + 1)/4. 
On the Infeasibility of Training Neural Networks with Small Squared Errors 375 
Let Pi be the i tn unit vector in R d+l Pi (0, 0,..., 0, 1, 0,..., 0). Furthermore, 
, -- 
XC = iec Pi for every edge C  E. Let $ be a coloring with maximum number 
of colorful edges. In this coloring denote by A1 the set of colorful edges and by As 
the set of monochromatic edges. Clearly IA11= c(H). 
Our data set will be the following (inputs are from R d+ instead of from R , but it 
makes no difference) 
D - {(pi,1/2)}ig=U{(p+l,1/2)t}U{(O+l,1)'}U{(xc,1)lC  A1}U{(Xc,1/2)[C  
where (Pd+1,1/2) t and (0+1,1) t means (Pd+l, 1/2) and (0d+l,1) are repeated t 
times in the data set, resp. Similarly to [2], consider two vectors a and b in 
where 
a -- (al,..., ad+l), ai -- --1 if i is Red and ai -- d -4- i otherwise 
b = (hi, .., bd+l), bi = -1 if i is Blue and bi = d + 1 otherwise 
It is not dimcult to verify that the function f0 = (1/2)(step (ax + 1/2)+ step (bx + 
1/2)) fits the data perfectly, thus ey 2 = liEfoil-- 0. 
Suppose f = (1/2)(step (cx - 7) + step (dx - 5)) satisfies 
M 
Mll:fll 2 __ E(f(xi ) _ )a < Mea 
i=1 
Since if f(Xi)  Yi then (f!Xi)- k})a > 1/4, the previous inequality implies: 
lUo -[{i,f(Xi)  Y/}[ < 4Me"- lu 
The ratio luo/M is called misclassification ratio, and we will show that this ratio 
cannot be arbitrary small. In order to avoid unnecessary ceiling and floor symbols, 
we assume the upper-bound/ is an integer. We choose t - / so that we can also 
assume that (0+, 1) and (P+l, 1/2) are well classified. 
space co_nsisti_ng of x: cx-7 > 0 (dx-5 > 0). Note 
Pd+l  H1 U Ha. Now let P1 denote the set of i where Pi 
such that Pi  Hi Cq Ha. Clearly, if j 6 Pa, then f(pj) J: 
Q = {C  EIC N Pa  }- Note that for each j  Pa, the 
thus: ]Q] _< 1oll _< lO/ 
Let A I = {Clf(c) - 1). Since less than  points are misclassified, IAIAAll < . 
Color V by the following rule: (1) ifp 6 P1, then i is Red; (2) ifp 6 Pa, color i 
arbitrarily, either Red or Blue; (3) if p q P U Pa, then i is Blue. 
Now we can finish the proof by the following two claims: 
Claim 1: Every edge in A\Q is colorful. It is left to readers to verify this simple 
statement. 
Let H1 (Ha) be the half 
that 0  H1 CI H and 
 H1, and Pa the set of i 
Yj, hence: IPal < u. Let 
degree of j is at most 10, 
Claim 2: IA\QI is close to IAll. 
Notice that: 
IAi\(A\Q) I _< ]AiAA] + ]QI _<  + lO/ = 11/ 
Observe that the size of the data set is M = d+ 2t + IEI, so [E I + d > M- 2t = 
M - 2. Moreover, Il >_ (d + 1)/4, so I1 >- (1/5)(M- 2). On the other hand, 
I&l >_ (1/2)1EI, all together we obtain; IA11 >_ (1/lO)(M- ), which yields: 
376 V.. H. Vu 
[AQI > 11 
- IAxl) -[Axl(1 - 110(( M _ )) 
4 2 
_> I&l(1- 110( 1 _ 4))- IA, I(1- 
Choose e = 4 such that k(4) < 3 (see Theorem 3.1.3). Then 4 will be a threshold 
for the class 2. This completes the proof. Q.E.D. 
Due to space limitation, we omit the proofs for other classes and refer to [9]. How- 
ever, let us at least describe (roughly) the general method to handle these cases. 
The method consists of following steps: 
Extend the data set in the previous proof by a set of (special) points. 
 Set the multiplicities of the special points sufficiently high so that those points 
should be well-classified. 
 If we choose the special points properly, the fact that these points are well-classified 
will determine (roughly) the behavior of all but 2 nodes. In general we will show 
that all but 2 nodes have little influence on the outputs of non-special data points. 
 The problem basically reduces to the case of two nodes. By modifying the previous 
proof, we could achieve the desired thresholds. 
5 Remarks and open problems 
 Readers may argue about the existence of (somewhat less natural) data points of 
high multiplicities. We can avoid using these data points by a combinatorial trick 
described in [9]. 
 The proof in Section 4 could be carried out using Theorem 3.1.2. However, we 
prefer using the hypergraph coloring terminology (Theorem 3.1.3), which is more 
convenient and standard. Moreover, Theorem 3.1.3 itself is interesting, and has not 
been listed among well known "approximation is hard" theorems. 
 It remains an open question to determine the right order of magnitude of thresh- 
olds for all the classes we considered. (see Section 1). By technical reasons, in the 
Main theorem, the thresholds for more than two nodes involve the dimension (d). 
We conjecture that there are dimension-free thresholds. 
Acknowledgement We wish to thank A. Blum, A. Barron and L. Lovsz for many 
useful ideas and discussions. 
References 
[1] S. Arora and C. Lurid Hardness of approximation, book chapter, preprint 
[2] A. Blum, R. Rivest Training a S-node neural network is NP-hard Neutral Net- 
works, Vol 5., p 117-127, 1992 
[3] A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth, Learnability and the 
Vepnik-Chervonenkis Dimension, Journal of the Association for computing Ma- 
chinery, Vol 36, No. 4, 929-965, 1989. 
[4] M. Garey and D. Johnson, Computers and intractability: A guide to the theory 
of NP-completeness, San Francisco, W.H.Freeman, 1979 
On the Infeasibility of Training Neural Networks with Small Squared Errors 377 
[5] D. ttaussler, Generalizing the PAC model for neural net and other learning 
applications (Tech. Rep. UCSC-CRL-89-30). Santa Cruz. CA: University of 
California 1989. 
[6] L. Jones, The computational intractability of training sigmoidal neural networks 
(preprint) 
[7] J. Judd Neutral Networks and Comple:ity of learning, MIT Press 1990. 
[8] N. Meggido, On the comple:ity of polyhedral separability (Tech. Rep. RJ 5252) 
IBM Almaden Research Center, San Jose, CA 
[9] V. I-I. Vu, On the infeasibility of training neural networks with small squared 
error, manuscript. 
