Agnostic PAC-Learning of Functions 
Analog Neural Nets 
(Extended Abstract) 
on 
Wolfgang Maass 
Institute for Theoretical Computer Science 
Technische Universitaet Graz 
Klosterwiesgasse 32/2 
A-8010 Graz, Austria 
e-mail: maass@igi.tu-graz.ac.at 
Abstract: 
There exist a number of negative results ([J], [BR], [KV]) about 
learning on neural nets in Valiant's model IV] for probably approx- 
imately correct learning ("PAC-learning"). These negative results 
are based on an asymptotic analysis where one lets the number of 
nodes in the neural net go to infinity. Hence this analysis is less ad- 
equate for the investigation of learning on a small fixed neural net 
with relatively few analog inputs (e.g. the principal components of 
some sensory data). The latter type of learning problem gives rise 
to a different kind of asymptotic question: Can the true error of the 
neural net be brought arbitrarily close to that of a neural net with 
"optimal" weights through sufficiently long training? In this paper 
we employ some new arguments in order to give a positive answer 
to this question in Haussler's rather realistic refinement of Valiant's 
model for PAC-learning ([H], [KSS]). In this more realistic model 
no a-priori assumptions are required about the "learning target", 
noise is permitted in the training data, and the inputs and outputs 
are not restricted to boolean values. As a special case our result 
implies one of the first positive results about learning on multi-layer 
neural nets in Valiant's original PAC-learning model. At the end 
of this paper we will describe an efficient parallel implementation 
of this new learning algorithm. 
311 
312 Maass 
We consider multi-layer high order feedforward neural nets A/' with arbitrary piece- 
wise polynomial activation functions. Each node g of fan-in m > 0 in iV' is 
called a computation node. It is labelled by some polynomial Q9(yl,..., y,) 
and some piecewise polynomial activation function 79 : R - R. We assume 
that 7 g consists of finitely many polynomial pieces and that its definition in- 
volves only rational parameters. The computation node g computes the function 
{Y,...,Y,/ - 79(Q9(Y,  ..,Y,)) from R ' into R. The nodes of fan-in 0 in 
("input nodes") are labelled by variables xl,..., xk. The nodes g of fan-out 0 in 
iV' ("output nodes") are labelled by 1,..., l. We assume that the range B of their 
activation functions 79 is bounded. Any parameters that occur in the definitions of 
the 79 are referred to as architectural parameters of iV'. 
The coefficients of all the polynomials Q9 are called the programmable parameters 
(or weights) of iV'. Let w be the number of programmable parameters of iV'. For any 
assignment c E R w to the programmable parameters of iV' the network computes 
a function from R k into R l which we will denote by A/'-. 
We write Qn for the set of rational numbers that can be written as quotients of 
l 
integers with bit-length _< n. For _z = Iz,...,zll E R l we write IIzl] for  Izj]. 
j=l 
Let F  R  - R l be some arbitrary functi6n, which we will view as a "prediction 
rule". For any given instance (_x,y) G R  x R l we measure the error of F by 
I[F(_x) -yll. For any distribution A over some subset of R  x R t we measure the 
true error off with regard to A by E(,y__)e.4[llF(_x ) -Yll], i.e. the expected value 
of the error of F with respect to distribution A. 
Theorem 1: Let A/' be some arbitrary high order feedforward neural net with piece- 
wise polynomial activation functions. Let w be the number of programmable para- 
meters of A/' (we assume that w = O(1)). Then one can construct from A/' some 
first order feedforward neural net J(/' with piecewise linear activation functions and 
2 
the quadratic activation function 7(x) = x , which has the following property: 
There exists a polynomial m(, ) and a learning algorithm LEARN such that for 
any given s, 5, (0,1) and s,n  N and any distribution A over Q x (Q, (3B)  
the following holds: 
For any sample  '- ((xi, Yi})i=X,...,m of m > m(1 1 
- 7, 7) points that are independently 
drawn according to A the algorithm LEARN computes in polynomially in m, s, n 
computation steps an assignment  of rational numbers to the programmable para- 
meters of J(/' such that with probability >_ 1 - 5: 
- yll] _< . + inf - Yllx], 
_ EQ 
or in other words: 
The true error of J(f5 with regard to A is within e of the least possible true error 
that can be achieved by any A/'- with c E Q. 
Remarks 
a) One can easily see (see [M 93b] for details) that Theorem 1 provides a 
positive learning result in Haussler's extension of Valiant's model for PAC- 
learning ([H], [KSS]). The "touchstone class" (see [KSS]) is defined as the 
Agnostic PAC-Learning of Functions on Analog Neural Nets 313 
b) 
class of function f  R k - R l that are computable on Af with program- 
mable parameters from Q. 
This fact is of some general interest, since so far only very few positive 
results are known for any learning problem in this rather realistic (but 
quite demanding) learning model. 
Consider the special case where the distribution A over Qn x (Qn n B) l is 
of the form 
{ , if_y = 
AD,_T(_x,y) = 0 , otherwise 
for some arbitrary distribution D over the domain Qn and some arbitrary 
--T E QsW. Then the term 
inf ylll] 
EQ 
is equal to 0. Hence the preceding theorem states that with learning algo- 
rithm LEARN the "learning network" J(/' can "learn" with arbitrarily small 
true error any target fimction Afar that is computable on Af with rational 
"weights" --T' Thus by choosing Af sufficiently large, one can guarantee 
that the associated "learning network" j(/' can learn any target-function 
that might arise in the context of a specific learning problem. 
In addition the theorem also applies to the more realistic situation where 
the learner receives examples (x, y) of the form (_x, Af-T (_x)+ noise), or even 
if there exists no "target functih" Af-" that would "explain" the actual 
distribution A of examples (x, y) ("agnostic learning"). 
The proof of Theorem 1 is mathematically quite involved, 
only an outline. It consists of three steps: 
and we can give here 
(1) Construction of the auxiliary neural net .. 
(2) Reducing the optimization of weights in jrfor a given distribution A to a 
finite nonlinear optimization problem. 
(3) Reducing the resulting finite nonliuear optimization problem to a family of 
finite linear optimization problems. 
Details to step (1): If the activation fianctions  in Af are piecewise linear and 
all computation. nodes in Af have fan-out _< 1 (this occurs for example if Af has just 
one hidden layer and only one output) then one can set J(/' :- Af. If the  are 
piecewise linear but not all computation nodes in Af have fan-out _< 1 one defines 
J(f as the tree of the same depth as Af, where subcircuits of computation nodes with 
fan-out m > 1 are duplicated m times. The activation functions remain unchanged 
in this case. 
If the activation fimctions 7 g are piecewise polynomial but not piecewise linear, 
one has to apply a rather complex construction which is described in detail in the 
Journal version of [M 93a]. In any case j(f has the property that all functions that 
314 Maass 
are computable on A/' can also be computed on Jf, the depth of  is bounded by a 
constant, and the size of JQ' is bounded by a polynomial in the size of A/' (provided 
that the depth and order of iV', as well as the number and degrees of the polynomial 
pieces of the 7  are bounded by a constant). 
Details to step (2): Since the VC-dimension of a neural net is only defined 
for neural nets with boolean output, one has to consider here instead the pseudo- 
dimension of the function class c that is defined by 
Definition: (see Itaussler [It]). 
Let X be some arbitrary domain, and let jr be an arbitrary class of functions from 
X into It. Then the pseudo-dimension of Y z is defined by 
dimf,(.) := max{IS1 ' $ C- x and 3h  $-- It such that 
Vb  {0, 3f  Vx  ,S (re) >_ be) = 1)). 
Note that in the special case where . is a concept class (i.e. all f 6 . are 0- 1 
valued) the pseudo-dimension dimf,(.) coincides with the VC-dimension of.. The 
pseudo-dimension of the function class associated with network architectures J(/' with 
piecewise polynomial activation finctions can be bounded with the help of Milnor's 
Theorem [Mi] in the same way as the VC-dimension for the case of boolean network 
output (see [GJ]): 
Theorem 2: Consider arbitrary network architectures J(/' of order v with k input 
nodes, 1 output nodes, and w programmable parameters. Assume that each gate in 
Jff employs as activation fimction some piecewise polynomial (or piecewise rational) 
function of degree _< d with at most q pieces. For some arbitrary p  {1,2,...} 
we define jr := { f  R +t - R  __a  R w _x 6 R  y G Rt(f(__x,y) = 
I1-(_) - yll)}. The,, one has dimp(.) = O(w 2 log q) if v, d, l = O(1). 
With the hell) of the pseudo-dinension one can carry out the desired reduction of 
the optimization of weights in JQ' (with regard to an arbitrary given distribution A 
of examples {_x, y)) to a finite optimization problen. Fix some interval [b, b2] C- R 
such that B C_ [bx, b2], bx < ha, and such that the ranges of the activation functions 
of the output gates of J(f are contained in [bl,b2]. We define b := I. (ba-bx) , and 
' :---- {f  R  x [b,b2]  --+ [0, b]' __a G R w V_.x G R k Vy G [b,b2]  (f(_x,y) - 
II-(_)-yll)). Assume now that parameters s, 5 6 (0, 1) with s < b and s,n  N 
have been fixed. For convenience we assume that s is sufficiently large so that 
all architectural parameters in A/' are from Q, (we assume that all architectural 
parameters in iV' are rational). We define 
m .- 7 2-dime(). In 33eb 8 
, '- + In 
By Corollary 2 of Theorem 7 in Haussler [HI one h for m km(}, ), K :=  e 
(2,3), nd any distribution A over Q x (Q, [bl,b2])' 
(1) P,',4-*[{Hf e 5.1(  f(,y)) E(_,y_)ea[f(ac, y)][ > ) < 5, 
(,u_) e 
Agnostic PAC-Learning of Functions on Analog Neural Nets 315 
where E(,y_)e.4 [f(__x, y)] is the expectation of f(_x,y) with regard to distribution A. 
We design an algorithm LEARN that computes for any rn E N, any sample 
 ' ((3i,IJi))i{1,...,m}  (Qkn X (Qn o [bl,b2])l) m, 
and any given s G N in polynomially in m, s, n computation steps an assignment 
_5 of rational numbers to the parameters in J(f such that the function  that is 
computed by J(/'5 satisfies 
(2) lll(xi )_ yilll 
i=1 
2 
<_ (1 - -)e + inf 
i=1 
This suffices for the proof of Theorem 1, since (1) and (2) together imply that, for 
any distribution A over Q x (Q,, cq lb1, b2])' and any ra > m( 17, ), with probability 
>_ 1 - 5 (with respect to the random drawing of (  /"*) the algorithm LEARN 
outputs for inputs ( and s an assignment  of rational numbers to the parameters 
in :Q' such that 
- yl111 <_  + inf - yllx]. 
Details to step (3): The computation of weights __& that satisfy (2) is nontrivial, 
since this amounts to solvinga nonlinear optimization problem. This holds even if 
each activation function in Af is piecewise linear, because weights from successive 
layers are multiplied with each other. 
We employ a method from [M 93a] that allows us to replace the nonlinear conditions 
on the programmable parameters __a of.f by linear conditions for a transformed set 
c,  of parameters. We simulate f- by another network architecture JQ'[.q] (which 
one may view as a "normal form" for /'-) that uses the same graph (V, E) as 
J(f, but different activation functions and different values/ for its programmable 
parameters. The activation functions of :(f[_c] depend on IVI new architectural 
parameters _c E R lul, which we call scaling parameters in the following. Whereas 
the architectural parameters of a network architecture are usually kept fixed, we 
will be forced to change the scaling parameters of JQ' along with its programmable 
parameters/. Although this new network architecture has the disadvantage that 
it requires ]VI additional parameters _c, it has the advantage that we can choose in 
J(fLc] all weights on edges between computation nodes to be from {-1,0, 1}. Hence 
we can treat them as constants with at most 3 possible values in the system of 
inequalities that describes computations of JQ']. Thereby we can achieve that all 
variables that appear in the inqualities that describe computations of J(f[_q] for fixed 
network inputs (the variables for weights of gates on level 1, the variables for the 
biases of gates on all levels, and the new variables for the scaling parameters _c) 
appear only linearly in those inqualities. 
We briefly indicate the construction of JQ' in the case where each activation function 
7 in Af is piecewise linear. For any c > 0 we consider the associated piecewise linear 
activation function 7  with 
W: C  = 
316 Maass 
Assume that  is some arbitrary given assignment to the programmable parameters 
in J(f. We transform j(fa_ through a recursive process into a "normal form" J(/'- 
in which all weights on edges between computation nodes are from {-1,0, 1}, such 
u 
q 
Assume that an output gate gout of J(/'- receives as input y] oiy i --oo, where 
i--1 
a,..., aq, a0 are the weights and the bias of gout (under the assignment _) and 
yx,..., yq are the (real valued) outputs of the immediate predecessors g,..., gq of 
g. For each i G {1,..., q} with ai  0 such that gi is not all input node we replace 
the activation function 7i of gi by 7? '1, and we multiply the weights and the bias 
of gate gg with lail. Finally we replace the weight ai of gate gout by sgn(ai), where 
sgn(ai) := 1 if ai > 0 and sgn(ai) := -1 if a < 0. This operation has the effect 
that the multiplication with I-l is carried out before the gate gi (rather than after 
gi, as done in J(f-), but that the considered output gate gout still receives the same 
input as before. If "i -- 0 we want to "freeze" that weight at 0. This can be done 
by deleting gi and all gates below gi from jr. 
The analogous operations are recursively carried out for the predecessors gi of gout 
(note however that the weights of gi are no longer the original ones from j/-a_, since 
they have been changed in the preceding step). We exploit here the assumption 
that each gate in J(/' has fan-out _< 1. 
Let /3 consist of the new weights on edges adjacent to input nodes and of the 
resulting biases of all gates in Jf. Let c consist of the resulting scaling parameters 
at the gates of f. Then we have a_,' E R k (f-(_x) = [C]-(_x)). Furthermore c > 0 
for all scaling parameters c in c. 
At the end of this proof we will also need the fact that the previously described para- 
meter transformation can be inverted, i.e. one can compute from c,__fi an equivalent 
weight assignment _ for J(f (with the original activation functions 7). 
We now describe how the algorithm LEARN computes for any given sample 
( -- ((xi,Yi))i=l ..... ,  (Q x (Q,, cl [bl,b2])l) m and any given s G N with the 
help of linear programming a new assignment _,  to the parameters in J(/' such that 
the function h that is computed by Jr[]  satisfies (2). For that purpose we describe 
the computations of J(/' for the fized inputs x/from the sample ( - ((xi, Y__i))i=,...,m 
by polynomially in m many systems L1,  ., Lp(m) that each consist of O(m) linear 
inequalities with the transformed parameters c,/3 as variables. Each system Lj re- 
flects one possibility for employing specific linear pieces of the activation functions in 
Jf for specific network inputs xx,..., x,n, and for employing different combinations 
of weights from {-1, 0, 1} for edges between computation nodes. 
One can show that it suffices to consider only polynomially in m many systems 
of inequalities Lj by exploiting that all inequalities are linear, and that the input 
space for Jf has bounded dimension k. 
Agnostic PAC-Learning of Functions on Analog Neural Nets 317 
We now expand each of the systems Lj (which has only O(1) variables) into a 
linear programming problem LPj with O(m) variables. We add to Lj for each of 
the 1 output nodes v of J(f 2m new variables u, v[ for i = 1,..., m, and the 4m 
inequalities 
ty(xi) < (Yi)v + u - v[ ty (xi) > (y)v + u - v  
_ , _ o, o, 
where ({x,})i=l ..... , is the fixed sample  and () is that coordinate of V which 
corresponds o the output node u of. In these inequalities the symbol ty(xi) de- 
notes the term (which is by construction linear in the variables , ) that represents 
the output of gate u for network input xi in this system Lj. One should note that 
these terms ty(zi) will in general be different for different j, since different linear 
pieces of the actuation functions at preceding gates may be used in the computation 
of  for the same network input x. We expand the system Lj of linear inequalities 
o a linear programming problem LP in canonical form by adding the optimization 
requirement 
minimize  ]  (u + v). 
i= v output node 
The algorithm LEARN employs an efficient algorithm for linear programming (e.g. 
the ellipsoid algorithm, see [PSI) in order to compute in altogether polynomially 
in m, s and n many steps an optimal solution for each of the linear programming 
problems LP,..., LP,(,, O. We write hj for the function from R k into R t that is 
computed by fr[_c] for the optimal solution _c,/3 of LPj. The algorithm LEARN 
m 
1 
computes -- ' [Ih(xi)- yil for j: 1,...,p(m). Let  be that index for which 
i=l 
this expression has a minimal value. Let __5,/} be the associated optimal solution of 
LPj (i.e. fr[ computes hj). LEARN employs the previously mentioned back- 
wards transformation from _,  into values _5_5 for the programmable parameters of 
2Q' such that Vx  R k (2Q'-a(_x) = jQ'[_5.](__x)). These values _& are given as output of 
the algorithm LEARN. 
We refer to [M 93b] for the verification that this weight assignment _& has the 
property that is claimed in Theorem 1. We also refer to [M 93b] for the proof in the 
more general case where the activation fimctions of Af are piecewise polynomial. ! 
Remark: The algorithm LEARN can be speeded up substantially on a parallel ma- 
chine. Furthermore if the individual processors of the parallel machine are allowed 
to use random bits, hardly any global control is required for this parallel computa- 
tion. We use polynomially in m many processors. Each processor picks at random 
one of the systems Lj of linear inequalities and solves the corresponding linear pro- 
gramming problem LP. Then the parallel machine compares in a "competitive 
phase" the costs  y_lll of the solutions hj that have been computed by 
i:1 
the individual processors. It outputs the weights _& for JQ' that correspond to the 
318 Maass 
best ones of these solutions hi. If one views the number w of weights in A/' no longer 
as a constant, one sees that the number of processores that are needed is simply 
exponential in w, but that the parallel computation time is polynomial in m and 
W. 
Acknowledgements 
I would like to thank Peter Auer, Phil Long and Hal White for their helpful com- 
ments. 
References 
[GJ] 
[HI 
[J] 
[KV] 
[KSS] 
[M 93a] 
[M 9361 
[Mi] 
[PSI 
IV] 
A. Blum, R. L. Rivest, "Training a 3-node neural network is NP- 
complete", Proc. of the 1988 Workshop on Computational Learning 
Theory, Morgan Kaufmann (San Mateo, 1988), 9 - 18 
P. Goldberg, M. Jerrum, "Bounding the Vapnik-Chervonenkis dimen- 
sion of concept classes parameterized by real numbers", Proc. of the 
6th Annual ACM Conference on Computational Learning Theory, 361 
- 369. 
D. Haussler, "Decision theoretic generalizations of the PAC model 
for neural nets and other learning applications", Information and 
Computation, vol. 100, 1992, 78 - 150 
J. S. Judd, "Neural Network Design and the Complexity of Learning", 
MIT-Press (Cambridge, 1990) 
M. Kearns, L. Valiant, "Cryptographic limitations on learning 
boolean formulae and finite automata", Proc. of the 21st ACM Sym- 
posium on Theory of Computing, 1989, 433 - 444 
M. J. Kearns, R. E. Schapire, L. M. Sellie, "Toward efficient agnostic 
learning", Proc. of the 5th ACM Workshop on Uomputational Learn- 
ing Theory, 1992, 341 - 352 
W. Maass, "Bounds for the computational power and learning com- 
plexity of analog neural nets" (extended abstract), Proc. of the Sth 
ACM Symposium on Theory of Computing, 1993, 335- 344. Journal 
version submitted for publication 
W. Maass, "Agnostic PAC-learning of functions on analog neural 
nets" (journal version), to appear in Neural Computation. 
J. Milnor, "On the Betti numbers of real varieties", Proc. of the Amer- 
ican Math. ,.qoc., vol. 15, 1964, 275 - 280 
C. H. Papadimitriou, K. Steiglitz, "Combinatorial Optimization: Al- 
gorithms and Complexity", Prentice Hall (Englewood Cliffs, 1982) 
L. G. Valiant, "A theory of the learnable", Comm. of the ACM, vol. 
27, 1984, 1134- 1142 
