Threshold Network Learning in the Presence of 
Equivalences 
John Shawe-Taylor 
Department of Computer Science 
Royal Holloway and Bedford New College 
University of London 
Egham, Surrey TW20 0EX, UK 
Abstract 
This paper applies the theory of Probably Approximately Correct (PAC) 
learning to multiple output feedforward threshold networks in which the 
weights conform to certain equivalences. It is shown that the sample size 
for reliable learning can be bounded above by a formula similar to that 
required for single output networks with no equivalences. The best previ- 
ously obtained bounds are improved for all cases. 
I INTRODUCTION 
This paper develops the results of Baum and I-Iaussler [3] bounding the sample sizes 
required for reliable generalisation of a single output feedforward threshold network. 
They prove their result using the theory of Probably Approximately Correct (PAC) 
learning introduced by Valiant [11]. They show that for 0 < e <_ 1/2, if a sample of 
size 
64W 64N 
m>_ m0 = --log-- 
is loaded into a feedforward network of linear threshold units with N nodes and W 
weights, so that a fraction 1- e/2 of the examples are correctly classified, then with 
confidence approaching certainty the network will correctly classify a fraction I - e 
of future examples drawn according to the same distribution. A similar bound was 
obtained for the case when the network correctly classified the whole sample. The 
results below will imply a significant improvement to both of these bounds. 
879 
880 Shawe-Taylor 
In many cases training can be simplified if known properties of a problem can 
be incorporated into the structure of a network before training begins. One such 
technique is described by Shawe-Taylor [9], though many similar techniques have 
been applied as for example in TDNN's [61. The effect of these restrictions is to 
constrain groups of weights to take the same value and learning algorithms are 
adapted to respect this constraint. 
In this paper we consider the effect of this restriction on the generalisation per- 
formance of the networks and in particular the sample sizes required to obtain a 
given level of generalisation. This extends the work described above by Baum and 
Haussler [3] by improving their bounds and also improving the results of Shawe- 
Taylor and Anthony [10], who consider generalisation of multiple-output threshold 
networks. The remarkable fact is that in all cases the formula obtained is the same, 
where we now understand the number of weights W to be the number of weight 
classes, but N is still the number of computational nodes. 
2 DEFINITIONS AND MAIN RESULTS 
2.1 SYMMETRY AND EQUIVALENCE NETWORKS 
We begin with a definition of threshold networks. To simplify the exposition it is 
convenient to incorporate the threshold value into the set of weights. This is done 
by creating a distinguished input that always has value 1 and is called the threshold 
input. The following is a formal notation for these systems. 
A network A/' = (C, 
set I of input nodes, 
threshold node. The 
with (no) x C C_ E. 
I, O, n0, E) is specified by a set C of computational nodes, a 
a subset O _C C of output nodes and a node n0 E I, called the 
connectivity is given by a set E C_ (C U I) x C of connections, 
With network Af we associate a weight function w from the set of connections to 
the real numbers. We say that the network Af is in state w. For input vector i 
with values in some subset of the set 7Z of real numbers, the network computes a 
function F. sf(w, i). 
An automorphism 7 of a network Af = (C, I, O, n0, E) is a bijection of the nodes of 
Af which fixes I setwise and {n0} U 0 pointwise, such that the induced action fixes 
E setwise. We say that an automorphism 7 preserves the weight assignment w if 
wji = w(j)(ti ) for alliE IUC, j  C. Let 7 be an automorphism of a network 
Af = (C, I, O, n0, E) and let i be an input to A/'. We denote by i t the input whose 
value on input / is that of i on input 7-xk. 
The following theorem is a natural generalisation of part of the Group Invariance 
Theorem of Minsky and Pappert [8] to multi-layer percepttons. 
Theorem 2.1 [9] Let 7 be a weight preserving automorphism of the network 
(C, I, O, no, E) in state w. Then for every input vector i 
i): it). 
Following this theorem it is natural to consider the concept of a symmetry net- 
work [9]. This is a pair (Af, r), where Af is a network and F a group of weight 
Threshold Network Learning in the Presence of Equivalences 881 
preserving automorphims of Af. We will also refer to the automorphisms as sym- 
metries. For a symmetry network (Af, F), we term the orbits of the connections E 
under the action of F the weight classes. 
Finally we introduce the concept of an equivalence network. This definition ab- 
stracts from the symmetry networks precisely those properties we require to obtain 
our results. The class of equivalence networks is, however, far larger than that 
of symmetry networks and includes many classes of networks studied by other re- 
searchers [6, 7]. 
Definition 2.2 An equivalence network is a threshold network in which an equiva- 
lence relation is defined on both weights and nodes. The two relations are required 
to be compatible in that weights in the same class are connected to nodes in the same 
class, while nodes in the same class have the same set of input weight connection 
types. The weights in an equivalence class are at all times required to remain equal. 
Note that every threshold network can be viewed as an equivalence network by 
taking the trivial equivalence relations. We now show that symmetry networks 
are indeed equivalence networks with the same weight classes and give a further 
technical lemma. For both lemmas proofs are omitted. 
Lemma 2.3 A symmetry network (Af, P) is an equivalence network, where the 
equivalence classes are the orbits of connections and nodes respectively. 
Lemma 2.4 Let A/' be an equivalence network and C be the set of classes of nodes. 
Then there is an indezing of the classes, Ci, i = 1,..., n, such that nodes in Ci do 
not have connections from nodes in Cj for j _) i. 
2.2 MAIN RESULTS 
We are now in a position to state our main results. Note that throughout this paper 
log means natural logarithm, while an explicit subscript is used for other bases. 
Theorem 2.5 Let A/' be an equivalence network with W weight classes and N com- 
putational nodes. If the network correctly computes a function on a set of m inputs 
drawn independently according to a fized probability distribution, where 
[ 
ra_ mo(6,)= 6(1- vf) log + 2Wlog 6 
then with probability at least 1 - l the error rate of the network will be less than e 
on inputs drawn according to the same distribution. 
Theorem 2.6 Let A/' be an equivalence network with W weight classes and N 
computational nodes. If the network correctly computes a function on a fraction 
1- (1- q,)e of ra inputs drawn independently according to a fized probability distri- 
bution, where 
m _ m0(e,$, 7) = q,2e(1 _ X/) 41os + 6Wlog 4Ne 
then with probability at least 1 - l the error rate of the network will be less than  
on inputs drawn according to the same distribution. 
882 
Shawe-Taylor 
3 THEORETICAL BACKGROUND 
3.1 DEFINITIONS AND PREVIOUS RESULTS 
In order to present results for binary outputs ({0, 1} functions) and larger ranges 
in a unified way we will consider throughout the task of learning the graph of a 
function. All the definitions reduce to the standard ones when the outputs are 
binary. 
We consider learning from examples as selecting a suitable function from a set H of 
hypotheses, being functions from a space X to set Y, which has at most countable 
size. At all times we consider an (unknown) target function 
c:X ,Y 
which we are attempting to learn. To this end the space X is required to be a 
probabihty space (X, Ig,/), with appropriate regularity conditions so that the sets 
considered are measurable [4]. In particular the hypotheses should be measurable 
when Y is given the discrete topology as should the error sets defined below. The 
space S: X x Y is equipped with a a-algebra Ig x 2 Y and measure v: v(/, c), 
defined by its value on sets of the form U x {y}: 
Using this measure the error of a hypothesis 
The introduction of v allows us to consider 
is defined to be 
Slh()  Y). 
samples being drawn from S, as they 
will automatically reflect the output value of the target. This approach freely 
generalises to stochastic concepts though we will restrict ourselves to target func- 
tions for the purposes of this paper. The error of a hypothesis h on a sample 
x = ((zx, yx),...,(z,,y,))  S " is defined to be 
We also define the VC dimension of a set of hypotheses by reference to the product 
space S. Consider a sample x = ((z,y),..., (z,,y,)) G S  and the function 
x*: H , {0, 1} ", 
given by x*(h): 1 if and only if h(zi): y, for i = 1,..., m. We can now define 
the growth function Bq(rn) as 
BH(m): max I(x*(h)lh G H}l _< 2 ". 
XS , 
The Vapnik-Chervonenkis dimension of a hypothesis space H is defined as 
VCdim(H) = { c; if BH(m): 2 ", for all m; 
max{rnlB/(rn ) = 2"}; otherwise. 
In the case of a threshold network A/', the set of functions obtainable using all 
possible weight assignments is termed the hypothesis space of Af and we will refer 
Threshold Network Learning in the Presence of Equivalences 883 
to it as A/'. For a threshold network A/', we also introduce the state growth function 
$xr(m). This is defined by first considering all computational nodes to be output 
nodes, and then counting different output sequences. 
Sxr(m) -- max I{(F,(w, it),Fg,(w,i.),...,F'(w, im))lw : E  7Z}l 
x=(i,...,i)ex ' 
where X = [0, 1] Ill and AP is obtained from Af by setting O = C. We clearly have 
that for all Af and m, Bxr(m) < Sxr(m). 
Theorem 3.1 [] If a hypothesis space H has growth function BH(m) then for 
any e> 0 and k > m and 
1 
0<r<l 
the probability that there is a function in H which agrees with a randomly chosen 
m sample and has error greater than e is less than 
( - r)' B,(m + )p --- . 
k(1-r):-I m+k 
This result can be used to obtain the following bound on sample size required for 
PAC learnability of a hypothesis space with VC dimension d. The theorem improves 
the bounds reported by Blumer et al. [4]. 
Theorem 3.2 [] If a hypothesis space H has finite VC dimension d > 1, then 
there is m0 = m0(, ) such that if m >mo then the probability that a hypothesis 
consistent with a randomly chosen sample of size m has error greater than e is less 
than . A suitable value of mo is 
I -1 
too--e(1-V/) [1g (d/( - ))+2dlog ()] . 
For the case when we allow our hypothesis to incorrectly compute the function on 
a small fraction of the training sample, we have the following result. Note that we 
are still considering the discrete metric and so in the case where we are considering 
multiple output feedforward networks a single output in error would count as an 
overall error. 
Theorem 3.3 [10] Let 0 < e < 1 and 0 < 7 <_ 1. Suppose H is a hypothesis space 
of functions from an input space X to a possibly countable set Y, and let y be any 
probability measure on $ = X x Y. Then the probability (with respect to y") that, 
for x E $", there is some h  H such that 
err(h) >  and erx(h) _< (1-,)erv() 
is at most 
Furthermore, if H has finite VC dimension d, this quantity is less than  for 
I [ () (4)] 
- > 0(,*,,): ,2(1 _ vq) 4log + 6dlog ,-7,  
884 
Shawe-Taylor 
4 THE GROWTH FUNCTION FOR EQUIVALENCE 
NETWORKS 
We will bound the number of output sequences Bxc(m) for a number rn of inputs 
by the number of distinct state sequences Sxc(rn) that can be generated from the 
m inputs by different weight assignments. This follows the approach taken in [10]. 
Theorem 4.1 Let A/' be an equivalence network with FV weight equivalence classes 
and a total of N computational nodes. Then we can bound Sf(rn) by 
-- 
Idea of Proof: Let Ci, i = 1, ..., n, be the equivalence classes of nodes indexed 
as guaranteed by Lemma 2.4 with Icil- ci and the number of inputs for nodes in 
Ci being ni (including the threshold input). Denote by .if 5 the network obtained 
by taking only the first j node equivalence classes. We omit a proof by induction 
that 
_< II 
i=1 
where Bi is the growth function for nodes in the class Ci. 
Using the well known bound on the growth function of a threshold node with ni 
inputs we obtain 
) 
S(m) _( II ?' 
i=1 k, 
Consider the function f(z) = zlog z. This is a convex function and so for a set of 
values z,..., ZM, we have that the average of f(:ri) is greater than or equal to f 
applied to the average of zi. Consider taking the z's to be ci copies of n/ci for 
each i - 1,... n. We obtain 
1 E n W W c 
-- n log -- > log or < , 
N i= c -   n/ - 
i=l 
and so 
as required.  
_( w ' 
The bounds we have obtained make it possible to bound the Vapnik-Chervonenkis 
dimension of equivalence networks. Though we we will not need these results, we 
give them here for completeness. 
Proposition 4.2 The Vapnik-Chervonenkis dimension of an equivalence network 
with W weight classes and N computational nodes is bounded by 
2W log: eN. 
Threshold Network Learning in the Presence of Equivalences 885 
5 PROOF OF MAIN RESULTS 
Using the results of the last section we are now in a position to prove Theorems 2.5 
and 2.6. 
Proof of Theorem 2.5: (Outline) We use Theorem 3.1 which bounds the proba- 
bility that a hypothesis with error greater than e can match an m-sample. Substi- 
tuting our bound on the growth function of an equivalence network and choosing 
Tm 
and r as in [1], we obtain the following bound on the probability 
- W: )w N TM exp(-em). 
By choosing rn > mo where mo is given by 
_ log + 2Wlog 6 
we guarantee that the above probability is less than  as required.  
Our second main result can be obtained more directly. 
Proof of Theorem 2.6: (Outline) We use Theorem 3.3 which bounds the prob- 
ability that a hypothesis with error greater than  can match all but a fraction 
(1 -7) of an m-sample. The bound on the sample size is obtained from the proba- 
bility bound by using the inequahty for Bar(2m). By adjusting the parameters we 
will convert the probability expression to that obtained by substituting our growth 
function. We can then read off a sample size by the corresponding substitution in 
the sample size formula. Consider setting d = W,  = din and m = Nm'. With 
these substitutions the sample size formula is 
, 1 [4 log ()+ 6Wlog (744d)] 
m : ff'e'(1 -' 
as required.  
6 CONCLUSION 
The problem of training feedforward neural networks remains a major hurdle to the 
application of this approach to large scale systems. A very promising technique for 
simphfying the training problem is to include equivalences in the network structure 
which can be justified by a priori knowledge of the apphcation domain. This paper 
has extended previous results concerning sample sizes for feedforward networks to 
cover so called equivalence networks in which weights are constrained in this way. 
At the same time we have improved the sample size bounds previously obtained for 
standard threshold networks [3] and multiple output networks [10]. 
886 Shawe-Taylor 
The results are of the same order as previous results and imply similar bounds on 
the Vapnik-Chervonenkis namely 2Wlog. eN. They perhaps give circumstancial 
evidence for the conjecture that the log 2 eN factor in this expression is real, in that 
the same expression obtains even if the number of computational nodes is increased 
by expanding the equivalence classes of weights. Equivalence networks may be a 
useful area to search for high growth functions and perhaps show that for certain 
classes the VC dimension is fl(Wlog N). 
References 
[10] 
[11] 
[1] Martin Anthony, Norman Biggs and John Shawe-Taylor, Learnability and For- 
mal Concept Analysis, RHBNC Department of Computer Science, Technical 
Report, CSD-TR-624, 1990. 
[2] Martin Anthony, Norman Biggs and John Shawe-Taylor, The learnability of 
formal concepts, Proc. COLT '90, Rochester, NY. (eds Mark Fulk and John 
Case) (1990) 246-257. 
[3] Eric Baum and David Haussler, What size net gives valid generalization, Neural 
Computation, 1 (1)(1989) 151-160. 
[4] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler and Manfred K. War- 
muth, Learnability and the Vapnik-Chervonenkis dimension, JACM, 36 (4) 
(1989) 929-965. 
[5] David Haussler, preliminary extended abstract, COLT '89. 
[6] K. Lang and G.E. Minton, The development of TDNN architecture for speech 
recognition, Technical Report CMU-CS-88-152, Carnegie-Mellon University, 
1988. 
[7] Y. le Cun, A theoretical framework for back propagation, in D. Touretzsky, 
editor, Connectionist Models: A Summer School, Morgan-Kaufmann, 1988. 
[8] M. Minsky and S. Paperr, Percepttons, expanded edition, MIT Press, Cam- 
bridge, USA, 1988. 
[9] John Shawe-Taylor, Building Symmetries into Feedforward Network Architec- 
tures, Proceedings of First IEE Conference on Artificial Neural Networks, Lon- 
don, 1989, 158-162. 
John Shawe-Taylor and Martin Anthony, Sample Sizes for Multiple Output 
Feedforward Networks, Network, 2 (1991) 107-117. 
Leslie G. Valiant, A theory of the learnable, Communications of the ACM, 27 
(1984) 1134-1142. 
