Learning in Feedforward Networks with Nonsmooth 
Functions 
Nicholas J. Redding* 
Information Technology Division 
Defence Science and Tech. Org. 
P.O. Box 1600 Salisbury 
Adelaide SA 5108 Australia 
T. Downs 
Intelligent Machines Laboratory 
Dept of Electrical Engineering 
University of Queensland 
Brisbane Q 4072 Australia 
Abstract 
This paper is concerned with the problem of learning in networks where some 
or all of the functions involved are not smooth. Examples of such networks are 
those whose neural transfer functions are piecewise-linear and those whose error 
function is defined in terms of the  norm. 
Up to now, networks whose neural transfer functions are piecewise-linear have 
received very little consideration in the literature, but the possibility of using an 
earor function defined in terms of the  norm has received some attention. In 
this latter work, however, the problems that can occur when gradient methods are 
used for nonsmooth error functions have not been addressed. 
In this paper we draw upon some recent results from the field of nonsmooth 
optimi?ofion (NSO) to present an algorithm for the nonsmooth case. Our moti- 
vation for this work arose out of the fact that we have been able to show that, 
in backpropagation, an error function based upon the  norm overcomes the 
difficulties which can occur when using the 2 norm. 
1 INTRODUCTION 
This paper is concerned with the problem of learning in networks where some or all of 
the functions involved are not smooth. Examples of such networks are those whose neural 
transfer functions are piecewise-linear and those whose error function is defined in terms 
of the  norm. 
*The author can be conmcd vh emafi at met address reddingO itd. dst o. oz. au. 
1056 
Learning in Feedforward Networks with Nonsmooth Functions 1057 
Up to now, networks whose neural transfer functions are piecewise-linear have received 
very little consideration in the literature, but the possibility of using an error function defined 
in terms of the oo norm has received some attention [1]. In the work described in [1], 
however, the problems that can occur when gradient methods are used for nonsmooth error 
functions have not been addressed. 
In this paper we draw upon some recent results from the field of nonsmooth optimi?_a_fion 
(NSO) to present an algorithm for the nonsmooth case. Our motivation for this work arose 
out of the fact that we have been able to show [2] that an error function based upon the 
norm overcomes the difficulties which can occur when using backpropagation's 2 norm 
[4]. 
The framework for NSO is the class of locally Lipschitzian functions [5]. Locally Lips- 
chitzian functions are a broad class of functions that include, but are not limited to, "smooth" 
(completely differenfiable) functions. (Note, however, that this framework does not include 
step-functions.) We here present a method for training feedforward networks (FFNs) whose 
behaviour can be described by a locally Lipschitzian function y = f(w, x), where the 
input vector x = (a:, ..., a:,) is an element of the set of patterns X C It ", w  It b is the 
weight vector, and y  It'"' is the m-dimensional output. 
The possible networks that fit within the locally Lipschitzian framework include any network 
that has a continuous, piecewise differenfiable description, i.e., continuous functions with 
nondifferentiable points ("nonsmooth functions"). 
Training a network involves the selection of a weight vector w* which minimizes an error 
function E(w). As long as the error function E is locally Lipschitzian, then it can be trained 
by the procedure that we will outline, which is based upon a new technique for NSO [6]. 
In Section 2, a description of the difficulties that can occur when gradient methods are 
applied to nonsmooth problems is presented. In Section 3, a short overview of the Bundle- 
Trust algorithm [6] for NSO is presented. And in Section 4 details of applying a NSO 
procedure to training networks with an oo based error function are presented, along with 
simulation results that demonstrate the viability of the technique. 
2 FAILURE OF GRADIENT METHODS 
Two difficulties which arise when gradient methods are applied to nonsmooth problems will 
be discussed here. The first is that gradient descent sometimes fails to converge to a local 
minimum, and the second relates to the lack of a stopping criterion for gradient methods. 
2.1 THE "JAMMING" EFFECT 
We will now show that gradient methods can fail to converge to a local minimum (the 
"jamming" effect [7,8]). The particular example used here is taken from [9]. 
Consider the following function, that has a minimum at the point w* = (0, 0): 
f(w) = 3(w 2 + 24). (1) 
If we start at the point w0 = (2, 1), it is easily shown that a steepest descent algorithm 2 
would generate the sequence w = (2,-1)/3, w2 = (2, 1)/9, ..., so that the sequence 
This is quite simple, using a theorem due to Krishnan [3]. 
rhis is achieved by repeatedly performing a line search along the steepest descent direction. 
1058 
Redding and Downs 
nond'erenliabl 
Figure 1: A contour plot of the function f3. 
{wk} oscillates between points on the two half-lines w = 2w2 and w = -2w2 for 
w > 0, converging to the optimal point w* = (0, 0). Next, from the function f, create a 
new function f2 in the following manner: 
f2(w) =  = V/3(w + 2w). (2) 
The gradient at any point of .t'2 is proportional to the gradient at the same point on fi, so 
the sequence of points generated by a gradient descent algorithm starting from (2, 1) on f2 
will be the same as the case for f, and will again converge 3 to the optimal point, again 
w* = (0, 0). 
Lastly, we shift the optimal point away from (0, 0), but keep a region including the sequence 
{wk } unchanged to create a new function/3 (w): 
{ q3(% 2+2w2 2) ifO<lw2l<2w, 
f3(w) = + 411) elsewhere. (3) 
The new function f3, depicted in fig. 1, is continuous, has a discontinuous derivative only 
on the hail-line w < 0, w2 = 0, and is convex with a "minimum" as w --. -oo. In spite 
of this, the steepest descent algorithm still converges to the now nonoptimal "jamming" 
point (0, 0). A multitude of possible variations to f exist that will achieve a similar result, 
but the point is clear: gradient methods can lead to trouble when applied to nonsmooth 
problems. 
This lesson is important, because the backpropagation learning algorithm is a smooth 
gradient descent technique, and as such will have the difficulties described when it, or an 
extension (eg., [1]), are applied to a nonsmooth problem. 
2.2 STOPPING CRITERION 
The second significant problem associated with smooth descent techniques in a nonsmooth 
context occurs with the stopping criterion. In normal smooth circumstances, a stopping 
aNote that for this new sequence of points, the gradient no longer converges to 0 at (0, 0), but 
oscillates between the values x/(1, :i:1 ). 
Learning in Feedforward Networks with Nonsmooth Functions 1059 
criterion is determined using 
IlV/ll < e, (4) 
where e is a small positive q_ntity determined by the required accuracy. However, it is 
frequently the case that the minimum of a nonsmooth function occurs at a nondifferenfiable 
point or "kink", and the gradient is of little value around these points. For example, the 
gradient of f(tv) = [to[ has a magnitude of 1 no matter how close w is to the optimum at 
3 NONSMOOTH OPTIMIZATION 
For any locally Lipschitzian function f, the generalized directional derivative always exists, 
and can be used to define a generalized gradient or subdifferential, denoted by Of, which 
is a compact convex set s [5]. A particular element g  Of(w) is termed a subgradient of 
f at w [5,10]. In situ_afions where f is strictly differentiable at w, the generalized gradient 
of f at w is equal to the gradient, i.e., Of(w) - Vf(w). 
We will now discuss the basic aspects of NSO and in particular the Bundle-Trust (BT) 
algorithm [6]. 
Quite naturally, subgradients in NSO provide a substitute for the gradients in standard 
smooth optimization using gradient descent. Accordingly, in an NSO pure, we require 
the following to be satisfied: 
At every w, we can compute f(w) and any g  Of(w). 
(5) 
To overcome the jamming effect, however, it is not sufficient replace the gradient with 
a subgradient in a gradient descent algorithm -- the strictly local information that this 
provides about the function's behaviour can be misleading. For example, an approach like 
this will not change the descent path taken from the starting point (2, 1) on the function fa 
(see fig. 1). 
The solution to this problem is to provide some "smearing" of the gradient information by 
enriching the information at w with knowledge of its surroundings. This can be achieved 
by replacing the strictly local subgradients g  Of(w) by LJvEB g  Of(v) where B is a 
suitable neighbourhood of w, and then define the e-generalized gradient 0f(w) as 
= U OS(v) 
vEB(w,0 
(6) 
where e > 0 and small, and co denotes a convex hull. These ideas were first used by [7] 
to overcome the lack of continuity in minimax problems, and have become the basis for 
extensive work in NSO. 
In an optimization prtwure, points in a sequence {wk, k = 0, 1,...} are visited until a 
point is reached at which a stopping criterion is satisfied. In a NSO pure, this occurs 
when a point wk is reached that satisfies the condition 0  Of(w), and the point is said 
to be e-optimal. That is, in the case of convex f, the point w is e-optimal if 
f(wt) < f(w) + 41w - w11 + e for all w  n 
(7) 
'*In other words, a set of vectors will define the generalized gradient of a nonsmooth function at a 
single point, rather than a single vector in the case of smooth functions. 
1060 Redding and Downs 
and in the case of nonconvex f, 
f(wt) < f(w) + llw -- wll + e for all w  a 
(s) 
where B is some neighbourhood of wk of nonzero dimension. Obviously, as e --, 0, then 
wk  w* at which 0 6 0f(w*), i.e., w is "within e" of the local minimum w*. 
Usually the e-generalized gradient is not available, and this is why the bundle concept is 
introduced. The basic idea of a bundle concept in NSO is to replace the e-generalized 
gradient by some inner approximating polytope P which will then be used to compute a 
descent direction. If the polytope P is a sufficiently good approximation to f, then we will 
find a direction along which to descend (a so-called serious step). In the case where P is not 
a sufficiently good approximation to f to yield a descent direction, then we perform a nu/l 
step, staying at our cunt position w, and try to improve P by adding another subgradient 
Of(v) at some nearby point v to our cunt position w. 
A natural way of approximating f is by using a cutting plane (CP) approximation. The CP 
approximation of f(w) at the point wt is given by the expression [6] 
max {g/t( wi) + f(wi)}, 
W-- 
(9) 
where gi is a subgradient of f at the point wi. We see then that (9) provides a piecewise 
linear approximation of convex 5 f from below, which will coincide with f at all points wi. 
For convenience, we redefine the CP approximation in terms of d = w - wt, d  R b, the 
vector difference of the point of approximation, w, and the current point in the optimiz_m_ion 
sequence, wt, giving the CP approximation f c of f: 
fc(wk, d) = max {g/d + g/t(w - wi) + f(wi)}. 
(10) 
Now, when the CP approximation is minimized to find a descent direction, there is no 
reason to trust the approximation far away from w. So, to discourage a large step size, a 
1 t 
stabilizing term  d d, where t is positive, is added to the CP approximation. 
If the CP approximation at wk of f is good enough, then the d given by 
ldt d 
dt= arg nn f c(wk, d) + 2tt 
will produce a descent direction such that a line search along wk + )d will find a new 
point w+ at which f(w+ ) < f(w) (a serious step). It may happen that fcv is such a 
poor approximation of f that a line search along dk is not a descent direction, or yields only 
a marginal improvement in f. ff this occurs, a null step is taken and one enriches the bundle 
of subgradients from which the CP approximation is computed by adding a subgradient 
from Of(w + Ad) for small A > 0. Each serious step guarantees a decrease in f, and a 
stopping criterion is provided by terminating the algorithm as soon as d in (11) satisfies 
the e-optimality criterion, at which point w is e-optimal. These details are the basis of 
bundle methods in NSO [9,10]. 
The bundle method described suffers from a weak point: its success depends on the delicate 
selection of the parameter tk in (11) [6]. This weakness has led to the incorporation of a 
"trust region" concept [11] into the bundle method to obtain the BT (bundle-trust) algorithm 
[6]. 
Sin the nonconvex f case, (9) is not an approximation to f from below, and additional tolerance 
parameters must be considered to accommodate this situation [6]. 
Learning in Feedforward Networks with Nonsmooth Functions 1061 
To incorporate a trust region, we define a "radius" that defines a ball in which we can "trust" that f c is a good approximation of f. In the BT algorithm, by following trust region 
concepts, the choice of tk is not made a priori and is deteannined during the algorithm by 
varying tk in a systematic way (trust part) and improving the CP approximation by nu/l 
steps (bundle part) until a satisfactory CP approximation f c is obtained along with a ball 
(in terms of t) on which we can trust the approximation. Then the d in (11) will lead to 
a substantial decrease in f. 
The full details of the BT algorithm can be found in [6], along with convergence proofs. 
4 EXAMPLES 
4.1 A SMOOTH NETWORK WITH NONSMOOTH ERROR FUNCTION 
The particular network example we consider here is a two-layer FFN (i.e., one with a single 
layer of hidden units) where each output unit's value yi is computed from its discriminant 
function Q o = wio + '= wij zj , by the transfer function yi = tanh( Q o ), where zj is the 
output of the j-th hidden unit. The j-th hidden unit's output zj is given by zj = tanh(Qas), 
where Qa = vjo + '= vj : zj is its discriminant function. The too error function (which 
is locally Lipschitzian) is defined to be 
E(w)= max max IQo.(X)-t(x)l, (12) 
XX lim 
where ti (x) is the desired output of output unit i for the input pattern x 6 X. 
To make use of the BT algorithm described in the previous section, it is necessary to obtain 
an expression from which a subgradient at w for E(w) in (12) can be computed. Using the 
generaliz gradient calculus in [5, Proposition 2.3.12], a subgradient g  0E(w) is given 
by the expression 6 
g = sgn (Qo,,(x') - ti,(x')) VwQo,,(x') for some i',x'  ,.7 
(14) 
where ,7 is the set of patterns and output indices for which E(w) in (12) obtains it maximum 
value, and the gradient VwQo,, (x') is given by 
VwQo,, (x') = { 
1 w.r.t. wi,o 
zj w.r.t. wi,j 
(1- zj2.)wi,s w.r.t.v./0 
a:(1 - zj2.)wi,s w.r.t. vSt 
0 elsewhere. 
(15) 
(Note that here j = 1,2,..., h and k = 1, ..., n). 
The BT technique outlined in the previous section was applied to the standard XOR and 
838 encoder problems using the oo error function in (12) and subgradients from (14,15). 
6Note that for a function f(w) = [w[ = max{w, -w}. the generalized gradient is given by the 
expression 
Of(w)= c 1,-1} z=0 (13) 
- z<0 
and a suitable subgradient g tE Of(w) can be obtained by choosing g = sgn(w). 
1062 Redding and Downs 
In all test runs, the BT algorithm was run until convergence to a local minimum of the too 
error function occuned with e set at 10 -4. On the XOR problem, over 20 test runs using a 
randomly initialized 2-2-1 network, an average of 52 function and subgradient evaluations 
were required. The minimum number of function and subgradient evaluations required 
in the test runs was 23 and the maximum was 126. On the 838 encoder problem, over 
20 test runs using a randomly initializ 8-3-8 network, an average of 334 function and 
subgradient evaluations were required. For this problem, the minimum number of function 
and subgradient evaluations required in the test runs was 221 and the maximum was 512. 
4.2 A NONSMOOTH NETWORK AND NONSMOOTH ERROR FUNCTION 
In this section we will consider a particular example that employs a network function that 
is nonsmooth as well as a nonsmooth error function (the oo error function of the previous 
example). 
Based on the piecewise-linear network employed by [12], let the i-th output of the network 
be given by the expression 
k=l '= 
with an oo-based error function 
+ wio (16) 
E(w) = max max ]yi(x) - ti(x)]. (17) 
XX lirn 
Once again using the generalized gradient calculus from [5, Proposition 2.3.12], a single 
subgradient g  0E(w) is given by the expression 
zn w.r.t. li k 
1 wx.t. wi,o 
I E=, vstzk + vsol wx.t. wiq (18) 
g - sgn(yi,(x') - tv(x')) wi,j sgn(5-]i= 1 vjx + vjo) w.r.t. vjo 
wv sgn(= vzk + vo)z  wx.t. v 
0 elsewhere. 
(Note that j = 1,2,..., h, k = 1,2,..., n). 
In all cases the e-stopping criterion is set at 10 -n. On the XOR problem, over 20 test 
nms using a randomly initialized 2-2-1 network, an average of 43 function and subgradient 
evaluations were required. The minimum number of function and subgradient evaluations 
required in the test runs was 30 and the maximum was 60. On the 838 encoder problem, 
over 20 test runs using a randomly initialized 8-3-8 network, an average of 445 function and 
subgradient evaluations were required. For this problem, the minimum number of function 
and subgradient evaluations required in the test runs was 386 and the maximum was 502. 
5 CONCLUSIONS 
We have demonstrated the viability of employing NSO for training networks in the case 
where standard procedures, with their implicit smoothness assumption, would have diffi- 
culties or find impossible. The particular nonsmooth examples we considered involved an 
error function based on the oo norm, for the case of a network with sigmoidal characteristics 
and a network with a piecewise-linear characteristic. 
Learning in Feedforward Networks with Nonsmooth Functions 1063 
Nonsmooth optimi?fion problems can be dealt with in many different ways. A possible 
altena_at_ive approach to the one presented here (that works for most NSO problems) is 
to express the problem as a composite function and then solve it using the exact penalty 
method (texmed composite NSO) [11]. Fletcher [11, p. 358] states that in practice this 
can require a great deal of storage or be too complicated to formulate. In contrast, the 
BT algorithm solves the more general basic NSO problem and so can be more widely 
applied than techniques based on composite functions. The BT algorithm is simplex to 
set up, but this can be at the cost of algorithm complexity and a comput_ot_ional overhead. 
The BT algorithm, however, does retain the gradient descent flavour of backpropagation 
because it uses the generalized gradient concept along with a chain rule for computing these 
(generalized) gradients. Nongradient-based and stochastic methods for NSO do exist, but 
they were not considered here because they do not retain the gradient-based deterministic 
flavour. It would be useful to see if these other techniques are faster for practical problems. 
The message should be clear however m smooth gradient techniques should be treated with 
suspicion when a nonsmooth problem is encountered, and in general the more complicated 
nonsmooth methods should be employed. 
References 
[10] 
[11] 
[12] 
[1] P. Bo, "A norm selection criterion for the generalized delta rule," IEEE 7Yans- 
actions on Neural Networks 2 (1991), 125-130. 
[2] N. J. Redding, "Some Aspects of Representation and Learning in Artificial Neural 
Networks," University of Queensland, PhD Thesis, June, 1991. 
[3] T. Krishnan, "On the threshold order of a Boolean function," IEEE Transactions on 
Electronic Computers EC- 15 (1966), 369-372. 
[4] M. L. Brady, R. Raghavan & J. Sitwhy, "Backpropagation fails to separate where 
perceptrons succeed," IEEE Transactions on Circuits and Systems 36 (1989). 
[5] E H. Clarke, Optimization and Nonsmooth Analysis, Canadian Mathematical Society 
Series of Monographs and Advanced Texts, John Wiley & Sons, New York, NY, 1983. 
[6] H. Schramm & J. Zowe, "A version of the bundle idea for minimizing a nonsmooth 
function: conceptual ideas, convergence analysis, numerical results," SIAM Journal on 
Optimization (1991 ), to appear. 
[7] V. E Dem'yanov & V. N. Malozemov, Introduction to Minimax, John Wiley & Sons, 
New York, NY, 1974. 
[8] P. Wolfe, "A method of conjugate subgradients for minimizing nondifferentiable func- 
tions," in NondJfferentiable Optimization, M. L. Balinski & P. Wolfe, eds., Mathematical 
Programming Study #3, North-Holland, Amsterdam, 1975, 145-173. 
[9] C. Lemar6chal, "Nondifferenfiable Optimization," in Optimization, G. L. Nemhauser, 
A. H. G. Rinnooy Kan & M. J. Todd, eds., Handbooks in Operations Research and 
Management Science # 1, North-Holland, Amsterdam, 1989, 529-572. 
K. C. Kiwiel, Methods of Descent for Nondifferentiable Optimization, Lect. Notes in 
Math. # 1133, Springer-Verlag, New York-Heidelberg-Berlin, 1985. 
R. Fletcher, Practica/Methods of Optimization second edition, John Wiley & Sons, 
New York, NY, 1987. 
R. Batruni, "A multilayer neural network with piecewise-linear structure and back- 
propagation learning,"/E.F./. Transactions on Neural Networks 2 (1991), 395-403. 
