65 
LINEAR LEARNING: LANDSCAPES AND ALGORITHMS 
Pierre Baldi 
Jet Propulsion Laboratory 
California Institute of Technology 
Pasadena, CA 91109 
What follows extends some of our results of [1] on learning from ex- 
amples in layered feed-forward networks of linear units. In particu- 
lar we examine what happens when the number of layers is large or 
when the connectivity between layers is local and investigate some 
of the properties of an autoassociative algorithm. Notation will be 
as in [1] where additional motivations and references can be found. 
It is usual to criticize linear networks because "linear functions do 
not compute" and because several layers can always be reduced to 
one by the proper multiplication of matrices. However this is not the 
point of view adopted here. It is assumed that the architecture of the 
network is given (and could perhaps depend on external constraints) 
and the purpose is to understand what happens during the learning 
phase, what strategies are adopted by a synaptic weights modifying 
algorithm,... [see also Cottrell et al. (1988) for an example of an ap- 
plication and the work of Linsker (1988) on the emergence of feature 
detecting units in linear networks]. 
Consider first a two layer network with n input units, n output units 
and p hidden units (p < n). Let (xl,yl),...,(xr, yr) be the set of 
centered input-output training patterns. The problem is then to find 
two matrices of weights A and B minimizing the error function E: 
= IlY,- AB,112. 
l<t<T 
66 Bald 
Let Exx, Ex, E, Eyx denote the usual covariance matrices. The 
main result of [1] is a description of the landscape of E, characerised 
by a multiplicity of saddle points and an absence of local minima. 
More precisely, the following four facts are true. 
Fact 1: For any fixed n x p matrix A the function E(A, B) is convex 
in the coefficients of B and attains its minimum for any B satisfying 
the equation 
AABExx = AEyx. (2) 
If in addition Exx is invertible and A is of full rank p, then E is 
strictly convex and has a unique minimum reached when 
B = 1)(A) = (A'A)-iA'SyxSc. 
(a) 
Fact 2: For any fixed p x n matrix B the function E(A, B) is convex 
in the coefficients of A and attains its minimum for any A satisfying 
the equation 
ABSxxB= SyxB . (4) 
If in addition Sxx is invertible and B is of full rank p, then E is 
strictly convex and has a unique minimum reached when 
A = .2.(B)= SyxB'(BSxxB') -. 
(5) 
Fact 3: Assume that Sxx is invertible. If two matrices A and B 
define a critical point of E (i.e. a point where OE/Oaii = OE/Obii = 
0) then the global map W = AB is of the form 
(6) 
where PA denotes the matrix of the orthogonal projection onto the 
subspace spanned by the columns of A and A satisfies 
PAS = PASPA = SPA (7) 
Linear Learning: Landscapes and Algorithms 67 
with E = EyxE]xEx. If A is of full rank p, then A and B define 
a critical point of E if and only if A satisfies (7) and B =/)(A), or 
equivalently if and only if A and W satisfy (6) and (7). Notice that 
in (6), the matrix Eyx E is the slope matrix for the ordinary least 
square regression of Y on X. 
Fact 4: Assume that E is full rank with n distinct eigenvalues A > 
... > X,. If Z = {i,...,ip}(1 < i < ... < ip < n)is any or- 
dered p-index set, let Uz = [ui,...,ui,] denote the matrix formed 
by the orthonormal eigenvectors of E associated with the eigenvalues 
Ai,..., Alp. Then two full rank matrices A and B define a critical 
point of E if and only if there exist an ordered p-index set 27 and an 
invertible p x p matrix C such that 
A= UzC' (8) 
B = C-1U5'],yxS'],. 
For such a critical point we have 
W = PU:rEYXE: 
(10) 
(11) 
Therefore a critical point of W of rank p is always the product of the 
ordinary least squares regression matrix followed by an orthogonal 
projection onto the subspace spanned by p eigenvectors of E. The map 
W associated with the index set {1,2,...,p} is the unique local and 
global minimum of E. The remaining () - 1 p-index sets correspond 
to saddle points. All additional critical points defined by matrices 
A and B which are not of full rank are also saddle points and can 
be characerized in terms of orthogonal projections onto subspaces 
spanned by q eigenvectors with q < p. 
68 Bsldi 
Deep Networks 
Consider now the case of a deep network with a first layer of n input 
units, an (rn + 1)-th layer of n output units and rn- 1 hidden layers 
with an error function given by 
E(AI,...,An) -  [lY' - A1AI...Arax, I[ . 
l<t<T 
(12) 
It is worth noticing that, as in fact I and 2 above, if we fix any rn - 1 
of the rn matrices A1, ..., A,, then E is convex in the remaining matrix 
of connection weights. Let p (p < n) denote the number of units in 
the smallest layer of the network (several hidden layers may have p 
units). In other words the network has a bottleneck of size p. Let i 
be the index of the corresponding layer and set 
A = AAa...Am-i+ 
B = Am-i+2...Am 
(13) 
When we let A,..., A, vary, the only restriction they impose on A 
and B is that they be of rank at most p. Conversely, any two ma- 
trices A and B of rank at most p can always be decomposed (and 
in many ways) in products of the form of (13). It results that any 
local minirod of the error function of the deep network should yield a 
local minirod for the corresponding "collapsed" .three layers network 
induced by (13) and vice versa. Therefore E(A,...,A,,) does not 
have any local minima and the global minimal map W* = A A2...A,, 
is unique and given by (10) with index set {1,2,...,p}. Notice that 
of course there is a large number of ways of decomposing W* into 
a product of the form AA2...A,. Also a saddle point of the error 
function E(A, B) does not necessarily generate a saddle point of the 
corresponding E(A,..., A,,) for the expressions corresponding to the 
two gradients are very different. 
Linear Larning: Landscapes and Algorithms 69 
Forced Connections. Local Connectivity 
Assume now an error function of the form 
= Ilu,- 
l<t<T 
for a two layers network but where the value of some of the entries 
of A may be externally prescribed. In particular this includes the 
case of local connectivity described by relations of the form aij ---- 0 
for any output unit i and any input unit j which are not connected. 
Clearly the error function E(A) is convex in A. Every constraint of 
the form aij --cst defines an hyperplane in the space of all possible A. 
The intersection of all these constraints is therefore a convex set. Thus 
minimizing E under the given constraints is still a convex optimization 
problem and so there are no local minima. It should be noticed that, 
in the case of a network with even only three constrained layers with 
two matrices A and B and a set of constraints of the form aij --cst 
on A and bet --cst for B, the set of admissible matrices of the form 
W = AB is, in general, not convex anymore. It is not unreasonable 
to conjecture that local minima may then arise, though this question 
needs to be investigated in greater detail. 
Algorithmic Aspects 
One of the nice features of the error landscapes described so far is 
the absence of local minima and the existence, up to equivalence, 
of a unique global minimum which can be understood in terms of 
principal component analysis and least square regression. However 
the landscapes are also characterized by a large number of saddle 
points which could constitute a problem for a simple gradient descent 
algorithm during the learning phase. The proof in [1] shows that 
the lower is the E value corresponding to a saddle point, the more 
difficult it is to escape from it because of a reduction in the possible 
number of directions of escape (see also [Chauvin, 1989] in a context of 
Hebbian learning). To assert how relevant these issues are for practical 
implementations requires further simulation experiments. On a more 
7O Bsld 
speculative side, it remains also to be seen whether, in a problem 
of large size, the number and spacing of saddle points encountered 
during the first stages of a descent process could not be used to "get 
a feeling" for the type of terrain being descented and as a result to 
adjust the pace (i.e. the learning rate). 
We now turn to a simple algorithm for the auto-associative case in a 
three layers network, i.e. the case where the presence of a teacher 
can be avoided by setting /t: xt and thereby trying to achieve a 
compression of the input data in the hidden layer. This technique 
is related to principal component analysis, as described in [1]. If 
yt: xt, it is easy to see from equations (8) and (9) that, if we take 
the matrix C to be the identity, then at the optimum the matrices 
A and B are transpose of each other. This heuristically suggests a 
possible fast algorithm for auto-association, where at each iteration a 
gradient descent step is applied only to one of the connection matrices 
while the other is updated in a symmetric fashion using transposition 
and avoiding to back-propagate the error in one of the layers (see 
[Williams, 1985] for a similar idea). More formally, the algorithm 
could be concisely described by 
A(O) = random 
(o) = '(o) 
A(k + 1) = A(k) - r I OA 
B(k+l)=A'(k+l) 
(15) 
Obviously a similar algorithm can be obtained by setting B(k + 1) = 
B(k) - rlc9E/OB and A(k + 1) = B'(k + 1). It may actually even be 
better to alternate the gradient step, one iteration with respect to A 
and one iteration with respect to B. 
A simple calculation shows that (15) can be rewritten as 
A(k + 1) = A(k) + rl(I- W(k))ExxA(k) 
( + l) = (k) + V()Sxx(t- W(k)) 
(16) 
Linear Learning: Landscapes and Algorithms 71 
where W(k) -- A(k)B(k). It is natural from what we have already 
seen to examine the behavior of this algorithm on the eigenvectors of 
I]xx. Assume that u is an eigenvector of both I]xx and W(k) with 
eigenvalues A and p(k). Then it is easy to see that u is an eigenvector 
of W(k + 1) with eigenvalue 
p(k + 1) = p(k)[1 + r/(1 - p(k))] 2. 
For the algorithm to converge to the optimal W, p(k q- 1) must con- 
verge to 0 or 1. Thus one has to look at the iterates of the func- 
tion f(x) = /[1 q-r/(1- x)] 2. This can be done in detail and 
we shall only describe the main points. First of all, f'(x) = 0 iff 
x = 0 or x = xa = 1 q- (1/r/A) or x = Xb = 1/3 + (1/3r/A) and 
f"(x) = 0 iff x = xc = 2/3 + (2/3r/,X) = 2Xb. For the fixed points, 
f(x) = x iff x = 0, x = 1 or x = Xd = 1 + (2/r/,X). Notice also 
that f(xa) = 0 and f(1 + (l/,lA)) = (1 + (//r/,X)(1 - l) a. Points cor- 
responding to the values 0,1, Xa,Xd of the x variable can readily be 
positioned on the curve of f but the relative position of xb (and 
depends on the value assumed by r/,X with respect to 1/2. Obviously 
if/(0) = 0, 1 or xd then/(k) = 0, 1 or x, if/(0) < 0/(k) --+ 
and if p(k) > x p(k) -4 +oc. Therefore the algorithm can converge 
only for 0 < p(0) < x. When the learning rate is too large, i.e. 
when > 1/2 then even if p(0) is in the interval (0, x) one can see 
that the algorithm does not converge and may even exhibit complex 
oscillatory behavior. However when r/,X < 1/2, if 0 </(0) < xa then 
/(k) --+ 1, if p(0) = xa then = 0 and if xa < /(0) < xd then 
In conclusion, we see that if the algorithm is to be tested, the 
learning rate should be chosen so that it does not exceed 1/2,X, where 
,X is the largest eigenvalue of Exx. Even more so than back propaga- 
tion, it can encounter problems in the proximity of saddle points. 
Once a non-principal eigenvector of Exx is learnt, the algorithm 
rapidly incorporates a projection along that direction which cannot 
be escaped at later stages. Simulations are required to examine the 
effects of "noisy gradients" (computed after the presentation of only 
a few training examples), multiple starting points, variable learning 
rates, momentum terms, and so forth. 
72 Bd 
Aknowledgement 
Work supported by NSF grant DMS-8800323 and in part by ONR 
contract 411P006 01. 
References 
(1) Baldi, P. and Hornik, K. (1988) Neural Networks and Principal 
Component Analysis: Learning from Examples without Local Min- 
ima. Neural Networks, Vol. 2, No. 1. 
(2) Chauvin, Y. (1989) Another Neural Model as a Principal Compo- 
nent Analyzer. Submitted for publication. 
(3) Cottrell, G. W., Munro, P. W. and Zipser, D. (1988) Image Com- 
pression by Back Propagation: a Demonstration of Extensional Pro- 
gramming. In: Advances in Cognitive Science, Vol. 2, Sharkey, N. E. 
ed., Norwood, NJ Abbex. 
(4) Linsker, R. (1988) Self-Organization in a Perceptual Network. 
Computer 21 (3), 105-117. 
(5) Williams, R. J. (1985) Feature Discovery Through Error-Correction 
Learning. ICS Report 8501, University of California, San Diego. 
