Hebb Learning of Features 
based on their Information Content 
Ferdinand Peper 
Communications Research Laboratory 
588-2, Iwaoka, Iwaoka-cho 
Nishi-ku, Kobe 651-24 
Japan 
peper@crl.go.jp 
Hideki Noda 
Kyushu Institute of Technology 
Dept. Electr., Electro., and Comp. Eng. 
1-1 Sensui-cho, Tobata-ku 
Kita-Kyushu 804, Japan 
noda@kawa.comp.kyutech.ac.jp 
Abstract 
This paper investigates the stationary points of a Hebb learning rule 
with a sigmoid nonlinearity in it. We show mathematically that when 
the input has a low information content, as measured by the input's 
variance, this learning rule suppresses learning, that is, forces the weight 
vector to converge to the zero vector. When the information content 
exceeds a certain value, the rule will automatically begin to learn a 
feature in the input. Our analysis suggests that under certain conditions 
it is the first principal component that is learned. The weight vector 
length remains bounded, provided the variance of the input is finite. 
Simulations confirm the theoretical results derived. 
I Introduction 
Hebb learning, one of the main mechanisms of synaptic strengthening, is induced 
by cooccurrent activity of pre- and post-synaptic neurons. It is used in artificial 
neural networks like perceptrons, associative memories, and unsupervised learning 
neural networks. Unsupervised Hebb learning typically employs rules of the form: 
laW(t) -- x(t)y(t) - d(x(t),y(t), w(t)), (1) 
where w is the vector of a neuron's synaptic weights, x is a stochastic input vector, 
y is the output expressed as a function of xTw, and the vector function d is a for- 
getting term forcing the weights to decay when there is little input. The integration 
constant p determines the learning speed and will be assumed i for convenience. 
The dynamics of rule (1) determines which features are learned, and, with it, the 
rule's stationary points and the boundedness of the weight vector. In some cases, 
weight vectors grow to zero or grow unbounded. Either is biologically implausible. 
Suppression and unbounded growth of weights is related to the characteristics of 
the input x and to the choice for d. Understanding this relation is important to 
enable a system, that employs Hebb learning, to learn the right features and avoid 
implausible weight vectors. 
Unbounded or zero length of weight vectors is avoided in [5] by keeping the total 
synaptic strength -i wi constant. Other studies, like [7], conserve the sum-squared 
Hebb Learning of Features based on their Information Content 247 
synaptic strength. Another way to keep the weight vector length bounded is to 
limit the range of each of the individual weights [4]. The effect of these constraints 
on the learning dynamics of a linear Hebb rule is studied in [6]. 
This paper constrains the weight vector length by a nonlinearity in a Hebb rule. It 
uses a rule of the form (1) with y - $(xTw -- h) and d(x, y, w) - c.w, the function 
$ being a smooth sigmoid, h being a constant, and c being a positive constant 
(see [1] for a similar rule). We prove that the weight vector w assumes a bounded 
nonzero solution if the largest eigenvalue  of the input covariance matrix satisfies 
, > c/$(-h). Furthermore, if  _ c/S(-h) the weight vector converges to the 
vector 0. Since  equals the variance of the input's first principal component, that 
is,  is a measure for the amount of information in the input, learning is enabled 
by a high information content and suppressed by a low information content. 
The next section describes the Hebb neuron and its input in more detail. After 
characterizing the stationary points of the Hebb learning rule in section 3, we ana- 
lyze their stability in section 4. Simulations in section 5 confirm that convergence 
towards a nonzero bounded solution occurs only when the information content of 
the input is sufficiently high. We finish this paper with a discussion. 
2 The Hebb Neuron and its Input 
Assume that the n-dimensional input vectors x presented to the neuron are gener- 
ated by a stationary white stochastic process with mean 0. The process's covariance 
matrix Y. = E[xx w] has eigenvalues A, ..., An (in order of decreasing size) and cor- 
responding eigenvectors u,..., un. Furthermore, E[[[x[[ '] is finite. This implies that 
the eigenvalues are finite because E[llxll = E[tr[xxT]] = tr[E[xxT]] = -i= Ai. It 
is assumed that the probability density function of x is continuous. Given an input 
x and a synaptic weight vector w, the neuron produces an output y = $(xTw -- h), 
where $: lR -- ]H. is a function that satisfies the conditions: 
C1. 
C2. 
C3. 
C4. 
is smooth, i.e., S is continuous and differentiable and S  is continuous. 
is sublinear, i.e., lim S(z)/z - lim S(z)/z -- O. 
Z'--)' OO 
is monotonically nondecreasing. 
has one maximum, which is at the point z - -h. 
Typically, these conditions are satisfied by smooth sigmoidal functions. This in- 
cludes sigmoids with infinite saturation values, like S(z) = sign(z)[z[ /2 (see [9]). 
The point at which a sigmoid achieves maximal steepness (condition C4) is called its 
base. Though the step function is discontinuous at its base, thus violating condition 
C1, the results in this paper apply to the step function too, because it is the limit 
of a sequence of continuous sigmoids, and the input density function is continuous 
and thus Lebesgue-integrable. The learning rule of the neuron is given by 
= xy - cw, (2) 
c being a positive constant. Use of a linear $(z) = az in this rule gives unstable 
dynamics: if a > c/,l, then the length of the weight vector w grows out of bound 
though ultimately w becomes collinear with u. It is proven in the next section 
that a sublinear $ prevents unbounded growth of w. 
248 E Peper and H. Noda 
3 Stationary Points of the Learning Rule 
To get insight into what stationary points the weight vector w ultimately converges 
to, we average the stochastic equation (2) over the input patterns and obtain 
(.) = E [xs (xT(w) - h) ] - c (w), 
(3) 
where (w / is the averaged weight vector and the expectation is taken over x, as 
with all expectations in this paper. Since the solutions of (2) correspond with the 
solutions of (3) under conditions described in [2], the averaged (w) will be referred 
to as w. Learning in accordance to (2) can then be interpreted [1] as a gradient 
descent process on an averaged energy function J associated with (3): 
S(w) = [v (xw - n) ] + 
with T(z)- S(v)dv. 
oo 
To characterize the solutions of (3) we use the following lemma. 
Lemma 1. Given a unit-length vector u, the function fu ' ]R - ]R is defined by 
_- 1 [uxS 
c 
and the constant u by u = E[uTxxXu]. The fixed points of f are as follows. 
1. If ,kS'(-h) _< c then f has one fixed point, i.e., z = 0. 
2. If ,kS(-h)  c then fu has three fixed points, i.e., z = 0, z = a +, and 
z = a, where a + (aS) is a positive (negative) value depending on u. 
Proof(Sketch; for a detailed proof see [11]). Function f is a smooth sigmoid, since 
conditions C1 to C4 carry over from S to f. The steepness of f in its base at z - 0 
depends on vector u. If S'(-h) _ c, function fu intersects the line h(z) -- z only 
at the origin, giving z - 0 as the only fixed point. If S(-h)  c, the steepness 
of fu is so large as to yield two more intersections: z -- a + and z = a. [] 
Thus characterizing the fixed points of f, the lemma allows us to find the fixed 
points of a vector function g: n _ ]Rn that is closely related to (3). Defining 
g(w) = _1 E[x$ (xTw- h)], 
c 
we find that a fixed point z = au of fu corresponds to the fixed point w = au of 
g. Then, since (3) can be written as v - c.g(w) - c.w, its stationary points are 
the fixed points of g, that is, w = 0 is a stationary point and for each u for which 
AS'(-h) > c there exists one bounded stationary point associated with a + and 
one associated with a. Consequently, if A _< c/S(-h) then the only fixed point 
of g is w = 0, because Ax >_ Au for all u. 
What is the implication of this result? The relation A 5 c/S'(-h) indicates a low 
information content of the input, because Ax--equaling the variance of the input's 
first principal component--is a measure for the input's information content. A low 
information content thus results in a zero w, suppressing learning. Section 4 shows 
Hebb Learning of Features based on their Information Content 249 
that a high information content results in a nonzero w. The turnover point of what 
is considered high/low information is adjusted by changing the steepness of the 
sigmoid in its base or changing constant c in the forgetting term. 
To show the boundedness of w, we consider an arbitrary point P: w =/u suffi- 
ciently far away from the origin 0 (but at finite distance) and calculate the compo- 
nent of v along the line OP as well as the components orthogonal to OP. Vector 
u has unit length, and  may be assumed positive since its sign can be absorbed 
by u. Then, the component along OP is given by the projection of v on u: 
uTr w=3u ---- uTcg(/U) -- uT/U : -- [/ -- fu(/)]. 
This is negative for all/ exceeding the fixed points of fu because of the sigmoidal 
shape of fu. So, for any point P in IP n lying far enough from 0 the vector compo- 
nent of v in P along the line OP is directed towards 0 and not away from it. This 
component decreases as we move away from O, because the value of [ - fu()] 
increases as  increases (fu is sublinear). Orthogonal to this is a component given 
by the projection of v on a unit-length vector v that is orthogonal to u: 
I : Tcg(/u) - T/u -- Tg(/u). 
Tv 
This component increases as we move away from O; however, it changes at a slower 
pace than the component along OP, witness the quotient of both components: 
{ T g (/U) T g (/U) // 
lim vTr = lim = lim = 0. 
Vector v thus becomes increasingly dominated by the component along OP as/ 
increases. So, the origin acts as an attractor if we are sufficiently far away from it, 
implying that w remains bounded during learning. 
4 Stability of the Stationary Points 
To investigate the stability of the stationary points, we use the Hessian of the 
averaged energy function J described in the last section. The Hessian at point w 
equals: H(w) = cI- E [xxTS  (xTw -- h)]. A stationary point w -  is stable 
iff H(v) is a positive definite matrix. The latter is satisfied if for every unit-length 
vector v, 
vTE [xxTS ' (xTv-- h) ] v < , (4) 
that is, if all eigenvalues of the matrix E[xxTS'(xT - h)] are less than c. First 
consider the stationary point w = 0. The eigenvalues of E[xxT$'(-h)] in decreasing 
order are ,kS'(-h), ...,,k,S'(-h). The Hessian H(0) is thus positive definite iff 
AS'(-h) < c. In this case w = 0 is stable. It is also stable in the case A1 = 
c/S'(-h), because then (4) holds for all v y u, preventing growth of w in directions 
other than u. Moreover, w will not grow in the direction of u, because If (/)1 < 
I/l for all/ y 0. Combined with the results of the last section this implies: 
Corollary 1. If A < c/S'(-h) then the averaged learning equation (3) will have 
as its only stationary point w = 0, and this point is stable. If A > c/S'(-h) the 
stationary point w = 0 is not stable, and there will be other stationary points. 
250 E Peper and H. Noda 
We now investigate the other stationary points. Let w - auU be such a point, u 
being a unit-length vector and Cu a nonzero constant. To check whether the Hessian 
H(ouU) is positive definite, we apply the relation E[XY] = E[X] E[Y] + Cov[X, Y] 
to the expression E [uTxxTu$  (OuXTU- h)] and obtain after rewriting: 
] = 
1 
1 E [uTxxTu$'(auXTU-- h)] Au 
--Cov [uxxu, S'uXU- n)]. 
The sigmoidal shape of the function fu implies that fu is less steep than the line 
h(z) = z at the intersection at z = au, that is, fu(au) < 1. It then follows that 
E [uTxxTu$'(auXTU -- h)] = Cfu(aU) < c, giving: 
1 
<  
The probability distribution of x unspecified, it is hard to evaluate this upper bound. 
For certain distributions the upper bound is minimized when u is maximized, 
that is, when u = u and u = l, implying the Hebb neuron to be a nonlinear 
principal component analyzer. Distributions that are symmetric with respect to 
the eigenvectors of Y. are probably examples of such distributions, as suggested 
by [11, 12]. For other distributions vector w may assume a solution not collinear 
with u or may periodically traverse (part of) the nonzero fixed-point set of g. 
5 Simulations 
We carry out simulations to test whether learning behaves in accordance with corol- 
lary 1. The following difference equation is used as the learning rule: 
Aw = 7 [x.tanh (axTw) -- w], (5) 
where '7 is the learning rate and a a constant. The use of a difference A in (5) rather 
than the differential in (2) is computationally easier, and gives identical results if 
'7 decreases over training time in accordance with conditions described in [3]. We 
use 7(t) = 1/(0.01t + 20). It satisfies these conditions and gives fast convergence 
without disrupting stability [10]. Its precise choice is not very critical here, though. 
The neuron is trained on multivariate normally distributed random input samples of 
dimension 6 with mean 0 and a covariance matrix E that has the eigenvalues 4.00, 
2.25, 1.00, 0.09, 0.04, and 0.01. The degree to which the weight vector and E's first 
eigenvector Ul are collinear is measured by the match coefficient [10], defined by: 
rn -- cos 2 /(u, w). In every experiment the neuron is trained for 10000 iterations 
by (5) with the value of parameter a set to 0.20, 0.25, and 0.30, respectively. This 
corresponds to the situations in which A1 < c/$(-h), A -- c/$(-h), and A > 
c/$(-h), respectively, since c - 1 and the steepness of the sigmoid $(z) = tanh(az) 
Hebb Learning of Features based on their Information Content 251 
in its base z = -h - 0 is S'(O) -- a. We perform each experiment 2000 times, which 
allows us to obtain the match coefficients beyond iteration 100 within +0.02 with 
a confidence coefficient of 95% (and a smaller confidence coefficient on the first 100 
iterations). The random initialization of the weight vector--its initial elements are 
uniformly distributed in the interval (-1, 1)--is different in each experiment. 
0.5 
1.0 
' a=0.30 
J" .... a=0.25 
 ..... a=0.20 
1 10 10 2 10 3 10 4 
Iterations 
0.0 0.0 
1 10 10 2 10 3 10 4 
Iterations 
Figure 1: Match coefficients averaged 
over 2000 experiments for parameter 
values a = 0.20, 0.25, and 0.30. 
Figure 2: Lengths of the weight vector 
averaged over 2000 experiments. The 
curve types are similar to those in Fig. 1. 
Fig. 1 shows that for all tested values of parameter a the weight vector gradually 
becomes collinear with u over 10000 iterations. The length of the weight vector 
converges to 0 when a -- 0.20 or a = 0.25 (see Fig. 2). In the case a = 0.30, 
corresponding to A > c/S'(-h), the length converges to a nonzero bounded value. 
In conclusion, convergence is as predicted by corollary 1: the weight vector converges 
to 0 if the information content in the input is too low for climbing the slope of the 
sigmoid in its base, and otherwise the weight vector becomes nonzero. 
6 Discussion 
Learning by the Hebb rule discussed in this paper is enabled if the input's informa- 
tion content as measured by the variance is sufficiently high, and only then. The 
results, though valid for a single neuron, have implications for systems consisting of 
multiple neurons connected by inhibitory connections. A neuron in such a system 
would have as output y = S(xTw - h - vTy'), where the inhibitory signal vTy ' 
would consist of the vector of output signals y' of the other neurons, weighted by the 
vector v (see also [1]). Function fu in lemma 1 would, when extended to contain the 
signal vTy ', still pass through the origin because of the zero-meanness of the input, 
but would have a reduced steepness at the origin caused by the shift in S's argument 
away from the base. The reduced steepness would make an intersection of fu with 
the line h(z) = z in a point other than the origin less likely. Consequently, an in- 
hibitory signal would bias the neuron towards suppressing its weights. In a system 
of neurons this would reduce the emergence of neurons with correlated outputs, be- 
cause of the mutual presence of their outputs in each other's inhibitory signals. The 
neurons, then, would extract different features, while suppressing information-poor 
features. 
In conclusion, the Hebb learning rule in this paper combines well with inhibitory 
connections, and can potentially be used to build a system of nonredundant feature 
extractors, each of which is optimized to extract only information-rich features. 
252 E Peper and H. Noda 
Moreover, the suppression of weights with a low information content suggests a 
straightforward way [8] to adaptively control the number of neurons, thus minimiz- 
ing the necessary neural resources. 
Acknowledgments 
We thank Dr. Mahdad N. Shirazi at Communications Research Laboratory (CRL) 
for the helpful discussions, Prof. Dr. S.-I. Amari for his encouragement, and Dr. 
Hidefumi Sawai at CRL for providing financial support to present this paper at 
NIPS'96 from the Council for the Promotion of Advanced Information and Com- 
munications Technology. This work was financed by the Japan Ministry of Posts 
and Telecommunications as part of their Frontier Research Project in Telecommu- 
nications. 
References 
[1] S.-I. Amari, "Mathematical Foundations of Neurocomputing," Proceedings of the 
IEEE, vol. 78, no. 9, pp. 1443-1463, 1990. 
[2] S. Geman, "Some Averaging and Stability Results for Random Differential Equa- 
tions," SIAM J. Appl. Math., vol. 36, no. 1, pp. 86-105, 1979. 
[3] 
H.J. Kushner and D.S. Clark, "Stochastic Approximation Methods for Constrained 
and Unconstrained Systems," Applied Mathematical Sciences, vol. 26, New York: 
Springer-Verlag, 1978. 
[4] R. Linsker, "Self-Organization in a Perceptual Network," Computer, vol. 21, pp. 105- 
117, 1988. 
[5] C. von der Malsburg, "Self-Organization of Orientation Sensitive Cells in the Striate 
Cortex," Kybernetik, vol. 14, pp. 85-100, 1973. 
[6] K.D. Miller and D.J.C. MacKay, "The Role of Constraints in Hebbian Learning," 
Neural Computation, vol. 6, pp. 100-126, 1994. 
[7] E. Oja, "A simplified neuron model as a principal component analyzer," Journal of 
Mathematics and Biology, vol. 15, pp. 267-273, '1982. 
[8] 
F. Peper and H. Noda, "A Mechanism for the Development of Feature Detecting 
Neurons," Proc. Second New-Zealand Int. Two-Stream Conf. on Artificial Neural 
Networks and Expert Systems, ANNES'95, Dunedin, New-Zealand, pp. 59-62, 20-23 
Nov. 1995. 
[9] 
F. Peper and H. Noda, "A Class of Simple Nonlinear 1-unit PCA Neural Networks," 
1995 IEEE Int. Conf. on Neural Networks, ICNN'95, Perth, Australia, pp. 285-289, 
27 Nov.-1 Dec. 1995. 
[10] 
F. Peper and H. Noda, "A Symmetric Linear Neural Network that Learns Principal 
Components and their Variances," IEEE Trans. on Neural Networks, vol. 7, pp. 1042- 
1047, 1996. 
[11] 
F. Peper and H. Noda, "Stationary Points of a Hebb Learning Rule for a Nonlinear 
Neural Network," Proc. 1996 Int. Syrup. Nonlinear Theory and Appl. (NOLTA '96), 
Kochi, Japan, pp. 241-244, 7-9 Oct 1996. 
[12] 
F. Peper and M.N. Shirazi, "On the Eigenstructure of Nonlinearized Covariance 
Matrices," Proc. 1996 Int. Syrup. Nonlinear Theory and Appl. (NOLTA '96), Kochi, 
Japan, pp. 491-493, 7-9 Oct 1996. 
