Central and Pairwise Data Clustering by 
Competitive Neural Networks 
Joachim Buhmann & Thomas Hofmann 
Rheinische Friedrich-Wilhelms-Universitfit 
Institut f/Jr Informatik II, R6merstrage 164 
D-53117 Bonn, Fed. Rep. Germany 
Abstract 
Data clustering amounts to a combinatorial optimization problem to re- 
duce the complexity of a data representation and to increase its precision. 
Central and pairwise data clustering are studied in the maximum en- 
tropy framework. For central clustering we derive a set of reestimation 
equations and a minimization procedure which yields an optimal num- 
ber of clusters, their centers and their cluster probabilities. A meanfield 
approximation for pairwise clustering is used to estimate assignment 
probabilities. A selfconsistent solution to multidimensional scaling and 
pairwise clustering is derived which yields an optimal embedding and 
clustering of data points in a d-dimensional Euclidian space. 
1 Introduction 
A central problem in information processing is the reduction of the data complexity with 
minimal loss in precision to discard noise and to reveal basic structure of data sets. Data 
clustering addresses this tradeoff by optimizing a cost function which preserves the original 
data as complete as possible and which simultaneously favors prototypes with minimal 
complexity (Linde et al., 1980; Gray, 1984; Chou et al., 1989; Rose et al., 1990). We dis- 
cuss an objective function for the joint optimization of distortion errors and the complexity 
of a reduced data representation. A maximum entropy estimation of the cluster assign- 
ments yields a unifying framework for clustering algorithms with a number of different 
distortion and complexity measures. The close analogy of complexity optimized clustering 
with winner-take-all neural networks suggests a neural-like implementation resembling 
topological feature maps (see Fig. 1). 
104 
Central and Pairwise Data Clustering by Competitive Neural Networks 105 
xi 
Figure 1: Architecture of a three 
layer competitive neural network 
for central data clustering with 
d neurons in the input layer, K 
neurons in the clustering layer 
with activity (Mi) and G neu- 
rons in the classification layer. 
The output neurons estimate the 
conditional probability P, li of 
data point i being in class 7- 
Given is a set of data points which are characterized either by coordinates {xi [xi C d; i - 
1,..., N} or by pairwise distances {ikli, k - 1,..., N}. The goal of data clustering 
is to determine a partitioning of a data set which either minimizes the average distance 
of data points to their cluster centers or the average distance between data points of the 
same cluster. The two cases are refered to as central or pairwise clustering. Solutions 
to central clustering are represented by a set of data prototypes {YIY c = 
1 .... , K }, and the size K of that set. The assignments { Mi] o = 1,..., K;i = 1 .... , N}, 
Mi  {0, 1} denote that data point i is uniquely assigned to cluster  (, Mi, = 1). 
Rate distortion theo specifies the optimal choice of y being the cluster centroids, i.e., 
 o 
Mia oia(xi, ya) = 0. Given only a set of distances or dissimilities the solution 
to pairwise clustering is chacterized by the expected assignment viables (Mi). The 
complexity {C [ = 1 .... , K} of a clustering solution depends on the specific infomation 
processing application at hand, in pticul, we assume that C is only a function of the 
cluster probability p = i= l Mi IN. We propose the central clustering cost function 
N K 
i=1 =1 
and the pairwise clustering cost function 
N K N 
The distortion and complexity costs are adjusted in size by the weighting parameter X. e 
cost hnctions (1,2) have to be optimized in an imrafive hshion: (i) vary the assignment 
r cc.pc t { Mia }) decrease; 
variables Mi for a fixed number K of clusters such that the costs x 
(ii) increment the number of clusters K  K + 1 and optimize Mi again. 
Complexity costs which penalize small, sparsely populated clusters, i.e., C = 1/p, 
1, 2 .... , favor equal clusmr probabilities, thereby emphasizing the hardware aspect of a 
clustering solution. The special case s = 1 with constant costs per cluster coesponds 
to K-means clustering. An alternative complexity measure which estimates encoding 
costs for data compression and data transmission is the Shannon entropy of a cluster set 
(C)  E,p,C, = - E,p, logp,. 
106 Buhmann and Hofmann 
The most common choice for the distortion measure are distances Dis = 
which preserve the permutation symmetry of (1) with respect to the cluster index v. A data 
partitioning scheme without permutation invariance of cluster indices is described by the 
cost function 
The generalized distortion error ((Dis)) = -'r T,Di,(x,, y,) between data point xi and 
cluster center y quantifies the intrinsic quantization errors Di-r (xi, y,) and the additional 
errors due to transitions T, from index 7 to a. Such transitions might be caused by noise 
in communication channels. These index transitions impose a topological order on the 
set of indices { a[ a = 1, .... K} which establishes a connection to self-organizing feature 
maps (Kohonen, 1984; Ritter et al., 1992) in the case of nearest neighbor transitions in a 
d-dimensional index space. We refer to such a partitioning of the data space as topology 
preserving clustering. 
2 Maximum Entropy Estimation of Central Clustering 
Different combinations of complexity terms, distortion measures and topology constraints 
define a variety of central clustering algorithms which are relevant in very different infor- 
mation processing contexts. To derive robust, preferably parallel algorithms for these data 
clustering cases, we study the clustering optimization problem in the probabilistic frame- 
work of maximum entropy estimation. The resulting Gibbs distribution proved to be the 
most stable distribution with respect to changes in expected clustering costs (Tikochinsky 
et al., 1984) and, therefore, has to be considered optimal in the sense of robust statistics. 
Statistical physics (see e.g. (Amit, 1989; Rose et al., 1990)) states that maximizing the 
entropy at a fixed temperature T - 1/]3 is equivalent to minimizing the free energy 
 Tic = -TlnZ=-Tln( E exp(-fix)) 
- -N E p,20p, fi log exp C:)] (4) 
with respect to the vmiables p, y. The effective complexity costs are C  0 C)/Opt. 
For a derivation of (4) see (Buhmann, Kfihnel, 1993b). 
The resulting re-estimation equations for the expected cluster probabilities and the expected 
centroid positions are necess conditions of Wx being minimal, i.e. 
1 
i 
1 
0 '- EETT(MiT)_O T)i(xi'Y)' (6) 
Oy 
exp [--fi(((Di)) + XC;)] 
(Mi) = x (7) 
Central and Pairwise Data Clustering by Competitive Neural Networks 107 
The expectation value (Mi) of the assignment variable Mi can be interpreted as a fuzzy 
membership of data point xi in cluster a. The case of supervised clustering can be treated 
in an analogous fashion (Buhmann, Kiihnel, 1993a) which gives rise to the third layer in the 
neural network implementation (see Fig. 1). The global minimum of the free energy (4) 
with respect to p,, y determines the maximum entropy solution of the cost function (1). 
Note that the optimization problem (1) of a K v state space has been reduced to a K(d + 1 ) 
dimensional minimization of the free energy f'x (4). To find the optimal parameters p, y 
and the number of clusters K which minimize the free energy, we start with one cluster 
located at the centroid of the data distribution, split that cluster and reestimate p, y using 
equation (5,6). The new configuration is accepted as an improved solution if the free energy 
(4) has been decreased. This splitting and reestimation loop is continued until we fail to 
find a new configuration with lower free energy. The temperature determines the fuzziness 
of a clustering solution, whereas the complexity term penalizes excessively many clusters. 
3 Meanfield Approximation for Pairwise Clustering 
The maximum entropy estimation for pairwise clustering constitutes a much harder problem 
than the calculation of the free energy for central clustering. Analytical expression for 
the Gibbs distributions are not known except for the quadratic distance measure Dik = 
(xi -- xk) 2. Therefore, we approximate the free energy by a variational principle commonly 
refered to as meanfield approximation. Given the costfunction (2) we derive a lower bound 
to the free energy by a system of noninteracting assignment variables. The approximative 
costfunction with the variational parameters w is 
K N 
Z E (8) 
v=l i=1 
The original costfunction for pairwise clustering can be written as / -  K + V with a 
(small) perturbation term V --  -  x due to cluster interactions. The partition function 
= Zo(exp(-/SV))o > Zoexp(-/3(V)o) 
(9) 
is bound from below if terms of the order O(((V-(V)0) 3 )o) and higher are negligible com- 
pared to the quadratic term. The angular brackets denote averages over all configurations 
of the costfunction without interactions. The averaged perturbation term (V)o amounts to 
(10) 
being the averaged assignment variables 
exp(-3m) 
(Mm): E exp(-/w)' 
(11) 
108 Buhmann and Hofmann 
The meanfield approximation with the cost function (8) yields a lower bound to the partition 
function Z of the original pairwise clustering problem. Therefore, we vary the parameters 
i, to maximize the quantity In Zo -/5(V)o which produces the best lower bound of Z 
based on an interaction free costfunction. Variation of & leads to the conditions 
(12) 
 being defined as 
( ' ) 
. _ i v,j (Mk)Vk + XC;. (13) 
w - p---   
For a given distance matrix Di the transcendental equations (11,12) have to be solved 
simultaneously. 
So far the i have been treated as independent variation parameters. An important 
problem, which is usually discussed in the context of Multidimensional Scaling, is to 
find an embedding for the data set in an Euclidian space and to cluster the embedded data. 
The variational framework can be applied to this problem, if we consider the parameters 
i as functions of data coordinates and prototype coordinates, i = Dic(Xi, Ya), e.g. 
with a quadratic distortion measure Di(xi, y) = Ilxi - yc[[ 2. The variables xi, y C a 
are the variational parameters which have to be determined by maximizing In Z0 -/5(V)o. 
Without imposing the restriction for the prototypes to be the cluster centroids, this leads to 
the following conditions for the data coordinates 
(Mi) (w- i) (Y- E(Mo)Yu) = 0, 
Vi C { 1, .... N}. (14) 
After further algebraic manipulations we receive the explicit expression for the data points 
' E<Mw)(lly,[12- ,)(y,- E<Mi,)yu) (15) 
/Cxi = 5 ' 
with the covariance matrix/Ci = ((yy')i- (y)i(y)/), (y)i = 5-,(Mw)y,. Let us 
assume that the matrix/C is non-singular which imposes the condition K > d and the 
cluster centers {yla = 1,..., K} being in general position. For K < d the equations 
i, = i + i are exactly solvable and embedding in dimensions larger than K produces 
non-unique solutions without improving the lower bound in (9). 
Varying In Zo -/5(V)o with respect to y yields a second set of stationarity conditions 
E(M5> (1-(Mss>)(j-*)(xs-yo ) = 0, 
Vc G { 1 .... ,K). (16) 
The weighting factors in (16), however, decay exponentially fast with the inverse temper- 
ature, i.e., (Mss)(1 - (Mss))  0(/5 exp[-/sc]), c > 0. This implies that the optimal 
solution for the data coordinates displays only a very weak dependence on the special 
choice of the prototypes in the low temperature regime. Fixing the parameters y and 
solving the transcendental equations (14,15) for xi, the solution will be very close to the 
optimal approximation. It is thus possible to choose the prototypes as the cluster centroids 
y = 1/(pN) 'i(Mic)xi and, thereby, to solve Eq. (15) in a self-consistent fashion. 
Central and Pairwise Data Clustering by Competitive Neural Networks 109 
a 
b 
Figure 2: A data distribution (4000 data points) (a), generated by four normally distributed 
sources is clustered with the complexity measure C = - log p and A = 0.4 (b). The plus 
signs (+) denote the centers of the Gaussians and stars (*) denote cluster centers. Figure (c) 
shows a topology preserving clustering solution with complexity  = 1/p and external 
noise 07 = 0.05). 
If the prototype variables depend on the data coordinates, the derivatives Oy/Oxi will 
not vanish in general and the condition (14) becomes more complicated. Regardless 
of this complication the resulting algorithm to estimate data coordinates xi interleaves the 
clustering process and the optimization of the embedding in a Euclidian space. The artificial 
separation of multidimensional scaling from data clustering has been avoided. Data points 
are embedded and clustered simultaneously. Furthermore, we have derived a maximum 
entropy approximation which is most robust with respect to changes in the average costs 
4 Clustering Results 
Non-topological (Tav = 5.v) clustering results at zero temperature for the logarithmic 
complexity measure (C = 1ogp) are shown in Fig. 2b. In the limit of very small com- 
plexity costs the best clustering solution densely covers the data distribution. The specific 
choice of logarithmic complexity costs causes an almost homogeneous density of cluster 
centers, a phenomenon which is known from studies of asymptotic codebook densities and 
which is explained by the vanishing average complexity costs () = -p log p of very 
sparsely occupied clusters (for references see (Buhmann, Kiihnel, 1993b)). 
Figure 2c shows a clustering configuration assuming a one-dimensional topology in index 
space with nearest neighbor transitions. The short links between neighboring nodes of the 
neural chain indicate that the distortions due to cluster index transitions have also been 
optimized. Note, that complexity optimized clustering determines the length of the chain 
or, for a more general noise distribution, an optimal size of the cluster set. This stopping 
criterion for adding new cluster nodes generalizes self-organizing feature maps (Kohonen, 
1984) and removes arbitrariness in the design of topological mappings. Furthermore, our 
algorithm is derived from an energy minimization principle in contrast to self-organizing 
feature maps which "cannot be derived as a stochastic gradient on any energy function" 
(Erwin et al., 1992). 
The complexity optimized clustering scheme has been tested on the real world task of 
110 Buhmann and Hofmann 
a 
b 
d 
Figure 3: Quantization of a 128x 128, 8bit, gray-level image. (a) Original picture. (b) 
Image reconstruction from wavelet coefficients quantized with entropic complexity. (c) 
Reconstruction from wavelet coefficients quantized by K-means clustering. (d,e) Absolute 
values of reconstruction errors in the images (b,c). Black is normalized in (d,e) to a deviation 
of 92 gray values. 
image compression (Bohmann, Kiihnel, 1993b). Entropy optimized clustering of wavelet 
decomposed images has reduced the reconstruction error of the compressed images up to 
30 percent. Images of a compression and reconstruction experiment are shown in Fig. 3. 
The compression ratio is 24.5 for a 128 x 128 image. According to our efficiency criterion 
entropy optimized compression is 36.8% more efficient than K-means clustering for that 
compression factor. The peak SNR values for (b,c) are 30.1 and 27.1, respectively. The 
considerable higher error near edges in the reconstruction based on K-means clustering (e) 
demonstrates that entropy optimized clustering of wavelet coefficients not only results in 
higher compression ratios but, even more important it preserves psychophysically important 
image features like edges more faithfully than conventional compression schemes. 
5 Conclusion 
Complexity optimized clustering is a maximum entropy approach to central and pairwise 
data clustering which determines the optimal number of clusters as a compromise between 
distortion errors and the complexity of a cluster set. The complexity term tums out to be as 
important for the design of a cluster set as the distortion measure. Complexity optimized 
clustering maps onto a winner-take-all network which suggests hardware implementations 
in analog VLSI (Andreou et al., 1991). Topology preserving clustering provides us with a 
Central and Pairwise Data Clustering by Competitive Neural Networks 111 
cost function based approach to limit the size of self-organizing maps. 
The maximum entropy estimation for pairwise clustering cannot be solved analytically 
but has to be approximated by a meanfield approach. This meanfield approximation of 
the pairwise clustering costs with quadratic Euclidian distances establishes a connection 
between multidimensional scaling and clustering. Contrary to the usual strategy which 
embeds data according to their dissimilarities in a Euclidian space and, in a separate second 
step, clusters the embedded data, our approach finds the Euclidian embedding and the data 
clusters simultaneously and in a selfconsistent fashion. 
The proposed framework for data clustering unifies traditional clustering techniques like 
K-means clustering, entropy constraint clustering or fuzzy clustering with neural network 
approaches such as topological vector quantizers. The network size and the cluster parame- 
ters are determined by a problem adapted complexity function which removes considerable 
arbitrariness present in other non-parametric clustering methods. 
Acknowledgement: JB thanks H. Ktihnel for insightful discussions. This work was 
supported by the Ministry of Science and Research of the state Nordrhein-Westfalen. 
References 
Amit, D. (1989). Modelling Brain Function. Cambridge: Cambridge University Press. 
Andreou, A. G., Boahen, K. A., Pouliquen, P. O., Pavasovi6, A., Jenkins, R. E., Stro- 
hbehn, K. (1991 ). Current Mode Subthreshold MOS Circuits for Analog VLSI Neural 
Systems. IEEE Transactions on Neural Networks, 2, 205-213. 
Buhmann, J., Ktihnel, H. (1993a). Complexity Optimized Data Clustering by Competitive 
Neural Networks. Neural Computation, 5, 75-88. 
Buhmann, J., Ktihnel, H. (1993b). Vector Quantization with Complexity Costs. IEEE 
Transactions on Information Theory, 39(4), 1133-1145. 
Chou, P. A., Lookabaugh, T., Gray, R. M. (1989). Entropy-Constrained Vector Quantization. 
IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 31-42. 
Erwin, W., Obermayer, K., Schulten, K. (1992). Self-organizing Maps: Ordering, Conver- 
gence Properties, and Energy Functions. Biological Cybernetics, 67, 47-55. 
Gray, R. M. (1984). Vector Quantization. IEEE Acoustics, Speech and Signal Processing 
Magazine, April, 4-29. 
Kohonen, T. (1984). Self-organization and Associative Memory. Berlin: Springer. 
Linde, Y., Buzo, A., Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE 
Transactions on Communications COM, 28, 84-95. 
Ritter, H., Martinetz, T., Schulten, K. (1992). Neural Computation and Self-organizing 
Maps. New York: Addison Wesley. 
Rose, K., Gurewitz, E., Fox, G. (1990). Statistical Mechanics and Phase Transitions in 
Clustering. Physical Review Letters, 65(8), 945-948. 
Tikochinsky, Y., Tishby, N.Z., Levine, R. D. (1984). Alternative Approach to Maximum- 
Entropy Inference. Physical Review A, 30, 2638-2644. 
