Multidimensional Scaling and Data Clustering 
Thomas Hofmann & Joachim Buhmann 
Rheinische Friedrich-Wilhelms-Universitit 
Institut fiir Informatik HI, R6merstra13e 164 
D-53117 Bonn, Germany 
email:(h, jb}@cs. uni-bonn. de 
Abstract 
Visualizing and structuring pairwise dissimilarity data are difficult combinatorial op- 
timization problems known as multidimensional scaling or pairwise data clustering. 
Algorithms for embedding dissimilarity data set in a EuclidJan space, for clustering 
these data and for actively selecting data to support the clustering process are discussed 
in the maximum entropy framework. Active data selection provides a strategy to discover 
structure in a data set efficiently with partially unknown data. 
1 Introduction 
Grouping experimental data into compact clusters arises as a data analysis problem in psy- 
chology, linguistics, genetics and other experimental sciences. The data which are supposed 
to be clustered are either given by an explicit coordinate representation (central clustering) 
or, in the non-metric case, they are characterized by dissimilarity values for pairs of data 
points (pairwise clustering). In this paper we study algorithms (i) for embedding non-metric 
data in a D-dimensional Euclidian space, (ii) for simultaneous clustering and embedding of 
non-metric data, and (iii) for active data selection to determine a particular cluster structure 
with minimal number of data queries. All algorithms are derived from the maximum entropy 
principle (Hertz et al., 1991) which guarantees robust statistics (Tikochinsky et al., 1984). 
The data are given by a real-valued, symmetric proximity matrix D 6 R Nx N, Z)kt being 
the pairwise dissimilarity between the data points k, I. Apart from the symmetry constraint 
we make no further assumptions about the dissimilarities, i.e., we do not require D being a 
metric. The numbers Z)kt quite often violate the triangular inequality and the dissimilarity of 
a datum to itself could be finite. 
2 Statistical Mechanics of Multidimensional Scaling 
Embedding dissimilarity data in a D-dimensional Euclidian space is a non-convex optimiza- 
tion problem which typically exhibits a large number of local minima. Stochastic search 
methods like simulated annealing or its deterministic variants have been very successfully 
460 Thomas Hofmann, Joachim Buhmann 
applied to such problems. 
{xi -' in a D-dimensional Euclidian space with minimal embedding costs 
Ji----I 
N 
74MD s 1 
= [Ix- 
The question in multidimensional scaling is to find coordinates 
(1) 
i,k=l 
N 
Without loss of generality we shift the center of mass in the origin (]k:l xk = 0). 
In the maximum entropy framework the coordinates {xi} are regarded as random variables 
which are distributed according to the Gibbs distribution P({xi}) = exp(-/ (74 MDs- Or). The 
inverse temperature . = 1/T controls the expected embedding costs (74 MDs) (expectation val- 
ues are denoted by (.)). To calculate the free energy Or for 74 MDs we approximate the coupling 
,X l)ikXiXk/  =1 xihi with the mean fields hi = 4 ]]kN__ T)ik(Xk)/N ' 
term 2 
Standard techniques to evaluate the free energy Or yield the equations 
5oo oo D 
-cx -oo d,d'=l 
D N oo 
.?(7- MDS) ---- 2 Z '/'d' /N Zln dxiexp(-3f(xi)), (3) 
d,d'=l i=1 
N 
2 
f(xi) - Ixi14-lxi12Y])ik +4x'7gxi+x/T(hi--4r)- (4) 
k=l 
The integral in Eq. (2) is dominated by the absolute minimum of Or in the limit N --4 o. 
Therefore, we calculate the saddle point equations 
N 
7. = ;Z ((x/x/T) +  (Ixi12)I) and 0 - N 
i=1 i=1 
(Xi) = f xiexp(-flf(xi)dxi 
f exp(-flf(xi)dxi (6) 
Equation (6) has been derived by differentiating Or with respect to hi. I denotes the D x D 
unit matrix. In the low temperature limit/7 --4 c the integral in (3) is dominated by the 
minimum of f (x). Therefore, a new estimate of (xi) is calculated minimizing f with respect 
to xi. Since all explicit dependencies between the xi have been eliminated, this minimization 
can be performed independently for all i, 1 _< i _< N. 
In the spirit of the EM algorithm for Gaussian mixture models we suggest the following 
algorithm to calculate a meanfield approximation for the multidimensional scaling problem. 
initialize (xl)  randomly; t----O. 
while z_,i=,l<xi) --<xl)('-'>l >  
E-step: estimate (Xl) (t+) as a 
M-step: 
function of (xl) , (t), 
calculate ('), h? and determine (t) such 
that the centroid condition is satisfied. 
29 (), h? ) 
Multidimensional Scaling and Data Clustering 461 
This algorithm was used to determine the embedding of protein dissimilarity data as shown in 
Fig. ld. The phenomenon that the data clusters are arranged in a circular fashion is explained 
by the lack of small dissimilarity values. The solution in Fig. ld is about a factor of two 
better than the embedding found by a classical MDS program (Gower, 1966). This program 
determines a (N - 1)- space where the ranking of the dissimilarities is preserved and uses 
principle component analysis to project this tentative embedding down to two dimensions. 
Extensions to other MDS cost functions are currently under investigation. 
3 Multidimensional Scaling and Pairwise Clustering 
Embedding data in a Euclidian space precedes quite often a visual inspection by the data 
analyst to discover structure and to group data into clusters. The question arises how both 
problems, the embedding problem and the clustering problem, can be solved simultaneously. 
The second algorithm addresses the problem to embed a data set in a Euclidian space such 
that the clustering structure is approximated as faithfully as possible in the maximum entropy 
sense by the clustering solution in this embedding space. The coordinates in the embedding 
space are the free parameters for this optimization problem. 
Clustering of non-metric dissimilarity data, also called pairwise clustering (Buhmann, Hof- 
mann, 1994a), is a combinatorial optimization problem which depends on Boolean assign- 
ments Mir  {0, 1 } of datum i to cluster u. The cost function for pairwise clustering with 
K clusters is 
K N N N 
1 1 
[5(M) = E 2p,N E E M,Mt,V, with Pv = /E M:,,. (7) 
v=l k=l /=1 k=l 
In the meanfield approach we approximate the Gibbs distribution P() corresponding 
to the original cost function by a family of approximating distributions. The distribution 
which represents most accurately the statistics of the original problem is determined by 
the minimum of the Kullback-Leibler divergence to the original Gibbs distribution. In the 
pairwise clustering case we introduce potentials {k} for the effective interactions, which 
define a set of cost functions with non-interacting assignments. 
K N 
(8) 
t/=l k=l 
The optimal potentials derived from this minimization procedure are 
{'--k} = arg min DKL (P(}-)IIP(;D), (9) 
where P(c) is the Gibbs distribution corresponding to ., and VK&ll') is the KL- 
divergence. This method is equivalent to minimizing an upper bound on the free energy 
(Buhmann, Hofmann, 1994b), 
$-, c,o, < 0$'o(-) + (Vx)o, with Vx pc _ o (10) 
[cK / -- = Six' K  
(.)0 denoting the average over all configurations of the cost function without interactions. 
Correlations between assignment variables are statistically independent for po(}.), i.e., 
{3irk,2V/1,j}0 = {Mtw}0{3fl,}0. The averaged potential Vt,-, therefore, amounts to 
K N K N 
1 
('V,,-) = E E (Mtv)(Mlv)2pvNV6l -- E E ('M/v)tCv' (11) 
t/=l k,l=l u=l k=l 
462 Thomas Hofmann, Joachim Buhmann 
the subscript of averages being omitted for conciseness. The expected assignment variables 
are 
exp(-fli,) (12) 
(Mir) -- K  
E,=, exp(--fii,) 
Minimizing the upper bound yields 
K O(Mi,) (i - i,) -- 0. (13) 
0i. 
The "optimal" potentials 
depend on the given distance matrix, the averaged assignment variables and the cluster 
probabilities. They are optimal in the sense, that if we set 
iv --- i q- i (15) 
the N, K stationarity conditions (13) are fulfilled for every i  { 1, ..., N}, t,  { 1, ..., K}. A 
simultaneous solution of Eq. (15) with (12) constitutes a necessary condition for a minimum 
of the upper bound for the free energy Y. 
The connection between the clustering and the multidimensional scaling problem is estab- 
lished, if we restrict the potentials i to be of the form Ix/ - yl 2 with the centroids 
N 
- = Mk. We consider the coordinates xi as the variational param- 
y,, = Mk,,x,/ N 
eters. The additional constraints restrict the family of approximating distributions, defined 
by co to a subset. Using the chain rule we can calculate the derivatives of the upper bound 
(10), resulting in the exact stationary conditions for xi, 
K N K 
Np x 
c,v= 1 j=l c%v=l 
(xj- 
(16) 
where Ai = & - '. The derivatives O(Mk)/Ox can be exactly calculated, since they 
are given as the solutions of an linear equation system with N x K unknowns for every xi. To 
reduce the computational complexity an approximation can be derived under the assumption 
Oya/Oxi  O. In this case the right hand side of (16) can be set to zero in a first order 
approximation yielding an explicit formula for xi, 
(17) 
with the covariance matrix ]Ci ((yyT)i (y)i(y)T) and (y)i K 
= - 
The derived system of transcendental equations given by (12), (17) and the centroid condi- 
tion explicitly reflects the dependencies between the clustering procedure and the Euclidian 
representation. Solving these equations simultaneously leads to an efficient algorithm which 
Multidimensional Scaling and Data Clustering 463 
a 
HA 
GGI 
MY 
HBX, 
HF, HE 
GP 
GGG 
HG,HE 
% HBX,HF, HE 
 MY 
 dP ' 
GG 1 x'x GG G 
x 
x 
+ 
HB  
++ + 
 + MY 
HG,HE,HF 
420 
380 
340 
300 
Random Selection 
2000 4000 6000 
# of selected D 
Figure 1: Similarity matrix of 145 protein sequencesof the globin family (a): dark gray levels 
correspond to high similarity values; (b): clustering with embedding in two dimensions; (c): 
multidimensional scaling solution for 2-dimensional embedding; (d): quality of clustering 
solution with random and active data selection of Dik values. c has been calculated on the 
basis of the complete set of Dik values. 
interleaves the multidimensional scaling process and the clustering process and which avoids 
an artificial separation into two uncorrelated processes. The described algorithm for simul- 
taneous Euclidian embedding and data clustering can be used for dimensionality reduction, 
e.g., high dimensional data can be projected to a low dimensional subspace in a nonlinear 
fashion which resembles local principle component analysis (Buhmann, Hofmann, 1994b). 
Figure (1) shows the clustering result for a real-world data set of 145 protein sequences. The 
similarity values between pairs of sequences are determined by a sequence alignment program 
which takes biochemical and structural information into account. The sequences belong to 
different protein families like hemoglobin, myoglobin and other globins; they are abbreviated 
with the displayed capital letters. The gray level visualization of the dissimilarity matrix with 
dark values for similar protein sequences shows the formation of distinct "squares" along the 
main diagonal. These squares correspond to the discovered partition after clustering. The 
embedding in two dimensions shows inter-cluster distances which are in consistent agreement 
with the similarity values of the data. In three and four dimensions the error between the 
464 Thomas Hofmann, Joachim Buhmann 
given dissimilarities and the constructed distances is further reduced. The results are in good 
agreement with the biological classification. 
4 Active Data Selection for Data Clustering 
Active data selection is an important issue for the analysis of data which are characterized 
by pairwise dissimilarity values. The size of the distance matrix grows like the square of 
the number of data 'points'. Such a O(N 2) scaling renders the data acquisition process 
expensive. It is, therefore, desirable to couple the data analysis process to the data acquisition 
process, i.e., to actively query the supposedly most relevant dissimilarity values. Before 
addressing active data selection questions for data clustering we have to discuss the problem 
how to modify the algorithm in the case of incomplete data. 
If we want to avoid any assumptions about statistical dependencies, it is impossible to infer 
unknown values and we have to work directly with the partial dissimilarity matrix. Since the 
data enters only in the (re-)calculation of the potentials in (14), it is straightforward to appro- 
priately modify these equations. All sums are restricted to terms with known dissimilarities 
and the normalization factors are adjusted accordingly. 
Alternatively we can try to explicitly estimate the unknown dissimilarity values based on 
a statistical model. For this purpose we propose two models, relying on a known group 
structure of the data. The first model (I) assumes that all dissimilarities between a point 
i and points j belonging to a group G are i.i.d. random variables with the probability 
density PO parameterized by 00. In this scheme a subset of the known dissimilarities of 
i and j to other points k are used as samples for the estimation of Dij. The selection 
of the specific subset is determined by the clustering structure. In the second model (II) 
we assume that the dissimilarities between groups Gv, G are i.i.d. random variables with 
density pv, parameterized by O,,t,. The parameters 0 are estimated on the basis of all 
known dissimilarities {Dij  D} between points from G and G. 
The assignments of points to clusters are not known a priori and have to be determined in the 
light of the (given and estimated) data. The data selection strategy becomes self-consistent 
if we interpret the mean fields (3di,,) of the clustering solution as posterior probabilities for 
the binary assignment variables. Combined with a maximum likelihood estimation for the 
unknown parameters given the posteriors, we arrive at an EM-like iteration scheme with the 
E-step replaced by the clustering algorithm. 
The precise form of the M-Step depends on the parametric form of the densities Pit or 
respectively. In the case of Gaussian distributions the M-Step is described by the following 
estimation equations for the location parameters 
ij _ 1 
with -,, I+, (/M'//MJ,/+/MoilMir/)  Corresponding expressions are derived 
for the standard deviations =0/ t respectively. In the case of non-normal distributions 
the empirical mean might still be a good estimator of the location parameter, though not 
necessarily a maximum likelihood estimator. The missing dissimilarities are estimated by 
the following statistics, derived from the empirical means. 
0!9 
,.,.= :XS. + N5 (I), Di  =  i5 (t) (II), (19) 
Multidimensional Scaling and Data Clustering 465 
L_ _ Random Data 
{, i ectin 
Active Data :----, 
Selection ' ........................ 
400 800 1200 
# of selected dissimilarities 
Figure 2: Similarity matrix of 54 word fragments generated by a dynamic programming 
algorithm. The clustering costs in the experiment with active data selection requires only half 
as much data as a random selection strategy. 
with ',, = v,Ev(3'l,). For model (I) we have used a pooled estimator to exploit the 
data symmetry. The iteration scheme finally leads to estimates Oit, or O,t, respectively for the 
parameters and ff)ij for all unknown dissimilarities. 
Criterion for Active Data Selection: We will use the expected reduction in the variance of 
the free energy Or0 as a score, which should be maximized by the selection criterion. Or0 is 
given by '0(D) = -i Zi'x-l log Z,'__i exp(-/h(D)). If we query a new dissimilarity 
T)ij the expected reduction of the variance of the free energy is approximated by 
Aij -- 2 Lq-/j] V [l)ij- ff)ij] 
(20) 
The partial derivatives can be calculated exactly by solving a system of linear equations with 
3; x K unknowns. Alternatively a first order approximation in e,, = O(1/Np,,) yields 
0'0 2 Np,+O e. . 
This expression defines a relevance measure of T)ij for the clustering problem since a T)ij 
value contributes to the clustering costs only if the data i and j belong to the same cluster. 
Equation (21) summarizes the mean-field contributions O'o/ODij  O(H)o/ODij. 
To derive the final form of our scoring function we have to calculate an approximation of 
the variance in Eq. (20) which measures the expected squared error for replacing the true 
value T)ij with our estimate 1)ij. Since we assumed statistical independence the variances 
are additive V [Dij - Dij] = V [Dij] + V [D/j]. The total population variance is a sum 
of inner- and inter-cluster variances, that can be approximated by the empirical means and 
by the empirical variances instead of the unknown parameters of pi/or P/tt' The sampling 
variance of the statistics 1)ii is estimated under the assumption, that the empirical means 7,i/ 
466 Thomas Hofmann, Joachim Buhmann 
or rh, t, respectively are uncorrelated. This holds in the hard clustering limit. We arrive at 
the following final expression for the variances of model (II) 
V[l)ij-Dij]  E7rvl (Dij-rhvl) 1 flu. -2 (22) 
For model (I) a slightly more complicamd foula can be derived. Inseffing the estimamd 
variances into Eq. (20) leads m the final expression for our scoring function. 
To demonsate the efficiency of the proposed selection sategy, we have compared e 
clustering costa achieved by active dam selection with the clustering costa resulting from 
randomly queried dam. Assignmenm int e case of active selection e calculated with 
statistical model (I). Figure 1 d demonsates that the clustering costa decrease significantly 
hsmr when the selection criterion (20) is implemented. The scmre of the clusmfing 
solution has been completely infeed wi about 3300 selected ik values. The random 
stramgy requires about 6500 queries for e same qualiW. Analogous comprison results for 
linguistic dam e summarized in Fig. 2. Note the inconsistencies in is dam set reflected by 
small Dik values oumide the cluster blocks (dk pixels) or by the lge ik values (white 
pixels) inside a block. 
Conclusion: Dam analysis of dissimil data is a challenging problem in molecul bi- 
ology, linguistics, psychology and, in general, in pattern recognition. We have presented 
 ree sategies to visualize dam structures and to inquire the data scmre by an efficient 
data selection procedure. The respective algofims e derived in the maximum enopy 
framework for maximal robustness of cluster estimation and dam embedding. Active data 
selection has been shown to require only half as much data for estimating a clustering solution 
of fixed quality comped to a random selection sategy. We expect the proposed selection 
sategy to facilitate maintenance of genome and protein data bases and to yield more robust 
data prototypes for efficient seamh and dam base mining. 
Acknowledgement: It is a pleasure m thank M. Vingron and D. Bavelier for providing e 
promin dam and the linguistic dam, respectively. We are also grateful to A. Polzer and H.J. 
Wneboldt for implementing the S algofim. This work was paffially supposed by e 
Minisy of Science and Reseamh of the state Nordrhein-Westfalen. 
References 
Buhmann, J., Hofmann, T. (1994a). Central and Pairwise Data Clustering by Competitive 
Neural Networks. Pages 104-111 of.' Advances in Neural Information Processing 
Systems 6. Morgan Kaufmann Publishers. 
Buhmann, J., Hofmann, T. (1994b). A Maximum Entropy Approach to Pairwise Data 
Clustering. Pages 207-212 of.' Proceedings of the International Conference on Pattern 
Recognition, Hebrew University, Jerusalem, vol. II. IEEE Computer Society Press. 
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in 
multivariate analysis. Biometrika, 53, 325-328. 
Hertz, J., Krogh, A., Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. 
New York: Addison Wesley. 
Tikochinsky, Y., Tishby, N.Z., Levine, R. D. (1984). Alternative Approach to Maximum- 
Entropy Inference. Physical Review A, 30, 2638-2644. 
