A Unified Learning Scheme: 
Bayesian-Kullback Ying-Yang Machine 
Lei Xu 
1. Computer Science Dept., The Chinese University of HK, Hong Kong 
2. National Machine Perception Lab, Peking University, Beijing 
Abstract 
A Bayesian-Kullback learning scheme, called Ying-Yang Machine, 
is proposed based on the two complement but equivalent Bayesian 
representations for joint density and their Kullback divergence. 
Not only the scheme unifies existing major supervised and unsu- 
pervised learnings, including the classical maximum likelihood or 
least square learning, the maximum information preservation, the 
EM &em algorithm and information geometry, the recent popular 
Helmholtz machine, as well as other learning methods with new 
variants and new results; but also the scheme provides a number 
of new learning models. 
1 INTRODUCTION 
Many different learning models have been developed in the literature. We may 
come to an age of searching a unified scheme for them. With a unified scheme, 
we may understand deeply the existing models and their relationships, which may 
cause cross-fertilization on them to obtain new results and variants; We may also be 
guided to develop new learning models, after we get better understanding on which 
cases we have already studied or missed, which deserve to be further explored. 
Recently, a Baysian-Kullback scheme, called the YING-YANG Machine, has been 
proposed as such an effort(Xu, 1995a). It bases on the Kullback divergence and two 
complement but equivalent Baysian representations for the joint distribution of the 
input space and the representation space, instead of merely using Kullback diver- 
gence for matching un-structuralized joint densities in information geometry type 
learnings (Amari, 1995a&b; Byrne, 1992; Csiszar, 1975). The two representations 
consist of four different components. The different combinations of choices of each 
component lead the YING-YANG Machine into different learning models. Thus, 
it acts as a general learning scheme for unifying the existing major unsupervised 
and supervised learnings. As shown in Xu(1995a), its one special case reduces to 
the EM algorithm (Dempster et al, 1977; Hathaway, 1986; Neal & Hinton, 1993) 
A Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine 445 
and the closely related Information Geometry theory and the em algorithm (Amari, 
1995a&b), to MDL autoencoder with a "bits-back" argument by Hinton & Zemel 
(1994) and its alternative equivalent form that minimizes the bits of uncoded resid- 
ual errors and the unused bits in the transmission channel's capacity (Xu, 1995d), 
as well as to Multisets modeling learning (Xu, 1995e)-a unified learning framework 
for clustering, PCA-type learnings and self-organizing map. It other special case 
reduces to maximum information preservation (Linsker, 1989; Atick & Redlich, 
1990; Bell & Sejnowski, 1995). More interestingly its another special case reduces 
to Helmholtz machine (Dayan et a1,1995; Hinton, 1995) with new understandings. 
Moreover, the YING-YANG machine includes also maximum likelihood or least 
square learning. 
Furthermore, the YING-YANG Machine has also been extended to temporal pat- 
terns with a number of new models for signal modeling. Some of them are the 
extensions of Helmholtz machine or maximum information preservation learning to 
temporal processing. Some of them include and extend the Hidden Markov Model 
(HMM), AMAR and AR models (Xu, 1995b). In addition, it has also been shown in 
Xu(1995a&c, 1996a) that one special case of the YING-YANG machine can provide 
us three variants for clustering or VQ, particularly with criteria and an automatic 
procedure developed for solving how to select the number of clusters in clustering 
analysis or Gaussian mixtures -- a classical problem that remains open for decades. 
In this paper, we present a deep and systematical further study. Section 2 re- 
describes the unified scheme on a more precise and systematical basis via discussing 
the possible marital status of the two Bayesian representations for joint density. 
Section 3 summarizes and explains those existing models under the unified scheme, 
particularly we have clarified some confusion made in the previous papers (Xu, 
1995a&b) on maximum information preservation learning. Section 4 proposed and 
summarizes a number of possible new models suggested by the unified scheme. 
2 BAYESIAN-KULLBACK YING-YANG MACHINE 
As argued in Xu (1995a), unsupervised and supervised learning problems can be 
summarized into the problem of estimating joint density P(x, y) of patterns in 
the input space X and the representation space Y, as shown in Fig.1. Under the 
Bayesian framework, we have two representations for P(x,y). One is PM1 (gg,Y) -- 
PM1 (ylgg)PM1 (gO), implemented by a model Mi called YANG/(male) part since it 
performs the task of transferring a pattern/(a real body) into a code/(a seed). The 
other is PM2(x,y) = PM2(xlY)PM(y), implemented by a model M2 called YING 
part since it performs the task of generating a pattern/(a real body) from a code/(a 
seed). They are complement to each other and together implement an entire circle 
x -+ y -+ x. This compliments to the ancient chinese YING-YANG philosophy. 
Here we have four components PMi(X), PMl(Ylx), PM2(xly) and PM2(Y). The 
PM1 (x) can be fixed at some density estimate on input data, e.g., we have at least 
two choices-Parzen window estimate Ph(x) or empirical estimate P0(x): 
-- [' ( n ), Po(x) -- limn-o Pn(x) =  N 
Nh d 1  t=l (5(X -- Xi). (1) 
For PMi(Ylgg), PM2(ggly), each can have three choices: (1) from a parametric fam- 
ily specified by model M1 or M2; (2) free of model with PMi(Ylgg) -- P(ylgc) or 
PM2(ggly) -- P(zly); (3) broken channel PMl(Ylgg) -- PMi(Y) or PM2(ggly) ---- PM2(gg). 
Finally, PM= (Y) with its y consistent to PM1 (Yl x) can also being from a parametric 
family or free of model. Any combinations of the choices of the four components 
forms a potential YING-YANG pair. We at least have 2 x 3 x 3 x 2 = 36 pairs. 
A YING-YANG pair has four types of marital status: (a) marry, i.e., YING and 
446 L. XU 
I Reprelentitlon 8ploe Y 
8ymboll, Integerl, Binary Codell 
Re11 
Enoodlng [ 
Recognition Px(ylx) 
Representation 
I Deoodlng 
Gieneretl ng 
Fleoonitruotlon 
PxxCx) Input pttern mpm X 
Figure 1 The joint spaces X, Y and the YING-YANG Machine 
YANG match each other; (b) divorce, i.e., YING and YANG go away from each 
other; (c) YING chases YANG, YANG escapes; (d) YANG chases YING, but YING 
escapes. The four types can be described by a combination of minimization (chas- 
ing) and maximization (escaping) on one of the two Kullback divergences below: 
IC ( M1, M2 ) --- fx,y PM1 (Yix)PMx (x) log vq (ul)vq ().. (2a) 
vt (xlu)v (u) axay 
K ( M2, M ) = f,u PM2(xlY)PM2(Y) log r.(lu)r. (u) dxdy (2b) 
Table 1 Mathematical description for marital status of a YING-YANG pair 
' K(M,M2) (a) maxM minMK(M1,M2) I maxM minM K(M1,M2 l 
mnM1,M K'(M1,M2) (b) minM maxM1K(M1,M2) I minM1 maxM s/(M1 M2 
maxM 1 Mg , 
We can replace K(M1, M2) by K(M2, M1) in the table. The 2nd & 3rd columns are for 
(c) (d) respectively, each has two cases depending on who starts the act and the two 
are usually not equivalent. Their results are undefined depending on initial condition for 
M1,M2, except of two special cases: (i) Free PMt(yl x) and parametric PM2(xly), with 
mina42 maxa4t t being the same as (b) with broken Pa4x (ylx), and with maxa42 mina4t K 
defined but useless. (ii) Free Pa42(xly) and parametric Pa4t (y[x), with minMt max K 
the same as case (a) with broken P2(xly), with mina4t maxa42 K defined but useless. 
Therefore, we will focus on the status marry and divorce. Even so, not all of the 
above mentioned 2 x 3 x 3 x 2 = 36 YING-YANG pairs provide sensible learning 
models although minm,m K and maxmx,m K are always well defined. Fortunately, 
a quite number of them indeed lead us to useful learning models, as will be shown 
in the sequent sections. 
We can implement minmx,m K(M, M2) by the following Alternative Minimization 
(ALTMIN) procedure: 
Step 1 Fix M2 = M 'ta, to get M 0 = arg MinM KL(M1, M 'ta) 
Step 2 Fix M1 = M 'ta, to get M 0 = arg MinM KL(M 'ta, M2) 
The ALTMIN iteration will finally converge to a local minimum of K(M, M2). We can 
have a similar procedure for maxM,M2/'(M1, M2) via replacing Min by Max. 
Since the above scheme bases on the two complement YING and YANG Bayesian 
representations and their Kullback divergence for their marital status, we call it 
Bayesian-Kullback YING-YANG learning scheme. Furthermore, under this scheme 
we call each obtained YING-YANG pair that is sensible for learning purpose as a 
Bayesian-Kullback YING-YANG Machine or YING-YANG machine shortly. 
3 UNIFIED EXISTING LEARNINGS 
Let PM(X)= Po(x) by eq.(1) and put it into eq.(2), through certain mathematics 
we can get K(M, M2) = hM -- haux - qM, + D with D independent of M, M2 
and hM, hax, qMl,2 given by Eqs.(E1)(E2)&(E4) in Tab.2 respectively. The larger 
A Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine 447 
is the hMt, the more discriminative or separable are the representations in Y for the 
input data set. The larger is the hM, the more concentrated the representations 
in Y. The larger is the qm,2, the better Pm2(xly ) fits the input data. 
Therefore, minm,mK(Mi,M2) consists of (1) best fitting of Pm(zly ) on input 
data via maxqm,, which is desirable, (2) producing more concentrated representa- 
tions in Y to occupy less resource, which is also desirable and is the behind reason for 
solving the problem of selecting cluster number in clustering analysis Xu(1995a&c, 
1996a), (3) but with the cost of less discriminative representations in Y for the input 
data. Inversely, maxM,M K(M, M2) consists of (1) producing best discriminative 
or separable representation PM (ylx) in Y for the input data set, which is desirable, 
in the cost of (2) producing a more uniform representation in Y to fully occupy the 
resource, and (3) causing PM(zly) away from fitting input data. 
Shown in Table 2 are the unified existing unsupervised learnings. For the case 
H-f-W, we have hM1 = h, hM =ha, qm, =qM, and minMK(M,M2) re- 
sults in PM2(Y) = PM(Y) =cu and PM(zlY)PM(y ) = PM(z)PMi(ylz) with 
PM(X) =-]u=l PM2(xlY)PM(Y) ' In turn, we get K(M,M.) =--LM + D with 
LM being the likelihood given by eq.(E5), i.e., we get maximum likelihood estima- 
tion on mixture model. In fact, the ALTMIN given in Tab.2 leads us to exactly the 
EM algorithm by nempster et a1(1977). Also, here PM(x,y), PM(x,y) is equiv- 
alent to the data submanifold T) and model submanifold 2M in the Information 
Geometry theory (Amari, 1995a&b), with the ALTMIN being the em algorithm. 
As shown in Xu(95a), the cases also includes the MDL auto-encoder (Hinton & 
Zemel, 1994) and Multi-sets modeling (Xu, 1995e). 
For the case Single-M, the hM -- hu is actually the information transmitted by 
the YANG part from z to y. In this case, its minimization produces a non-sensible 
model for learning. However, its maximization is exactly the Informax learning 
scheme (Linsker, 1989; Aft& & Redlich, 1990; Bell & Sejnowski, 1995). Here, we 
clear up a confusion made in Xu(95a&b) where the minimization was mistakenly 
considered. 
For the case H-m- W, the hM --briM1 --qM.2 is just the -F(d; O, Q) used by Dayan et 
al (1995) and Hinton et al (1995) for Helmholtz machine. We can set up the detailed 
correspondence that (i)here PMl(Ylxi)is their Qa; (ii)logPM(x,y is their-Ea; 
and (iii) their P is PM(Yl x) = PM(xly)PM(y)/ Y-]uPM2(xly)PM(y). So, we get 
a new perspective for Helmholtz machine. Moreover, we know that K(M, Me) be- 
comes a negative likelihood only when PM(xly)PM(y) = PM(X)PM1 (y[x), which 
is usually not true when the YANG and YING parts are both parametric. So 
Helmholtz machine is not equivalent to maximum likelihood learning in general 
with a gap depending on PM(xlY)PM2(y ) -- PM(X)PM(Ylx). The equivalence is 
approximately acceptable only when the family of PM(xly ) or/and PMl(Ylxi) is 
large enough or Me, M are both linear with gaussian density. 
In Tab.4, the case Single-M under K(M., M) is the classical maximum likelihood 
(ML) learning for supervised learning which includes the least square learning by 
back propagation (BP) for feedfarward net as a special case. Moreover, its counter- 
part for a backward net as inverse mapping is the case Single-F under K(Mi, M2). 
4 NEW LEARNING MODELS 
First, a number of variants for the above existing models are given in Table 2. 
Second, a particular new model can be obtained from the case H-m-Wby changing 
minm,m into maxm,m. That is, we have maxm,m [hm -- h - qm.], shortly 
448 L. XU 
Table 2: BKC-YY Machine for Unsupervised Learning ( Part I): K(M1, M2) 
Given Data {xi}V=l, Fix PM(X) -- Po(x) by eq.(1), and thus K(M1,M2) = tG, + D, with 
D irrelevnt to M1, M2 and Kb given by the following formulae and table: 
Marriage 
Status H-f- W Single-M Single-F H-m- W W-f-H 
PM2 ( y) Uniform 
PM 1 (Yt x) : PM1 ll ]M1 (UIx) PM2(U), 
Condition free, i.e., PM2(x PM (Yl x) and and free 
= - L --q Mx, ] 
(min) (max) (min) (min) (min) 
Si: Fix Sl: Fix 
M2, get Get M1 Get M2 M2, get Get 
P(ylxi) by max by M1 by M1 by 
au by hMl-h max min [hM min 
Acridly by 
S2: get S2: Fix 
M2 by get M2 by 
max qM2. max 
Repeat N o N o Repeat N o 
S 1  S2. Repeat Repeat S 1  S2. Repeat 
1. ML on 
Mixtures 
& EM Dupli- 
(Dem77) Informax, cared Helm- Related 
2. Inform- Maximum models holtz to 
Existing ation mutual by ML machine PCA 
Equiv- geometry Inform- learning (Hin95) 
-lent (Amari95) ation on (Day95) 
models 3. MDL (Lin89) input 
Auto- (Ati90) data. 
encoder (Be195) 
(Hin94) 
4. Multi-sets 
modeling 
(Xu94,95) 
1. For H-f-W type , we have: 
Three VQ variants when PM(xI) is Gaussian. Also, criteria for 
New selecting the correct k for VQ or clustering (Xu95a&c). 
Results 2. For H-m-W type, we have: 
Robust PCA + criterion for determining subspace dimension (Xu, 95c). 
1. More smooth PMx (x) given by Parzen window estimate. 
2. FactoriM coding PM2(y) = rIPM2(yi) with binary y = [yl..., y,]. 
Variants 3. FactoriM coding PM (ylx)= l-I, PM2(Y'[ x) with binary [yl-.. 
4. Replace '-. .' in all the above items by 'f.v .dy' for real y. 
Note: H-Husband, W-Wife, f-follows, M-Male, F-Female, m-matches. X-f-Y stands for 
X part is free. Single-X stands for the other part broken. H-m-W stands for both parts 
being parametric. '(min)' stands for min/(b and '(max)' stands for max ICb. 
A Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine 449 
Table 3: BKC-YY Machine for Unsupervised Learning ( Part II): K(M2, M1) 
Given Data {xi}v=l, Fix PMx(x) = Po(x) by eq.(1), and thus K(M2, M1) = tG, + D, with 
D irrelevant to M1, M2 and tCb given by the following formulae and table: 
P(xi)logP(xi), P2 (x)= y'u P(xly)P(y), (E6) 
log. . _ [H, P(l')l/" L = Zlg" (E7) 
' [H,  (1')1/ ' 
1 
"' log P. (ylxi), L, =  Zi,u a' log P. (y[xi), (ES) 
v.(,lv)log (10) 
(a?, h, L given in Table 1) 
Marriage 
Uondition The same as those in Table 1. 
-L,,] - L ] + h - 
K (if forcing -q,x ] 
= 
(max) (min) n) (max) (min) (min) 
Sl: : 'Sl: 
Fix M, Fix M, Fix M2, 
Get M2 get  get  Get M2 get M Get M 
by by (E7). by (E2). by by by 
max in Tab.1 min 
hM2 82: 82: max [h max 
update update h  
M 2  --qMe,  ] 
M by M by L 
ALTMIN max L, max L 82: Fix M, 
get M2 
by min 
N o Repeat Repeat N o H'mpeat N o 
Repeat Sl, 82 Sl, 82 Repeat Sl, 82 Repeat 
Existing no no no no no no 
models new  new  new  new  new  new  
Variants Similar to those in Table 1. 
Table 4: BKC-YY Machine for Supervised Learning 
Given Data {xi, N 
y,}:l, Fix Pro(x) = Po(x) by eq.(1). 
K ( M1, M2 ) -- Kb -.[- D K ( M2, M1) = [1% -1 t- i 
Mareage 
Status Single-M Single-F H-m- W Single-M Single-F H-m- W 
K hMx --LM h-qx, -  h2 2-q2, 
(max) (min) (min) (min) ax) (min) 
mimum ML Mixed ML mmmum Mixed 
Feate entropy(ME) F-B entropy B-F 
F-net B-net net F-net B-net net 
Existing no BP on no BP on no no 
models new  B-net new ] F-net new  new ] 
450 L. XU 
denoted by H-m-W-Max. This model is a dual to the Helmholtz machine in order 
to focus on getting best discriminative or separable representations PM(Yl x) in Y 
instead of best fitting of PM2 (xlY) on input data. 
Third, by replacing K(M1, M2) with K(M2, M1), in Table 3 we can obtain new 
models that are the counterparts of those given in Table 2. For the case H-f-W, 
its maxM,M gives minimum entropy estimate on PM2(X) instead of maximum 
likelihood estimate on PM(X) in Table 2. For the case Single-M, it will function 
similarly to the case Single-F in Table 2, but with minimum entropy on PM1 (Y138) 
in Table 2 replaced by maximum likelihood on PM (Yl x) here. For the case H-m- 
W, the focus shifts from on getting best fitting of PM,(Xly) on input data to on 
getting best discriminative representations PM(y]x) in Y, which is similar to the 
just mentioned H-m-W-Max, but with minimum entropy on PM(Yl x) replaced by 
maximum likelihood on PM(Y]X). The other two cases in Table 3 have been also 
changed similarly from those in Table 2. 
Fourth, several new model have also been proposed in Table 4 for supervised learn- 
ing. Instead of maximum likelihood, the new models suggest learning by minimum 
entropy or a mix of maximum likelihood and minimum entropy. 
Finally, further studies on the other status in Table I are needed. Heuristically, 
we can also treat the case H-m-W by two separated steps. We first get M1 by 
max[hMx -- haM], and then get M2 by maxqM,; or we first get M2 by min[h - 
ha - qM] and then get M1 by min[hM -- haMx - qMl,]. The two algorithms 
attempt to get both a good discriminative representation by PM1 (y[x) and a good 
fitting of PM(xly ) on input data. However whether they work well needs to be 
tested experimentally. 
We are currently conducting experiments on comparison several of the above new 
models against their existing counterparts. 
Acknowledgements The work was Supported by the HK RGC Earmarked Grant 
CUHK250/9JE. 
References 
Amari, S(1995a) [Amari95] "Information geometry of the EM and em algorithms for neural networks", 
Neural Networks 8, to appear. 
Amari, S(1995b), Neural Computation 7 z pp13-18. 
Atick, J.J. & R. edlich, A.N. (1990) [Ati90], Neural Computation Vol.2, No.3, pp308-320. 
Bell A. J. & Sejnowski, T. J.(1995) [Be195], Neural Computation Vol.7, No.6, 1129-1159. 
Byrne, W. (1992), IEEE Trans. Neural Networks 3, pp612-620. 
Csiszar, I., (1975), Annals of Probabihty 3, pp146-158. 
Dayan, P., Hinton, G. E., & Neal, R.. N. (1995) [Day95], Neural Computation Vol.7, No.5, 889-904. 
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977) [Dem77], J. Royal Statsst. Society, B39, 1-38. 
Hathaway, R..J.(1986), Statistics 5 Probability Letters 4, pp53-56. 
Hinton, G. E., et al, (1995) [Hin95], Science 268, pp1158-1160. 
.... :-) [ ]  pp - . 
Hinton, G E & Zemel, l S 1994 Hin94 , Advances in NIPq 6, 3 10 
Linsker, 1. (1989) [Lin89], Avances in NIPS 1, pp186-194. ' 
Neal, l. N.& Hinton, G. E(1993), A new view of the EM algorithm that sustfies incremental and other 
variants, preprint. 
Xu, L. (1996), "How Many Clusters ? : A YING-YANG Machine Based Theory For A Classical Open 
Problem In Pattern Recognition", to appear on Proc. IEEE ICNN96. 
Xu, L. (1995a), "YING-YANG Machine: a Bayesian-Kullback scheme for unified learnings and new 
results on vector quantization", Keynote talk, Proc. Intl Conf. on Neural Information Processing 
CONIP95), Oct 30 - Nov. 3, 1995, pp977-988. 
u, L.(1995b), "YING-YANG Machine for Temporal Signals", Keynote talk, Proc IEEE intl Conf. 
Neural Networks & Signal Processing, Vol.I, pp644-651, Nanjing, 10-13, 1995. 
Xu, L (1995c), "New Advances " 
S m 'siu ^, A.:,-' .... ,o.n Th.e YING-YANG Machine , Invited paper, Proc. of 1995 Intl. 
y po m ..... mctm leurm leworks, ppIS07-12, Dec. 18-20, Taiwan. 
Xu, L. (1995d), "Cluster Number Selection, Adaptive EM Algorithms and Competitive Learnings", 
Invited paper, Proc. Intl Conf. on Neural Information Processing (ICONIP95), Oct 30- Nov. 3, 1995, 
Vol.II, pp1499-1502. 
Xu, L. (1995e), Invited paper, Proc. WCNN95, Vol.I, pp35-42. Also, Invited paper, Proc. IEEE ICNN 
1994, ppI315-320. 
Xu, L., & Jordan, M.I. (1993), Proc. of WCNN'93, Portland, O1, Vol. II, 431-434. 
