99 
Connectionist Learning of Expert Preferences by 
Comparison Training 
Gerald Tesauro 
IBM Thomas J. Watson Research Center 
PO Box 704, Yorktown lteights, NY 10598 USA 
Abstract 
A new training paradigm, called the "coznparison paradigm," is introduced 
for tasks in which a network must learn to choose a preferred pattern from a 
set of n, alternatives, based on examples of human expert preferences. In this 
paradigm, the input to the network consists of two of the n alternatives, and 
the trained outtrot is the expert's judgement of which i)attcrn is better. This 
paradigm is applied to the learning of backgammon, a difflcult board game in 
which the expert selects a move from a set of legal moves. With comparison 
training, much higher levels of performance can be achieved, wi{h networks 
that are much smaller, and with coding schemes that arc much simpler and 
easier to understand. Furthermore, it is possible to set up the network so 
that it always produces consistent rank-orderings. 
1. Introduction 
There is now widespread interest in the use of cmuectimist netw,rks f,r real- 
world practical problem solving. The principal areas of application which 
have been stndied so far involve relatively low-level signal processing and 
pattern recognition tasks. Howewe. r, connectionist networks night also be 
nscfil in higher-level tasks which are currently tackled by expert systens 
and knowledge engineering aptmaches [2]. In ibis paper, we consider problem 
domains in which the expert is given a set of n alternatives as input (n may be 
either small or large), and must select the m,t (lcsirahle r most preferable 
alternative. This type of task occurs repeatedly throughout the {loznains 
of politics, bnsiness, economics, medicine, and many others. Whether it is 
choosing a foreign-policy option, a weap(ms contractor, a c(mrse of treatment 
for a disease, or simply what to have for dinner, prblezns requiring choice 
are constantly being fa(:ed and solved by huznan experts. 
How might a learning system such as a, connectionist network be set up to 
learn to make such choices from human expert examples? The imnediately 
obvious approach is to train the network to produce a numerical output 
100 esauro 
"score" for ea.ch input alternative. To make a choice, then, one would haxe 
the network score ea.ch alternative, and select the alternative with the high- 
est score. Since the learning system learns from examples, it seems logical to 
train the network on a da. ta base of examples in which a human expert has 
entered a numerical score for each possible choice. However, there are two 
major problems with such an approach. First, in many domains in which n 
is la. rge, it would be tremendously time-consuming for the expert to create 
a da. ta base in which each individual alternative has been painstaking evalu- 
ated, even the vast number of obviously bad alternatives which are not even 
worth considering. (It is important for the network to see examples of bad 
alterna. tives, otherwise it would tend to produce high scores for everything.) 
More importantly, in ma.ny domains human experts do not think in terms of 
absolute scoring functions, ami it would thus be extremely difficult to create 
training data containing absolute scores, because such scoring is alien to the 
expert's way of thinking about the problem. Instead, the most natural way 
to make training data is simply to record the expert in action, i.e., for each 
problem situation, record each of the alternatives he had to choose from, and 
record which one he actually selected. 
For these reasons, we advocate teaching the network to compare pairs of 
alterna.tives, rather than scoring individual alternatives. In other words, the 
input should be two of the set of n alternatives, and the output should be a 
I or 0 depending on which of the two alternatives is better. From a set of 
recorded human expert preferences, one can then teach the network that the 
expcrt's choice is better than all other alternatives. 
One potential concern raised by this approach is that, in performance mode 
after the network is trained, it might be necessary to make n? comparisons to 
select the best alterna.tive, whereas mly n individual scores 3re needed in the 
other approa.ch. However, the network can select the best alterna. tive with 
only n comparisons by going through the list f Mterntiw;s in order, and 
comparing the current alternative with the best alternative seen so far. If 
the current alternative is better, it becomes the new best Mternativc, and if 
it is worse, it is rejected. Another potential concern is that a network which 
only knows how to compare might m,t produce a consistent rank-ordering, 
i.e., it might say that alternative a is better than b, b is better than c, and 
c is better than a, and then one does not know which Mterna.tivc to select. 
However, we shall see la.tcr that it is p,ssiblc t guarantee cmsistency with a 
constrained architecture which forces the network to (:,rate uI) with absolute 
numerical scores for individual alternatives. 
In the following, we shall examine the aI)plicatim of the c(anparison train- 
ing paradigm to the game of backgammon, as considerable experience has 
already been obtained in this domain. In previous papers [7,6], a network 
was described which learned to play backgammon froln an expert da, ta base, 
using the so-called "back-propagation" learning rule [5]. In that system, the 
network was trained to score individual moves. In other words, the input 
Connectionist Learning of Expert Preferences ! O1 
consists of a move (defined by the initial i)osition before the move and the 
final position after the move), and the tiesired output is a real number indi- 
cating the strength of the move. Henceforth we shall refer to this training 
paradigm as the "relative score" paradigm. While this approach produced 
considerable s,ccess, it had a number of serious limitations. We shall see 
that the comparison para.dign solves one of the m,,st important limitations 
of the previous approach, with the result that the overall perfornance of 
the network is much better, the number of c,nnecti,ms required is greatly 
reduced, and the network's input c,,ding scheme is much simpler and easier 
to understand. 
2. Previous backgammon networks 
In [7], a netwt,rk was described which learned t,, play fairly good backgam- 
mon by back-propagation learning {,f a large expert training set, using the 
rela, tive score paradigm described previously. After training, the network 
was tested both by measuring its perf,rmance on a test set of p,,sitions not 
used in training, and by actual game play against humans and conventlena.1 
conputer prograins. The best netw,,rk was able to defeat Snn Microsys- 
tems' Gammontool, the best available connercial pr(,gram, by a substa,ntial 
ma. rgin, but it was still far from human expert-level performance. 
The ba.sic conclusion of [7] was that it was possible to achieve decent levels 
of performance by this network learning procedure, but it, was not an easy 
matter, and required substantial human intervenlim. The choice of a coding 
scheme for the input inft,rma. tion, f,,r example, was found t,, be an extremely 
intportant issue. The best coding schemes contained a. great deal of domain- 
specific informa.tion. The best cnc,,ding (f the "raw" board infornation was 
in terms of concepts tha, t human experts use to describe local transitions, such 
as "slotting," "stripping," etc.. Als(,, a few "pro-computed features" were 
required in addition to the raw b{mrd informati(m. Thus it was necessary 
to be a. domain expert in order to design a suitable netw(,rk coding schene, 
and it seemed tha. t the only way to discover the best (:,,ting scheme was 
by pa.instaking trial and err,r. This was somewhat disapp,finting, as it was 
hoped that the network learning pr,,(:edure w(mhl aut,mati(:ally produce an 
expert backgamnon network with little {r no huntan eff,,r. 
3. Coxnparison paradigm netsvork set-up 
In the standard t)ractice of ba.(:k-i)rt)t)agation , a c,)mparis,)n paradigm net- 
work would have an input layer, one (,r more layers of hid(len units, and a.n 
output layer, with full connectivity between ad. jat'ent layers. The input layer 
would represent, two final board p,sitions a and b, and the (utt)ut layer would 
have just a. single unit to represent which b,m.rd psition was better. The 
10 Tesauro 
teacher signal for the output unit would be a 1 if board position a was better 
than b, and a 0 if b was better than a. 
The proposed comparison paradigm network would overcome the limitation 
of only being a.ble to consider individual moves in isolation, without knowl- 
edge of witat other alternatives are available. In a.dditi(n, the sophisticated 
coding scheme that was developed to encode transition infofinn tion would not 
be needed, since comparisons could be based solely on the final board states. 
The comparison a,pproach offers greater sensitivity in distinguishing between 
close alternatives, a,nd as stated previously, it corresp,nds more closely to 
the actual form of human expert knowledge. 
These advantages are formidable, but there are some important problems 
with the approach as currently described. One technical problem is that the 
learning is significantly slower. This is because 2n c(mtparisons per training 
position are presented to the network, where n ,- 20, whereas in the relative 
score approach, only about 3-4 ntovcs per position w(mld bc presented. It 
was therefore necessary to develop a. number of technical tricks to increase 
the speed of the simulator (:ode for this specific application (to be described 
in a future publication). 
A more fundamental problown with the approach, however, is the issue of con- 
sistency of network comparisons. Two properties are required fi)r complete 
consistency: (1) The comparison between any two positions must be unam- 
biguous, i.e., if the network says that, a is better than b when a is presented 
on the left and b on the right, it had better say that a is better tha,n b if 
a is on the right and b is on the left. One can show that this requires the 
network's output to exactly invert whenever the input hmrd positions are 
swapped. (2) The comparisons must be transitire, as alluded to previ(msly, 
i.e., if a is judged better than b, and b is judged better than c, the network 
had better judge a to be better than c. 
Standa.rd unconstrained networks have no guarantee f satisfying either of 
these properties. After some thought, however, one realizes that the output 
inversion symmetry can be enforced by a. symmetry relati{in amongst the 
weights in the network, and that the transitivity and rank-,,rdcr consistency 
can be guaranteed by scparability in tit(; architecture, as ilhlstratcd in Figure 
1. tiere we see that this network really consists (,f two half-nctw,,rks, one of 
which is only concerned with the evaluation {fbard p,siti, m , and the other 
of which is concerned only with the evahlati{m of board p{sition b. (Due to 
the indicated symmetry relation, fine needs mly st{rc one half-network in the 
simulator code.) Each half-network may have one or m(re layers of hidden 
units, but as long as they are not cross-coupled, the evaluation of each of 
the two input board positions is boiled down to a single real number. Since 
real numbers always rank-order consistently, the nctw(rk's comparisons are 
always consistent. 
Connectionist Learning of Expert Preferences !03 
final position (a) final position (b) 
Figure 1: A network dcslgn for comparison tral,ing with guaranteed consis- 
tency of comparison. Weight groups have symmctry relation W 1 = W 2 
and W 8 = -W4, which ensures lhat the output cxactly inrcrts upon swap- 
ping positions in the input array. Separation of the hiddenunits condenes 
the evaluation of each final board position into a single real numbcr, thus 
ensuring transitlvity. 
An important added benefit of this s(:hcne is that an M)s(,lute board oval- 
nation function is obtained in each half-network. This means that the net- 
work, to the extent that its cvaluati,m fun(:timt is ac(:uratc, has an intrinsic 
understanding of a. give. n positi)n, as opp(sed to merely being able to (le- 
toct features which correspond to go,,d moves. As has 1)ten emphasized by 
Berliner [1], an intrinsic understanding ,,f the positi,,n is ('ru(:ial for play at 
the highest levels, and for use of the d,mbling cube. Thus, this approach 
can serve as the basis for future pr,,grcss, whereas tit(. previous approach of 
scoring novcs was (loomed eventually t,, run into a dead end. 
4. Results of comparison training 
The training procedure fir the c{mtparism paradigm nctw{,rk was as fdlows: 
Networks were set up with 289 input units which enc,dc a description of sin- 
gle final boa.rd position, varying numbers ,f hidden units, and a single output 
unit. The training data was taken fr{,m a set f 400 games in which the au- 
thor played both sides. This data set c{mtains a recording of the author's 
preferred move for each position, and no other comncnts. The. engaged posi- 
tions in the data set were selected {mr (disengaged racing p,sitions were not 
studied) and divided into five categrics: bcarff, bearin, {qq)onent bearoff, 
opponent bearin, and a default (:ategory cvcring everything else. In each 
104 Tesauro 
Type of RSP net CP net 
test set (651-12-1) (289-1)__. 
bearoff .82 .83 
bearin .54 .60 
opp. bearoff .56 .54 
ot)p. bearin .60 .66 
other .58 .65 
'Fable l: Performance or nets or indicated size on respective test sets, as mea- 
sured by fraction or positions for which net agrees with human expert choice 
of best move. RSP: re}ative score paradigm, CId: comparison paradigm. 
category, 200 positions chosen a.t random were set aside t{ be used as test- 
ing data; the remaining data. (about 1000 positions in each category except 
the default category, fir which aberot 4000 positions were used) was used to 
train networks which specialized in each category. The learning algorithm 
used was standard back-propagati(m with m(mtentum and without weight 
decay. 
Performance after training is summarized in Tables 1 and 2. Table 1 gives 
the performance of each specialist network (n the appr(qriate set of test 
positions. Results fir the cmnparison paradigm netw{rk are shown fir net- 
works without hidden units, be{:ause it was ftmnd that the addition of hidden 
units did not improve the perfirmance. (This is discussed in the fillowing 
section.) We contrast these results with results of training networks in the 
relative score paradigm on the same training data sets. We see in Table 1 
that for the bear{ff and (q)i)onent bearoff specialists, there is only a snall 
change in performance under the c,mtparison paradigm. F(r the bearin and 
opponent bearin specialists, there is an intprovement in perfirnance of a.bout 
6 percentage points in each case. F{r this particular applicant(m, this is a. very 
substantial inprovement in Icrfirmancc. Howerer, the most important fin{l- 
ing is for the default category, which is much larger and m,re difficult than 
any of the specialist categories. The default network's perfiJrmance is the 
key factor in determining the system's {verall gantt I)crfi,rman{'e. With com- 
parison training, we find an intt)rovcment in perfrman('e fr{,m 58% to 65%. 
Given the size and difficulty of this categ{ry, his can {rely bc described as a 
huge intprovemcnt in pcrfirmance, and is all the nre remarkable when one 
{:tinstilers that the conparison paradigm net has only 300 weights, as qq>osed 
to 8000 weights fir the relative sc,re paradigm net. 
Next, a comt)ined game-playing system was set up using the five specialist 
nets for all engaged positions. (The Gamntontool cvaluati(nt functin was 
called for racing positions.) Results are given in Table 2. Against Gammon- 
tool itself, the pcrfornancc under the {:,mparis(,n paradigm inproves from 
59% to 64%. Against the a.uthr (and teacher), the perriermanet improves 
from an estimated 35% (since the RSP nets are so big and slow, accurate 
Connectionist Learning of Expert Preferences ! 05 
Opponent RSP nets CP nets 
Gammontool 
Tesauro 
.59 (500 games) 
.35 (100 games) 
.64 (2000 games) 
.42 (400 games) 
Table 2: Game-playing pcrformancc of composltc nctwork systems against 
Gammontool and against the author, as measured by fraction of games won, 
without counting gammons or backgammons. 
statistics could not be obtained) to about 42%. 
Qualitatively, one notices a substantial overall improvement in the new net- 
work's level of play. But what is most striking is the nctwork's u'(,rst case be- 
havior. The previous relative-score network had particularly bad worst-case 
behavior: about once every other game, the network would na.kc an atro- 
cious blunder which would seriously jeopardize its chances (,f winning that 
ga,me [6]. An alarming fraction of these blunders were seemingly random and 
could not be logically explained. The new comparison paradigm network's 
worst-case behavior is vastly int)rovcd in this regard. The frequency and 
severity of its nista.kcs are significantly reduce. d, t),t thOrO importantly, its 
mista.kcs are understandable. (Some of the improvement in this respect may 
be due to the elimination of the noisy teacher signal described in [7].) 
5. Conclusions 
;Ve have seen that, in the domain (,f ba.ckgamtnon, the introduction of the 
comparison training paradigm has resulted in networks which perform much 
better, with vastly reduced numbers ,,f weights, and witit input coding schemes 
that are much simpler and easier to understand. It was surprising that such 
high performance could be obtained in "pcrccptron" networks, i.e., networks 
without hidden units. This re,hinds us that one should not sunmarily dis- 
miss perceptrons as uninteresting or unworthy of study because they are only 
capable of learning linea.rly separable functions [3]. A substantial component 
of many difficult real-world problems nay lie in the linearly separable spec- 
trum, and thus it na,kes sense to try pcrceptr,,ns at least as a. first attempt. 
It was also surprising that the use ,f hidden nnits in the c{,mparison-traincd 
networks docs not improve the perfrmance. Titis is unexplained, a.nd is the 
subject of current research. It is, h,,wevcr, not with,ntt precedent: in at least 
one other real-world application [4], it has been f,und that networks with 
hidden units do not perform any better than netw{,rks without hidden units. 
More generally, one might conclude that, in training a. nc,,ral network 
indeed any learning system) from human expert examples in a conplcx do- 
main, there should be a good match between the natural f,,rm of the expert's 
knowledge and the method by which he network is trained. For domains in 
which the expert must select a preferred alternative. from a set of alternatives, 
106 Tesauro 
the expert naturally thinks in terms of comparisons amongst the top few al- 
ternatives, and the comparison paradigm proposed hcrc takes advantage of 
that fact. It would be possible in principle to train a network using absolute 
evahlations, but the creation of such a. training set might be tot) difficult to 
undertake on a large scale. 
If the above discnssion is correct, then the cotnparison paradigm stumld be 
useful in other applications involving expert choice, and in other learning 
systems besides connectionist networks. Typically expert systems are hand- 
crafted by knowledge engineers, rather than learned from human expert ex- 
amples; however, there has recently been some interest in supervised learning 
approaches. It will be interesting to see if the conparis{m paradign proves to 
be usefid when supervised learning procedures are applied to other domains 
involving expert choice. In using the comparison paradign, it will be impel 
tant to have some way to guarantee that the system's comparisons will be 
unambignons and transitive. For feed-forward netw{rks, it was shown in this 
paper how to guarantee this using synmetric, separated networks; it should 
be possible to iinpose similar constraints on other learning systems to enforce 
consistency. 
References 
[l] 
}t. Berliner, "On the construction of evaluation fu,ctlons for large domains," 
Proc. o IJCA 1 (197,t) 53--55. 
[2] S. I. Gallant, onnectmnmt expert systems," Comm. A(JM 31, 152-169 
(1988). 
[6] 
[7] 
M. Minsky and S. Paperr, Perccptrons, MIT Press, Cambridge MA (1969). 
N. Qian and T. J. Scjnowskl, "Prcdictlng the secondary st ru(:ture of globular 
proteins using neural network models," .1. Jh)l. Biol. 202, 865 88,1 (1988). 
D. E. Rumelart and J. 1, McClelland (cds.), l'aallcl l)istritmtcd l'roccssing: 
Exploration intim Microstr,cl, ure o1' Cognitio,, V,ls. I and 2, MIT Press, 
Cronbridge MA (1986). 
G. Tesauro, "Neural network defeats (:real. or in backgammon match." Univ. 
of Illinois, Center for Complex Syslems Te(:hnical lieport (;(JSR-88-6 (1988). 
G. Tesauro and T. J. Sejnowski, "A parallel nctwork that learns to play 
backgammon," Artificial lntcllige,cc, in press (1989). 
