Learning How To Teach 
or 
Selecting Minimal Surface Data 
Davi Geiger 
Siemens Corporate Research, Inc 
755 College Rd. East 
Princeton, NJ 08540 
USA 
Ricardo A. Marques Pereira 
Dipartimento di Informatica 
Universita di Trento 
Via Inama 7, Trento, TN 38100 
ITALY 
Abstract 
Learning a map from an input set to an output set is similar to the prob- 
lem of reconstructing hypersurfaces from sparse data (Poggio and Girosi, 
1990). In this framework, we discuss the problem of automatically select- 
ing "minimal" surface data. The objective is to be able to approximately 
reconstruct the surface from the selected sparse data. We show that this 
problem is equivalent to the one of compressing information by data re- 
moval and the one of learning how to teach. Our key step is to introduce a 
process that statistically selects the data according to the model. During 
the process of data selection (learning how to teach) our system (teacher) 
is capable of predicting the new surface, the approximated one provided 
by the selected data. We concentrate on piecewise smooth surfaces, e.g. 
images, and use mean field techniques to obtain a deterministic network 
that is shown to compress image data. 
I Learning and surface reconstruction 
Given a dense input data that represents a hypersurface, how could we automatically 
select very few data points such as to be able to use these fewer data points (sparse 
data) to approximately reconstruct the hypersurface ? 
We will be using the term surface to refer to hypersurface (surface in multidimen- 
364 
Learning How to Teach or Selecting Minimal Surface Data 365 
sions) throughout the paper. 
It has been shown (Poggio and Girosi, 1990) that the problem of reconstructing a 
surface from sparse and noisy data is equivalent to the problem of learning from 
examples. ['or instance, to learn how to add numbers can be cast as finding the 
map from X = {pair of numbers} to F = {sum} from a set of noisy examples. The 
surface is F(X) and the sparse and noisy data are the set of N examples {(xi, di)}, 
where i = 0, 1, ..., N and xi = (ai, bi)  X, such that ai q- bi - di q- Ill (Ili being the 
noise term). Some a priori information about the surface, e.g. the smoothness one, 
is necessary for reconstruction. 
Consider a set of N input-output examples, {(xi,di)}, and a form [[ Pf [[2 for 
the cost of the deviation of f, the approximated surface, from smoothness. P is a 
differential operator and I1' [[ is a norm (usually L2). To find the surface f, that 
best fits (i) the data and (ii) the smoothness criteria, is to solve the problem of 
minimizing the functional 
N-1 
v(f) -- 2 q- II Pf II 2 
i=0 
Different methods of solving the function can yield different types of network. In 
particular using the Green's method gives supervised backprop type of networks 
(Poggio and Girosi, 1990) and using optimization techniques (like gradient descent) 
we obtain unsupervised (with feedback) type of networks. 
2 Learning how to teach arithmetic operations 
The problem of learning how to add and multiply is a simple one and yet provide 
insights to our approach of selecting the minimum set of examples. 
Learning arithmetic operations The surface given by the addition of two num- 
bers, namely f(x, y) = x + y, is a plane passing through the origin. The multipli- 
cation surface, f(x, y) = xy, is hyperbolic. The a priori knowledge of the addition 
and multiplication surface can be expressed as a minimum of the functional 
v(f) - II V2f(x,Y) ]{ dxdy 
oo 
where 
02 02 
V2f(, Y) = (a7  + 3-y )f(, Y) 
Other functions also minimize V(f), like fix, y) = x 2 - y2, and so a few examples 
are necessary to learn how to add and multiply given the above prior knowledge. If 
the prior assumption consider a larger class of basis functions, then more examples 
will be required. Given p input-output examples, {(x, y); d}, the learning problem 
of adding and multiplying can be cast as the optimization of 
366 Geiger and Pereira 
II V2f(x,Y) {[ dxdy 
We now consider the problem of selecting the examples from the full surface data. 
A sparse process for selecting data Let us assume that the full set of data 
is given. in a 2-Dimensional lattice. So we have a finite amount of data (N 2 data 
points), with the input-output set being {(xi, yj); dij), where i,j = 0, 1, ..., N-1. To 
select p examples we introduce a sparse process that selects out data by modifying 
the cost function according to 
oo 
i,j=O 
N-1 
II V=f(x,Y) II +,x(p- (1-sij))  
i,j=0 
where $ij - 1 selects out the data and we have added the last term to assure that 
p examples are selected. The data term forces noisy data to be thrown out first, 
the second order smoothness of f reduces the need for many examples (p w, 10) to 
learn these arithmetic operations. Learning s is equivalent to learn how to select 
the examples, or to learn how to teach. The system (teacher) has to learn a set of 
examples (sparse data) that contains all the "relevant" information. The redundant 
information can be "filled in" by the prior knowledge. Once the teacher has learned 
these selected examples, he, she or it (machine) presents them to the student that 
with the a priori knowledge about surfaces is able to approximately learn the full 
input-output map (surface). 
3 Teaching piecewise smooth surfaces 
We first briefly introduce the weak membrane model, a coupled Markov random 
field for modeling piecewise smooth surfaces. Then we lay down the framework for 
learning to teach this surface. 
3.1 Weak membrane model 
Within the Bayes approach the a priori knowledge that surfaces are smooth (first 
order smoothness) but not at the discontinuities has been analyzed by (Geman and 
Geman, 1984) (Blake and Zisserman, 1987) (Mumford and Shah, 1985) (Geiger and 
Girosi, 1991). If we consider the noise to be white Gaussian, the final posterior 
probability becomes P(f,I[g)= e -lvO',l), where 
V(f,I) = Y.[(fij - gij)  + I  II V'f Ili5 (1-Iij)+?ijlij], 
i,j 
(1) 
We represented surfaces by fij at pixel (i, j), and discontinuities by Iij. The input. 
data is gij, [I Vf [[ij is the norm of the gradient at pixel (i,j). Z is a normalization 
Learning How to Teach or Selecting Minimal Surface Data 367 
constant, known as the partition function. fi is a global parameter of the model and 
is inspired on thermodynamics, and/ and ?ij are parameters to be estimated. This 
model, when used for image segmentation, has been shown to give a good pattern 
of discontinuities and eliminate the noise. Thus, suggesting that the piecewise 
assumption is valid for images. 
3.2 Redundant data 
We have assumed the surface to be smooth and therefore there is redundant infor- 
mation within smooth regions. We then propose a model that selects the "relevant" 
information according to two criteria 
1. Discontinuity data: Discontinuities usually capture relevant information, 
and it is possible to roughly approximate surfaces just using edge data (see Geiger 
and Pereira, 1990). A limitation of just using edge data is that an oversmoothed 
surface is represented. 
2. Texture data: Data points that have significant gradients (not enough to be 
a discontinuity) are here considered texture data. Keeping texture data allows us 
to distinguish between fiat surfaces, as for example a clean sky in an image, and 
texture surfaces, as for example the leaves in the tree (see figure 2). 
3.3 The sparse process 
Again, our proposal is first to extend the weak membrane model by including an 
additional binary field - the sparse process s- that is 1 when data is selected out 
and 0 otherwise. There are natural connections between the process s and robust 
statistics (Huber, 1988) as discussed in (Geiger and Yuille, 1990) and (Geiger and 
rereira, 1991). We modify (1) by considering (see also Geiger and rereira, 1990) 
i,j 
where we have introduced the term ?]ij$ij to keep some data otherwise $ij - 1 
everywhere. If the data term is too large, the process s - i can suppress it. We 
will now assume that the data is noise-free, or that the noise has already been 
smoothed out. We then want to find which data points (s = 0) are necessary to 
keep to reconstruct f. 
3.4 Mean field equations and unsupervised networks 
To impose the discontinuity data constraint we use the hard constraint technique 
(Geiger and Yuille, 1990 and its references). We do not allow states that throw 
out data (sij = 1) at the edge location (lij = 1). More precisely, within the 
statistical framework we reduce the possible states for the processes s and I to 
$ijlij ---- O. Therefore, excluding the state ($ij -- 1, lij . 1). Applying the saddle 
point approximation, a well known mean field technique (Geiger and Girosi, 1989 
and its references), on the field f, we can compute the partition function 
368 
Geiger and Pereira 
where f maximizes Z. After applying mean field techniques we obtain the following 
equations for the processes I and s 
[ij - e-/b"J+(LJ-a'J)]/Zij , s-/j - e-[llVfll''+nJ]/Zi (4) 
and, using the definition II vf 117- fi+x,J) 2 + (fi+x,j+x- fij)2, the mean 
field self consistent equation (Geiger and Peteira, 1991) becomes 
(1 -- gij)(fij -- gij) 
-p{Kij(1 - [i) + Ki_,j_(1 - [i-j-) + 
Mi_,j(1 - [i-,j) + Mi,j_x(1 - 
(5) 
where Kij = (fi+j+ - fi,j) 2 and Mij = (fi+j - fi,j+x) 2. The set of coupled 
equations (5) (4) can be mapped to an unsupervised network, we call a minimal 
surface representation network (MSRN), and can efficiently be solved in a massively 
parallel machine. Notice that sij + lij _> 1, because of the hard constraint, and in 
the limit of fi -+ oo the processes s and l becomes either 0 or 1. In order to throw 
away redundant (smooth) data keeping some of the texture we adapt the cost 
according to the gradient of the surface. More precisely, we set 
rlij = r/[(Aig) 2 + (A}'jg) 2] 
(6) 
where (Aig)2 = (gi+j --gi-,j) 2 and (Abg)2 = (gij+ - gij-x) 2. The smoother 
is the data the lower is the cost to discard the data (sij = 1). In the limit of r/-+ 0 
only edge data (l O = 1) is kept, since from (4) lim,_os = 1 - lj. 
3.5 Learning how to teach and the approximated surface 
With the mean field equations we compute the approximated surface f simulta- 
neously to s and to l. Thus, while learning the process s (the selected data) the 
system also predict the approximated surface f that the student will learn from 
the selected examples. By changing the parameters, say p and r/, the teacher can 
choose the optimal parameters such as to select less data and preserve the quality 
of the approximated surface. Once s has been learned the system only feeds the 
selected data points to the learner machinery. We actually relax the condition and 
feed the learner with the selected data and the corresponding discontinuity map (l). 
Notice that in the limit of r/-. 0 the selected data points are coincident with the 
discontinuities (l = 1). 
Learning How to Teach or Selecting Minimal Surface Data 369 
4 Results: Image compression 
We show the results of the algorithm to learn the minimal representation of images. 
The algorithm is capable of image compression and one advantage over the cosine 
transform (traditional method) is that it does not have the problem of breaking the 
images into blocks. However, a more careful comparison is needed. 
4.1 Learning s, f, and I 
To analyze the quality of the surface approximation, we show in figure 1 the per- 
formance of the network as we vary the threshold /. We first show a face image 
and the line process and then the predicted approximated surfaces together with 
the correspondent sparse process s. 
4.2 Reconstruction, Generalization or "The student performance" 
We can now test how the student learns from the selected examples, or how good is 
the surface reconstruction from the selected data. We reconstruct the approximate 
surfaces by running (5) again, but with the selected surface data points (si5 = 0) 
and the discontinuities (li = 1) given from the previous step. We show in figure 2f 
that indeed we obtain the predicted surfaces (the student has learned). 
References: 
E. B. Baum and Y. Lyuu. 1991. The trastsition to perfect generalization in perceptrons, 
Neural Computation, vol.3, no.3. pp.386-401. 
A. Blake and A. Zisserman. 1987. Visual Reconstruction, MIT Press, Cambridge, Mass. 
D. Geiger and F. Girosi. 1989. Coupled Markov random fields and mean field theory, 
Advances in Neural Information Processing Systems 2, Morgan Kaufmann, D. Touretzky. 
D. Geiger and A. Yuille. 1991. A common framework for image segmentation, Int. Jour. 
Comp. Vis.,vol.6:3, pp. 227-243. 
D. Geiger and F. Girosi. 1991. Parallel and deterministic algorithms for MRFs: surface 
reconstruction, PAMI, May 1991, vol. PAMI-13, 5, pp.401-412 . 
D. Geiger and R. M. Perefta. 1991. The outlier process, IEEE Workshop on Neural 
Networks for signal Processing, Princeton, NJ. 
S. Geman and D. Geman. 1984. Stochastic Relaxation, Gibbs Distributions, and the 
Bayesian Restoration of Images,PAMI, vol. PAMI-6, pp.721-741K. 
J.J. Hopfield. 1984. Neural networks and physical systems with emergent collective com- 
putational abilities, Proc. Nat. Acad. Sci.,79 , pp. 2554-2558. 
P.J. Huber. 1981. Robust Statistics, John Wiley and Sons, New York. 
D. Mumford and J. Shah. 1985. Boundary detection by minimizing functionMs, I , Proc. 
IEEE Conf. on Computer Vision  Pattern Recognition, San Francisco, CA . 
T. Poggio and F. Girosi. 1990. Regularization algorithms for learning that are equivalent 
to multilayer network, Science,vol-247, pp. 978-982. 
D. E. Rumelhart, G. Hinton and R. J. Willians. 1986. Learning internal representations 
by error backpropagation. Nature, 323, 533. 
370 Geiger and Pereira 
Go 
eo 
.[. .,;?. .  ... ,.... : "," 
-"  .l I '1'  , 
 ,,.{, .r .. ,..: . 
'1_'i } '.' .'r -  .'r'., TM ',TL I : 
 Tr ,'_r.i..".i  ',?,.':"- , I, 
, . '- i',&;". :._j 
..-,lye'.,  F,.J... /.. '.,.! 
   -' - .. 4A. .' I 
 . . ,, ,J ,., ,. .... 
',r , -4.--F.'--  .. '1. ' '.: ' '. I 
...., ';:: ,,..-',, -L,,iJ 
.', '..:'. -- -' li:'.:' 
 - . :-.% ?.. ---- .... :;.'.- . 
 ' '. >'. ,'...;.':' r,.i:b .' 
 ..-m .." , ', !E....._,. .l'b'.r;. 
-' Yg' '- .' ':*11. ;" "  ' 
d. ' - "" ' "if:-' .... 
.1" .' .:    
I  'j " . --- 
', ". q ,5 - "'. , :;l 
.. . . ; .  
 , ' o .' . . : :,. .   . k . - .  . 
.e......... , . , . ... 15' I 
.'e,...,. 
Figure 1: (a) 8-bit image of 128 X 128 pixels. (b) The edge map for/ = 1.0, 
?ij - 100.0. After 200 iterations and final fi = 25  cw (c) the approximated image 
for y -- 0.01, 7j -- 1.0 and r/ - 0.0009. (d) the corresponding sparse process (e) 
approximated image y = 0.01, ?ij --- 1.0 and r] = 0.0001. (f) the corresponding 
sparse process. 
Learning How to Teach or Selecting Minimal Surface Data 371 
Co 
do 
Figure 2: (a) 8-bit image of 256 X 256 pizels. (b) The final image aer data has 
been selected out. Final values for the parameters: F -- 0.02, ?ij -- 2.8, ? - 0.005, 
final  -- 25  cw, 00 iterations. (c) The sparse process, where black dots represent 
sij - 0 (data kept). (d) The initial selected data from (a) using the process s (c) 
(e) The reconstruction of the image after Z iterations. (f) After 80 iterations. As 
predicted, it is very close to (b). 
