The "Softmax" Nonlinearity: 
Derivation Using Statistical Mechanics 
and Useful Properties 
as a Multiterminal Analog Circuit 
Element 
I. M. Elfadel 
Research Laboratory of Electronics 
Massachusetts Institute of Technology 
Cambridge, MA 02139 
J. L. Wyatt, Jr. 
Research Laboratory of Electronics 
Massachusetts Institute of Technology 
Cambridge, MA 02139 
Abstract 
We use mean-field theory methods from Statistical Mechanics to 
derive the "softmax" nonlinearity from the discontinuous winner- 
take-all (WTA) mapping. We give two simple ways of implementing 
"softmax" as a multiterminal network element. One of these has a 
number of important network-theoretic properties. It is a recipro- 
cal, passive, incrementally passive, nonlinear, resistive multitermi- 
nal element with a content function having the form of information- 
theoretic entropy. These properties should enable one to use this 
element in nonlinear ItC networks with such other reciprocal el- 
ements as resistive fuses and constraint boxes to implement very 
high speed analog optimization algorithms using a minimum of 
hardware. 
1 Introduction 
In order to efficiently implement nonlinear optimization algorithms in analog VLSI 
hardware, maximum use should be made of the natural properties of the silicon 
medium. Reciprocal circuit elements facilitate such an implementation since they 
882 
The "Softmax" Nonlinearity 883 
can be combined with other reciprocal elements to form an analog network having 
Lyapunov-like functions: the network content or co-content. In this paper, we show 
a reciprocal implementation of the "softmax" nonlinearity that is usually used to 
enforce local competition between neurons [Peterson, 1989]. We show that the cir- 
cuit is passive and incrementally passive, and we explicitly compute its content and 
co-content functions. This circuit adds a new element to the library of the analog 
circuit designer that can be combined with reciprocal constraint boxes [Harris, 1988] 
and nonlinear resistive fuses [Harris, 1989] to form fast, analog VLSI optimization 
networks. 
2 Derivation of the Softmax Nonlinearity 
To a vector y ( R" of distinct real numbers, the discrete winner-take-all (WTA) 
mapping W assigns a vector of binary numbers by giving the value 1 to the com- 
ponent of y corresponding to max<i<, yi and the value 0 to the remaining com- 
ponents. Formally, W is defined as 
W(y) = (W(y),..., Wn(y)) T 
where for every 1 _ j _ n, 
1 ifyj>yi, l<_i<_n 
Wj (y) = 0 otherwise (1) 
Following [Geiger, 1991], we assign to the vector y the "energy" function 
Eye) = - zkuk = _Ty, e (2) 
where ', is the set of vertices of the unit simplex $, = {x  ", zi _ 0, 1 < i < 
n and y]: z} = 1}. Every vertex in the simplex encodes one possible winner. It 
is then easy to show that W(y) is the solution to the linear programming problem 
max  zy. 
Moreover, we can assign to the energy Ey(Z) the Gibbs distribution 
e-y(Z)/T 
Py(z) = Py(zl,..., z,) = ZT 
where T is the temperature of the heat bath and ZT is a normalizing constant. 
Then one can show that the mean of z considered as a random variable is given 
by [Geiger, 1991] 
ej/T eJ/T 
rs (yIT) 
The mapping F: " -- " whose components are the Fj 's, 1 <_ j _< n, is the gener- 
alized sigmoid mapping [Peterson, 1989] or "softmax". It plays, in WTA networks, 
a role similar to that of the sigmoidal function in Hopfield and backpropagation 
884 Elfadel and Wyatt 
Figure 1: A circuit implementation of softmax with 5 inputs and 5 outputs. This 
circuit is operated in subthreshold mode, takes the gates voltages as inputs and 
gives the drain currents as outputs. This circuit is not a reciprocal multiterminal 
element. 
networks [Hopfield, 1984, Rumelhart, 1986] and is usually used for enforcing com- 
petitive behavior among the neurons of a single cluster in networks of interacting 
clusters [Peterson, 1989, Waugh, 1993]. 
For y 6 ", we denote by FT(y)  F(y/T). The softmax mapping satisfies the 
following properties: 
1. The mapping FT converges pointwise to W over ' as T -- 0 and to the 
ie =  (1, 1 1) T ( " 
center of mass of $,, ,  ,..., , as T -- +o. 
2. The Jacobian DF of the softmax mapping is a symmetric n x n matrix that 
satisfies 
DF(y) = diatl(Fk(y))- F(y)F(y) T. (3) 
It is always singular with the vector e being the only eigenvector correspond- 
ing to the zero eigenvalue. Moreover, all its eigenvalues are upper-bounded 
by max<}<n F}(y) < 1. 
3. The softmax mapping is a gradient map, i.e, there exists a "potential" 
function   n --  such that F = VP. Moreover  is convex. 
The symbol P was chosen to indicate that it is a potential function. It should be 
noted that if F is the gradient map of P then FT is the gradient map of TT 
where PT(y) = P(y/T). In a related paper [Elfadel, 1993], we have found that 
the convexity of P is essential in the study of the global dynamics of analog WTA 
networks. Another instance where the convexity of P was found important is the 
one reported in [Kosowsky, 1991] where a mean-field algorithm was proposed to 
solve the linear assignment problem. 
The "Softmax" Nonlinearity 885 
V V Vs V4 Vs 
 12 I 14 I5 
T5 
1 T T T4 
Figure 2: Modified circuit implementation of softmax. In this circuit all the tran- 
sistors are diode-connected, and all the drain currents are well in saturation region. 
Note that for every transistor, both the voltage input and the current output are 
on the same wire - the drain. This circuit is a reciprocal multiterminal element. 
3 Circuit Implementations and Properties 
Now we propose two simple CMOS circuit implementations of the generalized sig- 
moid mapping. See Figures I and 2. When the transistors are operated in the 
subthreshold region the drain currents i,..., in are the outputs of a softmax map- 
ping whose inputs are the gate voltages v,..., Vn. The explicit v - i characteristics 
are given by 
exp( vm /Vo ) 
i, : IcE: exp(vr/Vo), (4) 
where  is a process-dependent parameter and V0 is the thermal voltage 
([Mead, 1989],p. 36). These circuits have the interesting properties of being un- 
clocked and parallel. Moreover, the competition constraint is imposed naturally 
through the KCL equation and the control current source. From a complexity 
point of view, this circuit is most striking since it computes n exponentials, n ra- 
tios, and n - I sums in one time constant! A derivation similar to the above was 
independently carried out in [Waugh, 1993] for the circuit of Figure 1. Although 
the first circuit implements softmax, it has two shortcomings. The first is practical: 
the separation between inputs and outputs implies additional wiring. The second 
is theoretical: this circuit is not a reciprocal multiterminal element, and therefore 
it can't be combined with other reciprocal elements like resistive fuses or constraint 
boxes to design analog, reciprocal optimization networks. 
Therefore, we only consider the circuit of Figure 2 and let v and i be the n- 
dimensional vectors representing the input voltages and the output currents, respec- 
tively.  The softmax mapping i = F(v) represents a voltage-controlled, nonlinear, 
XCompare with Lazarro et. al.'s WTA circuit [Lazzaro, 1989] whose inputs are currents 
and outputs are voltages. 
886 Elfadel and Wyatt 
resistive multiterminal element. The main result of our paper is the following: 2 
Theorem 1 The softmaz multiterminal element F is reciprocal, passive, 
passive and has a co-content function given by 
 (v): 1IV0 In ]] expOcvm/Vo ) (5) 
and a content function given by 
 *(i) = IVo y]  In (6) 
locally 
Thus, with this reciprocal, locally passive implementation of the softmax mapping, 
we have added a new circuit element to the library of the circuit designer. Note 
that this circuit element implements in an analog way the constraint = Yk = 1 
defining the unit simplex $,. Therefore, it can be considered a nonlinear constraint 
box [Harris, 1988] that can be used in reciprocal networks to implement analog 
optimization algorithms. 
The expression of * is a strong reminder of the information-theoretic definition of 
entropy. We suggest the name "entropic resistor" for the circuit of Figure 2. 
4 Conclusions 
In this paper, we have discussed another instance of convergence between the sta- 
tistical physics paradigm of Gibbs distributions and analog circuit implementation 
in the context of the winner-take-all function. The problem of using the simple, 
reciprocal circuit implementation of softmax to design analog networks for find- 
ing near optimal solutions of the linear assignment problem [Kosowsky, 1991] or 
the quadratic assignment problem [Simic, 1991] is still open and should prove a 
challenging task for analog circuit designers. 
Acknowledgements 
I. M. Elfadd would like to thank Alan Yuille for many helpful discussions and Fred 
Waugh for helpful discussions and for communicating the preprint of [Waugh, 1993]. 
This work was supported by the National Science Foundation under Grant No. 
MIP-91-17724. 
References 
[Peterson, 1989] C. Peterson and B. SSderberg. A new method for mapping opti- 
mization problems onto neural networks. International Journal of 
Neural Systems, 1(1):3 - 22, 1989. 
aThe concepts of reciprocity, passivity, content, and co-content are fundamental to 
nonlinear circuit theory. They are carefully developed in [Wyatt, 1992]. 
The "Softmax" Nonlinearity 887 
[Harris, 1988] J.G. Harris. Solving early vision problems with VLSI constraint 
networks. In Neural Architectures .for Computer Vision Workshop, 
AAAI-88, Minneapolis, MN, 1988. 
[Harris, 1989] J.G. Harris, C. Koch, J. Luo, and J. Wyatt. Resistive fuses: 
Analog hardware for detecting discontinuities in early vision. In 
C. Mead and M. Ismail, editors, Analog VLSI Implemenation 
Neural Ststems. Kluwer Academic Publishers, 1989. 
[Geiger, 1991] D. Geiger and A. Yuille. A common framework for image segmen- 
tation. Int. J. Computer Vision, 6:227- 253, 1991. 
[Hopfield, 1984] J.J. Hopfield. Neurons with graded responses have collective com- 
putational properties like those of two-state neurons. Proc. Nat'l 
Acad. Sci., USA, 81:3088-3092, 1984. 
[Rumelhart, 1986] D. E. Rumelhart et. ai. Parallel Distributed Processing, vol- 
ume 1. MIT Press, 1986. 
[Waugh, 1993] F.R. Waugh and R. M. Westervelt. Analog neural networks with 
local competition. I. dynamics and stability. Phtsical Review E, 
1993. in press. 
[Elfadd, 1993] I.M. Elfadd. Global dynamics of winner-take-all networks. In 
SPIE Proceedings, Stochastic and Neural Methods in Image Pro- 
cessing, volume 2032, pages 127 - 137, San Diego, CA, 1993. 
[Kosowsky, 1991] J. J. Kosowsky and A. L. Yuille. The invisible hand algorithm: 
Solving the assignment problem with statistical physics. TR 
91-1, Harvard Robotics Laboratory, 1991. 
[Mead, 1989] Carver Mead. Analog VLSI and Neural Ststems. Addison-Wesley, 
1989. 
[Lazzaro, 1989] J. Lazarro, S. Ryckebush, M. Mahowald, and C. Mead. Winner- 
take-all circuits of O(n) complexity. In D. S. Tourersky, editor, 
Advances in Neural Information Processing Systems I, pages 703 
- 711. Morgan Kaufman, 1989. 
[Wyatt, 1992] J.L. Wyatt. Lectures on Nonlinear Circuit Theory. MIT VLSI 
memo # 92-685, 1992. 
[Simic, 1991] P.D. Simic. Constrained nets for graph matching and other 
quadratic assignment problems. Neural Computation, 3:169 - 281, 
1991. 
