Risk Sensitive Reinforcement Learning 
Ralph Neuneier 
Siemens AG, Corporate Technology 
D-81730 Mtinchen, Germany 
Ralph. Neuneier@mchp. siemens. de 
Oliver Mihatsch 
Siemens AG, Corporate Technology 
D-81730 Mtinchen, Germany 
O1 iver. Mihats ch@mchp. siemens. de 
Abstract 
As already known, the expected return of a policy in Markov Deci- 
sion Problems is not always the most suitable optimality criterion. For 
many applications control strategies have to meet various constraints like 
avoiding very bad states (risk-avoiding) or generating high profit within 
a short time (risk-seeking) although this might probably cause significant 
costs. We propose a modified Q-learning algorithm which uses a single 
continuous parameter  E [-1, 1] to determine in which sense the re- 
sulting policy is optimal. For  = 0, the policy is optimal with respect 
to the usual expected return criterion, while   1 generates a solution 
which is optimal in worst case. Analogous, the closer  is to - 1 the more 
risk seeking the policy becomes. In contrast to other related approaches 
in the field of MDPs we do not have to transform the cost model or to 
increase the state space in order to take risk into account. Our new ap- 
proach is evaluated by computing optimal investment strategies for an 
artificial stock market. 
1 WHY IT SOMETIMES PAYS TO ACT CAUTIOUSLY 
Reinforcement learning (RL) deals with the computation of favorable control policies in 
sequential decision task. Its theoretical framework of Markov Decision Problems (MDPs) 
evaluates and compares policies by their expected (sometimes discounted or averaged) sum 
of the immediate returns or costs per time step (Bertsekas & Tsitsiklis, 1996). But there are 
numerous applications which require a more sophisticated control scheme: e.g. a policy 
should take into account that bad outcomes or states may be possible even if they are very 
rare because they are so disastrous, that they should be certainly avoided. 
An obvious example is the field of finance where the main question is how to invest re- 
sources among various opportunities (e.g. assets like stocks, bonds, etc.) to achieve re- 
markable returns while simultaneously controlling the risk exposure of the investments 
due to changing markets or economic conditions. Many traders try to achieve this by a 
Markovitz-like portfolio management which distributes capital according to return and risk 
1032 R. Neuneier and O. Mihatsch 
estimates of the assets. A new approach using reinforcement learning techniques which 
additionally integrates trading costs and other market imperfections has been proposed in 
Neuneier, 1998. Here, these algorithms are naturally extended such that an explicit risk 
control is now possible. The investor can decide how much risk she/he is willing to accept 
and then compute an optimal risk-averse investment strategy. Similar trade-off scenarios 
can be formulated in robotics, traffic control and further application areas. 
The fact that the popular expected value criterion is not always suitable has been already 
known in the field of AI (Koenig & Simmons, 1994), control theory and reinforcement 
learning (Heger, 1994 and Szepesvri, 1997). Several techniques have been proposed to 
handle this problem. The most obvious way is to transform the sum of returns Y]t rt using 
an appropriate utility function U which reflects the desired properties of the solution. Un- 
fortunately, interesting nonlinear utility functions incorporating the variance of the return, 
such as U( t rt) = t rt -- /(t rt -- E(t re)) 2, lead to non-Markovian decision 
problems. The popular class of exponential utility functions U( t rt) -- exp( t rt) 
preserves the Markov property but requires time dependent policies even for discounted 
infinite horizon MDPs. Furthermore, it is not possible to formulate a corresponding model- 
free learning algorithm. A further alternative changes the state space model by including 
past returns as an additional state element at the cost of a higher dimensionality of the 
MDP. Furthermore, it is not always clear in which way the states should be augmented. 
One may also transform the cost model, i.e. by punishing large losses stronger than mi- 
nor costs. While requiring a significant amount of prior knowledge, this also increases the 
complexity of the MDP. 
In contrast to these approaches we modify the popular Q-learning algorithm by introducing 
a control parameter which determines in which sense the resulting policy is optimal. Intu- 
itively and loosely speaking, our algorithm simulates the learning behavior of an optimistic 
(pessimistic) person by overweighting (underweighting) experiences which are more pos- 
itive (negative) than expected. This main idea will be made more precise in section 2 and 
mathematically thoroughly analyzed in section 3. Using artificial data, we demonstrate 
some properties of the new algorithm by constructing an optimal risk-avoiding investment 
strategy (section 4). 
2 RISK SENSITIVE Q-LEARNING 
For brevity we restrict ourselves to the subclass of infinite horizon discounted Markov deci- 
sion problems (MDP). Furthermore, we assume the immediate rewards being deterministic 
functions of the current state and control action. Let $ = {1,..., n} be the finite state 
space and U be the finite action space. Transition probabilities and immediate rewards are 
denoted by Pij (u) and gi (u), respectively. 7 denotes the discount factor. Let II be the set 
of all deterministic policies mapping states to control actions. 
A commonly used objective is to learn a policy r that 
maximizes (i,u) :=gi(u)+ E Tkgi(r(ik)) (1) 
quantifying the expected reward if one executes control action u in state i and follows 
the policy r thereafter. It is a well-known result that the optimal Q-values *(i, u) := 
mazcn  (i, u) satisfy the following optimality equation 
*(i u)=gi(u)+7EPij(u)max*(j,u' ) ViE$,uEU. (2) 
 u'EU 
jss 
Any policy  with 7(i) = argmaxusu *(i, u) is optimal with respect to the expected 
reward criterion. 
Risk Sensitive Reinforcement Learning 1033 
The Q-function  averages over the outcome of all possible trajectories (series of states) 
of the Markov process generated by following the policy -. However, the outcome of a 
specific realization of the Markov process may deviate significantly from this mean value. 
The expected reward criterion does not consider any risk, although the cases where the 
discounted reward falls considerably below the mean value is of a living interest for many 
applications. Therefore, depending on the application at hand the expected reward ap- 
proach is not always appropriate. Alternatively, Heger (1994) and Littman & Szepesvfiri 
(1996) present a performance criterion that exclusively focuses on risk avoiding policies: 
maximize(_Q(i,u):=gi(u)+ inf {7kgik(r(ik))}). (3) 
i1,2,... 
p(il,i2,...)>O 
The Q-function Q (i, u) denotes the worst possible outcome if one executes control action 
u in state i and follows the policy - thereafter. The corresponding optimality equation for 
Q__*(i,u) :-- maxcn Q__ (i, u) is given by 
Q*(i,u): gi(u) + 7 min maxQ*(j,u'). (4) 
-- aS u'CU-- 
pj(u)>O 
Any policy _ satisfying (i) = arg max,eu Q_Q_* (i, u) is optimal with respect to this mini- 
mal reward criterion. In most real world applications this approach is too restrictive because 
it takes very rare events (that in practice never happen) fully into account. This usually leads 
to policies with a lower average performance than the application requires. An investment 
manager, for instance, which acts with respect to this very pessimistic objective function 
will not invest at all. 
To handle the trade-off between a sufficient average performance and a risk avoiding (risk 
seeking) behavior, we propose a family of new optimality equations parameterized by a 
meta-parameter  (-1 < t < 1): 
O--pij(u)X'(gi(u)+7maxQ,(j,u')-Q,(i u)) Vi6$,u6U (5) 
u' 6U 
where ,I:'" (x) :-- (1 -  sign(x))x. (In the next section we will show that a unique solution 
Q,, of the above equation (5) exists.) Obviously, for  -- 0 we recover equation (2), 
the optimality equation for the expected reward criterion. If we choose  to be positive 
(0 <  < 1) then we overweight negative temporal differences 
gi(u) + 7 max Q(j, u') - Q(i, u) < 0 (6) 
u'EU 
with respect to positive ones. Loosely speaking, we overweight transitions to states where 
the future return is lower than the average one. On the other hand, we underweight transi- 
tions to states that promise a higher return than in the average. Thus, an agent that behaves 
according to the policy -,,(i) := argmax,u Q(i, u) is risk avoiding if t > 0. In the 
limit t --+ 1 the policy -,, approaches the optimal worst-case policy , as we will show 
in the following section. (To get an intuition about this, the reader may easily check that 
the optimal worst-case Q-value Q* fulfills the modified optimality equation (5) for  --- 1.) 
Similarly, the policy ',, becomes--risk seeking if we choose  to be negative. 
It is straightforward to formulate a risk sensitive Q-learning algorithm that bases on the 
modified optimality equation (5). Let OR(i, u; w) be a parametric approximation of the 
Q-function Q(i, u). The states and actions encountered at time step k during simulation 
are denoted by ik and u respectively. At each time step apply the following update rule: 
d  = gik(u)+7maxO,(i+,u';w ()) -O,(i,u;w()), 
u' EU 
= w + ), (7) 
1034 R. Neuneier and O. Mihatsch 
where a? ) denotes a stepsize sequence. The following section analyzes the properties of 
the new optimality equations and the corresponding Q-learning algorithm. 
3 PROPERTIES OF THE RISK SENSITIVE Q-FUNCTION 
Due to space limitations we are not able to give detailed proofs of our results. Instead, we 
focus on interpreting their practical consequences. The proofs will be published elsewhere. 
Before formulating the mathematical results, we introduce some notation to make the ex- 
position more concise. Using an arbitrary stepsize 0 < o < 1, we define the value iteration 
operator corresponding to our modified optimality equation (5) as 
Ta,,[Q](i, u):- Q(i, u) + a E Po(u)A' (9i(u) + 7 max Q(j, u') - Q(i, u)). 
u'U 
(8) 
The operator Ta,, acts on the space of Q-functions. For every Q-function Q and every 
state-action pair (i, u) we define N[Q](i, u) to be the set of all successor states j for 
which max,u Q(j, u') attains its minimum: 
N[Q](i,u) := {j e Slpij(u ) > 0and maxQ(j,u') = 
u' EU 
min maxQ(j' u')}. (9) 
Pi5' (U) >0 
Let p, [Q] (i, u) :-- YjeN [C2I(i,)Pij (u) be the probability of transitions to such successor 
states. 
We have the following lemma ensuring the contraction property of 
Lemma 1 (Contraction Property) Let [QI = maxi$,,eu Q(i, u) and 0 < a < 1, 0 < 
7 < 1. Then 
- _< (1 - o(1 - l1)(1 - 7))IQ - Q21 VQ1, Q2. (10) 
The operator Ta, is contracting, becauseO < (1 - a(1 -Il)(X - )) < x. 
The lemma has several important consequences. 
1. The risk sensitive optimality equation (5), i.e. Ta,, [Q] = Q has a unique solution 
for all-l<< 1. 
2. The value iteration procedure Qnew :-- Ta,[Q] converges towards Q,. 
3. The existing convergence results for traditional Q-learning (Bertsekas & Tsitsiklis 
1997, Tsitsiklis & Van Roy 1997) remain also valid in the risk sensitive case   0. 
Particularly, risk sensitive Q-learning (7) converges with probability one in the case 
of lookup table representations as well as in the case of optimal stopping problems 
combined with linear representations. 
4. The speed of convergence for both, risk sensitive value iteration and Q-learning be- 
comes worse if Itel --, 1. We can remedy this to some extent if we increase the stepsize 
o appropriately. 
Let -, be a greedy policy with respect to the unique solution Q, of our modified optimality 
equation; that is ',(i) = argmaxu Q(i,u). The following theorem examines the 
performance of -, for the risk avoiding case  _> 0. It gives us a feeling about the expected 
outcome r and the worst possible outcome Q' of policy ', for different values of 
The theorem clarifies the limiting behavior of -, if  --, 1. 
Risk Sensitive Reinforcement Learning 1035 
Theorem 2 Let 0 < n < 1. The following inequalities hold componentwise, i.e. for each 
pair (i, u) E S x U. 
7 (0' - Q*) (11) 
0<_*- <2n 1_7 -- 
0 < p,[Q,](Q* - Q) < (1 - :) _ (, _ Q,) (12) 
- - - - 7 - 
Moreover, lim ' -- * and lim Q' = Q*. 
--0 -+1 -- -- 
The difference * - Q* between the optimal expected reward and the optimal worst case 
reward is crucial in the above inequalities. It measures the amount of risk being inherent in 
our MDP at hand. Besides the value of n, this quantity essentially influences the difference 
between the performance of the policy -, and the optimal performance with respect to 
both, the expected reward and the worst case criterion. The second inequality (12) states 
that the performance of policy -, in the worst case sense tends to the optimal worst case 
performance if n  1. The "speed of convergence" is influenced by the quantity p[Q], 
i.e. the probability that a worst case transition really occurs. (Note that p, [Q,] is bounded 
from below.) A higher probability p, [Q,] of worst case transitions implies a stronger risk 
avoiding attitude of the policy ',. 
4 EXPERIMENTS: RISK-AVERSE INVESTMENT DECISIONS 
Our algorithm is now tested on the task of constructing an optimal investment policy for an 
artificial stock price analogous to the empirical analysis in Neuneier, 1998. The task, illus- 
trated as a MDP in fig. 1, is to decide at each time step (e.g. each day or after each mayor 
event on the market) whether to buy the stock and therefore speculating on increasing stock 
prices or to keep the capital in cash which avoids potential losses due to decreasing stock 
prices. 
disturbancie 
investments 
financial market 
return 
investor 
rates, prices 
Figure 1. The Markov Decision Problem: 
xt = (h, 
state: market St 
and portfolio Kt 
policy u, actions 
transition probabilities 
return function 
10 
50 
50 200 250 300 
time 
Figure 2. A realization of the ar- 
tificial stock price for 300 time 
steps. It is obvious that the 
price follows an increasing trend 
but with higher values a sud- 
den drop to low values becomes 
more and more probable. 
It is assumed, that the investor is not able to influence the market by the investment de- 
cisions. This leads to a MDP with some of the state elements being uncontrollable and 
results in two computationally import implications: first, one can simulate the investments 
by historical data without investing (and potentially losing) real money. Second, one can 
formulate a very efficient (memory saving) and more robust Q-learning algorithms. Due to 
space restriction we skip a detailed description of these algorithms and refer the interested 
reader to Neuneier, 1998. 
1036 R. Neuneier and O. Mihatsch 
The artificial stock price is in the range of [1, 2]. The transition probabilities are chosen 
such that the stock market simulates a situation where the price follows an increasing trend 
but with higher values a drop to very low values becomes more and more probable (fig. 2). 
The state vector consists of the current stock price and the current investment, i.e. the 
amount of money invested in stocks or cash. Changing the investment from cash to stocks 
results in some transaction costs consisting of variable and fixed terms. These costs are 
essential to define the investment problem as a MDP because they couple the actions made 
at different time steps. Otherwise we could solve the problem by a pure prediction of the 
next stock price. The function which quantifies the immediate return for each time step is 
defined as follows: if the capital is invested in cash, then there is nothing to earn even if 
the stock price increases, if the inve'stor has bought stocks the return is equal the relative 
change of the stock price weighted by the invested amount of capital minus the transaction 
costs which apply if one changed from cash to stocks. 
m tok I $tok Fkqe $ 
In cash 2 
cap?t 1.5 
in stocks 1 .rock price $ 
Figure 3. Left: Risk neu- 
tral policy,  = 0. Right: 
A small bias of  = 0.3 
against risk changes the po- 
licy if one is not invested 
(transaction costs apply in 
this case). 
o 2 2 
capital 1.5 calera t 'v.5 
In stocks I stock price $ Jn stocks I ltOCk price $ 
stocks] stocks, 
2 2 
in stocks 1 stock price $ m stocks 1 tock price $ 
Figure 4. Left:  -- 0.5 
yields a stronger risk averse 
attitude. Right: With  = 
0.8 the policy becomes also 
more cautious if already in- 
vested in stocks. 
Figure 5. Left:  = 0.9 
leads to a policy which in- 
vests in stocks in only 5 
cases. Right: The worst 
case solution never invests 
because there is always a 
positive probability for de- 
creasing stock prices. 
As a reinforcement learning method, Q-learning has to interact with the environment (here 
the stock market) to learn optimal investment behavior. Thus, a training set of 2000 data 
points is generated. The training phase is divided into epochs which consists of as many 
trials as data in the training set exist. At every trial the algorithm selects randomly a stock 
price from the data set, chooses a random investment state and updates the tabulated Q- 
values according to the procedure given in Neuneier, 1998. The only difference of our new 
risk averse Q-learning is that negative experiences, i.e. smaller returns than in the mean, 
are overweighted in comparison to positive experiences using the -factor of eq. (7). Using 
different  values from 0 (recovering the original Q-learning procedure) to 1 (leading to 
worst case Q-learning) we plot the resulting policies as mappings from the state space to 
control actions in figures 3 to 5. Obviously, with increasing  the investor acts more and 
more cautiously because there are less states associated with an investment decision for 
stocks. In the extreme case of  - 1, there is no stock investment at all in order to avoid 
any loss. The policy is not useful in practice. This supports our introductory comments that 
worst case Q-learning is not appropriate in many tasks. 
Risk Sensitive Reinforcement Learning 1037 
0.8 
 0.7[ 
+ 0.6 
,.Q 
 0.5 
0.4 
0.3 
0.2 
0.1 
-0.1 ! 
QQ-plot of the distributions 
-0.2 0.7 0.8 
-0.1 0 0.1 0.2 0.3 0.4 0.5 016 
quantiles of the classical approach: --0 
Figure 6. The quantiles of the dis- 
tributions of the discounted sum of 
returns for  = 0.2 (o) and  = 0.4 
(+) are plotted against the quan- 
tiles for the classical risk neutral 
approach  = 0. The distributions 
only differ significantly for nega- 
tive accumulated returns (left tail of 
the distributions). 
For further analysis, we specify a risky start state io for which a sudden drop of the stock 
price in the near future is very probable. Starting at io we compute the cumulated dis- 
counted rewards of 10000 different trajectories following the policies 'o, 'o.2 and '0.4 
which have been generated using : -- 0 (risk neutral),  - 0.2 and t - 0.4. The resulting 
three data sets are compared using a quantile-quantile plot whose purpose is to determine 
whether the samples come from the same distribution type. If they do so, the plot will be 
linear. Fig. 6 clearly shows that for higher -values the left tail of the distribution (neg- 
ative returns) bends up indicating a fewer number of losses. On the other hand there is 
no significant difference for positive quantiles. In contrast to naive utility functions which 
penalizes high variance in general, our risk sensitive Q-learning asymmetrically reduces 
the probability for losses which may be more suitable for many applications. 
5 CONCLUSION 
We have formulated a new Q-learning algorithm which can be continuously tuned towards 
risk seeking or risk avoiding policies. Thus, it is possible to construct control strategies 
which are more suitable for the problem at hand by only small modifications of Q-learning 
algorithm. The advantage of our approch in comparison to already known solutions is, that 
we have neither to change the cost nor the state model. We can prove that our algorithm 
converges under the usual assumptions. Future work will focus on the connections between 
our approach and the utility theoretic point of view. 
References 
D. P. Bertsekas, J. N. Tsitsiklis (1996) Neuro-Dynamic Programming. Athena Scientific. 
M. Heger (1994) Consideration of Risk and Reinforcement Learning, in Machine Learning, proceed- 
ings of the 1 lth International Conference, Morgan Kaufmann Publishers. 
S. Koenig, R.G. Simmons (1994) Risk-Sensitive Planning with Probabilistic Decision Graphs. 
Proc. of the Fourth Int. Conf. on Principles of Knowledge Representation and Reasoning (KR). 
M.L. Littman, Cs. Szepesvfiri (1996), A generalized reinforcement-learning model: Convergence 
and applications. In International Conference of Machine Learning '96. Bari. 
R. Neuneier (1998) Enhancing Q-learning for Optimal Asset Allocation, in Advances in Neural In- 
formation Processing Systems 10, Cambridge, MA: MIT Press. 
M. L. Puterman (1994), Markov Decision Processes, John Wiley & Sons. 
Cs. Szepesvfiri (1997) Non-Markovian Policies in Sequential Decision Problems, Acta Cybernetica. 
J.N. Tsitsiklis, B. Van Roy (1997) Approximate Solutions to Optimal Stopping Problems, in Ad- 
vances in Neural Information Processing Systems 9, Cambridge, MA: MIT Press. 
