Multi-Armed Bandits with Risk-Aware Performance Measures
Loading...
Date
Authors
Advisor
Schied, Alexander
Ghossoub, Mario
Charpentier, Arthur
Ghossoub, Mario
Charpentier, Arthur
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
The classical multi-armed bandit problem involves a learner and a collection of arms with
unknown reward distributions. At each round, the learner selects an arm and receives new
information. The learner faces a tradeoff between exploiting the current information and
exploring all arms. The traditional objective in the literature is to maximize the expected
cumulative reward over all rounds. Such an objective does not involve a risk-reward trade-
off, which is fundamental in many areas of application. Building on the mean–variance
formulation of Sani et al. (2012), I first extend the classical multi-armed bandit problem
to a mean–variance setting. I relax assumptions of independent arms and bounded re-
wards and instead allow for sub-Gaussian reward distributions. Within this framework,
I introduce the Risk-Aware Lower Confidence Bound algorithm and study its theoretical
properties. Moving beyond mean–variance criteria, I then develop the Expected Shortfall
Lower Confidence Bound algorithm for bandits evaluated under Expected Shortfall, estab-
lishing new concentration bounds for empirical Value at Risk and Expected Shortfall and
deriving regret guarantees for both light- and heavy-tailed reward distributions. Building
on this analysis, I next propose the Range Value at Risk Lower Confidence Bound algo-
rithm, which relies on a novel uniform concentration argument for order statistics inspired
by Boucheron and Thomas (2012), and I extend the regret analysis to general continuous
reward distributions with non-decreasing hazard rates. Finally, I introduce the Spectral
Risk Measure Lower Confidence Bound algorithm, providing a unified treatment of a broad
class of spectral risk measures that encompasses both Expected Shortfall and Range Value
at Risk as special cases. Numerical experiments on synthetic and real financial datasets
demonstrate that the proposed algorithms achieve favourable risk–reward tradeoffs and
consistently outperform existing risk-aware bandit methods across a variety of distribu-
tional structures.