Multi-Armed Bandits with Risk-Aware Performance Measures

HU, HONGDA

Multi-Armed Bandits with Risk-Aware Performance Measures

Files

Hu_Hongda.pdf (14.21 MB)

Date

2026-04-29

Authors

HU, HONGDA

Advisor

Schied, Alexander
Ghossoub, Mario
Charpentier, Arthur

Publisher

University of Waterloo

Abstract

The classical multi-armed bandit problem involves a learner and a collection of arms with unknown reward distributions. At each round, the learner selects an arm and receives new information. The learner faces a tradeoff between exploiting the current information and exploring all arms. The traditional objective in the literature is to maximize the expected cumulative reward over all rounds. Such an objective does not involve a risk-reward trade- off, which is fundamental in many areas of application. Building on the mean–variance formulation of Sani et al. (2012), I first extend the classical multi-armed bandit problem to a mean–variance setting. I relax assumptions of independent arms and bounded re- wards and instead allow for sub-Gaussian reward distributions. Within this framework, I introduce the Risk-Aware Lower Confidence Bound algorithm and study its theoretical properties. Moving beyond mean–variance criteria, I then develop the Expected Shortfall Lower Confidence Bound algorithm for bandits evaluated under Expected Shortfall, estab- lishing new concentration bounds for empirical Value at Risk and Expected Shortfall and deriving regret guarantees for both light- and heavy-tailed reward distributions. Building on this analysis, I next propose the Range Value at Risk Lower Confidence Bound algo- rithm, which relies on a novel uniform concentration argument for order statistics inspired by Boucheron and Thomas (2012), and I extend the regret analysis to general continuous reward distributions with non-decreasing hazard rates. Finally, I introduce the Spectral Risk Measure Lower Confidence Bound algorithm, providing a unified treatment of a broad class of spectral risk measures that encompasses both Expected Shortfall and Range Value at Risk as special cases. Numerical experiments on synthetic and real financial datasets demonstrate that the proposed algorithms achieve favourable risk–reward tradeoffs and consistently outperform existing risk-aware bandit methods across a variety of distribu- tional structures.