Multi-Armed Bandit Algorithms

A comprehensive resource on multi-armed bandit algorithms, covering theoretical bounds, algorithm comparisons, fundamental papers, books, tutorials, software implementations, and research links. This page provides regret bounds and performance guarantees for various bandit algorithms across different settings.

Papers

Fundamental papers, main venues, and comprehensive Mendeley group of bandit research

Books & Tutorials

Essential books, survey papers, ICML tutorials, and video lectures on bandits

Software & Datasets

Implementations in C#, Matlab, JavaScript, Ruby, and benchmark datasets

Links & Groups

Research groups, Wikipedia resources, and active communities

Contextual Bandits

Contextual bandit algorithms, papers, and the Exploration-Exploitation Challenge

We provide the theoretical lower bounds for these settings and the regret upper bounds of the listed algorithms as the function of the following key quantities: $K$ - number of arms, $T$ - number of rounds (time steps), $d$ - dimension (where appropriate) and sometimes also additional parameters pertinent to the assumptions on the bandits' structure. We shall also use the notations $R_T$, $r_T$, and $g_T$ respectively for the cumulative regret, simple regret, and gain at time $T$.

Finite-Armed Bandits

Cumulative Regret $E(R_T)$

Stochastic

Lai-Robbins Lower Bound [1985]: Under mild assumptions on the reward distribution, the asymptotic lower bound on expected cumulative regret is: $$\liminf_{T\rightarrow\infty}\frac{E(R_T)}{\log T} \geq \sum_{i:\mu_i < \mu^*} \frac{\Delta_i}{\text{KL}(\mu_i || \mu^*)},$$ where $\Delta_i = \mu^* - \mu_i$ is the gap and $\text{KL}(\mu_i || \mu^*)$ is the KL divergence between arm $i$ and the optimal arm. This is the fundamental information-theoretic limit for any consistent algorithm.

Algorithm	Reference	Bound	Notes
ε-greedy	[Auer et al, 2002]	$E(R_T) = \Omega(T)$ for constant $\epsilon$, $O(\log T)$ for decaying $\epsilon_t$	Simple but suboptimal. Constant $\epsilon$ gives linear regret. Outperformed by UCB strategies.
UCB1	[Auer, Cesa-Bianchi & Fischer, 2002]	$E(R_T) \le \sum_{i:\mu_i < \mu^*} \frac{8\log T}{\Delta_i} + \left(1 + \frac{\pi^2}{3}\right)\sum_{i=1}^K \Delta_i$	Classic algorithm. Simple and practical. Over-explores by factor of 8 compared to Lai-Robbins lower bound.
UCB-V	[Audibert, Munos & Szepesvári, 2009]	$E(R_T) \le \sum_{i:\mu_i < \mu^*} \left(\frac{2\sigma_i^2\log T}{\Delta_i} + 3\Delta_i\right)$	Exploits variance information. Better constants than UCB1 when variances differ significantly.
KL-UCB	[Garivier & Cappé, 2011]	$\limsup_{T\to\infty} \frac{E(R_T)}{\log T} \le \sum_{i:\mu_i < \mu^} \frac{\Delta_i}{\text{KL}(\mu_i\|\|\mu^)}+o(1)$	Asymptotically optimal. Matches Lai-Robbins lower bound exactly (up to $o(\log T)$ terms).
Thompson Sampling	[Agrawal & Goyal, 2012]	$E(R_T) \le (1 + \epsilon) \sum_{i:\mu_i < \mu^} \frac{\Delta_i}{\text{KL}(\mu_i\|\|\mu^)} \log T + O(K)$ for Bernoulli	Bayesian approach. Asymptotically optimal like KL-UCB. Excellent empirical performance. Industry standard.
FH-Gittins	[Gittins, 1979]	$E(R_T) \le \sum_{i:\mu_i < \mu^*} \frac{32\log T}{\Delta_i}+o(\log T)$	Finite-horizon Gittins index. Designed for Bayesian setting. Requires knowledge of horizon $T$, but performs well compared to anytime strategies.

Adversarial

Minimax Lower Bound: For adversarial rewards, the minimax optimal regret is: $$\inf_{\text{algorithm}} \sup_{\text{reward sequence}} E(R_T) = \Theta(\sqrt{TK}).$$ This bound holds even with full feedback and is unavoidable without stationarity assumptions.

Algorithm	Reference	Bound	Notes
EXP3	[Auer, Cesa-Bianchi, Freund & Schapire, 2002]	$E(R_T) \le 2\sqrt{eKT\log K}$	Handles adversarial rewards. No stationarity assumptions needed. Requires knowing horizon $T$. Uses exponential weights.
EXP3.1	[Auer, Cesa-Bianchi, Freund & Schapire, 2002]	$E(R_T) \le 4\sqrt{eKT\log K}$	Anytime version that does not require knowing $T$ in advance. Doubles time scale adaptively.

High-Probability Bounds

Goal: Bound regret that holds with high probability $1-\delta$, not just in expectation. Critical for risk-sensitive applications.

Algorithm	Reference	Bound	Notes
UCB1	[Auer, Cesa-Bianchi & Fischer, 2002]	$R_T \le \sum_{i:\mu_i < \mu^*} \frac{8\log T}{\Delta_i} + O(K\log(1/\delta))$ with prob. $\ge 1-\delta$	High-probability version with extra $\log(1/\delta)$ term. Concentration via Hoeffding's inequality.
MOSS	[Audibert & Bubeck, 2009]	$E(R_T) \le O(\sqrt{KT})$ minimax optimal, no gap dependence	Minimax Optimal Strategy in Stochastic. Optimal worst-case bound without knowing gaps.

Simple Regret $E(r_T)$

Definition: Simple regret $r_T = \mu^* - \mu_{\hat{i}_T}$ measures quality of final recommendation, not cumulative regret.

Algorithm	Reference	Bound	Notes
Uniform Sampling	[Bubeck, Munos & Stoltz, 2009]	$E(r_T) \le \sum_{i=1}^K e^{-\frac{T\Delta_i^2}{2K}}$	Pure exploration strategy. Exponential decay. Optimal for long horizons and many arms.
UCB-E	[Bubeck, Munos & Stoltz, 2009]	$E(r_T) \le O(T^{-\frac{\Delta_{\min}^2}{2}})$ polynomial decay	Exploration-focused UCB variant. Better constants for medium horizons.

Expected Discounted Reward

Algorithm	Reference	Bound	Notes
Gittins Index	[Gittins, 1979]	Optimal	Optimal for discounted rewards. Bayesian approach. Computationally expensive.
Dynamic Programming	-	Optimal	Optimal solution exists but computationally intractable for large problems.
Approximate Gittins	[Niño-Mora, 2011]	Near-optimal	Computationally efficient approximation to Gittins index.

Modern Bandits with Function Approximation

Linear & Contextual Bandits

Setting: Rewards follow $r_t = \langle \theta^*, x_t \rangle + \eta_t$ where $\theta^* \in \mathbb{R}^d$ is unknown and $x_t \in \mathbb{R}^d$ is context.

Minimax Lower Bound: For linear bandits, the worst-case regret satisfies: $$R_T = \Omega(d\sqrt{T}).$$ This dimension-dependent lower bound is fundamental and cannot be improved in the worst case. Optimal algorithms match this up to logarithmic factors.

Algorithm	Reference	Bound	Notes
LinUCB	[Li, Chu, Langford & Schapire, 2010]	$E(R_T) \le O(d\sqrt{T\log T})$	Optimal for linear contextual bandits. Uses ridge regression + UCB bonus. Widely used in recommendation systems.
LinTS	[Agrawal & Goyal, 2013]	$E(R_T) \le \tilde{O}(d^{3/2}\sqrt{T})$ high prob., $\tilde{O}(d\sqrt{T})$ expected	Thompson Sampling for linear bandits. Often better empirical performance than LinUCB. Bayesian approach.
OFUL	[Abbasi-Yadkori, Pál & Szepesvári, 2011]	$E(R_T) \le O(d\sqrt{T\log(KT/\delta)})$ with prob. $\ge 1-\delta$	Optimism in the Face of Uncertainty (Linear). Tight analysis with self-normalized concentrations.

Neural & Deep Bandits

Algorithm	Reference	Approach	Notes
Neural-UCB	[Zhou, Li & Zhu, 2020]	$E(R_T) \le \tilde{O}(\sqrt{T})$ for over-parameterized networks	Provable regret for neural function approximation. Uses neural tangent kernel theory. Bridges theory-practice gap.
Neural-TS	[Riquelme, Tucker & Snoek, 2018]	Empirical: outperforms LinUCB on real datasets	Thompson Sampling with deep neural networks. Practical Bayesian deep learning. Excellent empirical results.
Bootstrapped DQN	[Osband, Blundell, Pritzel & Van Roy, 2016]	Ensemble uncertainty for exploration in deep RL	Efficient deep exploration via bootstrapping. Randomized value functions. Used in Atari and complex MDPs.

Best-of-Both-Worlds (BOBW) Algorithms

Major breakthrough: Algorithms that achieve near-optimal regret in BOTH stochastic AND adversarial settings simultaneously, without knowing which environment they're in.

The Challenge: An algorithm must satisfy simultaneously: $$\text{Stochastic: } E(R_T) = O\left(\sum_{i:\mu_i < \mu^*} \frac{\log T}{\Delta_i}\right) \quad \text{and} \quad \text{Adversarial: } E(R_T) = O(\sqrt{TK\log T}).$$ Recent breakthroughs (2022-2024) achieve both bounds without prior knowledge of the environment type.

Algorithm	Reference	Guarantee	Innovation
BOBW-Combinatorial	[Ito, 2023]	Adversarial: $\tilde{O}(\sqrt{mKT})$, Stochastic: $O(m\log T / \Delta_{\min})$	First practical BOBW for combinatorial bandits with $m$ base arms. Works in both worlds without tuning.
Adversarial Linear	[Zimmert & Lattimore, 2022]	$\tilde{O}(\sqrt{dT})$ with high probability, near-minimax optimal	Resolves open problem on high-probability bounds. Nearly tight for adversarial linear bandits.

Batched & Delayed Feedback

Practical constraint: Real systems can't update policies after every interaction. Batched bandits reduce communication and enable parallelization.

The Batched Regret Tradeoff: With $M$ adaptive batches over horizon $T$, the optimal regret bound is: $$R_T = O\left(\sqrt{\frac{KT\log K}{M}} + M\Delta_{\max}\right),$$ where $\Delta_{\max}$ is the maximum gap. This reveals a fundamental tradeoff: using fewer batches ($M \ll \sqrt{T}$) incurs additional regret proportional to $M$.

Setting	Reference	Result	Impact
M-batched	[Gao et al, 2019]	$O(\sqrt{KT/M} + M)$ regret with M batches	Fundamental tradeoff: fewer batches = more regret. Critical for distributed systems.
Delayed Feedback	[Dann & Lattimore, 2021]	Handles variable delay in feedback	Essential for online ads, recommendation where feedback is delayed hours/days.

LLM Alignment & RLHF

Method	Reference	Key Contribution	Impact
InstructGPT	[Ouyang et al, 2022]	RLHF with human preferences for LLM alignment	Foundation for ChatGPT. Revolutionized how LLMs are trained to follow instructions.
Constitutional AI	[Bai et al, 2022]	AI feedback instead of human feedback for alignment	Scalable alignment method. Used in Claude and other models. Reduces human labeling.
DPO	[Rafailov et al, 2023]	Direct Preference Optimization without RL	Simpler alternative to RLHF. Widely adopted for fine-tuning open models (Llama, Mistral).
Test-Time Compute	[Snell et al, 2024]	Scaling inference-time exploration	Breakthrough: more test-time search can beat larger models. Key for o1-style reasoning.

Implicit Exploration

Key insight: Exploration can emerge implicitly from model uncertainty without explicit bonuses. Connects to deep RL and practical algorithms.

Method	Reference	Mechanism	Advantage
RLSVI	[Zanette, Brandfonbrener, Brunskill, Pirotta & Lazaric, 2020]	Frequentist regret via randomized value iteration	Randomized Least-Squares Value Iteration. Practical model-based RL with provable guarantees. Bridges theory-practice.

Pure Exploration & Best Arm Identification

Goal: Find the best arm with minimal samples, not minimize cumulative regret. Critical for clinical trials, A/B testing.

Sample Complexity Lower Bound: To identify the best arm with probability $\geq 1-\delta$, any algorithm requires at least: $$n \geq \Omega\left(\sum_{i=2}^K \frac{1}{\Delta_i^2} \log\frac{K}{\delta}\right)$$ samples, where arms are ordered by mean $\mu_1 > \mu_2 > \cdots > \mu_K$. This instance-dependent bound is tight and achieved by algorithms like lil'UCB and Track-and-Stop.

Setting	Reference	Bound	Application
Fixed Budget	[Audibert & Bubeck, 2010]	$O(\log(1/\delta)/\Delta^2)$ sample complexity	Clinical trials with fixed number of patients. Find best treatment quickly.
Fixed Confidence	[Jamieson et al, 2014]	lil'UCB achieves optimal instance-dependent bounds	Stop as soon as confident about best arm. Saves resources in practice.
Dueling Bandits	[Neu & Olkhovskaya, 2021]	Efficient algorithms for pairwise comparisons	When you can only compare two options (e.g., which ad is better?). Widely used.

Offline Bandits & Off-Policy Evaluation

Method	Reference	Approach	Use Case
Doubly Robust	[Dudik et al, 2014]	Combines direct method and importance sampling	Industry standard for off-policy evaluation. Used at Microsoft, Google, Meta.
SWITCH	[Wang et al, 2017]	Adaptive estimator selection for OPE	Reduces variance in policy evaluation. Critical for A/B testing at scale.
Balanced Policy Eval	[Athey & Wager, 2021]	Balancing weights for causal inference	Connects bandits to causal inference. Used in tech companies for decision making.