Michal Valko : Projects

Multi-Armed Bandit Algorithms

A comprehensive resource on multi-armed bandit algorithms, covering theoretical bounds, algorithm comparisons, fundamental papers, books, tutorials, software implementations, and research links. This page provides regret bounds and performance guarantees for various bandit algorithms across different settings.

We provide the theoretical lower bounds for these settings and the regret upper bounds of the listed algorithms as the function of the following key quantities: $K$ - number of arms, $T$ - number of rounds (time steps), $d$ - dimension (where appropriate) and sometimes also additional parameters pertinent to the assumptions on the bandits' structure. We shall also use the notations $R_T$, $r_T$, and $g_T$ respectively for the cumulative regret, simple regret, and gain at time $T$.

Finite-Armed Bandits

Cumulative Regret $E(R_T)$

Stochastic

Lai-Robbins Lower Bound [1985]: Under mild assumptions on the reward distribution, the asymptotic lower bound on expected cumulative regret is: $$\liminf_{T\rightarrow\infty}\frac{E(R_T)}{\log T} \geq \sum_{i:\mu_i < \mu^*} \frac{\Delta_i}{\text{KL}(\mu_i || \mu^*)},$$ where $\Delta_i = \mu^* - \mu_i$ is the gap and $\text{KL}(\mu_i || \mu^*)$ is the KL divergence between arm $i$ and the optimal arm. This is the fundamental information-theoretic limit for any consistent algorithm.

Algorithm Reference Bound Notes
ε-greedy [Auer et al, 2002] $E(R_T) = \Omega(T)$ for constant $\epsilon$, $O(\log T)$ for decaying $\epsilon_t$ Simple but suboptimal. Constant $\epsilon$ gives linear regret. Outperformed by UCB strategies.
UCB1 [Auer, Cesa-Bianchi & Fischer, 2002] $E(R_T) \le \sum_{i:\mu_i < \mu^*} \frac{8\log T}{\Delta_i} + \left(1 + \frac{\pi^2}{3}\right)\sum_{i=1}^K \Delta_i$ Classic algorithm. Simple and practical. Over-explores by factor of 8 compared to Lai-Robbins lower bound.
UCB-V [Audibert, Munos & Szepesvári, 2009] $E(R_T) \le \sum_{i:\mu_i < \mu^*} \left(\frac{2\sigma_i^2\log T}{\Delta_i} + 3\Delta_i\right)$ Exploits variance information. Better constants than UCB1 when variances differ significantly.
KL-UCB [Garivier & Cappé, 2011] $\limsup_{T\to\infty} \frac{E(R_T)}{\log T} \le \sum_{i:\mu_i < \mu^*} \frac{\Delta_i}{\text{KL}(\mu_i||\mu^*)}+o(1)$ Asymptotically optimal. Matches Lai-Robbins lower bound exactly (up to $o(\log T)$ terms).
Thompson Sampling [Agrawal & Goyal, 2012] $E(R_T) \le (1 + \epsilon) \sum_{i:\mu_i < \mu^*} \frac{\Delta_i}{\text{KL}(\mu_i||\mu^*)} \log T + O(K)$ for Bernoulli Bayesian approach. Asymptotically optimal like KL-UCB. Excellent empirical performance. Industry standard.
FH-Gittins [Gittins, 1979] $E(R_T) \le \sum_{i:\mu_i < \mu^*} \frac{32\log T}{\Delta_i}+o(\log T)$ Finite-horizon Gittins index. Designed for Bayesian setting. Requires knowledge of horizon $T$, but performs well compared to anytime strategies.

Adversarial

Minimax Lower Bound: For adversarial rewards, the minimax optimal regret is: $$\inf_{\text{algorithm}} \sup_{\text{reward sequence}} E(R_T) = \Theta(\sqrt{TK}).$$ This bound holds even with full feedback and is unavoidable without stationarity assumptions.

Algorithm Reference Bound Notes
EXP3 [Auer, Cesa-Bianchi, Freund & Schapire, 2002] $E(R_T) \le 2\sqrt{eKT\log K}$ Handles adversarial rewards. No stationarity assumptions needed. Requires knowing horizon $T$. Uses exponential weights.
EXP3.1 [Auer, Cesa-Bianchi, Freund & Schapire, 2002] $E(R_T) \le 4\sqrt{eKT\log K}$ Anytime version that does not require knowing $T$ in advance. Doubles time scale adaptively.

High-Probability Bounds

Goal: Bound regret that holds with high probability $1-\delta$, not just in expectation. Critical for risk-sensitive applications.

Algorithm Reference Bound Notes
UCB1 [Auer, Cesa-Bianchi & Fischer, 2002] $R_T \le \sum_{i:\mu_i < \mu^*} \frac{8\log T}{\Delta_i} + O(K\log(1/\delta))$ with prob. $\ge 1-\delta$ High-probability version with extra $\log(1/\delta)$ term. Concentration via Hoeffding's inequality.
MOSS [Audibert & Bubeck, 2009] $E(R_T) \le O(\sqrt{KT})$ minimax optimal, no gap dependence Minimax Optimal Strategy in Stochastic. Optimal worst-case bound without knowing gaps.

Simple Regret $E(r_T)$

Definition: Simple regret $r_T = \mu^* - \mu_{\hat{i}_T}$ measures quality of final recommendation, not cumulative regret.

Algorithm Reference Bound Notes
Uniform Sampling [Bubeck, Munos & Stoltz, 2009] $E(r_T) \le \sum_{i=1}^K e^{-\frac{T\Delta_i^2}{2K}}$ Pure exploration strategy. Exponential decay. Optimal for long horizons and many arms.
UCB-E [Bubeck, Munos & Stoltz, 2009] $E(r_T) \le O(T^{-\frac{\Delta_{\min}^2}{2}})$ polynomial decay Exploration-focused UCB variant. Better constants for medium horizons.

Expected Discounted Reward

Algorithm Reference Bound Notes
Gittins Index [Gittins, 1979] Optimal Optimal for discounted rewards. Bayesian approach. Computationally expensive.
Dynamic Programming - Optimal Optimal solution exists but computationally intractable for large problems.
Approximate Gittins [Niño-Mora, 2011] Near-optimal Computationally efficient approximation to Gittins index.

Modern Bandits with Function Approximation

Linear & Contextual Bandits

Setting: Rewards follow $r_t = \langle \theta^*, x_t \rangle + \eta_t$ where $\theta^* \in \mathbb{R}^d$ is unknown and $x_t \in \mathbb{R}^d$ is context.

Minimax Lower Bound: For linear bandits, the worst-case regret satisfies: $$R_T = \Omega(d\sqrt{T}).$$ This dimension-dependent lower bound is fundamental and cannot be improved in the worst case. Optimal algorithms match this up to logarithmic factors.

Algorithm Reference Bound Notes
LinUCB [Li, Chu, Langford & Schapire, 2010] $E(R_T) \le O(d\sqrt{T\log T})$ Optimal for linear contextual bandits. Uses ridge regression + UCB bonus. Widely used in recommendation systems.
LinTS [Agrawal & Goyal, 2013] $E(R_T) \le \tilde{O}(d^{3/2}\sqrt{T})$ high prob., $\tilde{O}(d\sqrt{T})$ expected Thompson Sampling for linear bandits. Often better empirical performance than LinUCB. Bayesian approach.
OFUL [Abbasi-Yadkori, Pál & Szepesvári, 2011] $E(R_T) \le O(d\sqrt{T\log(KT/\delta)})$ with prob. $\ge 1-\delta$ Optimism in the Face of Uncertainty (Linear). Tight analysis with self-normalized concentrations.

Neural & Deep Bandits

Algorithm Reference Approach Notes
Neural-UCB [Zhou, Li & Zhu, 2020] $E(R_T) \le \tilde{O}(\sqrt{T})$ for over-parameterized networks Provable regret for neural function approximation. Uses neural tangent kernel theory. Bridges theory-practice gap.
Neural-TS [Riquelme, Tucker & Snoek, 2018] Empirical: outperforms LinUCB on real datasets Thompson Sampling with deep neural networks. Practical Bayesian deep learning. Excellent empirical results.
Bootstrapped DQN [Osband, Blundell, Pritzel & Van Roy, 2016] Ensemble uncertainty for exploration in deep RL Efficient deep exploration via bootstrapping. Randomized value functions. Used in Atari and complex MDPs.

Best-of-Both-Worlds (BOBW) Algorithms

Major breakthrough: Algorithms that achieve near-optimal regret in BOTH stochastic AND adversarial settings simultaneously, without knowing which environment they're in.

The Challenge: An algorithm must satisfy simultaneously: $$\text{Stochastic: } E(R_T) = O\left(\sum_{i:\mu_i < \mu^*} \frac{\log T}{\Delta_i}\right) \quad \text{and} \quad \text{Adversarial: } E(R_T) = O(\sqrt{TK\log T}).$$ Recent breakthroughs (2022-2024) achieve both bounds without prior knowledge of the environment type.

Algorithm Reference Guarantee Innovation
BOBW-Combinatorial [Ito, 2023] Adversarial: $\tilde{O}(\sqrt{mKT})$, Stochastic: $O(m\log T / \Delta_{\min})$ First practical BOBW for combinatorial bandits with $m$ base arms. Works in both worlds without tuning.
Adversarial Linear [Zimmert & Lattimore, 2022] $\tilde{O}(\sqrt{dT})$ with high probability, near-minimax optimal Resolves open problem on high-probability bounds. Nearly tight for adversarial linear bandits.

Batched & Delayed Feedback

Practical constraint: Real systems can't update policies after every interaction. Batched bandits reduce communication and enable parallelization.

The Batched Regret Tradeoff: With $M$ adaptive batches over horizon $T$, the optimal regret bound is: $$R_T = O\left(\sqrt{\frac{KT\log K}{M}} + M\Delta_{\max}\right),$$ where $\Delta_{\max}$ is the maximum gap. This reveals a fundamental tradeoff: using fewer batches ($M \ll \sqrt{T}$) incurs additional regret proportional to $M$.

Setting Reference Result Impact
M-batched [Gao et al, 2019] $O(\sqrt{KT/M} + M)$ regret with M batches Fundamental tradeoff: fewer batches = more regret. Critical for distributed systems.
Delayed Feedback [Dann & Lattimore, 2021] Handles variable delay in feedback Essential for online ads, recommendation where feedback is delayed hours/days.

LLM Alignment & RLHF

Method Reference Key Contribution Impact
InstructGPT [Ouyang et al, 2022] RLHF with human preferences for LLM alignment Foundation for ChatGPT. Revolutionized how LLMs are trained to follow instructions.
Constitutional AI [Bai et al, 2022] AI feedback instead of human feedback for alignment Scalable alignment method. Used in Claude and other models. Reduces human labeling.
DPO [Rafailov et al, 2023] Direct Preference Optimization without RL Simpler alternative to RLHF. Widely adopted for fine-tuning open models (Llama, Mistral).
Test-Time Compute [Snell et al, 2024] Scaling inference-time exploration Breakthrough: more test-time search can beat larger models. Key for o1-style reasoning.

Implicit Exploration

Key insight: Exploration can emerge implicitly from model uncertainty without explicit bonuses. Connects to deep RL and practical algorithms.

Method Reference Mechanism Advantage
RLSVI [Zanette, Brandfonbrener, Brunskill, Pirotta & Lazaric, 2020] Frequentist regret via randomized value iteration Randomized Least-Squares Value Iteration. Practical model-based RL with provable guarantees. Bridges theory-practice.

Pure Exploration & Best Arm Identification

Goal: Find the best arm with minimal samples, not minimize cumulative regret. Critical for clinical trials, A/B testing.

Sample Complexity Lower Bound: To identify the best arm with probability $\geq 1-\delta$, any algorithm requires at least: $$n \geq \Omega\left(\sum_{i=2}^K \frac{1}{\Delta_i^2} \log\frac{K}{\delta}\right)$$ samples, where arms are ordered by mean $\mu_1 > \mu_2 > \cdots > \mu_K$. This instance-dependent bound is tight and achieved by algorithms like lil'UCB and Track-and-Stop.

Setting Reference Bound Application
Fixed Budget [Audibert & Bubeck, 2010] $O(\log(1/\delta)/\Delta^2)$ sample complexity Clinical trials with fixed number of patients. Find best treatment quickly.
Fixed Confidence [Jamieson et al, 2014] lil'UCB achieves optimal instance-dependent bounds Stop as soon as confident about best arm. Saves resources in practice.
Dueling Bandits [Neu & Olkhovskaya, 2021] Efficient algorithms for pairwise comparisons When you can only compare two options (e.g., which ad is better?). Widely used.

Offline Bandits & Off-Policy Evaluation

Method Reference Approach Use Case
Doubly Robust [Dudik et al, 2014] Combines direct method and importance sampling Industry standard for off-policy evaluation. Used at Microsoft, Google, Meta.
SWITCH [Wang et al, 2017] Adaptive estimator selection for OPE Reduces variance in policy evaluation. Critical for A/B testing at scale.
Balanced Policy Eval [Athey & Wager, 2021] Balancing weights for causal inference Connects bandits to causal inference. Used in tech companies for decision making.
mv