Multi-Armed Bandit Algorithms
We provide the theoretical lower bounds for these settings and the regret upper bounds of the listed algorithms as the function of the following key quantities: $K$ - number of arms, $T$ - number of rounds (time steps), $d$ - dimension (where appropriate) and sometimes also additional parameters pertinent to the assumptions on the bandits' structure. We shall also use the notations $R_T$, $r_T$, and $g_T$ respectively for the cumulative regret, simple regret, and gain at time $T$.
Finite-Armed Bandits
Cumulative Regret $E(R_T)$
Stochastic
Lai-Robbins Lower Bound [1985]: Under mild assumptions on the reward distribution, the asymptotic lower bound on expected cumulative regret is: $$\liminf_{T\rightarrow\infty}\frac{E(R_T)}{\log T} \geq \sum_{i:\mu_i < \mu^*} \frac{\Delta_i}{\text{KL}(\mu_i || \mu^*)},$$ where $\Delta_i = \mu^* - \mu_i$ is the gap and $\text{KL}(\mu_i || \mu^*)$ is the KL divergence between arm $i$ and the optimal arm. This is the fundamental information-theoretic limit for any consistent algorithm.
| Algorithm | Reference | Bound | Notes |
| ε-greedy | [Auer et al, 2002] | $E(R_T) = \Omega(T)$ for constant $\epsilon$, $O(\log T)$ for decaying $\epsilon_t$ | Simple but suboptimal. Constant $\epsilon$ gives linear regret. Outperformed by UCB strategies. |
| UCB1 | [Auer, Cesa-Bianchi & Fischer, 2002] | $E(R_T) \le \sum_{i:\mu_i < \mu^*} \frac{8\log T}{\Delta_i} + \left(1 + \frac{\pi^2}{3}\right)\sum_{i=1}^K \Delta_i$ | Classic algorithm. Simple and practical. Over-explores by factor of 8 compared to Lai-Robbins lower bound. |
| UCB-V | [Audibert, Munos & Szepesvári, 2009] | $E(R_T) \le \sum_{i:\mu_i < \mu^*} \left(\frac{2\sigma_i^2\log T}{\Delta_i} + 3\Delta_i\right)$ | Exploits variance information. Better constants than UCB1 when variances differ significantly. |
| KL-UCB | [Garivier & Cappé, 2011] | $\limsup_{T\to\infty} \frac{E(R_T)}{\log T} \le \sum_{i:\mu_i < \mu^*} \frac{\Delta_i}{\text{KL}(\mu_i||\mu^*)}+o(1)$ | Asymptotically optimal. Matches Lai-Robbins lower bound exactly (up to $o(\log T)$ terms). |
| Thompson Sampling | [Agrawal & Goyal, 2012] | $E(R_T) \le (1 + \epsilon) \sum_{i:\mu_i < \mu^*} \frac{\Delta_i}{\text{KL}(\mu_i||\mu^*)} \log T + O(K)$ for Bernoulli | Bayesian approach. Asymptotically optimal like KL-UCB. Excellent empirical performance. Industry standard. |
| FH-Gittins | [Gittins, 1979] | $E(R_T) \le \sum_{i:\mu_i < \mu^*} \frac{32\log T}{\Delta_i}+o(\log T)$ | Finite-horizon Gittins index. Designed for Bayesian setting. Requires knowledge of horizon $T$, but performs well compared to anytime strategies. |
Adversarial
Minimax Lower Bound: For adversarial rewards, the minimax optimal regret is: $$\inf_{\text{algorithm}} \sup_{\text{reward sequence}} E(R_T) = \Theta(\sqrt{TK}).$$ This bound holds even with full feedback and is unavoidable without stationarity assumptions.
| Algorithm | Reference | Bound | Notes |
| EXP3 | [Auer, Cesa-Bianchi, Freund & Schapire, 2002] | $E(R_T) \le 2\sqrt{eKT\log K}$ | Handles adversarial rewards. No stationarity assumptions needed. Requires knowing horizon $T$. Uses exponential weights. |
| EXP3.1 | [Auer, Cesa-Bianchi, Freund & Schapire, 2002] | $E(R_T) \le 4\sqrt{eKT\log K}$ | Anytime version that does not require knowing $T$ in advance. Doubles time scale adaptively. |
High-Probability Bounds
Goal: Bound regret that holds with high probability $1-\delta$, not just in expectation. Critical for risk-sensitive applications.
| Algorithm | Reference | Bound | Notes |
| UCB1 | [Auer, Cesa-Bianchi & Fischer, 2002] | $R_T \le \sum_{i:\mu_i < \mu^*} \frac{8\log T}{\Delta_i} + O(K\log(1/\delta))$ with prob. $\ge 1-\delta$ | High-probability version with extra $\log(1/\delta)$ term. Concentration via Hoeffding's inequality. |
| MOSS | [Audibert & Bubeck, 2009] | $E(R_T) \le O(\sqrt{KT})$ minimax optimal, no gap dependence | Minimax Optimal Strategy in Stochastic. Optimal worst-case bound without knowing gaps. |
Simple Regret $E(r_T)$
Definition: Simple regret $r_T = \mu^* - \mu_{\hat{i}_T}$ measures quality of final recommendation, not cumulative regret.
| Algorithm | Reference | Bound | Notes |
| Uniform Sampling | [Bubeck, Munos & Stoltz, 2009] | $E(r_T) \le \sum_{i=1}^K e^{-\frac{T\Delta_i^2}{2K}}$ | Pure exploration strategy. Exponential decay. Optimal for long horizons and many arms. |
| UCB-E | [Bubeck, Munos & Stoltz, 2009] | $E(r_T) \le O(T^{-\frac{\Delta_{\min}^2}{2}})$ polynomial decay | Exploration-focused UCB variant. Better constants for medium horizons. |
Expected Discounted Reward
| Algorithm | Reference | Bound | Notes |
| Gittins Index | [Gittins, 1979] | Optimal | Optimal for discounted rewards. Bayesian approach. Computationally expensive. |
| Dynamic Programming | - | Optimal | Optimal solution exists but computationally intractable for large problems. |
| Approximate Gittins | [Niño-Mora, 2011] | Near-optimal | Computationally efficient approximation to Gittins index. |
Modern Bandits with Function Approximation
Linear & Contextual Bandits
Setting: Rewards follow $r_t = \langle \theta^*, x_t \rangle + \eta_t$ where $\theta^* \in \mathbb{R}^d$ is unknown and $x_t \in \mathbb{R}^d$ is context.
Minimax Lower Bound: For linear bandits, the worst-case regret satisfies: $$R_T = \Omega(d\sqrt{T}).$$ This dimension-dependent lower bound is fundamental and cannot be improved in the worst case. Optimal algorithms match this up to logarithmic factors.
| Algorithm | Reference | Bound | Notes |
| LinUCB | [Li, Chu, Langford & Schapire, 2010] | $E(R_T) \le O(d\sqrt{T\log T})$ | Optimal for linear contextual bandits. Uses ridge regression + UCB bonus. Widely used in recommendation systems. |
| LinTS | [Agrawal & Goyal, 2013] | $E(R_T) \le \tilde{O}(d^{3/2}\sqrt{T})$ high prob., $\tilde{O}(d\sqrt{T})$ expected | Thompson Sampling for linear bandits. Often better empirical performance than LinUCB. Bayesian approach. |
| OFUL | [Abbasi-Yadkori, Pál & Szepesvári, 2011] | $E(R_T) \le O(d\sqrt{T\log(KT/\delta)})$ with prob. $\ge 1-\delta$ | Optimism in the Face of Uncertainty (Linear). Tight analysis with self-normalized concentrations. |
Neural & Deep Bandits
| Algorithm | Reference | Approach | Notes |
| Neural-UCB | [Zhou, Li & Zhu, 2020] | $E(R_T) \le \tilde{O}(\sqrt{T})$ for over-parameterized networks | Provable regret for neural function approximation. Uses neural tangent kernel theory. Bridges theory-practice gap. |
| Neural-TS | [Riquelme, Tucker & Snoek, 2018] | Empirical: outperforms LinUCB on real datasets | Thompson Sampling with deep neural networks. Practical Bayesian deep learning. Excellent empirical results. |
| Bootstrapped DQN | [Osband, Blundell, Pritzel & Van Roy, 2016] | Ensemble uncertainty for exploration in deep RL | Efficient deep exploration via bootstrapping. Randomized value functions. Used in Atari and complex MDPs. |
Best-of-Both-Worlds (BOBW) Algorithms
Major breakthrough: Algorithms that achieve near-optimal regret in BOTH stochastic AND adversarial settings simultaneously, without knowing which environment they're in.
The Challenge: An algorithm must satisfy simultaneously: $$\text{Stochastic: } E(R_T) = O\left(\sum_{i:\mu_i < \mu^*} \frac{\log T}{\Delta_i}\right) \quad \text{and} \quad \text{Adversarial: } E(R_T) = O(\sqrt{TK\log T}).$$ Recent breakthroughs (2022-2024) achieve both bounds without prior knowledge of the environment type.
| Algorithm | Reference | Guarantee | Innovation |
| BOBW-Combinatorial | [Ito, 2023] | Adversarial: $\tilde{O}(\sqrt{mKT})$, Stochastic: $O(m\log T / \Delta_{\min})$ | First practical BOBW for combinatorial bandits with $m$ base arms. Works in both worlds without tuning. |
| Adversarial Linear | [Zimmert & Lattimore, 2022] | $\tilde{O}(\sqrt{dT})$ with high probability, near-minimax optimal | Resolves open problem on high-probability bounds. Nearly tight for adversarial linear bandits. |
Batched & Delayed Feedback
Practical constraint: Real systems can't update policies after every interaction. Batched bandits reduce communication and enable parallelization.
The Batched Regret Tradeoff: With $M$ adaptive batches over horizon $T$, the optimal regret bound is: $$R_T = O\left(\sqrt{\frac{KT\log K}{M}} + M\Delta_{\max}\right),$$ where $\Delta_{\max}$ is the maximum gap. This reveals a fundamental tradeoff: using fewer batches ($M \ll \sqrt{T}$) incurs additional regret proportional to $M$.
| Setting | Reference | Result | Impact |
| M-batched | [Gao et al, 2019] | $O(\sqrt{KT/M} + M)$ regret with M batches | Fundamental tradeoff: fewer batches = more regret. Critical for distributed systems. |
| Delayed Feedback | [Dann & Lattimore, 2021] | Handles variable delay in feedback | Essential for online ads, recommendation where feedback is delayed hours/days. |
LLM Alignment & RLHF
| Method | Reference | Key Contribution | Impact |
| InstructGPT | [Ouyang et al, 2022] | RLHF with human preferences for LLM alignment | Foundation for ChatGPT. Revolutionized how LLMs are trained to follow instructions. |
| Constitutional AI | [Bai et al, 2022] | AI feedback instead of human feedback for alignment | Scalable alignment method. Used in Claude and other models. Reduces human labeling. |
| DPO | [Rafailov et al, 2023] | Direct Preference Optimization without RL | Simpler alternative to RLHF. Widely adopted for fine-tuning open models (Llama, Mistral). |
| Test-Time Compute | [Snell et al, 2024] | Scaling inference-time exploration | Breakthrough: more test-time search can beat larger models. Key for o1-style reasoning. |
Implicit Exploration
Key insight: Exploration can emerge implicitly from model uncertainty without explicit bonuses. Connects to deep RL and practical algorithms.
| Method | Reference | Mechanism | Advantage |
| RLSVI | [Zanette, Brandfonbrener, Brunskill, Pirotta & Lazaric, 2020] | Frequentist regret via randomized value iteration | Randomized Least-Squares Value Iteration. Practical model-based RL with provable guarantees. Bridges theory-practice. |
Pure Exploration & Best Arm Identification
Goal: Find the best arm with minimal samples, not minimize cumulative regret. Critical for clinical trials, A/B testing.
Sample Complexity Lower Bound: To identify the best arm with probability $\geq 1-\delta$, any algorithm requires at least: $$n \geq \Omega\left(\sum_{i=2}^K \frac{1}{\Delta_i^2} \log\frac{K}{\delta}\right)$$ samples, where arms are ordered by mean $\mu_1 > \mu_2 > \cdots > \mu_K$. This instance-dependent bound is tight and achieved by algorithms like lil'UCB and Track-and-Stop.
| Setting | Reference | Bound | Application |
| Fixed Budget | [Audibert & Bubeck, 2010] | $O(\log(1/\delta)/\Delta^2)$ sample complexity | Clinical trials with fixed number of patients. Find best treatment quickly. |
| Fixed Confidence | [Jamieson et al, 2014] | lil'UCB achieves optimal instance-dependent bounds | Stop as soon as confident about best arm. Saves resources in practice. |
| Dueling Bandits | [Neu & Olkhovskaya, 2021] | Efficient algorithms for pairwise comparisons | When you can only compare two options (e.g., which ad is better?). Widely used. |
Offline Bandits & Off-Policy Evaluation
| Method | Reference | Approach | Use Case |
| Doubly Robust | [Dudik et al, 2014] | Combines direct method and importance sampling | Industry standard for off-policy evaluation. Used at Microsoft, Google, Meta. |
| SWITCH | [Wang et al, 2017] | Adaptive estimator selection for OPE | Reduces variance in policy evaluation. Critical for A/B testing at scale. |
| Balanced Policy Eval | [Athey & Wager, 2021] | Balancing weights for causal inference | Connects bandits to causal inference. Used in tech companies for decision making. |

















