Bandit Papers
Fundamental papers in multi-armed bandit research, main venues for bandit publications,
and a comprehensive collection of papers via Mendeley group.
Fundamental Papers
[Lai & Robbins, 1985]
Asymptotically efficient adaptive allocation rules
[Auer, Cesa-Bianchi & Fischer, 2002]
Finite-time Analysis of the Multiarmed Bandit Problem
[Agrawal & Goyal, 2012]
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
[Auer, Cesa-Bianchi, Freund & Schapire, 2002]
The Nonstochastic Multiarmed Bandit Problem
[Bubeck, Munos & Stoltz, 2009]
Pure Exploration in Multi-armed Bandits Problems
[Gittins, 1979]
Bandit Processes and Dynamic Allocation Indices
[Srinivas, Krause, Kakade & Seeger, 2010]
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
[Garivier & Cappé, 2011]
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
[Russo, Van Roy, Kazerouni, Osband & Wen, 2018]
A Tutorial on Thompson Sampling
[Lattimore & Szepesvári, 2017]
The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits
🔥 Breakthrough Results (2017-2025)
Best-of-Both-Worlds (BOBW) Algorithms
[Ito, 2023]
Adversarial Combinatorial Bandits with Bandit Feedback: Best-of-Both-Worlds and Adversarial Robustness
[Zimmert & Lattimore, 2022]
Return of the Bias: Almost Minimax Optimal High Probability Bounds for Adversarial Linear Bandits
[Lee, Neu & Zimmert, 2022]
Near-Optimal Algorithms for Making the Gradient Small in Stochastic Minimax Optimization
Instance-Dependent & Adaptive Methods
[Neu & Olkhovskaya, 2021]
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
[Dann & Lattimore, 2021]
Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient
[Tsallis-INF: Tiapkin et al, 2023]
Fast Rates for Maximum Entropy Exploration
LLMs, Reasoning & Exploration
[Ouyang et al, 2022]
Training language models to follow instructions with human feedback (InstructGPT/ChatGPT)
[Bai et al, 2022]
Constitutional AI: Harmlessness from AI Feedback (powers Claude)
[Rafailov et al, 2023]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
[Snell et al, 2024]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Batched & Practical Bandits
[Gao, Han, Ren & Zhou, 2019]
Batched Multi-armed Bandits Problem
[Zhou, Li, Zhu, 2020]
Neural Contextual Bandits with UCB-based Exploration (NeuralUCB with provable guarantees)
[Dudik et al, 2014]
Doubly Robust Policy Evaluation and Optimization (industry standard for OPE)
Main Venues for Bandit Research
mv

















