Michal Valko : Projects
← Back to Bandits

Bandit Papers

Fundamental papers in multi-armed bandit research, main venues for bandit publications, and a comprehensive collection of papers via Mendeley group.

Fundamental Papers

[Lai & Robbins, 1985] Asymptotically efficient adaptive allocation rules
[Auer, Cesa-Bianchi & Fischer, 2002] Finite-time Analysis of the Multiarmed Bandit Problem
[Agrawal & Goyal, 2012] Analysis of Thompson Sampling for the Multi-armed Bandit Problem
[Auer, Cesa-Bianchi, Freund & Schapire, 2002] The Nonstochastic Multiarmed Bandit Problem
[Bubeck, Munos & Stoltz, 2009] Pure Exploration in Multi-armed Bandits Problems
[Gittins, 1979] Bandit Processes and Dynamic Allocation Indices
[Srinivas, Krause, Kakade & Seeger, 2010] Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
[Garivier & Cappé, 2011] The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
[Russo, Van Roy, Kazerouni, Osband & Wen, 2018] A Tutorial on Thompson Sampling
[Lattimore & Szepesvári, 2017] The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

🔥 Breakthrough Results (2017-2025)

Best-of-Both-Worlds (BOBW) Algorithms

[Ito, 2023] Adversarial Combinatorial Bandits with Bandit Feedback: Best-of-Both-Worlds and Adversarial Robustness
[Zimmert & Lattimore, 2022] Return of the Bias: Almost Minimax Optimal High Probability Bounds for Adversarial Linear Bandits
[Lee, Neu & Zimmert, 2022] Near-Optimal Algorithms for Making the Gradient Small in Stochastic Minimax Optimization

Instance-Dependent & Adaptive Methods

[Neu & Olkhovskaya, 2021] Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
[Dann & Lattimore, 2021] Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient
[Tsallis-INF: Tiapkin et al, 2023] Fast Rates for Maximum Entropy Exploration

LLMs, Reasoning & Exploration

[Ouyang et al, 2022] Training language models to follow instructions with human feedback (InstructGPT/ChatGPT)
[Bai et al, 2022] Constitutional AI: Harmlessness from AI Feedback (powers Claude)
[Rafailov et al, 2023] Direct Preference Optimization: Your Language Model is Secretly a Reward Model
[Snell et al, 2024] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Batched & Practical Bandits

[Gao, Han, Ren & Zhou, 2019] Batched Multi-armed Bandits Problem
[Zhou, Li, Zhu, 2020] Neural Contextual Bandits with UCB-based Exploration (NeuralUCB with provable guarantees)
[Dudik et al, 2014] Doubly Robust Policy Evaluation and Optimization (industry standard for OPE)

Main Venues for Bandit Research

ICML International Conference on Machine Learning
NeurIPS Conference on Neural Information Processing Systems
COLT Conference on Learning Theory
ALT Algorithmic Learning Theory
AISTATS Artificial Intelligence and Statistics
ICLR International Conference on Learning Representations
mv