preprints
- Yunhao Tang, Taco Cohen, David W. Zhang, Michal Valko, Rémi Munos:
RL-finetuning LLMs from on- and off-policy data with a single algorithm,
arXiv preprint
bibtex
abstract
We introduce a unified reinforcement learning algorithm designed to fine-tune large language models (LLMs) using both on-policy and off-policy data. This approach enhances efficiency and stability by leveraging diverse data sources within a single framework, addressing challenges like distributional shifts and sample efficiency.
- Chaoqi Wang, Zhuokai Zhao, Chen Zhu, Karthik Abinav Sankararaman, Michal Valko, Xuefei Cao, Zhaorun Chen, Madian Khabsa, Yuxin Chen, Hao Ma, Sinong Wang:
Preference optimization with multi-sample comparisons,
arXiv preprint
bibtex
bibtexabstract
Traditional post-training approaches like RLHF rely on single-sample comparisons, which may not capture group-level characteristics. We introduce Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO), extending post-training to incorporate multi-sample comparisons focusing on group-wise attributes.
- Antoine Scheid, Étienne Boursier, Alain Durmus, Michael I Jordan, Pierre Ménard, Éric Moulines, Michal Valko:
Optimal design for reward modeling in RLHF,
arXiv preprint
bibtex
abstract
We address the challenge of effectively aligning large language models with human preferences through RLHF. We focus on optimizing the design of the reward model to enhance its performance and reliability, proposing methodologies for optimal selection of data and feedback mechanisms.
- Pierre Perrault, Denis Belomestny, Pierre Ménard, Éric Moulines, Alexey Naumov, Daniil Tiapkin, Michal Valko:
A new bound on the cumulant generating function of Dirichlet processes,
arXiv preprint
bibtex
bibtexabstract
We introduce a novel method for bounding the cumulant generating function of Dirichlet processes, effective for constructing confidence regions for sums of independent Dirichlet processes.
- Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, Will Dabney:
Understanding the performance gap between online and offline alignment algorithms,
arXiv preprint
bibtex
bibtexabstract
We investigate the differences between online and offline methods for aligning LLMs with human preferences. Online algorithms with on-policy sampling outperform offline algorithms. Offline algorithms excel in pairwise classification but struggle with generative tasks, highlighting the critical role of on-policy sampling in AI alignment.
- Denis Belomestny, Pierre Ménard, Alexey Naumov, Daniil Tiapkin, Michal Valko:
Sharp deviations bounds for Dirichlet weighted sums with application to analysis of Bayesian algorithms,
arXiv preprint
bibtex
bibtexabstract
We present new concentration inequalities for weighted sums of Dirichlet random variables. These inequalities provide precise bounds on the probability that such sums deviate from their expected values. We apply these results to analyze the performance of Bayesian algorithms, particularly in the context of multi-armed bandit problems.
- Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, Jincheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Michal Valko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári:
KL-entropy-regularized RL with a generative model is minimax optimal,
arXiv preprint
bibtex
bibtexabstract
We study the sample complexity of reinforcement learning when employing a generative model and incorporating KL-entropy regularization. We demonstrate that this approach achieves minimax optimality, meaning it attains the best possible performance in the worst-case scenario. This extends prior results on model-based RL with generative models to settings that include KL-entropy regularization.
2025
- Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Michal Valko, Vianney Perchet:
The Harder Path: Last iterate convergence for uncoupled Learning in zero-sum games with bandit feedback,
in International Conference on Machine Learning
(ICML 2025)
OpenReview
bibtex
abstract
We investigate learning in zero-sum matrix games with bandit feedback. We focus on uncoupled algorithms that ensure last iterate convergence to Nash equilibrium. We show the best attainable rate is O(T^{-1/4}), contrasting with O(T^{-1/2}) for average iterates. We propose two algorithms achieving this optimal rate.
-
Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Éric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Ménard:
Accelerating Nash learning from human feedback via Mirror Prox,
in COLT 2025 Workshop: Foundations of Post-training (COLT-FoPt 2025),
arXiv preprint
bibtex
poster
abstract
We introduce an accelerated approach to Nash Learning from Human Feedback (NLHF) for aligning LLMs. Using Mirror Prox optimization, we efficiently reach the Nash equilibrium of the preference model. This method offers faster convergence compared to standard mirror descent approaches.
2024
- Llama Team: Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, ... Michal Valko ... Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, Zhiyu Ma:
The Llama 3 herd of models,
arXiv preprint
bibtex
bibtexabstract
We introduce Llama 3, a new suite of foundation models supporting multilinguality, coding, reasoning, and tool usage. The largest model is a dense Transformer with 405 billion parameters and a context window of up to 128K tokens, delivering quality comparable to GPT-4.
- Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael Mozer, Sanjeev Arora:
Metacognitive capabilities of LLMs: An exploration in mathematical problem solving,
in Neural Information Processing Systems
(NeurIPS 2024) and
(ICML 2024 - AI for Math)
arXiv preprint
poster
bibtex
abstract
Metacognitive knowledge refers to humans' intuitive knowledge of their own thinking and reasoning processes. The paper gives evidence that LLMs also have metacognitive knowledge, including ability to name skills and procedures to apply given a task. We explore this in context of math reasoning, developing a procedure to get an LLM to assign sensible skill labels to math questions. This improves accuracy on GSM8k and MATH for several strong LLMs.
- Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko:
Local and adaptive mirror descents in extensive-form games,
in Neural Information Processing Systems
(NeurIPS 2024),
PDF, arXiv preprint
bibtex
abstract
We study how to learn epsilon-optimal strategies in two-player zero-sum extensive-form games with imperfect information through self-play. We propose local and adaptive mirror descent algorithms that exploit the structure of extensive-form games. Our algorithms achieve improved regret bounds by using local regularization and adaptive step sizes, leading to faster convergence to Nash equilibrium.
- Rémi Munos*, Michal Valko, Daniele Calandriello*, Mohammad Gheshlaghi
Azar*, Mark Rowland*, Daniel Guo*, Yunhao Tang*, Matthieu Geist*, Thomas
Mésnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier
Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot:
Nash learning from human feedback,
in International Conference on Machine Learning
(ICML 2024)
arXiv preprint
image
bibtex
abstract
Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF).
- Daniele Calandriello, Daniel Guo, Rémi Munos, Mark Rowland, Yunhao Tang, Bernardo Ávila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot:
Human alignment of large language models through online preference optimisation,
in International Conference on Machine Learning
(ICML 2024) [spotlight - 3.5% acceptance rate]
PDF, arXiv preprint
bibtex
abstract
We address the challenge of aligning large language models (LLMs) with human preferences. We introduce an online preference optimization framework that iteratively refines LLMs based on human feedback. This approach contrasts with traditional RLHF by continuously updating the model in real-time as new preferences are collected, demonstrating improved performance in various language tasks.
- Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, Bilal Piot:
Generalized preference optimization: A unified approach to offline alignment,
in International Conference on Machine Learning
(ICML 2024)
PDF, arXiv preprint
bibtex
abstract
We introduce Generalized Preference Optimization (GPO), a comprehensive framework for aligning large language models with human preferences using offline data. GPO unifies various offline alignment methods like DPO and VPO, offering theoretical guarantees for robustness and reliability without the need for online interactions or on-policy sampling.
- Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares, Jessica Hoffmann, Lucas Dixon, Michal Valko, Mathieu Blondel:
Decoding-time realignment of language models,
in International Conference on Machine Learning
(ICML 2024) [spotlight - 3.5% acceptance rate]
PDF, arXiv preprint
bibtex
abstract
Aligning language models with human preferences is crucial for reducing errors and biases. Alignment techniques such as RLHF optimize a tradeoff between human preference rewards and proximity regularization. We propose decoding-time realignment (DeRa), a simple method to explore and evaluate different regularization strengths in aligned models without retraining. DeRa enables control over the degree of alignment, allowing users to smoothly transition between unaligned and aligned models.
- Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Zhaohan Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos:
A general theoretical paradigm to understand learning from human preferences,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2024)
arXiv preprint
bibtex
bibtexabstract
The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. In this paper we derive a new general objective called ΨPO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO and to identify their potential pitfalls.
- Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot:
Unlocking the power of representations in long-term novelty-based exploration,
in International Conference on Learning Representations
(ICLR 2024) [spotlight - 5% acceptance rate]
(NeurIPS 2023 - ALOE)
PDF, arXiv preprint
bibtex
abstract
We study novelty-based exploration in reinforcement learning, focusing on methods that leverage learned representations to sustain exploration over long time horizons. We demonstrate how representation learning can be combined with novelty-based intrinsic rewards to achieve effective exploration in challenging environments.
- Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Éric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Ménard:
Demonstration-regularized RL,
(ICLR 2024)
in International Conference on Learning Representations
PDF, arXiv preprint
bibtex
abstract
We explore how incorporating expert demonstrations can enhance the sample efficiency of reinforcement learning. We focus on a framework that leverages expert demonstrations through KL regularization applied to a policy learned via behavior cloning. Our findings indicate that utilizing expert demonstrations allows for identifying an optimal policy with significantly reduced sample complexity in both finite and linear MDPs.
2023
- Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Éric Moulines, Rémi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Ménard:
Model-free posterior sampling via learning rate randomization,
in Neural Information Processing Systems
(NeurIPS 2023),
arXiv preprint
bibtex
abstract
We propose a model-free approach to posterior sampling in reinforcement learning via learning rate randomization. This technique provides a computationally efficient alternative to traditional Bayesian methods while maintaining strong theoretical guarantees for exploration and sample efficiency.
- Onno Eberhard, Thibaut Cuvelier, Michal Valko, Bruno Adrien De Backer:
Middle-mile logistics through the lens of goal-conditioned reinforcement learning,
in NeurIPS 2023 Workshop: Goal-conditioned RL,
OpenReview preprint
bibtex
abstract
Middle-mile logistics describes routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP and combine graph neural networks with model-free RL.
- Daniel Jarrett, Corentin Tallec, Florent Altché, Thomas Mesnard, Rémi Munos, Michal Valko:
Curiosity in hindsight: Intrinsic exploration in stochastic environments,
in International Conference on Machine Learning
(ICML 2023)
(NeurIPS 2022 - DeepRL),
arXiv preprint
image
bibtex
bibtexabstract
Consider the problem of exploration in sparse-reward or reward-free environments. In the curiosity-driven paradigm, the agent is rewarded for how much each realized outcome differs from their predicted outcome. But using predictive error as intrinsic motivation is fragile in stochastic environments. We propose Curiosity in Hindsight: learning representations of the future that capture precisely the unpredictable aspects of each outcome, such that intrinsic rewards only reflect the predictable aspects of world dynamics. We show state-of-the-art results in exploring Montezuma's Revenge with sticky actions.
- Yunhao Tang, Rémi Munos, Mark Rowland, Michal Valko:
VA-learning as a more efficient alternative to Q-learning,
in International Conference on Machine Learning
(ICML 2023)
arXiv preprint
bibtex
abstract
We introduce VA-learning, a novel reinforcement learning algorithm that focuses on learning the advantage function, representing the difference between the value of a specific action and the average value of all actions. By concentrating on this advantage function, VA-learning provides a more direct and efficient approach to policy evaluation and improvement, achieving faster convergence than Q-learning.
- Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Éric Moulines, Rémi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Ménard:
Fast rates for maximum entropy exploration,
in International Conference on Machine Learning
(ICML 2023)
arXiv preprint
bibtex
abstract
We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards. We study the maximum entropy exploration problem of two types: visitation entropy maximization and trajectory entropy. We propose algorithms with improved sample complexity and establish the first theoretical result showing the potential statistical advantage of regularized MDPs for exploration.
- Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko:
Adapting to game trees in zero-sum imperfect information games,
in International Conference on Machine Learning
[outstanding paper award]
(ICML 2023) [oral - 2% acceptance rate]
arXiv preprint
poster
poster (image)
talk
bibtex
abstract
We study how to learn ε-optimal strategies in a zero-sum imperfect information game through self-play with trajectory feedback. We give a problem-independent lower bound and propose two FTRL algorithms: Balanced FTRL which matches the lower bound, and Adaptive FTRL which adapts the regularization to observations.
- Yunhao Tang, Zhaohan Daniel Guo, Pierre Harvey Richemond, Bernardo Ávila Pires, Yash Chandak, Rémi Munos, Mark Rowland, Mohammad Gheshlaghi Azar, Charline Le Lan, Clare Lyle, András György, Shantanu Thakoor, Will Dabney, Bilal Piot, Daniele Calandriello, Michal Valko:
Understanding self-predictive learning for reinforcement learning,
in International Conference on Machine Learning
(ICML 2023)
arXiv preprint
bibtex
bibtexabstract
We investigate the role of self-predictive learning (SPL) in reinforcement learning. SPL involves training agents to predict their future states, which can enhance representation learning and improve sample efficiency. We analyze how SPL influences the learning dynamics and performance of RL agents, providing insights into its benefits and limitations.
- Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, Rémi Munos, Bernardo Ávila Pires, Michal Valko:
DoMo-AC: Doubly multi-step off-policy actor-critic algorithm,
in International Conference on Machine Learning
(ICML 2023)
PDF, arXiv preprint
bibtex
abstract
We propose DoMo-AC, a doubly multi-step off-policy actor-critic algorithm that extends multi-step returns to both the actor and critic updates. This allows for more efficient learning by leveraging longer temporal dependencies while maintaining stability in off-policy settings.
-
Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, Jincheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári, Wataru Kumagai:
Regularization and variance-weighted regression achieves minimax optimality in linear MDPs: Theory and practice,
in International Conference on Machine Learning
(ICML 2023)
pdf
arXiv preprint
bibtex
abstract
We introduce a novel approach to reinforcement learning in linear Markov Decision Processes (MDPs). We propose a method that combines regularization techniques with variance-weighted regression to achieve minimax optimality. This approach is both theoretically sound and practically effective, as demonstrated through our analysis and experiments.
- Thomas Mesnard, Wenqi Chen, Alaa Saade, Yunhao Tang, Mark Rowland, Theophane Weber, Clare Lyle, Audrunas Gruslys, Michal Valko, Will Dabney, Georg Ostrovski, Éric Moulines, Rémi Munos:
Quantile credit assignment,
in International Conference on Machine Learning
(ICML 2023) [oral - 2% acceptance rate]
PDF, arXiv preprint
bibtex
abstract
In reinforcement learning, the credit assignment problem is to distinguish luck from skill, that is, separate the inherent randomness in the environment from the controllable effects of the agent's actions. This paper proposes two novel algorithms, Quantile Credit Assignment (QCA) and Hindsight QCA (HQCA), which incorporate distributional value estimation to perform credit assignment. Both QCA and HQCA have the appealing interpretation of leveraging an estimate of the quantile level of the return (interpreted as the level of "luck") in order to derive a "luck-dependent" baseline for policy gradient methods.
- Mehdi Azabou, Venkataramana Ganesh, Shantanu Thakoor, Chi-Heng Lin, Lakshmi Sathidevi, Ran Liu, Michal Valko, Petar Veličković, Eva L Dyer:
Half-Hop: A graph upsampling approach for slowing down message passing,
in International Conference on Machine Learning
(ICML 2023)
arXiv preprint
bibtex
poster
code
abstract
Message passing neural networks have shown a lot of success on graph-structured data. However, there are many instances where message passing can lead to over-smoothing or fail when neighboring nodes belong to different classes. In this work, we introduce a simple yet general framework for improving learning in message passing neural networks. Our approach essentially upsamples edges in the original graph by adding "slow nodes" at each edge that can mediate communication between a source and a target node. Our method only modifies the input graph, making it plug-and-play and easy to use with existing models.
2022
- Zhaohan Daniel Guo, Shantanu Thakoor, Miruna Pîslar, Bernardo Ávila Pires, Florent Altché, Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Rémi Munos, Mohammad Gheshlaghi Azar, Bilal Piot:
BYOL-Explore: Exploration by bootstrapped prediction, in Neural Information Processing Systems
(NeurIPS 2022)
arXiv preprint
poster
poster (image)
talk
bibtex
bibtexabstract
We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8 and achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
- Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Éric Moulines, Rémi Munos, Alexey Naumov, Mark Rowland, Michal Valko, Pierre Ménard: Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees, in Neural Information Processing Systems (NeurIPS 2022), arXiv preprint poster poster (image) talk bibtex
- Daniil Tiapkin, Denis Belomestny, Éric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang,
Michal Valko, Pierre Ménard:
From Dirichlet to Rubin: Optimistic exploration in RL without bonuses,
in International Conference on Machine Learning
(ICML 2022)
[long talk - 2% acceptance rate]
arXiv preprint,
talk
poster
bibtex
bibtexabstract
We propose a new approach to exploration in reinforcement learning that achieves optimism without explicit bonus terms. By leveraging the connection between Dirichlet distributions and Rubin's Bayesian bootstrap, we derive an efficient exploration strategy with theoretical guarantees.
- Anirudh Goyal, Abram L Friesen, Theophane Weber, Andrea Banino, Nan Rosemary Ke, Adrià Puigdomènech Badia, Ksenia Konyushkova, Michal Valko, Simon Osindero, Timothy P Lillicrap, Nicolas Heess, Charles Blundell:
Retrieval-augmented reinforcement learning,
in International Conference on Machine Learning
(ICML 2022)
PDF, arXiv preprint
bibtex
bibtexabstract
We explore an alternative paradigm where we augment an RL agent with a retrieval process that has direct access to a dataset of experiences. The retrieval process is trained to retrieve information that may be useful in the current context. We show that retrieval-augmented R2D2 learns significantly faster than the baseline and achieves higher scores on Atari.
- Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco:
Scaling Gaussian process optimization by evaluating a few unique
candidates multiple times,
in International Conference on Machine Learning
(ICML 2022)
arXiv preprint,
talk
poster
bibtex
abstract
We show that GP-based optimization can be made efficient by sticking to a candidate solution for multiple evaluation steps and switching only when necessary. This limits the number of unique points in the history and allows efficient GP posterior computation. This approach is especially useful with high switch costs (e.g., switching chemicals in wet labs).
- Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L. Dyer, Rémi Munos, Petar Veličković, Michal Valko:
Large-scale representation learning on graphs via bootstrapping,
in International Conference on Learning Representations
(ICLR 2022)
(ICLR 2021 - GTRL)
arXiv preprint
poster
bibtex
bibtexabstract
Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. We introduce Bootstrapped Graph Latents (BGRL) - a graph representation learning method that learns by predicting alternative augmentations of the input. BGRL uses only simple augmentations and alleviates the need for contrasting with negative examples, and is thus scalable by design. BGRL outperforms or matches prior methods on several established benchmarks, while achieving a 2-10x reduction in memory costs.
- Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric:
Adaptive multi-goal exploration,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2022)
arXiv preprint
bibtex
abstract
We study adaptive exploration in multi-goal reinforcement learning settings, developing algorithms that efficiently learn to reach multiple goal states while adapting to the structure of the environment.
- Yunhao Tang, Mark Rowland, Rémi Munos, Michal Valko:
Marginalized operators for off-policy reinforcement learning,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2022)
and (ICML 2021 - RL Theory)
PDF, arXiv preprint
bibtex
abstract
We study marginalized operators for off-policy reinforcement learning, providing a unified framework for understanding and improving off-policy evaluation and optimization methods.
2021
- Ran Liu, Mehdi Azabou, Max Dabagia, Chi-Heng Lin, Mohammad Gheshlaghi Azar, Keith B. Hengen, Michal Valko, Eva L. Dyer:
Drop, Swap, and Generate: A self-supervised approach for generating neural activity,
in Neural Information Processing Systems
(NeurIPS 2021) [oral - 1% acceptance rate]
PDF, arXiv preprint
bibtex
abstract
We introduce a self-supervised method for generating neural activity data. This approach leverages techniques such as dropping, swapping, and generating data to enhance the representation learning of neural activity. Our goal is to improve the understanding and modeling of neural dynamics without relying on labeled datasets.
- Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric:
Stochastic shortest path: minimax, parameter-free and towards horizon-free regret,
in Neural Information Processing Systems
(NeurIPS 2021) [spotlight - 3% acceptance rate]
and (ICML 2021 - RL Theory),
arXiv preprint,
talk
poster (image)
video
bibtex
abstract
We introduce a novel algorithm for stochastic shortest path problems that achieves minimax optimal regret without requiring prior knowledge of problem-dependent parameters. This approach moves towards horizon-free regret bounds.
- Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric:
A provably efficient sample collection strategy for reinforcement learning,
in Neural Information Processing Systems
(NeurIPS 2021) [spotlight - 3% acceptance rate]
arXiv preprint,
bibtex
bibtexabstract
A common assumption in reinforcement learning (RL) is to have access to a generative model (i.e., a simulator of the environment), which allows to generate samples from any desired state-action pair. Nonetheless, in many settings a generative model may not be available and an adaptive exploration strategy is needed to efficiently collect samples from an unknown environment by direct interaction. In this paper, we study the scenario where an algorithm based on the generative model assumption d...
- Tadashi Kozuno*, Pierre Ménard*, Rémi Munos, Michal Valko:
Model-free learning for two-player zero-sum partially observable Markov games with perfect recall,
in Neural Information Processing Systems
(NeurIPS 2021)
arXiv preprint,
talk
poster
bibtex
abstract
We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. We focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). We provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm, a model-free algorithm with a high-probability bound on the convergence rate to the NE of order 1/sqrt(T) where T is the number of played games.
- Yunhao Tang*, Tadashi Kozuno*, Mark Rowland, Rémi Munos, Michal Valko:
Unifying gradient estimators for meta-reinforcement learning via off-policy evaluation,
in Neural Information Processing Systems
(NeurIPS 2021)
PDF, arXiv preprint
bibtex
abstract
We propose a unified framework that leverages off-policy evaluation techniques to derive gradient estimators for meta-reinforcement learning, improving efficiency and stability of meta-RL algorithms.
- Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, Andrew Zisserman:
Broaden your views for self-supervised video learning, in
International Conference on Computer Vision
(ICCV 2021)
arXiv preprint
poster
bibtex
abstract
We introduce BraVe, a self-supervised learning framework for video. One view has access to a narrow temporal window while the other has broad access to content. BraVe achieves state-of-the-art results on UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.
- Pierre Ménard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko:
UCB Momentum Q-learning: Correcting the bias without forgetting,
in International Conference on Machine Learning
(ICML 2021)
[long talk - 3% acceptance rate]
arXiv preprint
poster
poster (image)
talk
bibtex
abstract
We propose UCBMQ, using momentum to correct Q-learning bias while limiting impact on regret. UCBMQ matches the lower bound for large T and has a second-order term scaling only linearly with states.
- Pierre Ménard, Omar Darwiche Domingues, Émilie Kaufmann, Anders Jonsson, Édouard Leurent, Michal Valko:
Fast active learning for pure exploration in reinforcement learning,
in International Conference on Machine Learning
(ICML 2021)
PDF, arXiv preprint,
bibtex
abstract
Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring efficiently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one side, and a few theoretically backed exploration strategies on the other. Many of them are incarnate...
- Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel:
Revisiting Peng's Q(λ) for modern reinforcement learning,
in International Conference on Machine Learning
(ICML 2021)
PDF, arXiv preprint
bibtex
abstract
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservati...
- Yunhao Tang, Mark Rowland, Rémi Munos, Michal Valko: Taylor expansion of discount factors, in International Conference on Machine Learning (ICML 2021) arXiv preprint
- Xavier Fontaine, Pierre Perrault, Michal Valko, Vianney Perchet:
Online A-optimal design and active linear regression,
in International Conference on Machine Learning
(ICML 2021)
arXiv preprint
bibtex
bibtexabstract
We expand the discount factor into a Taylor series, decomposing the value function into terms corresponding to different time scales, enhancing understanding and performance of RL algorithms.
- Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Émilie Kaufmann, Michal Valko:
Kernel-based reinforcement Learning: A finite-time analysis,
in International Conference on Machine Learning
(ICML 2021)
arXiv preprint
bibtex
abstract
In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of Omega((H3SA/ϵ2)log(1/$\delta$)) on the sample complexity of an ($\epsilon$,$\delta$)-PAC algorithm for best policy identification in a non-stationary MDP. This lower bound relies on a construction of...
- Karl Tuyls, Shayegan Omidshafiei, Paul Muller, Zhe Wang, Jerome Connor, Daniel Hennes, Ian Graham, William Spearman, Tim Waskett, Dafydd Steele, Pauline Luc, Adrià Recasens, Alexandre Galashov, Gregory Thornton, Romuald Elie, Pablo Sprechmann, Pol Moreno, Kris Cao, Marta Garnelo, Praneet Dutta, Michal Valko, Nicolas Heess, Alex Bridgland, Julien Pérolat, Bart De Vylder, Ali Eslami, Mark Rowland, Andrew Jaegle, Rémi Munos, Trevor Back, Razia Ahamed, Simon Bouton, Nathalie Beauguerlange, Jackson Broshear, Thore Graepel, Demis Hassabis:
Game plan: What AI can do for football, and what football can do for AI,
in Journal of Artificial Intelligence Research
(JAIR 2021)
arXiv preprint
bibtex
bibtexabstract
We explore how AI can enhance football analytics and how football presents unique challenges that drive AI research, combining statistical learning, game theory, and computer vision.
- Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Émilie Kaufmann, Michal Valko: A kernel-based approach to non-stationary reinforcement learning in metric spaces, in in International Conference on Artificial Intelligence and Statistics (AISTATS 2021) and (ICML 2020 - RL Theory) [oral - 6% acceptance rate] arXiv preprint video
- Omar Darwiche Domingues, Pierre Ménard, Émilie Kaufmann, Michal Valko:
Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited,
in Algorithmic Learning Theory
(ALT 2021)
PDF, arXiv preprint
video
bibtex
abstract
In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of Omega((H3SA/ϵ2)log(1/$\delta$)) on the sample complexity of an ($\epsilon$,$\delta$)-PAC algorithm for best policy identification in a non-stationary MDP.
- Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric:
Sample complexity bounds for stochastic shortest path with a generative model,
in Algorithmic Learning Theory
(ALT 2021)
video
bibtex
abstract
We consider computing an epsilon-optimal policy in stochastic shortest path with a generative oracle. We propose two algorithms with PAC sample complexity bounds, combining SSP-specific tools and variance reduction techniques.
- Émilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Édouard Leurent, Michal Valko:
Adaptive reward-free exploration, in
arXiv preprint,
video 1
video 2
in Algorithmic Learning Theory
(ALT 2021) and (ICML 2020 - RL Theory)
bibtex
bibtexabstract
Reward-free exploration is a reinforcement learning setting recently studied by Jin et al., who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different object...
- Guillaume Gautier, Rémi Bardenet, Michal Valko:
Fast sampling from β-ensembles,
Statistics and Computing
(Statistics and Computing 2021)
arXiv preprint,
bibtex
abstract
We study sampling algorithms for $\beta$-ensembles with time complexity less than cubic in the cardinality of the ensemble. Following Dumitriu and Edelman (2002), we see the ensemble as the eigenvalues of a random tridiagonal matrix, namely a random Jacobi matrix. First, we provide a unifying and elementary treatment of the tridiagonal models associated to the three classical Hermite, Laguerre and Jacobi ensembles. For this purpose, we use simple changes of variables between successive repara...
- Mehdi Azabou, Mohammad Gheshlaghi Azar, Ran Liu, Chi-Heng Lin, Erik C. Johnson, Kiran Bhaskaran-Nair, Max Dabagia, Bernardo Ávila Pires, Lindsey Kitchell, Keith B. Hengen, William Gray-Roncal, Michal Valko, Eva L. Dyer:
Mine Your Own vieW: Self-supervised learning through across-sample prediction,
in NeurIPS 2021 Workshop: Self-Supervised Learning - Theory and Practice,
arXiv preprint
bibtex
abstract
We introduce Mine Your Own vieW (MYOW), a new approach for self-supervised learning that actively mines views by finding samples that are close in the representation space, then predicts the representation of nearby samples. MYOW outperforms other self-supervised approaches in neuroscience applications where rich augmentations are not established.
- Omar Darwiche Domingues, Corentin Tallec, Rémi Munos, Michal Valko:
Density-based bonuses on learned representations for reward-free exploration in deep reinforcement learning,
in ICML 2021 Workshop: Unsupervised Reinforcement Learning,
OpenReview
bibtex
abstract
We propose a framework that computes exploration bonuses through density estimation applied to learned representations, enabling agents to explore effectively without external rewards by assigning higher bonuses to less frequently visited state-action pairs.
- Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Alaa Saade, Shantanu Thakoor, Bilal Piot, Bernardo Ávila Pires, Michal Valko, Thomas Mesnard, Tor Lattimore, Rémi Munos:
Geometric entropic exploration,
in ICML 2021 Workshop: Unsupervised Reinforcement Learning,
arXiv preprint
bibtex
abstract
Exploration is essential for solving complex RL tasks. Maximum State-Visitation Entropy (MSVE) formulates the exploration problem as visiting all states as uniformly as possible. However, existing approaches to MSVE are theoretically justified only for discrete state-spaces. We introduce Geometric Entropy Maximisation (GEM), a new algorithm that maximises the geometry-aware Shannon entropy of state-visits in both discrete and continuous domains.
2020
- Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko:
Bootstrap Your Own Latent: A new approach to self-supervised learning,
in Neural Information Processing Systems
(NeurIPS 2020) [oral - 1% acceptance rate]
- PDF, arXiv preprint bibtex
- our X announcement and our code
- YouTube video by Yannic, unofficial blog 1 and unofficial blog 2
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While s...
- Pierre H. Richemond, Jean-Bastien Grill, Florent Altché, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, Michal Valko:
BYOL works even without batch statistics,
in NeurIPS 2020 Workshop: Self-Supervised Learning - Theory and Practice
arXiv preprint
bibtex
bibtexabstract
Bootstrap Your Own Latent (BYOL) is a self-supervised learning approach for image representation. From an augmented view of an image, BYOL trains an online network to predict a target network representation of a different augmented view of the same image. Unlike contrastive methods, BYOL does not explicitly use a repulsion term built from negative pairs in its training objective. Yet, it avoids collapse to a trivial, constant representation. Thus, it has recently been hypothesized that batch ...
- Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric:
Improved sample complexity for incremental autonomous exploration in MDPs,
in Neural Information Processing Systems
(NeurIPS 2020) [oral - 1% acceptance rate]
arXiv preprint
poster
image
bibtex
abstract
We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of ϵ-optimal goal-conditioned policies attaining all states that are incrementally reachable within L steps (in expectation) from a reference state s0. In this paper, we introduce a novel model-based approach that interleaves discovering new states from s0 and improving the accu...
- Daniele Calandriello*, Michał Dereziński*, Michal Valko:
Sampling from a k-DPP without looking at all items,
in Neural Information Processing Systems
(NeurIPS 2020) spotlight - 3% acceptance rate]
PDF, arXiv preprint
bibtex
abstract
Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more. Given a kernel function and a subset size k, our goal is to sample k out of n items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. k-DPP). Existing k-DPP sampling algorithms require an expensive preprocessing step ...
- Pierre Perrault, Étienne Boursier, Vianney Perchet, Michal Valko:
Statistical efficiency of Thompson sampling for combinatorial semi-bandits,
in Neural Information Processing Systems
(NeurIPS 2020)
PDF, arXiv preprint
bibtex
abstract
We investigate stochastic combinatorial multi-armed bandit with semi-bandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family. We propose to answer the above question for these two families by analyzing variants of the Combinatorial...
- Anders Jonsson, Émilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Édouard Leurent, Michal Valko:
Planning in Markov decision processes with gap-dependent sample complexity,
in Neural Information Processing Systems
(NeurIPS 2020)
arXiv preprint
poster
bibtex
abstract
We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability. This problem-dependent sample complexity result is expressed in terms of the sub-optimality gaps of the state-action pairs that are visited during exploration. Our experiments reveal t...
-
Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, Rémi Munos:
Monte-Carlo tree search as regularized policy optimization,
in International Conference on Machine Learning (ICML 2020)
talk
image
arXiv preprint
video
bibtex
abstract
The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to groundbreaking results in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero's search heuristic, along with other common ones, can be interpreted as an approximation to the solution of a specific regularized policy optimization problem. With this insig...
- Yunhao Tang, Michal Valko, Rémi Munos:
Taylor expansion policy optimization,
in International Conference on Machine Learning (ICML 2020)
and (MS RL Day 2021)
arXiv preprint
talk
image
bibtex
abstract
In this work, we investigate the application of Taylor expansions in reinforcement learning. In particular, we propose Taylor expansion policy optimization, a policy optimization formalism that generalizes prior work (e.g., TRPO) as a first-order special case. We also show that Taylor expansions intimately relate to off-policy evaluation. Finally, we show that this new formulation entails modifications which improve the performance of several state-of-the-art distributed algorithms.
- Rémy Degenne, Pierre Ménard, Xuedong Shang, Michal Valko:
Gamification of pure exploration for linear bandits,
in International Conference on Machine Learning (ICML 2020)
arXiv preprint
talk
bibtex
abstract
We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits. While asymptotically optimal algorithms exist for standard multi-arm bandits, the existence of such algorithms for the best-arm identification in linear bandits has been elusive despite several attempts to address it. First, we provide a thorough comparison and new insight over different notions of optimality in the linear case, including G-optimality, transdu...
- Pierre Perrault, Zheng Wen, Jennifer Healey, Michal Valko:
Budgeted online influence maximization,
in International Conference on Machine Learning (ICML 2020)
video
bibtex
abstract
We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. Our approach models better the real-world setting where the cost of influencers varies and advertizers want to find the best value for their overall social advertising budget. We propose an algorithm assuming an independent cascade diffusion model and edge-level semi-bandit feedback, and provide ...
- Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric:
No-regret exploration in goal-oriented reinforcement learning,
in International Conference on Machine Learning (ICML 2020)
arXiv preprint
talk
bibtex
abstract
Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP problems, with most of the theoretical literature focusing on different problems (i.e., fixed-horizo...
- Aadirupa Saha, Pierre Gaillard, Michal Valko:
Improved sleeping bandits with stochastic action sets and adversarial rewards,
in International Conference on Machine Learning (ICML 2020)
arXiv preprint
talk
bibtex
abstract
In this paper, we consider the problem of sleeping bandits with stochastic action sets and adversarial rewards. In this setting, in contrast to most work in bandits, the actions may not be available at all times. For instance, some products might be out of stock in item recommendation. The best existing efficient (i.e., polynomial-time) algorithms for this problem only guarantee a O(T**2/3) upper-bound on the regret. Yet, inefficient algorithms based on EXP4 can achieve O(sqrt(T)). In this pa...
- Anne Manegueu, Claire Vernade, Alexandra Carpentier, Michal Valko:
Stochastic bandits with arm-dependent delays,
in International Conference on Machine Learning (ICML 2020) and
(GPSD 2020) and
(WiML 2019)
PDF, arXiv preprint
video
bibtex
abstract
Significant work has been recently dedicated to the stochastic delayed bandit setting because of its relevance in applications. The applicability of existing algorithms is however restricted by the fact that strong assumptions are often made on the delay distributions, such as full observability, restrictive shape constraints, or uniformity over arms. In this work, we weaken them significantly and only assume that there is a bound on the tail of the delay. In particular, we cover the importan...
- Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco:
Near-linear time Gaussian process optimization with adaptive batching and resparsification,
in International Conference on Machine Learning (ICML 2020) and
(OPT 2019)
arXiv preprint
talk
bibtex
abstract
Gaussian processes (GP) are one of the most successful frameworks to model uncertainty. However, GP optimization (e.g., GP-UCB) suffers from major scalability issues. Experimental time grows linearly with the number of evaluations, unless candidates are selected in batches (e.g., using GP-BUCB) and evaluated in parallel. Furthermore, computational cost is often prohibitive since algorithms such as GP-BUCB require a time at least quadratic in the number of dimensions and iterations to select...
- Pierre Perrault, Vianney Perchet, Michal Valko:
Covariance-adapting algorithm for semi-bandits with application to sparse rewards,
in Conference on Learning Theory
(COLT 2020),
video
bibtex
abstract
We investigate stochastic combinatorial semi-bandits, where the entire joint distribution of rewards impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed sub-Gaussian family. We alleviate this issue by instead considering a new general family of sub-exponential distributions...
- Xuedong Shang, Rianne de Heide, Émilie Kaufmann, Pierre Ménard, Michal Valko:
Fixed-confidence guarantees for Bayesian best-arm identification,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2020)
talk
bibtex
arXiv preprintabstract
We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS). In particular, we justify its use for fixed-confidence best-arm identification. We further propose a variant of TTTS called Top-Two Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T3C when coupled with a very natural Bayesian stopping rule, for bandits with Gaussian rewards, so...
- Côme Fiegel, Victor Gabillon, Michal Valko: Adaptive multi-fidelity optimization with fast learning rates in International Conference on Artificial Intelligence and Statistics (AISTATS 2020) bibtex
- Victor Gabillon, Rasul Tutunov, Michal Valko, Haitham Bou Ammar: Derivative-free & order-robust optimisation in International Conference on Artificial Intelligence and Statistics (AISTATS 2020) bibtex talk
- Julien Seznec, Pierre Ménard, Alessandro Lazaric, Michal Valko:
A single algorithm for both restless and rested rotting bandits,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2020)
bibtex
talk
abstract
In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the available actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rottin...
- Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric:
Reward-free exploration beyond finite-horizon, in
Theoretical Foundations of RL Workshop @ ICML 2020
(ICML 2020 - RL Theory)
video
bibtex
abstract
We study reward-free exploration in reinforcement learning settings that extend beyond the standard finite-horizon framework, developing algorithms that can efficiently explore without access to reward signals.
- Tomáš Kocák, Rémi Munos, Branislav Kveton, Shipra Agrawal, Michal Valko:
Spectral Bandits,
Journal of Machine Learning Research
(JMLR 2020)
bibtex
abstract
Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node of an undirected graph and its expected rating is similar to the one of its neighbors. The goal is to recommend items that have high expect...
2019
- Jean-Bastien Grill*, Omar Darwiche Domingues*, Pierre Ménard, Rémi Munos, Michal Valko:
Planning in entropy-regularized Markov decision processes and games,
in Neural Information Processing Systems
(NeurIPS 2019)
bibtex
poster
abstract
We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O(1/epsilon^4) for a desired accuracy epsilon.
- Mark Rowland, Shayegan Omidshafiei, Karl Tuyls, Julien Pérolat, Michal Valko, Georgios Piliouras, Rémi Munos:
Multiagent evaluation under incomplete information,
in Neural Information Processing Systems
(NeurIPS 2019)
arXiv preprint
bibtex
abstract
We investigate multiagent evaluation in the incomplete information regime, involving general-sum many-player games with noisy outcomes. We derive sample complexity guarantees and propose adaptive algorithms for accurate ranking.
- Michał Dereziński*, Daniele Calandriello*, Michal Valko:
Exact sampling of determinantal point processes with sublinear time preprocessing,
in Neural Information Processing Systems
(NeurIPS 2019) and
(ICML 2019 - NEGDEP)
PDF, arXiv preprint
bibtex
video
bibtexabstract
We study the complexity of sampling from a distribution over all index subsets of the set (1,...,n) with the probability of a subset S proportional to the determinant of the submatrix LS of some n x n p.s.d. matrix L, where LS corresponds to the entries of L indexed by S. Known as a determinantal point process, this distribution is used in machine learning to induce diversity in subset selection. In practice, we often wish to sample multiple subsets S with small expected size k = E[card(S)] t...
- Guillaume Gautier, Rémi Bardenet, Michal Valko:
On two ways to use determinantal point processes for Monte Carlo integration,
in Neural Information Processing Systems
(NeurIPS 2019) and
(ICML 2019 - NEGDEP)
bibtex
video
abstract
When approximating an integral by a weighted sum of function evaluations, determinantal point processes (DPPs) provide a way to enforce repulsion between evaluation points. We analyze the Ermakov-Zolotukhin estimator using modern arguments and provide an efficient implementation to sample exactly a particular multidimensional DPP called multivariate Jacobi ensemble.
- Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco:
Gaussian process optimization with adaptive sketching: Scalable and no regret, in
Conference on Learning Theory
(COLT 2019) and (ICML 2019 - NEGDEP)
and (SWSL 2019)
bibtex video
talk
poster
poster (image)
abstract
We introduce BKB (budgeted kernelized bandit), a novel approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret with near-constant per-iteration complexity. Combining GP-UCB with leverage score sampling, we prove that selecting inducing points based on posterior variance gives an accurate low-rank approximation of the GP.
- Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko:
Scale-free adaptive planning for deterministic dynamics & discounted rewards
in International Conference on Machine Learning
(ICML 2019)
bibtex
video
talk
posterabstract
We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce PlaTypOOS, an adaptive, robust, and efficient alternative to the OLOP (open-loop optimistic planning) algorithm. Whereas OLOP requires a priori knowledge of the ranges of both rewards and noise, PlaTypOOS dynamically adapts its behavior to both. This allows PlaTypOOS to be immune...
- Pierre Perrault, Vianney Perchet, Michal Valko:
Exploiting structure of uncertainty for efficient matroid semi-bandits,
in International Conference on Machine Learning
(ICML 2019)
PDF, arXiv preprint
bibtex
video
talk
poster
poster (image)abstract
We improve the efficiency of algorithms for stochastic combinatorial semi-bandits. In most interesting problems, state-of-the-art algorithms take advantage of structural properties of rewards, such as independence. However, while being minimax optimal in terms of regret, these algorithms are intractable. In our paper, we first reduce their implementation to a specific submodular maximization. Then, in case of matroid constraints, we design adapted approximation routines, thereby providing the...
- Xuedong Shang, Émilie Kaufmann, Michal Valko:
A simple dynamic bandit-based algorithm for hyper-parameter tuning,
in Workshop on Automated Machine Learning at International Conference on Machine Learning
(ICML 2019 - AutoML)
bibtex
poster
codeabstract
Hyper-parameter tuning is a major part of modern machine learning systems. The tuning itself can be seen as a sequential resource allocation problem. As such, methods for multi-armed bandits have been already applied. In this paper, we view hyper-parameter optimization as an instance of best-arm identification in infinitely many-armed bandits. We propose D-TTTS, a new adaptive algorithm inspired by Thompson sampling, which dynamically balances between refining the estimate of the quality of h...
- Guillaume Gautier, Rémi Bardenet, Michal Valko:
DPPy: Sampling determinantal point processes with Python,
Journal of Machine Learning Research
(JMLR 2019)
PDF, arXiv preprint
bibtex
abstract
Determinantal point processes (DPPs) are specific probability distributions over clouds of points that are used as models and computational tools across physics, probability, statistics, and more recently machine learning. Sampling from DPPs is a challenge and therefore we present DPPy, a Python toolbox that gathers known exact and approximate sampling algorithms. The project is hosted on GitHub and equipped with an extensive documentation. This documentation takes the form of a short survey ...
- Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, Michal Valko:
Rotting bandits are not harder than stochastic ones,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2019)
arXiv preprint
bibtex
talk
poster
[full oral presentation - 2.5% acceptance rate]abstract
In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing window...
- Andrea Locatelli, Alexandra Carpentier, Michal Valko:
Active multiple matrix completion with adaptive confidence sets,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2019)
bibtex
talk
posterabstract
In this work, we formulate a new multi-task active learning setting in which the learner's goal is to solve multiple matrix completion problems simultaneously. At each round, the learner can choose from which matrix it receives a sample from an entry drawn uniformly at random. Our main practical motivation is market segmentation, where the matrices represent different regions with different preferences of the customers. The challenge in this setting is that each of the matrices can be of a di...
- Pierre Perrault, Vianney Perchet, Michal Valko:
Finding the bandit in a graph: Sequential search-and-stop,
in International Conference on Artificial Intelligence and Statistics
(AISTATS 2019)
arXiv preprint
bibtex
posterabstract
We consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution. The agent can only examine vertices whose in-neighbors have already been examined. In scheduling theory, this problem is denoted by 1|prec|∑wjCj [Graham1979]. However, in this paper we address a learning setting where we allow the agent to stop before having found the object and restart searching ...
- Peter L. Bartlett, Victor Gabillon, Michal Valko:
A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption,
in Algorithmic Learning Theory
(ALT 2019)
bibtex
talk 1
talk 2
abstract
We study the problem of optimizing a function under a budgeted number of evaluations. We only assume that the function is locally smooth around one of its global optima. The difficulty of optimization is measured in terms of 1) the amount of noise b of the function evaluation and 2) the local smoothness, d, of the function. A smaller d results in smaller optimization error. We come with a new, simple, and parameter-free approach. First, for all values of b and d, this approach recovers at lea...
- Xuedong Shang, Émilie Kaufmann, Michal Valko:
General parallel optimization without metric,
in Algorithmic Learning Theory
(ALT 2019)
bibtex
talk
abstract
Hierarchical bandits are an approach for global optimization of extremely irregular functions. This paper provides new elements regarding POO, an adaptive meta-algorithm that does not require the knowledge of local smoothness of the target function. We first highlight the fact that the subroutine algorithm used in POO should have a small regret under the assumption of local smoothness with respect to the chosen partitioning, which is unknown if it is satisfied by the standard subroutine HOO. ...
- Guillaume Gautier, Rémi Bardenet, Michal Valko:
Les processus ponctuels déterminantaux en apprentissage automatique
bibtex
(Gretsi 2019)abstract
For the session ``Mathematical tools in machine learning", we propose a short survey of determinantal point processes, a popular probabilistic model and tool in machine learning, which already has promising applications in signal processing.
2018
- Jean-Bastien Grill, Michal Valko, Rémi Munos:
Optimistic optimization of a Brownian,
in Neural Information Processing Systems
(NeurIPS 2018)
arXiv preprint
bibtex
posterabstract
We address the problem of optimizing a Brownian motion. We consider a (random) realization W of a Brownian motion with input space in [0,1]. Given W, our goal is to return an epsilon-approximation of its maximum using the smallest possible number of function evaluations, the sample complexity of the algorithm. We provide an algorithm with sample complexity of order log2(1/epsilon). This improves over previous results of Al-Mharmah and Calvin (1996) and Calvin et al. (2017) which provided only...
- Xuedong Shang, Émilie Kaufmann, Michal Valko:
Adaptive black-box optimization got easier: HCT needs only local smoothness,
in European Workshop on Reinforcement Learning
(EWRL 2018)
bibtex
posterabstract
Hierarchical bandits is an approach for global optimization of extremely irregular functions. This paper provides new elements regarding POO, an adaptive meta-algorithm that does not require the knowledge of local smoothness of the target function. We first highlight the fact that the sub-routine algorithm used in POO should have a small regret under the assumption of local smoothness with respect to the chosen partitioning, which is unknown if it is satisfied by the standard sub-routine HOO....
- Édouard Oyallon, Eugene Belilovsky, Sergey Zagoruyko, Michal Valko:
Compressing the input for CNNs with the first-order scattering transform,
in European Conference on Computer Vision
(ECCV 2018)
bibtex
posterabstract
We study the first-order scattering transform as a candidate for reducing the signal processed by a convolutional neural network (CNN). We show theoretical and empirical evidence that in the case of natural images and sufficiently small translation invariance, this transform preserves most of the signal information needed for classification while substantially reducing the spatial resolution and total signal size. We demonstrate that cascading a CNN with this representation performs on par wi...
- Daniele Calandriello, Ioannis Koutis, Alessandro Lazaric, Michal Valko:
Improved large-scale graph learning through ridge spectral sparsification,
in International Conference on Machine Learning
(ICML 2018)
bibtex
talk
posterabstract
The representation and learning benefits of methods based on graph Laplacians, such as Laplacian smoothing or harmonic function solution for semi-supervised learning (SSL), are empirically and theoretically well supported. Nonetheless, the exact versions of these methods scale poorly with the number of nodes n of the graph. In this paper, we combine a spectral sparsification routine with Laplacian learning. Given a graph G as input, our algorithm computes a sparsifier in a distributed way in ...
- Yasin Abbasi-Yadkori, Peter L. Bartlett, Victor Gabillon, Alan Malek, Michal Valko:
Best of both worlds: Stochastic &
adversarial best-arm identification,
Conference on Learning Theory
(COLT 2018)
bibtex
video
talk
posterabstract
We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impos...
2017
- Daniele Calandriello, Alessandro Lazaric, Michal Valko:
Efficient second-order online kernel learning with adaptive embedding,
in Neural Information Processing Systems
(NeurIPS 2017)
bibtex
talk
posterabstract
Online kernel learning (OKL) is a flexible framework to approach prediction problems, since the large approximation space provided by reproducing kernel Hilbert spaces can contain an accurate function for the problem. Nonetheless, optimizing over this space is computationally expensive. Not only first order methods accumulate O( sqrt T ) more loss than the optimal function, but the curse of kernelization results in a O(t) per step complexity. Second-order methods get closer to the optimum muc...
- Zheng Wen, Branislav Kveton, Michal Valko, Sharan Vaswani:
Online influence maximization under independent cascade model with semi-bandit feedback, in Neural Information Processing Systems
(NeurIPS 2017)
bibtex
abstract
We study the online influence maximization problem in social networks under the independent cascade model. Specifically, we aim to learn the set of " best influencers " in a social network online while repeatedly interacting with it. We address the challenges of (i) combinatorial action space, since the number of feasible influencer sets grows exponentially with the maximum number of influencers, and (ii) limited feedback, since only the influenced portion of the network is observed. Under a ...
- Daniele Calandriello, Alessandro Lazaric, Michal Valko:
Second-order kernel online convex optimization with adaptive sketching, in International Conference on Machine Learning
(ICML 2017)
arXiv preprint
bibtex
talk
poster
bibtexabstract
Kernel online convex optimization (KOCO) is a framework combining the expressiveness of non-parametric kernel models with the regret guarantees of online learning. First-order KOCO methods such as functional gradient descent require only O(t) time and space per iteration, and, when the only information on the losses is their convexity, achieve a minimax optimal O(sqrtT) regret. Nonetheless, many common losses in kernel problems, such as squared loss, logistic loss, and squared hinge loss poss...
- Guillaume Gautier, Rémi Bardenet, Michal Valko:
Zonotope hit-and-run for efficient sampling from projection DPPs, in International Conference on Machine Learning
(ICML 2017)
arXiv preprint
bibtex
talk
posterabstract
Determinantal point processes (DPPs) are distributions over sets of items that model diversity using kernels. Their applications in machine learning include summary extraction and recommendation systems. Yet, the cost of sampling from a DPP is prohibitive in large-scale applications, which has triggered an effort towards efficient approximate samplers. We build a novel MCMC sampler that combines ideas from combinatorial geometry, linear programming, and Monte Carlo methods to sample from DPPs...
- Daniele Calandriello, Alessandro Lazaric, Michal Valko:
Distributed adaptive sampling for kernel matrix approximation, in International Conference on Artificial Intelligence and Statistics
(AISTATS 2017) and (ICML 2017 - LL)
arXiv preprint
bibtex
talk
code
posterabstract
Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not scale to large datasets, because constructing and storing the kernel matrix Kn requires at least O(n2) time and space for n samples. Recent works (Alaoui 2014, Musco 2016) show that sampling points with replacement according to their ridge leverage scores (RLS) generates small dictionaries of relevant points with strong spectral approximation guarantees for Kn. The drawback of RLS-based method...
- Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, Yu-En Liu:
Trading off rewards and errors in multi-armed bandits, in International Conference on Artificial Intelligence and Statistics
(AISTATS 2017)
bibtex
posterabstract
In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation err...
2016
- Michal Valko:
Bandits on graphs and structures,
habilitation thesis, École normale supérieure de Cachan
(ENS Cachan 2016)
bibtex
talk
abstract
We investigate the structural properties of certain sequential decision-making problems with limited feedback (bandits) in order to bring the known algorithmic solutions closer to a practical use. In the first part, we put a special emphasis on structures that can be represented as graphs on actions, in the second part we study the large action spaces that can be of exponential size in the number of base actions or even infinite. We show how to take advantage of structures over the actions an...
- Jean-Bastien Grill, Michal Valko, Rémi Munos:
Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning,
in Neural Information Processing Systems
(NeurIPS 2016)
bibtex
talk
poster
[full oral presentation - 1.8% acceptance rate]
abstract
We present TrailBlazer, a Monte-Carlo planning algorithm that exploits the structure of MDPs by exploring only near-optimal states. Sample complexity guarantees depend on a measure of the quantity of near-optimal states rather than the full state space.
- Daniele Calandriello, Alessandro Lazaric, Michal Valko:
Pack only the essentials: Adaptive dictionary learning for kernel ridge regression, in Adaptive and Scalable Nonparametric Methods in Machine Learning at Neural Information Processing Systems
(NeurIPS 2016 - ASNMML)
bibtex
posterabstract
Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not scale to large datasets, because constructing and storing the kernel matrix Kn requires at least O(n2) time and space for n samples. Recent works (Alaoui 2014, Musco 2016) show that sampling points with replacement according to their ridge leverage scores (RLS) generates small dictionaries of relevant points with strong spectral approximation guarantees for Kn. The drawback of RLS-based method...
- Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, Yu-En Liu:
Rewards and Errors in Multi-armed Bandit for Interactive Education, in Challenges in Machine Learning:
Learning and Education workshop at Neural Information Processing Systems
(NeurIPS 2016 - CIML)
bibtex
posterabstract
In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation err...
- Akram Erraqabi, Michal Valko, Alexandra Carpentier, Odalric-Ambrym Maillard:
Pliable rejection sampling, in International Conference on Machine Learning
(ICML 2016)
bibtex
talk
long talk
posterabstract
Rejection sampling is a technique for sampling from difficult distributions. However, its use is limited due to a high rejection rate. Common adaptive rejection sampling methods either work only for very specific distributions or without performance guarantees. In this paper, we present pliable rejection sampling (PRS), a new approach to rejection sampling, where we learn the sampling proposal using a kernel estimator. Since our method builds on rejection sampling, the samples obtained are wi...
- Daniele Calandriello, Alessandro Lazaric, Michal Valko:
Analysis of Nyström method with sequential ridge leverage scores, in Uncertainty in Artificial Intelligence
(UAI 2016)
bibtex
poster
spotlightabstract
Large-scale kernel ridge regression (KRR) is limited by the need to store a large kernel matrix Kt. To avoid storing the entire matrix Kt, Nyström methods subsample a subset of columns of the kernel matrix, and efficiently find an approximate KRR solution on the reconstructed Kt. The chosen subsampling distribution in turn affects the statistical and computational tradeoffs. For KRR problems, [15, 1] show that a sampling distribution proportional to the ridge leverage scores (RLSs) provides...
- Tomáš Kocák, Gergely Neu, Michal Valko:
Online learning with Erdős-Rényi side-observation graphs, in Uncertainty in Artificial Intelligence
(UAI 2016)
bibtex
poster
spotlight
abstract
We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with an unknown probability rt, independently of each other and the action of the learner. Moreover, we allow rt to change in every round t, which rules out the possibility of estimating rt by a well-concentrated sample average. We propose an algorithm which operates under the...
- Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko:
Bayesian policy gradient and actor-critic algorithms,
Journal of Machine Learning Research
(JMLR 2016)
bibtex
code
code (MATLAB)
abstract
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large number of samples is required to attain accurate estimates, resulting in slow convergence. In this pape...
- Tomáš Kocák, Gergely Neu, Michal Valko:
Online learning with noisy side observations, in International Conference on Artificial Intelligence and Statistics
(AISTATS 2016)
bibtex
talk
poster
[full oral presentation - 6% acceptance rate]
bibtexabstract
We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of O(sqrt(alpha{\^{
- Alexandra Carpentier, Michal Valko:
Revealing graph bandits for maximizing local influence, in International Conference on Artificial Intelligence and Statistics
(AISTATS 2016)
bibtex
posterabstract
We study a graph bandit setting where the objective of the learner is to detect the most influential node of a graph by requesting as little information from the graph as possible. One of the relevant applications for this setting is marketing in social networks, where the marketer aims at finding and taking advantage of the most influential customers. The existing approaches for bandit problems on graphs require either partial or complete knowledge of the graph. In this paper, we do not assu...
2015
- Jean-Bastien Grill, Michal Valko, Rémi Munos:
Black-box optimization of noisy functions with unknown smoothness,
in Neural Information Processing Systems
(NeurIPS 2015)
bibtex
code,
code in R
poster
abstract
We study black-box optimization of a function with unknown smoothness. Our algorithm POO (parallel optimistic optimization) performs almost as well as the best known algorithms requiring knowledge of the smoothness, with error at most a factor of sqrt(ln n) away from the optimal.
- Alexandra Carpentier, Michal Valko:
Simple regret for infinitely many armed bandits, in International Conference on Machine Learning
(ICML 2015)
bibtex
talk
poster
arXivabstract
We consider a stochastic bandit problem with infinitely many arms. In this setting, the learner has no chance of trying all the arms even once and has to dedicate its limited number of samples only to a certain number of arms. All previous algorithms for this setting were designed for minimizing the cumulative regret of the learner. In this paper, we propose an algorithm aiming at minimizing the simple regret. As in the cumulative regret setting of infinitely many armed bandits, the rate of t...
- Manjesh Hanawal, Venkatesh Saligrama, Michal Valko, Rémi Munos:
Cheap Bandits, in International Conference on Machine Learning
(ICML 2015)
arXiv preprint
bibtex
talk
posterabstract
We consider stochastic sequential learning problems where the learner can observe the average reward of several actions. Such a setting is interesting in many applications involving monitoring and surveillance, where the set of the actions to observe represent some (geographical) area. The importance of this setting is that in these applications, it is actually cheaper to observe average reward of a group of actions rather than the reward of a single action. We show that when the reward is sm...
- Daniele Calandriello, Alessandro Lazaric, Michal Valko:
Large-scale semi-supervised learning with online spectral graph sparsification, in Resource-Efficient Machine Learning workshop at International Conference on Machine Learning
(ICML 2015 - REML)
bibtex
poster
abstract
We introduce Sparse-HFS, a scalable algorithm that can compute solutions to SSL problems using only O(n polylog(n)) space and O(m polylog(n)) time.
- Julien Audiffren, Michal Valko, Alessandro Lazaric, Mohammad Ghavamzadeh:
Maximum Entropy Semi-Supervised Inverse Reinforcement Learning,
in International Joint Conferences on Artificial Intelligence
(IJCAI 2015) bibtex
talk
posterabstract
A popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is avai...
2014
- Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos:
Efficient Learning by Implicit Exploration in Bandit Problems with Side Observations,
in Neural Information Processing Systems
(NeurIPS 2014) bibtex
talk
poster
bibtexabstract
We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys nea...
- Alexandra Carpentier, Michal Valko:
Extreme Bandits,
in Neural Information Processing Systems
(NeurIPS 2014) bibtex
posterabstract
In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values. In this paper, we study an efficient way to allocate these resources sequentially under limited feedback. While sequential design of experiments is well studied in bandit theory, the most commonly optimized property is the regret with respect to the maximum mean reward. However, in other problems such as network intrusion detection, we are inter...
- Gergely Neu, Michal Valko:
Online Combinatorial Optimization with Stochastic Decision Sets and Adversarial Losses,
in Neural Information Processing Systems
(NeurIPS 2014) bibtex
talk
posterabstract
Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stock. In this paper we study learning algorithms that are able to deal with stochastic availability of such unreliable composite actions. We propose and analyze algorithms based on the Follow-The-Perturbed-Leader predict...
- Michal Valko, Rémi Munos, Branislav Kveton, Tomáš Kocák:
Spectral Bandits for Smooth Graph Functions,
in International Conference on Machine Learning
(ICML 2014) bibtex slides
poster
abstract
We study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for online learning problems involving graphs, such as content-based recommendation. We introduce the notion of an effective dimension and propose algorithms that scale linearly and sublinearly in this dimension.
- Tomáš Kocák, Michal Valko, Rémi Munos, Shipra Agrawal:
Spectral Thompson Sampling,
in AAAI Conference on Artificial Intelligence
(AAAI 2014) bibtex
slides
posterabstract
Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be simil...
- Philippe Preux, Rémi Munos, Michal Valko:
Bandits attack function optimization, in IEEE Congress on Evolutionary Computation
(CEC 2014) bibtex
abstract
We consider function optimization as a sequential decision making problem under the budget constraint. Such constraint limits the number of objective function evaluations allowed during the optimization. We consider an algorithm inspired by a continuous version of a multi-armed bandit problem which attacks this optimization problem by solving the tradeoff between exploration (initial quasi-uniform search of the domain) and exploitation (local optimization around the potentially global maxima)...
- Julien Audiffren, Michal Valko, Alessandro Lazaric, Mohammad Ghavamzadeh:
MESSI: Maximum Entropy Semi-Supervised Inverse Reinforcement Learning,
in NIPS Workshop on Novel Trends and Applications in Reinforcement Learning
(NeurIPS 2014 - TCRL) bibtex
abstract
A popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is avai...
- Tomáš Kocák, Michal Valko, Rémi Munos, Branislav Kveton, Shipra Agrawal:
Spectral Bandits for Smooth Graph Functions with Applications in Recommender Systems,
in AAAI Workshop on Sequential Decision-Making with Big Data
(AAAI 2014 - SDMBD) bibtex
abstract
Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each recommended item is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms ...
2013
- Michal Valko, Alexandra Carpentier, Rémi Munos:
Stochastic Simultaneous Optimistic Optimization, in International Conference on Machine Learning
(ICML 2013)
bibtex
demo
code,
code in R
slides
poster
video
talk
abstract
We study the problem of global maximization of a function given noisy evaluations. We consider a weak assumption that the function is locally smooth around one of its global maxima. Our algorithm StoSOO follows an optimistic strategy to iteratively construct upper confidence bounds over hierarchical partitions of the function domain.
-
Michal Valko, Nathan Korda, Rémi Munos, Ilias Flaounas, Nello Cristianini:
Finite-Time Analysis of Kernelised Contextual Bandits,
in Uncertainty in Artificial Intelligence
(UAI 2013) and (JFPDA 2013).
bibtex
poster
spotlight
codeabstract
We tackle the problem of online reward maximisation over a large finite set of actions described by their contexts. We focus on the case when the number of actions is too big to sample all of them even once. However we assume that we have access to the similarities between actions' contexts and that the expected reward is an arbitrary linear function of the contexts' images in the related reproducing kernel Hilbert space (RKHS). We propose KernelUCB, a kernelised UCB algorithm, and give a cum...
- Branislav Kveton, Michal Valko:
Learning from a Single Labeled Face and a Stream of Unlabeled Data,
in IEEE International Conference on Automatic Face and Gesture Recognition
(FG 2013)
[spotlight]
bibtex
abstract
Face recognition from a single image per person is a challenging problem because the training sample is extremely small. We consider a variation of this problem. In our problem, we recognize only one person, and there are no labeled data for any other person. This setting naturally arises in authentication on personal computers and mobile devices, and poses additional challenges because it lacks negative examples. We formalize our problem as one-class classification, and propose and analyze a...
- Miloš Hauskrecht, Iyad Batal, Michal Valko, Shyam Visweswaran, Gregory F. Cooper, Gilles Clermont:
Outlier detection for patient monitoring and alerting, in Journal of Biomedical Informatics (JBI 2013) bibtex
abstract
We develop and evaluate a data-driven approach for detecting unusual (anomalous) patient-management decisions using past patient cases stored in electronic health records (EHRs). Our hypothesis is that a patient-management decision that is unusual with respect to past patient care may be due to an error and that it is worthwhile to generate an alert if such a decision is encountered. We evaluate this hypothesis using data obtained from EHRs of 4486 post-cardiac surgical patients and a subset ...
2012
- Michal Valko,
Mohammad Ghavamzadeh, Alessandro Lazaric:
Semi-supervised apprenticeship learning, in Journal of Machine Learning Research Workshop and Conference Proceedings: European Workshop on Reinforcement Learning (EWRL 2012) bibtex
talk
posterabstract
In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under whic...
2011
- Michal Valko,
Branislav Kveton, Hamed Valizadegan, Gregory F. Cooper, Miloš Hauskrecht:
Conditional Anomaly Detection with Soft Harmonic Functions, in International Conference on Data Mining (ICDM 2011) bibtex
abstract
In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We de...
- Thomas C. Hart, Patricia M. Corby, Miloš Hauskrecht, Ok Hee Ryu, Richard Pelikan, Michal Valko, Maria B. Oliveira, Gerald T. Hoehn, and Walter A. Bretz:
Identification of Microbial and Proteomic Biomarkers in
Early Childhood Caries in International Journal of Dentistry (IJD 2011) bibtex
abstract
The purpose of this study was to provide a univariate and multivariate analysis of genomic microbial data and salivary mass-spectrometry proteomic profiles for dental caries outcomes. In order to determine potential useful biomarkers for dental caries, a multivariate classification analysis was employed to build predictive models capable of classifying microbial and salivary sample profiles with generalization performance. We used high-throughput methodologies including multiplexed micr...
-
Michal Valko:
Adaptive Graph-Based Algorithms for Conditional Anomaly Detection and Semi-Supervised Learning, PhD thesis, University of Pittsburgh
(PITT 2011) bibtex
abstract
We develop graph-based methods for semi-supervised learning based on label propagation on a data similarity graph. When data is abundant or arrive in a stream, the problems of computation and data storage arise for any graph-based method. We propose a fast approximate online algorithm that solves for the harmonic solution on an approximate graph. We show, both empirically and theoretically, that good behavior can be achieved by collapsing nearby points into a set of local representative...
- Michal Valko, Hamed Valizadegan, Branislav Kveton, Gregory F. Cooper, Miloš Hauskrecht: Conditional Anomaly Detection Using Soft Harmonic Functions: An Application to Clinical Alerting, Workshop on Machine Learning for Global Challenges in International Conference on Machine Learning (ICML 2011 - Global) bibtex poster spotlight
2010
-
Michal Valko, Branislav Kveton, Ling Huang, Daniel Ting:
Online Semi-Supervised Learning on Quantized Graphs in
Uncertainty in Artificial Intelligence
(UAI 2010) bibtex
Video: Adaptation,
Video: OfficeSpace,
spotlight
poster
abstract
In this paper, we tackle the problem of online semi-supervised learning (SSL). When data arrive in a stream, the dual problems of computation and data storage arise for any SSL method. We propose a fast approximate online SSL algorithm that solves for the harmonic solution on an approximate graph. We show, both empirically and theoretically, that good behavior can be achieved by collapsing nearby points into a set of local "representative points" that minimize distortion. Moreover, we ...
- Branislav Kveton, Michal Valko, Ali Rahimi, Ling Huang:
Semi-Supervised Learning with Max-Margin Graph Cuts in
International Conference on Artificial Intelligence and Statistics
(AISTATS 2010) bibtex
abstract
This paper proposes a novel algorithm for semisupervised learning. This algorithm learns graph cuts that maximize the margin with respect to the labels induced by the harmonic function solution. We motivate the approach, compare it to existing work, and prove a bound on its generalization error. The quality of our solutions is evaluated on a synthetic problem and three UCI ML repository datasets. In most cases, we outperform manifold regularization of support vector machines, which is ...
- Miloš Hauskrecht, Michal Valko, Shyam Visweswaram, Iyad Batal, Gilles Clermont, Gregory Cooper:
Conditional Outlier Detection for Clinical Alerting in Annual American Medical Informatics Association conference
(AMIA 2010) bibtex
[Homer Warner Best Paper Award]
abstract
We develop and evaluate a data-driven approach for detecting unusual (anomalous) patient-management actions using past patient cases stored in an electronic health record (EHR) system. Our hypothesis is that patient-management actions that are unusual with respect to past patients may be due to a potential error and that it is worthwhile to raise an alert if such a condition is encountered. We evaluate this hypothesis using data obtained from the electronic health records of 4,486 post-cardia...
- Branislav Kveton, Michal Valko, Matthai Phillipose, Ling Huang:
Online Semi-Supervised Perception: Real-Time Learning without Explicit Feedback in
IEEE Online Learning for Computer Vision Workshop in The
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2010 - OLCV)
[best paper Google Award]
bibtex
abstract
This paper proposes an algorithm for real-time learning without explicit feedback. The algorithm combines the ideas of semi-supervised learning on graphs and online learning. In particular, it iteratively builds a graphical representation of its world and updates it with observed examples. Labeled examples constitute the initial bias of the algorithm and are provided offline, and a stream of unlabeled examples is collected online to update this bias. We motivate the algorithm, discuss h...
- Michal Valko, Miloš Hauskrecht:
Feature importance analysis for patient management decisions in
International Congress on Medical Informatics
(MEDINFO 2010) bibtex
abstract
The objective of this paper is to understand what characteris-tics and features of clinical data influence physician.s deci-sion about ordering laboratory tests or prescribing medica-tions the most. We conduct our analysis on data and decisions extracted from electronic health records of 4486 post-surgical cardiac patients. The summary statistics for 335 different lab order decisions and 407 medication decisions are reported. We show that in many cases, physician.s lab-order and medica...
2008
- Michal Valko,
Gregory Cooper, Amy Seybert,
Shyam Visweswaran, Melissa Saul, Miloš Hauskrecht:
Conditional anomaly detection methods for patient-management alert systems, Workshop on Machine Learning in Health Care Applications in International Conference on Machine Learning (ICML-2008 - MLHealth) bibtex
talk
abstract
Anomaly detection methods can be very useful in identifying unusual or interesting patterns in data. A recently proposed conditional anomaly detection framework extends anomaly detection to the problem of identifying anomalous patterns on a subset of attributes in the data. The anomaly always depends (is conditioned) on the value of remaining attributes. The work presented in this paper focuses on instance-based methods for detecting conditional anomalies. The methods rely on the distance met...
- Michal Valko, Miloš Hauskrecht: Distance metric learning for conditional anomaly detection, International Florida AI Research Society Conference (FLAIRS 2008)
bibtex
abstract
Anomaly detection methods can be very useful in identifying unusual or interesting patterns in data. A recently proposed conditional anomaly detection framework extends anomaly detection to the problem of identifying anomalous patterns on a subset of attributes in the data. The anomaly always depends (is conditioned) on the value of remaining attributes. The work presented in this paper focuses on instance-based methods for detecting conditional anomalies. The methods depend heavily on the di...
- Michal Valko, Richard Pelikan, Miloš Hauskrecht: Learning predictive models for combinations of heterogeneous
proteomic data sources, AMIA
Summit on Translational Bioinformatics (STB 2008) [outstanding paper award] bibtex
talkabstract
Multiple technologies that measure expression levels of protein mixtures in the human body offer a potential for detection and understanding the disease. The recent increase of these technologies prompts researchers to evaluate the individual and combined utility of data generated by the technologies. In this work, we study two data sources to measure the expression of protein mixtures in the human body: whole-sample MS profiling and multiplexed protein arrays. We investigate the indiv...
2007
- Miloš Hauskrecht, Michal Valko, Branislav Kveton, Shyam Visweswaram, Gregory Cooper: Evidence-based Anomaly Detection in Clinical Domains in Annual American Medical Informatics Association conference (AMIA 2007).
[nominated for the best paper award] bibtex
abstract
Anomaly detection methods can be very useful in identifying interesting or concerning events. In this work, we develop and examine new probabilistic anomaly detection methods that let us evaluate management decisions for a specific patient and identify those decisions that are highly unusual with respect to patients with the same or similar condition. The statistics used in this detection are derived from probabilistic models such as Bayesian networks that are learned from a database of past ...
2006
- Wendy W. Chapman, John N. Dowling, Gregory F. Cooper, Miloš Hauskrecht and Michal Valko: A Comparison of Chief Complaints and Emergency Department Reports for Identifying Patients with Acute Lower Respiratory Syndrome in
Proceedings of the National Syndromic Surveillance Conference
(ISDS 2006)
bibtex
abstract
Automated syndromic surveillance systems often classify patients into syndromic categories based on free-text chief complaints. Chief complaints (CC) demonstrate low to moderate sensitivity in identify-ing syndromic cases. Emergency Department (ED) reports promise more detailed clinical information that may increase sensitivity of detection.
- Miloš Hauskrecht, Richard Pelikan, Michal Valko, James Lyons-Weiler: Feature Selection and Dimensionality Reduction in Genomics and Proteomics. Fundamentals of Data Mining in Genomics and Proteomics, eds. Berrar, Dubitzky, Granzow. Springer (2006)
bibtex
abstract
Finding reliable, meaningful patterns in data with high numbers of attributes can be extremely difficult. Feature selection helps us to decide what attributes or combination of attributes are most important for finding these patterns. In this chapter, we study feature selection methods for building classification models from high-throughput genomic (microarray) and proteomic (mass spectrometry) data sets. Thousands of feature candidates must be analyzed, compared and combined in such data set...
2005
-
Michal Valko, Nuno C. Marques, Marco Castelani: Evolutionary Feature Selection for Spiking Neural Network Pattern Classifiers
in Proceedings of Portuguese Conference on Artificial Intelligence (EPIA 2005),
eds. Bento et al., IEEE, pages 24-32. bibtex
abstract
This paper presents an application of the biologically realistic JASTAP neural network model to classification tasks. The JASTAP neural network model is presented as an alternative to the basic multi-layer perceptron model. An evolutionary procedure previously applied to the simultaneous solution of feature selection and neural network training on standard multi-layer perceptrons is extended with JASTAP model. Preliminary results on IRIS standard data set give evidence that this extensi...
-
Michal Valko Evolving Neural Networks for Statistical Decision Theory,
Comenius University, Bratislava, 2005
(master thesis) (2005) Advisor: Radoslav Harman
bibtex
bibtex (alt)
talk
abstract
Real biological networks are able to make decisions. We will show that this behavior can be observed even in some simple architectures of biologically plausible neural models. The great interest of this thesis is also to contribute to methods of statistical decision theory by giving a lead how to evolve the neural networks to solve miscellaneous decision tasks.
older preprints
- Pierre Perrault, Jennifer Healey, Zheng Wen, Michal Valko:
On the approximation relationship between optimizing ratio of submodular (RS) and difference of submodular (DS) functions,
arXiv preprint
bibtex
bibtexabstract
We explore the connection between ratio of submodular functions (RS) and difference of submodular functions (DS) optimization. We show an algorithm for RS can be transformed into one for DS, and vice versa, and analyze a greedy algorithm for these problems.
- Branislav Kveton, Zheng Wen, Azin Ashkan, Michal Valko:
Learning to Act Greedily: Polymatroid Semi-Bandits,
accepted for publication to Journal of Machine Learning Research
(JMLR)
bibtex
PDF, arXiv preprint
abstract
Many important optimization problems, such as the minimum spanning tree and minimum-cost flow, can be solved optimally by a greedy method. In this work, we study a learning variant of these problems, where the model of the problem is unknown and has to be learned by interacting repeatedly with the environment in the bandit setting. We formalize our learning problem quite generally, as learning how to maximize an unknown modular function on a known polymatroid. We propose a computationally eff...
Presentations
- Michal Valko: Graph-Based Anomaly Detection with Soft Harmonic Functions: Presented at CS Department Research Competition (Research 2011) [#1st place] talk also at (Grad Expo 2011) and (CS Day 2011) poster
-
Branislav Kveton, Michal Valko, Matthai Philiposse: Real-Time Adaptive Face Recognition, Presented at
23rd Neural Information Processing Systems conference (NeurIPS 2009),
Video: Adaptation,
Video: OfficeSpace,
demo,
poster #1,
poster #2
bibtex
abstract
We develop a non-parametric approach for conditional anomaly detection based on soft harmonic solutions, estimating label confidence to detect anomalous mislabeling.
- Michal Valko:, Branislav Kveton, Matthai Philiposse: Robust Face Recognition Using Online Learning, Presented at 9th University of Pittsburgh Science conference (SCIENCE 2009) Live Demonstration (Grad Expo 2010) talk and (CS Day 2010) poster
- Michal Valko: Conditional anomaly detection with adaptive similarity metric: Presented at CS Department Research Competition (Research 2008) [#1st place] talk
- Michal Valko, Miloš Hauskrecht, G. Cooper, S. Visweswaran, M. Saul, A. Seybert, J. Harrison, A. Post: Conditional Anomaly Detection, Presented at (CS Day 2008) [#1st by people, #2nd by faculty] also at University of Pittsburgh, Arts & Sciences (Grad Expo 2008) poster
References
- bibtex file with references I often use

















