Michal Valko - Research Projects

RLHF and LLM Alignment

2023 - present

with Rémi Munos, Daniele Calandriello, Yunhao Tang, Pierre Ménard, Daniil Tiapkin, Mohammad Gheshlaghi Azar

We develop methods for aligning large language models (LLMs) with human preferences. Our main contribution is Nash Learning from Human Feedback (NLHF), an alternative to RLHF that learns a preference model and finds the Nash equilibrium policy. We also develop online preference optimization, offline alignment methods (GPO), decoding-time realignment, and theoretical frameworks for understanding preference learning. Our work has been applied to Llama 3 and addresses challenges in reward modeling, multi-sample comparisons, and metacognitive capabilities of LLMs.

ICML 2024 ICML 2024 ICML 2024 ICML 2024 AISTATS 2024 arXiv:2503.19612 abstract arXiv:2410.12138 arXiv:2410.17055 NeurIPS 2024 arXiv:2407.21783 arXiv:2405.08448 COLT 2025

Zero-Sum Games and Imperfect Information

2021 - 2025

with Tadashi Kozuno, Côme Fiegel, Rémi Munos, Pierre Ménard, Vianney Perchet, Mark Rowland

We develop algorithms for learning optimal strategies in zero-sum games with imperfect information. Our contributions include model-free learning for partially observable Markov games, adaptive algorithms for extensive-form games with local and adaptive mirror descent, and last-iterate convergence guarantees for uncoupled learning. Our work on adapting to game trees in zero-sum imperfect information games received the outstanding paper award at ICML 2023. We also study multiagent evaluation under incomplete information.

NeurIPS 2021 ICML 2023 NeurIPS 2024 ICML 2025 NeurIPS 2019 Nature Machine Intelligence 2021

Sample efficient Monte-Carlo tree search: TrailBlazer, SmoothCruiser, StoSOO, POO, and OOB

2011 - now

with Jean-Bastien Grill, Rémi Munos, Alexandra Carpentier, Victor Gabillon, Peter Bartlett, Côme Fiegel

Monte-Carlo planning and Monte-Carlo tree search has been popularized in the game of computer Go. Our first contribution are generic black-box function optimizers for extremely difficult functions with guarantees with main application to hyper-parameter tuning. The second set of contributions is in planning, including TrailBlazer, an adaptive planning algorithm in MDPs. We developed parameter-free and adaptive approaches to optimization, scale-free online planning, and adaptive MCTS methods.

ICML 2013 CEC 2014 ICML 2015 NIPS 2015 NIPS 2016 ALT 2019 ICML 2019 ICML 2020 ICLR 2020 StoSOO code

Sample efficient Monte-Carlo tree search: TrailBlazer, SmoothCruiser, StoSOO, POO, and OOB visualization

Posterior Sampling and Bayesian RL

2022 - 2024

with Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Éric Moulines, Alexey Naumov, Pierre Ménard, Pierre Perrault

We develop posterior sampling methods for reinforcement learning that achieve optimism without explicit bonus terms. Our contributions include optimistic posterior sampling with tight guarantees, model-free posterior sampling via learning rate randomization, fast rates for maximum entropy exploration, and demonstration-regularized RL. We also provide sharp deviation bounds for Dirichlet weighted sums and new bounds on cumulant generating functions of Dirichlet processes, which are fundamental for analyzing Bayesian algorithms.

NeurIPS 2022 ICML 2022 NeurIPS 2023 ICML 2023 ICLR 2024 arXiv:2304.03056 arXiv:2409.18621

Self-Supervised Learning and BYOL

2020 - 2024

with Jean-Bastien Grill, Pierre H. Richemond, Zhaohan Daniel Guo, Rémi Munos, Daniele Calandriello, Mehdi Azabou, Eva L. Dyer, Petar Veličković

We develop self-supervised learning methods that learn representations without labeled data. Our flagship contribution is Bootstrap Your Own Latent (BYOL), a self-supervised approach that avoids contrastive learning by having a target network bootstrap the online network's representations. BYOL achieves state-of-the-art results in image representation learning and has been extended to exploration in reinforcement learning (BYOL-Explore), video learning, graph representation learning, and neuroscience applications.

NeurIPS 2020 NeurIPS 2020 Workshop NeurIPS 2022 ICCV 2021 arXiv:2102.10106 NeurIPS 2021 ICLR 2022

Self-Supervised Learning and BYOL visualization

Exploration and Intrinsic Motivation

2020 - 2024

with Alaa Saade, Daniel Jarrett, Omar Darwiche Domingues, Zhaohan Daniel Guo, Thomas Mesnard, Anirudh Goyal

We develop methods for exploration in reinforcement learning using intrinsic motivation and learned representations. Our contributions include unlocking the power of representations in long-term novelty-based exploration, curiosity in hindsight for stochastic environments, density-based bonuses on learned representations for reward-free exploration, geometric entropic exploration, quantile credit assignment, and retrieval-augmented reinforcement learning. These methods enable effective exploration in sparse-reward or reward-free environments.

ICLR 2024 ICML 2023 ICML 2021 Workshop arXiv:2101.02055 ICML 2023 ICML 2022 ICML 2021 ICML 2020 ICML 2020

Off-Policy and Value Function Learning

2020 - 2023

with Yunhao Tang, Pierre Ménard, Rémi Munos, Mark Rowland, Tadashi Kozuno

We develop methods for off-policy reinforcement learning and value function estimation. Our contributions include UCB Momentum Q-learning that corrects bias without forgetting, unified gradient estimators for meta-RL via off-policy evaluation, marginalized operators for off-policy RL, VA-learning as an alternative to Q-learning, Taylor expansion of discount factors and policy optimization, and revisiting Peng's Q(λ) for modern RL. These methods improve sample efficiency and stability in off-policy settings.

ICML 2021 NeurIPS 2021 AISTATS 2022 ICML 2023 ICML 2021 ICML 2020 ICML 2021 arXiv:2305.18501

Stochastic Shortest Path and Reward-Free Exploration

2020 - 2022

with Jean Tarbouriech, Matteo Pirotta, Alessandro Lazaric, Émilie Kaufmann, Pierre Ménard

We study exploration in reinforcement learning when no reward function is provided. We develop algorithms for stochastic shortest path problems with minimax optimal regret bounds that are parameter-free and approach horizon-free. Our work includes provably efficient sample collection strategies, incremental autonomous exploration, goal-oriented exploration, and adaptive multi-goal exploration. These methods enable agents to efficiently explore unknown environments before any task is specified.

NeurIPS 2021 NeurIPS 2021 ALT 2021 NeurIPS 2020 ICML 2020 AISTATS 2022 ALT 2021

Gaussian Process Optimization and Kernel Methods

2019 - 2022

with Daniele Calandriello, Omar Darwiche Domingues, Alessandro Lazaric, Lorenzo Rosasco, Pierre Ménard

We develop scalable methods for Gaussian process optimization and kernel-based reinforcement learning. Our contributions include adaptive sketching techniques that achieve near-linear time complexity, methods for evaluating candidates multiple times to reduce switch costs, and kernel-based approaches for non-stationary reinforcement learning. We also provide finite-time analysis of kernel-based RL and study episodic RL with minimax lower bounds in metric spaces.

COLT 2019 ICML 2020 ICML 2022 ICML 2021 AISTATS 2021

Determinantal Point Processes (DPPs)

2017 - 2020

with Guillaume Gautier, Rémi Bardenet, Michał Dereziński, Daniele Calandriello

Determinantal point processes (DPPs) are probabilistic models for selecting diverse subsets of items. We develop efficient sampling algorithms for DPPs, including exact sampling with sublinear preprocessing, zonotope hit-and-run methods for projection DPPs, and applications to Monte Carlo integration. We also create DPPy, a Python toolbox for DPP sampling, and study fast sampling from β-ensembles. Our methods enable scalable diverse subset selection in machine learning applications.

NeurIPS 2019 NeurIPS 2019 ICML 2017 NeurIPS 2020 JMLR 2019 Statistics and Computing 2021

Determinantal Point Processes (DPPs) visualization

Combinatorial Bandits and Influence Maximization

2017 - 2020

with Zheng Wen, Pierre Perrault, Branislav Kveton, Vianney Perchet

We study online learning in combinatorial settings, particularly influence maximization in social networks. Our work includes online influence maximization under the independent cascade model with semi-bandit feedback, budgeted online influence maximization, efficient algorithms for matroid semi-bandits that exploit structure of uncertainty, statistical efficiency of Thompson sampling for combinatorial semi-bandits, and covariance-adapting algorithms for semi-bandits. We also develop methods for learning to act greedily in polymatroid settings.

NeurIPS 2017 ICML 2020 ICML 2019 NeurIPS 2020 COLT 2020 JMLR

Adaptive structural sampling

2013-2020

with Alexandra Carpentier, Andrea Locatelli, Akram Erraqabi, Alessandro Lazaric, Rémi Bardenet, Guillaume Gautier

Many of the sequential problems require adaptive sampling in some particular way. Examples include using learning to improve rejection rate in rejection sampling, sampling with two contradictory objectives such as when we have to trade off reward and regret, extreme and infinitely many-arm bandits, and efficient sampling of determinantal point processes.

ICML 2015 ICML 2016 AISTATS 2017 ICML 2017

SQUEAK: Online sparsification of kernels and graphs

2009 - 2018

with Daniele Calandriello, Alessandro Lazaric, Yiannis Koutis

My PhD thesis ended with an open direction, whether efficient spectral sparsifiers can fuel online graph-learning methods. We introduce the first dictionary-learning streaming algorithm that operates in a single-pass over the dataset. This reduces the overall time required to construct provably accurate dictionaries from quadratic to near-linear, or even logarithmic when parallelized.

UAI 2016 AISTATS 2017 ICML 2017 NIPS 2017 arXiv:1609.03769 arXiv:1601.05675

SQUEAK: Online sparsification of kernels and graphs visualization

Graph Bandits

2013 - 2017

with Gergely Neu, Tomáš Kocák, Shipra Agrawal, Rémi Munos, Alexandra Carpentier, Branislav Kveton, Zheng Wen

Bandit problems are online decision-making problems where the only feedback given to the learner is a (noisy) reward of the chosen decision. We study the benefits of homophily (similar actions give similar rewards) under the name spectral bandits, side information (well-informed bandits), and influence maximization (IM bandits). In the algorithms, we take advantage of these similarities in order to (provably) learn faster.

ICML 2014 AAAI 2014 NIPS 2014 NIPS 2014 ICML 2015 AISTATS 2016 ENS Cachan 2016 NIPS 2017

Semi-supervised apprenticeship learning

2011 - 2015

with Julien Audiffren, Mohammad Ghavamzadeh, Alessandro Lazaric

In apprenticeship learning we aim to learn a good behavior by observing an expert. We consider a situation when we observe many trajectories of behaviors but only one or a few of them are labeled as experts' trajectories. We investigate the assumptions under which the remaining unlabeled trajectories can aid in learning a policy with a good performance.

EWRL 2012 IJCAI 2015 IRL literature

Composing Learning for Artificial Cognitive Systems (CompLACS)

2011 - 2015

with Rémi Munos, Mohammad Ghavamzadeh, Alessandro Lazaric, Daniil Ryabko

The purpose of this project was to develop a unified toolkit for intelligent control in many different problem areas. This toolkit incorporates many of the most successful approaches to a variety of important control problems within a single framework, including bandit problems, Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), continuous stochastic control, and multi-agent systems.

EWRL 2012 UAI 2013 ICML 2013 ICML 2014 AAAI 2014 NIPS 2014 NIPS 2014 NIPS 2014

Composing Learning for Artificial Cognitive Systems (CompLACS) visualization

Large-scale semi-supervised learning

2010 - 2013

with Branislav Kveton, Avneesh Saluja

We parallelized online harmonic solver to process 1 TB of video data in a day. I am working on the multi-manifold learning that can overcome changes in distribution. I am showing how the online learner adapts as to characters' aging over 10 years period in Married... with Children sitcom. The research was part of Everyday Sensing and Perception (ESP) project.

FG 2013

Large-scale semi-supervised learning visualization

Online semi-supervised learning

2009 - 2013

with Branislav Kveton, Ali Rahimi, Ling Huang, Daniel Ting

We extended graph-based semi-supervised learning to the structured case and demonstrated on handwriting recognition and object detection from video streams. We came up with an online algorithm that on the real-world datasets recognizes faces at 80-90% precision with 90% recall.

UAI 2010 AISTATS 2010 CVPR - OLCV 2010

Online semi-supervised learning visualization

Anomaly detection in clinical databases

2007 - 2013

with Miloš Hauskrecht

Statistical anomaly detection methods for identification of unusual outcomes and patient management decisions. I combined max-margin learning with distance learned to create an anomaly detector, which outperforms the hospital rule for Heparin Induced Thrombocytopenia detection. I later scaled the system for 5K patients with 9K features and 743 clinical decisions per day.

AMIA 2007 FLAIRS 2008 ICML-HEALTH 2008 MEDINFO 2010 AMIA 2010 ICDM 2011 JBI 2013

Odd-Man-Out

2007 - 2011

with Wendy Chapman, Roger Day, Gregory Cooper

We hypothesized that clinical data in emergency department (ED) reports would increase sensitivity and specificity of case identification for patients with an acute lower respiratory syndrome (ALRS). We designed a statistic of disagreement (odd-man-out) to evaluate the machine learning classifier with expert evaluation in the cases when the gold standard is not available.

ISDS 2006

High-throughput proteomic and genomic data and biomarker discovery

2006 - 2007

with Miloš Hauskrecht, Richard Pelikan, Shuguang Wang

I built a framework for the cancer prediction from high-throughput proteomic and genomic data sources. I found a way to merge heterogeneous data sources: My fusion model was able to predict pancreatic cancer from Luminex combined with SELDI with 91.2% accuracy.

Springer 2006 STB 2008 IJD 2011

Evolutionary feature selection algorithms

2005

with Nuno Miguel Cavalheiro Marques, Marco Castelani

We enhanced the existing FeaSANNT neural feature selection with spiking neuron model to handle inputs noised with up to 10% Gaussian noise.

Plastic Synapses (regularity counting)

2003 - 2005

with Juraj Pavlásek, Radoslav Harman, Ján Jenča

We were modeling basic learning function at the level of synapses. I designed a model that is able to adapt to the regular frequencies with different rate as the time flows. I used genetic programming to find biologically plausible networks that distinguish different gamma distribution and provided explanation of the strategies evolved.

PlaSyn software

Plastic Synapses spiking network visualization

Oktava2000 Discussion Forum

2000

PHP-based discussion forum and community platform created during high school years. Features debate system, user management, and custom styling. Archive preserves messages from 2003 and group collage.

View Archive

Michal Valko : Projects

Current Research