Michal Valko : Projects

RLHF and LLM Alignment

2023 - present

We develop methods for aligning large language models (LLMs) with human preferences. Our main contribution is Nash Learning from Human Feedback (NLHF), an alternative to RLHF that learns a preference model and finds the Nash equilibrium policy. We also develop online preference optimization, offline alignment methods (GPO), decoding-time realignment, and theoretical frameworks for understanding preference learning. Our work has been applied to Llama 3 and addresses challenges in reward modeling, multi-sample comparisons, and metacognitive capabilities of LLMs.

Zero-Sum Games and Imperfect Information

2021 - 2025

We develop algorithms for learning optimal strategies in zero-sum games with imperfect information. Our contributions include model-free learning for partially observable Markov games, adaptive algorithms for extensive-form games with local and adaptive mirror descent, and last-iterate convergence guarantees for uncoupled learning. Our work on adapting to game trees in zero-sum imperfect information games received the outstanding paper award at ICML 2023. We also study multiagent evaluation under incomplete information.

Sample efficient Monte-Carlo tree search: TrailBlazer, SmoothCruiser, StoSOO, POO, and OOB

2011 - now
with Jean-Bastien Grill, Rémi Munos, Alexandra Carpentier

Monte-Carlo planning and Monte-Carlo tree search has been popularized in the game of computer Go. Our first contribution are generic black-box function optimizers for extremely difficult functions with guarantees with main application to hyper-parameter tuning. The second set of contributions is in planning, including TrailBlazer, an adaptive planning algorithm in MDPs.

Sample efficient Monte-Carlo tree search: TrailBlazer, SmoothCruiser, StoSOO, POO, and OOB visualization

Posterior Sampling and Bayesian RL

2022 - 2024

We develop posterior sampling methods for reinforcement learning that achieve optimism without explicit bonus terms. Our contributions include optimistic posterior sampling with tight guarantees, model-free posterior sampling via learning rate randomization, fast rates for maximum entropy exploration, and demonstration-regularized RL. We also provide sharp deviation bounds for Dirichlet weighted sums and new bounds on cumulant generating functions of Dirichlet processes, which are fundamental for analyzing Bayesian algorithms.

Self-Supervised Learning and BYOL

2020 - 2024

We develop self-supervised learning methods that learn representations without labeled data. Our flagship contribution is Bootstrap Your Own Latent (BYOL), a self-supervised approach that avoids contrastive learning by having a target network bootstrap the online network's representations. BYOL achieves state-of-the-art results in image representation learning and has been extended to exploration in reinforcement learning (BYOL-Explore), video learning, graph representation learning, and neuroscience applications.

Self-Supervised Learning and BYOL visualization

Exploration and Intrinsic Motivation

2020 - 2024

We develop methods for exploration in reinforcement learning using intrinsic motivation and learned representations. Our contributions include unlocking the power of representations in long-term novelty-based exploration, curiosity in hindsight for stochastic environments, density-based bonuses on learned representations for reward-free exploration, geometric entropic exploration, quantile credit assignment, and retrieval-augmented reinforcement learning. These methods enable effective exploration in sparse-reward or reward-free environments.

Off-Policy and Value Function Learning

2020 - 2023

We develop methods for off-policy reinforcement learning and value function estimation. Our contributions include UCB Momentum Q-learning that corrects bias without forgetting, unified gradient estimators for meta-RL via off-policy evaluation, marginalized operators for off-policy RL, VA-learning as an alternative to Q-learning, Taylor expansion of discount factors and policy optimization, and revisiting Peng's Q(λ) for modern RL. These methods improve sample efficiency and stability in off-policy settings.

Stochastic Shortest Path and Reward-Free Exploration

2020 - 2022

We study exploration in reinforcement learning when no reward function is provided. We develop algorithms for stochastic shortest path problems with minimax optimal regret bounds that are parameter-free and approach horizon-free. Our work includes provably efficient sample collection strategies, incremental autonomous exploration, goal-oriented exploration, and adaptive multi-goal exploration. These methods enable agents to efficiently explore unknown environments before any task is specified.

Gaussian Process Optimization and Kernel Methods

2019 - 2022

We develop scalable methods for Gaussian process optimization and kernel-based reinforcement learning. Our contributions include adaptive sketching techniques that achieve near-linear time complexity, methods for evaluating candidates multiple times to reduce switch costs, and kernel-based approaches for non-stationary reinforcement learning. We also provide finite-time analysis of kernel-based RL and study episodic RL with minimax lower bounds in metric spaces.

Determinantal Point Processes (DPPs)

2017 - 2020

Determinantal point processes (DPPs) are probabilistic models for selecting diverse subsets of items. We develop efficient sampling algorithms for DPPs, including exact sampling with sublinear preprocessing, zonotope hit-and-run methods for projection DPPs, and applications to Monte Carlo integration. We also create DPPy, a Python toolbox for DPP sampling, and study fast sampling from β-ensembles. Our methods enable scalable diverse subset selection in machine learning applications.

Determinantal Point Processes (DPPs) visualization

Combinatorial Bandits and Influence Maximization

2017 - 2020

We study online learning in combinatorial settings, particularly influence maximization in social networks. Our work includes online influence maximization under the independent cascade model with semi-bandit feedback, budgeted online influence maximization, efficient algorithms for matroid semi-bandits that exploit structure of uncertainty, statistical efficiency of Thompson sampling for combinatorial semi-bandits, and covariance-adapting algorithms for semi-bandits. We also develop methods for learning to act greedily in polymatroid settings.

Adaptive structural sampling

2013-2020

Many of the sequential problems require adaptive sampling in some particular way. Examples include using learning to improve rejection rate in rejection sampling, sampling with two contradictory objectives such as when we have to trade off reward and regret, extreme and infinitely many-arm bandits, and efficient sampling of determinantal point processes.

SQUEAK: Online sparsification of kernels and graphs

2009 - 2018

My PhD thesis ended with an open direction, whether efficient spectral sparsifiers can fuel online graph-learning methods. We introduce the first dictionary-learning streaming algorithm that operates in a single-pass over the dataset. This reduces the overall time required to construct provably accurate dictionaries from quadratic to near-linear, or even logarithmic when parallelized.

SQUEAK: Online sparsification of kernels and graphs visualization

Graph Bandits

2013 - 2017

Bandit problems are online decision-making problems where the only feedback given to the learner is a (noisy) reward of the chosen decision. We study the benefits of homophily (similar actions give similar rewards) under the name spectral bandits, side information (well-informed bandits), and influence maximization (IM bandits). In the algorithms, we take advantage of these similarities in order to (provably) learn faster.

Graph Bandits visualization

Semi-supervised apprenticeship learning

2011 - 2015

In apprenticeship learning we aim to learn a good behavior by observing an expert. We consider a situation when we observe many trajectories of behaviors but only one or a few of them are labeled as experts' trajectories. We investigate the assumptions under which the remaining unlabeled trajectories can aid in learning a policy with a good performance.

Composing Learning for Artificial Cognitive Systems (CompLACS)

2011 - 2015

The purpose of this project was to develop a unified toolkit for intelligent control in many different problem areas. This toolkit incorporates many of the most successful approaches to a variety of important control problems within a single framework, including bandit problems, Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), continuous stochastic control, and multi-agent systems.

Composing Learning for Artificial Cognitive Systems (CompLACS) visualization

Large-scale semi-supervised learning

2010 - 2013
with Branislav Kveton, Avneesh Saluja

We parallelized online harmonic solver to process 1 TB of video data in a day. I am working on the multi-manifold learning that can overcome changes in distribution. I am showing how the online learner adapts as to characters' aging over 10 years period in Married ... with Children sitcom. The research was part of Everyday Sensing and Perception (ESP) project.

Large-scale semi-supervised learning visualization

Online semi-supervised learning

2009 - 2013
with Branislav Kveton, Ali Rahimi, Ling Huang, Daniel Ting

We extended graph-based semi-supervised learning to the structured case and demonstrated on handwriting recognition and object detection from video streams. We came up with an online algorithm that on the real-world datasets recognizes faces at 80-90% precision with 90% recall.

Anomaly detection in clinical databases

2007 - 2013

Statistical anomaly detection methods for identification of unusual outcomes and patient management decisions. I combined max-margin learning with distance learned to create an anomaly detector, which outperforms the hospital rule for Heparin Induced Thrombocytopenia detection. I later scaled the system for 5K patients with 9K features and 743 clinical decisions per day.

Odd-Man-Out

2007 - 2011

We hypothesized that clinical data in emergency department (ED) reports would increase sensitivity and specificity of case identification for patients with an acute lower respiratory syndrome (ALRS). We designed a statistic of disagreement (odd-man-out) to evaluate the machine learning classifier with expert evaluation in the cases when the gold standard is not available.

Odd-Man-Out visualization

High-throughput proteomic and genomic data and biomarker discovery

2006 - 2007
with Miloš Hauskrecht, Richard Pelikan, Shuguang Wang

I built a framework for the cancer prediction from high-throughput proteomic and genomic data sources. I found a way to merge heterogeneous data sources: My fusion model was able to predict pancreatic cancer from Luminex combined with SELDI with 91.2% accuracy.

Evolutionary feature selection algorithms

2005

We enhanced the existing FeaSANNT neural feature selection with spiking neuron model to handle inputs noised with up to 10% Gaussian noise.

FeaSANNT algorithm visualization

Plastic Synapses (regularity counting)

2003 - 2005
with Juraj Pavlásek, Radoslav Harman, Ján Jenča

We were modeling basic learning function at the level of synapses. I designed a model that is able to adapt to the regular frequencies with different rate as the time flows. I used genetic programming to find biologically plausible networks that distinguish different gamma distribution and provided explanation of the strategies evolved.

Plastic Synapses spiking network visualization
mv