RLHF and LLM Alignment
2023 - presentWe develop methods for aligning large language models (LLMs) with human preferences. Our main contribution is Nash Learning from Human Feedback (NLHF), an alternative to RLHF that learns a preference model and finds the Nash equilibrium policy. We also develop online preference optimization, offline alignment methods (GPO), decoding-time realignment, and theoretical frameworks for understanding preference learning. Our work has been applied to Llama 3 and addresses challenges in reward modeling, multi-sample comparisons, and metacognitive capabilities of LLMs.
Zero-Sum Games and Imperfect Information
2021 - 2025We develop algorithms for learning optimal strategies in zero-sum games with imperfect information. Our contributions include model-free learning for partially observable Markov games, adaptive algorithms for extensive-form games with local and adaptive mirror descent, and last-iterate convergence guarantees for uncoupled learning. Our work on adapting to game trees in zero-sum imperfect information games received the outstanding paper award at ICML 2023. We also study multiagent evaluation under incomplete information.
Sample efficient Monte-Carlo tree search: TrailBlazer, SmoothCruiser, StoSOO, POO, and OOB
2011 - nowMonte-Carlo planning and Monte-Carlo tree search has been popularized in the game of computer Go. Our first contribution are generic black-box function optimizers for extremely difficult functions with guarantees with main application to hyper-parameter tuning. The second set of contributions is in planning, including TrailBlazer, an adaptive planning algorithm in MDPs.
Posterior Sampling and Bayesian RL
2022 - 2024We develop posterior sampling methods for reinforcement learning that achieve optimism without explicit bonus terms. Our contributions include optimistic posterior sampling with tight guarantees, model-free posterior sampling via learning rate randomization, fast rates for maximum entropy exploration, and demonstration-regularized RL. We also provide sharp deviation bounds for Dirichlet weighted sums and new bounds on cumulant generating functions of Dirichlet processes, which are fundamental for analyzing Bayesian algorithms.
Self-Supervised Learning and BYOL
2020 - 2024We develop self-supervised learning methods that learn representations without labeled data. Our flagship contribution is Bootstrap Your Own Latent (BYOL), a self-supervised approach that avoids contrastive learning by having a target network bootstrap the online network's representations. BYOL achieves state-of-the-art results in image representation learning and has been extended to exploration in reinforcement learning (BYOL-Explore), video learning, graph representation learning, and neuroscience applications.
Exploration and Intrinsic Motivation
2020 - 2024We develop methods for exploration in reinforcement learning using intrinsic motivation and learned representations. Our contributions include unlocking the power of representations in long-term novelty-based exploration, curiosity in hindsight for stochastic environments, density-based bonuses on learned representations for reward-free exploration, geometric entropic exploration, quantile credit assignment, and retrieval-augmented reinforcement learning. These methods enable effective exploration in sparse-reward or reward-free environments.
Off-Policy and Value Function Learning
2020 - 2023We develop methods for off-policy reinforcement learning and value function estimation. Our contributions include UCB Momentum Q-learning that corrects bias without forgetting, unified gradient estimators for meta-RL via off-policy evaluation, marginalized operators for off-policy RL, VA-learning as an alternative to Q-learning, Taylor expansion of discount factors and policy optimization, and revisiting Peng's Q(λ) for modern RL. These methods improve sample efficiency and stability in off-policy settings.
Stochastic Shortest Path and Reward-Free Exploration
2020 - 2022We study exploration in reinforcement learning when no reward function is provided. We develop algorithms for stochastic shortest path problems with minimax optimal regret bounds that are parameter-free and approach horizon-free. Our work includes provably efficient sample collection strategies, incremental autonomous exploration, goal-oriented exploration, and adaptive multi-goal exploration. These methods enable agents to efficiently explore unknown environments before any task is specified.
Gaussian Process Optimization and Kernel Methods
2019 - 2022We develop scalable methods for Gaussian process optimization and kernel-based reinforcement learning. Our contributions include adaptive sketching techniques that achieve near-linear time complexity, methods for evaluating candidates multiple times to reduce switch costs, and kernel-based approaches for non-stationary reinforcement learning. We also provide finite-time analysis of kernel-based RL and study episodic RL with minimax lower bounds in metric spaces.
Determinantal Point Processes (DPPs)
2017 - 2020Determinantal point processes (DPPs) are probabilistic models for selecting diverse subsets of items. We develop efficient sampling algorithms for DPPs, including exact sampling with sublinear preprocessing, zonotope hit-and-run methods for projection DPPs, and applications to Monte Carlo integration. We also create DPPy, a Python toolbox for DPP sampling, and study fast sampling from β-ensembles. Our methods enable scalable diverse subset selection in machine learning applications.
Combinatorial Bandits and Influence Maximization
2017 - 2020We study online learning in combinatorial settings, particularly influence maximization in social networks. Our work includes online influence maximization under the independent cascade model with semi-bandit feedback, budgeted online influence maximization, efficient algorithms for matroid semi-bandits that exploit structure of uncertainty, statistical efficiency of Thompson sampling for combinatorial semi-bandits, and covariance-adapting algorithms for semi-bandits. We also develop methods for learning to act greedily in polymatroid settings.
Adaptive structural sampling
2013-2020Many of the sequential problems require adaptive sampling in some particular way. Examples include using learning to improve rejection rate in rejection sampling, sampling with two contradictory objectives such as when we have to trade off reward and regret, extreme and infinitely many-arm bandits, and efficient sampling of determinantal point processes.
SQUEAK: Online sparsification of kernels and graphs
2009 - 2018My PhD thesis ended with an open direction, whether efficient spectral sparsifiers can fuel online graph-learning methods. We introduce the first dictionary-learning streaming algorithm that operates in a single-pass over the dataset. This reduces the overall time required to construct provably accurate dictionaries from quadratic to near-linear, or even logarithmic when parallelized.
Graph Bandits
2013 - 2017Bandit problems are online decision-making problems where the only feedback given to the learner is a (noisy) reward of the chosen decision. We study the benefits of homophily (similar actions give similar rewards) under the name spectral bandits, side information (well-informed bandits), and influence maximization (IM bandits). In the algorithms, we take advantage of these similarities in order to (provably) learn faster.
Semi-supervised apprenticeship learning
2011 - 2015In apprenticeship learning we aim to learn a good behavior by observing an expert. We consider a situation when we observe many trajectories of behaviors but only one or a few of them are labeled as experts' trajectories. We investigate the assumptions under which the remaining unlabeled trajectories can aid in learning a policy with a good performance.
Composing Learning for Artificial Cognitive Systems (CompLACS)
2011 - 2015The purpose of this project was to develop a unified toolkit for intelligent control in many different problem areas. This toolkit incorporates many of the most successful approaches to a variety of important control problems within a single framework, including bandit problems, Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), continuous stochastic control, and multi-agent systems.
Large-scale semi-supervised learning
2010 - 2013We parallelized online harmonic solver to process 1 TB of video data in a day. I am working on the multi-manifold learning that can overcome changes in distribution. I am showing how the online learner adapts as to characters' aging over 10 years period in Married ... with Children sitcom. The research was part of Everyday Sensing and Perception (ESP) project.
Online semi-supervised learning
2009 - 2013We extended graph-based semi-supervised learning to the structured case and demonstrated on handwriting recognition and object detection from video streams. We came up with an online algorithm that on the real-world datasets recognizes faces at 80-90% precision with 90% recall.
Anomaly detection in clinical databases
2007 - 2013Statistical anomaly detection methods for identification of unusual outcomes and patient management decisions. I combined max-margin learning with distance learned to create an anomaly detector, which outperforms the hospital rule for Heparin Induced Thrombocytopenia detection. I later scaled the system for 5K patients with 9K features and 743 clinical decisions per day.
Odd-Man-Out
2007 - 2011We hypothesized that clinical data in emergency department (ED) reports would increase sensitivity and specificity of case identification for patients with an acute lower respiratory syndrome (ALRS). We designed a statistic of disagreement (odd-man-out) to evaluate the machine learning classifier with expert evaluation in the cases when the gold standard is not available.
High-throughput proteomic and genomic data and biomarker discovery
2006 - 2007I built a framework for the cancer prediction from high-throughput proteomic and genomic data sources. I found a way to merge heterogeneous data sources: My fusion model was able to predict pancreatic cancer from Luminex combined with SELDI with 91.2% accuracy.
Evolutionary feature selection algorithms
2005We enhanced the existing FeaSANNT neural feature selection with spiking neuron model to handle inputs noised with up to 10% Gaussian noise.
Plastic Synapses (regularity counting)
2003 - 2005We were modeling basic learning function at the level of synapses. I designed a model that is able to adapt to the regular frequencies with different rate as the time flows. I used genetic programming to find biologically plausible networks that distinguish different gamma distribution and provided explanation of the strategies evolved.
