Michal Valko

Nash Learning from Human Feedback: A Brief Explainer

2024-07-15 RLHF LLM Game Theory

Standard RLHF trains a reward model from pairwise comparisons, then optimizes a policy against that reward. But the reward model is a bottleneck — it introduces approximation error that compounds during RL training.

Our approach, Nash Learning from Human Feedback (NLHF), sidesteps this by framing alignment as a two-player game. Instead of learning a scalar reward, we learn a preference model that directly compares two responses. The aligned LLM is the Nash equilibrium of this game — the policy that no other policy can consistently beat under the preference model.

This has several advantages: (1) it avoids the reward model bottleneck, (2) it provides a principled objective with game-theoretic guarantees, and (3) it connects to a rich theory of online learning in games.

We showed that this approach is competitive with or better than standard RLHF on alignment benchmarks, while being more theoretically grounded.

Nash Learning from Human Feedback (ICML 2024) A General Theoretical Paradigm (AISTATS 2024)

Michal Valko : Notes

Nash Learning from Human Feedback: A Brief Explainer