Skip to main content

Michal Valko : Notes

Short write-ups on research highlights, methods, and ideas. ⚙ Atom feed

Nash Learning from Human Feedback: A Brief Explainer

RLHF LLM Game Theory

Standard RLHF trains a reward model from pairwise comparisons, then optimizes a policy against that reward. But the reward model is a bottleneck — it introduces approximation error that compounds during RL training.

Our approach, Nash Learning from Human Feedback (NLHF), sidesteps this by framing alignment as a two-player game. Instead of learning a scalar reward, we learn a preference model that directly compares two responses. The aligned LLM is the Nash equilibrium of this game — the policy that no other policy can consistently beat under the preference model.

This has several advantages: (1) it avoids the reward model bottleneck, (2) it provides a principled objective with game-theoretic guarantees, and (3) it connects to a rich theory of online learning in games.

We showed that this approach is competitive with or better than standard RLHF on alignment benchmarks, while being more theoretically grounded.