Nash Learning from Human Feedback

Andrea Michi; Bilal Piot; Daniele Calandriello; Daniel J. Mankowitz; Doina Precup; Marco Selvi; Mark Rowland; Matthieu Geist; Michal Valko; Mohammad Gheshlaghi Azar

arxiv: 2312.00886 · v4 · pith:E7NFWSFLnew · submitted 2023-12-01 · 📊 stat.ML · cs.AI· cs.GT· cs.LG· cs.MA

Nash Learning from Human Feedback

R\'emi Munos , Michal Valko , Daniele Calandriello , Mohammad Gheshlaghi Azar , Mark Rowland , Zhaohan Daniel Guo , Yunhao Tang , Matthieu Geist

show 9 more authors

Thomas Mesnard Andrea Michi Marco Selvi Sertan Girgin Nikola Momchev Olivier Bachem Daniel J. Mankowitz Doina Precup Bilal Piot

This is my paper

classification 📊 stat.ML cs.AIcs.GTcs.LGcs.MA

keywords humanlearningfeedbackpolicymodelnashpreferencesapproach

0 comments

read the original abstract

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification
cs.LG 2026-06 unverdicted novelty 7.0

Introduces KL misspecification for bandits and RL under function approximation and proves explicit KL-regret bounds for regression-based Gibbs algorithms that recover the realizable case.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
cs.LG 2026-05 unverdicted novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Incentivizing High-Quality Human Annotations with Golden Questions
cs.GT 2025-05 unverdicted novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
KTO: Model Alignment as Prospect Theoretic Optimization
cs.LG 2024-02 conditional novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
cs.LG 2026-02 unverdicted novelty 6.0

PEPO uses pessimistic ensembling of DPO policies on data subsets to achieve single-policy concentrability sample bounds and avoid over-optimization in tabular settings.
Multiplayer Nash Preference Optimization
cs.AI 2025-09 unverdicted novelty 6.0

MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators
cs.LG 2025-02 unverdicted novelty 6.0

Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.
UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types
cs.LG 2024-08 unverdicted novelty 6.0

UNA unifies binary, pairwise, and score-based feedback for LLM alignment via a generalized implicit reward function shown optimal by the log sum inequality.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
cs.LG 2026-02 unverdicted novelty 5.0

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution o...