Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

· 2025 · cs.LG · arXiv 2505.19770

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

representative citing papers

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

cs.AI · 2026-05-20 · conditional · novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.

DDO-RM: Distribution-Level Policy Improvement after Reward Learning

stat.ML · 2026-04-13 · unverdicted · novelty 7.0

DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.

Reinforcement Learning from Human Feedback: A Statistical Perspective

stat.ML · 2026-04-02 · accept · novelty 2.0

A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

citing papers explorer

Showing 3 of 3 citing papers.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment cs.AI · 2026-05-20 · conditional · none · ref 29 · internal anchor
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
DDO-RM: Distribution-Level Policy Improvement after Reward Learning stat.ML · 2026-04-13 · unverdicted · none · ref 13 · internal anchor
DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.
Reinforcement Learning from Human Feedback: A Statistical Perspective stat.ML · 2026-04-02 · accept · none · ref 72 · internal anchor
A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

fields

years

verdicts

representative citing papers

citing papers explorer