DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.