DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.
citing papers explorer
-
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
-
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
-
An adaptive variance estimator for relative sparsity
A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.