Matching upper and lower bounds on DPO policy optimality gap are derived that depend on a single design-dependent information matrix linking pair selection to estimation error and suboptimality.
arXiv preprint arXiv:2502.04270 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
unclear 1representative citing papers
The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.
citing papers explorer
-
Which Pairs to Compare for LLM Post-Training?
Matching upper and lower bounds on DPO policy optimality gap are derived that depend on a single design-dependent information matrix linking pair selection to estimation error and suboptimality.
-
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.