The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.
9 f-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses Huang, A., Zhan, W., Xie, T., Lee, J
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.