arXiv preprint arXiv:2502.04270 , year=

Pilaf: Optimal human preference sampling for reward modeling , author= · arXiv 2502.04270

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Which Pairs to Compare for LLM Post-Training?

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

Matching upper and lower bounds on DPO policy optimality gap are derived that depend on a single design-dependent information matrix linking pair selection to estimation error and suboptimality.

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.

citing papers explorer

Showing 2 of 2 citing papers.

Which Pairs to Compare for LLM Post-Training? cs.AI · 2026-06-17 · unverdicted · none · ref 20
Matching upper and lower bounds on DPO policy optimality gap are derived that depend on a single design-dependent information matrix linking pair selection to estimation error and suboptimality.
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses cs.LG · 2026-05-07 · unverdicted · none · ref 5
The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.

arXiv preprint arXiv:2502.04270 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer