The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.
arXiv preprint arXiv:2510.11686 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
Limited generator access in autoregressive post-training confines learners to root-start rollouts whose value is bounded by on-policy prefix probabilities, while weak prefix control unlocks richer observations and produces an exponential gap in KL-regularized outcome-reward training.
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
citing papers explorer
-
The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives
The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.
-
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
-
The Role of Generator Access in Autoregressive Post-Training
Limited generator access in autoregressive post-training confines learners to root-start rollouts whose value is bounded by on-policy prefix probabilities, while weak prefix control unlocks richer observations and produces an exponential gap in KL-regularized outcome-reward training.
-
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.