arXiv preprint arXiv:2510.11686 , year=

Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, Jordan T Ash · 2025 · arXiv 2510.11686

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.

The Role of Generator Access in Autoregressive Post-Training

cs.LG · 2026-04-06 · unverdicted · novelty 5.0

Limited generator access in autoregressive post-training confines learners to root-start rollouts whose value is bounded by on-policy prefix probabilities, while weak prefix control unlocks richer observations and produces an exponential gap in KL-regularized outcome-reward training.

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

cs.LG · 2026-02-05 · unverdicted · novelty 5.0 · 2 refs

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

citing papers explorer

Showing 4 of 4 citing papers.

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives cs.LG · 2026-05-12 · unverdicted · none · ref 30
The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback cs.LG · 2026-05-06 · unverdicted · none · ref 85
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
The Role of Generator Access in Autoregressive Post-Training cs.LG · 2026-04-06 · unverdicted · none · ref 21
Limited generator access in autoregressive post-training confines learners to root-start rollouts whose value is bounded by on-policy prefix probabilities, while weak prefix control unlocks richer observations and produces an exponential gap in KL-regularized outcome-reward training.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution cs.LG · 2026-02-05 · unverdicted · none · ref 29 · 2 links
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

arXiv preprint arXiv:2510.11686 , year=

fields

years

verdicts

representative citing papers

citing papers explorer