arXiv preprint arXiv:2405.21046 , year=

Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, Alexander Rakhlin · 2024 · arXiv 2405.21046

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

representative citing papers

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

Continuous Latent Contexts Enable Efficient Online Learning in Transformers

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Transformers equipped with continuous latent context tokens can implement foundational online decision-making algorithms such as weighted majority and Q-learning, and a trained small model outperforms larger LLMs on synthetic online prediction tasks.

PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

cs.AI · 2026-05-01 · unverdicted · novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.

Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.

Multiplayer Nash Preference Optimization

cs.AI · 2025-09-27 · unverdicted · novelty 6.0

MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

cs.LG · 2026-02-05 · unverdicted · novelty 5.0 · 2 refs

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

Failure Modes of Maximum Entropy RLHF

cs.LG · 2025-09-24 · unverdicted · novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

citing papers explorer

Showing 8 of 8 citing papers.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 42
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 21
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Continuous Latent Contexts Enable Efficient Online Learning in Transformers cs.LG · 2026-05-11 · unverdicted · none · ref 5
Transformers equipped with continuous latent context tokens can implement foundational online decision-making algorithms such as weighted majority and Q-learning, and a trained small model outperforms larger LLMs on synthetic online prediction tasks.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs cs.AI · 2026-05-01 · unverdicted · none · ref 64
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment cs.LG · 2026-04-06 · unverdicted · none · ref 18
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
Multiplayer Nash Preference Optimization cs.AI · 2025-09-27 · unverdicted · none · ref 32
MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution cs.LG · 2026-02-05 · unverdicted · none · ref 33 · 2 links
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
Failure Modes of Maximum Entropy RLHF cs.LG · 2025-09-24 · unverdicted · none · ref 54
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

arXiv preprint arXiv:2405.21046 , year=

fields

years

verdicts

representative citing papers

citing papers explorer