Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
arXiv preprint arXiv:2501.12735 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
N-GRPO enhances GRPO via Semantic Neighbor Mixing of token embeddings to improve diversity and consistency in LLM math reasoning rollouts.
citing papers explorer
-
On Advantage Estimates for Max@K Policy Gradients
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
-
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
-
N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization
N-GRPO enhances GRPO via Semantic Neighbor Mixing of token embeddings to improve diversity and consistency in LLM math reasoning rollouts.