HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
arXiv preprint arXiv:2505.12929 , year=
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
citing papers explorer
-
Holder Policy Optimisation
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
-
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
-
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
-
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
-
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
- Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens