HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
On-policy rl with optimal reward baseline
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.
citing papers explorer
-
Holder Policy Optimisation
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
-
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
-
Policy Improvement Reinforcement Learning
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
-
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.