arXiv preprint arXiv:2505.12929 , year=

Do not let low-probability tokens over-dominate in rl for llms , author= · 2025 · arXiv 2505.12929

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

cs.LG · 2025-10-29 · unverdicted · novelty 4.0

GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

cs.LG · 2026-05-20

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

cs.CL · 2026-02-17

citing papers explorer

Showing 7 of 7 citing papers.

Holder Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 41 · 2 links
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning cs.LG · 2026-04-22 · unverdicted · none · ref 20
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR cs.CL · 2025-07-21 · unverdicted · none · ref 48
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals cs.LG · 2026-05-21 · unverdicted · none · ref 25
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping cs.LG · 2025-10-29 · unverdicted · none · ref 40
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation cs.LG · 2026-05-20 · unreviewed · ref 44
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens cs.CL · 2026-02-17 · unreviewed · ref 9

arXiv preprint arXiv:2505.12929 , year=

fields

years

verdicts

representative citing papers

citing papers explorer