arXiv preprint arXiv:2505.23585 , year=

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei · 2025 · arXiv 2505.23585

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

cs.LG · 2026-06-20 · unverdicted · novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

cs.LG · 2026-06-10 · unverdicted · novelty 5.0 · 2 refs

SGCD improves held-out scores on AppWorld and tau^3-airline by using LLM-summarized sibling contrasts to reshape GRPO advantages while keeping policy gradient in charge of the actor update.

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

cs.LG · 2025-10-11 · unverdicted · novelty 5.0

Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.

Policy Improvement Reinforcement Learning

cs.LG · 2026-04-01

citing papers explorer

Showing 7 of 7 citing papers after filters.

Holder Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 52 · 2 links
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 41
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 23
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 8 · 2 links
Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning cs.LG · 2026-06-20 · unverdicted · none · ref 65
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents cs.LG · 2026-06-10 · unverdicted · none · ref 16 · 2 links
SGCD improves held-out scores on AppWorld and tau^3-airline by using LLM-summarized sibling contrasts to reshape GRPO advantages while keeping policy gradient in charge of the actor update.
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective cs.LG · 2025-10-11 · unverdicted · none · ref 6
Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.

arXiv preprint arXiv:2505.23585 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer