Dcpo: Dynamic clipping policy optimization

Yang, S · 2025 · arXiv 2509.02333

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 2 other 1

citation-polarity summary

background 2 unclear 1

representative citing papers

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

cs.CL · 2026-04-25 · unverdicted · novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.

SSPO: Subsentence-level Policy Optimization

cs.CL · 2025-11-06 · unverdicted · novelty 6.0

SSPO computes policy importance ratios at the subsentence level with entropy-adjusted clipping bounds, yielding higher average scores than GRPO and GSPO on math reasoning benchmarks with Qwen models.

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

cs.AI · 2026-04-18 · unverdicted · novelty 5.0

MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

cs.LG · 2026-05-20

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

cs.LG · 2026-04-04

Policy Improvement Reinforcement Learning

cs.LG · 2026-04-01

citing papers explorer

Showing 1 of 1 citing paper after filters.

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 12
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

Dcpo: Dynamic clipping policy optimization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer