Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou · 2025 · arXiv 2508.07629

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

method 1

citation-polarity summary

background 1

representative citing papers

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

cs.LG · 2026-05-11 · conditional · novelty 7.0 · 3 refs

DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

cs.LG · 2025-12-05 · unverdicted · novelty 6.0

Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

cs.LG · 2025-09-25 · unverdicted · novelty 6.0 · 2 refs

CE-GPPO preserves bounded gradients from clipped tokens in PPO to regulate entropy evolution and improve performance on mathematical reasoning benchmarks.

OneRec-V2 Technical Report

cs.IR · 2025-08-28 · unverdicted · novelty 5.0

OneRec-V2 scales generative recommendation to 8B parameters via decoder-only design and real-world preference alignment, improving user engagement metrics in production A/B tests.

citing papers explorer

Showing 7 of 7 citing papers.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 32
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling cs.LG · 2026-05-11 · conditional · none · ref 13 · 3 links
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models cs.AI · 2026-05-13 · unverdicted · none · ref 14 · 2 links
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction cs.LG · 2026-05-12 · unverdicted · none · ref 22
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning cs.LG · 2025-12-05 · unverdicted · none · ref 17
Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning cs.LG · 2025-09-25 · unverdicted · none · ref 19 · 2 links
CE-GPPO preserves bounded gradients from clipped tokens in PPO to regulate entropy evolution and improve performance on mathematical reasoning benchmarks.
OneRec-V2 Technical Report cs.IR · 2025-08-28 · unverdicted · none · ref 19
OneRec-V2 scales generative recommendation to 8B parameters via decoder-only design and real-world preference alignment, improving user engagement metrics in production A/B tests.

Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer