arxiv: 2604.23318 · v1 · submitted 2026-04-25 · 💻 cs.CL · cs.LG

Recognition: unknown

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

Xinzhu Chen , Wei He , Huichuan Fan , Wenzhe Niu , Zhongxiang Sun , Xuanru Wang , Jiuchong Gao , Jinghua Hao

show 2 more authors

Renqing He Weijie Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords hidden statesWasserstein distancecredit assignmentGRPOreinforcement learningreasoning divergencespan-level analysisadvantage reweighting

0 comments

The pith

Span-level Wasserstein distances between hidden state distributions of correct and incorrect GRPO rollouts increase at points where local reasoning quality diverges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that within each group of rollouts produced by Group Relative Policy Optimization, the Wasserstein distance on hidden states for correct versus incorrect trajectories grows larger precisely around the spans where their reasoning begins to differ in quality. This pattern appears consistently when comparing many examples and when examining single trajectories step by step. Because the signal relies only on final outcome labels, it offers a way to perform finer credit assignment during reinforcement learning without training separate process reward models or collecting step annotations. The authors formalize the pattern as a separation theorem and turn the distances into a practical reweighting scheme for token advantages.

Core claim

Within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories. Under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise.

What carries the argument

Span-level Wasserstein distance between hidden-state distributions of correct and incorrect rollouts inside the same GRPO group, used to locate reasoning divergence points and scale token advantages.

If this is right

Scaling token advantages by these span-level distances produces measurable gains over standard GRPO on five mathematical reasoning benchmarks and five code generation benchmarks.
The method needs no extra model, no step-level labels, and only small changes to the existing training loop.
The same distributional separation signal appears both across many examples and inside single trajectories.
Performance becomes competitive with supervised process reward models while using only outcome correctness labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observation implies that transformer hidden states already embed local reasoning quality in a form detectable by simple distributional metrics.
The reweighting approach could be tested in other outcome-only reinforcement learning settings that lack process supervision.
If the separation theorem holds more generally, similar distances might help diagnose failure modes in non-reasoning generation tasks.

Load-bearing premise

The population-level gap in hidden-state distributions must exceed finite-sample noise so that post-divergence spans reliably show larger Wasserstein distances than pre-divergence spans.

What would settle it

Collect GRPO rollouts on a math or code task, annotate the first span where each incorrect trajectory diverges from a correct one, then test whether the measured Wasserstein distances are larger after that span than before it; a consistent reversal would falsify the separation claim.

Figures

Figures reproduced from arXiv: 2604.23318 by Huichuan Fan, Jinghua Hao, Jiuchong Gao, Renqing He, Wei He, Weijie Yu, Wenzhe Niu, Xinzhu Chen, Xuanru Wang, Zhongxiang Sun.

**Figure 1.** Figure 1: Hidden state divergence tracks reasoning quality at both aggregate and local levels. (a) As reasoning progresses, continuation success (blue, left axis) declines while Wasserstein distance to the opposing group (brown, right axis) rises, with closely aligned transition zones with Spearman’s ρ = −0.96. (b) At positions where continuation success changes by at least one completion step (|∆Accuracy| ≥ 0.0625)… view at source ↗

**Figure 2.** Figure 2: The overview of SHEAR. For each rollout group, we partition each trajectory into overlap view at source ↗

**Figure 3.** Figure 3: Empirical verification of the separation conditions. (Top Left) Empirical accuracy stratified view at source ↗

**Figure 4.** Figure 4: Training dynamics on mathematical reasoning. Each panel shows the benchmark-averaged view at source ↗

**Figure 5.** Figure 5: Ablation studies on Qwen2.5-Math-7B. distances. This introduces a form of cross-rollout reweighting that is conceptually distinct from the within-rollout token reweighting that the method is designed to deliver. A natural question is whether this cross-rollout effect contributes meaningfully to the observed gains, or whether the method would behave differently if span distances were rescaled in a way that … view at source ↗

**Figure 6.** Figure 6: Average accuracy across five math benchmarks for varying span length view at source ↗

**Figure 7.** Figure 7: Impact of rollout group size (G = 8 vs. G = 16) on Qwen2.5-Math-7B. Both methods benefit from larger groups, but SHEAR exhibits a wider improvement margin, suggesting that richer opposing sets enhance the span-level Wasserstein signal. 5.6 Effect of Rollout Sample Size As illustrated in view at source ↗

**Figure 8.** Figure 8: Training time overhead of SHEAR relative to standard GRPO. Percentages indicate the view at source ↗

**Figure 9.** Figure 9: Wasserstein distance discriminates between robust and vulnerable reasoning trajectories. view at source ↗

**Figure 10.** Figure 10: Distribution of group accuracies for the retained MATH500 subset (excluding allcorrect and all-incorrect groups). The retained problems span the full range of intermediate difficulty levels, with no single accuracy bin dominating view at source ↗

**Figure 11.** Figure 11: Normalized Wasserstein Distance Heatmap of Case. References [1] AI-MO. Aime 2024 (aimo-validation-aime). https://huggingface.co/datasets/AI-MO/ aimo-validation-aime, 2024. Hugging Face Dataset. [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. Pmlr, 2017. [3] Daixuan Cheng, Shaohan Huang, Xue… view at source ↗

read the original abstract

Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SHEAR reweights GRPO advantages using span-level Wasserstein distances on hidden states and reports gains on math and code benchmarks, but the separation theorem's assumptions look unreliable for typical LLM hidden states and small rollout groups.

read the letter

The paper's core contribution is a method called SHEAR that modifies GRPO by scaling token advantages according to the Wasserstein distance between span-level hidden state distributions of correct versus incorrect rollouts in the same group. This aims to provide finer credit assignment from outcome labels alone, without process annotations or extra reward models. They also formalize an observed pattern into a separation theorem under mild structural assumptions.

Referee Report

2 major / 2 minor

Summary. The paper claims that within GRPO groups, span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts increase at reasoning divergence points, formalized via a separation theorem under mild structural assumptions; this signal is used in the SHEAR method to reweight token advantages for finer-grained credit assignment in RLVR, yielding gains on math and code benchmarks without extra models or annotations.

Significance. If the separation theorem and empirical association hold, the work offers a self-supervised, parameter-free mechanism for local credit assignment using only outcome labels and existing hidden states, addressing a key limitation of coarse GRPO while avoiding the cost of process reward models. The formal theorem and consistent benchmark improvements across ten tasks are notable strengths.

major comments (2)

[§3] Theorem 1 (§3): the separation result depends on unenumerated 'mild structural assumptions' (identical pre-divergence distributions for correct/incorrect groups and detectable post-divergence shifts exceeding finite-sample noise); these are not empirically validated on high-dimensional LLM hidden states, where even pre-divergence tokens can show variance from sampling or early uncertainty, directly affecting the reliability of the post-divergence signal used by SHEAR.
[§4.2] §4.2 (Main results) and §4.3 (Ablations): with GRPO groups of only 4–8 rollouts, Wasserstein estimation in high dimensions is sensitive to sample size and imbalance; the reported gains over GRPO lack controls for this sensitivity (e.g., no ablation varying group size or reporting effective dimensionality), which is load-bearing for the claim that hidden-state divergence provides a robust self-supervision signal.

minor comments (2)

[Abstract] The abstract and §4.1 should explicitly list the five math and five code benchmarks rather than referring to them generically.
[§2] Notation for span-level distributions (e.g., how spans are segmented and hidden states aggregated) is introduced without a main-text equation; moving the key definition from the appendix to §2 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We respond point-by-point to the major comments below, indicating where we will revise the manuscript to address the concerns.

read point-by-point responses

Referee: [§3] Theorem 1 (§3): the separation result depends on unenumerated 'mild structural assumptions' (identical pre-divergence distributions for correct/incorrect groups and detectable post-divergence shifts exceeding finite-sample noise); these are not empirically validated on high-dimensional LLM hidden states, where even pre-divergence tokens can show variance from sampling or early uncertainty, directly affecting the reliability of the post-divergence signal used by SHEAR.

Authors: We appreciate the referee highlighting the need for greater clarity on the assumptions underlying Theorem 1. The theorem is based on two structural assumptions: identical pre-divergence hidden-state distributions between correct and incorrect groups, and post-divergence distributional shifts that exceed finite-sample Wasserstein estimation noise. These are characterized as mild because they follow directly from the definition of reasoning divergence points. In the revised manuscript, we will explicitly enumerate these assumptions in the theorem statement and add a dedicated paragraph discussing their plausibility, supported by the empirical observation that pre-divergence Wasserstein distances remain low while post-divergence distances increase. Although exhaustive high-dimensional validation is computationally demanding, the consistent patterns reported across ten benchmarks in §4 provide supporting evidence for the practical reliability of the signal. We will also include a brief sensitivity analysis to sampling noise. revision: partial
Referee: [§4.2] §4.2 (Main results) and §4.3 (Ablations): with GRPO groups of only 4–8 rollouts, Wasserstein estimation in high dimensions is sensitive to sample size and imbalance; the reported gains over GRPO lack controls for this sensitivity (e.g., no ablation varying group size or reporting effective dimensionality), which is load-bearing for the claim that hidden-state divergence provides a robust self-supervision signal.

Authors: The referee correctly notes that Wasserstein estimation with small group sizes (4–8 rollouts) in high dimensions can be sensitive to sample size and class imbalance. This is a substantive concern for the robustness claim. While §4.3 already contains ablations on span length and reweighting strength that show stable gains, we did not vary group size or report effective dimensionality. In the revised version, we will add an ablation varying GRPO group size from 4 to 16, report any dimensionality reduction (if used) or regularization applied during Wasserstein computation, and include standard deviations across multiple random seeds to quantify stability. These additions will directly strengthen the evidence that the divergence signal remains useful under the reported experimental conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; separation theorem is an independent formalization

full rationale

The paper's chain proceeds from an empirical observation (Wasserstein distances increase post-divergence in GRPO groups) to a mathematical separation theorem stated under explicit mild structural assumptions, then to the SHEAR reweighting rule that applies the distance as a scaling factor. No equation or claim reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the theorem is presented as a proof rather than a data-driven fit, and the method uses only outcome labels already present in RLVR. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a separation theorem whose proof invokes mild structural assumptions about hidden-state distributions and finite-sample noise; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption mild structural assumptions on hidden-state distributions
Invoked to guarantee that post-divergence Wasserstein distances exceed pre-divergence distances when population gap > noise.

pith-pipeline@v0.9.0 · 5629 in / 1237 out tokens · 43142 ms · 2026-05-08T08:01:21.305455+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages · 13 internal anchors

[1]

Aime 2024 (aimo-validation-aime)

AI-MO. Aime 2024 (aimo-validation-aime). https://huggingface.co/datasets/AI-MO/ aimo-validation-aime, 2024. Hugging Face Dataset

2024
[2]

Wasserstein generative adversarial networks

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017

2017
[3]

Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with Exploration: An Entropy Perspective.arXiv e-prints, page arXiv:2506.14758, June 2025

work page arXiv 2025
[4]

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning.arXiv e-prints, page arXiv:2504.15275, April 2025

Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, and Fei-Yue Wang. Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning.arXiv e-prints, page arXiv:2504.15275, April 2025

work page arXiv 2025
[5]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process Reinforcement through Implicit Rewards. arXiv e-pr...

work page internal anchor Pith review arXiv 2025
[6]

Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013

2013
[7]

Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts

Hanwen Du, Yuxin Dong, and Xia Ning. Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[8]

On the rate of convergence in wasserstein distance of the empirical measure.Probability theory and related fields, 162(3):707–738, 2015

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure.Probability theory and related fields, 162(3):707–738, 2015

2015
[9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review arXiv 2024
[10]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems.arXiv e-prints, page arXiv:2402.14008, February 2024

work page internal anchor Pith review arXiv 2024
[11]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv e-prints, page arXiv:2103.03874, March 2021

work page internal anchor Pith review arXiv 2021
[12]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and Con- tamination Free Evaluation of Large Language Models for Code.arXiv e-prints, page arXiv:2403.07974, March 2024

work page internal anchor Pith review arXiv 2024
[13]

Ladir: Latent diffusion enhances LLMs for text reasoning

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yian Ma, and Lianhui Qin. Ladir: Latent diffusion enhances LLMs for text reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[14]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization.arXiv e-prints, page arXiv:1607.06450, July 2016

work page internal anchor Pith review arXiv 2016
[15]

Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning

Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, and Hongcheng Guo. Outcome-Grounded Advantage Reshaping for Fine- Grained Credit Assignment in Mathematical Reasoning.arXiv e-prints, page arXiv:2601.07408, January 2026

work page arXiv 2026
[16]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step.arXiv e-prints, page arXiv:2305.20050, May 2023

work page internal anchor Pith review arXiv 2023
[17]

Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 26

2023
[18]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like Training: A Critical Perspective.arXiv e-prints, page arXiv:2503.20783, March 2025

work page internal anchor Pith review arXiv 2025
[19]

Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025

Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning.arXiv e-prints, page arXiv:2507.10628, July 2025

work page arXiv 2025
[20]

amc23 dataset

math-ai. amc23 dataset. https://huggingface.co/datasets/math-ai/amc23, 2023. Hugging Face Dataset

2023
[21]

Aime 2025 (aime2025)

OpenCompass. Aime 2025 (aime2025). https://huggingface.co/datasets/ opencompass/AIME2025, 2025. Hugging Face Dataset

2025
[22]

Now Foundations and Trends, 2019

Gabriel Peyré and Marco Cuturi.Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019
[23]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review arXiv 2024
[24]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv e-prints, page arXiv:2402.03300, February 2024

work page internal anchor Pith review arXiv 2024
[25]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning.arXiv e-prints, page arXiv:2506.0...

work page internal anchor Pith review arXiv 2025
[26]

arXiv preprint arXiv:2504.14945 , year =

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to Reason under Off-Policy Guidance.arXiv e-prints, page arXiv:2504.14945, April 2025

work page arXiv 2025
[27]

SSPO: Subsentence-level Policy Optimization

Kun Yang, Zikang chen, Yanmeng Wang, and Zhigen Li. SSPO: Subsentence-level Policy Optimization.arXiv e-prints, page arXiv:2511.04256, November 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. DCPO: Dynamic Clipping Policy Optimization.arXiv e-prints, page arXiv:2509.02333, September 2025

work page arXiv 2025
[29]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation.arXiv e-prints, page arXiv:2602.12125, February 2026

work page arXiv 2026
[30]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi...

work page arXiv 2026
[31]

Hybrid latent reasoning via reinforcement learning

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 27

2025
[32]

Available: https://arxiv.org/abs/1910.07467

Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization.arXiv e-prints, page arXiv:1910.07467, October 2019

work page arXiv 1910
[33]

ReLaX: Reasoning with Latent Exploration for Large Reasoning Models.arXiv e-prints, page arXiv:2512.07558, December 2025

Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, and Jibin Wu. ReLaX: Reasoning with Latent Exploration for Large Reasoning Models.arXiv e-prints, page arXiv:2512.07558, December 2025

work page arXiv 2025
[34]

arXiv preprint arXiv:2506.03106 , year=

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback.arXiv e-prints, page arXiv:2506.03106, June 2025

work page arXiv 2025
[35]

EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity.arXiv e-prints, page arXiv:2507.21848, July 2025

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity.arXiv e-prints, page arXiv:2507.21848, July 2025

work page arXiv 2025
[36]

arXiv preprint arXiv:2501.07301 , year=

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The Lessons of Developing Process Reward Models in Mathematical Reasoning.arXiv e-prints, page arXiv:2501.07301, January 2025

work page arXiv 2025
[37]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models.arXiv e-prints, page arXiv:2601.18734, January 2026

work page internal anchor Pith review arXiv 2026
[38]

Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to Reason without External Rewards.arXiv e-prints, page arXiv:2505.19590, May 2025. 28

work page arXiv 2025