Self-Distilled Policy Gradient

Quanquan Gu; Shiyuan Zhang; Yifan Zhang; Yifeng Liu

arxiv: 2606.04036 · v1 · pith:CRXPDDUKnew · submitted 2026-06-02 · 💻 cs.LG

Self-Distilled Policy Gradient

Yifeng Liu , Shiyuan Zhang , Yifan Zhang , Quanquan Gu This is my paper

Pith reviewed 2026-06-28 11:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-distilled policy gradienton-policy self-distillationpolicy gradientreinforcement learninglanguage modelsreverse KL divergenceRLVR

0 comments

The pith

Self-distilled policy gradient improves stability and performance in language model reinforcement learning by adding full-vocabulary on-policy self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SDPG as a policy-gradient framework for language models that adds on-policy self-distillation to handle sparse rewards. Self-distillation is realized as an auxiliary reverse KL loss over the full vocabulary that lets the model supervise its own generations using privileged context. The framework further incorporates group-relative verifier advantages, normalized standard deviation, and reference-policy KL regularization. These elements together are shown empirically to produce more stable training and higher performance than RLVR or plain self-distillation baselines.

Core claim

SDPG is a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss, as well as reference-policy KL regularization.

What carries the argument

Exact full-vocabulary on-policy self-distillation via reverse KL divergence, used as an auxiliary loss alongside group-relative verifier advantages.

If this is right

SDPG improves stability over RLVR and self-distillation baselines.
SDPG improves performance over RLVR and self-distillation baselines.
On-policy self-distillation supplies dense supervision signals for sparse-reward reinforcement learning.
The reverse KL auxiliary loss can be added to existing policy-gradient methods without changing the primary advantage estimator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-distillation mechanism might transfer to other sparse-reward sequential decision tasks where a model can condition on richer context than its own output distribution.
If the normalized standard deviation scaling proves critical, similar normalization could be tested on other advantage estimators outside language-model RL.
The reference-policy KL term may interact with the self-distillation loss in ways that limit distribution shift; removing or varying that term could isolate its contribution.

Load-bearing premise

The specific combination of group-relative verifier advantages, normalized standard deviation, exact full-vocabulary on-policy self-distillation, and reference-policy KL regularization produces net positive effects without harmful interactions or biases from using the model's own generations as supervision.

What would settle it

An ablation that removes the full-vocabulary on-policy self-distillation loss and measures whether stability or final performance drops on the same tasks would directly test the contribution of that component.

Figures

Figures reproduced from arXiv: 2606.04036 by Quanquan Gu, Shiyuan Zhang, Yifan Zhang, Yifeng Liu.

**Figure 2.** Figure 2: The illustration of the β schedule. Early misalignment between pt and the privileged distribution qt can make the OPD target noisy. To prevent privileged distillation from destabilizing exploration, we warm up β. The OPD term then takes effect gradually after the outcome policy has begun to find correct trajectories. Moreover, under an idealized privileged-information model, distilling a teacher conditi… view at source ↗

**Figure 3.** Figure 3: Training dynamics and benchmark performance on Qwen3-4B trained with baseline [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt templates for the student and teacher models, where the “{question}”, “{answer}” [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on Qwen3-4B isolating the KL regularization ( [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics and benchmark performance on Qwen3-1.7B trained with baseline [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

read the original abstract

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDPG names a specific mix of group-relative advantages, normalized scaling, on-policy reverse-KL self-distillation, and reference KL as a single framework, but the abstract supplies no numbers or experiment details to support the stability claim.

read the letter

SDPG combines group-relative verifier advantages, normalized standard deviation, on-policy self-distillation via reverse KL on the full vocabulary, and reference-policy KL regularization. The abstract states that this package improves stability and performance over RLVR and self-distillation baselines in language-model RL.

The concrete piece is the auxiliary loss that treats the model's own generations as a teacher under privileged context. That is a direct way to turn self-distillation into dense supervision rather than just another reward signal. Releasing the code is also straightforward and helpful for anyone who wants to try the combination.

The work targets a real pain point in current LLM alignment pipelines where rewards are sparse. The listed components are described clearly enough that a practitioner could implement the core idea from the abstract.

The main limitation is that the abstract contains no quantitative results, no task descriptions, no model sizes, no error bars, and no ablation details. Without those, the improvement claim cannot be checked for size, consistency across seeds, or dependence on particular hyper-parameters. The potential bias from using the model's own outputs as supervision is acknowledged in the setup but not analyzed.

This paper is for researchers already running policy-gradient fine-tuning on language models who are looking for an extra auxiliary loss to try. Readers seeking a large-scale validated method or a theoretical derivation will find little here.

If the full paper contains controlled experiments, ablations, and reproducible numbers that hold up, it is worth sending to referees. The topic is timely and the code is public, so a review could clarify whether the specific mix actually delivers net gains.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes SDPG, a self-distilled policy-gradient framework for language models. It combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation implemented as an auxiliary reverse KL divergence loss, and reference-policy KL regularization. The central claim is that this combination empirically improves stability and performance over RLVR and self-distillation baselines, with code released at the provided GitHub repository.

Significance. If the reported gains hold, the approach supplies a practical mechanism for obtaining dense on-policy supervision in sparse-reward RL settings for language models. The public code release is a clear strength that aids reproducibility and verification.

minor comments (3)

Abstract: the empirical claim would be strengthened by a one-sentence indication of the tasks or benchmarks on which stability and performance gains were measured.
Section 3 (method): the precise formula and motivation for 'normalized standard deviation' in the advantage estimator should be written out explicitly rather than described only in prose.
Experiments section: while the abstract asserts gains over baselines, the main text should include a short discussion of whether the four components interact constructively or whether any can be ablated without loss of the reported benefit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of SDPG, the recognition of its practical value for dense on-policy supervision in sparse-reward settings, and the recommendation for minor revision. The report contains no major comments requiring point-by-point responses.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical method (SDPG) that combines existing RL components (group-relative advantages, KL regularization, on-policy self-distillation via reverse KL) and reports performance gains on benchmarks. No derivation chain exists that reduces a claimed result to its inputs by construction; the central claims are experimental comparisons against baselines, with no fitted parameters renamed as predictions, no self-citation load-bearing uniqueness theorems, and no self-definitional loops in the method description. The self-distillation step is an explicit design choice, not a hidden tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text. The approach appears to rest on standard RL assumptions about advantage estimation and KL regularization that are not detailed here.

pith-pipeline@v0.9.1-grok · 5639 in / 1090 out tokens · 36924 ms · 2026-06-28T11:22:19.317003+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 23 canonical work pages · 19 internal anchors

[1]

On-Policy Context Distillation for Language Models

On-Policy Context Distillation for Language Models , author=. arXiv preprint arXiv:2602.12275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Preprint , year=

Information geometric measurements of generalisation , author=. Preprint , year=
[3]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Self-Distillation Enables Continual Learning

Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Fourteenth International Conference on Learning Representations , year=

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning , author=. The Fourteenth International Conference on Learning Representations , year=
[7]

Divergence measures and message passing , author=
[8]

Self-Distilled RLVR

Self-Distilled RLVR , author=. arXiv preprint arXiv:2604.03128 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[10]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
[11]

2025 , school=

vLLM: An Efficient Inference Engine for Large Language Models , author=. 2025 , school=

2025
[12]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

IEEE transactions on pattern analysis and machine intelligence , volume=

A survey on curriculum learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

2021
[14]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Instruction tuning with human curriculum , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

2024
[15]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=

Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
[16]

NeurIPS 2025 Workshop on Efficient Reasoning , year=

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning , author=. NeurIPS 2025 Workshop on Efficient Reasoning , year=

2025
[17]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2020 , month = mar, day =

John Schulman , title =. 2020 , month = mar, day =

2020
[21]

2024 , url =

MAA, Mathematical Association of America's American Mathematics Competitions , title =. 2024 , url =

2024
[22]

2025 , url =

MAA, Mathematical Association of America's American Mathematics Competitions , title =. 2025 , url =

2025
[23]

2023 , url =

MAA, Mathematical Association of America's American Mathematics Competitions , title =. 2023 , url =

2023
[24]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992
[26]

arXiv preprint arXiv:2507.04136 , year=

A technical survey of reinforcement learning techniques for large language models , author=. arXiv preprint arXiv:2507.04136 , year=

work page arXiv
[27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Reinforcement Learning for Reasoning in Large Language Models with One Training Example , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[28]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=
[31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[32]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Step-level value preference optimization for mathematical reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[33]

The Thirteenth International Conference on Learning Representations , year=

Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. The Thirteenth International Conference on Learning Representations , year=
[34]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[35]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning , author=. arXiv preprint arXiv:2510.10649 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning , author=. arXiv preprint arXiv:2601.07408 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Reasoning with exploration: An entropy perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[38]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[39]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025a

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author=. arXiv preprint arXiv:2505.12346 , year=

work page arXiv
[40]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Test-time Prompt Intervention , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[41]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Improve mathematical reasoning in language models by automated process supervision , author=. arXiv preprint arXiv:2406.06592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=
[43]

The Thirteenth International Conference on Learning Representations , year=

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=
[44]

The twelfth international conference on learning representations , year=

Minillm: Knowledge distillation of large language models , author=. The twelfth international conference on learning representations , year=
[45]

https://thinkingmachines.ai/blog/on-policy-distillation

Lu, K. and Lab, Thinking Machines , title =. Thinking Machines Lab: Connectionism , year =. doi:10.64434/tml.20251026 , url =

work page doi:10.64434/tml.20251026
[46]

MiMo-V2-Flash Technical Report

Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes , author=. arXiv preprint arXiv:2603.25562 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Privileged Information Distillation for Language Models

Privileged Information Distillation for Language Models , author=. arXiv preprint arXiv:2602.04942 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Reinforcement-aware knowledge distillation for LLM reasoning , author=. arXiv preprint arXiv:2602.22495 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2410.01679 , year=

Vineppo: Refining credit assignment in rl training of llms , author=. arXiv preprint arXiv:2410.01679 , year=

work page arXiv

[1] [1]

On-Policy Context Distillation for Language Models

On-Policy Context Distillation for Language Models , author=. arXiv preprint arXiv:2602.12275 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Preprint , year=

Information geometric measurements of generalisation , author=. Preprint , year=

[3] [3]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

[4] [4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Self-Distillation Enables Continual Learning

Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The Fourteenth International Conference on Learning Representations , year=

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning , author=. The Fourteenth International Conference on Learning Representations , year=

[7] [7]

Divergence measures and message passing , author=

[8] [8]

Self-Distilled RLVR

Self-Distilled RLVR , author=. arXiv preprint arXiv:2604.03128 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

[10] [10]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

[11] [11]

2025 , school=

vLLM: An Efficient Inference Engine for Large Language Models , author=. 2025 , school=

2025

[12] [12]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

IEEE transactions on pattern analysis and machine intelligence , volume=

A survey on curriculum learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

2021

[14] [14]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Instruction tuning with human curriculum , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

2024

[15] [15]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=

Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=

[16] [16]

NeurIPS 2025 Workshop on Efficient Reasoning , year=

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning , author=. NeurIPS 2025 Workshop on Efficient Reasoning , year=

2025

[17] [17]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

2020 , month = mar, day =

John Schulman , title =. 2020 , month = mar, day =

2020

[21] [21]

2024 , url =

MAA, Mathematical Association of America's American Mathematics Competitions , title =. 2024 , url =

2024

[22] [22]

2025 , url =

MAA, Mathematical Association of America's American Mathematics Competitions , title =. 2025 , url =

2025

[23] [23]

2023 , url =

MAA, Mathematical Association of America's American Mathematics Competitions , title =. 2023 , url =

2023

[24] [24]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992

[26] [26]

arXiv preprint arXiv:2507.04136 , year=

A technical survey of reinforcement learning techniques for large language models , author=. arXiv preprint arXiv:2507.04136 , year=

work page arXiv

[27] [27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Reinforcement Learning for Reasoning in Large Language Models with One Training Example , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[28] [28]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

[31] [31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[32] [32]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Step-level value preference optimization for mathematical reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[33] [33]

The Thirteenth International Conference on Learning Representations , year=

Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. The Thirteenth International Conference on Learning Representations , year=

[34] [34]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[35] [35]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning , author=. arXiv preprint arXiv:2510.10649 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning , author=. arXiv preprint arXiv:2601.07408 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Reasoning with exploration: An entropy perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[38] [38]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[39] [39]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025a

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author=. arXiv preprint arXiv:2505.12346 , year=

work page arXiv

[40] [40]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Test-time Prompt Intervention , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[41] [41]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Improve mathematical reasoning in language models by automated process supervision , author=. arXiv preprint arXiv:2406.06592 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=

[43] [43]

The Thirteenth International Conference on Learning Representations , year=

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=

[44] [44]

The twelfth international conference on learning representations , year=

Minillm: Knowledge distillation of large language models , author=. The twelfth international conference on learning representations , year=

[45] [45]

https://thinkingmachines.ai/blog/on-policy-distillation

Lu, K. and Lab, Thinking Machines , title =. Thinking Machines Lab: Connectionism , year =. doi:10.64434/tml.20251026 , url =

work page doi:10.64434/tml.20251026

[46] [46]

MiMo-V2-Flash Technical Report

Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes , author=. arXiv preprint arXiv:2603.25562 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Privileged Information Distillation for Language Models

Privileged Information Distillation for Language Models , author=. arXiv preprint arXiv:2602.04942 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Reinforcement-aware knowledge distillation for LLM reasoning , author=. arXiv preprint arXiv:2602.22495 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2410.01679 , year=

Vineppo: Refining credit assignment in rl training of llms , author=. arXiv preprint arXiv:2410.01679 , year=

work page arXiv