DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

Baoyan Guo; Bolan Yang; Chengwei Liu; Dan Liu; Haisen Luo; Haoning Wang; Haotian Wang; Jiong Chen; Junxi Yin; Lei Zhang

arxiv: 2606.30345 · v1 · pith:ER4JWYZLnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

Haisen Luo , Yiwei Liu , Haoning Wang , Dan Liu , Junxi Yin , Haotian Wang , Lei Zhang , Xiaoyu Tian

show 8 more authors

Shuaiting Chen Yuansheng Song Baoyan Guo Xiongfei Yan Bolan Yang Chengwei Liu Ming Cui Jiong Chen

This is my paper

Pith reviewed 2026-06-30 07:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-improvementself-distillationreinforcement learninglarge language modelsreasoning taskspolicy optimizationcurriculum learningexploration

0 comments

The pith

DRIFT lets language models improve their own reasoning by routing problems according to learning state and gating exploration to key steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DRIFT as a framework for online self-evolution in large language models that avoids external supervision. It identifies the core problems in prior self-distillation and reinforcement approaches as over-optimization of easy problems, weak signals on hard problems, and insufficient exploration of borderline cases. Difficulty Routing tracks the model's learning state at the individual problem level to allocate the right mix of distillation and reinforcement signals. Rhythm Gating then narrows token-level updates to critical reasoning positions, while a success buffer and two-stage curriculum retain quality experience and progress the model from basic behaviors to stable policy changes. A reader would care because this structure aims to make autonomous improvement more reliable and effective on complex tasks.

Core claim

DRIFT is an online self-evolution policy optimization framework for large language models. It regulates the model's self-improvement process through the joint use of Difficulty Routing and Rhythm Gating. The former identifies the model's learning state at the problem level and dynamically allocates self-distillation and reinforcement learning signals, while the latter refines policy updates at the token level, concentrating exploration on critical reasoning positions. By further incorporating a success buffer and a two-stage curriculum learning strategy, DRIFT preserves high-quality historical experience while progressively guiding the model from reliable behavior acquisition toward stable p

What carries the argument

Difficulty Routing, which identifies the model's per-problem learning state to dynamically allocate self-distillation and reinforcement learning signals.

If this is right

The framework produces higher scores than prior self-distillation and reinforcement methods across five reasoning benchmarks.
It reaches new peak accuracy on tool-use tasks.
The approach supports stable evolution across multiple model scales without external supervision.
The success buffer and curriculum together enable retention of high-quality experience during progressive training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-problem routing idea could extend to other adaptive training settings where progress varies widely across examples.
If the routing works as described, similar state-tracking might reduce the amount of hand-designed curricula needed in reinforcement learning for language models.
Separate tests that turn routing on and off while holding other components fixed would isolate its contribution to any observed stability gains.

Load-bearing premise

Difficulty Routing can reliably identify each problem's learning state for the model and the gating plus buffer will produce stable policy evolution without creating new instabilities.

What would settle it

Training runs in which Difficulty Routing is replaced by random or fixed allocation, then checking whether the reported performance gains over prior methods disappear.

Figures

Figures reproduced from arXiv: 2606.30345 by Baoyan Guo, Bolan Yang, Chengwei Liu, Dan Liu, Haisen Luo, Haoning Wang, Haotian Wang, Jiong Chen, Junxi Yin, Lei Zhang, Ming Cui, Shuaiting Chen, Xiaoyu Tian, Xiongfei Yan, Yiwei Liu, Yuansheng Song.

**Figure 1.** Figure 1: DRIFT overview updates can become noisy when successful samples are sparse or reward distributions are unstable. Self-distillation methods such as SDPO (Hubotter et al., 2026) reuse the model’s correct solutions ¨ as supervision, yet they typically treat all such solutions uniformly, ignoring whether a problem is easy, near the model’s capability boundary, or only occasionally solved. Recent sample-routing… view at source ↗

**Figure 2.** Figure 2: Overall DRIFT training pipeline. Writing the two branches together, we obtain the overall objective of DRIFT in the mixed stage: min θ L(θ) := Eyi∼πθ(·|x)    1yi∈incorrect LSDPO(yi , fi) | {z } Self-Distillation (§3.1, §3.4) − γi 1yi∈correct J rhythm GRPO (yi) | {z } Difficulty-Routed RL (§3.2, §3.3)    , (4) where the two branches expand respectively as LSDPO(yi , fi) := X |yi| t=1 JSD πθ(· | x, yi,… view at source ↗

**Figure 3.** Figure 3: Tool-use performance Main results. Our main results are reported as the mean@16 accuracy of Qwen3-8B across four scientific reasoning tasks (biology, chemistry, materials, physics) and tool use. DRIFT attains the highest average accuracy of 79.5%, outperforming the previously strongest baseline SRPO (77.4%) by 2.1 points, SC-SDPO (74.8%) by 4.7 points, and the GRPO/SDPO-series baselines by 7.5–9.5 points,… view at source ↗

**Figure 4.** Figure 4: Training dynamics of DRIFT (Qwen3-8B in tooluse) 11 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Difficulty-routing dynamics (Qwen3-8B in tooluse) 4.4 ABLATION STUDIES Component ablation [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on Tool Use of Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Validation performance on the STEM datasets. In addition to improvements in mean@16, best@16 also increases steadily throughout training [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Enabling large language models to achieve stable self-improvement without external expert supervision remains a central challenge in complex reasoning tasks. Existing self-distillation and reinforcement learning methods lack explicit mechanisms for tracking problem-level learning progress and adapting optimization strategies accordingly. Consequently, training may over-optimize easy problems, receive weak supervision from hard problems, and fail to sufficiently explore borderline cases. To resolve these issues, we propose DRIFT, an online self-evolution policy optimization framework for large language models. DRIFT regulates the model's self-improvement process through the joint use of Difficulty Routing and Rhythm Gating. The former identifies the model's learning state at the problem level and dynamically allocates self-distillation and reinforcement learning signals, while the latter refines policy updates at the token level, concentrating exploration on critical reasoning positions. By further incorporating a success buffer and a two-stage curriculum learning strategy, DRIFT preserves high-quality historical experience while progressively guiding the model from reliable behavior acquisition toward stable policy evolution. Evaluated across five benchmarks and three model scales, DRIFT surpasses the peak performance of both GRPO and SDPO across all evaluated metrics. On the average score over the five benchmarks, DRIFT achieves 79.5$\%$, outperforming GRPO by 9.5$\%$ and SDPO by 7.5$\%$, establishing a new state-of-the-art result. Notably, on ToolUse, DRIFT reaches an accuracy of 79.2$\%$, improving over GRPO by 13.5$\%$ and SDPO by 10.7$\%$, setting a new state-of-the-art and substantially outperforming all concurrent methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRIFT combines difficulty routing and rhythm gating for adaptive self-training and reports clear gains over GRPO and SDPO, but the abstract leaves the results hard to verify without controls or ablations.

read the letter

The main thing here is that DRIFT adds difficulty routing to allocate self-distillation versus RL signals per problem, rhythm gating to focus token-level updates, a success buffer, and a two-stage curriculum. It reports beating the baselines by 7-9 points on average across five benchmarks, with the largest lift on ToolUse.

The paper does a solid job spelling out the practical problems with prior self-distillation and RL methods—easy problems get over-optimized while hard ones give weak signals—and the components are presented as direct responses to those gaps. The idea of tracking problem-level progress without external labels is a reasonable direction, and the reported consistency across three model scales is at least a starting point for the claim.

The soft spots are straightforward. The abstract gives no details on baseline implementations, ablation results, variance across runs, or how the routing decisions were validated. Without those, it is difficult to separate the contribution of the new mechanisms from other training choices. The central assumption that the routing reliably detects learning state and that the gating plus buffer avoid new instabilities also sits on empirical results that are not yet shown to be robust.

This is for people working on LLM self-improvement and RL for reasoning. Someone already following GRPO-style work could pick up the routing idea and test it, but they would need the full methods and data to judge the numbers.

The paper shows clear engagement with the problem setup and prior limitations. It deserves peer review so the experiments can be checked properly.

Referee Report

3 major / 1 minor

Summary. The paper proposes DRIFT, an online self-evolution policy optimization framework for LLMs that combines Difficulty Routing (to identify per-problem learning states and allocate self-distillation vs. RL signals), Rhythm Gating (for token-level focus on critical reasoning positions), a success buffer (to preserve high-quality historical experience), and a two-stage curriculum. It claims consistent empirical superiority over GRPO and SDPO across five benchmarks and three model scales, with DRIFT reaching 79.5% average score (9.5% above GRPO, 7.5% above SDPO) and 79.2% on ToolUse (13.5% and 10.7% gains respectively), establishing new state-of-the-art results.

Significance. If the reported gains prove robust under controlled ablations and statistical testing, the framework could meaningfully advance unsupervised self-improvement methods by explicitly tracking problem difficulty and modulating exploration at both problem and token levels. The joint use of routing and gating addresses a recognized gap in prior self-distillation/RL approaches, though the absence of any mention of reproducibility artifacts (code, seeds, or hyperparameter schedules) limits immediate impact assessment.

major comments (3)

Abstract: the central claim that DRIFT 'surpasses the peak performance of both GRPO and SDPO across all evaluated metrics' and sets new SOTA is presented without any reference to experimental controls, baseline implementation details, number of runs, variance, or statistical significance testing; this information is load-bearing for the performance numbers (79.5% average, 79.2% ToolUse) and cannot be verified from the given text.
Abstract (method description): no equations, pseudocode, or implementation details are supplied for Difficulty Routing (how per-problem learning state is quantified) or Rhythm Gating (how token-level critical positions are identified), so it is impossible to assess whether these mechanisms are independent of the fitted training choices that produce the reported gains.
Abstract: the text states that DRIFT incorporates 'a success buffer and a two-stage curriculum learning strategy' but supplies no ablation results isolating their contribution versus the routing/gating components, undermining attribution of the 9.5%/7.5% average improvements specifically to the proposed innovations.

minor comments (1)

Abstract: minor typographical inconsistency in 'Self-DIstillation' (capital I in 'DIstillation').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We provide point-by-point responses to the major comments below, clarifying aspects of the experimental reporting and method descriptions.

read point-by-point responses

Referee: Abstract: the central claim that DRIFT 'surpasses the peak performance of both GRPO and SDPO across all evaluated metrics' and sets new SOTA is presented without any reference to experimental controls, baseline implementation details, number of runs, variance, or statistical significance testing; this information is load-bearing for the performance numbers (79.5% average, 79.2% ToolUse) and cannot be verified from the given text.

Authors: We agree that the abstract would benefit from referencing the experimental controls. The manuscript body (Section 4) provides details on baseline implementations, number of runs, variance, and statistical testing. We will revise the abstract to briefly note these aspects for improved verifiability of the claims. revision: yes
Referee: Abstract (method description): no equations, pseudocode, or implementation details are supplied for Difficulty Routing (how per-problem learning state is quantified) or Rhythm Gating (how token-level critical positions are identified), so it is impossible to assess whether these mechanisms are independent of the fitted training choices that produce the reported gains.

Authors: The abstract summarizes the approach at a high level. Full equations, quantification details for Difficulty Routing and Rhythm Gating, and pseudocode are provided in Section 3 of the manuscript. We will update the abstract to include a concise description of how the learning state and critical positions are identified. revision: partial
Referee: Abstract: the text states that DRIFT incorporates 'a success buffer and a two-stage curriculum learning strategy' but supplies no ablation results isolating their contribution versus the routing/gating components, undermining attribution of the 9.5%/7.5% average improvements specifically to the proposed innovations.

Authors: Ablation results isolating the contributions of the success buffer and two-stage curriculum are included in Section 5 and the appendix of the manuscript. We will revise the abstract to reference these ablations to better attribute the performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical proposal of a training framework (Difficulty Routing + Rhythm Gating + success buffer + two-stage curriculum) whose central claims consist of measured performance gains on five benchmarks versus GRPO and SDPO. No derivation chain, equations, or uniqueness theorems are presented that reduce to fitted parameters or self-citations by construction. The reported improvements are external benchmark results whose independence from the training choices is not internally contradicted by the given text; the method is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.1-grok · 5884 in / 1195 out tokens · 28737 ms · 2026-06-30T07:15:12.257183+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 16 internal anchors

[1]

Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, and Paria Rashidinejad. Reinforcement learning from rich feedback with distributional dagger.arXiv preprint arXiv:2606.05152,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Prompt replay: Speeding up grpo with on-policy reuse of high- signal prompts.arXiv preprint arXiv:2603.21177,

Andrei Baroian and Rutger Berger. Prompt replay: Speeding up grpo with on-policy reuse of high- signal prompts.arXiv preprint arXiv:2603.21177,

work page arXiv
[3]

The Unlearnability Phenomenon in RLVR for Language Models

URLhttps://arxiv.org/abs/2605.16787. Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, and Zhanxing Zhu. Mech- anistically interpreting the role of sample difficulty in rlvr for llms,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

URLhttps: //arxiv.org/abs/2605.28388. 13 Beike Language and Intelligence Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next genera- tion agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-V4: Towards highly eﬀicient million-token context intelligence,

URLhttps://arxiv.org/abs/2606.19348. Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models.arXiv preprint arXiv:2406.09098,

work page arXiv
[8]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

URLhttps://arxiv.org/abs/2603.25562. Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large lan- guage models.arXiv preprint arXiv:2306.08543,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

MiniLLM: On-Policy Distillation of Large Language Models

URLhttps://arxiv.org/abs/ 2306.08543. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Reinforcement Learning via Self-Distillation

Jonas H ¨ubotter, Frederike L ¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Re- inforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

s1: Simple test-time scaling

Aaron Jaech et al. Learning to reason with llms.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Entropy-Aware On-Policy Distillation of Language Models

URLhttps://arxiv.org/abs/2603.07079. Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimiza- tion via sample routing.arXiv preprint arXiv:2604.02288, 2026a. Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Zehao Liu, Yuanpu Cao, Jinghui Chen, and Vasant G. Honavar. Restoring the sweet spot: Pass-rate weighted self-distillation for llm reasoning.arXiv preprint arXiv:2605.27765,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Olmo 3

URLhttps://arxiv.org/abs/2512.13961. Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, and Lijie Wen. RLCSD: Reinforcement learning with contrastive on-policy self-distillation.arXiv preprint arXiv:2606.11709,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

URLhttps://arxiv.org/abs/2606.11709. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Physics-Guided Policy Optimization with Self-Distillation

14 Beike Language and Intelligence Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, and Kai Wei. Physics-guided policy optimization with self-distillation.arXiv preprint arXiv:2606.03620,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Wong, and Yu Cheng

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245,

work page arXiv
[22]

Jixiao Zhang and Chunsheng Zuo

URLhttps: //arxiv.org/abs/2507.07451. Jixiao Zhang and Chunsheng Zuo. Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5642–5654,

work page arXiv 2025
[23]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

URLhttps: //arxiv.org/abs/2601.18734. A APPENDIX Figure 7:Validation performance on the STEM datasets.In addition to improvements in mean@16, best@16 also increases steadily throughout training. Table 4:Best@16 performance on Qwen3-8B.DRIFT outperforms SDPO and GRPO across all five tasks, with particularly pronounced gains on materials and tool use. Metho...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, and Paria Rashidinejad. Reinforcement learning from rich feedback with distributional dagger.arXiv preprint arXiv:2606.05152,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Prompt replay: Speeding up grpo with on-policy reuse of high- signal prompts.arXiv preprint arXiv:2603.21177,

Andrei Baroian and Rutger Berger. Prompt replay: Speeding up grpo with on-policy reuse of high- signal prompts.arXiv preprint arXiv:2603.21177,

work page arXiv

[3] [3]

The Unlearnability Phenomenon in RLVR for Language Models

URLhttps://arxiv.org/abs/2605.16787. Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, and Zhanxing Zhu. Mech- anistically interpreting the role of sample difficulty in rlvr for llms,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

URLhttps: //arxiv.org/abs/2605.28388. 13 Beike Language and Intelligence Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next genera- tion agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

DeepSeek-V4: Towards highly eﬀicient million-token context intelligence,

URLhttps://arxiv.org/abs/2606.19348. Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models.arXiv preprint arXiv:2406.09098,

work page arXiv

[6] [8]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

URLhttps://arxiv.org/abs/2603.25562. Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large lan- guage models.arXiv preprint arXiv:2306.08543,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [9]

MiniLLM: On-Policy Distillation of Large Language Models

URLhttps://arxiv.org/abs/ 2306.08543. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [10]

Reinforcement Learning via Self-Distillation

Jonas H ¨ubotter, Frederike L ¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Re- inforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [11]

s1: Simple test-time scaling

Aaron Jaech et al. Learning to reason with llms.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [13]

Entropy-Aware On-Policy Distillation of Language Models

URLhttps://arxiv.org/abs/2603.07079. Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimiza- tion via sample routing.arXiv preprint arXiv:2604.02288, 2026a. Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [14]

Zehao Liu, Yuanpu Cao, Jinghui Chen, and Vasant G. Honavar. Restoring the sweet spot: Pass-rate weighted self-distillation for llm reasoning.arXiv preprint arXiv:2605.27765,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [15]

Olmo 3

URLhttps://arxiv.org/abs/2512.13961. Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, and Lijie Wen. RLCSD: Reinforcement learning with contrastive on-policy self-distillation.arXiv preprint arXiv:2606.11709,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [16]

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

URLhttps://arxiv.org/abs/2606.11709. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [18]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [19]

Physics-Guided Policy Optimization with Self-Distillation

14 Beike Language and Intelligence Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, and Kai Wei. Physics-guided policy optimization with self-distillation.arXiv preprint arXiv:2606.03620,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [20]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [21]

Wong, and Yu Cheng

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245,

work page arXiv

[19] [22]

Jixiao Zhang and Chunsheng Zuo

URLhttps: //arxiv.org/abs/2507.07451. Jixiao Zhang and Chunsheng Zuo. Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5642–5654,

work page arXiv 2025

[20] [23]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

URLhttps: //arxiv.org/abs/2601.18734. A APPENDIX Figure 7:Validation performance on the STEM datasets.In addition to improvements in mean@16, best@16 also increases steadily throughout training. Table 4:Best@16 performance on Qwen3-8B.DRIFT outperforms SDPO and GRPO across all five tasks, with particularly pronounced gains on materials and tool use. Metho...

work page internal anchor Pith review Pith/arXiv arXiv