pith. machine review for the scientific record. sign in

arxiv: 2605.07804 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationlong-horizon reasoningprefix drift detectiondynamic truncationteacher reward reliabilityefficient model trainingmath reasoning benchmarks
0
0 comments X

The pith

Prune-OPD makes on-policy distillation for long-horizon reasoning more efficient by pruning unreliable teacher rewards in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

On-policy distillation uses dense teacher rewards to train reasoning models, but in long tasks the student's path often drifts from the teacher's, rendering later rewards unhelpful and computations wasteful. Prune-OPD addresses this by tracking top-k overlap to spot drift and then down-weighting bad rewards while truncating rollouts. This reallocates compute to reliable parts of the supervision. The approach cuts training time substantially while holding or boosting results on hard math benchmarks. Readers should care because it removes a key barrier to scaling distillation on complex, extended reasoning problems.

Core claim

By continuously monitoring the local compatibility between student and teacher predictions through top-k overlap, Prune-OPD detects prefix-drift events in real time. When drift is severe it applies monotonic down-weighting to unreliable rewards and triggers dynamic rollout truncation. This stops generation on drifted trajectories and focuses training strictly on locally exploitable teacher signals. The result is a 37.6 to 68.0 percent reduction in training time across various teacher-student setups, with performance on AMC, AIME, and HMMT either preserved or improved, and automatic preservation of long contexts when alignment stays strong.

What carries the argument

The real-time prefix-drift detector based on top-k prediction overlap, combined with monotonic reward down-weighting and dynamic truncation.

Load-bearing premise

Top-k overlap accurately signals when teacher rewards stop being locally exploitable, and that truncating based on it does not remove signals essential for long-horizon learning progress.

What would settle it

A direct comparison experiment on one of the benchmarks where the pruned version shows lower final performance than the unpruned full rollout would indicate that important signals were discarded.

Figures

Figures reproduced from arXiv: 2605.07804 by Jing Tang, Minrui Xu, Xiaodan Liang, Yifan Song, Yiwei Wang, Yongxin Wang, Zhicheng Yang, Zhijiang Guo.

Figure 1
Figure 1. Figure 1: Conceptual overview of PRUNE-OPD. PRUNE-OPD monitors local student-teacher compatibility along the student rollout, monotonically attenuates OPD rewards after low-overlap drift events, and truncates the response once reliable teacher supervision is exhausted. However, the same on-policy design creates a new reliability problem. The teacher is queried not only on prefixes where its local distribution offers… view at source ↗
Figure 2
Figure 2. Figure 2: High-compatibility training dynamics for DeepSeek-R1-Distill-Qwen-7B / Skywork-OR1-7B. Left: effective response length and maximum OPD length versus training step. Middle: overlap ratio versus training step. Right: AMC23 accuracy over training, comparing OPD, OPD (Truncate 4k), and PRUNE-OPD. suggesting that exact sampled-token acceptability can be too strict for reasoning traces, whereas candidate-space o… view at source ↗
Figure 3
Figure 3. Figure 3: Training-step accuracy dynamics for DeepSeek-R1-Distill-Qwen-1.5B distilled from JustRL￾DeepSeek-1.5B. The five panels report benchmark accuracy over training steps on AMC23, AIME24, AIME25, HMMT24, and HMMT25, comparing OPD and PRUNE-OPD [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training-dynamics diagnostics for DeepSeek-R1-Distill-Qwen-1.5B distilled from JustRL-DeepSeek￾1.5B. The panels report mean Prune-OPD weight by token position with curves every 20 training steps from 0 to 200; effective response length and maximum OPD length over training; and overlap ratio over training. 4.6 Ablation Study OPD with simple truncation. We include OPD (Truncate 4k) as a fixed-budget baseline… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy over wall-clock time for the 4 DeepSeek student-teacher pairs. Each panel uses wall-clock time as the x-axis and benchmark accuracy as the y-axis, comparing OPD and PRUNE-OPD. A successful curve should match or exceed OPD accuracy while reaching comparable checkpoints earlier in time [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Short effective OPD windows in the low-overlap Qwen3 distillation pairs. For Qwen3-1.7B-Base / Qwen3-4B (Non-thinking) and Qwen3-4B-Base / Qwen3-4B (Non-thinking), low overlap causes PRUNE￾OPD to concentrate OPD supervision within a few hundred reliable tokens, whereas the OPD baseline keeps training on responses up to 12,288 tokens. and PRUNE-OPD therefore keeps the effective OPD length at only a few hund… view at source ↗
Figure 7
Figure 7. Figure 7: OPD baseline overlap-ratio training dynamics for DeepSeek-R1-Distill-Qwen-1.5B / JustRL￾DeepSeek-1.5B. Each panel plots overlap ratio versus training step for a token-position band: 0–1K, 2–3K, 4–6K, and 7–8K. This diagnostic shows how local student-teacher compatibility evolves at different trajectory depths under unpruned OPD. 0 2K 4K 6K Token position 0.6 0.9 1.2 1.5 Token Weight (a) Token Weight γ=0.6 … view at source ↗
Figure 8
Figure 8. Figure 8: Prune-OPD threshold diagnostics for DeepSeek-R1-Distill-Qwen-1.5B / JustRL-DeepSeek-1.5B. Left: mean Prune-OPD weight as a function of token position under three overlap thresholds, γ = 0.6, 0.7, 0.8; for each threshold, the curves are taken at training steps 100, 120, 140, 160, 180, and 200. Right: maximum OPD response length over training steps under the same thresholds. Together, these diagnostics show … view at source ↗
read the original abstract

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Prune-OPD as an enhancement to on-policy distillation (OPD) for long-horizon reasoning. It introduces a mechanism to monitor top-k overlap between student and teacher next-token distributions to detect prefix drift in real time. Upon detecting drift, the method applies monotonic down-weighting of unreliable rewards and dynamic rollout truncation to halt generation on drifted trajectories. The authors claim this leads to training time reductions of 37.6%--68.0% while maintaining or improving performance on challenging benchmarks including AMC, AIME, and HMMT, by aligning computation with supervision reliability across various teacher-student setups.

Significance. Should the empirical claims be substantiated, Prune-OPD represents a meaningful advance in efficient training of reasoning models via distillation. By providing a dynamic way to prune unexploitable supervision signals, it addresses a key scalability issue in OPD for tasks where student trajectories diverge from the teacher. This could enable more effective use of compute resources in long-context reasoning training. The paper's strength lies in its focus on real-time compatibility monitoring rather than static truncation, which if properly validated could be adopted in practice for reducing waste in RL-style fine-tuning of LLMs.

major comments (3)
  1. Abstract: The central efficiency and performance claims (37.6%--68.0% time reduction while preserving performance on AMC, AIME, HMMT) are stated without accompanying details on experimental protocols, baseline methods, number of trials, or error bars. This omission prevents assessment of whether the gains stem from the proposed drift detection or from simpler heuristics.
  2. Method: The key assumption that low top-k overlap reliably signals loss of local exploitability for teacher rewards is load-bearing for the pruning logic, but the manuscript provides no direct evidence or ablation showing correlation between overlap thresholds and quantities such as gradient norms, reward variance, or value function accuracy. If this proxy is only loosely related, the benefits may not generalize beyond the tested cases.
  3. Experiments: There is no comparison to alternative truncation strategies (e.g., fixed-length cutoffs or random pruning) or analysis of failure cases where high overlap coincides with poor learning signals. Such controls are necessary to establish that the method specifically reallocates compute to reliable supervision rather than merely shortening all rollouts.
minor comments (1)
  1. Abstract: The phrase 'diverse teacher-student combinations' is used but not elaborated with specific model pairs or sizes, which would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of Prune-OPD in addressing scalability issues in on-policy distillation. Below, we provide point-by-point responses to the major comments, outlining the revisions we plan to make.

read point-by-point responses
  1. Referee: Abstract: The central efficiency and performance claims (37.6%--68.0% time reduction while preserving performance on AMC, AIME, HMMT) are stated without accompanying details on experimental protocols, baseline methods, number of trials, or error bars. This omission prevents assessment of whether the gains stem from the proposed drift detection or from simpler heuristics.

    Authors: We agree that the abstract would benefit from additional context. In the revised version, we will expand the abstract to briefly describe the experimental protocols (including teacher-student pairs, benchmarks, and rollout settings), note comparisons against standard OPD baselines, indicate that results are averaged over multiple independent trials, and reference the error bars reported in the main experiments. This will help substantiate that the reported gains derive from the dynamic drift detection rather than simpler fixed heuristics. revision: yes

  2. Referee: Method: The key assumption that low top-k overlap reliably signals loss of local exploitability for teacher rewards is load-bearing for the pruning logic, but the manuscript provides no direct evidence or ablation showing correlation between overlap thresholds and quantities such as gradient norms, reward variance, or value function accuracy. If this proxy is only loosely related, the benefits may not generalize beyond the tested cases.

    Authors: We acknowledge the importance of validating the top-k overlap proxy. While our end-to-end results demonstrate its utility, we did not include explicit correlations with gradient norms or value accuracy. In the revision, we will add an ablation in the appendix that plots top-k overlap against reward variance across training steps and provides empirical justification for the chosen thresholds. This addition will strengthen the methodological grounding and address generalization concerns. revision: yes

  3. Referee: Experiments: There is no comparison to alternative truncation strategies (e.g., fixed-length cutoffs or random pruning) or analysis of failure cases where high overlap coincides with poor learning signals. Such controls are necessary to establish that the method specifically reallocates compute to reliable supervision rather than merely shortening all rollouts.

    Authors: We agree that explicit controls against alternative truncation methods would better isolate the contribution of drift detection. The current experiments compare against vanilla OPD, but we will add fixed-length truncation and random pruning baselines (matched for average length or pruning rate) to the revised experimental section, along with their efficiency and performance metrics. We will also include a brief analysis of any observed cases where high overlap coincided with poor signals, noting that such instances were infrequent in our long-horizon reasoning datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a heuristic extension with empirical validation

full rationale

The paper introduces Prune-OPD as an algorithmic framework that monitors top-k overlap to detect prefix drift and applies monotonic down-weighting plus truncation. No mathematical derivation chain is presented that reduces a claimed result to its own inputs by construction. Performance improvements (37.6%-68.0% time reduction with preserved accuracy) are reported as empirical outcomes on AMC/AIME/HMMT benchmarks rather than predictions derived from fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or description to justify the core mechanism. The approach remains self-contained as a practical heuristic without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate free parameters or axioms; the approach implicitly relies on the domain assumption that dense teacher rewards are locally exploitable only when student-teacher predictions remain compatible, and on the unstated choice of top-k overlap as the compatibility metric.

pith-pipeline@v0.9.0 · 5590 in / 1186 out tokens · 34503 ms · 2026-05-11T03:15:11.248353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, 2024

  2. [2]

    Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

  3. [3]

    Distillation scaling laws

    Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russell Webb. Distillation scaling laws. InInternational Conference on Machine Learning, pages 5977–6045. PMLR, 2025

  4. [4]

    On the efficacy of knowledge distillation

    Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019

  5. [5]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

  6. [6]

    Hdpo: Hybrid distillation policy optimization via privileged self-distillation, 2026

    Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation, 2026

  7. [7]

    Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

  8. [8]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

  9. [9]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  10. [10]

    Justrl: Scaling a 1.5b llm with a simple rl recipe, 2025

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5b llm with a simple rl recipe, 2025

  11. [11]

    How far can unsupervised rlvr scale llm training?, 2026

    Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training?, 2026

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  13. [13]

    Reinforcement learning via self-distillation, 2026

    Jonas Hubotter, Frederike Lubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation, 2026

  14. [14]

    Stable on-policy distillation through adaptive target reformulation, 2026

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation, 2026

  15. [15]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020

  16. [16]

    Entropy-aware on-policy distillation of language models, 2026

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models, 2026

  17. [17]

    Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026. 10

  18. [18]

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016

  19. [19]

    Scaling reasoning efficiently via relaxed on-policy distillation, 2026

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation, 2026

  20. [20]

    Unifying group-relative and self-distillation policy optimization via sample routing, 2026

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing, 2026

  21. [21]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  22. [22]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ra- masubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394, 2025

  23. [23]

    On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  24. [24]

    Improved knowledge distillation via teacher assistant

    Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5191–5198, 2020

  25. [25]

    Crisp: Compressed reasoning via iterative self-policy distillation, 2026

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation, 2026

  26. [26]

    Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter, 2019

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter, 2019

  27. [27]

    Multitask prompted training enables zero-shot task generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. InInternational Conference on Learning Representations, 2021

  28. [28]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    Self-distillation enables continual learning, 2026

    Idan Shenfeld, Mehul Damani, Jonas Hubotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026

  30. [30]

    EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval

    Yifan Song, Xingjian Tao, Zhicheng Yang, Yihong Luo, and Jing Tang. Ehrag: Bridging semantic gaps in lightweight graphrag via hybrid hypergraph construction and retrieval.arXiv preprint arXiv:2604.17458, 2026

  31. [31]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in Neural Information Processing Systems, 33:5776–5788, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in Neural Information Processing Systems, 33:5776–5788, 2020

  32. [32]

    Dai, and Quoc V

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021

  33. [33]

    Mimo-v2-flash technical report, 2026

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report, 2026

  34. [34]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. 11

  35. [35]

    Self-distilled rlvr, 2026

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026

  36. [36]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation, 2026

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation, 2026

  37. [37]

    Accordion-thinking: Self-regulated step summaries for efficient and readable llm reasoning

    Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Wenlei Shi, Yiwei Wang, Xiaodan Liang, and Jing Tang. Accordion-thinking: Self-regulated step summaries for efficient and readable llm reasoning. InForty-Third International Conference on Machine Learning, 2026

  38. [38]

    Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration

    Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Hanhui Li, Yiwei Wang, Xiaodan Liang, and Jing Tang. Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration. InForty-Third International Conference on Machine Learning, 2026

  39. [39]

    Optibench meets resocratic: Measure and improve LLMs for optimization modeling

    Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Measure and improve LLMs for optimization modeling. InThe Thirteenth International Conference on Learning Representations, 2025

  40. [40]

    Online experiential learning for language models, 2026

    Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models, 2026

  41. [41]

    On-policy context distillation for language models, 2026

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models, 2026

  42. [42]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025

  43. [43]

    Glm-5: From vibe coding to agentic engineering, 2026

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: From vibe coding to agentic engineering, 2026

  44. [44]

    Self- distillation for multi-token prediction, 2026

    Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, and Xingwu Sun. Self- distillation for multi-token prediction, 2026

  45. [45]

    reward hacking

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026. A Appendix A.1 Broader Impact This research contributes to the development of more efficient and reliable training protocols for LLMs specializing in complex reasoning. By introdu...