pith. machine review for the scientific record. sign in

arxiv: 2604.08527 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.LG

Recognition: unknown

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords on-policy distillationlength inflationtruncation collapseStableOPDmath reasoningLLM trainingdistillation objectiverollout mixture
0
0 comments X

The pith

On-policy distillation for LLMs triggers length inflation in student rollouts that causes truncation collapse and training instability, which StableOPD corrects with a reference divergence constraint and rollout mixture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-policy distillation suffers from a specific failure where student-generated rollouts grow abruptly long, leading to heavy truncation, repetition, biased gradients, and sudden drops in performance. This occurs because the distillation objective interacts with the student's own data distribution in a way that rewards longer and more repetitive outputs. StableOPD counters the problem by adding a reference-based divergence constraint to keep outputs close to the teacher and mixing in rollout data to break the inflation cycle. A sympathetic reader would care because reliable on-policy distillation could make it easier to transfer capabilities from strong models to weaker ones without constant retraining collapses. Experiments on math reasoning datasets show the method restores stability and delivers a 7.2 percent average performance gain.

Core claim

On-policy distillation trains student models under their own induced distribution while using teacher supervision, yet this process produces abrupt length inflation in rollouts. The inflation causes truncated trajectories to dominate, triggers repetition saturation, and creates biased gradient signals that destabilize training and degrade validation performance. The root cause is the interaction between student-induced data collection and the distillation objective, which implicitly favors long repetitive outputs. StableOPD mitigates the issue through a reference-based divergence constraint paired with rollout mixture distillation; together these prevent repetition-induced length inflation,

What carries the argument

Reference-based divergence constraint combined with rollout mixture distillation, which limits deviation from the teacher while incorporating mixed trajectories to break the cycle of length inflation.

Load-bearing premise

The reference-based divergence constraint and rollout mixture directly counteract the interaction between student-induced data collection and the distillation objective without introducing new instabilities or performance trade-offs.

What would settle it

Re-running the original OPD experiments on the same math reasoning datasets while ablating the divergence constraint or the rollout mixture and checking whether length inflation, truncation dominance, and performance drops reappear.

Figures

Figures reproduced from arXiv: 2604.08527 by Feng Luo, Guanchu Wang, Tianyi Zhang, Vladimir Braverman, Xiaotian Han, Yu-Neng Chuang, Zicheng Xu.

Figure 1
Figure 1. Figure 1: Abrupt length inflation within OPD. avoids the distribution mismatch inherent to purely offline distillation and enables continual adaptation during train￾ing (Agarwal et al., 2024; Lu & Lab, 2025; Yang et al., 2025; Ye et al., 2025). This paradigm has shown promise in domains such as long-form generation and reasoning, where robustness under the student’s evolving policy is critical. Despite the strong pe… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics of OPD on three groups. Training starts in a stable regime with low truncation and repetition, followed by a sharp phase transition where truncation and repetition increase and remain high while validation accuracy collapses, illustrating a robust truncation-repetition inflation failure mode of OPD. refer to this phenomenon as abrupt truncation-repetition inflation. 3.4. Understanding Abr… view at source ↗
Figure 3
Figure 3. Figure 3: Rollout-level evidence of abrupt repetition inflation for three student-teacher groups. Around the step where rollout length abruptly inflates, both student and teacher log prob become much less negative, with the teacher’s increase being larger, which induces a sudden jump in the reveser KL advantage. set of tokens in repetitive tails. We can decompose Eq. (3) into contributions from states inside and out… view at source ↗
Figure 4
Figure 4. Figure 4: Reverse-KL advantage for regular and repetitive tokens during OPD training. Repetitive tokens receive larger advantages than regular tokens throughout training. only through an auxiliary off-policy SFT term that stabilizes training, and thus does not introduce this issue. From a distributional perspective, mixture distillation can be viewed as training on a mixture of two state distributions: the on-policy… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics of OPD vs. Stable-OPD. Student: Qwen2.5-Math-1.5B; Teacher: OpenThinker3-7B [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of OPD vs. Stable-OPD. Student: Qwen2.5-Math-1.5B; Teacher: DeepSeek-R1-Distill-7B. E. More Dynamics Analysis of OPD As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dynamics of OPD across training. Each panel shows truncation/repetition for both rollout and evaluation versus accuracy. Sudden accuracy changes often align with abrupt shifts in teacher log-probabilities and advantage estimates. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies a failure mode in on-policy distillation (OPD) called truncation collapse, where on-policy rollouts exhibit abrupt length inflation and repetition saturation that bias gradients and degrade performance. It attributes this to the interaction between student-induced data collection and the distillation objective, and proposes StableOPD combining a reference-based divergence constraint with rollout mixture distillation to prevent collapse, stabilize dynamics, and achieve a 7.2% average gain on math reasoning datasets.

Significance. If the proposed mechanisms are shown to causally address the failure mode, the work would offer a practical contribution to stabilizing on-policy distillation for LLM reasoning tasks. The identification of length inflation as a distinct collapse mode is potentially useful, but the absence of detailed validation leaves the significance of the 7.2% gain and the proposed fixes difficult to assess.

major comments (2)
  1. [Abstract] Abstract: The central performance claim of a 7.2% average improvement is stated without any experimental details, baselines, datasets, number of runs, error bars, or statistical tests. This prevents evaluation of whether the result supports the claim that StableOPD prevents truncation collapse and improves performance.
  2. [Experimental section] Experimental section: No ablation studies, controlled off-policy variants, or measurements of per-component effects on rollout length distributions and gradient bias are reported. The comparison appears limited to full StableOPD versus baseline OPD, leaving the causal attribution of gains to the reference-based divergence constraint and rollout mixture unisolated from confounding factors such as altered batch composition or implicit regularization.
minor comments (2)
  1. [Introduction] The terms 'truncation collapse' and 'repetition saturation' are introduced without explicit quantitative definitions or detection metrics, which would improve clarity and reproducibility.
  2. [Method] Implementation details for the reference-based divergence constraint (e.g., exact formulation or pseudocode) are not provided in the description of StableOPD, hindering direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and rigor of our experimental claims. We address each major comment below and have revised the manuscript to incorporate the suggested changes.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim of a 7.2% average improvement is stated without any experimental details, baselines, datasets, number of runs, error bars, or statistical tests. This prevents evaluation of whether the result supports the claim that StableOPD prevents truncation collapse and improves performance.

    Authors: We agree that the abstract would benefit from additional context to support evaluation of the performance claim. In the revised manuscript, we have updated the abstract to reference the math reasoning datasets used and the standard OPD baseline, while noting that the 7.2% average improvement is reported with full experimental details (including runs, error bars, and statistical tests) provided in Section 4. Due to abstract length constraints, we direct readers to the main text for comprehensive statistics. revision: yes

  2. Referee: [Experimental section] Experimental section: No ablation studies, controlled off-policy variants, or measurements of per-component effects on rollout length distributions and gradient bias are reported. The comparison appears limited to full StableOPD versus baseline OPD, leaving the causal attribution of gains to the reference-based divergence constraint and rollout mixture unisolated from confounding factors such as altered batch composition or implicit regularization.

    Authors: We acknowledge that the original manuscript would be strengthened by additional analyses to isolate component effects and rule out confounds. In the revised manuscript, we expand the experimental section to include ablation studies for the reference divergence constraint and rollout mixture separately, controlled comparisons to off-policy distillation variants with matched batch compositions, and measurements of rollout length distributions, repetition rates, and gradient bias metrics across training stages. These additions support causal attribution of the stabilization and performance gains to the proposed mechanisms. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper identifies a failure mode in on-policy distillation through empirical observation of length inflation and repetition saturation, attributes it to the interaction between student-induced data collection and the distillation objective, and proposes StableOPD combining a reference-based divergence constraint with rollout mixture distillation. No equations, derivations, first-principles predictions, or mathematical chains are present that could reduce to self-referential fitting, self-citation load-bearing, or renaming of known results. All central claims rest on experimental results across math reasoning datasets rather than any self-contained logical reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; any hyperparameters for the divergence constraint or mixture ratio would be free parameters but are not specified.

pith-pipeline@v0.9.0 · 5471 in / 1010 out tokens · 52844 ms · 2026-05-10T17:23:00.900929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

    cs.AI 2026-05 conditional novelty 7.0

    On-policy distillation for LLMs is sensitive to teacher choice and loss design, while self-distillation fails on instance-specific information but succeeds on shared rules, with stop-gradient TopK, adapted teachers, a...

  2. UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.

Reference graph

Works this paper leans on

31 extracted references · 27 canonical work pages · cited by 2 Pith papers · 21 internal anchors

  1. [1]

    Process Reinforcement through Implicit Rewards

    Cui, G., Yuan, L., Wang, Z., Wang, H., Zhang, Y ., Chen, J., Li, W., He, B., Fan, Y ., Yu, T., et al. Process re- inforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  2. [2]

    MiniLLM: On-Policy Distillation of Large Language Models

    Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models.arXiv preprint arXiv:2306.08543,

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  4. [4]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

  5. [5]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  6. [6]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  7. [7]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

  8. [8]

    Reinforcement Learning via Self-Distillation

    H¨ubotter, J., L¨ubeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Buening, T. K., Guestrin, C., et al. Reinforcement learning via self- distillation.arXiv preprint arXiv:2601.20802,

  9. [9]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Kim, J., Luo, X., Kim, M., Lee, S., Kim, D., Jeon, J., Li, D., and Yang, Y . Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472,

  10. [10]

    and Rush, A

    Kim, Y . and Rush, A. M. Sequence-level knowledge distilla- tion. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,

  11. [11]

    Dual policy distilla- tion.arXiv preprint arXiv:2006.04061,

    Lai, K.-H., Zha, D., Li, Y ., and Hu, X. Dual policy distilla- tion.arXiv preprint arXiv:2006.04061,

  12. [12]

    Solving Quantitative Reasoning Problems with Language Models

    Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models, 2022.URL https://arxiv. org/abs/2206.14858, 1,

  13. [13]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

  14. [14]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

  15. [15]

    20251026

    doi: 10.64434/tml. 20251026. https://thinkingmachines.ai/blog/on-policy- distillation. Ross, S., Gordon, G., and Bagnell, D. A reduction of imita- tion learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conferenc...

  16. [16]

    Policy Distillation

    Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V ., Kavukcuoglu, K., and Hadsell, R. Policy distillation.arXiv preprint arXiv:1511.06295,

  17. [17]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

  18. [18]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    9 Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  20. [20]

    MiMo-V2-Flash Technical Report

    Xiao, B., Xia, B., Yang, B., Gao, B., Shen, B., Zhang, C., He, C., Lou, C., Luo, F., Wang, G., et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

  21. [21]

    arXiv preprint arXiv:2504.14945 , year =

    Yan, J., Li, Y ., Hu, Z., Wang, Z., Cui, G., Qu, X., Cheng, Y ., and Zhang, Y . Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945,

  22. [22]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

  23. [23]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  24. [24]

    Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

    Ye, T., Dong, L., Chi, Z., Wu, X., Huang, S., and Wei, F. Black-box on-policy distillation of large language models. arXiv preprint arXiv:2511.10643,

  25. [25]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

  26. [26]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892,

  27. [27]

    A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,

    Zhang, K., Zuo, Y ., He, B., Sun, Y ., Liu, R., Jiang, C., Fan, Y ., Tian, K., Jia, G., Li, P., et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,

  28. [28]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F., and Grover, A. Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734,

  29. [29]

    Appendix B

    10 Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models A. Appendix B. Use of LLMs We used a large language model only for spelling and grammar correction of the manuscript text. The LLM was not involved in research ideation, experimental design, data generation, analysis, or substantive writing beyond copy-editing. Al...

  30. [30]

    which trains from Qwen2.5-Math-7B and rule-based reward, proposing to remove the standard deviation in GRPO advantage computation and token-level normalization in policy loss computation; PRIME-Zero (Cui et al., 2025), which uses policy rollouts and outcome labels through implict process rewards; and OpenReasonerZero (Hu et al.,

  31. [31]

    which is an open-source implementation of RLVR methods. We also compare with standard OPD (Lu & Lab, 2025), we adopt the same 33k/13k split as above: the model is first supervised fine-tuned on 33k examples and then trained with OPD on the remaining 13k examples. D. Additional Experiment Results D.1. Additional Experiment Results on More Base models We al...