Recognition: unknown
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3
The pith
On-policy distillation for LLMs triggers length inflation in student rollouts that causes truncation collapse and training instability, which StableOPD corrects with a reference divergence constraint and rollout mixture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On-policy distillation trains student models under their own induced distribution while using teacher supervision, yet this process produces abrupt length inflation in rollouts. The inflation causes truncated trajectories to dominate, triggers repetition saturation, and creates biased gradient signals that destabilize training and degrade validation performance. The root cause is the interaction between student-induced data collection and the distillation objective, which implicitly favors long repetitive outputs. StableOPD mitigates the issue through a reference-based divergence constraint paired with rollout mixture distillation; together these prevent repetition-induced length inflation,
What carries the argument
Reference-based divergence constraint combined with rollout mixture distillation, which limits deviation from the teacher while incorporating mixed trajectories to break the cycle of length inflation.
Load-bearing premise
The reference-based divergence constraint and rollout mixture directly counteract the interaction between student-induced data collection and the distillation objective without introducing new instabilities or performance trade-offs.
What would settle it
Re-running the original OPD experiments on the same math reasoning datasets while ablating the divergence constraint or the rollout mixture and checking whether length inflation, truncation dominance, and performance drops reappear.
Figures
read the original abstract
On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies a failure mode in on-policy distillation (OPD) called truncation collapse, where on-policy rollouts exhibit abrupt length inflation and repetition saturation that bias gradients and degrade performance. It attributes this to the interaction between student-induced data collection and the distillation objective, and proposes StableOPD combining a reference-based divergence constraint with rollout mixture distillation to prevent collapse, stabilize dynamics, and achieve a 7.2% average gain on math reasoning datasets.
Significance. If the proposed mechanisms are shown to causally address the failure mode, the work would offer a practical contribution to stabilizing on-policy distillation for LLM reasoning tasks. The identification of length inflation as a distinct collapse mode is potentially useful, but the absence of detailed validation leaves the significance of the 7.2% gain and the proposed fixes difficult to assess.
major comments (2)
- [Abstract] Abstract: The central performance claim of a 7.2% average improvement is stated without any experimental details, baselines, datasets, number of runs, error bars, or statistical tests. This prevents evaluation of whether the result supports the claim that StableOPD prevents truncation collapse and improves performance.
- [Experimental section] Experimental section: No ablation studies, controlled off-policy variants, or measurements of per-component effects on rollout length distributions and gradient bias are reported. The comparison appears limited to full StableOPD versus baseline OPD, leaving the causal attribution of gains to the reference-based divergence constraint and rollout mixture unisolated from confounding factors such as altered batch composition or implicit regularization.
minor comments (2)
- [Introduction] The terms 'truncation collapse' and 'repetition saturation' are introduced without explicit quantitative definitions or detection metrics, which would improve clarity and reproducibility.
- [Method] Implementation details for the reference-based divergence constraint (e.g., exact formulation or pseudocode) are not provided in the description of StableOPD, hindering direct replication.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and rigor of our experimental claims. We address each major comment below and have revised the manuscript to incorporate the suggested changes.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim of a 7.2% average improvement is stated without any experimental details, baselines, datasets, number of runs, error bars, or statistical tests. This prevents evaluation of whether the result supports the claim that StableOPD prevents truncation collapse and improves performance.
Authors: We agree that the abstract would benefit from additional context to support evaluation of the performance claim. In the revised manuscript, we have updated the abstract to reference the math reasoning datasets used and the standard OPD baseline, while noting that the 7.2% average improvement is reported with full experimental details (including runs, error bars, and statistical tests) provided in Section 4. Due to abstract length constraints, we direct readers to the main text for comprehensive statistics. revision: yes
-
Referee: [Experimental section] Experimental section: No ablation studies, controlled off-policy variants, or measurements of per-component effects on rollout length distributions and gradient bias are reported. The comparison appears limited to full StableOPD versus baseline OPD, leaving the causal attribution of gains to the reference-based divergence constraint and rollout mixture unisolated from confounding factors such as altered batch composition or implicit regularization.
Authors: We acknowledge that the original manuscript would be strengthened by additional analyses to isolate component effects and rule out confounds. In the revised manuscript, we expand the experimental section to include ablation studies for the reference divergence constraint and rollout mixture separately, controlled comparisons to off-policy distillation variants with matched batch compositions, and measurements of rollout length distributions, repetition rates, and gradient bias metrics across training stages. These additions support causal attribution of the stabilization and performance gains to the proposed mechanisms. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper identifies a failure mode in on-policy distillation through empirical observation of length inflation and repetition saturation, attributes it to the interaction between student-induced data collection and the distillation objective, and proposes StableOPD combining a reference-based divergence constraint with rollout mixture distillation. No equations, derivations, first-principles predictions, or mathematical chains are present that could reduce to self-referential fitting, self-citation load-bearing, or renaming of known results. All central claims rest on experimental results across math reasoning datasets rather than any self-contained logical reduction to the paper's own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
On-policy distillation for LLMs is sensitive to teacher choice and loss design, while self-distillation fails on instance-specific information but succeeds on shared rules, with stop-gradient TopK, adapted teachers, a...
-
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
Reference graph
Works this paper leans on
-
[1]
Process Reinforcement through Implicit Rewards
Cui, G., Yuan, L., Wang, Z., Wang, H., Zhang, Y ., Chen, J., Li, W., He, B., Fan, Y ., Yu, T., et al. Process re- inforcement through implicit rewards.arXiv preprint arXiv:2502.01456,
work page internal anchor Pith review arXiv
-
[2]
MiniLLM: On-Policy Distillation of Large Language Models
Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models.arXiv preprint arXiv:2306.08543,
work page internal anchor Pith review arXiv
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,
work page internal anchor Pith review arXiv
-
[5]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review arXiv
-
[8]
Reinforcement Learning via Self-Distillation
H¨ubotter, J., L¨ubeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Buening, T. K., Guestrin, C., et al. Reinforcement learning via self- distillation.arXiv preprint arXiv:2601.20802,
work page internal anchor Pith review arXiv
-
[9]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Kim, J., Luo, X., Kim, M., Lee, S., Kim, D., Jeon, J., Li, D., and Yang, Y . Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
and Rush, A
Kim, Y . and Rush, A. M. Sequence-level knowledge distilla- tion. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,
2016
-
[11]
Dual policy distilla- tion.arXiv preprint arXiv:2006.04061,
Lai, K.-H., Zha, D., Li, Y ., and Hu, X. Dual policy distilla- tion.arXiv preprint arXiv:2006.04061,
-
[12]
Solving Quantitative Reasoning Problems with Language Models
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models, 2022.URL https://arxiv. org/abs/2206.14858, 1,
work page internal anchor Pith review arXiv 2022
-
[13]
Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Understanding R1-Zero-Like Training: A Critical Perspective
Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,
work page internal anchor Pith review arXiv
-
[15]
doi: 10.64434/tml. 20251026. https://thinkingmachines.ai/blog/on-policy- distillation. Ross, S., Gordon, G., and Bagnell, D. A reduction of imita- tion learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conferenc...
-
[16]
Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V ., Kavukcuoglu, K., and Hadsell, R. Policy distillation.arXiv preprint arXiv:1511.06295,
-
[17]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review arXiv 1910
-
[18]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
9 Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
MiMo-V2-Flash Technical Report
Xiao, B., Xia, B., Yang, B., Gao, B., Shen, B., Zhang, C., He, C., Lou, C., Luo, F., Wang, G., et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,
work page internal anchor Pith review arXiv
-
[21]
arXiv preprint arXiv:2504.14945 , year =
Yan, J., Li, Y ., Hu, Z., Wang, Z., Cui, G., Qu, X., Cheng, Y ., and Zhang, Y . Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945,
-
[22]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,
work page internal anchor Pith review arXiv
-
[23]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,
Ye, T., Dong, L., Chi, Z., Wu, X., Huang, S., and Wei, F. Black-box on-policy distillation of large language models. arXiv preprint arXiv:2511.10643,
-
[25]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892,
work page internal anchor Pith review arXiv
-
[27]
A survey of reinforcement learning for large reasoning models
Zhang, K., Zuo, Y ., He, B., Sun, Y ., Liu, R., Jiang, C., Fan, Y ., Tian, K., Jia, G., Li, P., et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,
-
[28]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F., and Grover, A. Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734,
work page internal anchor Pith review arXiv
-
[29]
Appendix B
10 Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models A. Appendix B. Use of LLMs We used a large language model only for spelling and grammar correction of the manuscript text. The LLM was not involved in research ideation, experimental design, data generation, analysis, or substantive writing beyond copy-editing. Al...
2024
-
[30]
which trains from Qwen2.5-Math-7B and rule-based reward, proposing to remove the standard deviation in GRPO advantage computation and token-level normalization in policy loss computation; PRIME-Zero (Cui et al., 2025), which uses policy rollouts and outcome labels through implict process rewards; and OpenReasonerZero (Hu et al.,
2025
-
[31]
which is an open-source implementation of RLVR methods. We also compare with standard OPD (Lu & Lab, 2025), we adopt the same 33k/13k split as above: the model is first supervised fine-tuned on 33k examples and then trained with OPD on the remaining 13k examples. D. Additional Experiment Results D.1. Additional Experiment Results on More Base models We al...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.