pith. machine review for the scientific record. sign in

arxiv: 2605.13643 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords on-policy distillationstrong-to-weak distillationlocal teachability collapsechange point detectionteacher margintrajectory truncationQwen3 models
0
0 comments X

The pith

In strong-to-weak on-policy distillation, truncating supervision at the onset of local teachability collapse outperforms full-trajectory training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that in strong-to-weak on-policy distillation, full-trajectory supervision can fail because later tokens often lack the local contrast needed to guide learning effectively. The authors identify this as local teachability collapse and propose focusing supervision only on the prefix where the teacher's feedback remains discriminative. They detect the cutoff by tracking the teacher's margin over the student's top-K options, averaging it per sentence, and applying a BIC change-point test to find the downward shift. Experiments on Qwen3 models demonstrate that this selective approach outperforms standard full-sequence OPD on five in-domain benchmarks while better maintaining out-of-domain performance. The insight is that teacher guidance must be not only present but locally useful for the student.

Core claim

The paper establishes that in strong-to-weak OPD, supervision should be truncated at the onset of local teachability collapse, detected via a BIC-style downward change point in NLTK-sentence-aggregated teacher margins over the student's top-K set. This trajectory-specific release rule delivers superior performance compared to full-trajectory supervision across multiple benchmarks and student scales, while also improving out-of-domain capability retention.

What carries the argument

Trajectory-specific release rule that measures teacher margin over student's top-K set, aggregates across NLTK sentences, and truncates at the BIC-detected downward change point.

If this is right

  • Truncating at the change point yields consistent gains over full-trajectory OPD on in-domain tasks.
  • The approach better preserves out-of-domain model capabilities than baseline distillation methods.
  • Performance improvements hold across different student scales within the Qwen3 model family.
  • Effective OPD requires assessing local utility of teacher feedback in addition to its availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dynamic per-trajectory monitoring of teachability could extend to other AI feedback methods such as RLHF.
  • Replacing sentence-level aggregation with token-level detection might allow even more precise cutoffs.
  • Similar collapse patterns may limit gains in self-improvement loops or iterative distillation.

Load-bearing premise

The BIC-style downward change point on sentence-aggregated teacher margins reliably marks where feedback stops being locally discriminative without cutting useful earlier supervision.

What would settle it

If the release rule produces lower scores than full-trajectory OPD when evaluated on the same five in-domain benchmarks and student scales, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.13643 by Bing Wang, Jieping Ye, Kaiyuan Liu, Rongxiang Weng, Yang Bai, Ziyuan Zhuang.

Figure 1
Figure 1. Figure 1: The presence of a guidance signal does not guarantee high-value local teachability. (a) The mean teacher-student advantage remains nonzero across late response regions, indicating that the teacher has not simply ceased guiding the student. (b) The within-bin standard deviation of At, normalized by its early-region value, steadily decays throughout the rollout. This decrease in sampled-path dispersion is a … view at source ↗
Figure 2
Figure 2. Figure 2: Teacher margin over student top-K candidates. The normalized margin declines along the response, providing the local signal that is later aggregated over sentence segments for release. Trajectory-specific release rule. The preced￾ing diagnostic, combined with the established in￾sight from outcome-driven RL that high-contrast advantages are more useful for policy updates while low-contrast advantages can of… view at source ↗
Figure 3
Figure 3. Figure 3: The release rule detects trajectory-specific drops in local contrast. Statistics are computed on the exported OPD rollouts using the teacher’s top-1 and top-2 margin over the student’s top-K candidates. (a) The profiled RSS-BIC gain, BIC0 − minτ BIC1(τ ), is positive for most samples. The mean gain is 24.0, and 79% of the samples accept a significant downward BIC￾improving release with a gain greater than … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Example: A representative rollout shows that while teacher and student log-probabilities diverge early on, they become highly aligned in later stages (log￾probabilities correlation increases from -0.09 to 0.93). The resulting reduction in advantage variation (std(A) drops from 2.08 to 0.61) renders the dense feedback less actionable for guiding specific local policy improvements. Interpretation… view at source ↗
Figure 5
Figure 5. Figure 5: Support-size sensitivity. Left: The normalized teacher-margin curves remain similar across K ∈ {2, 4, 8, 16, 32, 64}, supporting the robustness of the local-contrast decline. Right: Early segments consistently exhibit larger teacher top-1–top-2 margins than late segments across support sizes, indicating that the decline in local teachability is not an artifact of a particular K [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 6
Figure 6. Figure 6: Example BIC-style release. A represen￾tative rollout shows an accepted downward change point in the teacher top1–top2 margin over the stu￾dent’s top-K candidates [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Diagnostic release positions and fixed￾prefix performance. Each point is one fixed-prefix checkpoint; with only four checkpoints, this is descriptive rather than a correlation claim. The following case studies illustrate the specific behaviors that the release rule is designed to ex￾clude from supervision. These instances do not constitute inference-time truncation. Because the trajectory-specific rule exc… view at source ↗
Figure 8
Figure 8. Figure 8: Average response length during train￾ing. The gray line represents our proposed method, while the orange line corresponds to standard full￾trajectory OPD. .5We further evaluate the wall-clock training time of the 1.7B strong-to-weak setting un￾der identical hardware and training conditions. On a single node equipped with eight H800 GPUs, standard full-trajectory OPD requires 14.28 hours for 100 training st… view at source ↗
read the original abstract

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that standard full-trajectory on-policy distillation (OPD) from strong to weak models can fail due to 'local teachability collapse,' where later tokens in a rollout retain non-zero teacher advantage but lack local contrast that makes dense feedback useful for student updates. The authors operationalize this via a trajectory-specific release rule: compute teacher margins over the student's top-K candidates, aggregate by NLTK sentence segments, and truncate supervision at the first downward BIC change point. Experiments on the Qwen3 family report that this rule outperforms full-trajectory OPD on five in-domain benchmarks at multiple scales and yields better out-of-domain preservation.

Significance. If the release rule correctly isolates regions of locally discriminative teacher feedback, the result would provide a practical, low-overhead improvement to strong-to-weak OPD that reduces unnecessary supervision while preserving or enhancing performance. The work also highlights a previously under-examined failure mode in on-policy distillation, which could influence how future distillation pipelines allocate teacher compute.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method): the claim that the BIC downward change point on NLTK-aggregated margins 'reliably identifies the onset of local teachability collapse' is not supported by any token-level or gradient-level evidence that post-change-point tokens are less informative for student updates than pre-change-point tokens. Outperformance versus full-trajectory OPD could therefore be explained by any length-reducing heuristic rather than by correctly locating a teachability boundary.
  2. [§4] §4 (experiments): the reported consistent outperformance on five benchmarks lacks error bars, statistical significance tests, exact hyperparameter values, baseline implementation details, or ablation studies on the top-K and BIC sensitivity parameters. Without these, it is impossible to determine whether the gains are robust or sensitive to the two free parameters listed in the axiom ledger.
  3. [§3.2] §3.2 (release rule): the rule is presented as an empirical heuristic motivated by observed failure modes, yet no derivation or controlled experiment shows that the BIC statistic on sentence-aggregated margins isolates loss of local discriminative utility rather than simply detecting a drop in margin magnitude.
minor comments (2)
  1. [§3] Notation for the top-K set and margin aggregation should be defined with explicit equations rather than prose descriptions to improve reproducibility.
  2. [§4] The OOD preservation claim would be strengthened by reporting the specific out-of-domain tasks and the magnitude of capability retention relative to the full-trajectory baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, providing additional evidence and clarifications where possible. The revised manuscript incorporates several changes to strengthen the claims and experimental reporting.

read point-by-point responses
  1. Referee: [Abstract and §3] the claim that the BIC downward change point on NLTK-aggregated margins 'reliably identifies the onset of local teachability collapse' is not supported by any token-level or gradient-level evidence... Outperformance versus full-trajectory OPD could therefore be explained by any length-reducing heuristic

    Authors: We agree that direct token- or gradient-level evidence would provide stronger causal support. Our primary evidence remains the consistent performance gains over full-trajectory OPD. To rule out a generic length-reduction effect, we have added an ablation comparing BIC truncation against random truncation at matched lengths; the BIC rule outperforms random truncation, indicating it locates regions of reduced local utility rather than simply shortening sequences. A full gradient analysis lies outside the current revision scope but is noted as future work. revision: partial

  2. Referee: [§4] the reported consistent outperformance on five benchmarks lacks error bars, statistical significance tests, exact hyperparameter values, baseline implementation details, or ablation studies on the top-K and BIC sensitivity parameters

    Authors: We have revised §4 and the appendix to include: (i) error bars from 3 independent runs, (ii) paired t-test p-values for all reported gains, (iii) exact hyperparameter tables, (iv) full baseline implementation details, and (v) sensitivity ablations for top-K and BIC penalty showing stable performance across reasonable ranges. These additions confirm the gains are robust. revision: yes

  3. Referee: [§3.2] the rule is presented as an empirical heuristic... yet no derivation or controlled experiment shows that the BIC statistic on sentence-aggregated margins isolates loss of local discriminative utility rather than simply detecting a drop in margin magnitude

    Authors: We have expanded §3.2 with a controlled comparison: BIC applied to sentence-aggregated margins versus BIC applied directly to raw margin values. The margin-based BIC better predicts downstream performance degradation in suffix regions, supporting that it captures loss of local contrast. We also added a short derivation sketch linking the change-point detection to the point where teacher advantage ceases to be locally discriminative for the student's top-K set. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical truncation heuristic tested on held-out data

full rationale

The paper observes that full-trajectory OPD can fail in strong-to-weak settings due to loss of local contrast in later tokens, introduces the term 'local teachability collapse,' and defines a release rule that computes teacher margins over the student's top-K set, aggregates at NLTK sentence level, and truncates at the first BIC downward change point. This rule is presented as an operationalization of an empirical principle rather than a derivation from first principles. No equations reduce the reported gains to quantities fitted from the evaluation data, no self-citations bear the central load, and the outperformance is measured on held-out in-domain and out-of-domain benchmarks. The construction is therefore self-contained and does not collapse to its inputs by definition.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that teacher feedback is dense and locally comparable, plus two implementation choices (top-K and sentence segmentation) whose values are not justified from first principles.

free parameters (2)
  • top-K
    Size of student's candidate set used to compute teacher margin; value not stated in abstract but required for the release rule.
  • BIC change-point sensitivity
    Threshold or penalty parameter controlling when a downward change point triggers truncation; not specified.
axioms (2)
  • domain assumption NLTK sentence tokenization produces segments that align with regions of stable teachability
    Used to aggregate margins before change-point detection.
  • domain assumption Teacher margin over top-K remains a valid local proxy for teachability throughout the trajectory until the detected change point
    Core premise of the release rule.
invented entities (1)
  • local teachability collapse no independent evidence
    purpose: Named failure mode describing loss of discriminative power in suffix segments
    Coined term for the observed phenomenon; no independent evidence supplied beyond the experiments.

pith-pipeline@v0.9.0 · 5576 in / 1460 out tokens · 75982 ms · 2026-05-14T19:50:03.308971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 17 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Online difficulty filtering for reasoning oriented reinforcement learning

    Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers), pages 700–719, 2026

  4. [4]

    MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

  5. [5]

    O’Reilly Media, Inc

    Steven Bird, Ewan Klein, and Edward Loper.Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009

  6. [6]

    V old: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

    Walid Bousselham, Hilde Kuehne, and Cordelia Schmid. V old: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

  7. [7]

    X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

    Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, and Tao Jin. X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  9. [9]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  10. [10]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

  11. [11]

    arXiv preprint arXiv:2603.23871 , year =

    Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

  12. [12]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  13. [13]

    Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. 2025

  14. [14]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  15. [15]

    Lion: Adversarial distillation of proprietary large language models

    Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. Lion: Adversarial distillation of proprietary large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3134–3154, 2023

  16. [16]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. 10

  17. [17]

    Tms: Trajectory-mixed supervision for reward-free, on-policy sft.arXiv preprint arXiv:2602.03073, 2026

    Rana Muhammad Shahroz Khan, Zijie Liu, Zhen Tan, Charles Fleming, and Tianlong Chen. Tms: Trajectory-mixed supervision for reward-free, on-policy sft.arXiv preprint arXiv:2602.03073, 2026

  18. [18]

    Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  19. [19]

    Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation.arXiv preprint arXiv:2602.02994, 2026

    Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, and Jian Luan. Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation.arXiv preprint arXiv:2602.02994, 2026

  20. [20]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  21. [21]

    Where did this sentence come from? tracing provenance in llm reasoning distillation.arXiv preprint arXiv:2512.20908, 2025

    Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen, Jun Zhang, and Jieping Ye. Where did this sentence come from? tracing provenance in llm reasoning distillation.arXiv preprint arXiv:2512.20908, 2025

  22. [22]

    On-policy distillation, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation, 2025. Connectionism

  23. [23]

    Didactic to constructive: Turning expert solutions into learnable reasoning.arXiv preprint arXiv:2602.02405, 2026

    Ethan Mendes, Jungsoo Park, and Alan Ritter. Didactic to constructive: Turning expert solutions into learnable reasoning.arXiv preprint arXiv:2602.02405, 2026

  24. [24]

    AIME 2025 Dataset

    OpenCompass. AIME 2025 Dataset. https://huggingface.co/datasets/opencompass/ AIME2025, 2025. American Invitational Mathematics Examination 2025 problems (I & II)

  25. [25]

    Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, and Xiangyang Ji. Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models? InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 1240–1250, 2026

  26. [26]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

  27. [27]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433, 2026

  28. [28]

    Learning from the right rollouts: Data attribution for ppo-based llm post-training.arXiv preprint arXiv:2604.01597, 2026

    Dong Shu, Denghui Zhang, and Jessica Hullman. Learning from the right rollouts: Data attribution for ppo-based llm post-training.arXiv preprint arXiv:2604.01597, 2026

  29. [29]

    Orpo-distill: Mixed-policy preference optimization for cross-architecture llm distillation.arXiv preprint arXiv:2509.25100, 2025

    Aasheesh Singh, Vishal Vaddina, and Dagnachew Birru. Orpo-distill: Mixed-policy preference optimization for cross-architecture llm distillation.arXiv preprint arXiv:2509.25100, 2025

  30. [30]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

  31. [31]

    Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

    Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

  32. [32]

    On the Step Length Confounding in LLM Reasoning Data Selection

    Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu, Ximing Li, Xiaosong Yuan, Sinan Fan, Jun Zhang, and Jieping Ye. On the step length confounding in llm reasoning data selection.arXiv preprint arXiv:2604.06834, 2026

  33. [33]

    Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

  34. [34]

    Mimo-v2-flash technical report, 2026

    LLM-Core Xiaomi. Mimo-v2-flash technical report, 2026. 11

  35. [35]

    Ovd: On-policy verbal distillation.arXiv preprint arXiv:2601.21968, 2026

    Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, and Ngai Wong. Ovd: On-policy verbal distillation.arXiv preprint arXiv:2601.21968, 2026

  36. [36]

    Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint arXiv:2601.09088,

    Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye. Distribution-aligned sequence distillation for superior long-cot reasoning. arXiv preprint arXiv:2601.09088, 2026

  37. [37]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  38. [38]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  39. [39]

    Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

    Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025

  40. [40]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  41. [41]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  42. [42]

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

  43. [43]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 12 A Technical Appendices and Supplementary Material A.1 Release Rule Details The algorithm below summarizes the release rule used in the experimen...

  44. [44]

    Future work will investigate the generalizability of the proposed release rule beyond math- ematical reasoning and supplementary code-domain checks, extending its application to broader instruction-following and mixed-domain settings

  45. [45]

    Following recent concurrent works [ 16, 38, 42], our experiments primarily focus on the Qwen3 model family. Although validating our findings across a broader range of model families remains a valuable direction for future research, the core contribution of this paper lies in establishing the existence of local teachability collapse. We demonstrate that on...

  46. [46]

    By detecting local teachability dynamically during the generation process, the system could prematurely terminate the collection of low-value suffix supervi- sion

    Another promising direction involves transitioning from post-hoc release decisions to online, rollout-time truncation. By detecting local teachability dynamically during the generation process, the system could prematurely terminate the collection of low-value suffix supervi- sion. This would inherently improve training efficiency, offering a more proacti...

  47. [47]

    Build a teacher margin on the student’s top-K candidate set

  48. [48]

    Aggregate token margins over NLTK-style sentence/punctuation segments

  49. [49]

    Select a single downward change point using profiled RSS-BIC

  50. [50]

    Keep only the prefix before the selected change point

  51. [51]

    ""Output of the reference dynamic-prefix rule

    Rescale the kept prefix to preserve per-sample loss mass. """ 26 from __future__ import annotations import math from dataclasses import dataclass from typing import Iterable, Sequence EPS = 1e-12 @dataclass(frozen=True) class DynamicPrefixResult: """Output of the reference dynamic-prefix rule.""" release_segment: int bic_improvement: float accepted: bool ...