arxiv: 2605.13643 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Kaiyuan Liu , Ziyuan Zhuang , Yang Bai , Bing Wang , Rongxiang Weng , Jieping Ye

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords on-policy distillationstrong-to-weak distillationlocal teachability collapsechange point detectionteacher margintrajectory truncationQwen3 models

0 comments

The pith

In strong-to-weak on-policy distillation, truncating supervision at the onset of local teachability collapse outperforms full-trajectory training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that in strong-to-weak on-policy distillation, full-trajectory supervision can fail because later tokens often lack the local contrast needed to guide learning effectively. The authors identify this as local teachability collapse and propose focusing supervision only on the prefix where the teacher's feedback remains discriminative. They detect the cutoff by tracking the teacher's margin over the student's top-K options, averaging it per sentence, and applying a BIC change-point test to find the downward shift. Experiments on Qwen3 models demonstrate that this selective approach outperforms standard full-sequence OPD on five in-domain benchmarks while better maintaining out-of-domain performance. The insight is that teacher guidance must be not only present but locally useful for the student.

Core claim

The paper establishes that in strong-to-weak OPD, supervision should be truncated at the onset of local teachability collapse, detected via a BIC-style downward change point in NLTK-sentence-aggregated teacher margins over the student's top-K set. This trajectory-specific release rule delivers superior performance compared to full-trajectory supervision across multiple benchmarks and student scales, while also improving out-of-domain capability retention.

What carries the argument

Trajectory-specific release rule that measures teacher margin over student's top-K set, aggregates across NLTK sentences, and truncates at the BIC-detected downward change point.

If this is right

Truncating at the change point yields consistent gains over full-trajectory OPD on in-domain tasks.
The approach better preserves out-of-domain model capabilities than baseline distillation methods.
Performance improvements hold across different student scales within the Qwen3 model family.
Effective OPD requires assessing local utility of teacher feedback in addition to its availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic per-trajectory monitoring of teachability could extend to other AI feedback methods such as RLHF.
Replacing sentence-level aggregation with token-level detection might allow even more precise cutoffs.
Similar collapse patterns may limit gains in self-improvement loops or iterative distillation.

Load-bearing premise

The BIC-style downward change point on sentence-aggregated teacher margins reliably marks where feedback stops being locally discriminative without cutting useful earlier supervision.

What would settle it

If the release rule produces lower scores than full-trajectory OPD when evaluated on the same five in-domain benchmarks and student scales, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.13643 by Bing Wang, Jieping Ye, Kaiyuan Liu, Rongxiang Weng, Yang Bai, Ziyuan Zhuang.

**Figure 1.** Figure 1: The presence of a guidance signal does not guarantee high-value local teachability. (a) The mean teacher-student advantage remains nonzero across late response regions, indicating that the teacher has not simply ceased guiding the student. (b) The within-bin standard deviation of At, normalized by its early-region value, steadily decays throughout the rollout. This decrease in sampled-path dispersion is a … view at source ↗

**Figure 2.** Figure 2: Teacher margin over student top-K candidates. The normalized margin declines along the response, providing the local signal that is later aggregated over sentence segments for release. Trajectory-specific release rule. The preceding diagnostic, combined with the established insight from outcome-driven RL that high-contrast advantages are more useful for policy updates while low-contrast advantages can of… view at source ↗

**Figure 3.** Figure 3: The release rule detects trajectory-specific drops in local contrast. Statistics are computed on the exported OPD rollouts using the teacher’s top-1 and top-2 margin over the student’s top-K candidates. (a) The profiled RSS-BIC gain, BIC0 − minτ BIC1(τ ), is positive for most samples. The mean gain is 24.0, and 79% of the samples accept a significant downward BICimproving release with a gain greater than … view at source ↗

**Figure 4.** Figure 4: Qualitative Example: A representative rollout shows that while teacher and student log-probabilities diverge early on, they become highly aligned in later stages (logprobabilities correlation increases from -0.09 to 0.93). The resulting reduction in advantage variation (std(A) drops from 2.08 to 0.61) renders the dense feedback less actionable for guiding specific local policy improvements. Interpretation… view at source ↗

**Figure 5.** Figure 5: Support-size sensitivity. Left: The normalized teacher-margin curves remain similar across K ∈ {2, 4, 8, 16, 32, 64}, supporting the robustness of the local-contrast decline. Right: Early segments consistently exhibit larger teacher top-1–top-2 margins than late segments across support sizes, indicating that the decline in local teachability is not an artifact of a particular K [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 6.** Figure 6: Example BIC-style release. A representative rollout shows an accepted downward change point in the teacher top1–top2 margin over the student’s top-K candidates [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Diagnostic release positions and fixedprefix performance. Each point is one fixed-prefix checkpoint; with only four checkpoints, this is descriptive rather than a correlation claim. The following case studies illustrate the specific behaviors that the release rule is designed to exclude from supervision. These instances do not constitute inference-time truncation. Because the trajectory-specific rule exc… view at source ↗

**Figure 8.** Figure 8: Average response length during training. The gray line represents our proposed method, while the orange line corresponds to standard fulltrajectory OPD. .5We further evaluate the wall-clock training time of the 1.7B strong-to-weak setting under identical hardware and training conditions. On a single node equipped with eight H800 GPUs, standard full-trajectory OPD requires 14.28 hours for 100 training st… view at source ↗

read the original abstract

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that full-trajectory OPD supervision often stops being useful once local contrast fades in strong-to-weak settings and gives a simple BIC change-point rule to truncate early, with reported gains on five benchmarks.

read the letter

The main point is that in strong-to-weak on-policy distillation, keeping dense teacher feedback on the whole student rollout can backfire once later tokens lose local contrast, even when the teacher still holds an edge. The authors name this local teachability collapse and turn the observation into a concrete release rule: compute the teacher's margin over the student's top-K set, aggregate by NLTK sentences, and stop supervision at the first downward BIC change point. That rule is the operational contribution beyond prior full-sequence OPD work. They test it with Qwen3 models and report consistent wins over standard OPD on five in-domain benchmarks plus better out-of-domain retention. The idea is straightforward and directly addresses a practical pain point when distilling smaller models from stronger ones. The experiments are the weakest part. The abstract asserts outperformance but supplies no error bars, no full hyperparameter tables, no baseline implementation details, and no statistical tests, so the size and reliability of the gains remain hard to judge from the given information. The change-point detector itself is presented as an empirical fix without token-level or gradient evidence that post-change tokens are genuinely less informative than earlier ones; any length-reducing heuristic could produce similar results. The NLTK sentence aggregation and BIC sensitivity are also free parameters that could be tuned to the test sets. This work is aimed at people running strong-to-weak distillation pipelines who want a lightweight way to avoid wasting supervision on unteachable suffixes. A reader who already works on on-policy methods will get a usable heuristic to try, even if the current evidence is preliminary. Send it to peer review so the experimental claims can be checked with proper controls and the rule can be stress-tested on more models and tasks.

Referee Report

3 major / 2 minor

Summary. The paper argues that standard full-trajectory on-policy distillation (OPD) from strong to weak models can fail due to 'local teachability collapse,' where later tokens in a rollout retain non-zero teacher advantage but lack local contrast that makes dense feedback useful for student updates. The authors operationalize this via a trajectory-specific release rule: compute teacher margins over the student's top-K candidates, aggregate by NLTK sentence segments, and truncate supervision at the first downward BIC change point. Experiments on the Qwen3 family report that this rule outperforms full-trajectory OPD on five in-domain benchmarks at multiple scales and yields better out-of-domain preservation.

Significance. If the release rule correctly isolates regions of locally discriminative teacher feedback, the result would provide a practical, low-overhead improvement to strong-to-weak OPD that reduces unnecessary supervision while preserving or enhancing performance. The work also highlights a previously under-examined failure mode in on-policy distillation, which could influence how future distillation pipelines allocate teacher compute.

major comments (3)

[Abstract and §3] Abstract and §3 (method): the claim that the BIC downward change point on NLTK-aggregated margins 'reliably identifies the onset of local teachability collapse' is not supported by any token-level or gradient-level evidence that post-change-point tokens are less informative for student updates than pre-change-point tokens. Outperformance versus full-trajectory OPD could therefore be explained by any length-reducing heuristic rather than by correctly locating a teachability boundary.
[§4] §4 (experiments): the reported consistent outperformance on five benchmarks lacks error bars, statistical significance tests, exact hyperparameter values, baseline implementation details, or ablation studies on the top-K and BIC sensitivity parameters. Without these, it is impossible to determine whether the gains are robust or sensitive to the two free parameters listed in the axiom ledger.
[§3.2] §3.2 (release rule): the rule is presented as an empirical heuristic motivated by observed failure modes, yet no derivation or controlled experiment shows that the BIC statistic on sentence-aggregated margins isolates loss of local discriminative utility rather than simply detecting a drop in margin magnitude.

minor comments (2)

[§3] Notation for the top-K set and margin aggregation should be defined with explicit equations rather than prose descriptions to improve reproducibility.
[§4] The OOD preservation claim would be strengthened by reporting the specific out-of-domain tasks and the magnitude of capability retention relative to the full-trajectory baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, providing additional evidence and clarifications where possible. The revised manuscript incorporates several changes to strengthen the claims and experimental reporting.

read point-by-point responses

Referee: [Abstract and §3] the claim that the BIC downward change point on NLTK-aggregated margins 'reliably identifies the onset of local teachability collapse' is not supported by any token-level or gradient-level evidence... Outperformance versus full-trajectory OPD could therefore be explained by any length-reducing heuristic

Authors: We agree that direct token- or gradient-level evidence would provide stronger causal support. Our primary evidence remains the consistent performance gains over full-trajectory OPD. To rule out a generic length-reduction effect, we have added an ablation comparing BIC truncation against random truncation at matched lengths; the BIC rule outperforms random truncation, indicating it locates regions of reduced local utility rather than simply shortening sequences. A full gradient analysis lies outside the current revision scope but is noted as future work. revision: partial
Referee: [§4] the reported consistent outperformance on five benchmarks lacks error bars, statistical significance tests, exact hyperparameter values, baseline implementation details, or ablation studies on the top-K and BIC sensitivity parameters

Authors: We have revised §4 and the appendix to include: (i) error bars from 3 independent runs, (ii) paired t-test p-values for all reported gains, (iii) exact hyperparameter tables, (iv) full baseline implementation details, and (v) sensitivity ablations for top-K and BIC penalty showing stable performance across reasonable ranges. These additions confirm the gains are robust. revision: yes
Referee: [§3.2] the rule is presented as an empirical heuristic... yet no derivation or controlled experiment shows that the BIC statistic on sentence-aggregated margins isolates loss of local discriminative utility rather than simply detecting a drop in margin magnitude

Authors: We have expanded §3.2 with a controlled comparison: BIC applied to sentence-aggregated margins versus BIC applied directly to raw margin values. The margin-based BIC better predicts downstream performance degradation in suffix regions, supporting that it captures loss of local contrast. We also added a short derivation sketch linking the change-point detection to the point where teacher advantage ceases to be locally discriminative for the student's top-K set. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical truncation heuristic tested on held-out data

full rationale

The paper observes that full-trajectory OPD can fail in strong-to-weak settings due to loss of local contrast in later tokens, introduces the term 'local teachability collapse,' and defines a release rule that computes teacher margins over the student's top-K set, aggregates at NLTK sentence level, and truncates at the first BIC downward change point. This rule is presented as an operationalization of an empirical principle rather than a derivation from first principles. No equations reduce the reported gains to quantities fitted from the evaluation data, no self-citations bear the central load, and the outperformance is measured on held-out in-domain and out-of-domain benchmarks. The construction is therefore self-contained and does not collapse to its inputs by definition.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that teacher feedback is dense and locally comparable, plus two implementation choices (top-K and sentence segmentation) whose values are not justified from first principles.

free parameters (2)

top-K
Size of student's candidate set used to compute teacher margin; value not stated in abstract but required for the release rule.
BIC change-point sensitivity
Threshold or penalty parameter controlling when a downward change point triggers truncation; not specified.

axioms (2)

domain assumption NLTK sentence tokenization produces segments that align with regions of stable teachability
Used to aggregate margins before change-point detection.
domain assumption Teacher margin over top-K remains a valid local proxy for teachability throughout the trajectory until the detected change point
Core premise of the release rule.

invented entities (1)

local teachability collapse no independent evidence
purpose: Named failure mode describing loss of discriminative power in suffix segments
Coined term for the observed phenomenon; no independent evidence supplied beyond the experiments.

pith-pipeline@v0.9.0 · 5576 in / 1460 out tokens · 75982 ms · 2026-05-14T19:50:03.308971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 17 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Online difficulty filtering for reasoning oriented reinforcement learning

Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers), pages 700–719, 2026

work page 2026
[4]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review arXiv 2025
[5]

O’Reilly Media, Inc

Steven Bird, Ewan Klein, and Edward Loper.Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009

work page 2009
[6]

V old: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

Walid Bousselham, Hilde Kuehne, and Cordelia Schmid. V old: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

work page arXiv 2025
[7]

X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, and Tao Jin. X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

work page arXiv 2026
[8]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review arXiv 2024
[10]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

arXiv preprint arXiv:2603.23871 , year =

Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

work page arXiv 2026
[12]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. 2025

work page 2025
[14]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Lion: Adversarial distillation of proprietary large language models

Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. Lion: Adversarial distillation of proprietary large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3134–3154, 2023

work page 2023
[16]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. 10

work page arXiv 2026
[17]

Tms: Trajectory-mixed supervision for reward-free, on-policy sft.arXiv preprint arXiv:2602.03073, 2026

Rana Muhammad Shahroz Khan, Zijie Liu, Zhen Tan, Charles Fleming, and Tianlong Chen. Tms: Trajectory-mixed supervision for reward-free, on-policy sft.arXiv preprint arXiv:2602.03073, 2026

work page arXiv 2026
[18]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022
[19]

Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation.arXiv preprint arXiv:2602.02994, 2026

Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, and Jian Luan. Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation.arXiv preprint arXiv:2602.02994, 2026

work page internal anchor Pith review arXiv 2026
[20]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Where did this sentence come from? tracing provenance in llm reasoning distillation.arXiv preprint arXiv:2512.20908, 2025

Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen, Jun Zhang, and Jieping Ye. Where did this sentence come from? tracing provenance in llm reasoning distillation.arXiv preprint arXiv:2512.20908, 2025

work page arXiv 2025
[22]

On-policy distillation, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation, 2025. Connectionism

work page 2025
[23]

Didactic to constructive: Turning expert solutions into learnable reasoning.arXiv preprint arXiv:2602.02405, 2026

Ethan Mendes, Jungsoo Park, and Alan Ritter. Didactic to constructive: Turning expert solutions into learnable reasoning.arXiv preprint arXiv:2602.02405, 2026

work page arXiv 2026
[24]

AIME 2025 Dataset

OpenCompass. AIME 2025 Dataset. https://huggingface.co/datasets/opencompass/ AIME2025, 2025. American Invitational Mathematics Examination 2025 problems (I & II)

work page 2025
[25]

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, and Xiangyang Ji. Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models? InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 1240–1250, 2026

work page 2026
[26]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[27]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Learning from the right rollouts: Data attribution for ppo-based llm post-training.arXiv preprint arXiv:2604.01597, 2026

Dong Shu, Denghui Zhang, and Jessica Hullman. Learning from the right rollouts: Data attribution for ppo-based llm post-training.arXiv preprint arXiv:2604.01597, 2026

work page arXiv 2026
[29]

Orpo-distill: Mixed-policy preference optimization for cross-architecture llm distillation.arXiv preprint arXiv:2509.25100, 2025

Aasheesh Singh, Vishal Vaddina, and Dagnachew Birru. Orpo-distill: Mixed-policy preference optimization for cross-architecture llm distillation.arXiv preprint arXiv:2509.25100, 2025

work page arXiv 2025
[30]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

work page arXiv 2026
[32]

On the Step Length Confounding in LLM Reasoning Data Selection

Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu, Ximing Li, Xiaosong Yuan, Sinan Fan, Jun Zhang, and Jieping Ye. On the step length confounding in llm reasoning data selection.arXiv preprint arXiv:2604.06834, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Mimo-v2-flash technical report, 2026

LLM-Core Xiaomi. Mimo-v2-flash technical report, 2026. 11

work page 2026
[35]

Ovd: On-policy verbal distillation.arXiv preprint arXiv:2601.21968, 2026

Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, and Ngai Wong. Ovd: On-policy verbal distillation.arXiv preprint arXiv:2601.21968, 2026

work page arXiv 2026
[36]

Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint arXiv:2601.09088,

Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye. Distribution-aligned sequence distillation for superior long-cot reasoning. arXiv preprint arXiv:2601.09088, 2026

work page arXiv 2026
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page arXiv 2026
[39]

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025

work page arXiv 2025
[40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

work page arXiv 2026
[43]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 12 A Technical Appendices and Supplementary Material A.1 Release Rule Details The algorithm below summarizes the release rule used in the experimen...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Future work will investigate the generalizability of the proposed release rule beyond math- ematical reasoning and supplementary code-domain checks, extending its application to broader instruction-following and mixed-domain settings

work page
[45]

Following recent concurrent works [ 16, 38, 42], our experiments primarily focus on the Qwen3 model family. Although validating our findings across a broader range of model families remains a valuable direction for future research, the core contribution of this paper lies in establishing the existence of local teachability collapse. We demonstrate that on...

work page
[46]

By detecting local teachability dynamically during the generation process, the system could prematurely terminate the collection of low-value suffix supervi- sion

Another promising direction involves transitioning from post-hoc release decisions to online, rollout-time truncation. By detecting local teachability dynamically during the generation process, the system could prematurely terminate the collection of low-value suffix supervi- sion. This would inherently improve training efficiency, offering a more proacti...

work page
[47]

Build a teacher margin on the student’s top-K candidate set

work page
[48]

Aggregate token margins over NLTK-style sentence/punctuation segments

work page
[49]

Select a single downward change point using profiled RSS-BIC

work page
[50]

Keep only the prefix before the selected change point

work page
[51]

""Output of the reference dynamic-prefix rule

Rescale the kept prefix to preserve per-sample loss mass. """ 26 from __future__ import annotations import math from dataclasses import dataclass from typing import Iterable, Sequence EPS = 1e-12 @dataclass(frozen=True) class DynamicPrefixResult: """Output of the reference dynamic-prefix rule.""" release_segment: int bic_improvement: float accepted: bool ...

work page