When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Chaowei Xiao; Xiaogeng Liu; Xinyan Wang; Yechao Zhang; Yingzi Ma

arxiv: 2605.21606 · v1 · pith:XIHANZK6new · submitted 2026-05-20 · 💻 cs.LG · cs.AI

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Xiaogeng Liu , Xinyan Wang , Yingzi Ma , Yechao Zhang , Chaowei Xiao This is my paper

Pith reviewed 2026-05-22 09:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy self-distillationreasoning distillationteacher token reliabilityposition weightingAIME benchmarklarge language modelsbranch viability

0 comments

The pith

Teacher tokens in reasoning distillation are more reliable later in the sequence, and weighting them by position improves student performance without extra teacher computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to make on-policy self-distillation more effective for training students on reasoning tasks. It demonstrates that teacher tokens are not equally reliable at every point in a generated sequence, but instead follow a clear trajectory where reliability increases with position. A diagnostic that checks whether alternative teacher tokens lead to correct final answers reveals this pattern, with position outperforming entropy as a predictor. The authors then modify the distillation objective to apply higher weights to later tokens, producing measurable gains on math competition problems while using the same student rollouts and teacher passes as before.

Core claim

Teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation. An oriented within-sequence position score reaches an AUROC of 0.83 for predicting whether a teacher token leads to the correct answer, while local uncertainty scores reach at most 0.57. Position-Weighted On-Policy Self-Distillation applies an increasing position weight to the same clipped forward-KL target, improving AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points respectively, with consistent aggregate gains on larger models from different families.

What carries the argument

The branch-viability diagnostic, which records next-token alternatives from the privileged teacher, forces each after the student prompt plus on-policy prefix, and checks whether the resulting continuation recovers the correct answer to expose position-dependent reliability.

If this is right

An oriented within-sequence position score predicts teacher-token reliability with AUROC 0.83, outperforming uncertainty-based scores.
PW-OPSD improves AIME 2024 and 2025 performance by +1.0 and +1.1 Avg@12 points using only the existing student rollout and teacher pass.
The same position-weighted approach yields consistent aggregate improvements on larger models including DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think.
Teacher tokens early in reasoning trajectories provide weaker learning signals than those appearing later.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same position-weighting idea could be tested in code generation or other long-horizon sequential tasks where early choices are more exploratory.
Running the diagnostic on non-mathematical reasoning problems would show whether the trajectory structure is specific to math or more general.
Combining the position signal with other cheap diagnostics might produce additional gains without increasing teacher cost.

Load-bearing premise

The branch-viability diagnostic accurately measures the reliability of the original teacher token for the student's learning signal.

What would settle it

If applying the same diagnostic to a fresh set of problems shows that position no longer predicts whether forced alternatives recover the correct answer, or if position-weighted training produces no improvement on AIME benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.21606 by Chaowei Xiao, Xiaogeng Liu, Xinyan Wang, Yechao Zhang, Yingzi Ma.

**Figure 2.** Figure 2: Per-token weight schedules for the four configurations; top panels label the role of each (wmin, τ, s) knob [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Position in the rollout predicts teacher token reliability better than entropy in OPSD, and weighting later tokens gives small consistent gains on AIME without extra compute.

read the letter

The main thing to know is that this paper identifies a trajectory-level pattern in on-policy self-distillation for reasoning: teacher tokens become more reliable as position advances in the sequence. They turn that into a simple weighting scheme that improves AIME scores by about a point while using exactly the same rollouts and teacher passes as standard OPSD. The diagnostic they introduce checks branch viability by recording teacher alternatives at a position, forcing them after the student prompt plus on-policy prefix, and seeing whether the student continuation still reaches the correct answer. On Qwen3-4B this position score hits AUROC 0.83, well above entropy baselines at 0.57, and the resulting PW-OPSD shows +1.0 / +1.1 Avg@12 lifts on AIME 2024/2025 plus similar aggregate gains on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think across seeds. The method stays parameter-free beyond the weighting schedule itself and adds no teacher cost. That combination of a new diagnostic, clear empirical signal, and zero-overhead change is the practical contribution. The evaluation scope is reasonable for the claim, covering multiple model families and keeping the forward-KL objective intact. The soft spot is the diagnostic itself. Because it switches to the student prompt and prefix before testing continuation, the viability score mixes how good the original teacher token was with how well the student happens to recover from that branch under its own policy. A position that looks reliable only because the student can exploit it may not actually supply a cleaner learning signal in the KL term. The reported gains are modest, and the abstract gives no error bars or statistical tests, so the full tables would need checking for variance and whether the position effect holds under different weighting schedules. This is useful reading for anyone already running OPSD or similar distillation loops on reasoning tasks. The idea is easy to reimplement and test locally, so it could save time if the pattern replicates. It has enough concrete method and cross-model evidence to go to peer review rather than desk reject, though reviewers will likely press on the diagnostic's context switch and ask for more robustness checks.

Referee Report

2 major / 2 minor

Summary. The paper claims that teacher-token reliability in on-policy self-distillation for reasoning is trajectory-structured rather than uniform or purely entropy-driven. It introduces a branch-viability diagnostic that records next-token alternatives from a privileged teacher prompt, forces them after a student prompt plus on-policy prefix, and checks whether a student-template continuation recovers the correct answer. On Qwen3-4B this yields an oriented within-sequence position score with AUROC 0.83 (vs. at most 0.57 for local uncertainty). Motivated by the diagnostic, the authors propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight to the clipped forward-KL objective while reusing the same student rollouts and teacher pass. Experiments report +1.0 / +1.1 Avg@12 gains on AIME 2024/2025 and consistent aggregate improvements on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think.

Significance. If the diagnostic validly isolates teacher-token reliability, the work supplies a lightweight, zero-extra-teacher-compute improvement to on-policy distillation that exploits the natural trajectory structure of reasoning traces. The reported gains on two recent AIME sets, the use of multiple random seeds, and the generalization check across model families would constitute a practical contribution to self-distillation methods for math reasoning.

major comments (2)

[Branch-viability diagnostic (abstract and §3)] The branch-viability diagnostic (described in the abstract and motivating §3) records next-token alternatives under the privileged teacher prompt but then forces each alternative after the student prompt plus on-policy prefix before testing student-template continuation. This procedure measures viability only under the student's context and policy, not under the original teacher context that generated the token; consequently the reported AUROC of 0.83 and the rationale for increasing position weights in PW-OPSD rest on a proxy that conflates teacher reliability with student-specific recovery strength.
[Abstract and evaluation sections] The abstract states concrete AUROC 0.83 and Avg@12 deltas of +1.0 / +1.1 yet provides no error bars, ablation tables, or statistical significance tests for either the diagnostic predictor or the benchmark improvements. Because the central claim that position weighting is reliably superior rests on these numbers, the absence of uncertainty quantification is load-bearing for the empirical support.

minor comments (2)

[§4] Notation for the position-weight schedule and the exact form of the clipped forward-KL target should be defined in a single equation block rather than scattered across text.
[§3.2] The paper should clarify whether the position score is computed once on a held-out diagnostic set or re-estimated inside each training run; the current description leaves this ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Branch-viability diagnostic (abstract and §3)] The branch-viability diagnostic (described in the abstract and motivating §3) records next-token alternatives under the privileged teacher prompt but then forces each alternative after the student prompt plus on-policy prefix before testing student-template continuation. This procedure measures viability only under the student's context and policy, not under the original teacher context that generated the token; consequently the reported AUROC of 0.83 and the rationale for increasing position weights in PW-OPSD rest on a proxy that conflates teacher reliability with student-specific recovery strength.

Authors: We thank the referee for this observation. The diagnostic is deliberately constructed to assess teacher-token viability under the student's on-policy prefix and context, as this is the exact setting in which the tokens are used during distillation. For on-policy self-distillation, the relevant question is whether a teacher token enables the student to recover the correct answer when continuing from its own trajectory, rather than viability under the original teacher prompt. This choice aligns with the training objective and explains why position emerges as a stronger predictor than local entropy. We acknowledge that the procedure does not isolate teacher reliability in a teacher-only context. In the revision we will expand §3 with an explicit discussion of this design rationale, including the distinction between teacher-context and student-context viability, to prevent misinterpretation. revision: partial
Referee: [Abstract and evaluation sections] The abstract states concrete AUROC 0.83 and Avg@12 deltas of +1.0 / +1.1 yet provides no error bars, ablation tables, or statistical significance tests for either the diagnostic predictor or the benchmark improvements. Because the central claim that position weighting is reliably superior rests on these numbers, the absence of uncertainty quantification is load-bearing for the empirical support.

Authors: We agree that uncertainty quantification is important for supporting the central empirical claims. Although the manuscript states that results were obtained across multiple random seeds, we did not report per-seed variability or formal tests in the submitted version. In the revised manuscript we will add standard-deviation error bars to the AUROC and Avg@12 figures in both the abstract and the evaluation sections. We will also include a supplementary table with per-seed results and report statistical significance (e.g., paired tests across seeds) for the reported gains. These additions will be made without altering the experimental protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external diagnostic

full rationale

The paper's chain runs as follows: (1) define branch-viability diagnostic on held-out rollouts (record teacher next-token alternatives, force after student prompt + on-policy prefix, test student-template recovery); (2) compute AUROC of several scores including 'oriented within-sequence position score' against the diagnostic labels (0.83 for position vs. 0.57 for entropy); (3) motivate increasing position weights in PW-OPSD from the observed trajectory structure; (4) evaluate the resulting objective on AIME. None of these steps reduces to its inputs by construction. The diagnostic is an independent computation performed outside the training loop; the position weight is a fixed schedule chosen after the diagnostic rather than a fitted parameter renamed as a prediction; no self-citations or uniqueness theorems are invoked as load-bearing premises. The result is therefore an empirical correlation plus a motivated reweighting, not a self-referential identity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical diagnostic whose validity is assumed rather than proven; the position-weight schedule is a design choice motivated by that diagnostic.

free parameters (1)

position weighting schedule
Increasing position weight is chosen after observing the diagnostic results; no explicit functional form or fitting procedure is stated in the abstract.

axioms (1)

domain assumption The branch-viability diagnostic correctly identifies reliable teacher tokens for the student's on-policy learning signal.
This premise is invoked when the authors translate the AUROC finding into the PW-OPSD weighting rule.

pith-pipeline@v0.9.0 · 5885 in / 1393 out tokens · 57293 ms · 2026-05-22T09:22:24.815089+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PW-OPSD ... applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 22 internal anchors

[2]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston

URL https://arxiv.org/abs/2306.13649. Accepted at ICLR

work page arXiv
[3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Hugging Face model card. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R’e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv, abs/2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, G. Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, P. Till...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Hugging Face model card. Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

URL https://arxiv. org/abs/1506.02142. 12 pages, 6 figures; fixed a mistake with standard error and added a new table with updated results (marked "Update [October 2016]"); Published in ICML

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

MiniLLM: On-Policy Distillation of Large Language Models

URL https://arxiv.org/abs/ 2306.08543. Published as a conference paper in ICLR

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The False Promise of Imitating Proprietary LLMs

Arnav Gudibande, Eric Wallace, Charles Burton Snell, Xinyang Geng, Hao Liu, P. Abbeel, S. Levine, and Dawn Song. The false promise of imitating proprietary llms.ArXiv, abs/2305.15717,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenThoughts: Data Recipes for Reasoning Models

10 E. Guha, Ryan Marten, Sedrick Scott Keh, Negin Raoof, G. Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean-Pierre Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Ben Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, K. Sharma, Charlie Cheng-Jie Ji, Yi...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Measuring Mathematical Problem Solving With the MATH Dataset

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z
[13]

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

LoRA: Low-Rank Adaptation of Large Language Models

URL https: //arxiv.org/abs/2106.09685. HuggingFaceH4. MATH-500. https://huggingface.co/datasets/HuggingFaceH4/ MATH-500,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

https://huggingface.co/datasets/HuggingFaceH4/aime_ 2024,

work page 2024
[17]

Entropy-aware on-policy distillation of language models

URL https://arxiv.org/abs/2603.07079. 16 pages, 11 figures, preprint. 11 Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. ToDi: Token-wise Distillation via Fine-Grained Divergence Control.arXiv preprint arXiv:2505.16297,

work page internal anchor Pith review arXiv
[18]

EMNLP 2025 (Oral)

URL https: //arxiv.org/abs/2505.16297. EMNLP 2025 (Oral). Alex Kendall and Y . Gal. What uncertainties do we need in bayesian deep learning for computer vision?ArXiv, abs/1703.04977,

work page arXiv 2025
[19]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation.ArXiv, abs/1606.07947,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and SeYoung Yun. Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

work page arXiv
[22]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

URL https://arxiv.org/abs/2603.11137. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention.arXiv preprint arXiv:2309.06180,

work page arXiv
[23]

URL https: //arxiv.org/abs/2309.06180. SOSP

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Let's Verify Step by Step

URL https://arxiv.org/abs/2305.20050. Andrey Malinin and Mark Gales. Predictive Uncertainty Estimation via Prior Networks.arXiv preprint arXiv:1802.10501,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Predictive Uncertainty Estimation via Prior Networks

URLhttps://arxiv.org/abs/1802.10501. Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondˇrej ˇCertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symbolic computing in python.PeerJ Computer Science, 3:e103,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, S. Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, M. Simens, Amanda Askell, Peter Welinder, P. Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback.ArXiv,...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas Köpf, E. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep...

work page internal anchor Pith review Pith/arXiv arXiv 1912
[29]

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

URL https://arxiv.org/abs/1011.0686. Appearing in the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011). M. Sensoy, M. Kandemir, and Lance M. Kaplan. Evidential deep learning to quantify classification uncertainty.ArXiv, abs/1806.01768,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[31]

A Survey of On-Policy Distillation for Large Language Models

URLhttps://arxiv.org/abs/2604.00626. Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.ArXiv, abs/2602.20574,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, D. Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.ArXiv, abs/2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,

Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,

work page arXiv
[34]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page 2020
[35]

Transformers: State-of-the-Art Natural Language Processing

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URLhttps://aclanthology.org/2020.emnlp-demos.6/. Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models.arXiv preprint arXiv:2404.02657,

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[36]

URLhttps://arxiv.org/abs/2404.02657. COLING

work page arXiv
[38]

Qwen3 Technical Report

URL https://arxiv.org/abs/2505.09388. yentinglin. AIME

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Hugging Face dataset. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.ArXiv, abs/2601.18734,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Teacher” and “student

13 A Branch-viability protocol details Models, prompts, and software.The diagnostic uses one Qwen/Qwen3-4B checkpoint [Qwen Team, 2025] (HuggingFace snapshot1cfa9a72...3b3df60c, dtypebfloat16) with no adapters loaded. “Teacher” and “student” denote two prompt templates applied to this single checkpoint: the teacher template includes the privileged ground-...

work page 2025
[41]

Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total)

for generation [Kwon et al., 2023], transformers 4.57.1 for the HuggingFace (HF) teacher forward passes [Wolf et al., 2020], and torch 2.8.0+cu128on4×H100 80GB GPUs [Paszke et al., 2019]. Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total). Phase B produces 23 + 21 + 18 = 62correct-spine problems. After Phase ...

work page 2023
[42]

student greedy decode of 84 problems in total (24 from MATH-500, 30 from AIME 2024, and 30 from AIME 2025; the protocol is run independently per dataset and the labeled candidates are pooled in Phase G), with fallback to T=0.7 for problems where greedy does not reach\boxedwithin16K tokens. • Phase B.HF forward of the same checkpoint under the teacher temp...

work page 2024
[43]

global token mean EOPD [Jin et al., 2026] entropy-gated RKL/FKL mixture teacher-entropy gate global token mean PW-OPSD (ours) clipped FKL top opsd i,t positionw i,t per-sequence mean G PW-OPSD training pseudocode Algorithm 1 gives one training step of PW-OPSD. The procedure computes student and teacher log-probabilities at the distillation temperature, fo...

work page 2026
[44]

Methods are compared on the same gap

OpenThoughts-Math- 30k [siyanzhao, 2026] prompts vary in length (median 93 tokens, max 826), so right-padding produces a per-batch prompt-PAD gap. Methods are compared on the same gap. Train/eval prompt template gap.Training prompts use the OPSD reference student template Problem: {problem}\n\n Please reason... with enable_thinking=False; evaluation promp...

work page 2026
[45]

All methods are evaluated at the 100-step checkpoint, following the OPSD evaluation horizon of Zhao et al

The local launcher uses 4×H100 GPUs with effective batch size 32 (per-device batch 4 with gradient accumulation 2). All methods are evaluated at the 100-step checkpoint, following the OPSD evaluation horizon of Zhao et al. [2026]. PW-OPSD uses (wmin, τ, s) = (0.25,0.30,0.10) for the diagnostic-derived default schedule. Baselines.We compare PW-OPSD against...

work page 2026
[46]

For each method–benchmark pair we run three random evaluation seeds (main, 1,

The HMMT set is a locally cleaned parquet derived from MathArena’s hmmt_feb_2025 release [Dekoninck et al., 2026], with SHA-256 recorded in the appendix. For each method–benchmark pair we run three random evaluation seeds (main, 1,

work page 2026
[47]

extends this protocol to DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think under the same four benchmarks. J Evaluation metric definitions Multi-sample evaluation for reasoning.Reasoning models are commonly evaluated with repeated sampling because a single completion can understate the chance that the model finds a correct solution. Pass@N measures whether...

work page 2021
[48]

per-sequence) by positioning (none vs

Table 6: 2×2 ablation on Qwen3-4B AIME 2024: reduction (uniform vs. per-sequence) by positioning (none vs. position-weighted). Avg@12 mean ± sample standard deviation across three evaluation seeds. Bold marks the column maximum. The diagonal (OPSD and PW-OPSD Moderate) reports the same evaluation runs as the corresponding rows of Table 2; small difference...

work page 2024
[49]

The two axes are complementary on AIME 2024 rather than independently sufficient

matches the AIME 2024 lead of Table 2; switching either axis alone underperforms by ∼1.5 pp. The two axes are complementary on AIME 2024 rather than independently sufficient. 20

work page 2024

[1] [2]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston

URL https://arxiv.org/abs/2306.13649. Accepted at ICLR

work page arXiv

[2] [3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Hugging Face model card. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R’e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv, abs/2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, G. Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, P. Till...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [5]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Hugging Face model card. Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [7]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

URL https://arxiv. org/abs/1506.02142. 12 pages, 6 figures; fixed a mistake with standard error and added a new table with updated results (marked "Update [October 2016]"); Published in ICML

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [9]

MiniLLM: On-Policy Distillation of Large Language Models

URL https://arxiv.org/abs/ 2306.08543. Published as a conference paper in ICLR

work page internal anchor Pith review Pith/arXiv arXiv

[7] [10]

The False Promise of Imitating Proprietary LLMs

Arnav Gudibande, Eric Wallace, Charles Burton Snell, Xinyang Geng, Hao Liu, P. Abbeel, S. Levine, and Dawn Song. The false promise of imitating proprietary llms.ArXiv, abs/2305.15717,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [11]

OpenThoughts: Data Recipes for Reasoning Models

10 E. Guha, Ryan Marten, Sedrick Scott Keh, Negin Raoof, G. Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean-Pierre Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Ben Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, K. Sharma, Charlie Cheng-Jie Ji, Yi...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [12]

Measuring Mathematical Problem Solving With the MATH Dataset

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z

[10] [13]

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [14]

LoRA: Low-Rank Adaptation of Large Language Models

URL https: //arxiv.org/abs/2106.09685. HuggingFaceH4. MATH-500. https://huggingface.co/datasets/HuggingFaceH4/ MATH-500,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [15]

https://huggingface.co/datasets/HuggingFaceH4/aime_ 2024,

work page 2024

[13] [17]

Entropy-aware on-policy distillation of language models

URL https://arxiv.org/abs/2603.07079. 16 pages, 11 figures, preprint. 11 Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. ToDi: Token-wise Distillation via Fine-Grained Divergence Control.arXiv preprint arXiv:2505.16297,

work page internal anchor Pith review arXiv

[14] [18]

EMNLP 2025 (Oral)

URL https: //arxiv.org/abs/2505.16297. EMNLP 2025 (Oral). Alex Kendall and Y . Gal. What uncertainties do we need in bayesian deep learning for computer vision?ArXiv, abs/1703.04977,

work page arXiv 2025

[15] [19]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation.ArXiv, abs/1606.07947,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [20]

Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and SeYoung Yun. Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

work page arXiv

[17] [22]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

URL https://arxiv.org/abs/2603.11137. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention.arXiv preprint arXiv:2309.06180,

work page arXiv

[18] [23]

URL https: //arxiv.org/abs/2309.06180. SOSP

work page internal anchor Pith review Pith/arXiv arXiv

[19] [24]

Let's Verify Step by Step

URL https://arxiv.org/abs/2305.20050. Andrey Malinin and Mark Gales. Predictive Uncertainty Estimation via Prior Networks.arXiv preprint arXiv:1802.10501,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [25]

Predictive Uncertainty Estimation via Prior Networks

URLhttps://arxiv.org/abs/1802.10501. Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondˇrej ˇCertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symbolic computing in python.PeerJ Computer Science, 3:e103,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, S. Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, M. Simens, Amanda Askell, Peter Welinder, P. Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback.ArXiv,...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [27]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas Köpf, E. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep...

work page internal anchor Pith review Pith/arXiv arXiv 1912

[23] [29]

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

URL https://arxiv.org/abs/1011.0686. Appearing in the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011). M. Sensoy, M. Kandemir, and Lance M. Kaplan. Evidential deep learning to quantify classification uncertainty.ArXiv, abs/1806.01768,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[24] [31]

A Survey of On-Policy Distillation for Large Language Models

URLhttps://arxiv.org/abs/2604.00626. Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.ArXiv, abs/2602.20574,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [32]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, D. Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.ArXiv, abs/2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [33]

f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,

Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,

work page arXiv

[27] [34]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page 2020

[28] [35]

Transformers: State-of-the-Art Natural Language Processing

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URLhttps://aclanthology.org/2020.emnlp-demos.6/. Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models.arXiv preprint arXiv:2404.02657,

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[29] [36]

URLhttps://arxiv.org/abs/2404.02657. COLING

work page arXiv

[30] [38]

Qwen3 Technical Report

URL https://arxiv.org/abs/2505.09388. yentinglin. AIME

work page internal anchor Pith review Pith/arXiv arXiv

[31] [39]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Hugging Face dataset. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.ArXiv, abs/2601.18734,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [40]

Teacher” and “student

13 A Branch-viability protocol details Models, prompts, and software.The diagnostic uses one Qwen/Qwen3-4B checkpoint [Qwen Team, 2025] (HuggingFace snapshot1cfa9a72...3b3df60c, dtypebfloat16) with no adapters loaded. “Teacher” and “student” denote two prompt templates applied to this single checkpoint: the teacher template includes the privileged ground-...

work page 2025

[33] [41]

Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total)

for generation [Kwon et al., 2023], transformers 4.57.1 for the HuggingFace (HF) teacher forward passes [Wolf et al., 2020], and torch 2.8.0+cu128on4×H100 80GB GPUs [Paszke et al., 2019]. Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total). Phase B produces 23 + 21 + 18 = 62correct-spine problems. After Phase ...

work page 2023

[34] [42]

student greedy decode of 84 problems in total (24 from MATH-500, 30 from AIME 2024, and 30 from AIME 2025; the protocol is run independently per dataset and the labeled candidates are pooled in Phase G), with fallback to T=0.7 for problems where greedy does not reach\boxedwithin16K tokens. • Phase B.HF forward of the same checkpoint under the teacher temp...

work page 2024

[35] [43]

global token mean EOPD [Jin et al., 2026] entropy-gated RKL/FKL mixture teacher-entropy gate global token mean PW-OPSD (ours) clipped FKL top opsd i,t positionw i,t per-sequence mean G PW-OPSD training pseudocode Algorithm 1 gives one training step of PW-OPSD. The procedure computes student and teacher log-probabilities at the distillation temperature, fo...

work page 2026

[36] [44]

Methods are compared on the same gap

OpenThoughts-Math- 30k [siyanzhao, 2026] prompts vary in length (median 93 tokens, max 826), so right-padding produces a per-batch prompt-PAD gap. Methods are compared on the same gap. Train/eval prompt template gap.Training prompts use the OPSD reference student template Problem: {problem}\n\n Please reason... with enable_thinking=False; evaluation promp...

work page 2026

[37] [45]

All methods are evaluated at the 100-step checkpoint, following the OPSD evaluation horizon of Zhao et al

The local launcher uses 4×H100 GPUs with effective batch size 32 (per-device batch 4 with gradient accumulation 2). All methods are evaluated at the 100-step checkpoint, following the OPSD evaluation horizon of Zhao et al. [2026]. PW-OPSD uses (wmin, τ, s) = (0.25,0.30,0.10) for the diagnostic-derived default schedule. Baselines.We compare PW-OPSD against...

work page 2026

[38] [46]

For each method–benchmark pair we run three random evaluation seeds (main, 1,

The HMMT set is a locally cleaned parquet derived from MathArena’s hmmt_feb_2025 release [Dekoninck et al., 2026], with SHA-256 recorded in the appendix. For each method–benchmark pair we run three random evaluation seeds (main, 1,

work page 2026

[39] [47]

extends this protocol to DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think under the same four benchmarks. J Evaluation metric definitions Multi-sample evaluation for reasoning.Reasoning models are commonly evaluated with repeated sampling because a single completion can understate the chance that the model finds a correct solution. Pass@N measures whether...

work page 2021

[40] [48]

per-sequence) by positioning (none vs

Table 6: 2×2 ablation on Qwen3-4B AIME 2024: reduction (uniform vs. per-sequence) by positioning (none vs. position-weighted). Avg@12 mean ± sample standard deviation across three evaluation seeds. Bold marks the column maximum. The diagonal (OPSD and PW-OPSD Moderate) reports the same evaluation runs as the corresponding rows of Table 2; small difference...

work page 2024

[41] [49]

The two axes are complementary on AIME 2024 rather than independently sufficient

matches the AIME 2024 lead of Table 2; switching either axis alone underperforms by ∼1.5 pp. The two axes are complementary on AIME 2024 rather than independently sufficient. 20

work page 2024