pith. sign in

arxiv: 2605.21606 · v1 · pith:XIHANZK6new · submitted 2026-05-20 · 💻 cs.LG · cs.AI

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Pith reviewed 2026-05-22 09:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy self-distillationreasoning distillationteacher token reliabilityposition weightingAIME benchmarklarge language modelsbranch viability
0
0 comments X

The pith

Teacher tokens in reasoning distillation are more reliable later in the sequence, and weighting them by position improves student performance without extra teacher computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to make on-policy self-distillation more effective for training students on reasoning tasks. It demonstrates that teacher tokens are not equally reliable at every point in a generated sequence, but instead follow a clear trajectory where reliability increases with position. A diagnostic that checks whether alternative teacher tokens lead to correct final answers reveals this pattern, with position outperforming entropy as a predictor. The authors then modify the distillation objective to apply higher weights to later tokens, producing measurable gains on math competition problems while using the same student rollouts and teacher passes as before.

Core claim

Teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation. An oriented within-sequence position score reaches an AUROC of 0.83 for predicting whether a teacher token leads to the correct answer, while local uncertainty scores reach at most 0.57. Position-Weighted On-Policy Self-Distillation applies an increasing position weight to the same clipped forward-KL target, improving AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points respectively, with consistent aggregate gains on larger models from different families.

What carries the argument

The branch-viability diagnostic, which records next-token alternatives from the privileged teacher, forces each after the student prompt plus on-policy prefix, and checks whether the resulting continuation recovers the correct answer to expose position-dependent reliability.

If this is right

  • An oriented within-sequence position score predicts teacher-token reliability with AUROC 0.83, outperforming uncertainty-based scores.
  • PW-OPSD improves AIME 2024 and 2025 performance by +1.0 and +1.1 Avg@12 points using only the existing student rollout and teacher pass.
  • The same position-weighted approach yields consistent aggregate improvements on larger models including DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think.
  • Teacher tokens early in reasoning trajectories provide weaker learning signals than those appearing later.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same position-weighting idea could be tested in code generation or other long-horizon sequential tasks where early choices are more exploratory.
  • Running the diagnostic on non-mathematical reasoning problems would show whether the trajectory structure is specific to math or more general.
  • Combining the position signal with other cheap diagnostics might produce additional gains without increasing teacher cost.

Load-bearing premise

The branch-viability diagnostic accurately measures the reliability of the original teacher token for the student's learning signal.

What would settle it

If applying the same diagnostic to a fresh set of problems shows that position no longer predicts whether forced alternatives recover the correct answer, or if position-weighted training produces no improvement on AIME benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.21606 by Chaowei Xiao, Xiaogeng Liu, Xinyan Wang, Yechao Zhang, Yingzi Ma.

Figure 1
Figure 1. Figure 1: Branch viability reveals a positional reliability structure in Qwen3-4B reasoning traces. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-token weight schedules for the four configurations; top panels label the role of each (wmin, τ, s) knob [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that teacher-token reliability in on-policy self-distillation for reasoning is trajectory-structured rather than uniform or purely entropy-driven. It introduces a branch-viability diagnostic that records next-token alternatives from a privileged teacher prompt, forces them after a student prompt plus on-policy prefix, and checks whether a student-template continuation recovers the correct answer. On Qwen3-4B this yields an oriented within-sequence position score with AUROC 0.83 (vs. at most 0.57 for local uncertainty). Motivated by the diagnostic, the authors propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight to the clipped forward-KL objective while reusing the same student rollouts and teacher pass. Experiments report +1.0 / +1.1 Avg@12 gains on AIME 2024/2025 and consistent aggregate improvements on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think.

Significance. If the diagnostic validly isolates teacher-token reliability, the work supplies a lightweight, zero-extra-teacher-compute improvement to on-policy distillation that exploits the natural trajectory structure of reasoning traces. The reported gains on two recent AIME sets, the use of multiple random seeds, and the generalization check across model families would constitute a practical contribution to self-distillation methods for math reasoning.

major comments (2)
  1. [Branch-viability diagnostic (abstract and §3)] The branch-viability diagnostic (described in the abstract and motivating §3) records next-token alternatives under the privileged teacher prompt but then forces each alternative after the student prompt plus on-policy prefix before testing student-template continuation. This procedure measures viability only under the student's context and policy, not under the original teacher context that generated the token; consequently the reported AUROC of 0.83 and the rationale for increasing position weights in PW-OPSD rest on a proxy that conflates teacher reliability with student-specific recovery strength.
  2. [Abstract and evaluation sections] The abstract states concrete AUROC 0.83 and Avg@12 deltas of +1.0 / +1.1 yet provides no error bars, ablation tables, or statistical significance tests for either the diagnostic predictor or the benchmark improvements. Because the central claim that position weighting is reliably superior rests on these numbers, the absence of uncertainty quantification is load-bearing for the empirical support.
minor comments (2)
  1. [§4] Notation for the position-weight schedule and the exact form of the clipped forward-KL target should be defined in a single equation block rather than scattered across text.
  2. [§3.2] The paper should clarify whether the position score is computed once on a held-out diagnostic set or re-estimated inside each training run; the current description leaves this ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Branch-viability diagnostic (abstract and §3)] The branch-viability diagnostic (described in the abstract and motivating §3) records next-token alternatives under the privileged teacher prompt but then forces each alternative after the student prompt plus on-policy prefix before testing student-template continuation. This procedure measures viability only under the student's context and policy, not under the original teacher context that generated the token; consequently the reported AUROC of 0.83 and the rationale for increasing position weights in PW-OPSD rest on a proxy that conflates teacher reliability with student-specific recovery strength.

    Authors: We thank the referee for this observation. The diagnostic is deliberately constructed to assess teacher-token viability under the student's on-policy prefix and context, as this is the exact setting in which the tokens are used during distillation. For on-policy self-distillation, the relevant question is whether a teacher token enables the student to recover the correct answer when continuing from its own trajectory, rather than viability under the original teacher prompt. This choice aligns with the training objective and explains why position emerges as a stronger predictor than local entropy. We acknowledge that the procedure does not isolate teacher reliability in a teacher-only context. In the revision we will expand §3 with an explicit discussion of this design rationale, including the distinction between teacher-context and student-context viability, to prevent misinterpretation. revision: partial

  2. Referee: [Abstract and evaluation sections] The abstract states concrete AUROC 0.83 and Avg@12 deltas of +1.0 / +1.1 yet provides no error bars, ablation tables, or statistical significance tests for either the diagnostic predictor or the benchmark improvements. Because the central claim that position weighting is reliably superior rests on these numbers, the absence of uncertainty quantification is load-bearing for the empirical support.

    Authors: We agree that uncertainty quantification is important for supporting the central empirical claims. Although the manuscript states that results were obtained across multiple random seeds, we did not report per-seed variability or formal tests in the submitted version. In the revised manuscript we will add standard-deviation error bars to the AUROC and Avg@12 figures in both the abstract and the evaluation sections. We will also include a supplementary table with per-seed results and report statistical significance (e.g., paired tests across seeds) for the reported gains. These additions will be made without altering the experimental protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external diagnostic

full rationale

The paper's chain runs as follows: (1) define branch-viability diagnostic on held-out rollouts (record teacher next-token alternatives, force after student prompt + on-policy prefix, test student-template recovery); (2) compute AUROC of several scores including 'oriented within-sequence position score' against the diagnostic labels (0.83 for position vs. 0.57 for entropy); (3) motivate increasing position weights in PW-OPSD from the observed trajectory structure; (4) evaluate the resulting objective on AIME. None of these steps reduces to its inputs by construction. The diagnostic is an independent computation performed outside the training loop; the position weight is a fixed schedule chosen after the diagnostic rather than a fitted parameter renamed as a prediction; no self-citations or uniqueness theorems are invoked as load-bearing premises. The result is therefore an empirical correlation plus a motivated reweighting, not a self-referential identity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical diagnostic whose validity is assumed rather than proven; the position-weight schedule is a design choice motivated by that diagnostic.

free parameters (1)
  • position weighting schedule
    Increasing position weight is chosen after observing the diagnostic results; no explicit functional form or fitting procedure is stated in the abstract.
axioms (1)
  • domain assumption The branch-viability diagnostic correctly identifies reliable teacher tokens for the student's on-policy learning signal.
    This premise is invoked when the authors translate the AUROC finding into the PW-OPSD weighting rule.

pith-pipeline@v0.9.0 · 5885 in / 1393 out tokens · 57293 ms · 2026-05-22T09:22:24.815089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 22 internal anchors

  1. [2]

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston

    URL https://arxiv.org/abs/2306.13649. Accepted at ICLR

  2. [3]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Hugging Face model card. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R’e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv, abs/2407.21787,

  3. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, G. Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, P. Till...

  4. [5]

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

    Hugging Face model card. Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674,

  5. [7]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

    URL https://arxiv. org/abs/1506.02142. 12 pages, 6 figures; fixed a mistake with standard error and added a new table with updated results (marked "Update [October 2016]"); Published in ICML

  6. [9]

    MiniLLM: On-Policy Distillation of Large Language Models

    URL https://arxiv.org/abs/ 2306.08543. Published as a conference paper in ICLR

  7. [10]

    The False Promise of Imitating Proprietary LLMs

    Arnav Gudibande, Eric Wallace, Charles Burton Snell, Xinyang Geng, Hao Liu, P. Abbeel, S. Levine, and Dawn Song. The false promise of imitating proprietary llms.ArXiv, abs/2305.15717,

  8. [11]

    OpenThoughts: Data Recipes for Reasoning Models

    10 E. Guha, Ryan Marten, Sedrick Scott Keh, Negin Raoof, G. Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean-Pierre Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Ben Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, K. Sharma, Charlie Cheng-Jie Ji, Yi...

  9. [12]

    Measuring Mathematical Problem Solving With the MATH Dataset

    ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874,

  10. [13]

    Distilling the Knowledge in a Neural Network

    Geoffrey E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531,

  11. [14]

    LoRA: Low-Rank Adaptation of Large Language Models

    URL https: //arxiv.org/abs/2106.09685. HuggingFaceH4. MATH-500. https://huggingface.co/datasets/HuggingFaceH4/ MATH-500,

  12. [15]

    https://huggingface.co/datasets/HuggingFaceH4/aime_ 2024,

  13. [17]

    Entropy-aware on-policy distillation of language models

    URL https://arxiv.org/abs/2603.07079. 16 pages, 11 figures, preprint. 11 Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. ToDi: Token-wise Distillation via Fine-Grained Divergence Control.arXiv preprint arXiv:2505.16297,

  14. [18]

    EMNLP 2025 (Oral)

    URL https: //arxiv.org/abs/2505.16297. EMNLP 2025 (Oral). Alex Kendall and Y . Gal. What uncertainties do we need in bayesian deep learning for computer vision?ArXiv, abs/1703.04977,

  15. [19]

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation.ArXiv, abs/1606.07947,

  16. [20]

    Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and SeYoung Yun. Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

  17. [22]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

    URL https://arxiv.org/abs/2603.11137. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention.arXiv preprint arXiv:2309.06180,

  18. [23]

    URL https: //arxiv.org/abs/2309.06180. SOSP

  19. [24]

    Let's Verify Step by Step

    URL https://arxiv.org/abs/2305.20050. Andrey Malinin and Mark Gales. Predictive Uncertainty Estimation via Prior Networks.arXiv preprint arXiv:1802.10501,

  20. [25]

    Predictive Uncertainty Estimation via Prior Networks

    URLhttps://arxiv.org/abs/1802.10501. Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondˇrej ˇCertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symbolic computing in python.PeerJ Computer Science, 3:e103,

  21. [26]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, S. Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, M. Simens, Amanda Askell, Peter Welinder, P. Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback.ArXiv,...

  22. [27]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas Köpf, E. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep...

  23. [29]

    A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

    URL https://arxiv.org/abs/1011.0686. Appearing in the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011). M. Sensoy, M. Kandemir, and Lance M. Kaplan. Evidential deep learning to quantify classification uncertainty.ArXiv, abs/1806.01768,

  24. [31]

    A Survey of On-Policy Distillation for Large Language Models

    URLhttps://arxiv.org/abs/2604.00626. Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.ArXiv, abs/2602.20574,

  25. [32]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, D. Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.ArXiv, abs/2203.11171,

  26. [33]

    f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,

    Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,

  27. [34]

    Transformers: State-of-the-art natural language processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

  28. [35]

    Transformers: State-of-the-Art Natural Language Processing

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URLhttps://aclanthology.org/2020.emnlp-demos.6/. Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models.arXiv preprint arXiv:2404.02657,

  29. [36]

    URLhttps://arxiv.org/abs/2404.02657. COLING

  30. [38]

    Qwen3 Technical Report

    URL https://arxiv.org/abs/2505.09388. yentinglin. AIME

  31. [39]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Hugging Face dataset. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.ArXiv, abs/2601.18734,

  32. [40]

    Teacher” and “student

    13 A Branch-viability protocol details Models, prompts, and software.The diagnostic uses one Qwen/Qwen3-4B checkpoint [Qwen Team, 2025] (HuggingFace snapshot1cfa9a72...3b3df60c, dtypebfloat16) with no adapters loaded. “Teacher” and “student” denote two prompt templates applied to this single checkpoint: the teacher template includes the privileged ground-...

  33. [41]

    Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total)

    for generation [Kwon et al., 2023], transformers 4.57.1 for the HuggingFace (HF) teacher forward passes [Wolf et al., 2020], and torch 2.8.0+cu128on4×H100 80GB GPUs [Paszke et al., 2019]. Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total). Phase B produces 23 + 21 + 18 = 62correct-spine problems. After Phase ...

  34. [42]

    student greedy decode of 84 problems in total (24 from MATH-500, 30 from AIME 2024, and 30 from AIME 2025; the protocol is run independently per dataset and the labeled candidates are pooled in Phase G), with fallback to T=0.7 for problems where greedy does not reach\boxedwithin16K tokens. • Phase B.HF forward of the same checkpoint under the teacher temp...

  35. [43]

    global token mean EOPD [Jin et al., 2026] entropy-gated RKL/FKL mixture teacher-entropy gate global token mean PW-OPSD (ours) clipped FKL top opsd i,t positionw i,t per-sequence mean G PW-OPSD training pseudocode Algorithm 1 gives one training step of PW-OPSD. The procedure computes student and teacher log-probabilities at the distillation temperature, fo...

  36. [44]

    Methods are compared on the same gap

    OpenThoughts-Math- 30k [siyanzhao, 2026] prompts vary in length (median 93 tokens, max 826), so right-padding produces a per-batch prompt-PAD gap. Methods are compared on the same gap. Train/eval prompt template gap.Training prompts use the OPSD reference student template Problem: {problem}\n\n Please reason... with enable_thinking=False; evaluation promp...

  37. [45]

    All methods are evaluated at the 100-step checkpoint, following the OPSD evaluation horizon of Zhao et al

    The local launcher uses 4×H100 GPUs with effective batch size 32 (per-device batch 4 with gradient accumulation 2). All methods are evaluated at the 100-step checkpoint, following the OPSD evaluation horizon of Zhao et al. [2026]. PW-OPSD uses (wmin, τ, s) = (0.25,0.30,0.10) for the diagnostic-derived default schedule. Baselines.We compare PW-OPSD against...

  38. [46]

    For each method–benchmark pair we run three random evaluation seeds (main, 1,

    The HMMT set is a locally cleaned parquet derived from MathArena’s hmmt_feb_2025 release [Dekoninck et al., 2026], with SHA-256 recorded in the appendix. For each method–benchmark pair we run three random evaluation seeds (main, 1,

  39. [47]

    extends this protocol to DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think under the same four benchmarks. J Evaluation metric definitions Multi-sample evaluation for reasoning.Reasoning models are commonly evaluated with repeated sampling because a single completion can understate the chance that the model finds a correct solution. Pass@N measures whether...

  40. [48]

    per-sequence) by positioning (none vs

    Table 6: 2×2 ablation on Qwen3-4B AIME 2024: reduction (uniform vs. per-sequence) by positioning (none vs. position-weighted). Avg@12 mean ± sample standard deviation across three evaluation seeds. Bold marks the column maximum. The diagonal (OPSD and PW-OPSD Moderate) reports the same evaluation runs as the corresponding rows of Table 2; small difference...

  41. [49]

    The two axes are complementary on AIME 2024 rather than independently sufficient

    matches the AIME 2024 lead of Table 2; switching either axis alone underperforms by ∼1.5 pp. The two axes are complementary on AIME 2024 rather than independently sufficient. 20