When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
Pith reviewed 2026-05-22 09:22 UTC · model grok-4.3
The pith
Teacher tokens in reasoning distillation are more reliable later in the sequence, and weighting them by position improves student performance without extra teacher computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation. An oriented within-sequence position score reaches an AUROC of 0.83 for predicting whether a teacher token leads to the correct answer, while local uncertainty scores reach at most 0.57. Position-Weighted On-Policy Self-Distillation applies an increasing position weight to the same clipped forward-KL target, improving AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points respectively, with consistent aggregate gains on larger models from different families.
What carries the argument
The branch-viability diagnostic, which records next-token alternatives from the privileged teacher, forces each after the student prompt plus on-policy prefix, and checks whether the resulting continuation recovers the correct answer to expose position-dependent reliability.
If this is right
- An oriented within-sequence position score predicts teacher-token reliability with AUROC 0.83, outperforming uncertainty-based scores.
- PW-OPSD improves AIME 2024 and 2025 performance by +1.0 and +1.1 Avg@12 points using only the existing student rollout and teacher pass.
- The same position-weighted approach yields consistent aggregate improvements on larger models including DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think.
- Teacher tokens early in reasoning trajectories provide weaker learning signals than those appearing later.
Where Pith is reading between the lines
- The same position-weighting idea could be tested in code generation or other long-horizon sequential tasks where early choices are more exploratory.
- Running the diagnostic on non-mathematical reasoning problems would show whether the trajectory structure is specific to math or more general.
- Combining the position signal with other cheap diagnostics might produce additional gains without increasing teacher cost.
Load-bearing premise
The branch-viability diagnostic accurately measures the reliability of the original teacher token for the student's learning signal.
What would settle it
If applying the same diagnostic to a fresh set of problems shows that position no longer predicts whether forced alternatives recover the correct answer, or if position-weighted training produces no improvement on AIME benchmarks, the central claim would be falsified.
Figures
read the original abstract
On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that teacher-token reliability in on-policy self-distillation for reasoning is trajectory-structured rather than uniform or purely entropy-driven. It introduces a branch-viability diagnostic that records next-token alternatives from a privileged teacher prompt, forces them after a student prompt plus on-policy prefix, and checks whether a student-template continuation recovers the correct answer. On Qwen3-4B this yields an oriented within-sequence position score with AUROC 0.83 (vs. at most 0.57 for local uncertainty). Motivated by the diagnostic, the authors propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight to the clipped forward-KL objective while reusing the same student rollouts and teacher pass. Experiments report +1.0 / +1.1 Avg@12 gains on AIME 2024/2025 and consistent aggregate improvements on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think.
Significance. If the diagnostic validly isolates teacher-token reliability, the work supplies a lightweight, zero-extra-teacher-compute improvement to on-policy distillation that exploits the natural trajectory structure of reasoning traces. The reported gains on two recent AIME sets, the use of multiple random seeds, and the generalization check across model families would constitute a practical contribution to self-distillation methods for math reasoning.
major comments (2)
- [Branch-viability diagnostic (abstract and §3)] The branch-viability diagnostic (described in the abstract and motivating §3) records next-token alternatives under the privileged teacher prompt but then forces each alternative after the student prompt plus on-policy prefix before testing student-template continuation. This procedure measures viability only under the student's context and policy, not under the original teacher context that generated the token; consequently the reported AUROC of 0.83 and the rationale for increasing position weights in PW-OPSD rest on a proxy that conflates teacher reliability with student-specific recovery strength.
- [Abstract and evaluation sections] The abstract states concrete AUROC 0.83 and Avg@12 deltas of +1.0 / +1.1 yet provides no error bars, ablation tables, or statistical significance tests for either the diagnostic predictor or the benchmark improvements. Because the central claim that position weighting is reliably superior rests on these numbers, the absence of uncertainty quantification is load-bearing for the empirical support.
minor comments (2)
- [§4] Notation for the position-weight schedule and the exact form of the clipped forward-KL target should be defined in a single equation block rather than scattered across text.
- [§3.2] The paper should clarify whether the position score is computed once on a held-out diagnostic set or re-estimated inside each training run; the current description leaves this ambiguous.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Branch-viability diagnostic (abstract and §3)] The branch-viability diagnostic (described in the abstract and motivating §3) records next-token alternatives under the privileged teacher prompt but then forces each alternative after the student prompt plus on-policy prefix before testing student-template continuation. This procedure measures viability only under the student's context and policy, not under the original teacher context that generated the token; consequently the reported AUROC of 0.83 and the rationale for increasing position weights in PW-OPSD rest on a proxy that conflates teacher reliability with student-specific recovery strength.
Authors: We thank the referee for this observation. The diagnostic is deliberately constructed to assess teacher-token viability under the student's on-policy prefix and context, as this is the exact setting in which the tokens are used during distillation. For on-policy self-distillation, the relevant question is whether a teacher token enables the student to recover the correct answer when continuing from its own trajectory, rather than viability under the original teacher prompt. This choice aligns with the training objective and explains why position emerges as a stronger predictor than local entropy. We acknowledge that the procedure does not isolate teacher reliability in a teacher-only context. In the revision we will expand §3 with an explicit discussion of this design rationale, including the distinction between teacher-context and student-context viability, to prevent misinterpretation. revision: partial
-
Referee: [Abstract and evaluation sections] The abstract states concrete AUROC 0.83 and Avg@12 deltas of +1.0 / +1.1 yet provides no error bars, ablation tables, or statistical significance tests for either the diagnostic predictor or the benchmark improvements. Because the central claim that position weighting is reliably superior rests on these numbers, the absence of uncertainty quantification is load-bearing for the empirical support.
Authors: We agree that uncertainty quantification is important for supporting the central empirical claims. Although the manuscript states that results were obtained across multiple random seeds, we did not report per-seed variability or formal tests in the submitted version. In the revised manuscript we will add standard-deviation error bars to the AUROC and Avg@12 figures in both the abstract and the evaluation sections. We will also include a supplementary table with per-seed results and report statistical significance (e.g., paired tests across seeds) for the reported gains. These additions will be made without altering the experimental protocol. revision: yes
Circularity Check
No significant circularity; derivation uses external diagnostic
full rationale
The paper's chain runs as follows: (1) define branch-viability diagnostic on held-out rollouts (record teacher next-token alternatives, force after student prompt + on-policy prefix, test student-template recovery); (2) compute AUROC of several scores including 'oriented within-sequence position score' against the diagnostic labels (0.83 for position vs. 0.57 for entropy); (3) motivate increasing position weights in PW-OPSD from the observed trajectory structure; (4) evaluate the resulting objective on AIME. None of these steps reduces to its inputs by construction. The diagnostic is an independent computation performed outside the training loop; the position weight is a fixed schedule chosen after the diagnostic rather than a fitted parameter renamed as a prediction; no self-citations or uniqueness theorems are invoked as load-bearing premises. The result is therefore an empirical correlation plus a motivated reweighting, not a self-referential identity.
Axiom & Free-Parameter Ledger
free parameters (1)
- position weighting schedule
axioms (1)
- domain assumption The branch-viability diagnostic correctly identifies reliable teacher tokens for the student's on-policy learning signal.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PW-OPSD ... applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[2]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston
URL https://arxiv.org/abs/2306.13649. Accepted at ICLR
-
[3]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Hugging Face model card. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R’e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv, abs/2407.21787,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, G. Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, P. Till...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
Hugging Face model card. Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
URL https://arxiv. org/abs/1506.02142. 12 pages, 6 figures; fixed a mistake with standard error and added a new table with updated results (marked "Update [October 2016]"); Published in ICML
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[9]
MiniLLM: On-Policy Distillation of Large Language Models
URL https://arxiv.org/abs/ 2306.08543. Published as a conference paper in ICLR
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
The False Promise of Imitating Proprietary LLMs
Arnav Gudibande, Eric Wallace, Charles Burton Snell, Xinyang Geng, Hao Liu, P. Abbeel, S. Levine, and Dawn Song. The false promise of imitating proprietary llms.ArXiv, abs/2305.15717,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
OpenThoughts: Data Recipes for Reasoning Models
10 E. Guha, Ryan Marten, Sedrick Scott Keh, Negin Raoof, G. Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean-Pierre Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Ben Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, K. Sharma, Charlie Cheng-Jie Ji, Yi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Measuring Mathematical Problem Solving With the MATH Dataset
ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z
-
[13]
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
LoRA: Low-Rank Adaptation of Large Language Models
URL https: //arxiv.org/abs/2106.09685. HuggingFaceH4. MATH-500. https://huggingface.co/datasets/HuggingFaceH4/ MATH-500,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
https://huggingface.co/datasets/HuggingFaceH4/aime_ 2024,
work page 2024
-
[17]
Entropy-aware on-policy distillation of language models
URL https://arxiv.org/abs/2603.07079. 16 pages, 11 figures, preprint. 11 Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. ToDi: Token-wise Distillation via Fine-Grained Divergence Control.arXiv preprint arXiv:2505.16297,
work page internal anchor Pith review arXiv
-
[18]
URL https: //arxiv.org/abs/2505.16297. EMNLP 2025 (Oral). Alex Kendall and Y . Gal. What uncertainties do we need in bayesian deep learning for computer vision?ArXiv, abs/1703.04977,
-
[19]
Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation.ArXiv, abs/1606.07947,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,
Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and SeYoung Yun. Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,
-
[22]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E
URL https://arxiv.org/abs/2603.11137. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention.arXiv preprint arXiv:2309.06180,
-
[23]
URL https: //arxiv.org/abs/2309.06180. SOSP
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
URL https://arxiv.org/abs/2305.20050. Andrey Malinin and Mark Gales. Predictive Uncertainty Estimation via Prior Networks.arXiv preprint arXiv:1802.10501,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Predictive Uncertainty Estimation via Prior Networks
URLhttps://arxiv.org/abs/1802.10501. Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondˇrej ˇCertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symbolic computing in python.PeerJ Computer Science, 3:e103,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, S. Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, M. Simens, Amanda Askell, Peter Welinder, P. Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback.ArXiv,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas Köpf, E. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep...
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[29]
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
URL https://arxiv.org/abs/1011.0686. Appearing in the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011). M. Sensoy, M. Kandemir, and Lance M. Kaplan. Evidential deep learning to quantify classification uncertainty.ArXiv, abs/1806.01768,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[31]
A Survey of On-Policy Distillation for Large Language Models
URLhttps://arxiv.org/abs/2604.00626. Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating.ArXiv, abs/2602.20574,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, D. Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.ArXiv, abs/2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,
Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation.ArXiv, abs/2307.15190,
-
[34]
Transformers: State-of-the-art natural language processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...
work page 2020
-
[35]
Transformers: State-of-the-Art Natural Language Processing
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URLhttps://aclanthology.org/2020.emnlp-demos.6/. Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models.arXiv preprint arXiv:2404.02657,
- [36]
-
[38]
URL https://arxiv.org/abs/2505.09388. yentinglin. AIME
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Hugging Face dataset. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.ArXiv, abs/2601.18734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
13 A Branch-viability protocol details Models, prompts, and software.The diagnostic uses one Qwen/Qwen3-4B checkpoint [Qwen Team, 2025] (HuggingFace snapshot1cfa9a72...3b3df60c, dtypebfloat16) with no adapters loaded. “Teacher” and “student” denote two prompt templates applied to this single checkpoint: the teacher template includes the privileged ground-...
work page 2025
-
[41]
Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total)
for generation [Kwon et al., 2023], transformers 4.57.1 for the HuggingFace (HF) teacher forward passes [Wolf et al., 2020], and torch 2.8.0+cu128on4×H100 80GB GPUs [Paszke et al., 2019]. Problem attrition.Phase A samples 24 MATH-500, 30 AIME 2024, and 30 AIME 2025 problems (84 total). Phase B produces 23 + 21 + 18 = 62correct-spine problems. After Phase ...
work page 2023
-
[42]
student greedy decode of 84 problems in total (24 from MATH-500, 30 from AIME 2024, and 30 from AIME 2025; the protocol is run independently per dataset and the labeled candidates are pooled in Phase G), with fallback to T=0.7 for problems where greedy does not reach\boxedwithin16K tokens. • Phase B.HF forward of the same checkpoint under the teacher temp...
work page 2024
-
[43]
global token mean EOPD [Jin et al., 2026] entropy-gated RKL/FKL mixture teacher-entropy gate global token mean PW-OPSD (ours) clipped FKL top opsd i,t positionw i,t per-sequence mean G PW-OPSD training pseudocode Algorithm 1 gives one training step of PW-OPSD. The procedure computes student and teacher log-probabilities at the distillation temperature, fo...
work page 2026
-
[44]
Methods are compared on the same gap
OpenThoughts-Math- 30k [siyanzhao, 2026] prompts vary in length (median 93 tokens, max 826), so right-padding produces a per-batch prompt-PAD gap. Methods are compared on the same gap. Train/eval prompt template gap.Training prompts use the OPSD reference student template Problem: {problem}\n\n Please reason... with enable_thinking=False; evaluation promp...
work page 2026
-
[45]
The local launcher uses 4×H100 GPUs with effective batch size 32 (per-device batch 4 with gradient accumulation 2). All methods are evaluated at the 100-step checkpoint, following the OPSD evaluation horizon of Zhao et al. [2026]. PW-OPSD uses (wmin, τ, s) = (0.25,0.30,0.10) for the diagnostic-derived default schedule. Baselines.We compare PW-OPSD against...
work page 2026
-
[46]
For each method–benchmark pair we run three random evaluation seeds (main, 1,
The HMMT set is a locally cleaned parquet derived from MathArena’s hmmt_feb_2025 release [Dekoninck et al., 2026], with SHA-256 recorded in the appendix. For each method–benchmark pair we run three random evaluation seeds (main, 1,
work page 2026
-
[47]
extends this protocol to DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think under the same four benchmarks. J Evaluation metric definitions Multi-sample evaluation for reasoning.Reasoning models are commonly evaluated with repeated sampling because a single completion can understate the chance that the model finds a correct solution. Pass@N measures whether...
work page 2021
-
[48]
per-sequence) by positioning (none vs
Table 6: 2×2 ablation on Qwen3-4B AIME 2024: reduction (uniform vs. per-sequence) by positioning (none vs. position-weighted). Avg@12 mean ± sample standard deviation across three evaluation seeds. Bold marks the column maximum. The diagonal (OPSD and PW-OPSD Moderate) reports the same evaluation runs as the corresponding rows of Table 2; small difference...
work page 2024
-
[49]
The two axes are complementary on AIME 2024 rather than independently sufficient
matches the AIME 2024 lead of Table 2; switching either axis alone underperforms by ∼1.5 pp. The two axes are complementary on AIME 2024 rather than independently sufficient. 20
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.