pith. machine review for the scientific record. sign in

arxiv: 2605.09725 · v2 · submitted 2026-05-10 · 💻 cs.CV

Recognition: unknown

On-Policy Distillation with Best-of-N Teacher Rollout Selection

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords on-policy distillationbest-of-n rolloutteacher selectionreasoning modelsmath benchmarkssupervision signalAIMEAMC
0
0 comments X

The pith

BRTS selects the best teacher rollout from multiple samples to reduce noise in on-policy distillation for reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy distillation suffers from high-variance teacher signals because it relies on single stochastic rollouts that can be incorrect or poorly matched to the student. BRTS samples a small pool of teacher trajectories and applies a priority rule that first requires correctness, then prefers the one most aligned with the student's current behavior. For harder prompts where unconditioned samples fail, it adds a ground-truth-conditioned recovery step to elicit a valid derivation. The selected trajectory supplies an auxiliary teacher-context supervision branch inside the OPD training loop. Experiments on AIME 2024, AIME 2025, and AMC 2023 demonstrate gains over standard OPD, with the largest improvements on the harder benchmarks.

Core claim

Augmenting on-policy distillation with best-of-N teacher rollout selection, ordered by correctness then student alignment and supplemented by a ground-truth recovery step, produces more reliable supervision signals and measurable performance gains on challenging math reasoning benchmarks.

What carries the argument

The BRTS priority rule that ranks teacher trajectories first by correctness and second by alignment with the student's sampled behavior, together with the ground-truth-conditioned recovery mechanism that elicits natural derivations when unconditioned sampling fails.

If this is right

  • The auxiliary loss on the curated teacher trajectory supplies an additional stable training signal inside the OPD loop.
  • Gains are largest on harder datasets, indicating the method helps most when single-rollout supervision is most unreliable.
  • The teacher-context branch operates alongside the standard student-context branch without altering the core OPD objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection logic could be applied to other on-policy methods that suffer from rollout variance, such as certain preference optimization loops.
  • If correctness can be verified automatically in non-math domains, the framework might extend beyond competition problems without requiring new human labels.
  • Over repeated distillation rounds the curated trajectories could gradually reduce distribution shift between teacher and student.

Load-bearing premise

The correctness-first priority rule plus the ground-truth recovery step will reliably produce higher-quality supervision trajectories without introducing selection bias or overfitting to the chosen paths.

What would settle it

If replacing the priority-based selection with random rollout choice yields no performance difference on AIME or AMC benchmarks, the claim that the curation rule drives the gains would be falsified.

Figures

Figures reproduced from arXiv: 2605.09725 by Di Fu, DongDi Zhao, Ke Zhang, Vishal M Patel, Yijiang Li, Yuanye Liu, Yunjie Tian.

Figure 1
Figure 1. Figure 1: Conceptual comparison of OPD and BRTS. (a) Classical OPD may propagate unreliable [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of BRTS. The left panel shows teacher and student trajectories in the trajectory [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Majority-vote accuracy across training steps. The student-only baseline uses two student [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Majority accuracy on AIME25 and AIME24 across training steps, comparing Tier-1 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tier-1 accuracy as a function of total rollouts per prompt. Panel (a) compares observed [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top: Accuracy across training steps. Adding more Tier-1 candidates improves uncon￾ditioned teacher coverage (a), while Tier-2 further increases the catch rate by recovering prompts missed by Tier-1 (b,c). Bottom: Tier composition example, where Tier-1 is unconditioned teacher success, Tier-2 is ground-truth-guided recovery, and grey denotes fallback cases. the fallback trajectory. As shown in [PITH_FULL_I… view at source ↗
read the original abstract

On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes BRTS (Best-of-N Rollout Teacher Selection), an augmentation to on-policy distillation (OPD) for reasoning models. Standard OPD uses noisy student-context teacher rollouts that can be incorrect or poorly aligned; BRTS instead samples N teacher trajectories, selects via a priority rule (correctness first, then student alignment), and falls back to a ground-truth-conditioned recovery step on hard prompts to supply a correct derivation. The selected trajectory supplies an auxiliary teacher-context supervision signal inside the OPD loop. Experiments on AIME 2024, AIME 2025, and AMC 2023 report that BRTS outperforms standard OPD, with the largest gains on the hardest datasets.

Significance. If the gains are shown to be robust and attributable to the selection mechanism rather than the recovery step, BRTS would provide a simple, data-efficient way to improve supervision quality in on-policy distillation without introducing reward models or full SFT. The open-sourced code at the provided GitHub link supports reproducibility and is a clear strength.

major comments (3)
  1. [Experiments] Experiments section: no error bars, standard deviations across seeds, or statistical significance tests are reported for the benchmark gains on AIME 2024/2025 and AMC 2023. This makes it impossible to determine whether the observed improvements exceed run-to-run variance.
  2. [Experiments] Experiments section: no ablation is presented that isolates the best-of-N selection rule (correctness + alignment) from the ground-truth-conditioned recovery step. Because the recovery step supplies correct derivations that standard OPD never receives, it is unclear whether the reported gains are driven by the priority rule or by the recovery mechanism itself.
  3. [Method] Method section: the precise definition and computation of 'student alignment' (used as the secondary selection criterion) is not specified with sufficient detail to allow replication or to assess whether it introduces selection bias toward the student's current (possibly flawed) behavior.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'natural derivation' in the recovery step description is vague; a brief clarification of what conditioning is applied would improve readability.
  2. [Experiments] The manuscript would benefit from an explicit statement of the value of N used in the reported experiments and any sensitivity analysis around this hyper-parameter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving experimental rigor and methodological clarity, and we have revised the paper accordingly to address them.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no error bars, standard deviations across seeds, or statistical significance tests are reported for the benchmark gains on AIME 2024/2025 and AMC 2023. This makes it impossible to determine whether the observed improvements exceed run-to-run variance.

    Authors: We agree that the lack of error bars and statistical tests limits the strength of the claims. In the revised manuscript, we have rerun all experiments across three random seeds, added mean and standard deviation values to the main results tables, and included paired t-test p-values demonstrating that the gains over standard OPD are statistically significant (p < 0.05) on AIME 2024 and AIME 2025. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation is presented that isolates the best-of-N selection rule (correctness + alignment) from the ground-truth-conditioned recovery step. Because the recovery step supplies correct derivations that standard OPD never receives, it is unclear whether the reported gains are driven by the priority rule or by the recovery mechanism itself.

    Authors: This is a fair point. We have added a dedicated ablation subsection in the revised Experiments section that isolates the components. The new results compare (i) standard OPD, (ii) BRTS without recovery (priority rule only on unconditioned samples), (iii) recovery with random selection among correct trajectories, and (iv) full BRTS. These show that the priority rule provides measurable additional gains beyond recovery alone, especially on the hardest prompts. revision: yes

  3. Referee: [Method] Method section: the precise definition and computation of 'student alignment' (used as the secondary selection criterion) is not specified with sufficient detail to allow replication or to assess whether it introduces selection bias toward the student's current (possibly flawed) behavior.

    Authors: We thank the referee for highlighting this omission. The revised Method section now defines student alignment explicitly as the cosine similarity between sentence-transformer embeddings of the student's partial solution trace and each candidate teacher trajectory. We have included the exact embedding model, pseudocode for the selection procedure, and a short discussion of potential bias toward the student's current policy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic evaluated on held-out benchmarks

full rationale

The paper presents BRTS as a procedural selection rule (correctness priority, alignment tie-breaker, ground-truth recovery fallback) inside on-policy distillation. No mathematical derivations, predictions, or first-principles results are claimed; the method is a heuristic whose performance is measured directly on external benchmarks (AIME 2024/2025, AMC 2023). No equations reduce to fitted parameters by construction, no self-citation chains support load-bearing premises, and no ansatz or renaming is introduced. The approach is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that correctness can be automatically verified against ground truth and that a simple priority rule produces better supervision; no new entities are postulated.

free parameters (1)
  • N (number of teacher rollouts)
    Size of the sampled pool is a design choice whose specific value is not stated in the abstract.
axioms (1)
  • domain assumption Correctness of teacher trajectories can be reliably determined from ground truth
    Invoked when the selection rule checks correctness first and when the recovery step conditions on ground truth.

pith-pipeline@v0.9.0 · 5616 in / 1135 out tokens · 28803 ms · 2026-05-14T21:09:13.233274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 31 canonical work pages · 21 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=3zKtaqxLhW

  2. [2]

    MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

  3. [3]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

  4. [4]

    Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

    Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

  5. [5]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    arXiv preprint arXiv:2603.23871 , year =

    Ken Ding. HDPO: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

  8. [8]

    RAFT: Reward ranked finetuning for generative foundation model alignment

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. InTransactions on Machine Learning Research, 2023

  9. [9]

    Specializing smaller language models towards multi-step reasoning

    Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning.arXiv preprint arXiv:2301.12726, 2023

  10. [10]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  11. [11]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, et al. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  12. [12]

    Maybank, and Dacheng Tao

    Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey.International Journal of Computer Vision, 2021

  13. [13]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. 2024

  14. [14]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. OpenThoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

  15. [15]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 2025

  16. [16]

    Justrl: Scaling a 1.5 b llm with a simple rl recipe

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025

  17. [17]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10

  18. [18]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  19. [19]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

  20. [20]

    TinyBERT: Distilling BERT for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, 2020

  21. [21]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

  22. [22]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of LLMs?arXiv preprint arXiv:2603.24472, 2026

  23. [23]

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016

  24. [24]

    arXiv preprint arXiv:2603.11137 , year =

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

  25. [25]

    Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 2024

  26. [26]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  27. [27]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ra- masubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. 2025

  28. [28]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  29. [29]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026

  30. [30]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  31. [31]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. InarXiv preprint arXiv:2112.09332, 2021

  32. [32]

    Privileged information distillation for language models, 2026

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models, 2026

  33. [33]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2011

  34. [34]

    DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. InProceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (NeurIPS Workshop), 2019. 11

  35. [35]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  36. [37]

    Rl’s razor: Why online reinforcement learning forgets less, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

  37. [38]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  38. [39]

    Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

    Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, et al. Beyond human data: Scaling self-training for problem-solving wi...

  39. [40]

    Learning by distilling context, 2022

    Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context, 2022

  40. [41]

    Christiano

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F. Christiano. Learning to summarize with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  41. [42]

    MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in Neural Information Processing Systems, 33:5776–5788, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in Neural Information Processing Systems, 33:5776–5788, 2020

  42. [43]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. 2023

  43. [44]

    MiMo-V2-Flash Technical Report

    Xiaomi LLM-Core Team, Bangjun Xiao, Bingquan Xia, Bo Yang, et al. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

  44. [45]

    Error bounds of imitating policies and environments.Advances in Neural Information Processing Systems, 33, 2020

    Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments.Advances in Neural Information Processing Systems, 33, 2020

  45. [46]

    Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

    Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

  46. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  47. [48]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled RLVR.arXiv preprint arXiv:2604.03128, 2026

  48. [49]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  49. [50]

    Black-box on-policy distillation of large language models, 2026

    Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models, 2026

  50. [51]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  51. [52]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 12

  52. [53]

    STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 2022

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 2022

  53. [54]

    Towards the law of capacity gap in distilling language models

    Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, and Yao Hu. Towards the law of capacity gap in distilling language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22504–22528, 2025

  54. [55]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 13 This appendix provides additional details to support the reproducibility and interpretation of our results. Appendix A describes the model, data...

  55. [56]

    NEVER mention, quote, paraphrase, or allude to it anywhere –- not in <think>, not in your answer

  56. [57]

    NEVER say things like ‘the key says’, ‘based on the hint’, ‘the answer is given’, ‘I can see the correct answer is’, or any equivalent phrasing

  57. [58]

    Your entire chain of thought must be derived from what you observe in the problem

  58. [59]

    Only use the validation key silently as a final sanity-check after you have already reasoned to a conclusion –- never as a starting point or shortcut. The resulting Tier-2 rollout is retained only if its extracted answer matches y⋆; samples for which Tier-2 also fails fall through to Tier-3, in which BRTS picks the most overlap-similar Tier-1 rollout as a...