pith. sign in

arxiv: 2606.01476 · v1 · pith:E7TRJVG7new · submitted 2026-05-31 · 💻 cs.LG · cs.CL

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Pith reviewed 2026-06-28 17:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords on-policy distillationlogit-free distillationchunk-level verificationMonte Carlo rolloutssemantic similarityblack-box teachersmath reasoning
0
0 comments X

The pith

OmniOPD replaces token-level logit matching in on-policy distillation with chunk-level semantic verification from Monte Carlo rollouts, enabling black-box teachers and higher performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy distillation requires direct access to a teacher's token logits and suffers from brittle signals that amplify noise and repetition. OmniOPD removes the logit requirement by running Monte Carlo rollouts of the student, scoring multi-token chunks with a continuous semantic similarity metric that approximates the teacher's local preferences, and auditing only at high-uncertainty points via a peak-entropy scheduler. A Dirichlet-Multinomial prior and base-model KL anchor stabilize the discrete sampling and keep the policy from collapsing on unaudited tokens. The method yields up to 28.64 percent gains over standard OPD on math tasks and an extra 9.54 percent relative gain when stronger proprietary models serve as teachers. A sympathetic reader would care because the change both widens the set of usable teachers and replaces a high-density but fragile signal with one that is more robust to distribution mismatch.

Core claim

OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, concentrates supervision via a peak-entropy scheduler that audits the student only at high-uncertainty reasoning forks, and applies a Dirichlet-Multinomial Bayesian prior plus base-model KL anchor to bound variance and prevent policy collapse on unaudited tokens, producing up to 28.64 percent improvement on math benchmarks and further gains when paired with black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash.

What carries the argument

Monte Carlo rollouts scored by continuous semantic similarity over multi-token chunks, focused by peak-entropy scheduler

If this is right

  • Surpasses standard OPD by up to 28.64 percent on math benchmarks
  • Enables use of black-box proprietary teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash with an additional 9.54 percent relative gain on math
  • Concentrates learning signal at high-uncertainty forks rather than every token
  • Stabilizes training on unaudited tokens through Dirichlet-Multinomial prior and KL anchor to avoid policy collapse

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chunk-verification loop could be applied to non-math reasoning domains where token-level teacher signals are similarly noisy
  • Multiple black-box teachers could be combined by averaging or voting their semantic scores on the same student chunks
  • The peak-entropy scheduler might be replaced by other uncertainty measures such as embedding variance without changing the rest of the pipeline
  • The method reduces dependence on open-weight models as teachers, which could shift distillation practice toward closed-model ensembles

Load-bearing premise

The continuous semantic similarity metric on multi-token chunks from rollouts accurately approximates the teacher's local preferences even without any logit access.

What would settle it

A controlled test that replaces the semantic similarity scores with random chunk scores or with scores from a deliberately mismatched teacher and measures whether the reported performance gains over standard OPD disappear.

read the original abstract

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces OmniOPD, a logit-free on-policy distillation framework that replaces token-level logit matching with Monte Carlo rollouts computing continuous semantic similarity over multi-token chunks. It adds a peak-entropy scheduler to audit high-uncertainty forks, a Dirichlet-Multinomial Bayesian prior, and a base-model KL anchor to bound variance and prevent collapse on unaudited tokens. The central empirical claim is that this yields up to +28.64% gains on math benchmarks over standard OPD and an additional +9.54% relative gain when using black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash.

Significance. If the chunk-level semantic verification reliably approximates teacher preferences and the variance-control mechanisms prevent collapse, the work would meaningfully expand on-policy distillation to proprietary models and offer a potentially more robust signal than brittle logit matching.

major comments (2)
  1. Abstract: the central performance claim (+28.64% on math) and the assertion that chunk-level verification is more reliable than token-level logit matching rest on unshown experimental details, ablations, and exact definitions of the semantic similarity metric; without these the claim cannot be assessed for load-bearing validity.
  2. Abstract: the weakest assumption—that Monte Carlo rollouts with continuous semantic similarity plus the scheduler/prior/KL anchor suffice to bound variance and avoid policy collapse—is stated but not accompanied by any derivation, bound, or diagnostic experiment in the provided material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review and the opportunity to address these points. The full manuscript contains the supporting experimental sections, metric definitions, and empirical diagnostics referenced in the abstract. We respond to each major comment below.

read point-by-point responses
  1. Referee: Abstract: the central performance claim (+28.64% on math) and the assertion that chunk-level verification is more reliable than token-level logit matching rest on unshown experimental details, ablations, and exact definitions of the semantic similarity metric; without these the claim cannot be assessed for load-bearing validity.

    Authors: Section 4 of the manuscript reports the full experimental results on math benchmarks, including the +28.64% gain over standard OPD with exact setup, baselines, and statistical details. Section 3.2 defines the continuous semantic similarity metric used for chunk-level verification. Section 5 contains the ablation studies isolating each component. These sections supply the load-bearing evidence for the abstract claims; the abstract itself is a summary and does not repeat the full experimental protocol. revision: no

  2. Referee: Abstract: the weakest assumption—that Monte Carlo rollouts with continuous semantic similarity plus the scheduler/prior/KL anchor suffice to bound variance and avoid policy collapse—is stated but not accompanied by any derivation, bound, or diagnostic experiment in the provided material.

    Authors: Sections 3.3–3.5 describe the peak-entropy scheduler, Dirichlet-Multinomial prior, and base-model KL anchor and explain how they are intended to control variance on unaudited tokens. The main results in Section 4 and additional diagnostics in Section 6 provide empirical evidence that collapse is avoided in practice. We do not supply a formal theoretical derivation or closed-form bound; the design is justified empirically rather than analytically. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The provided abstract and context introduce OmniOPD via independent mechanisms (Monte Carlo semantic similarity over chunks, peak-entropy scheduler, Dirichlet-Multinomial prior, KL anchor) that are explicitly motivated to address stated limitations of token-level logit matching. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the material. The performance claims (+28.64% on math, +9.54% with black-box teachers) are positioned as empirical outcomes rather than derivations that reduce to the inputs by construction. The central argument remains externally falsifiable via benchmark comparisons and does not rely on self-referential definitions or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is provided, so no specific free parameters, axioms, or invented entities can be extracted or verified.

pith-pipeline@v0.9.1-grok · 5880 in / 1123 out tokens · 31107 ms · 2026-06-28T17:04:43.310812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Formula-Driven Survey and Research Agenda for On-Policy Distillation

    cs.AI 2026-06 unverdicted novelty 4.0

    A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.

Reference graph

Works this paper leans on

39 extracted references · 33 canonical work pages · cited by 1 Pith paper · 26 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprintarXiv:2412.08905,

  2. [2]

    Claude-4, 2025a.https://www.anthropic.com/news/claude-4

    Anthropic. Claude-4, 2025a.https://www.anthropic.com/news/claude-4. Accessed: 2025-07-10. AI Anthropic. System card: Claude opus 4 & claude sonnet 4.Claude-4 Model Card, 2025b. Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. State of ai: An empirical 100 trillion token study with openrouter.arXiv preprint arXiv:2601.10088,

  3. [3]

    Accelerating Large Language Model Decoding with Speculative Sampling

    CharlieChen, SebastianBorgeaud, GeoffreyIrving, Jean-BaptisteLespiau, LaurentSifre, andJohnJumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

  4. [4]

    Advances in Neural Information Processing Systems, 38:9640–9664

    Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking rlvr through clipping, entropy, and spurious reward.arXiv preprint arXiv:2512.16912,

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  6. [6]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  7. [7]

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

    https: //arxiv.org/abs/2605.00674. Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562,

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  10. [10]

    Self-Policy Distillation via Capability-Selective Subspace Projection

    Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, and Hanxue Liang. Self-policy distillation via capability- selective subspace projection.arXiv preprint arXiv:2605.22675,

  11. [11]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

  12. [12]

    Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

    Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002,

  13. [13]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.NeurIPS, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathema...

  14. [14]

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301,

  15. [15]

    Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432,

  16. [16]

    Gemini 2.5 pro capable of winning gold at imo 2025.arXiv preprint arXiv:2507.15855,

    Yichen Huang and Lin F Yang. Gemini 2.5 pro capable of winning gold at imo 2025.arXiv preprint arXiv:2507.15855,

  17. [17]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

  18. [18]

    Gemma 3 Technical Report

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, AlexandreRamé, MorganeRivière, LouisRouillard, etal. Gemma3technicalreport. arXivpreprintarXiv:2503.19786, 4,

  19. [19]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327,

  20. [20]

    Mentor-kd: Making small language models better multi-step reasoners

    Hojae Lee, Junho Kim, and SangKeun Lee. Mentor-kd: Making small language models better multi-step reasoners. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17643–17658,

  21. [21]

    Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852,

  22. [22]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016,

  23. [23]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    doi: 10.1126/science.abq1158. https://www.science.org/doi/abs/10.1126/science.abq1158. Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr.arXiv preprint arXiv:2508.14029,

  24. [24]

    Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. 18 Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braver- man. Demystifying opd: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527,

  25. [25]

    A new era of intelligence with gemini 3.Mountain View, CA: Google)

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3.Mountain View, CA: Google). Available online at: https://blog. google/products-andplatforms/products/gemini/gemini-3/(Accessed February1, 2026),

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  27. [27]

    Distilling reasoning capabilities into smaller language models

    Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073,

  28. [28]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  29. [29]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626,

  30. [30]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  31. [31]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

  32. [32]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  33. [33]

    Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125,

  34. [34]

    Black-box on-policy distillation of large language models

    19 Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models. arXiv preprint arXiv:2511.10643,

  35. [35]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXivpreprintarXiv:2503.14476,

  36. [36]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

  37. [37]

    Teaching-assistant-in-the-loop: Improving knowledge distillation from imperfect teacher models in low-budget scenarios.arXiv preprint arXiv:2406.05322,

    Yuhang Zhou and Wei Ai. Teaching-assistant-in-the-loop: Improving knowledge distillation from imperfect teacher models in low-budget scenarios.arXiv preprint arXiv:2406.05322,

  38. [38]

    The top row utilizes the open-weight Qwen3-32B teacher, while the bottom row utilizes the frontier Gemini-2.5-flash model

    Trust Region Anchoring & Optimization 20:Calculate chunk-level policy loss: Lchunk =− X c∈C ˆπ(c) teacher X t∈c logπ θ(yt |x, y <t) ! 21:Calculate un-audited trust region penalty: LKL =β X t∈U DKL πref (· |x, y <t)∥π θ(· |x, y <t) 22:Compute total objective:L OmniOPD(θ) =L chunk +L KL 23:Perform gradient descent step:θ←θ−η∇ θLOmniOPD(θ) 24:end while 25:re...

  39. [39]

    The chunk-level on-policy loss exhibits rapid and stable convergence across both setups

    We track the process under both an open-weight Qwen3-32B and Gemini-2.5-flash. The chunk-level on-policy loss exhibits rapid and stable convergence across both setups. For the Qwen3-32B teacher, the loss drops sharply from approximately 0.33 to 0.24 within the first 100 steps before stabilizing. Remarkably, when utilizing the Gemini-2.5-flash teacher, the...