OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Jiayi Liu; Lizhu Zhang; Mingyi Wang; Peng Bo; Xiangjun Fan; Yifan Wu; Yuhang Zhou; Zhuokai Zhao

arxiv: 2606.01476 · v1 · pith:E7TRJVG7new · submitted 2026-05-31 · 💻 cs.LG · cs.CL

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Yuhang Zhou , Lizhu Zhang , Yifan Wu , Mingyi Wang , Peng Bo , Jiayi Liu , Xiangjun Fan , Zhuokai Zhao This is my paper

Pith reviewed 2026-06-28 17:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords on-policy distillationlogit-free distillationchunk-level verificationMonte Carlo rolloutssemantic similarityblack-box teachersmath reasoning

0 comments

The pith

OmniOPD replaces token-level logit matching in on-policy distillation with chunk-level semantic verification from Monte Carlo rollouts, enabling black-box teachers and higher performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy distillation requires direct access to a teacher's token logits and suffers from brittle signals that amplify noise and repetition. OmniOPD removes the logit requirement by running Monte Carlo rollouts of the student, scoring multi-token chunks with a continuous semantic similarity metric that approximates the teacher's local preferences, and auditing only at high-uncertainty points via a peak-entropy scheduler. A Dirichlet-Multinomial prior and base-model KL anchor stabilize the discrete sampling and keep the policy from collapsing on unaudited tokens. The method yields up to 28.64 percent gains over standard OPD on math tasks and an extra 9.54 percent relative gain when stronger proprietary models serve as teachers. A sympathetic reader would care because the change both widens the set of usable teachers and replaces a high-density but fragile signal with one that is more robust to distribution mismatch.

Core claim

OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, concentrates supervision via a peak-entropy scheduler that audits the student only at high-uncertainty reasoning forks, and applies a Dirichlet-Multinomial Bayesian prior plus base-model KL anchor to bound variance and prevent policy collapse on unaudited tokens, producing up to 28.64 percent improvement on math benchmarks and further gains when paired with black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash.

What carries the argument

Monte Carlo rollouts scored by continuous semantic similarity over multi-token chunks, focused by peak-entropy scheduler

If this is right

Surpasses standard OPD by up to 28.64 percent on math benchmarks
Enables use of black-box proprietary teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash with an additional 9.54 percent relative gain on math
Concentrates learning signal at high-uncertainty forks rather than every token
Stabilizes training on unaudited tokens through Dirichlet-Multinomial prior and KL anchor to avoid policy collapse

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chunk-verification loop could be applied to non-math reasoning domains where token-level teacher signals are similarly noisy
Multiple black-box teachers could be combined by averaging or voting their semantic scores on the same student chunks
The peak-entropy scheduler might be replaced by other uncertainty measures such as embedding variance without changing the rest of the pipeline
The method reduces dependence on open-weight models as teachers, which could shift distillation practice toward closed-model ensembles

Load-bearing premise

The continuous semantic similarity metric on multi-token chunks from rollouts accurately approximates the teacher's local preferences even without any logit access.

What would settle it

A controlled test that replaces the semantic similarity scores with random chunk scores or with scores from a deliberately mismatched teacher and measures whether the reported performance gains over standard OPD disappear.

read the original abstract

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniOPD replaces token logits with chunk-level semantic verification from Monte Carlo rollouts to allow black-box teachers, but the large gains rest on unverified assumptions about the similarity metric.

read the letter

The main point is that this work removes the logit-access requirement in on-policy distillation by shifting to chunk-level semantic similarity computed over Monte Carlo rollouts, plus a peak-entropy scheduler that only audits high-uncertainty steps. A Dirichlet-Multinomial prior and KL anchor are added to limit variance on unaudited tokens. This setup is meant to let closed models like Claude or Gemini serve as teachers while avoiding the noise in direct token matching.

It does a clean job naming the two practical problems with standard OPD: many strong models expose no logits, and token-level signals can amplify repetition or other degeneracies when overlap is low. The chunk-level approach plus speculative verification is a direct attempt to get around both, and the scheduler focuses compute where the student is most uncertain, which makes sense for efficiency.

The soft spots are straightforward. The abstract claims gains up to 28.64% on math over standard OPD and another 9.54% when switching to stronger black-box teachers, but those numbers cannot be checked against the actual rollout procedure, the semantic similarity function, or any ablations on the prior and anchor. The core assumption—that continuous semantic similarity over chunks reliably stands in for the teacher’s local preferences—sounds reasonable but could still be brittle if the metric correlates poorly with the teacher’s actual next-token distribution. Without the methods and results sections, it is impossible to tell whether the variance controls actually prevent collapse or just mask it.

This is for groups doing LLM post-training who want to bring in closed models without logit APIs. A reader already working on distillation variants would get value from the framing even if the experiments need tightening. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject, mainly because the problem it targets is real and the proposed components are distinct from prior OPD work.

Referee Report

2 major / 0 minor

Summary. The paper introduces OmniOPD, a logit-free on-policy distillation framework that replaces token-level logit matching with Monte Carlo rollouts computing continuous semantic similarity over multi-token chunks. It adds a peak-entropy scheduler to audit high-uncertainty forks, a Dirichlet-Multinomial Bayesian prior, and a base-model KL anchor to bound variance and prevent collapse on unaudited tokens. The central empirical claim is that this yields up to +28.64% gains on math benchmarks over standard OPD and an additional +9.54% relative gain when using black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash.

Significance. If the chunk-level semantic verification reliably approximates teacher preferences and the variance-control mechanisms prevent collapse, the work would meaningfully expand on-policy distillation to proprietary models and offer a potentially more robust signal than brittle logit matching.

major comments (2)

Abstract: the central performance claim (+28.64% on math) and the assertion that chunk-level verification is more reliable than token-level logit matching rest on unshown experimental details, ablations, and exact definitions of the semantic similarity metric; without these the claim cannot be assessed for load-bearing validity.
Abstract: the weakest assumption—that Monte Carlo rollouts with continuous semantic similarity plus the scheduler/prior/KL anchor suffice to bound variance and avoid policy collapse—is stated but not accompanied by any derivation, bound, or diagnostic experiment in the provided material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review and the opportunity to address these points. The full manuscript contains the supporting experimental sections, metric definitions, and empirical diagnostics referenced in the abstract. We respond to each major comment below.

read point-by-point responses

Referee: Abstract: the central performance claim (+28.64% on math) and the assertion that chunk-level verification is more reliable than token-level logit matching rest on unshown experimental details, ablations, and exact definitions of the semantic similarity metric; without these the claim cannot be assessed for load-bearing validity.

Authors: Section 4 of the manuscript reports the full experimental results on math benchmarks, including the +28.64% gain over standard OPD with exact setup, baselines, and statistical details. Section 3.2 defines the continuous semantic similarity metric used for chunk-level verification. Section 5 contains the ablation studies isolating each component. These sections supply the load-bearing evidence for the abstract claims; the abstract itself is a summary and does not repeat the full experimental protocol. revision: no
Referee: Abstract: the weakest assumption—that Monte Carlo rollouts with continuous semantic similarity plus the scheduler/prior/KL anchor suffice to bound variance and avoid policy collapse—is stated but not accompanied by any derivation, bound, or diagnostic experiment in the provided material.

Authors: Sections 3.3–3.5 describe the peak-entropy scheduler, Dirichlet-Multinomial prior, and base-model KL anchor and explain how they are intended to control variance on unaudited tokens. The main results in Section 4 and additional diagnostics in Section 6 provide empirical evidence that collapse is avoided in practice. We do not supply a formal theoretical derivation or closed-form bound; the design is justified empirically rather than analytically. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The provided abstract and context introduce OmniOPD via independent mechanisms (Monte Carlo semantic similarity over chunks, peak-entropy scheduler, Dirichlet-Multinomial prior, KL anchor) that are explicitly motivated to address stated limitations of token-level logit matching. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the material. The performance claims (+28.64% on math, +9.54% with black-box teachers) are positioned as empirical outcomes rather than derivations that reduce to the inputs by construction. The central argument remains externally falsifiable via benchmark comparisons and does not rely on self-referential definitions or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is provided, so no specific free parameters, axioms, or invented entities can be extracted or verified.

pith-pipeline@v0.9.1-grok · 5880 in / 1123 out tokens · 31107 ms · 2026-06-28T17:04:43.310812+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Formula-Driven Survey and Research Agenda for On-Policy Distillation
cs.AI 2026-06 unverdicted novelty 4.0

A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.

Reference graph

Works this paper leans on

39 extracted references · 33 canonical work pages · cited by 1 Pith paper · 26 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprintarXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Claude-4, 2025a.https://www.anthropic.com/news/claude-4

Anthropic. Claude-4, 2025a.https://www.anthropic.com/news/claude-4. Accessed: 2025-07-10. AI Anthropic. System card: Claude opus 4 & claude sonnet 4.Claude-4 Model Card, 2025b. Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. State of ai: An empirical 100 trillion token study with openrouter.arXiv preprint arXiv:2601.10088,

work page arXiv 2025
[3]

Accelerating Large Language Model Decoding with Speculative Sampling

CharlieChen, SebastianBorgeaud, GeoffreyIrving, Jean-BaptisteLespiau, LaurentSifre, andJohnJumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Advances in Neural Information Processing Systems, 38:9640–9664

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking rlvr through clipping, entropy, and spurious reward.arXiv preprint arXiv:2512.16912,

work page arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

https: //arxiv.org/abs/2605.00674. Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Self-Policy Distillation via Capability-Selective Subspace Projection

Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, and Hanxue Liang. Self-policy distillation via capability- selective subspace projection.arXiv preprint arXiv:2605.22675,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.NeurIPS, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathema...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Gemini 2.5 pro capable of winning gold at imo 2025.arXiv preprint arXiv:2507.15855,

Yichen Huang and Lin F Yang. Gemini 2.5 pro capable of winning gold at imo 2025.arXiv preprint arXiv:2507.15855,

work page arXiv 2025
[17]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Gemma 3 Technical Report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, AlexandreRamé, MorganeRivière, LouisRouillard, etal. Gemma3technicalreport. arXivpreprintarXiv:2503.19786, 4,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327,

2016
[20]

Mentor-kd: Making small language models better multi-step reasoners

Hojae Lee, Junho Kim, and SangKeun Lee. Mentor-kd: Making small language models better multi-step reasoners. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17643–17658,

2024
[21]

Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852,

work page arXiv
[22]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

doi: 10.1126/science.abq1158. https://www.science.org/doi/abs/10.1126/science.abq1158. Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr.arXiv preprint arXiv:2508.14029,

work page doi:10.1126/science.abq1158
[24]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. 18 Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braver- man. Demystifying opd: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20251026
[25]

A new era of intelligence with gemini 3.Mountain View, CA: Google)

Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3.Mountain View, CA: Google). Available online at: https://blog. google/products-andplatforms/products/gemini/gemini-3/(Accessed February1, 2026),

2026
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Distilling reasoning capabilities into smaller language models

Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073,

2023
[28]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Black-box on-policy distillation of large language models

19 Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models. arXiv preprint arXiv:2511.10643,

work page arXiv
[35]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXivpreprintarXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Teaching-assistant-in-the-loop: Improving knowledge distillation from imperfect teacher models in low-budget scenarios.arXiv preprint arXiv:2406.05322,

Yuhang Zhou and Wei Ai. Teaching-assistant-in-the-loop: Improving knowledge distillation from imperfect teacher models in low-budget scenarios.arXiv preprint arXiv:2406.05322,

work page arXiv
[38]

The top row utilizes the open-weight Qwen3-32B teacher, while the bottom row utilizes the frontier Gemini-2.5-flash model

Trust Region Anchoring & Optimization 20:Calculate chunk-level policy loss: Lchunk =− X c∈C ˆπ(c) teacher X t∈c logπ θ(yt |x, y <t) ! 21:Calculate un-audited trust region penalty: LKL =β X t∈U DKL πref (· |x, y <t)∥π θ(· |x, y <t) 22:Compute total objective:L OmniOPD(θ) =L chunk +L KL 23:Perform gradient descent step:θ←θ−η∇ θLOmniOPD(θ) 24:end while 25:re...

2025
[39]

The chunk-level on-policy loss exhibits rapid and stable convergence across both setups

We track the process under both an open-weight Qwen3-32B and Gemini-2.5-flash. The chunk-level on-policy loss exhibits rapid and stable convergence across both setups. For the Qwen3-32B teacher, the loss drops sharply from approximately 0.33 to 0.24 within the first 100 steps before stabilizing. Remarkably, when utilizing the Gemini-2.5-flash teacher, the...

2025

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprintarXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Claude-4, 2025a.https://www.anthropic.com/news/claude-4

Anthropic. Claude-4, 2025a.https://www.anthropic.com/news/claude-4. Accessed: 2025-07-10. AI Anthropic. System card: Claude opus 4 & claude sonnet 4.Claude-4 Model Card, 2025b. Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. State of ai: An empirical 100 trillion token study with openrouter.arXiv preprint arXiv:2601.10088,

work page arXiv 2025

[3] [3]

Accelerating Large Language Model Decoding with Speculative Sampling

CharlieChen, SebastianBorgeaud, GeoffreyIrving, Jean-BaptisteLespiau, LaurentSifre, andJohnJumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Advances in Neural Information Processing Systems, 38:9640–9664

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking rlvr through clipping, entropy, and spurious reward.arXiv preprint arXiv:2512.16912,

work page arXiv

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

https: //arxiv.org/abs/2605.00674. Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Self-Policy Distillation via Capability-Selective Subspace Projection

Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, and Hanxue Liang. Self-policy distillation via capability- selective subspace projection.arXiv preprint arXiv:2605.22675,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.NeurIPS, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathema...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Gemini 2.5 pro capable of winning gold at imo 2025.arXiv preprint arXiv:2507.15855,

Yichen Huang and Lin F Yang. Gemini 2.5 pro capable of winning gold at imo 2025.arXiv preprint arXiv:2507.15855,

work page arXiv 2025

[17] [17]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Gemma 3 Technical Report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, AlexandreRamé, MorganeRivière, LouisRouillard, etal. Gemma3technicalreport. arXivpreprintarXiv:2503.19786, 4,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327,

2016

[20] [20]

Mentor-kd: Making small language models better multi-step reasoners

Hojae Lee, Junho Kim, and SangKeun Lee. Mentor-kd: Making small language models better multi-step reasoners. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17643–17658,

2024

[21] [21]

Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852,

work page arXiv

[22] [22]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

doi: 10.1126/science.abq1158. https://www.science.org/doi/abs/10.1126/science.abq1158. Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr.arXiv preprint arXiv:2508.14029,

work page doi:10.1126/science.abq1158

[24] [24]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. 18 Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braver- man. Demystifying opd: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20251026

[25] [25]

A new era of intelligence with gemini 3.Mountain View, CA: Google)

Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3.Mountain View, CA: Google). Available online at: https://blog. google/products-andplatforms/products/gemini/gemini-3/(Accessed February1, 2026),

2026

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Distilling reasoning capabilities into smaller language models

Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073,

2023

[28] [28]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Black-box on-policy distillation of large language models

19 Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models. arXiv preprint arXiv:2511.10643,

work page arXiv

[35] [35]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXivpreprintarXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Teaching-assistant-in-the-loop: Improving knowledge distillation from imperfect teacher models in low-budget scenarios.arXiv preprint arXiv:2406.05322,

Yuhang Zhou and Wei Ai. Teaching-assistant-in-the-loop: Improving knowledge distillation from imperfect teacher models in low-budget scenarios.arXiv preprint arXiv:2406.05322,

work page arXiv

[38] [38]

The top row utilizes the open-weight Qwen3-32B teacher, while the bottom row utilizes the frontier Gemini-2.5-flash model

Trust Region Anchoring & Optimization 20:Calculate chunk-level policy loss: Lchunk =− X c∈C ˆπ(c) teacher X t∈c logπ θ(yt |x, y <t) ! 21:Calculate un-audited trust region penalty: LKL =β X t∈U DKL πref (· |x, y <t)∥π θ(· |x, y <t) 22:Compute total objective:L OmniOPD(θ) =L chunk +L KL 23:Perform gradient descent step:θ←θ−η∇ θLOmniOPD(θ) 24:end while 25:re...

2025

[39] [39]

The chunk-level on-policy loss exhibits rapid and stable convergence across both setups

We track the process under both an open-weight Qwen3-32B and Gemini-2.5-flash. The chunk-level on-policy loss exhibits rapid and stable convergence across both setups. For the Qwen3-32B teacher, the loss drops sharply from approximately 0.33 to 0.24 within the first 100 steps before stabilizing. Remarkably, when utilizing the Gemini-2.5-flash teacher, the...

2025