DOPD: Dual On-policy Distillation

Congcong Wang; Gen Li; Guibin Zhang; Jiaqi Wang; Kaituo Feng; Kaiwen Tuo; Qingyi Si; Qunzhong Wang; Shuai Dong; Shuicheng Yan

arxiv: 2606.30626 · v1 · pith:4NXDGBZLnew · submitted 2026-06-29 · 💻 cs.AI

DOPD: Dual On-policy Distillation

Xinlei Yu , Gen Li , Qingyi Si , Guibin Zhang , Yuqi Xu , Congcong Wang , Shuai Dong , Kaiwen Tuo

show 8 more authors

Xiangyu Zeng Kaituo Feng Qunzhong Wang Yang Shi Xiaobin Hu Xiangyu Yue Jiaqi Wang Shuicheng Yan

This is my paper

Pith reviewed 2026-06-30 05:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords on-policy distillationprivilege illusiondual distillationlarge language modelsvision-language modelsadvantage-aware routingtoken-level supervision

0 comments

The pith

DOPD routes each token's supervision between privileged teacher and student policies using advantage gaps to reduce privilege illusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adding privileged information to on-policy distillation creates privilege illusion, where models mix transferable capability gaps with non-replicable information asymmetry, worsened by uneven token importance. DOPD counters this by dynamically assigning each token to either the privileged teacher or privileged student for supervision, chosen according to advantage gap and relative probabilities. This gives tokens different objectives and strengths while transferring real capability and using auxiliary signals for asymmetry. If correct, the method raises the performance of distillation for large models by avoiding the previous failure mode. Readers care because it enables more reliable transfer of capabilities in language and vision-language settings without requiring perfectly symmetric information.

Core claim

DOPD is an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals to alleviate privilege illusion.

What carries the argument

Advantage-aware dual routing that assigns per-token supervision from either privileged teacher or privileged student based on advantage gap and relative probabilities.

If this is right

DOPD outperforms vanilla OPD and other counterparts on LLM and VLM settings.
The method yields gains in stability, robustness, continual learning, and out-of-distribution performance.
Tokens receive supervision varying in strength, objective, and strategy from either source.
Privilege illusion is reduced by separating capability transfer from information asymmetry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The routing logic may extend to other distillation or imitation settings where one party holds extra context that cannot be replicated.
Non-uniform token importance could be used more broadly to focus training on capability-bearing signals rather than uniform dense supervision.
The dual-policy approach might lower the requirement for exact capability matching between teacher and student in future distillation work.

Load-bearing premise

That advantage gap and relative probabilities can separate transferable capability signals from non-replicable information asymmetry without creating new biases or instability.

What would settle it

An experiment on a small-scale LLM task where DOPD routing is applied but the resulting model shows no gain over vanilla OPD or where the chosen routes fail to correlate with measured capability transfer on held-out data.

read the original abstract

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOPD frames a dual privileged setup with advantage-based per-token routing to fix privilege illusion in on-policy distillation, but the abstract supplies no metrics or ablations to show the routing works as intended.

read the letter

The paper's core move is to add privileged information to both teacher and student, then route each token's supervision between them using the advantage gap and relative probabilities. This is meant to separate replicable capability signals from non-replicable asymmetry, which the authors call privilege illusion. The idea targets a real issue in on-policy distillation for large models where token-level signals are uneven and extra inputs can create mimicry that the student cannot actually use.

What stands out is the explicit dual setup and the dynamic routing rule. Standard OPD usually has one privileged party; here both sides get privileges and the method decides per token which objective and strength to apply. That is a concrete response to the non-uniformity problem mentioned in the abstract.

The main weakness is the lack of any supporting data. The abstract states that DOPD outperforms vanilla OPD and other baselines on LLM and VLM tasks and shows gains on stability, robustness, continual learning, and OOD, yet it gives no numbers, no baseline details, no ablation on the routing heuristic, and no statistical checks. Without those, it is impossible to judge whether the routing actually isolates transferable signals or whether any gains come from extra supervision volume or compute. The stress-test concern about misclassification of tokens is therefore live: if the advantage and probability ratios do not reliably mark the right distinction, the method could add conflicting objectives rather than fix the illusion.

This is aimed at researchers working on distillation and efficient training of large language and vision-language models. A reader already thinking about on-policy methods might find the framing useful as a starting point, but the current version is too thin on evidence to stand on its own.

I would send it for peer review only after the authors add the missing experiments, ablations, and implementation details; the idea is worth checking but the claims need grounding first.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DOPD, an advantage-aware dual on-policy distillation paradigm for LLMs and VLMs. It dynamically routes token-level supervision between privileged teacher and privileged student policies using advantage gap and relative probabilities to address 'privilege illusion' (conflation of transferable capability gaps with non-replicable information asymmetry). The paper claims DOPD outperforms Vanilla OPD and other methods, with further benefits on stability, robustness, continual learning, and OOD tasks.

Significance. If the routing rule reliably isolates replicable capability signals from privilege-only asymmetry without introducing new biases or instability, the approach could meaningfully advance on-policy distillation by leveraging dual privileged policies and token-level adaptivity. The emphasis on non-uniform token supervision is a relevant direction for large-model training.

major comments (2)

[Abstract] Abstract: the claim that 'DOPD consistently outperforms Vanilla OPD and other counterparts' supplies no metrics, baselines, statistical details, ablation results, or experimental setup, which is load-bearing for the central experimental claim.
[Abstract] Abstract: the routing mechanism is described only at a conceptual level ('dynamically routes token-level supervision ... based on their advantage gap and relative probabilities') with no equations, algorithm, or pseudocode, preventing assessment of whether the heuristic correctly separates capability-bearing tokens from information-asymmetry tokens.

minor comments (1)

[Abstract] Abstract: the newly introduced term 'privilege illusion' is not formally defined or illustrated with an example, which reduces clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract can be strengthened for clarity and will revise it to better support the central claims while respecting length constraints. Below we address each point.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'DOPD consistently outperforms Vanilla OPD and other counterparts' supplies no metrics, baselines, statistical details, ablation results, or experimental setup, which is load-bearing for the central experimental claim.

Authors: We agree the abstract claim would be more informative with supporting details. In the revision we will add concise quantitative indicators (e.g., average relative gains on the primary benchmarks) and name the main baselines and settings, while keeping the statement within abstract length limits. revision: yes
Referee: [Abstract] Abstract: the routing mechanism is described only at a conceptual level ('dynamically routes token-level supervision ... based on their advantage gap and relative probabilities') with no equations, algorithm, or pseudocode, preventing assessment of whether the heuristic correctly separates capability-bearing tokens from information-asymmetry tokens.

Authors: The abstract intentionally remains high-level. The full routing rule, advantage-gap formula, probability-based selection, and Algorithm 1 appear in Section 3. To address the concern we will insert a compact inline expression for the token-level routing decision in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: method defined conceptually without self-referential derivations or fitted predictions

full rationale

The provided abstract and description introduce DOPD as a routing heuristic based on advantage gap and relative probabilities to address privilege illusion, but contain no equations, derivations, or first-principles claims that reduce to inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked. The central contribution is presented as an empirical method validated on LLM/VLM tasks rather than a mathematical chain that collapses to its own definitions. This is the common case of a self-contained algorithmic proposal without detectable circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the approach is described conceptually without mathematical or implementation details.

pith-pipeline@v0.9.1-grok · 5804 in / 974 out tokens · 31751 ms · 2026-06-30T05:49:15.551150+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 39 canonical work pages · 31 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations (ICLR), volume 2024, pages 21246–21263, 2024

2024
[2]

Aime problems and solutions, 2025

AIME. Aime problems and solutions, 2025. URLhttps://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions. 16

2025
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025

work page arXiv 2025
[5]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024

2024
[6]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In International Conference on Learning Representations (ICLR), 2024

2024
[9]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Skywork Open Reasoner 1 Technical Report

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, et al. Uni-opd: Unifying on-policy distillation with a dual-perspective recipe. arXiv preprint arXiv:2605.03677, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023

2023
[14]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advancesin Neural Information Processing Systems (NeurIPS), 36:62991–63010, 2023

2023
[15]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In International Conference on Learning Representations (ICLR), volume 2025, pages 58791–58831, 2025

2025
[17]

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Explain in your own words: Improving reasoning via token-selective dual knowledge distillation

Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation. InInternational Conference on Learning Representations (ICLR), 2026

2026
[20]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1317–1327, 2016

2016
[21]

DistiLLM-2: A contrastive approach boosts the distillation of LLMs

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. DistiLLM-2: A contrastive approach boosts the distillation of LLMs. InInternational Conference on Machine Learning (ICML), 2025. 17

2025
[22]

Reopold: Reward-based on-policy distillation with mixture-based reward clipping.arXiv preprint arXiv:2603.11137, 2026

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026
[23]

On-policy distillation, 2025

Thinking Machines Lab. On-policy distillation, 2025. URLhttps://thinkingmachines.ai/blog/ on-policy-distillation

2025
[24]

Lavida: A large diffusion language model for multimodal under- standing

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing. Advancesin Neural Information Processing Systems (NeurIPS), 38:105101–105134, 2026

2026
[25]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Small models struggle to learn from strong reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394, 2025

2025
[27]

Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

work page arXiv 2025
[28]

Visual-Advantage On-Policy Distillation for Vision-Language Models

Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, et al. Visual-advantage on-policy distillation for vision-language models. arXiv preprint arXiv:2605.21924, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Introducing gpt-5.4, 2026

OpenAI. Introducing gpt-5.4, 2026. URLhttps://openai.com/index/introducing-gpt-5-4

2026
[30]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Near-Future Policy Optimization

Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Near-future policy optimization.arXiv preprint arXiv:2604.20733, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Gates: Self-distillation under privileged context with consensus gating

Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating. arXiv preprint arXiv:2602.20574, 2026

work page arXiv 2026
[37]

Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024

2024
[38]

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. Not all disagreement is learnable: Token teachability in on-policy distillation.arXiv preprint arXiv:2605.26844, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314, 4:2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Realworldqa: A benchmark for real-world spatial understanding, 2024

xAI. Realworldqa: A benchmark for real-world spatial understanding, 2024. URLhttps://huggingface.co/ datasets/xai-org/RealworldQA. 18

2024
[42]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

DeepSeek-V4: Towards highly eﬀicient million-token context intelligence,

Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, et al. Deepseek-v4: Towards highly efficient million-token context intelligence. arXiv preprint arXiv:2606.19348, 2026

work page arXiv 2026
[45]

Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling

Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling. InInternational Conference on Learning Representations (ICLR), 2025

2025
[46]

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Patil, Ion Stoica, and Joseph E.Gonzalez

Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E.Gonzalez. Berkeley function calling leaderboard, 2024. URLhttps://gorilla.cs.berkeley.edu/blogs/8_ berkeley_function_calling_leaderboard.html

2024
[48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 10632–10643, 2025

2025
[51]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Joyai-vl-interaction: Real-time vision-language interaction intelligence

Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, et al. Joyai-vl-interaction: Real-time vision-language interaction intelligence. arXiv preprint arXiv:2606.14777, 2026

work page arXiv 2026
[53]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026

2026
[55]

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Vismem: Latent vision memory unlocks potential of vision-language models

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangn- ing Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 31544–31555, 2026

2026
[57]

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, and Yaojie Lu. Vision-opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9556–9567, 2024

2024
[59]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL: Long Papers), pages 15134–15186, 2025. 19

2025
[60]

Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

work page arXiv 2026
[61]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models

Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models. arXiv preprint arXiv:2502.16906, 2025

work page arXiv 2025
[63]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 48337–48383, 2025. 20 Appendix A Details of Privileged Input Original Input: Factor the f...

2025
[64]

Check whether the quadratic has a common numerical factor, if so, simplify quadratic
[65]

Use the coefficient structure to decide which appropriate factorization strategy
[66]

Identify the needed pairwise relationship between two numbers for the middle-term split
[67]

Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid

Indicate that the remaining expression can be factored into two linear binomials. Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid. The bottom square has vertices A, B, C, D, and the final vertex is E. Then I make a cut through the plane defined by ACE. There are now two pieces. What are the pieces? Are they tetrahedra, s...
[68]

Identify the plane determined by the two opposite base vertices and the apex
[69]

Observe how this plane intersects the square base along a diagonal
[70]

Use that diagonal to partition the base into two congruent triangular regions
[71]

Extend each triangular base region to the common apex to determine the corresponding three-dimensional subsolid
[72]

Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$

Compare each resulting piece by its vertices, edges, and triangular faces. Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$. We arrange the words in an alphabet sequence.\nPrecisely, the first word is $CCCCCCCC$, the second one is $CCCCCCCH$, the third is $CCCCCCCM$, the fourth one is $CCCCCCCO, ......

2017
[73]

Recognize that the alphabetic ordering induces a four-symbol positional system
[74]

Assign each letter an ordered digit according to this alphabet
[75]

Convert the requested ordinal position to a zero-based rank before processing
[76]

Express this rank as an eight-place base-four representation, preserving leading positions
[77]

surfboard

translate each base-four digit back to its corresponding letter. Case 3 (LLM-based) Figure 11Demonstrations of LLM-based privileged input. 21 Original Input: Privileged Input: Which is the main topic of the image: A: A woman surfing, B: A man skating, C: A man surfing, D: A woman skiting. Case 1 (VLM-based) Original Input: Privileged Input: What color are...
[78]

Please add multiple boxes if necessary
[79]

Please generate both the object label and quadruple coordinates
[80]

label":

Please output only valid JSON format without any other redundant content. Output format: [ {"label": "object", "bbox": [x1, y1, x2, y2]} ] Given a question, and corresponding ground-truth label. Query: {Query} Label: {Label} Add necessary step-wise decomposition hints that support the answer. Rules:

Showing first 80 references.

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations (ICLR), volume 2024, pages 21246–21263, 2024

2024

[2] [2]

Aime problems and solutions, 2025

AIME. Aime problems and solutions, 2025. URLhttps://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions. 16

2025

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025

work page arXiv 2025

[5] [5]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024

2024

[6] [6]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In International Conference on Learning Representations (ICLR), 2024

2024

[9] [9]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Skywork Open Reasoner 1 Technical Report

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, et al. Uni-opd: Unifying on-policy distillation with a dual-perspective recipe. arXiv preprint arXiv:2605.03677, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023

2023

[14] [14]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advancesin Neural Information Processing Systems (NeurIPS), 36:62991–63010, 2023

2023

[15] [15]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In International Conference on Learning Representations (ICLR), volume 2025, pages 58791–58831, 2025

2025

[17] [17]

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Explain in your own words: Improving reasoning via token-selective dual knowledge distillation

Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation. InInternational Conference on Learning Representations (ICLR), 2026

2026

[20] [20]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1317–1327, 2016

2016

[21] [21]

DistiLLM-2: A contrastive approach boosts the distillation of LLMs

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. DistiLLM-2: A contrastive approach boosts the distillation of LLMs. InInternational Conference on Machine Learning (ICML), 2025. 17

2025

[22] [22]

Reopold: Reward-based on-policy distillation with mixture-based reward clipping.arXiv preprint arXiv:2603.11137, 2026

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026

[23] [23]

On-policy distillation, 2025

Thinking Machines Lab. On-policy distillation, 2025. URLhttps://thinkingmachines.ai/blog/ on-policy-distillation

2025

[24] [24]

Lavida: A large diffusion language model for multimodal under- standing

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing. Advancesin Neural Information Processing Systems (NeurIPS), 38:105101–105134, 2026

2026

[25] [25]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Small models struggle to learn from strong reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394, 2025

2025

[27] [27]

Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

work page arXiv 2025

[28] [28]

Visual-Advantage On-Policy Distillation for Vision-Language Models

Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, et al. Visual-advantage on-policy distillation for vision-language models. arXiv preprint arXiv:2605.21924, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Introducing gpt-5.4, 2026

OpenAI. Introducing gpt-5.4, 2026. URLhttps://openai.com/index/introducing-gpt-5-4

2026

[30] [30]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Near-Future Policy Optimization

Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Near-future policy optimization.arXiv preprint arXiv:2604.20733, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Gates: Self-distillation under privileged context with consensus gating

Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating. arXiv preprint arXiv:2602.20574, 2026

work page arXiv 2026

[37] [37]

Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024

2024

[38] [38]

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. Not all disagreement is learnable: Token teachability in on-policy distillation.arXiv preprint arXiv:2605.26844, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314, 4:2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Realworldqa: A benchmark for real-world spatial understanding, 2024

xAI. Realworldqa: A benchmark for real-world spatial understanding, 2024. URLhttps://huggingface.co/ datasets/xai-org/RealworldQA. 18

2024

[42] [42]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

DeepSeek-V4: Towards highly eﬀicient million-token context intelligence,

Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, et al. Deepseek-v4: Towards highly efficient million-token context intelligence. arXiv preprint arXiv:2606.19348, 2026

work page arXiv 2026

[45] [45]

Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling

Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling. InInternational Conference on Learning Representations (ICLR), 2025

2025

[46] [46]

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Patil, Ion Stoica, and Joseph E.Gonzalez

Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E.Gonzalez. Berkeley function calling leaderboard, 2024. URLhttps://gorilla.cs.berkeley.edu/blogs/8_ berkeley_function_calling_leaderboard.html

2024

[48] [48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 10632–10643, 2025

2025

[51] [51]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Joyai-vl-interaction: Real-time vision-language interaction intelligence

Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, et al. Joyai-vl-interaction: Real-time vision-language interaction intelligence. arXiv preprint arXiv:2606.14777, 2026

work page arXiv 2026

[53] [53]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026

2026

[55] [55]

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

Vismem: Latent vision memory unlocks potential of vision-language models

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangn- ing Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 31544–31555, 2026

2026

[57] [57]

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, and Yaojie Lu. Vision-opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [58]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9556–9567, 2024

2024

[59] [59]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL: Long Papers), pages 15134–15186, 2025. 19

2025

[60] [60]

Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

work page arXiv 2026

[61] [61]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [62]

Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models

Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models. arXiv preprint arXiv:2502.16906, 2025

work page arXiv 2025

[63] [63]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 48337–48383, 2025. 20 Appendix A Details of Privileged Input Original Input: Factor the f...

2025

[64] [64]

Check whether the quadratic has a common numerical factor, if so, simplify quadratic

[65] [65]

Use the coefficient structure to decide which appropriate factorization strategy

[66] [66]

Identify the needed pairwise relationship between two numbers for the middle-term split

[67] [67]

Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid

Indicate that the remaining expression can be factored into two linear binomials. Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid. The bottom square has vertices A, B, C, D, and the final vertex is E. Then I make a cut through the plane defined by ACE. There are now two pieces. What are the pieces? Are they tetrahedra, s...

[68] [68]

Identify the plane determined by the two opposite base vertices and the apex

[69] [69]

Observe how this plane intersects the square base along a diagonal

[70] [70]

Use that diagonal to partition the base into two congruent triangular regions

[71] [71]

Extend each triangular base region to the common apex to determine the corresponding three-dimensional subsolid

[72] [72]

Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$

Compare each resulting piece by its vertices, edges, and triangular faces. Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$. We arrange the words in an alphabet sequence.\nPrecisely, the first word is $CCCCCCCC$, the second one is $CCCCCCCH$, the third is $CCCCCCCM$, the fourth one is $CCCCCCCO, ......

2017

[73] [73]

Recognize that the alphabetic ordering induces a four-symbol positional system

[74] [74]

Assign each letter an ordered digit according to this alphabet

[75] [75]

Convert the requested ordinal position to a zero-based rank before processing

[76] [76]

Express this rank as an eight-place base-four representation, preserving leading positions

[77] [77]

surfboard

translate each base-four digit back to its corresponding letter. Case 3 (LLM-based) Figure 11Demonstrations of LLM-based privileged input. 21 Original Input: Privileged Input: Which is the main topic of the image: A: A woman surfing, B: A man skating, C: A man surfing, D: A woman skiting. Case 1 (VLM-based) Original Input: Privileged Input: What color are...

[78] [78]

Please add multiple boxes if necessary

[79] [79]

Please generate both the object label and quadruple coordinates

[80] [80]

label":

Please output only valid JSON format without any other redundant content. Output format: [ {"label": "object", "bbox": [x1, y1, x2, y2]} ] Given a question, and corresponding ground-truth label. Query: {Query} Label: {Label} Add necessary step-wise decomposition hints that support the answer. Rules: