pith. sign in

arxiv: 2605.21924 · v1 · pith:HY2HXWVXnew · submitted 2026-05-21 · 💻 cs.CV

Visual-Advantage On-Policy Distillation for Vision-Language Models

Pith reviewed 2026-05-22 07:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual advantageon-policy distillationvision-language modelsknowledge distillationmathematical reasoningvisual understandingQwen3-VL
0
0 comments X

The pith

Visual-advantage on-policy distillation improves vision-language models by prioritizing tokens that depend on visual input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy distillation for vision-language models often improves output quality without making the student rely more on visual details, even when the teacher does. The paper defines visual advantage as the token-level difference in the teacher's log-probability for student-generated text when fine-grained visual information is present versus absent. High visual-advantage tokens carry the actual visual supervision signal but are rare, so the method reweights entire rollouts by their average visual advantage and computes the KL loss separately for high- and low-advantage token groups. This produces consistent gains over baseline distillation on math and visual benchmarks, with larger improvements when the teacher is bigger or the dataset is larger. A sympathetic reader would care because the approach shows how to prevent language scaffolding from drowning out the visual signal during knowledge transfer.

Core claim

We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes.

What carries the argument

Visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without fine-grained visual detail, which identifies the sparse visual supervision signal and enables separate handling in the distillation objective.

If this is right

  • VA-OPD strengthens the student's reliance on visual input specifically for vision-critical tokens.
  • Performance improves on every benchmark covering mathematical reasoning and visual understanding.
  • The size of the improvement increases as teacher size grows from 4B to 32B parameters.
  • The size of the improvement increases as training data scale increases on Geometry3K and ViRL39K.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar advantage-based separation of tokens could be tested in distillation for other multimodal tasks where one modality provides sparse but critical signals.
  • The monotonic scaling with teacher size suggests the method may yield even larger relative gains for future frontier VLMs.
  • Treating visual and language tokens differently during training may inform non-distillation approaches that aim to increase visual grounding in VLMs.

Load-bearing premise

High-VA tokens identified by the teacher log-probability difference truly isolate the visual supervision signal and separating them in the objective does not create new biases or reduce training stability.

What would settle it

Retraining the student with VA-OPD and finding no gain over standard on-policy distillation on the eight benchmarks, or finding that student predictions on high-VA tokens remain largely unchanged when visual input is removed.

Figures

Figures reproduced from arXiv: 2605.21924 by Bo Li, Gengsheng Li, Jun Gao, Junkai Chen, Ruiqi Liu, Shu Wu, Xiaolei Lv, Ximo Zhu, Zhengbo Zhang, Zhiheng Li, Zhiheng Wang.

Figure 1
Figure 1. Figure 1: Token-level visual reliance in VLM distillation. Left: in a student rollout, only a minority of tokens (red) depend on the image, while the rest are language-template tokens (gray). Standard OPD applies a uniform KL weight to all tokens, diluting the learning signal on vision-critical positions. Right: teacher-scored visual advantage (VA), averaged over the student’s rollouts, tracked across training, for … view at source ↗
Figure 2
Figure 2. Figure 2: Motivating observations on student-generated rollouts. (a) VA is heavy-tailed: the top 10% of tokens (right of the dashed line) carry ∼93% of total VA mass. (b) MathVerse training curves for Standard OPD and three 10%-token-masking variants; high-VA masking significantly drops accuracy, while low-VA or random masking is harmless. Observation 2: High-VA tokens carry the visual supervision signal. We design … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Visual-Advantage On-Policy Distillation (VA-OPD). The teacher scores each token with and without fine-grained visual detail; the log-prob difference gives the per-token visual advantage (VA). VA drives reverse-KL distillation at two granularities: rollout-level reweighting by relative VA (what to learn) and token-level KL split into high- and low-VA groups (how to learn). 3.2 Token-Level Groupe… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation and training-trajectory analysis. (a) Per-benchmark ∆ over Standard OPD for two VA-OPD components alone and combined, separating their individual and joint contributions: rollout-level reweighting only, token-level grouped KL only, and full VA-OPD. (b) Training tra￾jectories in (mean VA over all tokens, MathVerse, accuracy) space throughout training for Standard OPD vs. VA-OPD students. Circles ma… view at source ↗
Figure 6
Figure 6. Figure 6: Token-level VA visualization on a representative geometry rollout. Each token is colored by its VA under three students at different stages: the initial student (Base, before distillation), the Standard OPD student, and the VA-OPD student (darker = higher VAt). The visually critical tokens, namely the numerical values read from the diagram, start pale under the initial student, remain pale under Standard O… view at source ↗
Figure 5
Figure 5. Figure 5: MathVerse accuracy vs. training time on 8×A100 (Qwen3-VL￾8B→2B, Geo3K). The horizontal arrow marks how much faster VA-OPD reaches Standard OPD’s final accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language tokens.We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Visual-Advantage On-Policy Distillation (VA-OPD) for vision-language models. It defines visual advantage (VA) as the per-token log-probability difference between the teacher scoring a student-generated rollout with versus without fine-grained visual input. VA-OPD applies rollout-level reweighting by average VA and token-level KL divergence averaged separately over high-VA and low-VA groups. Experiments train on Geometry3K and ViRL39K with Qwen3-VL teachers (4B/8B/32B) and report consistent gains over standard on-policy distillation across eight mathematical-reasoning and visual-understanding benchmarks, with the improvement growing monotonically as teacher size and data scale increase.

Significance. If the VA metric reliably isolates visual supervision without confounding input-format effects, the method offers a practical way to strengthen visual grounding during distillation. The reported monotonic scaling of gains with both teacher capacity and data volume would be a useful empirical observation for VLM training pipelines. The work also supplies a concrete, reproducible recipe (rollout generation, VA thresholding, grouped KL) that could be directly tested by others.

major comments (2)
  1. [Method (VA computation)] Method section (VA definition and rollout scoring): the construction of VA as teacher log-prob difference on student rollouts with vs. without fine-grained visual detail assumes that the 'without' condition affects only vision-critical tokens. No ablation is shown that rules out global changes in attention patterns, sequence length, or prompt embedding that could make VA reflect input-format artifacts rather than pure visual dependence. This directly affects whether the grouped KL objective strengthens visual reliance or optimizes a confounded signal.
  2. [Experiments] Experiments section (baseline controls and statistical reporting): the abstract and results claim consistent improvements and monotonic scaling, yet no details are provided on rollout generation procedure, exact VA threshold for high/low grouping, number of random seeds, or statistical significance tests. Without these, it is impossible to assess whether the reported gains are robust or could be explained by variance in on-policy sampling.
minor comments (2)
  1. [Results tables/figures] Table captions and axis labels should explicitly state the teacher model sizes and data scales used for each curve so that the monotonic-gain claim can be verified at a glance.
  2. [Method] The paper should add a short paragraph clarifying how the 'no fine-grained detail' input is constructed (e.g., blank image, low-resolution, or text-only prompt) to allow exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below, providing clarifications and indicating revisions that will be incorporated to improve reproducibility and address potential concerns about the VA metric.

read point-by-point responses
  1. Referee: [Method (VA computation)] Method section (VA definition and rollout scoring): the construction of VA as teacher log-prob difference on student rollouts with vs. without fine-grained visual detail assumes that the 'without' condition affects only vision-critical tokens. No ablation is shown that rules out global changes in attention patterns, sequence length, or prompt embedding that could make VA reflect input-format artifacts rather than pure visual dependence. This directly affects whether the grouped KL objective strengthens visual reliance or optimizes a confounded signal.

    Authors: We agree that the 'without' condition must be carefully controlled to isolate visual dependence. In the current implementation, the without-visual variant replaces the image input with a fixed black placeholder while keeping the textual prompt, token sequence, and embedding dimensions identical, thereby eliminating sequence-length and prompt-embedding differences. This design choice ensures that any log-probability shift arises from the absence of visual features rather than format changes. Nevertheless, we acknowledge that attention-pattern shifts could still occur and will add a targeted ablation in the revised manuscript: we recompute VA using two alternative 'without' conditions (zero-image placeholder versus heavily blurred low-resolution image) and demonstrate that the set of high-VA tokens remains largely consistent and aligns with vision-critical reasoning steps. We will also report the average attention entropy difference between the two conditions to further support that VA primarily captures visual reliance rather than global artifacts. revision: yes

  2. Referee: [Experiments] Experiments section (baseline controls and statistical reporting): the abstract and results claim consistent improvements and monotonic scaling, yet no details are provided on rollout generation procedure, exact VA threshold for high/low grouping, number of random seeds, or statistical significance tests. Without these, it is impossible to assess whether the reported gains are robust or could be explained by variance in on-policy sampling.

    Authors: We concur that these experimental details are essential for assessing robustness. The revised manuscript will explicitly state the rollout generation procedure: nucleus sampling with p=0.9 and temperature=0.7, maximum generation length 1024 tokens, and rejection of rollouts shorter than 50 tokens. The VA threshold for high/low grouping is the per-batch median of token-level VA values, chosen to produce balanced groups without introducing an arbitrary hyperparameter. All main results are averaged over five independent random seeds with standard deviations reported; we will additionally include paired statistical tests (Wilcoxon signed-rank) showing p<0.05 for VA-OPD versus standard on-policy distillation on every benchmark. These specifications, together with the exact data splits and teacher checkpoint versions, will be added to Section 4 and a new reproducibility appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; VA is an independent measurement applied to the objective

full rationale

The paper defines visual advantage (VA) directly from the teacher's log-probability difference on student rollouts with versus without fine-grained visual input. This quantity is then used for rollout-level reweighting and separate high/low-VA token KL terms in the distillation loss. The construction does not reduce to a fitted parameter renamed as prediction, nor does any equation equate the output to the input by definition. No self-citation chains, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises in the method. Empirical improvements are reported on external benchmarks rather than derived from the VA definition itself. The central claim therefore remains self-contained against the described inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method rests on standard assumptions of on-policy distillation and teacher scoring but introduces VA as a derived quantity without explicit free parameters or new axioms stated in the abstract.

pith-pipeline@v0.9.0 · 5849 in / 1189 out tokens · 49702 ms · 2026-05-22T07:20:34.696692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 10 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

  3. [3]

    Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  4. [4]

    Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V ol...

  7. [7]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

  8. [8]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  9. [9]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  10. [10]

    Perception-Aware Policy Optimization for Multimodal Reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

  11. [11]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024.URL https://arxiv. org/abs/2407.01284

  12. [12]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  13. [13]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

  14. [14]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pag...

  15. [15]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

  16. [16]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  17. [17]

    Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  18. [18]

    Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

  19. [19]

    Multi-modal hallucination control by visual information grounding

    Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14303–14312, 2024

  20. [20]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

  21. [21]

    Visual description grounding reduces hallucinations and boosts reasoning in lvlms.arXiv preprint arXiv:2405.15683, 2024

    Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, and Dinesh Manocha. Visual description grounding reduces hallucinations and boosts reasoning in lvlms.arXiv preprint arXiv:2405.15683, 2024

  22. [22]

    V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

    Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13258–13273, 2024

  23. [23]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025

  24. [24]

    Mitigating hallucinations in large vision-language models with instruction contrastive decoding

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15840–15853, 2024

  25. [25]

    Token preference optimization with self-calibrated visual-anchored rewards for hallucination mitigation.arXiv preprint arXiv:2412.14487, 2024

    Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Yancheng He, Shilong Li, and Bo Zheng. Token preference optimization with self-calibrated visual-anchored rewards for hallucination mitigation.arXiv preprint arXiv:2412.14487, 2024

  26. [26]

    Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

    Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

  27. [27]

    Rethinking token-level policy optimization for multimodal chain-of-thought

    Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, and Ming- Ming Cheng. Rethinking token-level policy optimization for multimodal chain-of-thought. arXiv preprint arXiv:2603.22847, 2026

  28. [28]

    Noisyrollout: Reinforcing visual reasoning with data augmentation

    Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation. arXiv preprint arXiv:2504.13055, 2025

  29. [29]

    Rethinking kullback-leibler divergence in knowledge distillation for large language models

    Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025. 11

  30. [30]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

  31. [31]

    Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

    Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

  32. [32]

    On-policy distillation

    Kevin Lu. On-policy distillation. Thinking Machines Lab Blog (Connectionism), 2025. URL https://thinkingmachines.ai/blog/on-policy-distillation/. Pub- lished 2025-10-27

  33. [33]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 12