Visual-Advantage On-Policy Distillation for Vision-Language Models
Pith reviewed 2026-05-22 07:20 UTC · model grok-4.3
The pith
Visual-advantage on-policy distillation improves vision-language models by prioritizing tokens that depend on visual input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes.
What carries the argument
Visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without fine-grained visual detail, which identifies the sparse visual supervision signal and enables separate handling in the distillation objective.
If this is right
- VA-OPD strengthens the student's reliance on visual input specifically for vision-critical tokens.
- Performance improves on every benchmark covering mathematical reasoning and visual understanding.
- The size of the improvement increases as teacher size grows from 4B to 32B parameters.
- The size of the improvement increases as training data scale increases on Geometry3K and ViRL39K.
Where Pith is reading between the lines
- Similar advantage-based separation of tokens could be tested in distillation for other multimodal tasks where one modality provides sparse but critical signals.
- The monotonic scaling with teacher size suggests the method may yield even larger relative gains for future frontier VLMs.
- Treating visual and language tokens differently during training may inform non-distillation approaches that aim to increase visual grounding in VLMs.
Load-bearing premise
High-VA tokens identified by the teacher log-probability difference truly isolate the visual supervision signal and separating them in the objective does not create new biases or reduce training stability.
What would settle it
Retraining the student with VA-OPD and finding no gain over standard on-policy distillation on the eight benchmarks, or finding that student predictions on high-VA tokens remain largely unchanged when visual input is removed.
Figures
read the original abstract
On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language tokens.We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Visual-Advantage On-Policy Distillation (VA-OPD) for vision-language models. It defines visual advantage (VA) as the per-token log-probability difference between the teacher scoring a student-generated rollout with versus without fine-grained visual input. VA-OPD applies rollout-level reweighting by average VA and token-level KL divergence averaged separately over high-VA and low-VA groups. Experiments train on Geometry3K and ViRL39K with Qwen3-VL teachers (4B/8B/32B) and report consistent gains over standard on-policy distillation across eight mathematical-reasoning and visual-understanding benchmarks, with the improvement growing monotonically as teacher size and data scale increase.
Significance. If the VA metric reliably isolates visual supervision without confounding input-format effects, the method offers a practical way to strengthen visual grounding during distillation. The reported monotonic scaling of gains with both teacher capacity and data volume would be a useful empirical observation for VLM training pipelines. The work also supplies a concrete, reproducible recipe (rollout generation, VA thresholding, grouped KL) that could be directly tested by others.
major comments (2)
- [Method (VA computation)] Method section (VA definition and rollout scoring): the construction of VA as teacher log-prob difference on student rollouts with vs. without fine-grained visual detail assumes that the 'without' condition affects only vision-critical tokens. No ablation is shown that rules out global changes in attention patterns, sequence length, or prompt embedding that could make VA reflect input-format artifacts rather than pure visual dependence. This directly affects whether the grouped KL objective strengthens visual reliance or optimizes a confounded signal.
- [Experiments] Experiments section (baseline controls and statistical reporting): the abstract and results claim consistent improvements and monotonic scaling, yet no details are provided on rollout generation procedure, exact VA threshold for high/low grouping, number of random seeds, or statistical significance tests. Without these, it is impossible to assess whether the reported gains are robust or could be explained by variance in on-policy sampling.
minor comments (2)
- [Results tables/figures] Table captions and axis labels should explicitly state the teacher model sizes and data scales used for each curve so that the monotonic-gain claim can be verified at a glance.
- [Method] The paper should add a short paragraph clarifying how the 'no fine-grained detail' input is constructed (e.g., blank image, low-resolution, or text-only prompt) to allow exact reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We respond to each major comment below, providing clarifications and indicating revisions that will be incorporated to improve reproducibility and address potential concerns about the VA metric.
read point-by-point responses
-
Referee: [Method (VA computation)] Method section (VA definition and rollout scoring): the construction of VA as teacher log-prob difference on student rollouts with vs. without fine-grained visual detail assumes that the 'without' condition affects only vision-critical tokens. No ablation is shown that rules out global changes in attention patterns, sequence length, or prompt embedding that could make VA reflect input-format artifacts rather than pure visual dependence. This directly affects whether the grouped KL objective strengthens visual reliance or optimizes a confounded signal.
Authors: We agree that the 'without' condition must be carefully controlled to isolate visual dependence. In the current implementation, the without-visual variant replaces the image input with a fixed black placeholder while keeping the textual prompt, token sequence, and embedding dimensions identical, thereby eliminating sequence-length and prompt-embedding differences. This design choice ensures that any log-probability shift arises from the absence of visual features rather than format changes. Nevertheless, we acknowledge that attention-pattern shifts could still occur and will add a targeted ablation in the revised manuscript: we recompute VA using two alternative 'without' conditions (zero-image placeholder versus heavily blurred low-resolution image) and demonstrate that the set of high-VA tokens remains largely consistent and aligns with vision-critical reasoning steps. We will also report the average attention entropy difference between the two conditions to further support that VA primarily captures visual reliance rather than global artifacts. revision: yes
-
Referee: [Experiments] Experiments section (baseline controls and statistical reporting): the abstract and results claim consistent improvements and monotonic scaling, yet no details are provided on rollout generation procedure, exact VA threshold for high/low grouping, number of random seeds, or statistical significance tests. Without these, it is impossible to assess whether the reported gains are robust or could be explained by variance in on-policy sampling.
Authors: We concur that these experimental details are essential for assessing robustness. The revised manuscript will explicitly state the rollout generation procedure: nucleus sampling with p=0.9 and temperature=0.7, maximum generation length 1024 tokens, and rejection of rollouts shorter than 50 tokens. The VA threshold for high/low grouping is the per-batch median of token-level VA values, chosen to produce balanced groups without introducing an arbitrary hyperparameter. All main results are averaged over five independent random seeds with standard deviations reported; we will additionally include paired statistical tests (Wilcoxon signed-rank) showing p<0.05 for VA-OPD versus standard on-policy distillation on every benchmark. These specifications, together with the exact data splits and teacher checkpoint versions, will be added to Section 4 and a new reproducibility appendix. revision: yes
Circularity Check
No significant circularity; VA is an independent measurement applied to the objective
full rationale
The paper defines visual advantage (VA) directly from the teacher's log-probability difference on student rollouts with versus without fine-grained visual input. This quantity is then used for rollout-level reweighting and separate high/low-VA token KL terms in the distillation loss. The construction does not reduce to a fitted parameter renamed as prediction, nor does any equation equate the output to the input by definition. No self-citation chains, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises in the method. Empirical improvements are reported on external benchmarks rather than derived from the VA definition itself. The central claim therefore remains self-contained against the described inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VA is the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[2]
Minillm: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[3]
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,
Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V ol...
work page 2021
-
[7]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Perception-Aware Policy Optimization for Multimodal Reasoning
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024.URL https://arxiv. org/abs/2407.01284
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024
work page 2024
-
[14]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pag...
work page 2024
-
[15]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016
work page 2016
-
[16]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024
work page 2024
-
[17]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[18]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
work page 2024
-
[19]
Multi-modal hallucination control by visual information grounding
Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14303–14312, 2024
work page 2024
-
[20]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024
work page 2024
-
[21]
Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, and Dinesh Manocha. Visual description grounding reduces hallucinations and boosts reasoning in lvlms.arXiv preprint arXiv:2405.15683, 2024
-
[22]
Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13258–13273, 2024
work page 2024
-
[23]
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Mitigating hallucinations in large vision-language models with instruction contrastive decoding
Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15840–15853, 2024
work page 2024
-
[25]
Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Yancheng He, Shilong Li, and Bo Zheng. Token preference optimization with self-calibrated visual-anchored rewards for hallucination mitigation.arXiv preprint arXiv:2412.14487, 2024
-
[26]
Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025
-
[27]
Rethinking token-level policy optimization for multimodal chain-of-thought
Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, and Ming- Ming Cheng. Rethinking token-level policy optimization for multimodal chain-of-thought. arXiv preprint arXiv:2603.22847, 2026
-
[28]
Noisyrollout: Reinforcing visual reasoning with data augmentation
Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation. arXiv preprint arXiv:2504.13055, 2025
-
[29]
Rethinking kullback-leibler divergence in knowledge distillation for large language models
Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025. 11
work page 2025
-
[30]
Entropy-aware on-policy distillation of language models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026
-
[31]
Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024
-
[32]
Kevin Lu. On-policy distillation. Thinking Machines Lab Blog (Connectionism), 2025. URL https://thinkingmachines.ai/blog/on-policy-distillation/. Pub- lished 2025-10-27
work page 2025
-
[33]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.