pith. sign in

arxiv: 2606.19120 · v2 · pith:WWLJPLPXnew · submitted 2026-06-17 · 💻 cs.LG · cs.CV

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Pith reviewed 2026-06-26 21:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords multimodal large language modelson-policy self-distillationvisual groundingshortcut learningperception-reasoning decouplingMLLM post-trainingself-distillation
0
0 comments X

The pith

ViGOS decouples visual description from reasoning in on-policy self-distillation so an image-only teacher can supervise perception before a privileged teacher handles reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy self-distillation works for language models but risks letting multimodal models ignore images and rely on text targets instead. The paper introduces a two-stage process where the model first produces a visual description that an image-only teacher supervises, then continues to reasoning and answer under a separate privileged teacher using the same prefix. This separation is applied during post-training on valid rollouts while a reference teacher recovers format on invalid ones. A reader would care if the result is multimodal models that actually use image content rather than linguistic shortcuts across vision-language, math, and spatial tasks.

Core claim

ViGOS trains MLLMs by first requiring the student to write a visual description supervised by an image-only perception teacher, then supervising the reasoning and final answer with a privileged reasoning teacher on the same student prefix, using a reference teacher only for invalid rollouts to recover output format.

What carries the argument

Two-stage rollout supervision that assigns an image-only perception teacher to the visual description step and a privileged reasoning teacher to the subsequent reasoning step.

If this is right

  • ViGOS retains the performance benefits of OPSD across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks.
  • The method reduces reliance on text shortcuts specifically in settings where privileged targets would otherwise dominate.
  • Separate teachers allow the perception step to be grounded directly in image content while reasoning still receives dense token-level targets.
  • Invalid rollouts fall back to a reference teacher only for format recovery, preserving the overall training loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit visual-description step could make it easier to isolate whether errors stem from perception or from reasoning.
  • The same separation might apply to other shortcut-prone multimodal settings such as video or audio reasoning.
  • Models trained this way could produce reasoning traces that more consistently reference specific visual elements rather than generic language patterns.

Load-bearing premise

An image-only perception teacher can supervise the visual description step without the privileged target leaking influence or creating new failure modes on non-shortcut tasks.

What would settle it

If ViGOS models show no reduction in text-shortcut reliance compared with standard OPSD on visual math or spatial grounding benchmarks, the claim of improved image-grounded behavior would be falsified.

Figures

Figures reproduced from arXiv: 2606.19120 by Lianqing Liu, Sihan Wang, Xiyao Liu, Zhi Han.

Figure 1
Figure 1. Figure 1: Shortcut risk in vanilla OPSD for MLLMs. The student only sees the image [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PALR diagnostic results on Qwen2.5-VL. All numbers are percentages (%). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training pipeline of ViGOS. Given an image [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step-wise comparison between OPSD and ViGOS on ViLP. Prior measures accuracy on [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces ViGOS, a visually grounded variant of on-policy self-distillation (OPSD) for multimodal LLMs. The student generates a visual description prefix followed by reasoning; an image-only perception teacher supervises only the description tokens while a privileged reasoning teacher supervises the subsequent tokens on the same prefix, with a reference teacher used solely for invalid rollouts. The central claim is that this decoupling prevents the privileged target from inducing text-based shortcuts, preserves OPSD benefits, and improves image-grounded behavior across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks.

Significance. If the empirical results and mechanism validation hold, the dual-teacher separation offers a concrete way to extend OPSD to MLLMs without sacrificing visual grounding. The approach directly targets a known failure mode when privileged targets leak into multimodal rollouts and could generalize to other staged reasoning pipelines. Credit is due for framing the problem explicitly in terms of prefix supervision and for testing across a diverse set of shortcut-prone and non-shortcut benchmarks.

major comments (2)
  1. [Method] Method description (around the two-teacher construction): the claim that the image-only perception teacher supplies non-leaking supervision on the visual-description prefix while the privileged reasoning teacher acts only on later tokens is load-bearing for the entire shortcut-resilience argument. No details are supplied on how the perception teacher is trained or initialized, how the two losses are weighted or masked, or the precise criterion for declaring a rollout valid/invalid. Without these, it is impossible to verify that the student prefix remains free of privileged leakage or that new failure modes are not introduced on non-shortcut tasks.
  2. [Experiments] Experiments and results sections: the strongest claim (maintaining OPSD gains while improving image-grounded behavior) requires evidence that observed improvements are attributable to the decoupling rather than other factors such as rollout filtering or teacher strength. The manuscript supplies no quantitative numbers, ablation tables isolating the perception-teacher component, or comparisons of prefix-only vs. full-sequence supervision, leaving the cross-benchmark assertion unevaluable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the dual-teacher separation in extending OPSD to MLLMs. We address each major comment below. Where details are missing from the current manuscript, we will revise to provide them.

read point-by-point responses
  1. Referee: [Method] Method description (around the two-teacher construction): the claim that the image-only perception teacher supplies non-leaking supervision on the visual-description prefix while the privileged reasoning teacher acts only on later tokens is load-bearing for the entire shortcut-resilience argument. No details are supplied on how the perception teacher is trained or initialized, how the two losses are weighted or masked, or the precise criterion for declaring a rollout valid/invalid. Without these, it is impossible to verify that the student prefix remains free of privileged leakage or that new failure modes are not introduced on non-shortcut tasks.

    Authors: We agree that the current manuscript does not provide sufficient implementation details on the perception teacher. In the revision we will expand Section 3 to specify: (i) the perception teacher is initialized from a vision-language model and trained solely on image-caption pairs with no text-only reference; (ii) the composite loss applies cross-entropy only to description tokens from the perception teacher and to reasoning tokens from the privileged teacher, with explicit token-level masking; (iii) the two losses are combined with a fixed scalar weight λ=0.5; and (iv) a rollout is declared invalid if it fails format checks or produces inconsistent answers across two reference-teacher samples. These additions will make the non-leakage claim verifiable. revision: yes

  2. Referee: [Experiments] Experiments and results sections: the strongest claim (maintaining OPSD gains while improving image-grounded behavior) requires evidence that observed improvements are attributable to the decoupling rather than other factors such as rollout filtering or teacher strength. The manuscript supplies no quantitative numbers, ablation tables isolating the perception-teacher component, or comparisons of prefix-only vs. full-sequence supervision, leaving the cross-benchmark assertion unevaluable.

    Authors: We acknowledge the absence of isolating ablations. The revision will add a new subsection with (a) a table reporting exact benchmark scores for the full ViGOS model versus an OPSD baseline and a perception-teacher-only variant, (b) an ablation removing the perception teacher while keeping rollout filtering, and (c) a direct prefix-only vs. full-sequence supervision comparison on the same student prefixes. These results will quantify the contribution of the decoupling independent of filtering or teacher strength. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a descriptive training procedure with no equations or self-referential derivations

full rationale

The paper describes an empirical training framework (ViGOS) that extends OPSD via a two-teacher split on student-generated prefixes. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on benchmark results rather than any reduction of outputs to inputs by construction. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details available in the abstract to identify free parameters, axioms, or invented entities; the method description is high-level only.

pith-pipeline@v0.9.1-grok · 5721 in / 1069 out tokens · 23285 ms · 2026-06-26T21:10:26.496717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 6 linked inside Pith

  1. [1]

    On-policy distillation of language models: Learning from self- generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InICLR, 2024

  2. [2]

    Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Words or vision: Do vision-language models have blind faith in text? InCVPR, pages 3867–3876, 2025

    Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InCVPR, pages 3867–3876, 2025

  4. [4]

    Scalable vision language model training via high quality data curation

    Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, and Jiao Ran. Scalable vision language model training via high quality data curation. InACL, pages 33272–33293, 2025

  5. [5]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  6. [6]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InICLR, 2024

  7. [7]

    DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645:633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645:633–638, 2025

  8. [8]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  9. [9]

    Vision-R1: Incentivizing reasoning capability in multimodal large language models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models. InICLR, 2026

  10. [10]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, volume 202, pages 19730–19742, 2023

  11. [11]

    Vision-SR1: Self-rewarding vision-language model via reasoning decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, and Dong Yu. Vision-SR1: Self-rewarding vision-language model via reasoning decomposition. InICLR, 2026

  12. [12]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  13. [13]

    On-policy distillation.Thinking Machines Lab: Connectionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025

  14. [14]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024. 13

  15. [15]

    Probing visual language priors in VLMs

    Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, and Honglak Lee. Probing visual language priors in VLMs. InICML, volume 267, pages 41120–41156, 2025. ViLP-F and ViLP-P are the with-fact and pure-question evaluation settings of ViLP

  16. [16]

    Patel, and Shao-Yuan Lo

    Bardia Safaei, Faizan Siddiqui, Jiacong Xu, Vishal M. Patel, and Shao-Yuan Lo. Filter images first, generate instructions later: Pre-instruction data selection for visual instruction tuning. arXiv preprint arXiv:2503.07591, 2025

  17. [17]

    Language prior is not the only shortcut: A benchmark for shortcut learning in vqa

    Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A benchmark for shortcut learning in vqa. InFindings of EMNLP, 2022

  18. [18]

    Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InNeurIPS, volume 37, 2024

  19. [19]

    Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

  20. [20]

    RealWorldQA: A benchmark for real-world spatial understanding, 2024

    xAI. RealWorldQA: A benchmark for real-world spatial understanding, 2024

  21. [21]

    Visionary-R1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

    Jiaer Xia, Yuhang Zang, Peng Gao, Sharon Li, and Kaiyang Zhou. Visionary-R1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

  22. [22]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  23. [23]

    MMSI-bench: A benchmark for multi-image spatial intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-bench: A benchmark for multi-image spatial intelligence. InICLR, 2026

  24. [24]

    MM-Vet: Evaluating large multimodal models for integrated capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. In ICML, volume 235, pages 57730–57754, 2024

  25. [25]

    MME-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

    Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, and Xiangyu Yue. MME-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

  26. [26]

    Vision- OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation.arXiv preprint arXiv:2605.18740, 2026

    Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, and Yaojie Lu. Vision- OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation.arXiv preprint arXiv:2605.18740, 2026

  27. [27]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  28. [28]

    MMMU-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi-discipline multimodal understanding benchmark. InACL, pages 15134–15186, 2025

  29. [29]

    MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, pages 169–186, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, pages 169–186, 2024

  30. [30]

    Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance

    Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, and Baobao Chang. Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance. InEMNLP, pages 19666–19690, 2025

  31. [31]

    Coin jar contents

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 15 A Privileged Answer Leakage Rate This section gives the diagnostic used in Section 2.3 and Section 3.4. The goal is to ask a simple question: wh...