Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models
Pith reviewed 2026-06-26 04:59 UTC · model grok-4.3
The pith
VISE uses geometric and semantic invariance rewards to make self-evolving multimodal models attend to visual tokens rather than language priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VISE is a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. It operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images, leading to gains on 18 benchmarks.
What carries the argument
The VISE framework applying geometric invariance reward for spatial consistency and semantic invariance reward for evidence recognition to regularize visual conditioning in self-evolving LMMs.
If this is right
- Using Qwen3-VL-2B as base, achieves +16.85 CIDEr on COCO and +19.66 CIDEr on TextCaps.
- Reduces object hallucination by 5.0 CHAIR-I points.
- Generalizes across four model families and scales.
- Trains effectively on raw unlabeled images without any annotations.
Where Pith is reading between the lines
- The rewards could be adapted to other self-training setups to improve grounding in generated outputs.
- Measuring attention weights on visual tokens before and after training would test if the mechanism works as intended.
- Similar invariance ideas might help in text-only or other modality self-evolving systems.
Load-bearing premise
The geometric and semantic invariance rewards specifically cause the decoder to increase attention to visual tokens during generation.
What would settle it
If attention to visual tokens remains unchanged or decreases after training with VISE while metrics still improve, or if gains disappear when images are not provided.
read the original abstract
Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision--language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of $+16.85$ CIDEr on COCO and $+19.66$ CIDEr on TextCaps, reduces object hallucination by $5.0$ Chair-I points, and generalizes across four model families and scales. Our code and models are available at https://mbzuai-oryx.github.io/VISE
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that self-evolving LMMs suffer from visual under-conditioning because self-consistency rewards allow reliance on language priors rather than visual content. It introduces VISE, a single-model unsupervised framework using a geometric invariance reward (enforcing spatial consistency under transformations) and a semantic invariance reward (penalizing evidence-agnostic outputs on perturbed regions) to directly regularize the decoder's visual conditioning policy. Experiments with Qwen3-VL-2B and other models report gains of +16.85 CIDEr on COCO, +19.66 CIDEr on TextCaps, and -5.0 Chair-I points on object hallucination, with generalization across four model families on 18 benchmarks.
Significance. If the invariance rewards are shown to causally increase decoder attention to visual tokens (rather than acting as generic regularization), the result would be significant for unsupervised self-evolution of multimodal models. The purely single-model, annotation-free setting and cross-scale generalization are strengths; the magnitude of the reported CIDEr and hallucination reductions would be notable if the mechanism is isolated.
major comments (2)
- [Abstract and §3] Abstract and §3 (method description): the central claim that the geometric and semantic invariance rewards increase decoder attention to visual tokens (addressing 'visual under-conditioning') lacks supporting evidence such as quantitative attention-weight statistics, before/after attention-map comparisons, or ablations that isolate attention change from self-consistency or regularization effects; without this, the reported metric gains cannot be attributed to the stated policy change.
- [Experiments] Experiments section (results on COCO/TextCaps/Chair-I): the +16.85 CIDEr, +19.66 CIDEr, and -5.0 Chair-I improvements are presented as evidence of better visual conditioning, yet no ablation or diagnostic (e.g., attention entropy on visual vs. text tokens, or controlled perturbation tests) rules out alternative explanations such as improved language-model consistency; this is load-bearing for the paper's interpretation.
minor comments (2)
- The abstract states results on '18 benchmarks' but provides no explicit list or summary table; adding one would improve clarity.
- [§3] Notation for the two rewards (geometric and semantic) should be introduced with explicit equations early in §3 to avoid ambiguity when describing the combined objective.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the current manuscript provides only indirect support for the mechanism and committing to revisions that add the requested diagnostics.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method description): the central claim that the geometric and semantic invariance rewards increase decoder attention to visual tokens (addressing 'visual under-conditioning') lacks supporting evidence such as quantitative attention-weight statistics, before/after attention-map comparisons, or ablations that isolate attention change from self-consistency or regularization effects; without this, the reported metric gains cannot be attributed to the stated policy change.
Authors: We agree that the manuscript currently relies on indirect evidence: the reward formulations explicitly target visual content (spatial consistency under transformations and penalization of evidence-agnostic outputs on perturbed regions), together with large gains on captioning and hallucination benchmarks that require visual grounding. Direct attention statistics are absent. In the revised version we will add quantitative attention-entropy measurements on visual versus text tokens before and after training, plus controlled perturbation ablations, to isolate the claimed policy change from generic regularization. revision: yes
-
Referee: [Experiments] Experiments section (results on COCO/TextCaps/Chair-I): the +16.85 CIDEr, +19.66 CIDEr, and -5.0 Chair-I improvements are presented as evidence of better visual conditioning, yet no ablation or diagnostic (e.g., attention entropy on visual vs. text tokens, or controlled perturbation tests) rules out alternative explanations such as improved language-model consistency; this is load-bearing for the paper's interpretation.
Authors: We concur that alternative explanations must be ruled out for the interpretation to hold. The semantic invariance reward is intended to enforce visual dependence rather than mere consistency, but the manuscript does not yet contain the requested controlled comparisons. We will add (i) an ablation that removes the visual-perturbation component and (ii) a direct comparison against a pure self-consistency baseline, together with the attention-entropy diagnostics mentioned above, in the revised manuscript. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes VISE as a new unsupervised self-evolving framework that defines geometric and semantic invariance rewards to regularize visual conditioning. These rewards are constructed as part of the method itself rather than derived from prior results, and the paper reports empirical performance on 18 external benchmarks using base models like Qwen3-VL-2B. No equations or claims reduce by construction to fitted inputs, self-citations, or renamed known results; the central claims rest on the explicit reward formulations and observed metric gains rather than self-referential loops. The method is self-contained against external validation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Nocaps: Novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019
2019
-
[2]
Qwen3-vl technical report,
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
-
[3]
URLhttps://arxiv.org/abs/2511.21631
-
[4]
C2-evo: Co-evolving multimodal data and model for self-improving reasoning
Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, et al. C2-evo: Co-evolving multimodal data and model for self-improving reasoning. arXiv preprint arXiv:2507.16518, 2025
Pith/arXiv arXiv 2025
-
[5]
Embspatial-bench: Benchmarking spatial understand- ing for embodied tasks with large vision-language models
Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understand- ing for embodied tasks with large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume2: Short Papers), pages 346–355, 2024
2024
-
[6]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017
2017
-
[7]
Visplay: Self-evolving vision-language models
Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26274–26284, 2026
2026
-
[8]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9
2022
-
[9]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019
2019
-
[10]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In European conference on computer vision, pages 235–251. Springer, 2016
2016
-
[11]
Similarity of neural network representations revisited
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019
2019
-
[12]
Decouple to generalize: Context-first self- evolving learning for data-scarce vision-language reasoning
Tingyu Li, Zheng Sun, Jingxuan Wei, Conghui He, Lijun Wu, and Cheng Tan. Decouple to generalize: Context-first self- evolving learning for data-scarce vision-language reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29357–29366, 2026. 12
2026
-
[13]
Evaluating object hallucination in large vision- language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision- language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023
2023
-
[14]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014
2014
-
[15]
Agent0-vl: Exploring self-evolving agent for tool-integrated vision-language reasoning
Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, and Huaxiu Yao. Agent0-vl: Exploring self-evolving agent for tool-integrated vision-language reasoning. arXiv preprint arXiv:2511.19900, 2025
arXiv 2025
-
[16]
Diving into self-evolving training for multimodal reasoning
Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, and Junxian He. Diving into self-evolving training for multimodal reasoning. arXiv preprint arXiv:2412.17451, 2024
arXiv 2024
-
[17]
Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024
2024
-
[18]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
Pith/arXiv arXiv 2017
-
[19]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems, 35:2507–2521, 2022
2022
-
[20]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019
2019
-
[21]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Jia Qing Tan, Shafiq Joty, Enamul Hoque, et al. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022
2022
-
[22]
Infographicvqa
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022
2022
-
[23]
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015
2015
-
[24]
Generalized intersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019
2019
-
[25]
In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, October-November 2018. Associ...
-
[26]
Textcaps: a dataset for image captioning with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pages 742–758. Springer, 2020
2020
-
[27]
Beyond human data: Scaling self-training for problem-solving with language models
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023
arXiv 2023
-
[28]
Meghana Sunil, Manikandarajan Venmathimaran, and Muthu Subash Kavitha. ireasoner: Trajectory-aware intrinsic reasoning supervision for self-evolving large multimodal models. arXiv preprint arXiv:2601.05877, 2026
Pith/arXiv arXiv 2026
-
[29]
Evolmm: Self-evolving large multimodal models with continuous rewards
Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Khan. Evolmm: Self-evolving large multimodal models with continuous rewards. arXiv preprint arXiv:2511.16672, 2025
Pith/arXiv arXiv 2025
-
[30]
Vision- zero: Scalable VLM self-evolution via multi-agent self-play
Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision- zero: Scalable VLM self-evolution via multi-agent self-play. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=s00SNXREV6. 13
2026
-
[31]
Enhancing visual-language modality alignment in large vision language models via self- improvement
Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Taha Kass-Hout, et al. Enhancing visual-language modality alignment in large vision language models via self- improvement. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 268–282, 2025
2025
-
[32]
Realworldqa
xAI and visheratin. Realworldqa. https://huggingface.co/datasets/visheratin/realworldqa, 2024. URL https: //huggingface.co/datasets/visheratin/realworldqa
2024
-
[33]
Captionqa: Is your caption as useful as the image itself? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23741–23750, 2026
Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, and Chenfeng Xu. Captionqa: Is your caption as useful as the image itself? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23741–23750, 2026
2026
-
[34]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024
2024
-
[35]
the red car
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 14 Supplementary Material S1 Hyperparameter Sensi...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.