pith. sign in

arxiv: 2606.26387 · v1 · pith:7YTTPW2Fnew · submitted 2026-06-24 · 💻 cs.CV · cs.CL· cs.LG

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

Pith reviewed 2026-06-26 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords multimodal large language modelsvisual hallucination mitigationalignment methodsreinforcement learningcounterfactual attention maskingmutual information maximizationvisual grounding
0
0 comments X

The pith

VIGIL aligns MLLMs by penalizing blind confidence in counterfactual masked states to enforce visual grounding over language shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VIGIL as a reinforcement learning post-training method for multimodal large language models. It targets visual laziness, where models internally encode visual evidence but default to language priors, by adding a geometric constraint that maximizes mutual information between visual input and response. The constraint works by penalizing cases of high confidence even after textual-visual attention is masked to simulate a blind state. A sympathetic reader would care because the method claims to reduce hallucinations and improve reasoning more effectively than standard preference optimization while using far less data and without harming text-only performance.

Core claim

VIGIL shifts the focus from numerical reward fitting based on text to causal visual grounding by introducing a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response, achieved through penalizing blind confidence instances where the model remains certain even when textual-visual attention is masked to create a counterfactual blind state.

What carries the argument

Geometric constraint in the RL objective that penalizes blind confidence in counterfactual blind states created by masking textual-visual attention.

If this is right

  • VIGIL outperforms recent alignment methods on hallucination and reasoning benchmarks.
  • It achieves state-of-the-art full-data performance using only 25 percent of the preference data.
  • It produces emergent spatial grounding abilities without any bounding box supervision.
  • Text-only capabilities remain intact after the alignment process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attention-masking technique might generalize to other modalities or priors where models shortcut to non-grounded reasoning.
  • Data efficiency at 25 percent could reduce the annotation burden for building preference datasets in multimodal alignment.
  • Emergent spatial capabilities suggest the method may unlock implicit localization skills that standard supervision does not target.

Load-bearing premise

Penalizing blind confidence after masking textual-visual attention will causally move the model toward genuine visual grounding instead of merely altering output patterns.

What would settle it

Apply VIGIL training to an MLLM and measure whether hallucination rates on visual reasoning benchmarks remain unchanged or increase relative to standard direct preference optimization baselines.

read the original abstract

Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual inputs. Mechanistic studies indicate that this weakness stems from visual laziness: MLLMs encode the correct visual evidence internally, but overly rely on strong language priors during response. Existing alignment methods, such as direct preference optimization, primarily optimize outcome-level rewards based on text. This introduces an optimization bias toward linguistic shortcuts, leading to responses that often contradict the visual evidence. To address this, we propose Visual Information Gain In aLignment (VIGIL), a reinforcement-learning (RL) post-training framework that shifts the focus from numerical reward fitting to causal visual grounding. VIGIL introduces a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response. We achieve this by penalizing "blind confidence" instances where the model remains improperly certain even when textual-visual attention is masked to create a counterfactual blind state. Extensive experiments show that VIGIL consistently outperforms recent alignment methods across hallucination and reasoning benchmarks without compromising text-only capabilities. Our approach matches the full-data performance of state-of-the-art methods using only 25% of the preference data and even demonstrates emergent spatial grounding capabilities without explicit bounding box supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VIGIL, a reinforcement-learning post-training framework for multimodal large language models (MLLMs) that addresses visual laziness by introducing a geometric constraint to maximize mutual information between visual input and generated responses. This is achieved by penalizing high-confidence outputs in a counterfactual blind state created by masking textual-visual attention. The paper claims that VIGIL outperforms recent alignment methods on hallucination and reasoning benchmarks, matches full-data performance of state-of-the-art methods using only 25% of preference data, preserves text-only capabilities, and yields emergent spatial grounding without bounding-box supervision.

Significance. If the core mechanism is shown to be sound and the empirical gains hold under rigorous controls, the work would be significant for MLLM alignment research. It shifts focus from outcome-level reward fitting to an explicit causal intervention on visual grounding, offering a potentially more data-efficient alternative to standard preference optimization while addressing a documented mechanistic failure mode (visual laziness).

major comments (2)
  1. [Abstract / Method (counterfactual construction)] The central mechanism assumes that masking textual-visual attention produces a true counterfactual blind state equivalent to the absence of visual input. However, in standard MLLM transformer architectures, visual tokens are projected and can continue to influence later hidden states via self-attention among text tokens or residual connections even after cross-attention masking. This risks the penalty targeting a different phenomenon than linguistic shortcuts, weakening the claimed causal link to improved visual grounding and mutual-information maximization (see Abstract and the method description of the geometric constraint).
  2. [Abstract] No equations, formal definition of the geometric constraint, or derivation showing how the blind-confidence penalty implements mutual-information maximization are provided in the abstract. Without these, it is impossible to verify whether the reported performance gains are consistent with the method's own formulation or whether they reduce to quantities fitted from the evaluation data.
minor comments (1)
  1. [Abstract] The abstract states strong empirical claims (outperformance, 25% data efficiency, emergent spatial grounding) but supplies no metrics, baselines, ablation results, or dataset details. These must be supplied with precise numbers and controls in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments raise important points about the counterfactual construction and the presentation of the method. We address each below and have revised the manuscript to improve clarity and rigor where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Method (counterfactual construction)] The central mechanism assumes that masking textual-visual attention produces a true counterfactual blind state equivalent to the absence of visual input. However, in standard MLLM transformer architectures, visual tokens are projected and can continue to influence later hidden states via self-attention among text tokens or residual connections even after cross-attention masking. This risks the penalty targeting a different phenomenon than linguistic shortcuts, weakening the claimed causal link to improved visual grounding and mutual-information maximization (see Abstract and the method description of the geometric constraint).

    Authors: We appreciate this architectural observation. Masking is applied specifically to the cross-attention layers between text and visual tokens to simulate blindness, and our implementation follows standard practices in attention masking for counterfactual interventions. While residual pathways and subsequent self-attention could in principle allow limited leakage, our ablation studies (Section 4.3) show that the blind-confidence penalty produces the intended drop in output certainty and drives the observed gains in visual grounding. To strengthen the presentation, we will add a dedicated paragraph in the revised Method section discussing potential residual influences and why the chosen masking still enforces the desired causal intervention on visual information flow. revision: partial

  2. Referee: [Abstract] No equations, formal definition of the geometric constraint, or derivation showing how the blind-confidence penalty implements mutual-information maximization are provided in the abstract. Without these, it is impossible to verify whether the reported performance gains are consistent with the method's own formulation or whether they reduce to quantities fitted from the evaluation data.

    Authors: Abstracts conventionally omit equations to preserve readability for a broad audience. The full manuscript (Section 3) contains the formal definition of the geometric constraint, its relation to mutual-information maximization, and the derivation of the blind-confidence penalty. To address the concern, we will revise the abstract to include a concise textual description of the geometric constraint and its objective while retaining the equation-free style. The formal derivation and implementation details will remain in the main text, allowing readers to connect the high-level claims directly to the method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and gains are empirically validated without reduction to inputs by construction.

full rationale

The paper presents VIGIL as an RL post-training method that adds a penalty on blind confidence under masked textual-visual attention to maximize mutual information. No equations or steps in the provided abstract or description reduce the claimed performance gains to quantities fitted from the evaluation data itself, nor do they rely on self-citations for uniqueness or ansatz smuggling. The central mechanism is a novel constraint whose effectiveness is asserted via benchmark comparisons, leaving the derivation self-contained against external results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to identify concrete free parameters, axioms, or invented entities; the approach appears to rest on standard RL assumptions plus the novel counterfactual penalty whose effectiveness is asserted but not derived.

pith-pipeline@v0.9.1-grok · 5825 in / 1189 out tokens · 28824 ms · 2026-06-26T01:20:26.402124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 11 linked inside Pith

  1. [1]

    From system 1 to system 2: a survey of reasoning large language models

    Duzhen Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Xiuyi Chen, Yingying Zhang, et al. From system 1 to system 2: a survey of reasoning large language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  2. [2]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  3. [3]

    Llava-onevision: Easy visual task transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2025

  4. [4]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  5. [5]

    Mllms know where to look: Training-free perception of small visual details with multimodal llms

    Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details with multimodal llms. In The Thirteenth International Conference on Learning Representations, 2025

  6. [6]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

  7. [7]

    Understanding language prior of LVLMs by contrasting chain-of-embedding

    Lin Long, Changdae Oh, Seongheon Park, and Sharon Li. Understanding language prior of LVLMs by contrasting chain-of-embedding. In The Fourteenth International Conference on Learning Representations, 2026

  8. [8]

    Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning

    Tianyi Bai, Yuxuan Fan, Qiu Jiantao, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, and Binhang Yuan. Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  9. [9]

    Leveraging latent visual reasoning in silence

    Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, and Jianyang Gu. Leveraging latent visual reasoning in silence. arXiv preprint arXiv:2605.18641, 2026

  10. [10]

    Vgent: Visual grounding via modular design for disentangling reasoning and prediction

    Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, and Kangning Liu. Vgent: Visual grounding via modular design for disentangling reasoning and prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41160–41170, 2026

  11. [11]

    Adaptvision: Efficient vision-language models via adaptive visual acquisition

    Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, and Deheng Ye. Adaptvision: Efficient vision-language models via adaptive visual acquisition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11923–11932, 2026

  12. [12]

    Direct pref- erence optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct pref- erence optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  13. [13]

    Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations

    Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, and Xuming He. Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations. Transactions on Machine Learning Research, 2025

  14. [14]

    Manifold learning: What, how, and why

    Marina Meil˘ a and Hanyu Zhang. Manifold learning: What, how, and why. Annual Review of Statistics and Its Application, 11(1):393–417, 2024

  15. [15]

    Rnagenscape: property-guided optimization and interpolation of mrna sequences with manifold langevin dynamics

    Danqi Liao, Chen Liu, Xingzhi Sun, Dié Tang, Haochen Wang, Scott Youlten, Srikar Krishna Gopinath, Haejeong Lee, Ethan C Strayer, Antonio J Giraldez, et al. Rnagenscape: property-guided optimization and interpolation of mrna sequences with manifold langevin dynamics. arXiv preprint arXiv:2510.24736, 2025

  16. [16]

    Back to basics: Let denoising generative models denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 36115–36125, 2026

  17. [17]

    Assessing neural network representations during training using noise-resilient diffusion spectral entropy

    Danqi Liao, Chen Liu, Benjamin W Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy. Assessing neural network representations during training using noise-resilient diffusion spectral entropy. In 2024 58th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2024. 13

  18. [18]

    Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds

    Xingzhi Sun, Danqi Liao, Kincaid MacDonald, Yanlei Zhang, Guillaume Huguet, Guy Wolf, Ian Adelstein, Tim GJ Rudner, and Smita Krishnaswamy. Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds. In The 28th International Conference on Artificial Intelligence and Statistics, 2025

  19. [19]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081): 633–638, 2025

  20. [20]

    Perception-aware policy optimization for multimodal reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. arXiv preprint arXiv:2507.06448, 2025

  21. [21]

    Analyzing and mitigating object hallucination in large vision-language models

    Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024

  22. [22]

    Viunit: Visual unit tests for more robust visual programming

    Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, and Juan Carlos Niebles. Viunit: Visual unit tests for more robust visual programming. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24646–24656, 2025

  23. [23]

    Measuring compositional consistency for video question answering

    Mona Gandhi, Mustafa Omer Gul, Eva Prakash, Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Measuring compositional consistency for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2022

  24. [24]

    Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness

    Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Adu Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, and Tianyi Zhou. Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Dataset...

  25. [25]

    Counterfactual vqa: A cause-effect look at language bias

    Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12700–12710, 2021

  26. [26]

    A diversity-promoting objective function for neural conversation models

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 110–119, 2016

  27. [27]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of machine learning research, 24(240):1–113, 2023

  28. [28]

    Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness

    Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, and Zhengming Ding. Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

  29. [29]

    Openrlhf: An easy-to-use, scalable and high-performance rlhf framework

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  30. [30]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

  31. [31]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

  32. [32]

    Pytorch fsdp: Experiences on scaling fully sharded data parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023

  33. [33]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198–124235, 2024. 14

  34. [34]

    Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization

    Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

  35. [35]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

  36. [36]

    Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

  37. [37]

    Aligning large multimodal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

  38. [38]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024

  39. [39]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

  40. [40]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021

  41. [41]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  42. [42]

    Mitigating object hallucination in large vision-language models via visual attention direct preference optimization

    Yixiao He, Haifeng Sun, Qi Qi, Zirui Zhuang, Pengfei Ren, Huazheng Wang, Yafeng Nan, and Jingyu Wang. Mitigating object hallucination in large vision-language models via visual attention direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

  43. [43]

    Physics of language models: Part 3.3, knowledge capacity scaling laws

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. In International Conference on Learning Representations, 2025

  44. [44]

    Dispersion loss counteracts embedding condensation and improves generalization in small language models

    Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, and Smita Krishnaswamy. Dispersion loss counteracts embedding condensation and improves generalization in small language models. In International Conference on Machine Learning. PMLR, 2026

  45. [45]

    True multimodal in-context learning needs attention to the visual context

    Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context. In Second Conference on Language Modeling, 2025

  46. [46]

    Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning

    Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, et al. Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning. arXiv preprint arXiv:2510.12712, 2025

  47. [47]

    Small drafts, big verdict: Information-intensive visual reasoning via speculation

    Yuhan Liu, Lianhui Qin, and Shengjie Wang. Small drafts, big verdict: Information-intensive visual reasoning via speculation. arXiv preprint arXiv:2510.20812, 2025

  48. [48]

    Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning

    Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, and Feng Chen. Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning. arXiv preprint arXiv:2509.20912, 2025

  49. [49]

    Hallucination of multimodal large language models: A survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024

  50. [50]

    Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms

    Fang Peng, Xiaoshan Yang, Yaowei Wang, and Changsheng Xu. Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms. IEEE Transactions on Circuits and Systems for Video Technology, 2026

  51. [51]

    Visually-guided policy optimization for multimodal reasoning

    Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning. arXiv preprint arXiv:2604.09349, 2026. 15

  52. [52]

    Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025

  53. [53]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

  54. [54]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  55. [55]

    Thinimg: Cross-modal steganography for presenting talking heads in images

    Lin Zhao, Hongxuan Li, Xuefei Ning, and Xinru Jiang. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5553–5562, 2024

  56. [56]

    Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization

    Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, and Deng Cai. Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12737–12746, October 2025

  57. [57]

    Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation

    Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, and Jianyang Gu. Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41688–41698, 2026

  58. [58]

    Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

    Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

  59. [59]

    Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments

    Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, and Xin Cao. Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments. arXiv preprint arXiv:2604.22409, 2026

  60. [60]

    Causality

    Judea Pearl. Causality. Cambridge university press, 2009

  61. [61]

    compressed

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. 16 Appendix A Related Work A.1 Mechanisms of Hallucination in MLLMs Recent Multimodal Larg...