Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs
Pith reviewed 2026-06-26 01:20 UTC · model grok-4.3
The pith
VIGIL aligns MLLMs by penalizing blind confidence in counterfactual masked states to enforce visual grounding over language shortcuts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VIGIL shifts the focus from numerical reward fitting based on text to causal visual grounding by introducing a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response, achieved through penalizing blind confidence instances where the model remains certain even when textual-visual attention is masked to create a counterfactual blind state.
What carries the argument
Geometric constraint in the RL objective that penalizes blind confidence in counterfactual blind states created by masking textual-visual attention.
If this is right
- VIGIL outperforms recent alignment methods on hallucination and reasoning benchmarks.
- It achieves state-of-the-art full-data performance using only 25 percent of the preference data.
- It produces emergent spatial grounding abilities without any bounding box supervision.
- Text-only capabilities remain intact after the alignment process.
Where Pith is reading between the lines
- The attention-masking technique might generalize to other modalities or priors where models shortcut to non-grounded reasoning.
- Data efficiency at 25 percent could reduce the annotation burden for building preference datasets in multimodal alignment.
- Emergent spatial capabilities suggest the method may unlock implicit localization skills that standard supervision does not target.
Load-bearing premise
Penalizing blind confidence after masking textual-visual attention will causally move the model toward genuine visual grounding instead of merely altering output patterns.
What would settle it
Apply VIGIL training to an MLLM and measure whether hallucination rates on visual reasoning benchmarks remain unchanged or increase relative to standard direct preference optimization baselines.
read the original abstract
Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual inputs. Mechanistic studies indicate that this weakness stems from visual laziness: MLLMs encode the correct visual evidence internally, but overly rely on strong language priors during response. Existing alignment methods, such as direct preference optimization, primarily optimize outcome-level rewards based on text. This introduces an optimization bias toward linguistic shortcuts, leading to responses that often contradict the visual evidence. To address this, we propose Visual Information Gain In aLignment (VIGIL), a reinforcement-learning (RL) post-training framework that shifts the focus from numerical reward fitting to causal visual grounding. VIGIL introduces a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response. We achieve this by penalizing "blind confidence" instances where the model remains improperly certain even when textual-visual attention is masked to create a counterfactual blind state. Extensive experiments show that VIGIL consistently outperforms recent alignment methods across hallucination and reasoning benchmarks without compromising text-only capabilities. Our approach matches the full-data performance of state-of-the-art methods using only 25% of the preference data and even demonstrates emergent spatial grounding capabilities without explicit bounding box supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes VIGIL, a reinforcement-learning post-training framework for multimodal large language models (MLLMs) that addresses visual laziness by introducing a geometric constraint to maximize mutual information between visual input and generated responses. This is achieved by penalizing high-confidence outputs in a counterfactual blind state created by masking textual-visual attention. The paper claims that VIGIL outperforms recent alignment methods on hallucination and reasoning benchmarks, matches full-data performance of state-of-the-art methods using only 25% of preference data, preserves text-only capabilities, and yields emergent spatial grounding without bounding-box supervision.
Significance. If the core mechanism is shown to be sound and the empirical gains hold under rigorous controls, the work would be significant for MLLM alignment research. It shifts focus from outcome-level reward fitting to an explicit causal intervention on visual grounding, offering a potentially more data-efficient alternative to standard preference optimization while addressing a documented mechanistic failure mode (visual laziness).
major comments (2)
- [Abstract / Method (counterfactual construction)] The central mechanism assumes that masking textual-visual attention produces a true counterfactual blind state equivalent to the absence of visual input. However, in standard MLLM transformer architectures, visual tokens are projected and can continue to influence later hidden states via self-attention among text tokens or residual connections even after cross-attention masking. This risks the penalty targeting a different phenomenon than linguistic shortcuts, weakening the claimed causal link to improved visual grounding and mutual-information maximization (see Abstract and the method description of the geometric constraint).
- [Abstract] No equations, formal definition of the geometric constraint, or derivation showing how the blind-confidence penalty implements mutual-information maximization are provided in the abstract. Without these, it is impossible to verify whether the reported performance gains are consistent with the method's own formulation or whether they reduce to quantities fitted from the evaluation data.
minor comments (1)
- [Abstract] The abstract states strong empirical claims (outperformance, 25% data efficiency, emergent spatial grounding) but supplies no metrics, baselines, ablation results, or dataset details. These must be supplied with precise numbers and controls in the experimental section.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments raise important points about the counterfactual construction and the presentation of the method. We address each below and have revised the manuscript to improve clarity and rigor where appropriate.
read point-by-point responses
-
Referee: [Abstract / Method (counterfactual construction)] The central mechanism assumes that masking textual-visual attention produces a true counterfactual blind state equivalent to the absence of visual input. However, in standard MLLM transformer architectures, visual tokens are projected and can continue to influence later hidden states via self-attention among text tokens or residual connections even after cross-attention masking. This risks the penalty targeting a different phenomenon than linguistic shortcuts, weakening the claimed causal link to improved visual grounding and mutual-information maximization (see Abstract and the method description of the geometric constraint).
Authors: We appreciate this architectural observation. Masking is applied specifically to the cross-attention layers between text and visual tokens to simulate blindness, and our implementation follows standard practices in attention masking for counterfactual interventions. While residual pathways and subsequent self-attention could in principle allow limited leakage, our ablation studies (Section 4.3) show that the blind-confidence penalty produces the intended drop in output certainty and drives the observed gains in visual grounding. To strengthen the presentation, we will add a dedicated paragraph in the revised Method section discussing potential residual influences and why the chosen masking still enforces the desired causal intervention on visual information flow. revision: partial
-
Referee: [Abstract] No equations, formal definition of the geometric constraint, or derivation showing how the blind-confidence penalty implements mutual-information maximization are provided in the abstract. Without these, it is impossible to verify whether the reported performance gains are consistent with the method's own formulation or whether they reduce to quantities fitted from the evaluation data.
Authors: Abstracts conventionally omit equations to preserve readability for a broad audience. The full manuscript (Section 3) contains the formal definition of the geometric constraint, its relation to mutual-information maximization, and the derivation of the blind-confidence penalty. To address the concern, we will revise the abstract to include a concise textual description of the geometric constraint and its objective while retaining the equation-free style. The formal derivation and implementation details will remain in the main text, allowing readers to connect the high-level claims directly to the method. revision: yes
Circularity Check
No significant circularity; method and gains are empirically validated without reduction to inputs by construction.
full rationale
The paper presents VIGIL as an RL post-training method that adds a penalty on blind confidence under masked textual-visual attention to maximize mutual information. No equations or steps in the provided abstract or description reduce the claimed performance gains to quantities fitted from the evaluation data itself, nor do they rely on self-citations for uniqueness or ansatz smuggling. The central mechanism is a novel constraint whose effectiveness is asserted via benchmark comparisons, leaving the derivation self-contained against external results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
From system 1 to system 2: a survey of reasoning large language models
Duzhen Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Xiuyi Chen, Yingying Zhang, et al. From system 1 to system 2: a survey of reasoning large language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[2]
Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
Pith/arXiv arXiv 2024
-
[3]
Llava-onevision: Easy visual task transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2025
2025
-
[4]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024
2024
-
[5]
Mllms know where to look: Training-free perception of small visual details with multimodal llms
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details with multimodal llms. In The Thirteenth International Conference on Learning Representations, 2025
2025
-
[6]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024
2024
-
[7]
Understanding language prior of LVLMs by contrasting chain-of-embedding
Lin Long, Changdae Oh, Seongheon Park, and Sharon Li. Understanding language prior of LVLMs by contrasting chain-of-embedding. In The Fourteenth International Conference on Learning Representations, 2026
2026
-
[8]
Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning
Tianyi Bai, Yuxuan Fan, Qiu Jiantao, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, and Binhang Yuan. Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[9]
Leveraging latent visual reasoning in silence
Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, and Jianyang Gu. Leveraging latent visual reasoning in silence. arXiv preprint arXiv:2605.18641, 2026
Pith/arXiv arXiv 2026
-
[10]
Vgent: Visual grounding via modular design for disentangling reasoning and prediction
Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, and Kangning Liu. Vgent: Visual grounding via modular design for disentangling reasoning and prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41160–41170, 2026
2026
-
[11]
Adaptvision: Efficient vision-language models via adaptive visual acquisition
Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, and Deheng Ye. Adaptvision: Efficient vision-language models via adaptive visual acquisition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11923–11932, 2026
2026
-
[12]
Direct pref- erence optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct pref- erence optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[13]
Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations
Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, and Xuming He. Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations. Transactions on Machine Learning Research, 2025
2025
-
[14]
Manifold learning: What, how, and why
Marina Meil˘ a and Hanyu Zhang. Manifold learning: What, how, and why. Annual Review of Statistics and Its Application, 11(1):393–417, 2024
2024
-
[15]
Danqi Liao, Chen Liu, Xingzhi Sun, Dié Tang, Haochen Wang, Scott Youlten, Srikar Krishna Gopinath, Haejeong Lee, Ethan C Strayer, Antonio J Giraldez, et al. Rnagenscape: property-guided optimization and interpolation of mrna sequences with manifold langevin dynamics. arXiv preprint arXiv:2510.24736, 2025
Pith/arXiv arXiv 2025
-
[16]
Back to basics: Let denoising generative models denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 36115–36125, 2026
2026
-
[17]
Assessing neural network representations during training using noise-resilient diffusion spectral entropy
Danqi Liao, Chen Liu, Benjamin W Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy. Assessing neural network representations during training using noise-resilient diffusion spectral entropy. In 2024 58th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2024. 13
2024
-
[18]
Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds
Xingzhi Sun, Danqi Liao, Kincaid MacDonald, Yanlei Zhang, Guillaume Huguet, Guy Wolf, Ian Adelstein, Tim GJ Rudner, and Smita Krishnaswamy. Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds. In The 28th International Conference on Artificial Intelligence and Statistics, 2025
2025
-
[19]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081): 633–638, 2025
2025
-
[20]
Perception-aware policy optimization for multimodal reasoning
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. arXiv preprint arXiv:2507.06448, 2025
Pith/arXiv arXiv 2025
-
[21]
Analyzing and mitigating object hallucination in large vision-language models
Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[22]
Viunit: Visual unit tests for more robust visual programming
Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, and Juan Carlos Niebles. Viunit: Visual unit tests for more robust visual programming. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24646–24656, 2025
2025
-
[23]
Measuring compositional consistency for video question answering
Mona Gandhi, Mustafa Omer Gul, Eva Prakash, Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Measuring compositional consistency for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2022
2022
-
[24]
Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness
Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Adu Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, and Tianyi Zhou. Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Dataset...
2025
-
[25]
Counterfactual vqa: A cause-effect look at language bias
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12700–12710, 2021
2021
-
[26]
A diversity-promoting objective function for neural conversation models
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 110–119, 2016
2016
-
[27]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of machine learning research, 24(240):1–113, 2023
2023
-
[28]
Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness
Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, and Zhengming Ding. Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026
2026
-
[29]
Openrlhf: An easy-to-use, scalable and high-performance rlhf framework
Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[30]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020
2020
-
[31]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024
2024
-
[32]
Pytorch fsdp: Experiences on scaling fully sharded data parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023
2023
-
[33]
Simpo: Simple preference optimization with a reference-free reward
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198–124235, 2024. 14
2024
-
[34]
Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization
Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025
2025
-
[35]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023
2023
-
[36]
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023
Pith/arXiv arXiv 2023
-
[37]
Aligning large multimodal models with factually augmented rlhf
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024
2024
-
[38]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[39]
Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024
2024
-
[40]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021
2021
-
[41]
Training verifiers to solve math word problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
Pith/arXiv arXiv 2021
-
[42]
Mitigating object hallucination in large vision-language models via visual attention direct preference optimization
Yixiao He, Haifeng Sun, Qi Qi, Zirui Zhuang, Pengfei Ren, Huazheng Wang, Yafeng Nan, and Jingyu Wang. Mitigating object hallucination in large vision-language models via visual attention direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025
2025
-
[43]
Physics of language models: Part 3.3, knowledge capacity scaling laws
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. In International Conference on Learning Representations, 2025
2025
-
[44]
Dispersion loss counteracts embedding condensation and improves generalization in small language models
Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, and Smita Krishnaswamy. Dispersion loss counteracts embedding condensation and improves generalization in small language models. In International Conference on Machine Learning. PMLR, 2026
2026
-
[45]
True multimodal in-context learning needs attention to the visual context
Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context. In Second Conference on Language Modeling, 2025
2025
-
[46]
Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, et al. Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning. arXiv preprint arXiv:2510.12712, 2025
arXiv 2025
-
[47]
Small drafts, big verdict: Information-intensive visual reasoning via speculation
Yuhan Liu, Lianhui Qin, and Shengjie Wang. Small drafts, big verdict: Information-intensive visual reasoning via speculation. arXiv preprint arXiv:2510.20812, 2025
arXiv 2025
-
[48]
Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning
Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, and Feng Chen. Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning. arXiv preprint arXiv:2509.20912, 2025
Pith/arXiv arXiv 2025
-
[49]
Hallucination of multimodal large language models: A survey
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024
Pith/arXiv arXiv 2024
-
[50]
Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms
Fang Peng, Xiaoshan Yang, Yaowei Wang, and Changsheng Xu. Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms. IEEE Transactions on Circuits and Systems for Video Technology, 2026
2026
-
[51]
Visually-guided policy optimization for multimodal reasoning
Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning. arXiv preprint arXiv:2604.09349, 2026. 15
Pith/arXiv arXiv 2026
-
[52]
Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning
Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025
arXiv 2025
-
[53]
Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024
2024
-
[54]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022
2022
-
[55]
Thinimg: Cross-modal steganography for presenting talking heads in images
Lin Zhao, Hongxuan Li, Xuefei Ning, and Xinru Jiang. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5553–5562, 2024
2024
-
[56]
Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization
Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, and Deng Cai. Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12737–12746, October 2025
2025
-
[57]
Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation
Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, and Jianyang Gu. Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41688–41698, 2026
2026
-
[58]
Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026
Pith/arXiv arXiv 2026
-
[59]
Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, and Xin Cao. Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments. arXiv preprint arXiv:2604.22409, 2026
Pith/arXiv arXiv 2026
-
[60]
Causality
Judea Pearl. Causality. Cambridge university press, 2009
2009
-
[61]
compressed
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. 16 Appendix A Related Work A.1 Mechanisms of Hallucination in MLLMs Recent Multimodal Larg...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.