Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

Chen Liu; Chih-Ting Liao; Hao Xu; Janet Wang; Jianyang Gu; Lin Zhao; Muchao Ye; Qizhen Lan; Tianyang Wang; Xi Xiao

arxiv: 2606.26387 · v1 · pith:7YTTPW2Fnew · submitted 2026-06-24 · 💻 cs.CV · cs.CL· cs.LG

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

Xi Xiao , Chen Liu , Chih-Ting Liao , Yunbei Zhang , Qizhen Lan , Yuxiang Wei , Lin Zhao , Janet Wang

show 4 more authors

Jianyang Gu Muchao Ye Tianyang Wang Hao Xu

This is my paper

Pith reviewed 2026-06-26 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords multimodal large language modelsvisual hallucination mitigationalignment methodsreinforcement learningcounterfactual attention maskingmutual information maximizationvisual grounding

0 comments

The pith

VIGIL aligns MLLMs by penalizing blind confidence in counterfactual masked states to enforce visual grounding over language shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VIGIL as a reinforcement learning post-training method for multimodal large language models. It targets visual laziness, where models internally encode visual evidence but default to language priors, by adding a geometric constraint that maximizes mutual information between visual input and response. The constraint works by penalizing cases of high confidence even after textual-visual attention is masked to simulate a blind state. A sympathetic reader would care because the method claims to reduce hallucinations and improve reasoning more effectively than standard preference optimization while using far less data and without harming text-only performance.

Core claim

VIGIL shifts the focus from numerical reward fitting based on text to causal visual grounding by introducing a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response, achieved through penalizing blind confidence instances where the model remains certain even when textual-visual attention is masked to create a counterfactual blind state.

What carries the argument

Geometric constraint in the RL objective that penalizes blind confidence in counterfactual blind states created by masking textual-visual attention.

If this is right

VIGIL outperforms recent alignment methods on hallucination and reasoning benchmarks.
It achieves state-of-the-art full-data performance using only 25 percent of the preference data.
It produces emergent spatial grounding abilities without any bounding box supervision.
Text-only capabilities remain intact after the alignment process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attention-masking technique might generalize to other modalities or priors where models shortcut to non-grounded reasoning.
Data efficiency at 25 percent could reduce the annotation burden for building preference datasets in multimodal alignment.
Emergent spatial capabilities suggest the method may unlock implicit localization skills that standard supervision does not target.

Load-bearing premise

Penalizing blind confidence after masking textual-visual attention will causally move the model toward genuine visual grounding instead of merely altering output patterns.

What would settle it

Apply VIGIL training to an MLLM and measure whether hallucination rates on visual reasoning benchmarks remain unchanged or increase relative to standard direct preference optimization baselines.

read the original abstract

Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual inputs. Mechanistic studies indicate that this weakness stems from visual laziness: MLLMs encode the correct visual evidence internally, but overly rely on strong language priors during response. Existing alignment methods, such as direct preference optimization, primarily optimize outcome-level rewards based on text. This introduces an optimization bias toward linguistic shortcuts, leading to responses that often contradict the visual evidence. To address this, we propose Visual Information Gain In aLignment (VIGIL), a reinforcement-learning (RL) post-training framework that shifts the focus from numerical reward fitting to causal visual grounding. VIGIL introduces a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response. We achieve this by penalizing "blind confidence" instances where the model remains improperly certain even when textual-visual attention is masked to create a counterfactual blind state. Extensive experiments show that VIGIL consistently outperforms recent alignment methods across hallucination and reasoning benchmarks without compromising text-only capabilities. Our approach matches the full-data performance of state-of-the-art methods using only 25% of the preference data and even demonstrates emergent spatial grounding capabilities without explicit bounding box supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIGIL's masking of cross-attention to create a blind state for mutual-information maximization is the core move, but it may not isolate visual input as claimed.

read the letter

VIGIL tries to fix visual laziness in MLLMs by penalizing confident answers when textual-visual attention is masked to create a blind state. The stress-test is on point that this masking might not produce a true counterfactual because visual information can still flow through self-attention and residual connections.

The new part is the use of this attention masking in an RL post-training setup to maximize mutual information with the visual input, moving beyond text-only preference optimization like DPO. The paper does well by showing consistent improvements on hallucination and reasoning benchmarks, achieving SOTA-level results with just 25% of the preference data, and noting emergent spatial grounding without direct supervision. That efficiency claim is worth attention if the experiments control for other factors.

The soft spots are in the soundness of the core mechanism. The abstract gives a high-level description without equations or experimental specifics, making it hard to verify if the masked state really removes visual influence or if the penalty targets the intended behavior. If the full paper lacks ablations on hidden state differences or attention patterns between masked and no-image conditions, the causal claim weakens. The reader's weakest assumption about the penalty shifting to genuine visual grounding holds only if the masking works as intended.

This is for researchers working on multimodal model alignment and hallucination mitigation. A reader in that area would find the data efficiency and benchmark results useful.

It deserves a serious referee because it engages with a practical failure mode using a novel constraint, even if details need checking.

I recommend sending it for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VIGIL, a reinforcement-learning post-training framework for multimodal large language models (MLLMs) that addresses visual laziness by introducing a geometric constraint to maximize mutual information between visual input and generated responses. This is achieved by penalizing high-confidence outputs in a counterfactual blind state created by masking textual-visual attention. The paper claims that VIGIL outperforms recent alignment methods on hallucination and reasoning benchmarks, matches full-data performance of state-of-the-art methods using only 25% of preference data, preserves text-only capabilities, and yields emergent spatial grounding without bounding-box supervision.

Significance. If the core mechanism is shown to be sound and the empirical gains hold under rigorous controls, the work would be significant for MLLM alignment research. It shifts focus from outcome-level reward fitting to an explicit causal intervention on visual grounding, offering a potentially more data-efficient alternative to standard preference optimization while addressing a documented mechanistic failure mode (visual laziness).

major comments (2)

[Abstract / Method (counterfactual construction)] The central mechanism assumes that masking textual-visual attention produces a true counterfactual blind state equivalent to the absence of visual input. However, in standard MLLM transformer architectures, visual tokens are projected and can continue to influence later hidden states via self-attention among text tokens or residual connections even after cross-attention masking. This risks the penalty targeting a different phenomenon than linguistic shortcuts, weakening the claimed causal link to improved visual grounding and mutual-information maximization (see Abstract and the method description of the geometric constraint).
[Abstract] No equations, formal definition of the geometric constraint, or derivation showing how the blind-confidence penalty implements mutual-information maximization are provided in the abstract. Without these, it is impossible to verify whether the reported performance gains are consistent with the method's own formulation or whether they reduce to quantities fitted from the evaluation data.

minor comments (1)

[Abstract] The abstract states strong empirical claims (outperformance, 25% data efficiency, emergent spatial grounding) but supplies no metrics, baselines, ablation results, or dataset details. These must be supplied with precise numbers and controls in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments raise important points about the counterfactual construction and the presentation of the method. We address each below and have revised the manuscript to improve clarity and rigor where appropriate.

read point-by-point responses

Referee: [Abstract / Method (counterfactual construction)] The central mechanism assumes that masking textual-visual attention produces a true counterfactual blind state equivalent to the absence of visual input. However, in standard MLLM transformer architectures, visual tokens are projected and can continue to influence later hidden states via self-attention among text tokens or residual connections even after cross-attention masking. This risks the penalty targeting a different phenomenon than linguistic shortcuts, weakening the claimed causal link to improved visual grounding and mutual-information maximization (see Abstract and the method description of the geometric constraint).

Authors: We appreciate this architectural observation. Masking is applied specifically to the cross-attention layers between text and visual tokens to simulate blindness, and our implementation follows standard practices in attention masking for counterfactual interventions. While residual pathways and subsequent self-attention could in principle allow limited leakage, our ablation studies (Section 4.3) show that the blind-confidence penalty produces the intended drop in output certainty and drives the observed gains in visual grounding. To strengthen the presentation, we will add a dedicated paragraph in the revised Method section discussing potential residual influences and why the chosen masking still enforces the desired causal intervention on visual information flow. revision: partial
Referee: [Abstract] No equations, formal definition of the geometric constraint, or derivation showing how the blind-confidence penalty implements mutual-information maximization are provided in the abstract. Without these, it is impossible to verify whether the reported performance gains are consistent with the method's own formulation or whether they reduce to quantities fitted from the evaluation data.

Authors: Abstracts conventionally omit equations to preserve readability for a broad audience. The full manuscript (Section 3) contains the formal definition of the geometric constraint, its relation to mutual-information maximization, and the derivation of the blind-confidence penalty. To address the concern, we will revise the abstract to include a concise textual description of the geometric constraint and its objective while retaining the equation-free style. The formal derivation and implementation details will remain in the main text, allowing readers to connect the high-level claims directly to the method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and gains are empirically validated without reduction to inputs by construction.

full rationale

The paper presents VIGIL as an RL post-training method that adds a penalty on blind confidence under masked textual-visual attention to maximize mutual information. No equations or steps in the provided abstract or description reduce the claimed performance gains to quantities fitted from the evaluation data itself, nor do they rely on self-citations for uniqueness or ansatz smuggling. The central mechanism is a novel constraint whose effectiveness is asserted via benchmark comparisons, leaving the derivation self-contained against external results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to identify concrete free parameters, axioms, or invented entities; the approach appears to rest on standard RL assumptions plus the novel counterfactual penalty whose effectiveness is asserted but not derived.

pith-pipeline@v0.9.1-grok · 5825 in / 1189 out tokens · 28824 ms · 2026-06-26T01:20:26.402124+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 11 linked inside Pith

[1]

From system 1 to system 2: a survey of reasoning large language models

Duzhen Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Xiuyi Chen, Yingying Zhang, et al. From system 1 to system 2: a survey of reasoning large language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[2]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[3]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2025

2025
[4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

2024
[5]

Mllms know where to look: Training-free perception of small visual details with multimodal llms

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details with multimodal llms. In The Thirteenth International Conference on Learning Representations, 2025

2025
[6]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

2024
[7]

Understanding language prior of LVLMs by contrasting chain-of-embedding

Lin Long, Changdae Oh, Seongheon Park, and Sharon Li. Understanding language prior of LVLMs by contrasting chain-of-embedding. In The Fourteenth International Conference on Learning Representations, 2026

2026
[8]

Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning

Tianyi Bai, Yuxuan Fan, Qiu Jiantao, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, and Binhang Yuan. Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[9]

Leveraging latent visual reasoning in silence

Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, and Jianyang Gu. Leveraging latent visual reasoning in silence. arXiv preprint arXiv:2605.18641, 2026

Pith/arXiv arXiv 2026
[10]

Vgent: Visual grounding via modular design for disentangling reasoning and prediction

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, and Kangning Liu. Vgent: Visual grounding via modular design for disentangling reasoning and prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41160–41170, 2026

2026
[11]

Adaptvision: Efficient vision-language models via adaptive visual acquisition

Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, and Deheng Ye. Adaptvision: Efficient vision-language models via adaptive visual acquisition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11923–11932, 2026

2026
[12]

Direct pref- erence optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct pref- erence optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[13]

Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations

Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, and Xuming He. Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations. Transactions on Machine Learning Research, 2025

2025
[14]

Manifold learning: What, how, and why

Marina Meil˘ a and Hanyu Zhang. Manifold learning: What, how, and why. Annual Review of Statistics and Its Application, 11(1):393–417, 2024

2024
[15]

Rnagenscape: property-guided optimization and interpolation of mrna sequences with manifold langevin dynamics

Danqi Liao, Chen Liu, Xingzhi Sun, Dié Tang, Haochen Wang, Scott Youlten, Srikar Krishna Gopinath, Haejeong Lee, Ethan C Strayer, Antonio J Giraldez, et al. Rnagenscape: property-guided optimization and interpolation of mrna sequences with manifold langevin dynamics. arXiv preprint arXiv:2510.24736, 2025

Pith/arXiv arXiv 2025
[16]

Back to basics: Let denoising generative models denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 36115–36125, 2026

2026
[17]

Assessing neural network representations during training using noise-resilient diffusion spectral entropy

Danqi Liao, Chen Liu, Benjamin W Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy. Assessing neural network representations during training using noise-resilient diffusion spectral entropy. In 2024 58th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2024. 13

2024
[18]

Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds

Xingzhi Sun, Danqi Liao, Kincaid MacDonald, Yanlei Zhang, Guillaume Huguet, Guy Wolf, Ian Adelstein, Tim GJ Rudner, and Smita Krishnaswamy. Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds. In The 28th International Conference on Artificial Intelligence and Statistics, 2025

2025
[19]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081): 633–638, 2025

2025
[20]

Perception-aware policy optimization for multimodal reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. arXiv preprint arXiv:2507.06448, 2025

Pith/arXiv arXiv 2025
[21]

Analyzing and mitigating object hallucination in large vision-language models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024

2024
[22]

Viunit: Visual unit tests for more robust visual programming

Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, and Juan Carlos Niebles. Viunit: Visual unit tests for more robust visual programming. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24646–24656, 2025

2025
[23]

Measuring compositional consistency for video question answering

Mona Gandhi, Mustafa Omer Gul, Eva Prakash, Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Measuring compositional consistency for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2022

2022
[24]

Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness

Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Adu Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, and Tianyi Zhou. Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Dataset...

2025
[25]

Counterfactual vqa: A cause-effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12700–12710, 2021

2021
[26]

A diversity-promoting objective function for neural conversation models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 110–119, 2016

2016
[27]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of machine learning research, 24(240):1–113, 2023

2023
[28]

Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness

Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, and Zhengming Ding. Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

2026
[29]

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025
[30]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020
[31]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

2024
[32]

Pytorch fsdp: Experiences on scaling fully sharded data parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023

2023
[33]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198–124235, 2024. 14

2024
[34]

Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization

Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

2025
[35]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

2023
[36]

Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Pith/arXiv arXiv 2023
[37]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

2024
[38]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024

2024
[39]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

2024
[40]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021

2021
[41]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[42]

Mitigating object hallucination in large vision-language models via visual attention direct preference optimization

Yixiao He, Haifeng Sun, Qi Qi, Zirui Zhuang, Pengfei Ren, Huazheng Wang, Yafeng Nan, and Jingyu Wang. Mitigating object hallucination in large vision-language models via visual attention direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

2025
[43]

Physics of language models: Part 3.3, knowledge capacity scaling laws

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. In International Conference on Learning Representations, 2025

2025
[44]

Dispersion loss counteracts embedding condensation and improves generalization in small language models

Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, and Smita Krishnaswamy. Dispersion loss counteracts embedding condensation and improves generalization in small language models. In International Conference on Machine Learning. PMLR, 2026

2026
[45]

True multimodal in-context learning needs attention to the visual context

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context. In Second Conference on Language Modeling, 2025

2025
[46]

Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning

Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, et al. Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning. arXiv preprint arXiv:2510.12712, 2025

arXiv 2025
[47]

Small drafts, big verdict: Information-intensive visual reasoning via speculation

Yuhan Liu, Lianhui Qin, and Shengjie Wang. Small drafts, big verdict: Information-intensive visual reasoning via speculation. arXiv preprint arXiv:2510.20812, 2025

arXiv 2025
[48]

Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, and Feng Chen. Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning. arXiv preprint arXiv:2509.20912, 2025

Pith/arXiv arXiv 2025
[49]

Hallucination of multimodal large language models: A survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024

Pith/arXiv arXiv 2024
[50]

Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms

Fang Peng, Xiaoshan Yang, Yaowei Wang, and Changsheng Xu. Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms. IEEE Transactions on Circuits and Systems for Video Technology, 2026

2026
[51]

Visually-guided policy optimization for multimodal reasoning

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning. arXiv preprint arXiv:2604.09349, 2026. 15

Pith/arXiv arXiv 2026
[52]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025

arXiv 2025
[53]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

2024
[54]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022
[55]

Thinimg: Cross-modal steganography for presenting talking heads in images

Lin Zhao, Hongxuan Li, Xuefei Ning, and Xinru Jiang. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5553–5562, 2024

2024
[56]

Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization

Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, and Deng Cai. Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12737–12746, October 2025

2025
[57]

Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation

Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, and Jianyang Gu. Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41688–41698, 2026

2026
[58]

Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Pith/arXiv arXiv 2026
[59]

Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, and Xin Cao. Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments. arXiv preprint arXiv:2604.22409, 2026

Pith/arXiv arXiv 2026
[60]

Causality

Judea Pearl. Causality. Cambridge university press, 2009

2009
[61]

compressed

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. 16 Appendix A Related Work A.1 Mechanisms of Hallucination in MLLMs Recent Multimodal Larg...

2021

[1] [1]

From system 1 to system 2: a survey of reasoning large language models

Duzhen Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Xiuyi Chen, Yingying Zhang, et al. From system 1 to system 2: a survey of reasoning large language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[2] [2]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[3] [3]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2025

2025

[4] [4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

2024

[5] [5]

Mllms know where to look: Training-free perception of small visual details with multimodal llms

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details with multimodal llms. In The Thirteenth International Conference on Learning Representations, 2025

2025

[6] [6]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

2024

[7] [7]

Understanding language prior of LVLMs by contrasting chain-of-embedding

Lin Long, Changdae Oh, Seongheon Park, and Sharon Li. Understanding language prior of LVLMs by contrasting chain-of-embedding. In The Fourteenth International Conference on Learning Representations, 2026

2026

[8] [8]

Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning

Tianyi Bai, Yuxuan Fan, Qiu Jiantao, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, and Binhang Yuan. Hallucination at a glance: Controlled visual edits and fine-grained multimodal learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[9] [9]

Leveraging latent visual reasoning in silence

Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, and Jianyang Gu. Leveraging latent visual reasoning in silence. arXiv preprint arXiv:2605.18641, 2026

Pith/arXiv arXiv 2026

[10] [10]

Vgent: Visual grounding via modular design for disentangling reasoning and prediction

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, and Kangning Liu. Vgent: Visual grounding via modular design for disentangling reasoning and prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41160–41170, 2026

2026

[11] [11]

Adaptvision: Efficient vision-language models via adaptive visual acquisition

Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, and Deheng Ye. Adaptvision: Efficient vision-language models via adaptive visual acquisition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11923–11932, 2026

2026

[12] [12]

Direct pref- erence optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct pref- erence optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023

[13] [13]

Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations

Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, and Xuming He. Da-dpo: Cost-efficient difficulty-aware preference optimization for reducing mllm hallucinations. Transactions on Machine Learning Research, 2025

2025

[14] [14]

Manifold learning: What, how, and why

Marina Meil˘ a and Hanyu Zhang. Manifold learning: What, how, and why. Annual Review of Statistics and Its Application, 11(1):393–417, 2024

2024

[15] [15]

Rnagenscape: property-guided optimization and interpolation of mrna sequences with manifold langevin dynamics

Danqi Liao, Chen Liu, Xingzhi Sun, Dié Tang, Haochen Wang, Scott Youlten, Srikar Krishna Gopinath, Haejeong Lee, Ethan C Strayer, Antonio J Giraldez, et al. Rnagenscape: property-guided optimization and interpolation of mrna sequences with manifold langevin dynamics. arXiv preprint arXiv:2510.24736, 2025

Pith/arXiv arXiv 2025

[16] [16]

Back to basics: Let denoising generative models denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 36115–36125, 2026

2026

[17] [17]

Assessing neural network representations during training using noise-resilient diffusion spectral entropy

Danqi Liao, Chen Liu, Benjamin W Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy. Assessing neural network representations during training using noise-resilient diffusion spectral entropy. In 2024 58th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2024. 13

2024

[18] [18]

Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds

Xingzhi Sun, Danqi Liao, Kincaid MacDonald, Yanlei Zhang, Guillaume Huguet, Guy Wolf, Ian Adelstein, Tim GJ Rudner, and Smita Krishnaswamy. Geometry-aware generative autoencoders for warped riemannian metric learning and generative modeling on data manifolds. In The 28th International Conference on Artificial Intelligence and Statistics, 2025

2025

[19] [19]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081): 633–638, 2025

2025

[20] [20]

Perception-aware policy optimization for multimodal reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. arXiv preprint arXiv:2507.06448, 2025

Pith/arXiv arXiv 2025

[21] [21]

Analyzing and mitigating object hallucination in large vision-language models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024

2024

[22] [22]

Viunit: Visual unit tests for more robust visual programming

Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, and Juan Carlos Niebles. Viunit: Visual unit tests for more robust visual programming. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24646–24656, 2025

2025

[23] [23]

Measuring compositional consistency for video question answering

Mona Gandhi, Mustafa Omer Gul, Eva Prakash, Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Measuring compositional consistency for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2022

2022

[24] [24]

Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness

Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Adu Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, and Tianyi Zhou. Colorbench: Can vlms see and understand the colorful world? a comprehensive benchmark for color perception, reasoning, and robustness. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Dataset...

2025

[25] [25]

Counterfactual vqa: A cause-effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12700–12710, 2021

2021

[26] [26]

A diversity-promoting objective function for neural conversation models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 110–119, 2016

2016

[27] [27]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of machine learning research, 24(240):1–113, 2023

2023

[28] [28]

Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness

Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, and Zhengming Ding. Seeing clearly, reasoning confidently: Plug-and-play remedies for vision language model blindness. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

2026

[29] [29]

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025

[30] [30]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020

[31] [31]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

2024

[32] [32]

Pytorch fsdp: Experiences on scaling fully sharded data parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023

2023

[33] [33]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198–124235, 2024. 14

2024

[34] [34]

Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization

Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond multimodal halluci- nations: Enhancing lvlms through hallucination-aware direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

2025

[35] [35]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

2023

[36] [36]

Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Pith/arXiv arXiv 2023

[37] [37]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

2024

[38] [38]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024

2024

[39] [39]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

2024

[40] [40]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021

2021

[41] [41]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[42] [42]

Mitigating object hallucination in large vision-language models via visual attention direct preference optimization

Yixiao He, Haifeng Sun, Qi Qi, Zirui Zhuang, Pengfei Ren, Huazheng Wang, Yafeng Nan, and Jingyu Wang. Mitigating object hallucination in large vision-language models via visual attention direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

2025

[43] [43]

Physics of language models: Part 3.3, knowledge capacity scaling laws

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. In International Conference on Learning Representations, 2025

2025

[44] [44]

Dispersion loss counteracts embedding condensation and improves generalization in small language models

Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, and Smita Krishnaswamy. Dispersion loss counteracts embedding condensation and improves generalization in small language models. In International Conference on Machine Learning. PMLR, 2026

2026

[45] [45]

True multimodal in-context learning needs attention to the visual context

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context. In Second Conference on Language Modeling, 2025

2025

[46] [46]

Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning

Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, et al. Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning. arXiv preprint arXiv:2510.12712, 2025

arXiv 2025

[47] [47]

Small drafts, big verdict: Information-intensive visual reasoning via speculation

Yuhan Liu, Lianhui Qin, and Shengjie Wang. Small drafts, big verdict: Information-intensive visual reasoning via speculation. arXiv preprint arXiv:2510.20812, 2025

arXiv 2025

[48] [48]

Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, and Feng Chen. Defacto: Counterfactual thinking with images for enforcing evidence-grounded and faithful reasoning. arXiv preprint arXiv:2509.20912, 2025

Pith/arXiv arXiv 2025

[49] [49]

Hallucination of multimodal large language models: A survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024

Pith/arXiv arXiv 2024

[50] [50]

Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms

Fang Peng, Xiaoshan Yang, Yaowei Wang, and Changsheng Xu. Group-relative visual discrimination enhancement for unlocking intrinsic capability of mllms. IEEE Transactions on Circuits and Systems for Video Technology, 2026

2026

[51] [51]

Visually-guided policy optimization for multimodal reasoning

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning. arXiv preprint arXiv:2604.09349, 2026. 15

Pith/arXiv arXiv 2026

[52] [52]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025

arXiv 2025

[53] [53]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

2024

[54] [54]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022

[55] [55]

Thinimg: Cross-modal steganography for presenting talking heads in images

Lin Zhao, Hongxuan Li, Xuefei Ning, and Xinru Jiang. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5553–5562, 2024

2024

[56] [56]

Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization

Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, and Deng Cai. Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12737–12746, October 2025

2025

[57] [57]

Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation

Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, and Jianyang Gu. Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41688–41698, 2026

2026

[58] [58]

Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Pith/arXiv arXiv 2026

[59] [59]

Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, and Xin Cao. Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments. arXiv preprint arXiv:2604.22409, 2026

Pith/arXiv arXiv 2026

[60] [60]

Causality

Judea Pearl. Causality. Cambridge university press, 2009

2009

[61] [61]

compressed

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. 16 Appendix A Related Work A.1 Mechanisms of Hallucination in MLLMs Recent Multimodal Larg...

2021