Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

Ran Wang; Wing W. Y. Ng; Xizhao Wang; Yuheng Jia; Yutong Xie; Zhenglin Hua

arxiv: 2605.20965 · v1 · pith:DGXFYNR2new · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

Yutong Xie , Zhenglin Hua , Ran Wang , Wing W. Y. Ng , Xizhao Wang , Yuheng Jia This is my paper

Pith reviewed 2026-05-21 04:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords hallucination mitigationlarge vision-language modelsvisual attentioninter-layer discrepancysaliency mapattention enhancementtraining-free method

0 comments

The pith

LVLMs hallucinate by forgetting correct visual evidence but inter-layer attention discrepancies reveal a way to reinforce it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large vision-language models generate responses inconsistent with images mainly because they pay too little attention to the right visual parts and then lose track of them while producing text. Although overall attention to visuals is weak, the models show clear sensitivity to the accurate evidence in particular layers, visible as big differences between layers. By examining attention weights from the first few output tokens to image tokens across all layers, the authors locate image regions that get activated repeatedly and turn those into a saliency map. This map is applied during later generation steps to keep attention on the evidence and also to favor text tokens that match it well. The whole process needs no retraining and can be added directly to existing models.

Core claim

LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. Although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers with notable inter-layer discrepancy. A saliency map is formed from attention weights of early generated tokens to visual tokens by selecting those repeatedly activated across layers; this map is then used to enhance attention to the evidence and to emphasize text tokens grounded in it, thereby reducing visual forgetting.

What carries the argument

Inter-Layer Visual Attention Discrepancy (ILVAD) identifies repeatedly activated visual tokens from early-generation attention weights across layers to build a saliency map that boosts attention and curbs forgetting.

If this is right

The method works without any model retraining and plugs directly into different LVLM architectures.
Hallucination rates drop consistently when the approach is applied to five recent models on multiple benchmarks.
Text tokens can be chosen and highlighted according to how strongly their attention aligns with the visual saliency map.
Maintaining boosted attention to the identified evidence throughout generation prevents gradual visual forgetting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-wise discrepancy pattern could appear in other multimodal models and might guide where to insert visual grounding checks.
Updating the saliency map at later steps could help maintain accuracy in very long generated responses.
Pairing the attention reinforcement with existing alignment techniques might produce even more reliable outputs.

Load-bearing premise

Tokens that receive repeated activation across layers from the attention patterns of early generated tokens are the correct visual evidence, and strengthening attention to them will cut hallucinations without creating new inconsistencies.

What would settle it

Run the saliency-map enhancement on standard hallucination benchmarks for the tested LVLMs and measure whether hallucination rates stay the same or rise instead of falling.

Figures

Figures reproduced from arXiv: 2605.20965 by Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia, Yutong Xie, Zhenglin Hua.

**Figure 2.** Figure 2: Overview of the proposed method Inter-Layer Visual Attention Discrepancy (ILVAD). We extract the visual evidence saliency [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization on the impact of τ towards visual evidence saliency map [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Impact evaluation on α and β, with the black boxes highlighting the best results. The vertical axis represents α, and the horizontal axis represents β. 8 10 GreedyBeam VCD CODEAGLA VAF VAR SPARCONLY VHR Ours 0 2 4 Inference Time (1x Greedy) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of inference times for different methods. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of visual evidence saliency maps. The queries of images above are “Is there a cup in the image?”, “Describe [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Case study for LLAVA-1.5-7B. The hallucinated text generated by the baseline (Greedy) and the corresponding real text [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Case study for LLaVA-NeXT-7B. The hallucinated text generated by the baseline (Greedy) and the corresponding real text [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Case study for Qwen2-VL-7B. The hallucinated text generated by the baseline (Greedy) and the corresponding real text [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a straightforward training-free fix for LVLM hallucinations by boosting visual tokens that show attention shifts across layers in early outputs, with decent results on multiple models but thin validation that those tokens are actually the right evidence.

read the letter

The main takeaway is a practical, no-training intervention that spots visual tokens with high inter-layer attention discrepancy from the first few generated tokens, builds a saliency map from the repeatedly activated ones, and uses it both to strengthen attention during generation and to re-weight text tokens for better grounding. They report consistent hallucination reductions across five recent models on standard benchmarks, which is useful for anyone running these systems in practice. The approach is genuinely plug-and-play and avoids the usual fine-tuning costs, so it earns credit for being immediately deployable. The observation that models show layer-specific sensitivity to visual evidence is a reasonable starting point and aligns with known attention dynamics in transformers. On the soft spots, the central assumption that the discrepancy-selected tokens correspond to correct visual evidence lacks direct support. There are no reported comparisons to ground-truth object labels, human judgments, or simple baselines like uniform attention boosts or random token emphasis, so it remains possible that any increase in visual attention mass would produce similar gains. The experiments also appear light on ablations for choices like how many early tokens to use or which layers to compare. This is the sort of incremental but reproducible work that multimodal practitioners would value for quick reliability improvements, even if it does not resolve underlying model limitations. It is coherent on its own terms and shows honest engagement with the hallucination problem through attention analysis. I would bring it to a reading group for the method details and send it for peer review because the idea is testable and the multi-model results make it worth referee scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper claims that LVLMs hallucinate when paying insufficient attention to correct visual evidence and gradually forgetting it during generation. It empirically observes sensitivity to correct visual evidence in specific layers with notable inter-layer discrepancy. Motivated by this, the authors propose a training-free ILVAD method: attention weights from early generated tokens to visual tokens are used across layers to identify repeatedly activated tokens as visual evidence, forming a saliency map that boosts attention to these tokens during continued generation and re-weights text tokens strongly grounded in visual evidence. Evaluations on five recent LVLMs across benchmarks show consistent hallucination mitigation.

Significance. If the central assumption and empirical results hold, this work offers a significant practical advance by introducing a simple, training-free, plug-and-play technique for reducing hallucinations in LVLMs that leverages inter-layer attention patterns rather than model retraining. The method's reported generality across different architectures and the public code release support reproducibility and potential adoption.

major comments (2)

[Method (ILVAD description)] The method section defines visual evidence as tokens repeatedly activated across layers from attention weights of the first few generated tokens, then uses the resulting saliency map to mitigate forgetting. No independent verification (ground-truth object labels, human annotations, or causal tests against random/uniform attention baselines) is reported to confirm these tokens are the correct evidence rather than spurious correlations; this assumption is load-bearing for the claim that reinforcement via ILVAD specifically counters visual forgetting and hallucinations.
[Experiments and Evaluation] The experiments claim consistent mitigation across five models and various architectures, yet the manuscript provides insufficient quantitative results, ablation studies on saliency-map construction parameters (e.g., number of early tokens or layer selection), or baseline comparisons that would isolate the contribution of inter-layer discrepancy from generic increases in visual attention mass.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific benchmarks and hallucination metrics used to quantify improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our work. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Method (ILVAD description)] The method section defines visual evidence as tokens repeatedly activated across layers from attention weights of the first few generated tokens, then uses the resulting saliency map to mitigate forgetting. No independent verification (ground-truth object labels, human annotations, or causal tests against random/uniform attention baselines) is reported to confirm these tokens are the correct evidence rather than spurious correlations; this assumption is load-bearing for the claim that reinforcement via ILVAD specifically counters visual forgetting and hallucinations.

Authors: We acknowledge that the manuscript relies on empirical observations of inter-layer attention discrepancies without providing independent verification such as ground-truth object labels or direct comparisons to random baselines. Our definition of visual evidence stems from the consistent activation patterns observed in early tokens across layers, which we link to reduced hallucinations when reinforced. To strengthen this, we will add experiments in the revised manuscript that include comparisons against random and uniform attention baselines, as well as any available causal analyses, to better demonstrate that the selected tokens are not spurious. revision: yes
Referee: [Experiments and Evaluation] The experiments claim consistent mitigation across five models and various architectures, yet the manuscript provides insufficient quantitative results, ablation studies on saliency-map construction parameters (e.g., number of early tokens or layer selection), or baseline comparisons that would isolate the contribution of inter-layer discrepancy from generic increases in visual attention mass.

Authors: The current manuscript reports consistent improvements across five LVLMs and multiple benchmarks, but we agree that the experimental section would benefit from more detailed quantitative breakdowns and ablations. We will expand the revised version to include ablation studies on the number of early tokens used, layer selection choices, and additional baselines that apply generic visual attention boosts without leveraging inter-layer discrepancy. These additions will help isolate the specific contribution of our approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an explicit empirical procedure on observed attention patterns

full rationale

The paper's central contribution is an empirical observation that LVLMs exhibit inter-layer attention discrepancies to visual tokens, followed by a training-free procedural definition of ILVAD: extract attention weights from the first few generated tokens to image tokens across layers, select repeatedly activated tokens to form a saliency map, then use the map to boost visual attention and re-weight text tokens during continued generation. This procedure is defined directly in terms of the model's internal attention weights rather than any fitted parameter, self-referential equation, or prior result that reduces to the target claim by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the derivation; the assumption that repeatedly activated tokens constitute 'correct visual evidence' is presented as a motivated heuristic whose effectiveness is then tested on external benchmarks, keeping the chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that inter-layer attention patterns identify correct visual evidence and that reinforcing those patterns reduces forgetting. This depends on the domain assumption that attention weights serve as a reliable proxy for factual grounding.

axioms (1)

domain assumption Attention weights from early generated tokens to visual tokens reflect the model's sensitivity to correct visual evidence.
Invoked to justify constructing the saliency map from inter-layer discrepancies.

pith-pipeline@v0.9.0 · 5821 in / 1321 out tokens · 41303 ms · 2026-05-21T04:59:34.457103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 6 internal anchors

[1]

2025 , pages =

Huang, Xijie and Wang, Xinyuan and Zhang, Hantao and Zhu, Yinghao and Xi, Jiawen and An, Jingkun and Wang, Hao and Liang, Hao and Pan, Chengwei , title =. 2025 , pages =

work page 2025
[2]

Advances in Neural Information Processing Systems , volume =

Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee , title =. Advances in Neural Information Processing Systems , volume =

work page
[3]

The Twelfth International Conference on Learning Representations , year =

Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , title =. The Twelfth International Conference on Learning Representations , year =

work page
[7]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , year=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

work page
[8]

The Thirteenth International Conference on Learning Representations , year =

Seil Kang and Jinyeong Kim and Junhyeok Kim and Seong Jae Hwang , title =. The Thirteenth International Conference on Learning Representations , year =

work page
[9]

The Twelfth International Conference on Learning Representations , year =

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs , author=. The Twelfth International Conference on Learning Representations , year =

work page
[10]

Proceedings of the

Sicong Leng and Hang Zhang and Guanzheng Chen and Xin Li and Shijian Lu and Chunyan Miao and Lidong Bing , title =. Proceedings of the

work page
[11]

Advances in Neural Information Processing Systems , volume =

Junho Kim and Hyunjun Kim and Yeonju Kim and Yong Man Ro , title =. Advances in Neural Information Processing Systems , volume =

work page
[12]

Proceedings of the

Wenbin An and Feng Tian and Sicong Leng and Jiahao Nie and Haonan Lin and Qianying Wang and Ping Chen and Xiaoqin Zhang and Shijian Lu , title =. Proceedings of the

work page
[13]

and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi , booktitle=

Wan, Zifu and Zhang, Ce and Yong, Silong and Ma, Martin Q. and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi , booktitle=. ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models , pages=

work page
[14]

Proceedings of the

Hao Yin and Guangzong Si and Zilei Wang , title =. Proceedings of the

work page
[15]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Mingi Jung and Saehyung Lee and Eunji Kim and Sungroh Yoon , title =. Proceedings of the 42nd International Conference on Machine Learning , pages =

work page
[16]

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =

Jinghan He and Kuan Zhu and Haiyun Guo and Junfeng Fang and Zhenglin Hua and Yuheng Jia and Ming Tang and Tat. Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =

work page
[18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and Chua, Tat-Seng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[19]

Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation

Hua, Zhenglin and He, Jinghan and Yao, Zijun and Han, Tianxu and Guo, Haiyun and Jia, Yuheng and Fang, Junfeng. Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation. Findings of the Association for Computational Linguistics: EMNLP. 2025

work page 2025
[20]

The Thirteenth International Conference on Learning Representations , year=

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[21]

Science China Information Sciences , volume =

Woodpecker: Hallucination correction for multimodal large language models , author=. Science China Information Sciences , volume =

work page
[22]

Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLM s

Zhang, Xiaofeng and Quan, Yihao and Shen, Chen and Gu, Chaochen and Yuan, Xiaosong and Yan, Shaotian and Cao, Jiawei and Cheng, Hao and Wu, Kaijie and Ye, Jieping. Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLM s. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2025

work page 2025
[24]

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =

Huo, Fushuo and Xu, Wenchao and Zhang, Zhong and Wang, Haozhao and Chen, Zhicheng and Zhao, Peilin , booktitle=. Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =

work page
[25]

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding , pages=

Tang, Feilong and Liu, Chengzhi and Xu, Zhongxing and Hu, Ming and Huang, Zile and Xue, Haochen and Chen, Ziyang and Peng, Zelin and Yang, Zhiwei and Zhou, Sijin and Li, Wenxue and Li, Yulong and Song, Wenxuan and Su, Shiyan and Feng, Wei and Su, Jionglong and Lin, Minquan and Peng, Yifan and Cheng, Xuelian and Razzak, Imran and Ge, Zongyuan , booktitle=....

work page
[27]

VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions

Chen, Yuyan and Jia, Jiyuan and Lu, Jiaxin and Li, Siyue and Guan, Yu and Yang, Ming and Guo, Qingpei. VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025

work page 2025
[28]

Grounding Multimodal Large Language Models to the World , year =

Peng, Zhiliang and Wang, Wenhui and Dong, Li and Hao, Yaru and Huang, Shaohan and Ma, Shuming and Ye, Qixiang and Wei, Furu , booktitle=. Grounding Multimodal Large Language Models to the World , year =

work page
[29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Zhang, Ruifei and Zhang, Wei and Tan, Xiao and Yang, Sibei and Wan, Xiang and Luo, Xiaonan and Li, Guanbin , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[30]

Object Hallucination in Image Captioning

Rohrbach, Anna and Hendricks, Lisa Anne and Burns, Kaylee and Darrell, Trevor and Saenko, Kate. Object Hallucination in Image Captioning. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018
[31]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023
[33]

Aligning Large Multimodal Models with Factually Augmented RLHF

Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liangyan and Wang, Yu-Xiong and Yang, Yiming and Keutzer, Kurt and Darrell, Trevor. Aligning Large Multimodal Models with Factually Augmented RLHF. Findings of the Association for Computational Linguistics: ACL. 2024

work page 2024
[34]

Improved Baselines with Visual Instruction Tuning , pages=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , booktitle=. Improved Baselines with Visual Instruction Tuning , pages=

work page
[35]

Lawrence

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll \'a r, Piotr and Zitnick, C. Lawrence. Microsoft COCO: Common Objects in Context. 2014

work page 2014
[36]

2022 , booktitle =

Schwenk, Dustin and Khandelwal, Apoorv and Clark, Christopher and Marino, Kenneth and Mottaghi, Roozbeh , title =. 2022 , booktitle =

work page 2022
[37]

and Manning, Christopher D

Hudson, Drew A. and Manning, Christopher D. , booktitle=. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , year=

work page
[38]

Proceedings of the Thirty-Ninth

Qi Sun and Marc Pickett and Aakash Kumar Nain and Llion Jones , title =. Proceedings of the Thirty-Ninth

work page
[39]

DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

Gong, Xuan and Ming, Tianshi and Wang, Xinpeng and Wei, Zhihua. DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

work page 2024
[40]

Proceedings of the 42nd International Conference on Machine Learning , volume =

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage , author =. Proceedings of the 42nd International Conference on Machine Learning , volume =

work page
[41]

I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions

Garg, Roopal and Burns, Andrea and Karagol Ayan, Burcu and Bitton, Yonatan and Montgomery, Ceslee and Onoe, Yasumasa and Bunner, Andrew and Krishna, Ranjay and Baldridge, Jason Michael and Soricut, Radu. I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

work page 2024
[42]

and Petryk, Suzanne and Gonzalez, Joseph E

Chan, David M. and Petryk, Suzanne and Gonzalez, Joseph E. and Darrell, Trevor and Canny, John. CLAIR : Evaluating Image Captions with Large Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023
[43]

and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E

Petryk, Suzanne and Chan, David M. and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E. and Darrell, Trevor. ALOH a: A New Measure for Hallucination in Captioning Models. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024

work page 2024
[44]

Proceedings of the Fortieth

Xinyue Wang and Yuheng Jia and Hui Liu and Junhui Hou , title =. Proceedings of the Fortieth

work page
[45]

Mitigating object hallucinations in large vision-language models with assembly of global and local attention

An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., and Lu, S. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 29915--29926, 2025

work page 2025
[46]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

M., Petryk, S., Gonzalez, J

Chan, D. M., Petryk, S., Gonzalez, J. E., Darrell, T., and Canny, J. CLAIR : Evaluating image captions with large language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 13638--13646, 2023

work page 2023
[48]

VQAG uider: Guiding multimodal large language models to answer complex video questions

Chen, Y., Jia, J., Lu, J., Li, S., Guan, Y., Yang, M., and Guo, Q. VQAG uider: Guiding multimodal large language models to answer complex video questions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 7821--7834, 2025

work page 2025
[49]

Alphaedit: Null-space constrained knowledge editing for language models

Fang, J., Jiang, H., Wang, K., Ma, Y., Shi, J., Wang, X., He, X., and Chua, T.-S. Alphaedit: Null-space constrained knowledge editing for language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[50]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., and Ji, R. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

M., and Soricut, R

Garg, R., Burns, A., Karagol Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J. M., and Soricut, R. I mage I n W ords: Unlocking hyper-detailed image descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 93--127, 2024

work page 2024
[52]

DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination

Gong, X., Ming, T., Wang, X., and Wei, Z. DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 7696--7712, 2024

work page 2024
[53]

Cracking the code of hallucination in lvlms with vision-aware head divergence

He, J., Zhu, K., Guo, H., Fang, J., Hua, Z., Jia, Y., Tang, M., Chua, T., and Wang, J. Cracking the code of hallucination in lvlms with vision-aware head divergence. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 3488--3501, 2025

work page 2025
[54]

Steering LVLM s via sparse autoencoder for hallucination mitigation

Hua, Z., He, J., Yao, Z., Han, T., Guo, H., Jia, Y., and Fang, J. Steering LVLM s via sparse autoencoder for hallucination mitigation. In Findings of the Association for Computational Linguistics: EMNLP, pp.\ 10808--10828, 2025

work page 2025
[55]

Medical mllm is vulnerable: cross-modality jailbreak and mismatched attacks on medical multimodal large language models

Huang, X., Wang, X., Zhang, H., Zhu, Y., Xi, J., An, J., Wang, H., Liang, H., and Pan, C. Medical mllm is vulnerable: cross-modality jailbreak and mismatched attacks on medical multimodal large language models. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, pp.\ 3797--3805, 2025

work page 2025
[56]

Self-introspective decoding: Alleviating hallucinations for large vision-language models

Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., and Zhao, P. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[57]

Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models

Jung, M., Lee, S., Kim, E., and Yoon, S. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 28527--28551, 2025

work page 2025
[58]

Kang, S., Kim, J., Kim, J., and Hwang, S. J. See what you are told: Visual attention sink in large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[59]

Kim, J., Kim, H., Kim, Y., and Ro, Y. M. CODE: contrasting self-generated description to combat hallucination in large multi-modal models. In Advances in Neural Information Processing Systems, volume 37, pp.\ 133571--133599, 2024

work page 2024
[60]

Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage

Lee, S., Yoon, S., Bui, T., Shi, J., and Yoon, S. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 33815--33832, 2025

work page 2025
[61]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 13872--13882, 2024

work page 2024
[62]

Evaluating object hallucination in large vision-language models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 292--305, 2023

work page 2023
[63]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp.\ 740--755, 2014

work page 2014
[64]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916, 2023

work page 2023
[65]

Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26286--26296, 2024 a

work page 2024
[66]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[67]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., and Peng, W. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024 c

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Grounding multimodal large language models to the world

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Ye, Q., and Wei, F. Grounding multimodal large language models to the world. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[69]

M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J

Petryk, S., Chan, D. M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J. E., and Darrell, T. ALOH a: A new measure for hallucination in captioning models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 342--357, 2024

work page 2024
[70]

A., Burns, K., Darrell, T., and Saenko, K

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 4035--4045, 2018

work page 2018
[71]

Aligning large multimodal models with factually augmented RLHF

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.-X., Yang, Y., Keutzer, K., and Darrell, T. Aligning large multimodal models with factually augmented RLHF . In Findings of the Association for Computational Linguistics: ACL, pp.\ 13088--13110, 2024

work page 2024
[72]

Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding

Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., Li, W., Li, Y., Song, W., Su, S., Feng, W., Su, J., Lin, M., Peng, Y., Cheng, X., Razzak, I., and Ge, Z. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and P...

work page 2025
[73]

Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y

Wan, Z., Zhang, C., Yong, S., Ma, M. Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y. Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3225--3234, 2025

work page 2025
[74]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Caption anything: Interactive image description with diverse multimodal controls

Wang, T., Zhang, J., Fei, J., Ge, Y., Zheng, H., Tang, Y., Li, Z., Gao, M., Zhao, S., Shan, Y., and Zheng, F. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023

work page arXiv 2023
[76]

ESMC: mllm-based embedding selection for explainable multiple clustering

Wang, X., Jia, Y., Liu, H., and Hou, J. ESMC: mllm-based embedding selection for explainable multiple clustering. In Proceedings of the Fortieth AAAI Conference on Artificial Intelligence , pp.\ 26588--26596, 2026

work page 2026
[77]

Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models

Yin, H., Si, G., and Wang, Z. Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 14625--14634, 2025

work page 2025
[78]

Woodpecker: Hallucination correction for multimodal large language models

Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., and Chen, E. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences, 67, 2024

work page 2024
[79]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.-T., Sun, M., and Chua, T.-S. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13807--13816, 2024

work page 2024
[80]

Tell your model where to attend: Post-hoc attention steering for llms

Zhang, Q., Singh, C., Liu, L., Liu, X., Yu, B., Gao, J., and Zhao, T. Tell your model where to attend: Post-hoc attention steering for llms. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[81]

Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving

Zhang, R., Zhang, W., Tan, X., Yang, S., Wan, X., Luo, X., and Li, G. Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 5923--5933, 2025 a

work page 2025
[82]

Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in LVLM s

Zhang, X., Quan, Y., Shen, C., Gu, C., Yuan, X., Yan, S., Cao, J., Cheng, H., Wu, K., and Ye, J. Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in LVLM s. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 3512--3534, 2025 b

work page 2025
[83]

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhao, Z., Wang, B., Ouyang, L., wen Dong, X., Wang, J., and He, C. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[84]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[85]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

2025 , pages =

Huang, Xijie and Wang, Xinyuan and Zhang, Hantao and Zhu, Yinghao and Xi, Jiawen and An, Jingkun and Wang, Hao and Liang, Hao and Pan, Chengwei , title =. 2025 , pages =

work page 2025

[2] [2]

Advances in Neural Information Processing Systems , volume =

Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee , title =. Advances in Neural Information Processing Systems , volume =

work page

[3] [3]

The Twelfth International Conference on Learning Representations , year =

Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , title =. The Twelfth International Conference on Learning Representations , year =

work page

[4] [7]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , year=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

work page

[5] [8]

The Thirteenth International Conference on Learning Representations , year =

Seil Kang and Jinyeong Kim and Junhyeok Kim and Seong Jae Hwang , title =. The Thirteenth International Conference on Learning Representations , year =

work page

[6] [9]

The Twelfth International Conference on Learning Representations , year =

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs , author=. The Twelfth International Conference on Learning Representations , year =

work page

[7] [10]

Proceedings of the

Sicong Leng and Hang Zhang and Guanzheng Chen and Xin Li and Shijian Lu and Chunyan Miao and Lidong Bing , title =. Proceedings of the

work page

[8] [11]

Advances in Neural Information Processing Systems , volume =

Junho Kim and Hyunjun Kim and Yeonju Kim and Yong Man Ro , title =. Advances in Neural Information Processing Systems , volume =

work page

[9] [12]

Proceedings of the

Wenbin An and Feng Tian and Sicong Leng and Jiahao Nie and Haonan Lin and Qianying Wang and Ping Chen and Xiaoqin Zhang and Shijian Lu , title =. Proceedings of the

work page

[10] [13]

and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi , booktitle=

Wan, Zifu and Zhang, Ce and Yong, Silong and Ma, Martin Q. and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi , booktitle=. ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models , pages=

work page

[11] [14]

Proceedings of the

Hao Yin and Guangzong Si and Zilei Wang , title =. Proceedings of the

work page

[12] [15]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Mingi Jung and Saehyung Lee and Eunji Kim and Sungroh Yoon , title =. Proceedings of the 42nd International Conference on Machine Learning , pages =

work page

[13] [16]

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =

Jinghan He and Kuan Zhu and Haiyun Guo and Junfeng Fang and Zhenglin Hua and Yuheng Jia and Ming Tang and Tat. Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =

work page

[14] [18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and Chua, Tat-Seng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[15] [19]

Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation

Hua, Zhenglin and He, Jinghan and Yao, Zijun and Han, Tianxu and Guo, Haiyun and Jia, Yuheng and Fang, Junfeng. Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation. Findings of the Association for Computational Linguistics: EMNLP. 2025

work page 2025

[16] [20]

The Thirteenth International Conference on Learning Representations , year=

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[17] [21]

Science China Information Sciences , volume =

Woodpecker: Hallucination correction for multimodal large language models , author=. Science China Information Sciences , volume =

work page

[18] [22]

Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLM s

Zhang, Xiaofeng and Quan, Yihao and Shen, Chen and Gu, Chaochen and Yuan, Xiaosong and Yan, Shaotian and Cao, Jiawei and Cheng, Hao and Wu, Kaijie and Ye, Jieping. Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLM s. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2025

work page 2025

[19] [24]

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =

Huo, Fushuo and Xu, Wenchao and Zhang, Zhong and Wang, Haozhao and Chen, Zhicheng and Zhao, Peilin , booktitle=. Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =

work page

[20] [25]

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding , pages=

Tang, Feilong and Liu, Chengzhi and Xu, Zhongxing and Hu, Ming and Huang, Zile and Xue, Haochen and Chen, Ziyang and Peng, Zelin and Yang, Zhiwei and Zhou, Sijin and Li, Wenxue and Li, Yulong and Song, Wenxuan and Su, Shiyan and Feng, Wei and Su, Jionglong and Lin, Minquan and Peng, Yifan and Cheng, Xuelian and Razzak, Imran and Ge, Zongyuan , booktitle=....

work page

[21] [27]

VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions

Chen, Yuyan and Jia, Jiyuan and Lu, Jiaxin and Li, Siyue and Guan, Yu and Yang, Ming and Guo, Qingpei. VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025

work page 2025

[22] [28]

Grounding Multimodal Large Language Models to the World , year =

Peng, Zhiliang and Wang, Wenhui and Dong, Li and Hao, Yaru and Huang, Shaohan and Ma, Shuming and Ye, Qixiang and Wei, Furu , booktitle=. Grounding Multimodal Large Language Models to the World , year =

work page

[23] [29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Zhang, Ruifei and Zhang, Wei and Tan, Xiao and Yang, Sibei and Wan, Xiang and Luo, Xiaonan and Li, Guanbin , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[24] [30]

Object Hallucination in Image Captioning

Rohrbach, Anna and Hendricks, Lisa Anne and Burns, Kaylee and Darrell, Trevor and Saenko, Kate. Object Hallucination in Image Captioning. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018

[25] [31]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023

[26] [33]

Aligning Large Multimodal Models with Factually Augmented RLHF

Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liangyan and Wang, Yu-Xiong and Yang, Yiming and Keutzer, Kurt and Darrell, Trevor. Aligning Large Multimodal Models with Factually Augmented RLHF. Findings of the Association for Computational Linguistics: ACL. 2024

work page 2024

[27] [34]

Improved Baselines with Visual Instruction Tuning , pages=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , booktitle=. Improved Baselines with Visual Instruction Tuning , pages=

work page

[28] [35]

Lawrence

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll \'a r, Piotr and Zitnick, C. Lawrence. Microsoft COCO: Common Objects in Context. 2014

work page 2014

[29] [36]

2022 , booktitle =

Schwenk, Dustin and Khandelwal, Apoorv and Clark, Christopher and Marino, Kenneth and Mottaghi, Roozbeh , title =. 2022 , booktitle =

work page 2022

[30] [37]

and Manning, Christopher D

Hudson, Drew A. and Manning, Christopher D. , booktitle=. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , year=

work page

[31] [38]

Proceedings of the Thirty-Ninth

Qi Sun and Marc Pickett and Aakash Kumar Nain and Llion Jones , title =. Proceedings of the Thirty-Ninth

work page

[32] [39]

DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

Gong, Xuan and Ming, Tianshi and Wang, Xinpeng and Wei, Zhihua. DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

work page 2024

[33] [40]

Proceedings of the 42nd International Conference on Machine Learning , volume =

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage , author =. Proceedings of the 42nd International Conference on Machine Learning , volume =

work page

[34] [41]

I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions

Garg, Roopal and Burns, Andrea and Karagol Ayan, Burcu and Bitton, Yonatan and Montgomery, Ceslee and Onoe, Yasumasa and Bunner, Andrew and Krishna, Ranjay and Baldridge, Jason Michael and Soricut, Radu. I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

work page 2024

[35] [42]

and Petryk, Suzanne and Gonzalez, Joseph E

Chan, David M. and Petryk, Suzanne and Gonzalez, Joseph E. and Darrell, Trevor and Canny, John. CLAIR : Evaluating Image Captions with Large Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023

[36] [43]

and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E

Petryk, Suzanne and Chan, David M. and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E. and Darrell, Trevor. ALOH a: A New Measure for Hallucination in Captioning Models. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024

work page 2024

[37] [44]

Proceedings of the Fortieth

Xinyue Wang and Yuheng Jia and Hui Liu and Junhui Hou , title =. Proceedings of the Fortieth

work page

[38] [45]

Mitigating object hallucinations in large vision-language models with assembly of global and local attention

An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., and Lu, S. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 29915--29926, 2025

work page 2025

[39] [46]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [47]

M., Petryk, S., Gonzalez, J

Chan, D. M., Petryk, S., Gonzalez, J. E., Darrell, T., and Canny, J. CLAIR : Evaluating image captions with large language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 13638--13646, 2023

work page 2023

[41] [48]

VQAG uider: Guiding multimodal large language models to answer complex video questions

Chen, Y., Jia, J., Lu, J., Li, S., Guan, Y., Yang, M., and Guo, Q. VQAG uider: Guiding multimodal large language models to answer complex video questions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 7821--7834, 2025

work page 2025

[42] [49]

Alphaedit: Null-space constrained knowledge editing for language models

Fang, J., Jiang, H., Wang, K., Ma, Y., Shi, J., Wang, X., He, X., and Chua, T.-S. Alphaedit: Null-space constrained knowledge editing for language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[43] [50]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., and Ji, R. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [51]

M., and Soricut, R

Garg, R., Burns, A., Karagol Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J. M., and Soricut, R. I mage I n W ords: Unlocking hyper-detailed image descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 93--127, 2024

work page 2024

[45] [52]

DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination

Gong, X., Ming, T., Wang, X., and Wei, Z. DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 7696--7712, 2024

work page 2024

[46] [53]

Cracking the code of hallucination in lvlms with vision-aware head divergence

He, J., Zhu, K., Guo, H., Fang, J., Hua, Z., Jia, Y., Tang, M., Chua, T., and Wang, J. Cracking the code of hallucination in lvlms with vision-aware head divergence. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 3488--3501, 2025

work page 2025

[47] [54]

Steering LVLM s via sparse autoencoder for hallucination mitigation

Hua, Z., He, J., Yao, Z., Han, T., Guo, H., Jia, Y., and Fang, J. Steering LVLM s via sparse autoencoder for hallucination mitigation. In Findings of the Association for Computational Linguistics: EMNLP, pp.\ 10808--10828, 2025

work page 2025

[48] [55]

Medical mllm is vulnerable: cross-modality jailbreak and mismatched attacks on medical multimodal large language models

Huang, X., Wang, X., Zhang, H., Zhu, Y., Xi, J., An, J., Wang, H., Liang, H., and Pan, C. Medical mllm is vulnerable: cross-modality jailbreak and mismatched attacks on medical multimodal large language models. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, pp.\ 3797--3805, 2025

work page 2025

[49] [56]

Self-introspective decoding: Alleviating hallucinations for large vision-language models

Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., and Zhao, P. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[50] [57]

Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models

Jung, M., Lee, S., Kim, E., and Yoon, S. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 28527--28551, 2025

work page 2025

[51] [58]

Kang, S., Kim, J., Kim, J., and Hwang, S. J. See what you are told: Visual attention sink in large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[52] [59]

Kim, J., Kim, H., Kim, Y., and Ro, Y. M. CODE: contrasting self-generated description to combat hallucination in large multi-modal models. In Advances in Neural Information Processing Systems, volume 37, pp.\ 133571--133599, 2024

work page 2024

[53] [60]

Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage

Lee, S., Yoon, S., Bui, T., Shi, J., and Yoon, S. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 33815--33832, 2025

work page 2025

[54] [61]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 13872--13882, 2024

work page 2024

[55] [62]

Evaluating object hallucination in large vision-language models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 292--305, 2023

work page 2023

[56] [63]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp.\ 740--755, 2014

work page 2014

[57] [64]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916, 2023

work page 2023

[58] [65]

Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26286--26296, 2024 a

work page 2024

[59] [66]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024

[60] [67]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., and Peng, W. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024 c

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [68]

Grounding multimodal large language models to the world

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Ye, Q., and Wei, F. Grounding multimodal large language models to the world. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[62] [69]

M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J

Petryk, S., Chan, D. M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J. E., and Darrell, T. ALOH a: A new measure for hallucination in captioning models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 342--357, 2024

work page 2024

[63] [70]

A., Burns, K., Darrell, T., and Saenko, K

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 4035--4045, 2018

work page 2018

[64] [71]

Aligning large multimodal models with factually augmented RLHF

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.-X., Yang, Y., Keutzer, K., and Darrell, T. Aligning large multimodal models with factually augmented RLHF . In Findings of the Association for Computational Linguistics: ACL, pp.\ 13088--13110, 2024

work page 2024

[65] [72]

Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding

Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., Li, W., Li, Y., Song, W., Su, S., Feng, W., Su, J., Lin, M., Peng, Y., Cheng, X., Razzak, I., and Ge, Z. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and P...

work page 2025

[66] [73]

Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y

Wan, Z., Zhang, C., Yong, S., Ma, M. Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y. Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3225--3234, 2025

work page 2025

[67] [74]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [75]

Caption anything: Interactive image description with diverse multimodal controls

Wang, T., Zhang, J., Fei, J., Ge, Y., Zheng, H., Tang, Y., Li, Z., Gao, M., Zhao, S., Shan, Y., and Zheng, F. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023

work page arXiv 2023

[69] [76]

ESMC: mllm-based embedding selection for explainable multiple clustering

Wang, X., Jia, Y., Liu, H., and Hou, J. ESMC: mllm-based embedding selection for explainable multiple clustering. In Proceedings of the Fortieth AAAI Conference on Artificial Intelligence , pp.\ 26588--26596, 2026

work page 2026

[70] [77]

Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models

Yin, H., Si, G., and Wang, Z. Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 14625--14634, 2025

work page 2025

[71] [78]

Woodpecker: Hallucination correction for multimodal large language models

Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., and Chen, E. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences, 67, 2024

work page 2024

[72] [79]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.-T., Sun, M., and Chua, T.-S. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13807--13816, 2024

work page 2024

[73] [80]

Tell your model where to attend: Post-hoc attention steering for llms

Zhang, Q., Singh, C., Liu, L., Liu, X., Yu, B., Gao, J., and Zhao, T. Tell your model where to attend: Post-hoc attention steering for llms. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[74] [81]

Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving

Zhang, R., Zhang, W., Tan, X., Yang, S., Wan, X., Luo, X., and Li, G. Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 5923--5933, 2025 a

work page 2025

[75] [82]

Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in LVLM s

Zhang, X., Quan, Y., Shen, C., Gu, C., Yuan, X., Yan, S., Cao, J., Cheng, H., Wu, K., and Ye, J. Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in LVLM s. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 3512--3534, 2025 b

work page 2025

[76] [83]

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhao, Z., Wang, B., Ouyang, L., wen Dong, X., Wang, J., and He, C. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[77] [84]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[78] [85]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025