Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

Sangheum Hwang; Seulbi Lee

arxiv: 2602.17186 · v2 · pith:2N3RDPOOnew · submitted 2026-02-19 · 💻 cs.CV

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee , Sangheum Hwang This is my paper

Pith reviewed 2026-05-22 11:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual information gainselective traininglarge vision-language modelslanguage biasvisual groundingperplexity metricdata efficiency

0 comments

The pith

A perplexity-based metric identifies which training samples and tokens actually need visual input, allowing selective training that improves grounding while using far less data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Visual Information Gain as the drop in a model's prediction uncertainty when an image is provided. This metric flags the specific samples and tokens where vision supplies new information, such as colors, spatial relations, or object attributes. Training only on high-gain elements reduces reliance on language shortcuts and produces stronger visual grounding. The same performance is reached with substantially smaller training sets because low-gain data, which the model can predict from text alone, is skipped.

Core claim

The authors measure Visual Information Gain at both sample and token levels by comparing perplexity with and without the image. High-VIG items are those whose accurate prediction requires the visual evidence. A selective training scheme then retains only these items, yielding models that ground answers in images rather than defaulting to textual patterns and that match or exceed full-data performance with markedly less supervision.

What carries the argument

Visual Information Gain (VIG), the reduction in perplexity of next-token predictions when visual input is added, used to rank and retain only informative samples and tokens during training.

If this is right

Models exhibit stronger grounding on tasks that require image details.
Answers show reduced tendency to ignore the image and answer from language alone.
Equivalent or higher benchmark scores are achieved with far fewer training examples.
Fine-grained token-level analysis reveals exactly which words in each caption depend on vision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

VIG filtering could be reused at inference time to down-weight tokens the model can already predict without the image.
The same uncertainty-reduction idea might help curate or re-weight datasets for other multimodal tasks such as video or audio-language models.
Repeated measurement of VIG during continued training could dynamically adjust which new data to retain.

Load-bearing premise

That focusing training exclusively on high-VIG samples and tokens will strengthen visual grounding and cut language bias without removing signals the model still needs for overall capability.

What would settle it

Train identical model architectures once on the full dataset and once on only the top-VIG subset, then measure accuracy on visual grounding benchmarks that test color, attribute, and spatial reasoning; if the selective model scores lower, the claim does not hold.

Figures

Figures reproduced from arXiv: 2602.17186 by Sangheum Hwang, Seulbi Lee.

**Figure 1.** Figure 1: Examples of LLaVA-1.5 instruction tuning data. The dataset includes both samples and tokens with very different levels of visual dependency: some questions can be answered without looking at the image, whereas others need fine-grained visual details (highlighted in green). In practice, multimodal instruction-tuning datasets contain a heterogeneous mixture of examples: some can be answered from common sense… view at source ↗

**Figure 2.** Figure 2: VIG distribution across benchmarks. Blue benchmarks (COCO, POPE) show stronger multimodal interaction, while red benchmarks (GQA, SQA) exhibit weaker visual dependency [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visualizing the token-level VIGs. Each point shows a token’s prediction loss with (x-axis) and without (y-axis) visual input. The color encodes the token-level loss difference (y − x). 3.4 VIG-Guided Selective Training To demonstrate the practical utility of VIG, we adopt the principle of selective modeling, recently shown to be effective for LLMs [52]. For the i-th training sample (Ii , Qi , Ai) with answ… view at source ↗

**Figure 4.** Figure 4: Attention fraction allocated to visual tokens. Compared to LLaVA-1.5 7B, VIG training assigns significantly more attention to visual tokens across all layers. Base Corruption Norm 30 40 50 60 70 80 Accuracy (%) 77.9 32.1 41.2 78.2 42.3 54.1 LLaVA-1.5 7B VIG training [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation of text reliance under textual corruption. Base: accuracy on clean inputs. Corruption: accuracy when the same image is paired with a corrupted caption containing a conflicting description. Norm: corruption accuracy normalized by the corresponding Base (Corruption/Base). 4.5 Ablation Study Effectiveness of VIG-based selection. To validate the effectiveness of our VIG-guided selection strategy, we… view at source ↗

**Figure 6.** Figure 6: Ablation study of selection ratio p% on LLaVA-1.5 7B. We report a single metric per benchmark: LLaVAW score, MMBench score, CS for CHAIR, and Hall for MMHal. p = 100 corresponds to the vanilla model trained on the full instruction-tuning dataset (no VIG-based selection). All scores are normalized with respect to the p = 100 setting. quantity. In contrast, on MMBench, aggressive filtering (p = 30, 50) resul… view at source ↗

read the original abstract

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Visual Information Gain (VIG), a perplexity-based metric quantifying the reduction in prediction uncertainty when visual input is added to Large Vision Language Models. It proposes a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens to improve visual grounding, mitigate language bias, and achieve better performance with reduced supervision.

Significance. If the experimental claims hold, the work supplies a concrete, quantitative tool for identifying visually informative training data at sample and token granularity. This could enable more efficient multimodal training and directly address language bias without architectural changes or post-hoc decoding fixes. The metric itself is parameter-free and derived from standard perplexity, which is a strength.

major comments (2)

[Abstract and §3] Abstract and §3: the central claim that VIG-guided selection 'improves visual grounding and mitigates language bias' while delivering 'superior performance with significantly reduced supervision' is load-bearing, yet the abstract supplies no quantitative results, baselines, or ablations. Without these, the performance advantage cannot be evaluated against random selection, difficulty-matched selection, or full-data training.
[§4] §4 (VIG definition): VIG is defined as the perplexity drop when the image is provided. This quantity can be driven by language-only factors (rare phrasing, dataset artifacts) rather than causal visual utility. The manuscript must demonstrate that high-VIG items are not simply harder language-only examples; a difficulty-matched or random-perplexity baseline is required to rule out this confound.

minor comments (2)

[§3] Notation for perplexity and VIG should be introduced with explicit equations (e.g., VIG(s) = PPL(text|s) - PPL(text|image,s)) rather than described only in prose.
[Figures] Figure captions and axis labels need to state the exact evaluation metrics (e.g., VQA accuracy, POPE hallucination rate) and the number of runs for error bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: the central claim that VIG-guided selection 'improves visual grounding and mitigates language bias' while delivering 'superior performance with significantly reduced supervision' is load-bearing, yet the abstract supplies no quantitative results, baselines, or ablations. Without these, the performance advantage cannot be evaluated against random selection, difficulty-matched selection, or full-data training.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific performance gains (e.g., accuracy improvements on grounding benchmarks) obtained with substantially reduced supervision relative to both full-data training and random selection. The experimental section already contains the requested comparisons and ablations; we will add explicit forward references from §3 to these results so readers can directly evaluate the advantage of VIG-guided selection. revision: yes
Referee: [§4] §4 (VIG definition): VIG is defined as the perplexity drop when the image is provided. This quantity can be driven by language-only factors (rare phrasing, dataset artifacts) rather than causal visual utility. The manuscript must demonstrate that high-VIG items are not simply harder language-only examples; a difficulty-matched or random-perplexity baseline is required to rule out this confound.

Authors: This is a valid methodological concern. Although VIG is explicitly the difference in perplexity between the vision-language and language-only settings, high-VIG samples could still correlate with linguistic difficulty. To address the potential confound we will add a new baseline experiment that selects samples according to language-only perplexity (difficulty-matched) and compare the resulting model performance against the VIG-guided schedule. The revised manuscript will report these results alongside the existing random and full-data baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: VIG metric and selective scheme are independently defined

full rationale

The paper defines Visual Information Gain (VIG) directly as a perplexity-based reduction in prediction uncertainty when visual input is added, using a standard information-theoretic quantity with no fitted parameters, self-referential equations, or ansatz smuggled via citation. The VIG-guided selective training scheme is then proposed as a downstream application that prioritizes high-VIG samples and tokens; this follows from the metric without any reduction of the central claim to a fit or to a self-citation chain. No load-bearing uniqueness theorem, renaming of known results, or self-definitional loop appears in the derivation. The approach remains self-contained against external benchmarks such as standard perplexity and selective training heuristics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that perplexity reduction accurately captures visual grounding benefit and that selective training on high-VIG items will causally mitigate language bias. No free parameters are described. The main invented element is the VIG metric itself.

axioms (1)

standard math Perplexity is a valid proxy for prediction uncertainty in autoregressive language models
Invoked when defining VIG as reduction in perplexity provided by visual input

invented entities (1)

Visual Information Gain (VIG) no independent evidence
purpose: Quantify the benefit of visual input for individual samples and tokens
Newly defined metric introduced to enable the selective training scheme

pith-pipeline@v0.9.0 · 5682 in / 1317 out tokens · 52315 ms · 2026-05-22T11:13:16.553548+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[2]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[3]

Llava-next: Stronger llms supercharge multimodal capabilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, 2024

work page 2024
[4]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5Visual Intelligence, 2024

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, and Wenhai Wang. Mini-internvl: a flexible-transfer pocket multi-modal model with 5Visual Intelligence, 2024

work page 2024
[5]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page 2024
[6]

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2025

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page 2025
[7]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, 2025

work page 2025
[8]

Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

work page 2023
[9]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

work page 2023
[10]

mplug-owl: Modularization empowers large language models with multimodality, 2024

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl: Modularization empowers large language models with multimodality, 2024

work page 2024
[11]

Kosmos-2: Grounding multimodal large language models to the world, 2023

Zhiliang Peng, Wenhui Wang, Liliu2024llavanext Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023

work page 2023
[12]

InstructBLIP: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[13]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...

work page 2022
[14]

Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

work page 2025
[15]

Deepseek-vl: Towards real-world vision-language understanding, 2024

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024. 11 APREPRINT- FEBRUARY20, 2026

work page 2024
[16]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

work page 2024
[17]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page 2024
[18]

Qwen2.5-vl technical report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. T...

work page 2025
[19]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

work page 2023
[20]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

work page 2023
[21]

Qwen technical report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page 2023
[22]

Yi: Open foundation models by 01.ai, 2024

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong D...

work page 2024
[23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

work page 2021
[24]

Sigmoid loss for language image pre- training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

work page 2023
[25]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[26]

Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Baobao Chang, and Minjia Zhang. Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[27]

Paying more attention to image: A training-free method for alleviating hallucination in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. InEuropean Conference on Computer Vision, 2025

work page 2025
[28]

MMICL: Empowering vision-language model with multi-modal in-context learning

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[29]

Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, and Bo Zhao. Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[30]

Contrastive region guidance: Improving grounding in vision-language models without training

David Wan, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. Contrastive region guidance: Improving grounding in vision-language models without training. InEuropean Conference on Computer Vision, 2025

work page 2025
[31]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 12 APREPRINT- FEBRUARY20, 2026

work page 2017
[32]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[33]

Mitigating hallucination in large multi-modal models via robust instruction tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. InThe International Conference on Learning Representations, 2024

work page 2024
[34]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[35]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. Proceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[36]

Debiasing multimodal large language models via penalization of language priors

YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, and Rong Jin. Debiasing multimodal large language models via penalization of language priors. InProceedings of the ACM International Conference on Multimedia, 2025

work page 2025
[37]

Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[38]

Less is more: Mitigating multimodal hallucination from an eos decision perspective

Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[39]

Mm1: Methods, analysis and insights from multimodal llm pre-training

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Al...

work page 2025
[40]

Gemini: A family of highly capable multimodal models

Team Gemini. Gemini: A family of highly capable multimodal models. Technical report, Gemini Team Google, 2025

work page 2025
[41]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2024

work page 2024
[42]

Counterfactual vqa: A cause-effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[43]

Vision language models are biased, 2025

An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased, 2025

work page 2025
[44]

What’s in the image? a deep-dive into the vision of vision language models

Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[45]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, 2025

work page 2025
[46]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InThe International Conference on Learning Representations, 2025

work page 2025
[47]

Where do large vision-language models look at when answering questions?, 2025

Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. Where do large vision-language models look at when answering questions?, 2025

work page 2025
[48]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie andJames Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014

work page 2014
[49]

Multi-modal data spectrum: Multi- modal datasets are multi-dimensional, 2025

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, and Sumit Chopra. Multi-modal data spectrum: Multi- modal datasets are multi-dimensional, 2025

work page 2025
[50]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[51]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, 2022. 13 APREPRINT- FEBRUARY20, 2026

work page 2022
[52]

Not all tokens are what you need for pretraining

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, yelong shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Not all tokens are what you need for pretraining. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[53]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Yu Weihao, Yang Zhengyuan, Li Linjie, Wang Jianfeng, Lin Kevin, Liu Zicheng, Wang Xinchao, and Wang Lijuan. Mm-vet: Evaluating large multimodal models for integrated capabilities. InInternational Conference on Machine Learning, 2024

work page 2024
[54]

Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, 2024

Liu Yuan, Duan Haodong, Zhang Yuanhan, Li Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, He Conghui, Liu Ziwei, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, 2024

work page 2024
[55]

Minesh Mathew, Dimosthenis Karatzas, and C.V . Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021

work page 2021
[56]

Aligning large multimodal models with factually augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. InFindings of the Association for Computational Linguistics, 2024

work page 2024
[57]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[58]

taking off

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean Conference on Computer Vision, 2022. 14 APREPRINT- FEBRUARY20, 2026 A Details of Benchmarks Visual understanding task.To assess the model’s capabilities in general visual perceptio...

work page arXiv 2022

[1] [1]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[2] [2]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[3] [3]

Llava-next: Stronger llms supercharge multimodal capabilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, 2024

work page 2024

[4] [4]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5Visual Intelligence, 2024

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, and Wenhai Wang. Mini-internvl: a flexible-transfer pocket multi-modal model with 5Visual Intelligence, 2024

work page 2024

[5] [5]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page 2024

[6] [6]

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2025

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page 2025

[7] [7]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, 2025

work page 2025

[8] [8]

Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

work page 2023

[9] [9]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

work page 2023

[10] [10]

mplug-owl: Modularization empowers large language models with multimodality, 2024

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl: Modularization empowers large language models with multimodality, 2024

work page 2024

[11] [11]

Kosmos-2: Grounding multimodal large language models to the world, 2023

Zhiliang Peng, Wenhui Wang, Liliu2024llavanext Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023

work page 2023

[12] [12]

InstructBLIP: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[13] [13]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...

work page 2022

[14] [14]

Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025

work page 2025

[15] [15]

Deepseek-vl: Towards real-world vision-language understanding, 2024

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024. 11 APREPRINT- FEBRUARY20, 2026

work page 2024

[16] [16]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

work page 2024

[17] [17]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page 2024

[18] [18]

Qwen2.5-vl technical report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. T...

work page 2025

[19] [19]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

work page 2023

[20] [20]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

work page 2023

[21] [21]

Qwen technical report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page 2023

[22] [22]

Yi: Open foundation models by 01.ai, 2024

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong D...

work page 2024

[23] [23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

work page 2021

[24] [24]

Sigmoid loss for language image pre- training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

work page 2023

[25] [25]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[26] [26]

Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Baobao Chang, and Minjia Zhang. Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025

[27] [27]

Paying more attention to image: A training-free method for alleviating hallucination in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. InEuropean Conference on Computer Vision, 2025

work page 2025

[28] [28]

MMICL: Empowering vision-language model with multi-modal in-context learning

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[29] [29]

Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, and Bo Zhao. Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[30] [30]

Contrastive region guidance: Improving grounding in vision-language models without training

David Wan, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. Contrastive region guidance: Improving grounding in vision-language models without training. InEuropean Conference on Computer Vision, 2025

work page 2025

[31] [31]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 12 APREPRINT- FEBRUARY20, 2026

work page 2017

[32] [32]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018

[33] [33]

Mitigating hallucination in large multi-modal models via robust instruction tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. InThe International Conference on Learning Representations, 2024

work page 2024

[34] [34]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[35] [35]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. Proceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024

[36] [36]

Debiasing multimodal large language models via penalization of language priors

YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, and Rong Jin. Debiasing multimodal large language models via penalization of language priors. InProceedings of the ACM International Conference on Multimedia, 2025

work page 2025

[37] [37]

Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[38] [38]

Less is more: Mitigating multimodal hallucination from an eos decision perspective

Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[39] [39]

Mm1: Methods, analysis and insights from multimodal llm pre-training

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Al...

work page 2025

[40] [40]

Gemini: A family of highly capable multimodal models

Team Gemini. Gemini: A family of highly capable multimodal models. Technical report, Gemini Team Google, 2025

work page 2025

[41] [41]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2024

work page 2024

[42] [42]

Counterfactual vqa: A cause-effect look at language bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[43] [43]

Vision language models are biased, 2025

An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased, 2025

work page 2025

[44] [44]

What’s in the image? a deep-dive into the vision of vision language models

Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[45] [45]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, 2025

work page 2025

[46] [46]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InThe International Conference on Learning Representations, 2025

work page 2025

[47] [47]

Where do large vision-language models look at when answering questions?, 2025

Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. Where do large vision-language models look at when answering questions?, 2025

work page 2025

[48] [48]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie andJames Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014

work page 2014

[49] [49]

Multi-modal data spectrum: Multi- modal datasets are multi-dimensional, 2025

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, and Sumit Chopra. Multi-modal data spectrum: Multi- modal datasets are multi-dimensional, 2025

work page 2025

[50] [50]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019

[51] [51]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, 2022. 13 APREPRINT- FEBRUARY20, 2026

work page 2022

[52] [52]

Not all tokens are what you need for pretraining

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, yelong shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Not all tokens are what you need for pretraining. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[53] [53]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Yu Weihao, Yang Zhengyuan, Li Linjie, Wang Jianfeng, Lin Kevin, Liu Zicheng, Wang Xinchao, and Wang Lijuan. Mm-vet: Evaluating large multimodal models for integrated capabilities. InInternational Conference on Machine Learning, 2024

work page 2024

[54] [54]

Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, 2024

Liu Yuan, Duan Haodong, Zhang Yuanhan, Li Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, He Conghui, Liu Ziwei, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, 2024

work page 2024

[55] [55]

Minesh Mathew, Dimosthenis Karatzas, and C.V . Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021

work page 2021

[56] [56]

Aligning large multimodal models with factually augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. InFindings of the Association for Computational Linguistics, 2024

work page 2024

[57] [57]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[58] [58]

taking off

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean Conference on Computer Vision, 2022. 14 APREPRINT- FEBRUARY20, 2026 A Details of Benchmarks Visual understanding task.To assess the model’s capabilities in general visual perceptio...

work page arXiv 2022