Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain
Pith reviewed 2026-05-22 11:13 UTC · model grok-4.3
The pith
A perplexity-based metric identifies which training samples and tokens actually need visual input, allowing selective training that improves grounding while using far less data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors measure Visual Information Gain at both sample and token levels by comparing perplexity with and without the image. High-VIG items are those whose accurate prediction requires the visual evidence. A selective training scheme then retains only these items, yielding models that ground answers in images rather than defaulting to textual patterns and that match or exceed full-data performance with markedly less supervision.
What carries the argument
Visual Information Gain (VIG), the reduction in perplexity of next-token predictions when visual input is added, used to rank and retain only informative samples and tokens during training.
If this is right
- Models exhibit stronger grounding on tasks that require image details.
- Answers show reduced tendency to ignore the image and answer from language alone.
- Equivalent or higher benchmark scores are achieved with far fewer training examples.
- Fine-grained token-level analysis reveals exactly which words in each caption depend on vision.
Where Pith is reading between the lines
- VIG filtering could be reused at inference time to down-weight tokens the model can already predict without the image.
- The same uncertainty-reduction idea might help curate or re-weight datasets for other multimodal tasks such as video or audio-language models.
- Repeated measurement of VIG during continued training could dynamically adjust which new data to retain.
Load-bearing premise
That focusing training exclusively on high-VIG samples and tokens will strengthen visual grounding and cut language bias without removing signals the model still needs for overall capability.
What would settle it
Train identical model architectures once on the full dataset and once on only the top-VIG subset, then measure accuracy on visual grounding benchmarks that test color, attribute, and spatial reasoning; if the selective model scores lower, the claim does not hold.
Figures
read the original abstract
Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Visual Information Gain (VIG), a perplexity-based metric quantifying the reduction in prediction uncertainty when visual input is added to Large Vision Language Models. It proposes a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens to improve visual grounding, mitigate language bias, and achieve better performance with reduced supervision.
Significance. If the experimental claims hold, the work supplies a concrete, quantitative tool for identifying visually informative training data at sample and token granularity. This could enable more efficient multimodal training and directly address language bias without architectural changes or post-hoc decoding fixes. The metric itself is parameter-free and derived from standard perplexity, which is a strength.
major comments (2)
- [Abstract and §3] Abstract and §3: the central claim that VIG-guided selection 'improves visual grounding and mitigates language bias' while delivering 'superior performance with significantly reduced supervision' is load-bearing, yet the abstract supplies no quantitative results, baselines, or ablations. Without these, the performance advantage cannot be evaluated against random selection, difficulty-matched selection, or full-data training.
- [§4] §4 (VIG definition): VIG is defined as the perplexity drop when the image is provided. This quantity can be driven by language-only factors (rare phrasing, dataset artifacts) rather than causal visual utility. The manuscript must demonstrate that high-VIG items are not simply harder language-only examples; a difficulty-matched or random-perplexity baseline is required to rule out this confound.
minor comments (2)
- [§3] Notation for perplexity and VIG should be introduced with explicit equations (e.g., VIG(s) = PPL(text|s) - PPL(text|image,s)) rather than described only in prose.
- [Figures] Figure captions and axis labels need to state the exact evaluation metrics (e.g., VQA accuracy, POPE hallucination rate) and the number of runs for error bars.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3: the central claim that VIG-guided selection 'improves visual grounding and mitigates language bias' while delivering 'superior performance with significantly reduced supervision' is load-bearing, yet the abstract supplies no quantitative results, baselines, or ablations. Without these, the performance advantage cannot be evaluated against random selection, difficulty-matched selection, or full-data training.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific performance gains (e.g., accuracy improvements on grounding benchmarks) obtained with substantially reduced supervision relative to both full-data training and random selection. The experimental section already contains the requested comparisons and ablations; we will add explicit forward references from §3 to these results so readers can directly evaluate the advantage of VIG-guided selection. revision: yes
-
Referee: [§4] §4 (VIG definition): VIG is defined as the perplexity drop when the image is provided. This quantity can be driven by language-only factors (rare phrasing, dataset artifacts) rather than causal visual utility. The manuscript must demonstrate that high-VIG items are not simply harder language-only examples; a difficulty-matched or random-perplexity baseline is required to rule out this confound.
Authors: This is a valid methodological concern. Although VIG is explicitly the difference in perplexity between the vision-language and language-only settings, high-VIG samples could still correlate with linguistic difficulty. To address the potential confound we will add a new baseline experiment that selects samples according to language-only perplexity (difficulty-matched) and compare the resulting model performance against the VIG-guided schedule. The revised manuscript will report these results alongside the existing random and full-data baselines. revision: yes
Circularity Check
No circularity: VIG metric and selective scheme are independently defined
full rationale
The paper defines Visual Information Gain (VIG) directly as a perplexity-based reduction in prediction uncertainty when visual input is added, using a standard information-theoretic quantity with no fitted parameters, self-referential equations, or ansatz smuggled via citation. The VIG-guided selective training scheme is then proposed as a downstream application that prioritizes high-VIG samples and tokens; this follows from the metric without any reduction of the central claim to a fit or to a self-citation chain. No load-bearing uniqueness theorem, renaming of known results, or self-definitional loop appears in the derivation. The approach remains self-contained against external benchmarks such as standard perplexity and selective training heuristics.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Perplexity is a valid proxy for prediction uncertainty in autoregressive language models
invented entities (1)
-
Visual Information Gain (VIG)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[2]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[3]
Llava-next: Stronger llms supercharge multimodal capabilities in the wild, 2024
Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, 2024
work page 2024
-
[4]
Mini-internvl: a flexible-transfer pocket multi-modal model with 5Visual Intelligence, 2024
Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, and Wenhai Wang. Mini-internvl: a flexible-transfer pocket multi-modal model with 5Visual Intelligence, 2024
work page 2024
-
[5]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...
work page 2024
-
[6]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...
work page 2025
-
[7]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, 2025
work page 2025
-
[8]
Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023
work page 2023
-
[9]
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023
work page 2023
-
[10]
mplug-owl: Modularization empowers large language models with multimodality, 2024
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl: Modularization empowers large language models with multimodality, 2024
work page 2024
-
[11]
Kosmos-2: Grounding multimodal large language models to the world, 2023
Zhiliang Peng, Wenhui Wang, Liliu2024llavanext Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023
work page 2023
-
[12]
InstructBLIP: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[13]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...
work page 2022
-
[14]
Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2025
work page 2025
-
[15]
Deepseek-vl: Towards real-world vision-language understanding, 2024
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024. 11 APREPRINT- FEBRUARY20, 2026
work page 2024
-
[16]
Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024
work page 2024
-
[17]
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...
work page 2024
-
[18]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. T...
work page 2025
-
[19]
Llama: Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023
work page 2023
-
[20]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023
work page 2023
-
[21]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page 2023
-
[22]
Yi: Open foundation models by 01.ai, 2024
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong D...
work page 2024
-
[23]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021
work page 2021
-
[24]
Sigmoid loss for language image pre- training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023
work page 2023
-
[25]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[26]
Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Baobao Chang, and Minjia Zhang. Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
work page 2025
-
[27]
Paying more attention to image: A training-free method for alleviating hallucination in lvlms
Shi Liu, Kecheng Zheng, and Wei Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. InEuropean Conference on Computer Vision, 2025
work page 2025
-
[28]
MMICL: Empowering vision-language model with multi-modal in-context learning
Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[29]
Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly
Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, and Bo Zhao. Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[30]
Contrastive region guidance: Improving grounding in vision-language models without training
David Wan, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. Contrastive region guidance: Improving grounding in vision-language models without training. InEuropean Conference on Computer Vision, 2025
work page 2025
-
[31]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 12 APREPRINT- FEBRUARY20, 2026
work page 2017
-
[32]
Object hallucination in image captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
-
[33]
Mitigating hallucination in large multi-modal models via robust instruction tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. InThe International Conference on Learning Representations, 2024
work page 2024
-
[34]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[35]
Detecting and preventing hallucinations in large vision language models
Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. Proceedings of the AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[36]
Debiasing multimodal large language models via penalization of language priors
YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, and Rong Jin. Debiasing multimodal large language models via penalization of language priors. InProceedings of the ACM International Conference on Multimedia, 2025
work page 2025
-
[37]
Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[38]
Less is more: Mitigating multimodal hallucination from an eos decision perspective
Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[39]
Mm1: Methods, analysis and insights from multimodal llm pre-training
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Al...
work page 2025
-
[40]
Gemini: A family of highly capable multimodal models
Team Gemini. Gemini: A family of highly capable multimodal models. Technical report, Gemini Team Google, 2025
work page 2025
-
[41]
OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2024
work page 2024
-
[42]
Counterfactual vqa: A cause-effect look at language bias
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
work page 2021
-
[43]
Vision language models are biased, 2025
An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased, 2025
work page 2025
-
[44]
What’s in the image? a deep-dive into the vision of vision language models
Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[45]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, 2025
work page 2025
-
[46]
See what you are told: Visual attention sink in large multimodal models
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InThe International Conference on Learning Representations, 2025
work page 2025
-
[47]
Where do large vision-language models look at when answering questions?, 2025
Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. Where do large vision-language models look at when answering questions?, 2025
work page 2025
-
[48]
Tsung-Yi Lin, Michael Maire, Serge Belongie andJames Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014
work page 2014
-
[49]
Multi-modal data spectrum: Multi- modal datasets are multi-dimensional, 2025
Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, and Sumit Chopra. Multi-modal data spectrum: Multi- modal datasets are multi-dimensional, 2025
work page 2025
-
[50]
Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[51]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, 2022. 13 APREPRINT- FEBRUARY20, 2026
work page 2022
-
[52]
Not all tokens are what you need for pretraining
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, yelong shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Not all tokens are what you need for pretraining. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[53]
Mm-vet: Evaluating large multimodal models for integrated capabilities
Yu Weihao, Yang Zhengyuan, Li Linjie, Wang Jianfeng, Lin Kevin, Liu Zicheng, Wang Xinchao, and Wang Lijuan. Mm-vet: Evaluating large multimodal models for integrated capabilities. InInternational Conference on Machine Learning, 2024
work page 2024
-
[54]
Liu Yuan, Duan Haodong, Zhang Yuanhan, Li Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, He Conghui, Liu Ziwei, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, 2024
work page 2024
-
[55]
Minesh Mathew, Dimosthenis Karatzas, and C.V . Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021
work page 2021
-
[56]
Aligning large multimodal models with factually augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. InFindings of the Association for Computational Linguistics, 2024
work page 2024
-
[57]
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[58]
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean Conference on Computer Vision, 2022. 14 APREPRINT- FEBRUARY20, 2026 A Details of Benchmarks Visual understanding task.To assess the model’s capabilities in general visual perceptio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.