When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
Pith reviewed 2026-05-10 05:02 UTC · model grok-4.3
The pith
Vision-language models acting as judges often ignore the image and favor more informative answers even when they conflict with the visual content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLM-as-a-Judge often pays limited attention to the image and blindly favors the more informative answer even when it conflicts with the image content. BIRCH corrects inconsistencies with the image in candidate answers first, then compares the answers against this corrected version to shift focus to image-grounded correctness. This reduces informativeness bias by up to 17% and improves performance by up to 9.8% across models and benchmarks.
What carries the argument
BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects image inconsistencies in candidate answers and then compares them against the corrected version.
If this is right
- Automatic evaluation of VLMs becomes more reliable by reducing preference for verbose but image-inconsistent answers.
- Judging performance on benchmarks rises by up to 9.8% when BIRCH is used instead of standard comparison.
- Informativeness bias drops by up to 17% across multiple VLM judges and evaluation datasets.
- VLM judge systems require explicit image-alignment correction steps for trustworthy results.
Where Pith is reading between the lines
- Similar informativeness biases could appear in text-only LLM judges when evaluating factual consistency.
- Training VLMs with stronger image-attention signals during evaluation tasks may prevent the bias at the source.
- BIRCH-style correction could be tested on other multimodal benchmarks to measure broader gains in judge reliability.
Load-bearing premise
The initial correction step in BIRCH can reliably identify and fix image inconsistencies without introducing new errors or biases.
What would settle it
A controlled test set of answers with known image conflicts presented to BIRCH, where the method is checked for whether it consistently corrects those conflicts without missing them or creating new inaccuracies.
Figures
read the original abstract
The reliability of VLM-as-a-Judge is critical for the automatic evaluation of vision-language models (VLMs). Despite recent progress, our analysis reveals that VLM-as-a-Judge often pays limited attention to the image when making decisions. Instead, they often blindly favor the more informative answer, even when they can recognize it conflicts with the image content. We call this problem informativeness bias, which significantly undermines judge reliability. To address it, we propose BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects inconsistencies with the image content in candidate answers, and then compares the answers against this corrected version. This shifts the judge's focus from informativeness to image-grounded correctness. Experiments on multiple models and benchmarks show that BIRCH reduces informativeness bias by up to 17%, resulting in performance gains of up to 9.8%. Our work reveals an overlooked but fundamental flaw in current VLM-as-a-Judge systems and highlights the need for more principled designs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that VLMs used as judges exhibit 'informativeness bias' by favoring more informative answers even when they conflict with image content (despite recognizing the conflict). It proposes BIRCH, a two-step paradigm that first corrects candidate answers for image inconsistencies to create a truthful anchor, then compares answers against this anchor to prioritize image-grounded correctness over informativeness. Experiments across multiple models and benchmarks report that BIRCH reduces the bias by up to 17% and yields performance gains of up to 9.8%.
Significance. If the results hold after validation, the work identifies a practically important limitation in VLM-as-a-Judge systems and offers a mitigation that could improve automated evaluation reliability in vision-language tasks. The empirical framing, new bias concept, and quantitative gains provide a concrete starting point for more robust judging protocols.
major comments (2)
- [BIRCH correction step] BIRCH correction step (described in the abstract and method): The central claim that BIRCH isolates image-grounded correctness depends on the initial correction reliably detecting and fixing image inconsistencies without introducing new errors or biases. No human-annotated validation, error analysis, or ablation on conflicting cases is provided. Because the correction likely uses the same class of VLM, it may inherit the limited image attention that produces informativeness bias, rendering the subsequent comparison invalid.
- [Experiments] Experimental results (abstract): The reported reductions (up to 17% bias, 9.8% performance) are load-bearing for the contribution, yet the abstract supplies no information on controls, baselines, statistical tests, or confounds. This prevents assessment of whether the gains are attributable to the bias reduction or to other factors.
minor comments (1)
- [Abstract] The abstract states quantitative gains on multiple models and benchmarks but omits details on experimental controls, baselines, statistical tests, or potential confounds, reducing clarity for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive assessment of the significance of our work. We address each major comment in detail below, providing clarifications and outlining revisions where appropriate.
read point-by-point responses
-
Referee: [BIRCH correction step] BIRCH correction step (described in the abstract and method): The central claim that BIRCH isolates image-grounded correctness depends on the initial correction reliably detecting and fixing image inconsistencies without introducing new errors or biases. No human-annotated validation, error analysis, or ablation on conflicting cases is provided. Because the correction likely uses the same class of VLM, it may inherit the limited image attention that produces informativeness bias, rendering the subsequent comparison invalid.
Authors: We agree that validating the correction step is important for substantiating our claims. In the full manuscript (Section 3), we detail the correction prompt, which is designed to explicitly focus on identifying and rectifying image inconsistencies, using a different instruction set than the judging prompt. To address the concern, we will include in the revised manuscript: (1) a human evaluation on a sample of 200 conflicting cases to measure the accuracy of the correction step, (2) an error analysis categorizing any introduced errors, and (3) an ablation study comparing BIRCH with and without the correction step. Regarding the potential for inherited bias, our results demonstrate that BIRCH consistently reduces informativeness bias across models, indicating that the correction provides a useful anchor even if imperfect. We believe this additional analysis will strengthen the paper. revision: yes
-
Referee: [Experiments] Experimental results (abstract): The reported reductions (up to 17% bias, 9.8% performance) are load-bearing for the contribution, yet the abstract supplies no information on controls, baselines, statistical tests, or confounds. This prevents assessment of whether the gains are attributable to the bias reduction or to other factors.
Authors: The full details of our experimental setup, including baselines (direct VLM judging and other variants), controls for model size and prompt variations, statistical tests (e.g., significance testing with p-values reported in tables), and discussion of potential confounds (such as answer length normalization), are provided in Sections 4 and 5 of the manuscript. However, we acknowledge that the abstract is concise and lacks this context. In the revision, we will expand the abstract to include a brief mention of the experimental design, key baselines, and that improvements are statistically significant. This will allow readers to better evaluate the results from the abstract alone. revision: yes
Circularity Check
No circularity: purely empirical identification and mitigation tested on external benchmarks
full rationale
The paper defines informativeness bias via direct observation of VLM judge behavior, proposes the BIRCH correction-then-compare procedure as an engineering mitigation, and reports quantitative reductions (up to 17% bias, 9.8% performance) measured against independent benchmarks and models. No mathematical derivation, parameter fitting, or self-citation chain is used to obtain the central results; all claims rest on external experimental outcomes rather than reducing to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs can be prompted to detect and correct inconsistencies between candidate answers and image content
invented entities (1)
-
informativeness bias
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Long Ouyang and Jeffrey Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul F. Christiano and Jan Leike and Ryan Lowe , editor =...
work page 2022
- [2]
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=
work page 2022
-
[4]
Ziegler and Ryan Lowe and Chelsea Voss and Alec Radford and Dario Amodei and Paul F
Nisan Stiennon and Long Ouyang and Jeffrey Wu and Daniel M. Ziegler and Ryan Lowe and Chelsea Voss and Alec Radford and Dario Amodei and Paul F. Christiano , editor =. Learning to summarize with human feedback , booktitle =. 2020 , url =
work page 2020
-
[5]
Findings of the Association for Computational Linguistics: NAACL 2025 , year =
Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, LJ and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh. R eward B ench: Evaluating Reward Models for Language Modeling. Findings of the Association for Computational Linguistics: NAACL 20...
-
[6]
JudgeBench: A Benchmark for Evaluating
Sijun Tan and Siyuan Zhuang and Kyle Montgomery and William Yuan Tang and Alejandro Cuadron and Chenguang Wang and Raluca Popa and Ion Stoica , booktitle=. JudgeBench: A Benchmark for Evaluating. 2025 , url=
work page 2025
-
[7]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =
Lianmin Zheng and Wei. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =. 2023 , url =
work page 2023
-
[8]
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge , author=. 2025 , eprint=
work page 2025
- [9]
-
[10]
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29
-
[11]
Humans or LLMs as the Judge? A Study on Judgement Biases , author=. 2024 , eprint=
work page 2024
-
[12]
Large Language Models are Inconsistent and Biased Evaluators , author=. 2024 , eprint=
work page 2024
- [13]
-
[14]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2024 , eprint=
work page 2024
-
[15]
Yue, Xiang and Ni, Yuansheng and Zheng, Tianyu and Zhang, Kai and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and Wei, Cong and Yu, Botao and Yuan, Ruibin and Sun, Renliang and Yin, Ming and Zheng, Boyuan and Yang, Zhenzhu and Liu, Yibo and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , booktit...
-
[16]
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models , year=
Li, Lei and Wei, Yuancheng and Xie, Zhihui and Yang, Xuqing and Song, Yifan and Wang, Peiyi and An, Chenxin and Liu, Tianyu and Li, Sujian and Lin, Bill Yuchen and Kong, Lingpeng and Liu, Qi , booktitle=. VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models , year=
-
[17]
Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=
work page 2024
-
[18]
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models , author=. 2025 , eprint=
work page 2025
- [19]
-
[20]
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness , year=
Yu, Tianyu and Zhang, Haoye and Li, Qiming and Xu, Qixin and Yao, Yuan and Chen, Da and Lu, Xiaoman and Cui, Ganqu and Dang, Yunkai and He, Taiwen and Feng, Xiaocheng and Song, Jun and Zheng, Bo and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong , booktitle=. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness , year=
-
[21]
Calibrated Self-Rewarding Vision Language Models , booktitle =
Yiyang Zhou and Zhiyuan Fan and Dongjie Cheng and Sihan Yang and Zhaorun Chen and Chenhang Cui and Xiyao Wang and Yun Li and Linjun Zhang and Huaxiu Yao , editor =. Calibrated Self-Rewarding Vision Language Models , booktitle =. 2024 , url =
work page 2024
-
[22]
Large language models are not fair evaluators
Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...
-
[23]
The Twelfth International Conference on Learning Representations , year=
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning , author=. The Twelfth International Conference on Learning Representations , year=
-
[24]
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning , author=. ICLR 2024 Workshop on Data-centric Machine Learning Research (DMLR): Harnessing Momentum for Science , year=
work page 2024
-
[25]
YiFan Zhang and Tao Yu and Haochen Tian and Chaoyou Fu and Peiyan Li and Jianshu Zeng and Wulin Xie and Yang Shi and Huanyu Zhang and Junkang Wu and Xue Wang and Yibo Hu and Bin Wen and Tingting Gao and Zhang Zhang and Fan Yang and Di ZHANG and Liang Wang and Rong Jin , booktitle=. 2025 , url=
work page 2025
-
[26]
Pu, Shu and Wang, Yaochen and Chen, Dongping and Chen, Yuhang and Wang, Guohao and Qin, Qi and Zhang, Zhongyi and Zhang, Zhiyuan and Zhou, Zetong and Gong, Shuang and Gui, Yi and Wan, Yao and Yu, Philip S. , title =. 2025 , isbn =. doi:10.1145/3711896.3737409 , booktitle =
-
[27]
Hui Wei and Shenghua He and Tian Xia and Fei Liu and Andy Wong and Jingyang Lin and Mei Han , booktitle=. Systematic Evaluation of. 2025 , url=
work page 2025
-
[28]
Justice or Prejudice? Quantifying Biases in
Jiayi Ye and Yanbo Wang and Yue Huang and Dongping Chen and Qihui Zhang and Nuno Moniz and Tian Gao and Werner Geyer and Chao Huang and Pin-Yu Chen and Nitesh V Chawla and Xiangliang Zhang , booktitle=. Justice or Prejudice? Quantifying Biases in. 2025 , url=
work page 2025
-
[29]
Forty-second International Conference on Machine Learning , year=
The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models , author=. Forty-second International Conference on Machine Learning , year=
-
[30]
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples , booktitle =
Baiqi Li and Zhiqiu Lin and Wenxuan Peng and Jean de Dieu Nyandwi and Daniel Jiang and Zixian Ma and Simran Khanuja and Ranjay Krishna and Graham Neubig and Deva Ramanan , editor =. NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples , booktitle =. 2024 , url =
work page 2024
-
[31]
and Ma, Wei-Chiu and Krishna, Ranjay
Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A. and Ma, Wei-Chiu and Krishna, Ranjay. BLINK: Multimodal Large Language Models Can See but Not Perceive. Computer Vision -- ECCV 2024. 2025
work page 2024
-
[32]
Leng, Sicong and Zhang, Hang and Chen, Guanzheng and Li, Xin and Lu, Shijian and Miao, Chunyan and Bing, Lidong , booktitle=. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding , year=
-
[33]
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Liu, Shi and Zheng, Kecheng and Chen, Wei. Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs. Computer Vision -- ECCV 2024. 2025
work page 2024
-
[34]
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning , author=. 2025 , eprint=
work page 2025
-
[35]
Chunting Zhou and Pengfei Liu and Puxin Xu and Srini Iyer and Jiao Sun and Yuning Mao and Xuezhe Ma and Avia Efrat and Ping Yu and LILI YU and Susan Zhang and Gargi Ghosh and Mike Lewis and Luke Zettlemoyer and Omer Levy , booktitle=. 2023 , url=
work page 2023
-
[36]
Qiyuan Zhang and Yufei Wang and Tiezheng YU and Yuxin Jiang and Chuhan Wu and Liangyou Li and Yasheng Wang and Xin Jiang and Lifeng Shang and Ruiming Tang and Fuyuan Lyu and Chen Ma , booktitle=. RevisEval: Improving. 2025 , url=
work page 2025
-
[37]
Graders should cheat: privileged information enables expert-level automated evaluations , author=. 2025 , eprint=
work page 2025
-
[38]
Hagos, Desta Haileselassie and Battle, Rick and Rawat, Danda B. , journal=. Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives , year=
-
[39]
The Twelfth International Conference on Learning Representations , year=
Evaluating Large Language Models at Evaluating Instruction Following , author=. The Twelfth International Conference on Learning Representations , year=
-
[40]
Zhengyu Hu and Linxin Song and Jieyu Zhang and Zheyuan Xiao and Zhengyu Chen and Hui Xiong , booktitle=. Explaining Length Bias in. 2025 , url=
work page 2025
-
[41]
Improved Baselines with Visual Instruction Tuning , year=
Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , booktitle=. Improved Baselines with Visual Instruction Tuning , year=
-
[42]
Visual Instruction Tuning , url =
Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , url =
- [43]
-
[44]
Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=. 2025 , url=
work page 2025
-
[45]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
work page 2025
-
[46]
URLhttps://doi.org/10.18653/v1/2022.acl-long.229
Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229
-
[47]
Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=
work page 2025
-
[48]
CogVLM: Visual Expert for Pretrained Language Models , url =
Wang, Weihan and Lv, Qingsong and Yu, Wenmeng and Hong, Wenyi and Qi, Ji and Wang, Yan and Ji, Junhui and Yang, Zhuoyi and Zhao, Lei and Song, Xixuan and Xu, Jiazheng and Chen, Keqin and Xu, Bin and Li, Juanzi and Dong, Yuxiao and Ding, Ming and Tang, Jie , booktitle =. CogVLM: Visual Expert for Pretrained Language Models , url =
-
[49]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=
work page 2024
- [50]
-
[51]
Educational and psychological measurement , volume=
A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=
work page 1960
-
[52]
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References
Tang, Tianyi and Lu, Hongyuan and Jiang, Yuchen and Huang, Haoyang and Zhang, Dongdong and Zhao, Xin and Kocmi, Tom and Wei, Furu. Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V...
-
[53]
Khan, Akbir and Hughes, John and Valentine, Dan and Ruis, Laura and Sachan, Kshitij and Radhakrishnan, Ansh and Grefenstette, Edward and Bowman, Samuel R. and Rockt\". Debating with More Persuasive. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.