pith. sign in

arxiv: 2604.17768 · v1 · submitted 2026-04-20 · 💻 cs.AI

When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias

Pith reviewed 2026-05-10 05:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords Vision-Language ModelsVLM-as-a-JudgeInformativeness BiasAutomatic EvaluationMultimodal AIImage GroundingBias Reduction
0
0 comments X

The pith

Vision-language models acting as judges often ignore the image and favor more informative answers even when they conflict with the visual content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that VLM-as-a-Judge systems frequently overlook image details and instead choose answers that contain more information, even when those answers contradict what the image shows. This informativeness bias reduces the reliability of automatic evaluations for vision-language models. The authors propose BIRCH, a new judging method that first corrects any inconsistencies between the answers and the image content before making comparisons. This change makes the judgment focus on correctness grounded in the image rather than on how much detail an answer provides. If true, this would make automated judging more trustworthy for developing better multimodal AI systems.

Core claim

VLM-as-a-Judge often pays limited attention to the image and blindly favors the more informative answer even when it conflicts with the image content. BIRCH corrects inconsistencies with the image in candidate answers first, then compares the answers against this corrected version to shift focus to image-grounded correctness. This reduces informativeness bias by up to 17% and improves performance by up to 9.8% across models and benchmarks.

What carries the argument

BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects image inconsistencies in candidate answers and then compares them against the corrected version.

If this is right

  • Automatic evaluation of VLMs becomes more reliable by reducing preference for verbose but image-inconsistent answers.
  • Judging performance on benchmarks rises by up to 9.8% when BIRCH is used instead of standard comparison.
  • Informativeness bias drops by up to 17% across multiple VLM judges and evaluation datasets.
  • VLM judge systems require explicit image-alignment correction steps for trustworthy results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar informativeness biases could appear in text-only LLM judges when evaluating factual consistency.
  • Training VLMs with stronger image-attention signals during evaluation tasks may prevent the bias at the source.
  • BIRCH-style correction could be tested on other multimodal benchmarks to measure broader gains in judge reliability.

Load-bearing premise

The initial correction step in BIRCH can reliably identify and fix image inconsistencies without introducing new errors or biases.

What would settle it

A controlled test set of answers with known image conflicts presented to BIRCH, where the method is checked for whether it consistently corrects those conflicts without missing them or creating new inaccuracies.

Figures

Figures reproduced from arXiv: 2604.17768 by Dan Roth, Mohammadtaher Safarzadeh, Roshan Sridhar, Xiaohan Zou.

Figure 1
Figure 1. Figure 1: VLM-as-a-Judge performs well when the more informative answer happens to be correct ( more informative = correct). But when the more informative answer is actually wrong ( more informative = wrong), accuracy drops sharply by 40.6-59.0% across models. This shows judges over-prioritize informativeness over truth. Our method BIRCH greatly improves perfor￾mance in these cases, thereby boosting overall results.… view at source ↗
Figure 2
Figure 2. Figure 2: (A) VLM-as-a-Judge input: A VQA question (query + image) is given to two different response models, producing candidate answers A and B. The question asks for the park’s name. Answer B looks more informative by giving a name and claiming it is inscribed on the bench, while Answer A is correct since the name is not visible in the image. (B) Human preference: Human annotators, given the query, image, answers… view at source ↗
Figure 4
Figure 4. Figure 4: Judge performance is strongly affected by how [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Informativeness bias is consistently stronger [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Image reliance score is consistently higher [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Informativeness bias is higher when the image [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the proposed BIRCH. Step 1: Each candidate answer is checked and corrected based on the image. In Answer B, the unsupported detail "Boeing 747" is replaced with "a commercial airliner" and clarified as "the exact model cannot be confirmed". Step 2: The corrected answers are merged into an anchor that is both truthful and informative (all details from the candidates are either corrected or p… view at source ↗
Figure 8
Figure 8. Figure 8: Informativeness bias (IB) and length bias (LB) [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Informativeness bias (IB) and length bias [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-domain comparison of GPT-4o (left), Llama-3.2-Vision-90B (middle), and Gemini-2.5-Flash (right), [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Successful and failed cases with the Standard Ref pipeline, where the judge first generates its own answer as a reference and then compares Answer A and B against it to decide which is better. (A) Successful case: Using the judge’s answer "it does not include any signs that would indicate the name of the park" as reference, the judge correctly identifies an error in Answer B, which claims the park name is… view at source ↗
Figure 13
Figure 13. Figure 13: Illustration of the Image Caption baseline and why it performs poorly. First, the same VLM later used as the judge is asked to generate a caption for the image. Then, the judge takes the query, image, two candidate answers, and the caption as input, and decides which answer better addresses the query. However, the caption does not mention whether the park name is visible, so the judge still accepts the un… view at source ↗
Figure 14
Figure 14. Figure 14: Example showing a human labeling error in VL-RewardBench. The white food (rice) is clearly to the [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example showing a human labeling error in VL-RewardBench. Answer B incorrectly states that the [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example showing a human labeling error in VL-RewardBench. Although precisely counting the flags [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Example showing a human labeling error in VL-RewardBench. The image clearly shows multiple tracks [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for comparing informativeness. We provide GPT-4o with the original query and two candidate [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt for generating reference answers in [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt for generating an anchor in BIRCH. [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt for judging with reference answer in BIRCH. [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: In this example, VLM-as-a-Judge prefers Answer B, which gives more details about the book but makes [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: In this example, VLM-as-a-Judge prefers Answer A, which provides more details about the environment [PITH_FULL_IMAGE:figures/full_fig_p025_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: In this example, VLM-as-a-Judge prefers Answer B, which provides more details about the cat but [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: In this example, Standard Ref rejects the better Answer B, judging its placement of the sign as speculative since this detail is not in the reference. This shows how it over-penalizes informativeness. In contrast, BIRCH ensures such placement details are included in the generated anchor, allowing it to correctly verify Answer B and achieve a better balance between informativeness and correctness [PITH_FU… view at source ↗
Figure 26
Figure 26. Figure 26: In this example, Standard Ref rejects the better Answer A, assuming Australia is not the highlighted country because the reference answer does not mention it. This shows how it over-penalizes informativeness. In contrast, BIRCH includes the valid mention of Australia in the generated anchor, allowing it to correctly verify Answer A and achieve a better balance between informativeness and correctness [PIT… view at source ↗
Figure 27
Figure 27. Figure 27: In this example, Standard Ref rejects the better Answer B, reasoning that electric or hybrid vehicles cannot be inferred from the image (even though the charging cable and plug suggest this) because the reference answer does not mention it. This shows how it over-penalizes informativeness. In contrast, BIRCH includes the likelihood of the car being electric or hybrid in the generated anchor, allowing it t… view at source ↗
Figure 28
Figure 28. Figure 28: In this example, Standard Ref rejects the better Answer B, judging “boar foraging at a feeding station” as unsupported because the reference answer does not mention it. This shows how it over-penalizes informativeness. In contrast, BIRCH includes this context in the generated anchor, allowing it to correctly verify Answer B and achieve a better balance between informativeness and correctness [PITH_FULL_I… view at source ↗
Figure 29
Figure 29. Figure 29: Failure case study. BIRCH generates an anchor with incorrect visual reasoning or is misled by the wrong [PITH_FULL_IMAGE:figures/full_fig_p031_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Failure case study. BIRCH generates a reliable anchor stating that the balloon held by the boy is filled [PITH_FULL_IMAGE:figures/full_fig_p032_30.png] view at source ↗
read the original abstract

The reliability of VLM-as-a-Judge is critical for the automatic evaluation of vision-language models (VLMs). Despite recent progress, our analysis reveals that VLM-as-a-Judge often pays limited attention to the image when making decisions. Instead, they often blindly favor the more informative answer, even when they can recognize it conflicts with the image content. We call this problem informativeness bias, which significantly undermines judge reliability. To address it, we propose BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects inconsistencies with the image content in candidate answers, and then compares the answers against this corrected version. This shifts the judge's focus from informativeness to image-grounded correctness. Experiments on multiple models and benchmarks show that BIRCH reduces informativeness bias by up to 17%, resulting in performance gains of up to 9.8%. Our work reveals an overlooked but fundamental flaw in current VLM-as-a-Judge systems and highlights the need for more principled designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that VLMs used as judges exhibit 'informativeness bias' by favoring more informative answers even when they conflict with image content (despite recognizing the conflict). It proposes BIRCH, a two-step paradigm that first corrects candidate answers for image inconsistencies to create a truthful anchor, then compares answers against this anchor to prioritize image-grounded correctness over informativeness. Experiments across multiple models and benchmarks report that BIRCH reduces the bias by up to 17% and yields performance gains of up to 9.8%.

Significance. If the results hold after validation, the work identifies a practically important limitation in VLM-as-a-Judge systems and offers a mitigation that could improve automated evaluation reliability in vision-language tasks. The empirical framing, new bias concept, and quantitative gains provide a concrete starting point for more robust judging protocols.

major comments (2)
  1. [BIRCH correction step] BIRCH correction step (described in the abstract and method): The central claim that BIRCH isolates image-grounded correctness depends on the initial correction reliably detecting and fixing image inconsistencies without introducing new errors or biases. No human-annotated validation, error analysis, or ablation on conflicting cases is provided. Because the correction likely uses the same class of VLM, it may inherit the limited image attention that produces informativeness bias, rendering the subsequent comparison invalid.
  2. [Experiments] Experimental results (abstract): The reported reductions (up to 17% bias, 9.8% performance) are load-bearing for the contribution, yet the abstract supplies no information on controls, baselines, statistical tests, or confounds. This prevents assessment of whether the gains are attributable to the bias reduction or to other factors.
minor comments (1)
  1. [Abstract] The abstract states quantitative gains on multiple models and benchmarks but omits details on experimental controls, baselines, statistical tests, or potential confounds, reducing clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of the significance of our work. We address each major comment in detail below, providing clarifications and outlining revisions where appropriate.

read point-by-point responses
  1. Referee: [BIRCH correction step] BIRCH correction step (described in the abstract and method): The central claim that BIRCH isolates image-grounded correctness depends on the initial correction reliably detecting and fixing image inconsistencies without introducing new errors or biases. No human-annotated validation, error analysis, or ablation on conflicting cases is provided. Because the correction likely uses the same class of VLM, it may inherit the limited image attention that produces informativeness bias, rendering the subsequent comparison invalid.

    Authors: We agree that validating the correction step is important for substantiating our claims. In the full manuscript (Section 3), we detail the correction prompt, which is designed to explicitly focus on identifying and rectifying image inconsistencies, using a different instruction set than the judging prompt. To address the concern, we will include in the revised manuscript: (1) a human evaluation on a sample of 200 conflicting cases to measure the accuracy of the correction step, (2) an error analysis categorizing any introduced errors, and (3) an ablation study comparing BIRCH with and without the correction step. Regarding the potential for inherited bias, our results demonstrate that BIRCH consistently reduces informativeness bias across models, indicating that the correction provides a useful anchor even if imperfect. We believe this additional analysis will strengthen the paper. revision: yes

  2. Referee: [Experiments] Experimental results (abstract): The reported reductions (up to 17% bias, 9.8% performance) are load-bearing for the contribution, yet the abstract supplies no information on controls, baselines, statistical tests, or confounds. This prevents assessment of whether the gains are attributable to the bias reduction or to other factors.

    Authors: The full details of our experimental setup, including baselines (direct VLM judging and other variants), controls for model size and prompt variations, statistical tests (e.g., significance testing with p-values reported in tables), and discussion of potential confounds (such as answer length normalization), are provided in Sections 4 and 5 of the manuscript. However, we acknowledge that the abstract is concise and lacks this context. In the revision, we will expand the abstract to include a brief mention of the experimental design, key baselines, and that improvements are statistically significant. This will allow readers to better evaluate the results from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical identification and mitigation tested on external benchmarks

full rationale

The paper defines informativeness bias via direct observation of VLM judge behavior, proposes the BIRCH correction-then-compare procedure as an engineering mitigation, and reports quantitative reductions (up to 17% bias, 9.8% performance) measured against independent benchmarks and models. No mathematical derivation, parameter fitting, or self-citation chain is used to obtain the central results; all claims rest on external experimental outcomes rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that VLMs can perform reliable image-based correction of answer inconsistencies and that the observed bias is measurable via the chosen experiments. No free parameters are explicitly fitted in the abstract description. The bias concept itself is a newly introduced entity without independent falsifiable evidence beyond the reported experiments.

axioms (1)
  • domain assumption VLMs can be prompted to detect and correct inconsistencies between candidate answers and image content
    Invoked in the description of the BIRCH paradigm as the first step before comparison.
invented entities (1)
  • informativeness bias no independent evidence
    purpose: To label the observed tendency of VLMs to prioritize answer informativeness over image consistency
    Newly defined based on analysis; no external independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5488 in / 1357 out tokens · 59013 ms · 2026-05-10T05:02:47.319259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Long Ouyang and Jeffrey Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul F. Christiano and Jan Leike and Ryan Lowe , editor =...

  2. [2]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  3. [3]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  4. [4]

    Ziegler and Ryan Lowe and Chelsea Voss and Alec Radford and Dario Amodei and Paul F

    Nisan Stiennon and Long Ouyang and Jeffrey Wu and Daniel M. Ziegler and Ryan Lowe and Chelsea Voss and Alec Radford and Dario Amodei and Paul F. Christiano , editor =. Learning to summarize with human feedback , booktitle =. 2020 , url =

  5. [5]

    Findings of the Association for Computational Linguistics: NAACL 2025 , year =

    Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, LJ and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh. R eward B ench: Evaluating Reward Models for Language Modeling. Findings of the Association for Computational Linguistics: NAACL 20...

  6. [6]

    JudgeBench: A Benchmark for Evaluating

    Sijun Tan and Siyuan Zhuang and Kyle Montgomery and William Yuan Tang and Alejandro Cuadron and Chenguang Wang and Raluca Popa and Ion Stoica , booktitle=. JudgeBench: A Benchmark for Evaluating. 2025 , url=

  7. [7]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =

    Lianmin Zheng and Wei. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =. 2023 , url =

  8. [8]

    2025 , eprint=

    Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    A Survey on LLM-as-a-Judge , author=. 2025 , eprint=

  10. [10]

    Benchmarking Cognitive Biases in Large Language Models as Evaluators

    Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

  11. [11]

    2024 , eprint=

    Humans or LLMs as the Judge? A Study on Judgement Biases , author=. 2024 , eprint=

  12. [12]

    2024 , eprint=

    Large Language Models are Inconsistent and Biased Evaluators , author=. 2024 , eprint=

  13. [13]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  14. [14]

    2024 , eprint=

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2024 , eprint=

  15. [15]

    MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , year=

    Yue, Xiang and Ni, Yuansheng and Zheng, Tianyu and Zhang, Kai and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and Wei, Cong and Yu, Botao and Yuan, Ruibin and Sun, Renliang and Yin, Ming and Zheng, Boyuan and Yang, Zhenzhu and Liu, Yibo and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , booktit...

  16. [16]

    VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models , year=

    Li, Lei and Wei, Yuancheng and Xie, Zhihui and Yang, Xuqing and Song, Yifan and Wang, Peiyi and An, Chenxin and Liu, Tianyu and Li, Sujian and Lin, Bill Yuchen and Kong, Lingpeng and Liu, Qi , booktitle=. VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models , year=

  17. [17]

    2024 , url=

    Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=

  18. [18]

    2025 , eprint=

    Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models , author=. 2025 , eprint=

  19. [19]

    2025 , url=

    Liqiang Jing and Xinya Du , journal=. 2025 , url=

  20. [20]

    RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness , year=

    Yu, Tianyu and Zhang, Haoye and Li, Qiming and Xu, Qixin and Yao, Yuan and Chen, Da and Lu, Xiaoman and Cui, Ganqu and Dang, Yunkai and He, Taiwen and Feng, Xiaocheng and Song, Jun and Zheng, Bo and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong , booktitle=. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness , year=

  21. [21]

    Calibrated Self-Rewarding Vision Language Models , booktitle =

    Yiyang Zhou and Zhiyuan Fan and Dongjie Cheng and Sihan Yang and Zhaorun Chen and Chenhang Cui and Xiyao Wang and Yun Li and Linjun Zhang and Huaxiu Yao , editor =. Calibrated Self-Rewarding Vision Language Models , booktitle =. 2024 , url =

  22. [22]

    Large language models are not fair evaluators

    Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

  23. [23]

    The Twelfth International Conference on Learning Representations , year=

    What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning , author=. The Twelfth International Conference on Learning Representations , year=

  24. [24]

    ICLR 2024 Workshop on Data-centric Machine Learning Research (DMLR): Harnessing Momentum for Science , year=

    Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning , author=. ICLR 2024 Workshop on Data-centric Machine Learning Research (DMLR): Harnessing Momentum for Science , year=

  25. [25]

    2025 , url=

    YiFan Zhang and Tao Yu and Haochen Tian and Chaoyou Fu and Peiyan Li and Jianshu Zeng and Wulin Xie and Yang Shi and Huanyu Zhang and Junkang Wu and Xue Wang and Yibo Hu and Bin Wen and Tingting Gao and Zhang Zhang and Fan Yang and Di ZHANG and Liang Wang and Rong Jin , booktitle=. 2025 , url=

  26. [26]

    , title =

    Pu, Shu and Wang, Yaochen and Chen, Dongping and Chen, Yuhang and Wang, Guohao and Qin, Qi and Zhang, Zhongyi and Zhang, Zhiyuan and Zhou, Zetong and Gong, Shuang and Gui, Yi and Wan, Yao and Yu, Philip S. , title =. 2025 , isbn =. doi:10.1145/3711896.3737409 , booktitle =

  27. [27]

    Systematic Evaluation of

    Hui Wei and Shenghua He and Tian Xia and Fei Liu and Andy Wong and Jingyang Lin and Mei Han , booktitle=. Systematic Evaluation of. 2025 , url=

  28. [28]

    Justice or Prejudice? Quantifying Biases in

    Jiayi Ye and Yanbo Wang and Yue Huang and Dongping Chen and Qihui Zhang and Nuno Moniz and Tian Gao and Werner Geyer and Chao Huang and Pin-Yu Chen and Nitesh V Chawla and Xiangliang Zhang , booktitle=. Justice or Prejudice? Quantifying Biases in. 2025 , url=

  29. [29]

    Forty-second International Conference on Machine Learning , year=

    The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models , author=. Forty-second International Conference on Machine Learning , year=

  30. [30]

    NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples , booktitle =

    Baiqi Li and Zhiqiu Lin and Wenxuan Peng and Jean de Dieu Nyandwi and Daniel Jiang and Zixian Ma and Simran Khanuja and Ranjay Krishna and Graham Neubig and Deva Ramanan , editor =. NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples , booktitle =. 2024 , url =

  31. [31]

    and Ma, Wei-Chiu and Krishna, Ranjay

    Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A. and Ma, Wei-Chiu and Krishna, Ranjay. BLINK: Multimodal Large Language Models Can See but Not Perceive. Computer Vision -- ECCV 2024. 2025

  32. [32]

    Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding , year=

    Leng, Sicong and Zhang, Hang and Chen, Guanzheng and Li, Xin and Lu, Shijian and Miao, Chunyan and Bing, Lidong , booktitle=. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding , year=

  33. [33]

    Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

    Liu, Shi and Zheng, Kecheng and Chen, Wei. Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs. Computer Vision -- ECCV 2024. 2025

  34. [34]

    2025 , eprint=

    Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning , author=. 2025 , eprint=

  35. [35]

    2023 , url=

    Chunting Zhou and Pengfei Liu and Puxin Xu and Srini Iyer and Jiao Sun and Yuning Mao and Xuezhe Ma and Avia Efrat and Ping Yu and LILI YU and Susan Zhang and Gargi Ghosh and Mike Lewis and Luke Zettlemoyer and Omer Levy , booktitle=. 2023 , url=

  36. [36]

    RevisEval: Improving

    Qiyuan Zhang and Yufei Wang and Tiezheng YU and Yuxin Jiang and Chuhan Wu and Liangyou Li and Yasheng Wang and Xin Jiang and Lifeng Shang and Ruiming Tang and Fuyuan Lyu and Chen Ma , booktitle=. RevisEval: Improving. 2025 , url=

  37. [37]

    2025 , eprint=

    Graders should cheat: privileged information enables expert-level automated evaluations , author=. 2025 , eprint=

  38. [38]

    , journal=

    Hagos, Desta Haileselassie and Battle, Rick and Rawat, Danda B. , journal=. Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives , year=

  39. [39]

    The Twelfth International Conference on Learning Representations , year=

    Evaluating Large Language Models at Evaluating Instruction Following , author=. The Twelfth International Conference on Learning Representations , year=

  40. [40]

    Explaining Length Bias in

    Zhengyu Hu and Linxin Song and Jieyu Zhang and Zheyuan Xiao and Zhengyu Chen and Hui Xiong , booktitle=. Explaining Length Bias in. 2025 , url=

  41. [41]

    Improved Baselines with Visual Instruction Tuning , year=

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , booktitle=. Improved Baselines with Visual Instruction Tuning , year=

  42. [42]

    Visual Instruction Tuning , url =

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , url =

  43. [43]

    2024 , eprint=

    Phi-4 Technical Report , author=. 2024 , eprint=

  44. [44]

    2025 , url=

    Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=. 2025 , url=

  45. [45]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  46. [46]

    URLhttps://doi.org/10.18653/v1/2022.acl-long.229

    Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

  47. [47]

    2025 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

  48. [48]

    CogVLM: Visual Expert for Pretrained Language Models , url =

    Wang, Weihan and Lv, Qingsong and Yu, Wenmeng and Hong, Wenyi and Qi, Ji and Wang, Yan and Ji, Junhui and Yang, Zhuoyi and Zhao, Lei and Song, Xixuan and Xu, Jiazheng and Chen, Keqin and Xu, Bin and Li, Juanzi and Dong, Yuxiao and Ding, Ming and Tang, Jie , booktitle =. CogVLM: Visual Expert for Pretrained Language Models , url =

  49. [49]

    2024 , eprint=

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

  50. [50]

    2025 , eprint=

    MLLMs are Deeply Affected by Modality Bias , author=. 2025 , eprint=

  51. [51]

    Educational and psychological measurement , volume=

    A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

  52. [52]

    Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

    Tang, Tianyi and Lu, Hongyuan and Jiang, Yuchen and Huang, Haoyang and Zhang, Dongdong and Zhao, Xin and Kocmi, Tom and Wei, Furu. Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

  53. [53]

    and Rockt\"

    Khan, Akbir and Hughes, John and Valentine, Dan and Ruis, Laura and Sachan, Kshitij and Radhakrishnan, Ansh and Grefenstette, Edward and Bowman, Samuel R. and Rockt\". Debating with More Persuasive. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =