MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Changchang Sun; Dehai Min; Huiyi Chen; Jiawei Peng; Kaijie Chen; Lu Cheng; Xu Yang; Yan Yan

arxiv: 2511.14159 · v2 · pith:DQP4AHMUnew · submitted 2025-11-18 · 💻 cs.CV

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Huiyi Chen , Jiawei Peng , Dehai Min , Changchang Sun , Kaijie Chen , Yan Yan , Xu Yang , Lu Cheng This is my paper

Pith reviewed 2026-05-21 19:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords LVLMsRobustnessMisleading Visual InputsVisual Question AnsweringBenchmarkMVI-SensitivityVision Language ModelsTaxonomy

0 comments

The pith

MVI-Bench shows large vision-language models are vulnerable to misleading visual inputs across three key levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MVI-Bench as the first dedicated benchmark for testing how misleading visual inputs affect the performance of Large Vision-Language Models in visual question answering tasks. It builds a taxonomy with three levels—Visual Concept, Visual Attribute, and Visual Relationship—to organize six categories of such inputs, resulting in 1,248 expertly annotated instances. A new metric, MVI-Sensitivity, is proposed to measure robustness in detail. Results from evaluating 18 state-of-the-art models indicate clear weaknesses, which the authors argue can inform improvements in model reliability for practical applications.

Core claim

MVI-Bench is the first comprehensive benchmark designed to evaluate the robustness of LVLMs to misleading visual inputs. It is grounded in fundamental visual primitives with a hierarchical taxonomy consisting of Visual Concept, Visual Attribute, and Visual Relationship levels. From this, six representative categories are curated into 1,248 VQA instances with expert annotations. The benchmark includes the MVI-Sensitivity metric for fine-grained evaluation, and testing reveals pronounced vulnerabilities in current LVLMs along with insights for developing more robust models.

What carries the argument

The three-level taxonomy of misleading visual inputs (Visual Concept, Visual Attribute, Visual Relationship) that structures the benchmark and enables the creation of MVI-Sensitivity for granular robustness assessment.

If this is right

LVLMs exhibit pronounced vulnerabilities when faced with misleading visual inputs in VQA scenarios.
The MVI-Bench provides a structured way to identify specific weaknesses in visual understanding.
Analyses from the benchmark offer actionable insights for enhancing LVLM reliability.
Future model development can target the identified categories to reduce errors from deceptive visuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the taxonomy to dynamic video inputs could reveal additional robustness issues in multimodal systems.
Integrating MVI-Bench into training pipelines might help mitigate the observed vulnerabilities through targeted data augmentation.
The benchmark's focus on visual primitives suggests it could complement existing text-based robustness tests for more complete evaluations.

Load-bearing premise

The three-level taxonomy and six categories together with expert annotations capture the main types of misleading visual inputs that affect real-world LVLM performance.

What would settle it

Demonstrating that a new LVLM achieves high performance on MVI-Bench yet still fails significantly in real-world applications involving misleading visuals would challenge the benchmark's effectiveness.

Figures

Figures reproduced from arXiv: 2511.14159 by Changchang Sun, Dehai Min, Huiyi Chen, Jiawei Peng, Kaijie Chen, Lu Cheng, Xu Yang, Yan Yan.

**Figure 1.** Figure 1: (a) Misleading Textual Input: misleading questions are created by injecting inaccurate or irrelevant information into otherwise normal queries. (b) Misleading Visual Input: misleading visual cues arise from real-world scenes, causing models to misinterpret the image content (e.g., stools mistaken for mushrooms). complex visual reasoning [56, 63]. With these rapid developments comes an urgent need for ri… view at source ↗

**Figure 2.** Figure 2: Examples from six misleading categories defined in MVI-Bench. Each pair contains a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of MVI-Bench statistics. (a) Six balanced misleading visual categories. (b) Three diverse image sources: natural, synthetic, and edited. (c) Broad object coverage across multiple domains. (d) High pairwise similarity ensures semantic consistency between normal and misleading image pairs. ally, each annotated VQA pair including its label (“misleading” or “normal”) and the answer is independently r… view at source ↗

**Figure 4.** Figure 4: Comparison between the “non-think” and “think” modes of SAIL-VL. In the non-think mode, the model answers directly based on visual evidence, while in the think mode, the model is guided by historical thoughts and tend to overemphasize fine details. and MVI-Sensitivity decreases. This trend suggests that stronger reasoning capacity can partially compensate for limited visual perceptual ability. However, the… view at source ↗

**Figure 5.** Figure 5: Attention-guided masking for a counterintuitive instance. Qwen2.5-VL-7B spuriously associates a receipt with a book. (a) On the normal image with one book, it answers incorrectly. (b) On the misleading image, it coincidentally answers “2” by counting the receipt as an extra book. (c) Masking the receipt flips the prediction, confirming the spurious correlation. paradigms, where models are supervised only… view at source ↗

**Figure 6.** Figure 6: Benchmark Curation Pipeline. The pipeline starts with image collection, followed by VQA annotation, data filtering, and ultimately results in MVI-Bench. To ensure data quality, human verification is performed at each key stage to eliminate low-quality data, annotations, and ambiguous evaluation questions. What is in the picture? A. Lotus flower B. Lotus leaf C. Leaves D. Chair (non-think) The image depicts… view at source ↗

**Figure 7.** Figure 7: Comparison between the “non-think” and “think” modes of SAIL-VL. In the non-think mode, the model answers directly based on visual evidence, while in the think mode, the model is guided by historical thoughts and tend to overemphasize fine details. scription of the image.”, resulting in the relative attention: Arel(x, q) = Ast(x, q) Ast(x, q′) . This normalization removes the model’s default visual bias an… view at source ↗

**Figure 8.** Figure 8: More Examples from six misleading categories defined in MVI-Bench. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MVI-Bench targets a real gap in LVLM robustness testing with a new visual taxonomy and metric, though its coverage and annotation details need closer scrutiny.

read the letter

The paper introduces MVI-Bench as a benchmark for how large vision-language models respond to misleading visual inputs. It stands out for focusing on visuals rather than the usual text-based tests. What they did is build a taxonomy with three levels—visual concept, visual attribute, and visual relationship—then select six categories and create 1,248 expert-annotated VQA instances. They add MVI-Sensitivity as a metric to measure robustness more finely. Testing 18 current models reveals they struggle with these inputs, and the authors suggest ways to improve. This fills a noticeable hole in robustness work. Most benchmarks look at hallucinations or bad text prompts, so a visual-specific one is a step forward. The hierarchical approach and the metric give a structured way to probe the issue. The main concern is coverage. The taxonomy might not catch everything that happens in practice, like cultural differences in how images are read or contradictions involving several objects at once. If those are common, the measured problems could be narrower than claimed. The abstract also skips over how the annotations were checked for consistency or what statistical checks were used, which leaves some uncertainty about the data quality. People working on making LVLMs more reliable would find this relevant, especially if they're looking for new evaluation tools. It gives concrete numbers across models that can spark discussion. I would recommend sending it for peer review. The idea targets a real need, and the empirical part on multiple models provides a basis for feedback, provided the methods get fleshed out.

Referee Report

3 major / 2 minor

Summary. The paper introduces MVI-Bench, the first comprehensive benchmark for evaluating Large Vision-Language Models' (LVLMs) robustness to misleading visual inputs. It grounds the benchmark in a three-level taxonomy (Visual Concept, Visual Attribute, Visual Relationship), curates six representative categories, and compiles 1,248 expertly annotated VQA instances. A novel MVI-Sensitivity metric is proposed for granular robustness evaluation. Empirical results on 18 state-of-the-art LVLMs report pronounced vulnerabilities, with in-depth analyses yielding actionable insights for more reliable LVLMs. The benchmark and codebase are released publicly.

Significance. If the benchmark construction and metric are rigorously documented, this work addresses a clear gap in LVLM robustness evaluation, which has largely emphasized textual hallucinations rather than visual misleading inputs. The hierarchical taxonomy based on fundamental visual primitives, multi-model evaluation, and public release of the benchmark represent strengths that could support reproducible research and guide improvements in LVLM visual understanding.

major comments (3)

[§3] §3 (Benchmark Construction): The claim that the three-level taxonomy and six curated categories provide comprehensive coverage of misleading visual inputs lacks supporting validation or discussion of potential omissions (e.g., culturally specific misinterpretations or multi-object contextual contradictions). This is load-bearing for the generalizability of the 'pronounced vulnerabilities' findings across the 18 models.
[§4] §4 (MVI-Sensitivity Metric): The novel MVI-Sensitivity metric is introduced to enable fine-grained evaluation, but its exact computation, aggregation method, and normalization are not specified with equations or algorithmic details. This prevents verification of the reported empirical results and undermines reproducibility.
[§3.2] §3.2 (Annotation Protocol): No information is provided on the expert annotation protocol, selection criteria for annotators, guidelines used, or inter-annotator agreement for the 1,248 VQA instances. These details are essential to establish the reliability of the benchmark instances underlying all claims.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly contrast MVI-Bench with existing robustness benchmarks focused on textual inputs to highlight the novelty.
[Conclusion] Ensure that the GitHub repository link includes clear documentation on how to reproduce the MVI-Sensitivity scores and access the annotated instances.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their thorough and constructive review of our manuscript. We have carefully addressed each major comment below and will incorporate revisions to strengthen the paper's clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The claim that the three-level taxonomy and six curated categories provide comprehensive coverage of misleading visual inputs lacks supporting validation or discussion of potential omissions (e.g., culturally specific misinterpretations or multi-object contextual contradictions). This is load-bearing for the generalizability of the 'pronounced vulnerabilities' findings across the 18 models.

Authors: We appreciate this observation regarding the scope of our taxonomy. The three-level hierarchy (Visual Concept, Visual Attribute, Visual Relationship) is explicitly grounded in established visual primitives from computer vision literature, and the six categories were chosen as representative based on their prevalence in visual understanding tasks. However, we acknowledge that explicit validation of coverage and discussion of omissions would better support generalizability claims. In the revised manuscript, we will add a dedicated limitations paragraph in §3 that discusses potential omissions, including culturally specific misinterpretations and multi-object contextual contradictions, while clarifying how the current design still enables meaningful evaluation of pronounced vulnerabilities across the 18 models. revision: yes
Referee: [§4] §4 (MVI-Sensitivity Metric): The novel MVI-Sensitivity metric is introduced to enable fine-grained evaluation, but its exact computation, aggregation method, and normalization are not specified with equations or algorithmic details. This prevents verification of the reported empirical results and undermines reproducibility.

Authors: Thank you for highlighting this issue. We will revise §4 to include the complete mathematical formulation of the MVI-Sensitivity metric. This will comprise explicit equations for per-instance sensitivity computation, the aggregation method across the 1,248 VQA instances (including any weighting), and normalization procedures. These additions will enable full verification and reproduction of the reported results on the 18 LVLMs. revision: yes
Referee: [§3.2] §3.2 (Annotation Protocol): No information is provided on the expert annotation protocol, selection criteria for annotators, guidelines used, or inter-annotator agreement for the 1,248 VQA instances. These details are essential to establish the reliability of the benchmark instances underlying all claims.

Authors: We agree that detailed annotation information is essential for establishing benchmark reliability. In the revised version, we will substantially expand §3.2 to describe the expert annotation protocol, including annotator selection criteria (requiring expertise in computer vision and multimodal AI), the annotation guidelines and interface, the quality control process, and quantitative inter-annotator agreement results (e.g., Fleiss' kappa) computed over the 1,248 instances. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation are independent artifacts

full rationale

The paper constructs MVI-Bench from a proposed three-level taxonomy and six curated categories with expert annotations, defines the MVI-Sensitivity metric, and reports empirical results on 18 LVLMs. No equations, fitted parameters, predictions, or derivations are present that reduce to self-defined inputs or self-citations. The taxonomy and benchmark are presented as new contributions rather than derived from prior results by construction, making the evaluation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim that the benchmark reveals actionable vulnerabilities rests on the quality and coverage of the expert-curated examples and the assumption that the chosen taxonomy is comprehensive.

axioms (1)

domain assumption Expert annotations accurately identify which visual inputs are misleading for the given questions
This underpins the creation of the 1,248 ground-truth VQA instances across the three hierarchical levels.

invented entities (1)

MVI-Sensitivity metric no independent evidence
purpose: To characterize LVLM robustness at a granular level beyond standard accuracy
Newly introduced in the paper; no external validation or prior literature reference is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5775 in / 1338 out tokens · 133395 ms · 2026-05-21T19:45:09.684281+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship... six representative categories and compile 1,248 expertly annotated VQA instances... MVI-Sensitivity
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
cs.CL 2026-05 unverdicted novelty 6.0

PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 2 Pith papers · 21 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Mvtamperbench: Evaluating robustness of vision-language models

Amit Agarwal, Srikant Panda, Angeline Charles, Hitesh Laxmichand Patel, Bhargava Kumar, Priyaran- jan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Hansa Meghwani, Karan Gupta, et al. Mvtamperbench: Evaluating robustness of vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18804–18828, 2025. 1, 2

work page 2025
[4]

Claude 3.7 sonnet system card, 2025

Anthropic. Claude 3.7 sonnet system card, 2025. Accessed: 2025-02-04. 6, 3

work page 2025
[5]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1, 2

work page 2015
[6]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Emer- gent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788, 2025

Leonardo Berti, Flavio Giorgi, and Gjergji Kasneci. Emer- gent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788, 2025. 7

work page arXiv 2025
[11]

Re- thinking visual layer selection in multimodal llms.arXiv preprint arXiv:2504.21447, 2025

Haoran Chen, Junyan Lin, Xinhao Chen, Yue Fan, Xin Jin, Hui Su, Jianfeng Dong, Jinlan Fu, and Xiaoyu Shen. Re- thinking visual layer selection in multimodal llms.arXiv preprint arXiv:2504.21447, 2025. 6

work page arXiv 2025
[12]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 5, 6

work page 2024
[14]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 2

work page 2023
[15]

Exploring response uncertainty in mllms: An em- pirical evaluation under misleading scenarios.arXiv preprint arXiv:2411.02708, 2024

Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, et al. Exploring response uncertainty in mllms: An em- pirical evaluation under misleading scenarios.arXiv preprint arXiv:2411.02708, 2024. 1, 2

work page arXiv 2024
[16]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,

work page
[17]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 3867–3876, 2025

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 3867–3876, 2025. 1, 2

work page 2025
[18]

Scalable vision language model training via high quality data curation.arXiv preprint arXiv:2501.05952, 2025

Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, and Jiao Ran. Scalable vision language model training via high quality data curation.arXiv preprint arXiv:2501.05952, 2025. 2

work page arXiv 2025
[19]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,...

work page 2024
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training.arXiv preprint arXiv:2509.26625, 2025

Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Kous- tuv Sinha, Philip Torr, and Filippos Kokkinos. Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training.arXiv preprint arXiv:2509.26625, 2025. 2, 4

work page arXiv 2025
[23]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Hal-eval: A uni- versal and fine-grained hallucination evaluation framework for large vision language models

Chaoya Jiang, Hongrui Jia, Mengfan Dong, Wei Ye, Haiyang Xu, Ming Yan, Ji Zhang, and Shikun Zhang. Hal-eval: A uni- versal and fine-grained hallucination evaluation framework for large vision language models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 525– 534, 2024. 2

work page 2024
[25]

Survey of adversarial robustness in multimodal large language models.arXiv preprint arXiv:2503.13962, 2025

Chengze Jiang, Zhuangzhuang Wang, Minjing Dong, and Jie Gui. Survey of adversarial robustness in multimodal large language models.arXiv preprint arXiv:2503.13962, 2025. 1, 2

work page arXiv 2025
[26]

Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25004–25014, 2025. 8

work page 2025
[27]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321,

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025. 8

work page arXiv 2025
[28]

Natural language understanding and inference with mllm in visual question answering: A survey.ACM Com- puting Surveys, 57(8):1–36, 2025

Jiayi Kuang, Ying Shen, Jingyou Xie, Haohao Luo, Zhe Xu, Ronghao Li, Yinghui Li, Xianfeng Cheng, Xika Lin, and Yu Han. Natural language understanding and inference with mllm in visual question answering: A survey.ACM Com- puting Surveys, 57(8):1–36, 2025. 1, 2

work page 2025
[29]

What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024

Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024. 6

work page 2024
[30]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 13299–13308, 2024. 6

work page 2024
[31]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seep- hys challenge.arXiv preprint arXiv:2509.06079, 2025

Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, and Bin Dong. Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seep- hys challenge.arXiv preprint arXiv:2509.06079, 2025. 7

work page arXiv 2025
[33]

More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 8

work page arXiv 2025
[34]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

work page 2023
[35]

Unveiling the ignorance of mllms: See- ing clearly, answering incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, et al. Unveiling the ignorance of mllms: See- ing clearly, answering incorrectly. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9087–9097, 2025. 1, 2

work page 2025
[36]

Beyond the visible: Benchmarking occlusion perception in multimodal large lan- guage models.arXiv preprint arXiv:2508.04059, 2025

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Li- meng Qiao, Lin Ma, and Tingting Jiang. Beyond the visible: Benchmarking occlusion perception in multimodal large lan- guage models.arXiv preprint arXiv:2508.04059, 2025. 3, 4

work page arXiv 2025
[37]

Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024. 7

work page arXiv 2024
[38]

Gpt-5 system card, 2025

OpenAI. Gpt-5 system card, 2025. Accessed: 2025-08-13. 6, 3

work page 2025
[39]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompts.arXiv preprint arXiv:2402.13220, 2 (7), 2024

Yusu Qian, Haotian Zhang, Yinfei Yang, and Zhe Gan. How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompts.arXiv preprint arXiv:2402.13220, 2 (7), 2024. 1, 2

work page arXiv 2024
[41]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5

work page 2021
[42]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Illusionvqa: A challenging optical illu- sion dataset for vision language models.arXiv preprint arXiv:2403.15952, 2024

Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, and Ri- fat Shahriyar. Illusionvqa: A challenging optical illu- sion dataset for vision language models.arXiv preprint arXiv:2403.15952, 2024. 3, 4

work page arXiv 2024
[44]

Motilal Banarsidass Publishe, 2005

Arthur David Smith.The problem of perception. Motilal Banarsidass Publishe, 2005. 2

work page 2005
[45]

Lvlm-interpret: an interpretability tool for large vision-language models.arXiv preprint arXiv:2404.03118, 2024

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. Lvlm-interpret: an interpretability tool for large vision-language models.arXiv preprint arXiv:2404.03118, 2024. 8

work page arXiv 2024
[46]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu- Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 13088–13110, 2024. 6

work page 2024
[47]

Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding

Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, et al. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26147–26159, 2025. 8

work page 2025
[48]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

More thought, less accuracy? on the dual nature of reasoning in vision-language models.arXiv preprint arXiv:2509.25848, 2025

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of reasoning in vision-language models.arXiv preprint arXiv:2509.25848, 2025. 8

work page arXiv 2025
[51]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 7

work page 2024
[52]

Som-1k: A thousand-problem benchmark dataset for strength of materials.arXiv preprint arXiv:2509.21079,

Qixin Wan, Zilong Wang, Jingwen Zhou, Wanting Wang, Zi- heng Geng, Jiachen Liu, Ran Cao, Minghui Cheng, and Lu Cheng. Som-1k: A thousand-problem benchmark dataset for strength of materials.arXiv preprint arXiv:2509.21079,

work page arXiv
[53]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive bench- mark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Hait- eng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024. 1, 2

work page arXiv 2024
[57]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 7

work page 2022
[59]

A survey of safety on large vision- language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. A survey of safety on large vision- language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025. 6

work page arXiv 2025
[60]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

work page 2024
[61]

Sail-vl2 technical report.arXiv preprint arXiv:2509.14033, 2025

Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jia- cong Wang, Han Wang, et al. Sail-vl2 technical report.arXiv preprint arXiv:2509.14033, 2025. 2, 6, 3

work page arXiv 2025
[62]

Rlaif-v: Open-source ai feedback leads to su- per gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to su- per gpt-4v trustworthiness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19985– 19995, 2025. 6

work page 2025
[63]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 1, 2

work page 2024
[64]

Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 7

work page arXiv 2025
[65]

Illusionbench: A large-scale and comprehensive benchmark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848, 2025

Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. Illusionbench: A large-scale and comprehensive benchmark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848, 2025. 4

work page arXiv 2025
[66]

On evalu- ating adversarial robustness of large vision-language mod- els.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Man Cheung, and Min Lin. On evalu- ating adversarial robustness of large vision-language mod- els.Advances in Neural Information Processing Systems, 36:54111–54138, 2023. 2

work page 2023
[67]

Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multi- ple choice selectors.arXiv preprint arXiv:2309.03882, 2023. 6

work page arXiv 2023
[68]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Please describe the image

Jingze Zhu, Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yan- qiang Zheng, Jiawei Chen, Xu Yang, Bernt Schiele, Jonas Fischer, and Xinting Hu. Layercake: Token-aware con- trastive decoding within large language model layers.arXiv preprint arXiv:2507.04404, 2025. 8 MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs...

work page arXiv 2025

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Mvtamperbench: Evaluating robustness of vision-language models

Amit Agarwal, Srikant Panda, Angeline Charles, Hitesh Laxmichand Patel, Bhargava Kumar, Priyaran- jan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Hansa Meghwani, Karan Gupta, et al. Mvtamperbench: Evaluating robustness of vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18804–18828, 2025. 1, 2

work page 2025

[4] [4]

Claude 3.7 sonnet system card, 2025

Anthropic. Claude 3.7 sonnet system card, 2025. Accessed: 2025-02-04. 6, 3

work page 2025

[5] [5]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1, 2

work page 2015

[6] [6]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Emer- gent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788, 2025

Leonardo Berti, Flavio Giorgi, and Gjergji Kasneci. Emer- gent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788, 2025. 7

work page arXiv 2025

[11] [11]

Re- thinking visual layer selection in multimodal llms.arXiv preprint arXiv:2504.21447, 2025

Haoran Chen, Junyan Lin, Xinhao Chen, Yue Fan, Xin Jin, Hui Su, Jianfeng Dong, Jinlan Fu, and Xiaoyu Shen. Re- thinking visual layer selection in multimodal llms.arXiv preprint arXiv:2504.21447, 2025. 6

work page arXiv 2025

[12] [12]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 5, 6

work page 2024

[14] [14]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 2

work page 2023

[15] [15]

Exploring response uncertainty in mllms: An em- pirical evaluation under misleading scenarios.arXiv preprint arXiv:2411.02708, 2024

Yunkai Dang, Mengxi Gao, Yibo Yan, Xin Zou, Yanggan Gu, Jungang Li, Jingyu Wang, Peijie Jiang, Aiwei Liu, Jia Liu, et al. Exploring response uncertainty in mllms: An em- pirical evaluation under misleading scenarios.arXiv preprint arXiv:2411.02708, 2024. 1, 2

work page arXiv 2024

[16] [16]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,

work page

[17] [17]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 3867–3876, 2025

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 3867–3876, 2025. 1, 2

work page 2025

[18] [18]

Scalable vision language model training via high quality data curation.arXiv preprint arXiv:2501.05952, 2025

Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, and Jiao Ran. Scalable vision language model training via high quality data curation.arXiv preprint arXiv:2501.05952, 2025. 2

work page arXiv 2025

[19] [19]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,...

work page 2024

[21] [21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training.arXiv preprint arXiv:2509.26625, 2025

Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Kous- tuv Sinha, Philip Torr, and Filippos Kokkinos. Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training.arXiv preprint arXiv:2509.26625, 2025. 2, 4

work page arXiv 2025

[23] [23]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Hal-eval: A uni- versal and fine-grained hallucination evaluation framework for large vision language models

Chaoya Jiang, Hongrui Jia, Mengfan Dong, Wei Ye, Haiyang Xu, Ming Yan, Ji Zhang, and Shikun Zhang. Hal-eval: A uni- versal and fine-grained hallucination evaluation framework for large vision language models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 525– 534, 2024. 2

work page 2024

[25] [25]

Survey of adversarial robustness in multimodal large language models.arXiv preprint arXiv:2503.13962, 2025

Chengze Jiang, Zhuangzhuang Wang, Minjing Dong, and Jie Gui. Survey of adversarial robustness in multimodal large language models.arXiv preprint arXiv:2503.13962, 2025. 1, 2

work page arXiv 2025

[26] [26]

Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating ob- ject hallucinations via attention lens. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25004–25014, 2025. 8

work page 2025

[27] [27]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321,

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025. 8

work page arXiv 2025

[28] [28]

Natural language understanding and inference with mllm in visual question answering: A survey.ACM Com- puting Surveys, 57(8):1–36, 2025

Jiayi Kuang, Ying Shen, Jingyou Xie, Haohao Luo, Zhe Xu, Ronghao Li, Yinghui Li, Xianfeng Cheng, Xika Lin, and Yu Han. Natural language understanding and inference with mllm in visual question answering: A survey.ACM Com- puting Surveys, 57(8):1–36, 2025. 1, 2

work page 2025

[29] [29]

What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024

Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024. 6

work page 2024

[30] [30]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 13299–13308, 2024. 6

work page 2024

[31] [31]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seep- hys challenge.arXiv preprint arXiv:2509.06079, 2025

Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, and Bin Dong. Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seep- hys challenge.arXiv preprint arXiv:2509.06079, 2025. 7

work page arXiv 2025

[33] [33]

More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 8

work page arXiv 2025

[34] [34]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

work page 2023

[35] [35]

Unveiling the ignorance of mllms: See- ing clearly, answering incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, et al. Unveiling the ignorance of mllms: See- ing clearly, answering incorrectly. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9087–9097, 2025. 1, 2

work page 2025

[36] [36]

Beyond the visible: Benchmarking occlusion perception in multimodal large lan- guage models.arXiv preprint arXiv:2508.04059, 2025

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Li- meng Qiao, Lin Ma, and Tingting Jiang. Beyond the visible: Benchmarking occlusion perception in multimodal large lan- guage models.arXiv preprint arXiv:2508.04059, 2025. 3, 4

work page arXiv 2025

[37] [37]

Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024. 7

work page arXiv 2024

[38] [38]

Gpt-5 system card, 2025

OpenAI. Gpt-5 system card, 2025. Accessed: 2025-08-13. 6, 3

work page 2025

[39] [39]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompts.arXiv preprint arXiv:2402.13220, 2 (7), 2024

Yusu Qian, Haotian Zhang, Yinfei Yang, and Zhe Gan. How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompts.arXiv preprint arXiv:2402.13220, 2 (7), 2024. 1, 2

work page arXiv 2024

[41] [41]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5

work page 2021

[42] [42]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Illusionvqa: A challenging optical illu- sion dataset for vision language models.arXiv preprint arXiv:2403.15952, 2024

Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, and Ri- fat Shahriyar. Illusionvqa: A challenging optical illu- sion dataset for vision language models.arXiv preprint arXiv:2403.15952, 2024. 3, 4

work page arXiv 2024

[44] [44]

Motilal Banarsidass Publishe, 2005

Arthur David Smith.The problem of perception. Motilal Banarsidass Publishe, 2005. 2

work page 2005

[45] [45]

Lvlm-interpret: an interpretability tool for large vision-language models.arXiv preprint arXiv:2404.03118, 2024

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. Lvlm-interpret: an interpretability tool for large vision-language models.arXiv preprint arXiv:2404.03118, 2024. 8

work page arXiv 2024

[46] [46]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu- Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 13088–13110, 2024. 6

work page 2024

[47] [47]

Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding

Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, et al. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26147–26159, 2025. 8

work page 2025

[48] [48]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

More thought, less accuracy? on the dual nature of reasoning in vision-language models.arXiv preprint arXiv:2509.25848, 2025

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of reasoning in vision-language models.arXiv preprint arXiv:2509.25848, 2025. 8

work page arXiv 2025

[51] [51]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 7

work page 2024

[52] [52]

Som-1k: A thousand-problem benchmark dataset for strength of materials.arXiv preprint arXiv:2509.21079,

Qixin Wan, Zilong Wang, Jingwen Zhou, Wanting Wang, Zi- heng Geng, Jiachen Liu, Ran Cao, Minghui Cheng, and Lu Cheng. Som-1k: A thousand-problem benchmark dataset for strength of materials.arXiv preprint arXiv:2509.21079,

work page arXiv

[53] [53]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive bench- mark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Hait- eng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024. 1, 2

work page arXiv 2024

[57] [57]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 7

work page 2022

[59] [59]

A survey of safety on large vision- language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025

Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. A survey of safety on large vision- language models: Attacks, defenses and evaluations.arXiv preprint arXiv:2502.14881, 2025. 6

work page arXiv 2025

[60] [60]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

work page 2024

[61] [61]

Sail-vl2 technical report.arXiv preprint arXiv:2509.14033, 2025

Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jia- cong Wang, Han Wang, et al. Sail-vl2 technical report.arXiv preprint arXiv:2509.14033, 2025. 2, 6, 3

work page arXiv 2025

[62] [62]

Rlaif-v: Open-source ai feedback leads to su- per gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to su- per gpt-4v trustworthiness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19985– 19995, 2025. 6

work page 2025

[63] [63]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 1, 2

work page 2024

[64] [64]

Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 7

work page arXiv 2025

[65] [65]

Illusionbench: A large-scale and comprehensive benchmark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848, 2025

Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. Illusionbench: A large-scale and comprehensive benchmark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848, 2025. 4

work page arXiv 2025

[66] [66]

On evalu- ating adversarial robustness of large vision-language mod- els.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Man Cheung, and Min Lin. On evalu- ating adversarial robustness of large vision-language mod- els.Advances in Neural Information Processing Systems, 36:54111–54138, 2023. 2

work page 2023

[67] [67]

Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multi- ple choice selectors.arXiv preprint arXiv:2309.03882, 2023. 6

work page arXiv 2023

[68] [68]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Please describe the image

Jingze Zhu, Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yan- qiang Zheng, Jiawei Chen, Xu Yang, Bernt Schiele, Jonas Fischer, and Xinting Hu. Layercake: Token-aware con- trastive decoding within large language model layers.arXiv preprint arXiv:2507.04404, 2025. 8 MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs...

work page arXiv 2025