pith. machine review for the scientific record. sign in

arxiv: 2310.14566 · v5 · submitted 2023-10-23 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords HallusionBenchlarge vision-language modelshallucinationvisual illusionbenchmarkimage reasoningGPT-4Vfailure modes
0
0 comments X

The pith

HallusionBench shows even GPT-4V reaches only 31.42 percent accuracy on paired questions that expose language hallucination and visual illusion in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HallusionBench, a benchmark of 346 images and 1129 expert-crafted questions, to test large vision-language models on image-context reasoning. It uses a novel control-group structure in the questions to measure response tendencies, logical consistency, and specific failure modes. When 15 models were evaluated, GPT-4V achieved the highest question-pair accuracy at 31.42 percent while every other model stayed below 16 percent. The work also provides case studies that identify patterns of language hallucination and visual illusion. These results indicate that current models have persistent difficulties maintaining consistency when visual information is involved.

Core claim

HallusionBench introduces a diagnostic suite with 346 images and 1129 questions that use a novel control-group structure to quantitatively analyze response tendencies, logical consistency, and failure modes such as language hallucination and visual illusion in large vision-language models, revealing that GPT-4V reaches 31.42% question-pair accuracy while others stay below 16%.

What carries the argument

The control-group question structure in HallusionBench, which pairs related questions to isolate and measure tendencies toward language hallucination and visual illusion.

If this is right

  • LVLMs exhibit widespread problems with logical consistency when reasoning about images.
  • The paired-question format can systematically surface distinct failure modes for targeted diagnosis.
  • Insights from the case studies point to concrete directions for improving response reliability in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Control-group designs of this type could be adapted to evaluate consistency errors in text-only or other multimodal settings.
  • Persistent low scores may reflect gaps in training data that reward fluent but inconsistent answers.
  • Repeated application of the benchmark to successive model releases would track whether these specific errors are decreasing.

Load-bearing premise

Expert-crafted questions with the control-group structure can accurately isolate language hallucination and visual illusion without introducing biases in question design or answer evaluation.

What would settle it

A new vision-language model that scores above 70 percent question-pair accuracy on HallusionBench while retaining strong results on standard image benchmarks would show that the measured errors are not fundamental.

read the original abstract

We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HallusionBench, a benchmark with 346 images and 1129 human-expert-crafted questions for evaluating large vision-language models on image-context reasoning. It proposes a novel control-group question structure to enable quantitative analysis of response tendencies, logical consistency, and specific failure modes including language hallucination and visual illusion. The authors evaluate 15 models and report that GPT-4V achieves 31.42% question-pair accuracy while all other models score below 16%, accompanied by case studies and suggested improvement pathways. The benchmark and code are released publicly.

Significance. If the control-group structure validly isolates the targeted failure modes without confounding by general reasoning difficulty or question ambiguity, the benchmark would offer a useful diagnostic tool for understanding limitations in current LVLMs and guiding targeted improvements. The scale of the evaluation across 15 models and the public release of the benchmark and codebase are strengths that could support standardized testing in the field.

major comments (2)
  1. [Abstract] Abstract: The claim that the novel control-group structure enables isolation of 'entangled language hallucination and visual illusion' is load-bearing for the central diagnostic contribution, yet no validation details (e.g., inter-rater agreement on pair equivalence, ablation removing one control dimension, or human baseline performance on the same pairs) are provided to confirm that pairs differ only in the targeted variable while holding visual complexity and linguistic subtlety fixed.
  2. [Evaluation] Evaluation results (reported accuracies): The headline finding that GPT-4V reaches 31.42% question-pair accuracy and others remain below 16% is presented as evidence of specific failure modes; however, without an explicit check that performance gaps arise from the disentangled hallucination/illusion factors rather than overall task hardness or pair ambiguity, the attribution to the claimed entangled modes is under-supported.
minor comments (2)
  1. [Abstract] The abstract states '1129 questions' but does not clarify how many question pairs are used for the reported question-pair accuracy metric; adding this detail would improve reproducibility.
  2. [Results] Figure or table presenting per-model accuracies would benefit from explicit error bars or statistical significance tests to support the claim that all non-GPT-4V models fall below 16%.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the validation of our proposed control-group structure.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the novel control-group structure enables isolation of 'entangled language hallucination and visual illusion' is load-bearing for the central diagnostic contribution, yet no validation details (e.g., inter-rater agreement on pair equivalence, ablation removing one control dimension, or human baseline performance on the same pairs) are provided to confirm that pairs differ only in the targeted variable while holding visual complexity and linguistic subtlety fixed.

    Authors: We appreciate the referee pointing out the need for more rigorous validation of the control-group design. The questions in HallusionBench were created by human experts specifically to hold visual and linguistic elements constant while varying the targeted hallucination or illusion factor. However, we acknowledge that quantitative validation metrics were not detailed in the initial submission. In the revised manuscript, we will add a dedicated section on benchmark construction that includes inter-rater agreement scores from multiple experts on pair equivalence, results from an ablation study where one control dimension is removed to measure its effect on failure detection, and human baseline accuracies on the question pairs. This will provide evidence that the pairs isolate the intended variables without confounding by general difficulty or ambiguity. revision: yes

  2. Referee: [Evaluation] Evaluation results (reported accuracies): The headline finding that GPT-4V reaches 31.42% question-pair accuracy and others remain below 16% is presented as evidence of specific failure modes; however, without an explicit check that performance gaps arise from the disentangled hallucination/illusion factors rather than overall task hardness or pair ambiguity, the attribution to the claimed entangled modes is under-supported.

    Authors: We agree that the attribution of low performance to the specific entangled modes requires additional support beyond the raw accuracies. The control-group structure was designed to enable exactly this kind of analysis through paired comparisons that reveal inconsistencies attributable to hallucination or illusion. To address this, we will expand the evaluation section in the revision to include explicit checks, such as the rate of logical inconsistencies across control pairs and differential performance on hallucination-specific vs. general reasoning questions. We will also report human expert performance on the same pairs to show that the task is solvable when the failure modes are not present, thereby supporting that the gaps in model performance are due to the targeted factors rather than inherent hardness or ambiguity. revision: yes

Circularity Check

0 steps flagged

Benchmark introduction and empirical evaluation contain no circular derivations

full rationale

The paper introduces HallusionBench as a human-expert-crafted set of 346 images and 1129 questions with a novel control-group structure, then reports direct empirical accuracies on 15 models (GPT-4V at 31.42% question-pair accuracy, others below 16%). No equations, fitted parameters, predictions, or first-principles derivations are presented that reduce to the benchmark inputs by construction. The control groups and failure-mode analysis are described as meticulously crafted observations rather than self-referential or load-bearing self-citations. The evaluation chain is therefore self-contained against external model testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of expert-crafted questions and control groups to isolate failure modes, plus the assumption that model responses can be reliably scored for hallucination versus illusion.

axioms (1)
  • domain assumption Human experts can design visual questions with control groups that isolate language hallucination from visual illusion without introducing bias.
    The quantitative analysis of response tendencies and logical consistency depends on this premise.

pith-pipeline@v0.9.0 · 5592 in / 1237 out tokens · 46853 ms · 2026-05-17T01:17:20.225173+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  2. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  3. Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

    cs.CV 2026-02 unverdicted novelty 7.0

    Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.

  4. Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    cs.CV 2025-12 unverdicted novelty 7.0

    DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

  5. R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    cs.AI 2025-03 conditional novelty 7.0

    R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

  6. MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

  7. Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

  8. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  9. To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

    cs.CV 2026-03 unverdicted novelty 6.0

    69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.

  10. Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

    cs.AI 2025-12 unverdicted novelty 6.0

    Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.

  11. Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

    cs.AI 2025-11 unverdicted novelty 6.0

    ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.

  12. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  14. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  15. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    cs.CL 2023-11 unverdicted novelty 6.0

    AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.

  16. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    cs.CV 2023-06 accept novelty 6.0

    A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

  17. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  18. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  19. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  20. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 20 Pith papers · 22 internal anchors

  1. [1]

    Gpt-4v(ision) system card. 2023. 6, 7

  2. [2]

    nocaps: novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- fan Lee, and Peter Anderson. nocaps: novel object captioning at scale. International Conference on Computer Vision, pages 8947–8956, 2019. 3

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

  4. [4]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425– 2433, 2015. 1

  5. [5]

    Openflamingo, 2023

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bit- ton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, 2023. 1

  6. [6]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023. 6, 7

  7. [7]

    Sparks of artifi- cial general intelligence: Early experiments with gpt-4, 2023

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artifi- cial general intelligence: Early experiments with gpt-4, 2023. 1

  8. [8]

    Alpagasus: Training a better alpaca with fewer data

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gu- naratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023. 1

  9. [9]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 1

  10. [10]

    Holistic analysis of hal- lucination in gpt-4v(ision): Bias and interference challenges

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hal- lucination in gpt-4v(ision): Bias and interference challenges. ArXiv, abs/2311.03287, 2023. 4

  11. [11]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 2, 6, 7

  12. [12]

    Xia, Mehdi S

    Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yev- gen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Flo- rence. Palm-e: An...

  13. [13]

    Lamm: Language- assisted multi-modal instruction-tuning dataset, framework, and benchmark

    Zhen fei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Wanli Ouyang, and Jing Shao. Lamm: Language- assisted multi-modal instruction-tuning dataset, framework, and benchmark. ArXiv, abs/2306.06687, 2023. 3

  14. [14]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A compre- hensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 3, 4

  15. [15]

    Multimodal-gpt: A vision and language model for dialogue with humans, 2023

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023. 5

  16. [16]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision , 127:398 – 414,

  17. [17]

    Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation, 2023

    Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, and Di- nesh Manocha. Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation, 2023. 1

  18. [18]

    Detecting and preventing hallucinations in large vision language models

    Anish Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. ArXiv, abs/2308.06394, 2023. 4

  19. [19]

    A comprehensive survey of deep learning for image captioning

    MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51 (6):1–36, 2019. 1

  20. [20]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125,

  21. [21]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 1, 6, 7

  22. [22]

    Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs

    Sheng Li and Nima Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. ArXiv, abs/2308.03349, 2023. 4

  23. [23]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355, 2023. 1, 3, 4, 5

  24. [24]

    Stablellava: Enhanced visual instruction tuning with synthe- sized image-dialogue data

    Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Gu- osheng Lin, Chunhua Shen, Ling Chen, and Yunchao Wei. Stablellava: Enhanced visual instruction tuning with synthe- sized image-dialogue data. ArXiv, abs/2308.10253, 2023. 2

  25. [25]

    To- wards understanding in-context learning with contrastive demonstrations and saliency maps

    Zongxia Li, Paiheng Xu, Fuxiao Liu, and Hyemi Song. To- wards understanding in-context learning with contrastive demonstrations and saliency maps. arXiv preprint arXiv:2307.05052, 2023. 1

  26. [26]

    Module- wise adaptive distillation for multimodality foundation mod- els

    Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, and Tianyi Zhou. Module- wise adaptive distillation for multimodality foundation mod- els. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1

  27. [27]

    Visual news: Benchmark and challenges in news image captioning

    Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Or- donez. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743, 2020. 1, 3

  28. [28]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 1, 2, 3, 4, 6, 7

  29. [29]

    Documentclip: Linking figures and main body text in reflowed documents

    Fuxiao Liu, Hao Tan, and Chris Tensmeyer. Documentclip: Linking figures and main body text in reflowed documents. arXiv preprint arXiv:2306.06306, 2023. 1

  30. [30]

    Covid- vts: Fact extraction and verification on short video platforms

    Fuxiao Liu, Yaser Yacoob, and Abhinav Shrivastava. Covid- vts: Fact extraction and verification on short video platforms. In Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 178–188, 2023. 1

  31. [31]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 1, 5, 6, 7

  32. [32]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

  33. [33]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Con- ghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023. 3

  34. [34]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math rea- soning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255, 2023. 4

  35. [35]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. 1

  36. [36]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 2

  37. [37]

    Scienceqa: A novel resource for question answering on scholarly articles

    Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022. 1

  38. [38]

    Claude 3, 2024

    Anthropic Team. Claude 3, 2024. 6, 7

  39. [39]

    Gemini: A family of highly capable multi- modal models, 2023

    Gemini Team. Gemini: A family of highly capable multi- modal models, 2023. 6, 7

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

  41. [41]

    Trans- form and tell: Entity-aware news image captioning

    Alasdair Tran, Alexander Mathews, and Lexing Xie. Trans- form and tell: Entity-aware news image captioning. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13035–13045, 2020. 1

  42. [42]

    Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016. 1

  43. [43]

    Vigc: Visual instruction generation and correction

    Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiao wen Dong, Weijia Li, Wei Li, Jiaqi Wang, and Conghui He. Vigc: Visual instruction generation and correction. ArXiv, abs/2308.12714, 2023. 3

  44. [44]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. ArXiv, abs/2205.14100, 2022. 6, 7

  45. [45]

    Larger language models do in-context learning differently

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023. 1

  46. [46]

    Large language models can be good privacy protection learners

    Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Haifeng Chen, Wei Wang, and Wei Cheng. Large language models can be good privacy protection learners. 2023. 1

  47. [47]

    Embodied multi-modal agent trained by an llm from a parallel textworld,

    Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. Embodied multi-modal agent trained by an llm from a parallel textworld,

  48. [48]

    The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023. 1, 2, 5, 9

  49. [49]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 1

  50. [50]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 1, 2, 5, 6, 7

  51. [51]

    mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023. 6, 7

  52. [52]

    A Survey on Multimodal Large Language Models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023. 2

  53. [53]

    Woodpecker: Hallucination correc- tion for multimodal large language models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xingguo Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. ArXiv, abs/2310.16045, 2023. 3

  54. [54]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490, 2023. 3

  55. [55]

    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 1

  56. [56]

    What matters in training a gpt4-style language model with multi- modal inputs? arXiv preprint arXiv:2307.02469, 2023

    Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guo- qiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multi- modal inputs? arXiv preprint arXiv:2307.02469, 2023. 4

  57. [57]

    Investigating the catastrophic for- getting in multimodal large language models

    Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic for- getting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023. 1

  58. [58]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the ai ocean: A survey on hal- lucination in large language models. ArXiv, abs/2309.01219,

  59. [59]

    Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, and Joyce Chai. Grounding visual illusions in language: Do vision- language models perceive illusions like humans? In Proceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2023. 3

  60. [60]

    arXiv preprint arXiv:2306.17107 , year=

    Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tongfei Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv, abs/2306.17107, 2023. 2

  61. [61]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. 1

  62. [62]

    Minigpt-5: Interleaved vision-and-language generation via generative vokens

    Kaizhi Zheng, Xuehai He, and Xin Eric Wang. Minigpt-5: Interleaved vision-and-language generation via generative vokens. ArXiv, abs/2310.02239, 2023. 6, 7

  63. [63]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2, 6, 7