arxiv: 2310.14566 · v5 · submitted 2023-10-23 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan , Fuxiao Liu , Xiyang Wu , Ruiqi Xian , Zongxia Li , Xiaoyu Liu , Xijun Wang , Lichang Chen

show 4 more authors

Furong Huang Yaser Yacoob Dinesh Manocha Tianyi Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords HallusionBenchlarge vision-language modelshallucinationvisual illusionbenchmarkimage reasoningGPT-4Vfailure modes

0 comments

The pith

HallusionBench shows even GPT-4V reaches only 31.42 percent accuracy on paired questions that expose language hallucination and visual illusion in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HallusionBench, a benchmark of 346 images and 1129 expert-crafted questions, to test large vision-language models on image-context reasoning. It uses a novel control-group structure in the questions to measure response tendencies, logical consistency, and specific failure modes. When 15 models were evaluated, GPT-4V achieved the highest question-pair accuracy at 31.42 percent while every other model stayed below 16 percent. The work also provides case studies that identify patterns of language hallucination and visual illusion. These results indicate that current models have persistent difficulties maintaining consistency when visual information is involved.

Core claim

HallusionBench introduces a diagnostic suite with 346 images and 1129 questions that use a novel control-group structure to quantitatively analyze response tendencies, logical consistency, and failure modes such as language hallucination and visual illusion in large vision-language models, revealing that GPT-4V reaches 31.42% question-pair accuracy while others stay below 16%.

What carries the argument

The control-group question structure in HallusionBench, which pairs related questions to isolate and measure tendencies toward language hallucination and visual illusion.

If this is right

LVLMs exhibit widespread problems with logical consistency when reasoning about images.
The paired-question format can systematically surface distinct failure modes for targeted diagnosis.
Insights from the case studies point to concrete directions for improving response reliability in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Control-group designs of this type could be adapted to evaluate consistency errors in text-only or other multimodal settings.
Persistent low scores may reflect gaps in training data that reward fluent but inconsistent answers.
Repeated application of the benchmark to successive model releases would track whether these specific errors are decreasing.

Load-bearing premise

Expert-crafted questions with the control-group structure can accurately isolate language hallucination and visual illusion without introducing biases in question design or answer evaluation.

What would settle it

A new vision-language model that scores above 70 percent question-pair accuracy on HallusionBench while retaining strong results on standard image benchmarks would show that the measured errors are not fundamental.

read the original abstract

We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HallusionBench adds a control-group question structure that lets you measure consistency and specific failure modes in LVLMs, but the isolation of hallucination from illusion still needs tighter validation.

read the letter

The key takeaway from HallusionBench is the introduction of a structured question design with control groups. This lets them measure not just accuracy but also logical consistency and specific failure modes like language hallucination and visual illusion in large vision-language models. Their evaluation of 15 models shows GPT-4V at 31.42% on question-pair accuracy, while the rest fall below 16%. That gap highlights how even top models struggle with these nuanced image-text interactions. What stands out positively is the scale and the human expert crafting of the 346 images and 1129 questions. They provide a benchmark that goes beyond simple VQA by including paired questions designed for quantitative analysis. The open release of the benchmark and code at the GitHub link is a plus, as it allows others to build on it or reproduce the results. The case studies help illustrate the pitfalls in a practical way, which can guide future model improvements. On the downside, the isolation of hallucination and illusion depends on the control groups being well-matched. The pairs need to hold everything constant except the targeted element, such as adding or removing a visual cue. If there's variation in question ambiguity or image difficulty across pairs, the results could be driven by overall hardness rather than the claimed disentanglement. The abstract calls the questions meticulously crafted, but I would want to see explicit checks like inter-rater reliability for scoring or human expert performance on the same set to confirm the controls work as intended. Without that, the claims about specific failure modes rest on somewhat unverified ground. This paper targets people developing or testing multimodal AI systems, especially those focused on reliability and reducing hallucinations. A reader looking for new evaluation tools would find the structure and the model comparisons useful. It has enough substance in the new design and the empirical results to merit sending it out for peer review, where referees could push on the validation of the control groups and suggest additions like human baselines.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HallusionBench, a benchmark with 346 images and 1129 human-expert-crafted questions for evaluating large vision-language models on image-context reasoning. It proposes a novel control-group question structure to enable quantitative analysis of response tendencies, logical consistency, and specific failure modes including language hallucination and visual illusion. The authors evaluate 15 models and report that GPT-4V achieves 31.42% question-pair accuracy while all other models score below 16%, accompanied by case studies and suggested improvement pathways. The benchmark and code are released publicly.

Significance. If the control-group structure validly isolates the targeted failure modes without confounding by general reasoning difficulty or question ambiguity, the benchmark would offer a useful diagnostic tool for understanding limitations in current LVLMs and guiding targeted improvements. The scale of the evaluation across 15 models and the public release of the benchmark and codebase are strengths that could support standardized testing in the field.

major comments (2)

[Abstract] Abstract: The claim that the novel control-group structure enables isolation of 'entangled language hallucination and visual illusion' is load-bearing for the central diagnostic contribution, yet no validation details (e.g., inter-rater agreement on pair equivalence, ablation removing one control dimension, or human baseline performance on the same pairs) are provided to confirm that pairs differ only in the targeted variable while holding visual complexity and linguistic subtlety fixed.
[Evaluation] Evaluation results (reported accuracies): The headline finding that GPT-4V reaches 31.42% question-pair accuracy and others remain below 16% is presented as evidence of specific failure modes; however, without an explicit check that performance gaps arise from the disentangled hallucination/illusion factors rather than overall task hardness or pair ambiguity, the attribution to the claimed entangled modes is under-supported.

minor comments (2)

[Abstract] The abstract states '1129 questions' but does not clarify how many question pairs are used for the reported question-pair accuracy metric; adding this detail would improve reproducibility.
[Results] Figure or table presenting per-model accuracies would benefit from explicit error bars or statistical significance tests to support the claim that all non-GPT-4V models fall below 16%.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the validation of our proposed control-group structure.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the novel control-group structure enables isolation of 'entangled language hallucination and visual illusion' is load-bearing for the central diagnostic contribution, yet no validation details (e.g., inter-rater agreement on pair equivalence, ablation removing one control dimension, or human baseline performance on the same pairs) are provided to confirm that pairs differ only in the targeted variable while holding visual complexity and linguistic subtlety fixed.

Authors: We appreciate the referee pointing out the need for more rigorous validation of the control-group design. The questions in HallusionBench were created by human experts specifically to hold visual and linguistic elements constant while varying the targeted hallucination or illusion factor. However, we acknowledge that quantitative validation metrics were not detailed in the initial submission. In the revised manuscript, we will add a dedicated section on benchmark construction that includes inter-rater agreement scores from multiple experts on pair equivalence, results from an ablation study where one control dimension is removed to measure its effect on failure detection, and human baseline accuracies on the question pairs. This will provide evidence that the pairs isolate the intended variables without confounding by general difficulty or ambiguity. revision: yes
Referee: [Evaluation] Evaluation results (reported accuracies): The headline finding that GPT-4V reaches 31.42% question-pair accuracy and others remain below 16% is presented as evidence of specific failure modes; however, without an explicit check that performance gaps arise from the disentangled hallucination/illusion factors rather than overall task hardness or pair ambiguity, the attribution to the claimed entangled modes is under-supported.

Authors: We agree that the attribution of low performance to the specific entangled modes requires additional support beyond the raw accuracies. The control-group structure was designed to enable exactly this kind of analysis through paired comparisons that reveal inconsistencies attributable to hallucination or illusion. To address this, we will expand the evaluation section in the revision to include explicit checks, such as the rate of logical inconsistencies across control pairs and differential performance on hallucination-specific vs. general reasoning questions. We will also report human expert performance on the same pairs to show that the task is solvable when the failure modes are not present, thereby supporting that the gaps in model performance are due to the targeted factors rather than inherent hardness or ambiguity. revision: yes

Circularity Check

0 steps flagged

Benchmark introduction and empirical evaluation contain no circular derivations

full rationale

The paper introduces HallusionBench as a human-expert-crafted set of 346 images and 1129 questions with a novel control-group structure, then reports direct empirical accuracies on 15 models (GPT-4V at 31.42% question-pair accuracy, others below 16%). No equations, fitted parameters, predictions, or first-principles derivations are presented that reduce to the benchmark inputs by construction. The control groups and failure-mode analysis are described as meticulously crafted observations rather than self-referential or load-bearing self-citations. The evaluation chain is therefore self-contained against external model testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of expert-crafted questions and control groups to isolate failure modes, plus the assumption that model responses can be reliably scored for hallucination versus illusion.

axioms (1)

domain assumption Human experts can design visual questions with control groups that isolate language hallucination from visual illusion without introducing bias.
The quantitative analysis of response tendencies and logical consistency depends on this premise.

pith-pipeline@v0.9.0 · 5592 in / 1237 out tokens · 46853 ms · 2026-05-17T01:17:20.225173+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We explore HALLUSIONBENCH and provide an in-depth analysis of examples on which the SoTA LVLMs, such as GPT-4V and LLaVA-1.5 fail... language hallucination and visual illusion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
cs.CV 2026-02 unverdicted novelty 7.0

Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
cs.CV 2025-12 unverdicted novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
cs.AI 2025-03 conditional novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
cs.CV 2024-06 conditional novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
cs.CV 2026-03 unverdicted novelty 6.0

69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents
cs.AI 2025-12 unverdicted novelty 6.0

Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
cs.AI 2025-11 unverdicted novelty 6.0

ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
cs.CL 2023-11 unverdicted novelty 6.0

AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
cs.CV 2023-06 accept novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 20 Pith papers · 22 internal anchors

[1]

Gpt-4v(ision) system card. 2023. 6, 7

work page 2023
[2]

nocaps: novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- fan Lee, and Peter Anderson. nocaps: novel object captioning at scale. International Conference on Computer Vision, pages 8947–8956, 2019. 3

work page 2019
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

work page
[4]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425– 2433, 2015. 1

work page 2015
[5]

Openflamingo, 2023

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bit- ton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, 2023. 1

work page 2023
[6]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Sparks of artifi- cial general intelligence: Early experiments with gpt-4, 2023

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artifi- cial general intelligence: Early experiments with gpt-4, 2023. 1

work page 2023
[8]

Alpagasus: Training a better alpaca with fewer data

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gu- naratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023. 1

work page arXiv 2023
[9]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 1

work page 2023
[10]

Holistic analysis of hal- lucination in gpt-4v(ision): Bias and interference challenges

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hal- lucination in gpt-4v(ision): Bias and interference challenges. ArXiv, abs/2311.03287, 2023. 4

work page arXiv 2023
[11]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Xia, Mehdi S

Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yev- gen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Flo- rence. Palm-e: An...

work page 2023
[13]

Lamm: Language- assisted multi-modal instruction-tuning dataset, framework, and benchmark

Zhen fei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Wanli Ouyang, and Jing Shao. Lamm: Language- assisted multi-modal instruction-tuning dataset, framework, and benchmark. ArXiv, abs/2306.06687, 2023. 3

work page arXiv 2023
[14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A compre- hensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Multimodal-gpt: A vision and language model for dialogue with humans, 2023

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023. 5

work page 2023
[16]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision , 127:398 – 414,

work page
[17]

Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation, 2023

Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, and Di- nesh Manocha. Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation, 2023. 1

work page 2023
[18]

Detecting and preventing hallucinations in large vision language models

Anish Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. ArXiv, abs/2308.06394, 2023. 4

work page arXiv 2023
[19]

A comprehensive survey of deep learning for image captioning

MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51 (6):1–36, 2019. 1

work page 2019
[20]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs

Sheng Li and Nima Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. ArXiv, abs/2308.03349, 2023. 4

work page arXiv 2023
[23]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355, 2023. 1, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Stablellava: Enhanced visual instruction tuning with synthe- sized image-dialogue data

Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Gu- osheng Lin, Chunhua Shen, Ling Chen, and Yunchao Wei. Stablellava: Enhanced visual instruction tuning with synthe- sized image-dialogue data. ArXiv, abs/2308.10253, 2023. 2

work page arXiv 2023
[25]

To- wards understanding in-context learning with contrastive demonstrations and saliency maps

Zongxia Li, Paiheng Xu, Fuxiao Liu, and Hyemi Song. To- wards understanding in-context learning with contrastive demonstrations and saliency maps. arXiv preprint arXiv:2307.05052, 2023. 1

work page arXiv 2023
[26]

Module- wise adaptive distillation for multimodality foundation mod- els

Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, and Tianyi Zhou. Module- wise adaptive distillation for multimodality foundation mod- els. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1

work page 2023
[27]

Visual news: Benchmark and challenges in news image captioning

Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Or- donez. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743, 2020. 1, 3

work page arXiv 2010
[28]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 1, 2, 3, 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Documentclip: Linking figures and main body text in reflowed documents

Fuxiao Liu, Hao Tan, and Chris Tensmeyer. Documentclip: Linking figures and main body text in reflowed documents. arXiv preprint arXiv:2306.06306, 2023. 1

work page arXiv 2023
[30]

Covid- vts: Fact extraction and verification on short video platforms

Fuxiao Liu, Yaser Yacoob, and Abhinav Shrivastava. Covid- vts: Fact extraction and verification on short video platforms. In Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 178–188, 2023. 1

work page 2023
[31]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 1, 5, 6, 7

work page 2023
[32]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Con- ghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math rea- soning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Scienceqa: A novel resource for question answering on scholarly articles

Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022. 1

work page 2022
[38]

Claude 3, 2024

Anthropic Team. Claude 3, 2024. 6, 7

work page 2024
[39]

Gemini: A family of highly capable multi- modal models, 2023

Gemini Team. Gemini: A family of highly capable multi- modal models, 2023. 6, 7

work page 2023
[40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Trans- form and tell: Entity-aware news image captioning

Alasdair Tran, Alexander Mathews, and Lexing Xie. Trans- form and tell: Entity-aware news image captioning. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13035–13045, 2020. 1

work page 2020
[42]

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016. 1

work page 2015
[43]

Vigc: Visual instruction generation and correction

Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiao wen Dong, Weijia Li, Wei Li, Jiaqi Wang, and Conghui He. Vigc: Visual instruction generation and correction. ArXiv, abs/2308.12714, 2023. 3

work page arXiv 2023
[44]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. ArXiv, abs/2205.14100, 2022. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Larger language models do in-context learning differently

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023. 1

work page arXiv 2023
[46]

Large language models can be good privacy protection learners

Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Haifeng Chen, Wei Wang, and Wei Cheng. Large language models can be good privacy protection learners. 2023. 1

work page 2023
[47]

Embodied multi-modal agent trained by an llm from a parallel textworld,

Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. Embodied multi-modal agent trained by an llm from a parallel textworld,

work page
[48]

The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023. 1, 2, 5, 9

work page 2023
[49]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 1, 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023. 6, 7

work page 2023
[52]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Woodpecker: Hallucination correc- tion for multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xingguo Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. ArXiv, abs/2310.16045, 2023. 3

work page arXiv 2023
[54]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

What matters in training a gpt4-style language model with multi- modal inputs? arXiv preprint arXiv:2307.02469, 2023

Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guo- qiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multi- modal inputs? arXiv preprint arXiv:2307.02469, 2023. 4

work page arXiv 2023
[57]

Investigating the catastrophic for- getting in multimodal large language models

Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic for- getting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023. 1

work page arXiv 2023
[58]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the ai ocean: A survey on hal- lucination in large language models. ArXiv, abs/2309.01219,

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, and Joyce Chai. Grounding visual illusions in language: Do vision- language models perceive illusions like humans? In Proceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2023. 3

work page 2022
[60]

arXiv preprint arXiv:2306.17107 , year=

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tongfei Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv, abs/2306.17107, 2023. 2

work page arXiv 2023
[61]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Minigpt-5: Interleaved vision-and-language generation via generative vokens

Kaizhi Zheng, Xuehai He, and Xin Eric Wang. Minigpt-5: Interleaved vision-and-language generation via generative vokens. ArXiv, abs/2310.02239, 2023. 6, 7

work page arXiv 2023
[63]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023