Recognition: 2 theorem links
· Lean TheoremHallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Pith reviewed 2026-05-17 01:17 UTC · model grok-4.3
The pith
HallusionBench shows even GPT-4V reaches only 31.42 percent accuracy on paired questions that expose language hallucination and visual illusion in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HallusionBench introduces a diagnostic suite with 346 images and 1129 questions that use a novel control-group structure to quantitatively analyze response tendencies, logical consistency, and failure modes such as language hallucination and visual illusion in large vision-language models, revealing that GPT-4V reaches 31.42% question-pair accuracy while others stay below 16%.
What carries the argument
The control-group question structure in HallusionBench, which pairs related questions to isolate and measure tendencies toward language hallucination and visual illusion.
If this is right
- LVLMs exhibit widespread problems with logical consistency when reasoning about images.
- The paired-question format can systematically surface distinct failure modes for targeted diagnosis.
- Insights from the case studies point to concrete directions for improving response reliability in future models.
Where Pith is reading between the lines
- Control-group designs of this type could be adapted to evaluate consistency errors in text-only or other multimodal settings.
- Persistent low scores may reflect gaps in training data that reward fluent but inconsistent answers.
- Repeated application of the benchmark to successive model releases would track whether these specific errors are decreasing.
Load-bearing premise
Expert-crafted questions with the control-group structure can accurately isolate language hallucination and visual illusion without introducing biases in question design or answer evaluation.
What would settle it
A new vision-language model that scores above 70 percent question-pair accuracy on HallusionBench while retaining strong results on standard image benchmarks would show that the measured errors are not fundamental.
read the original abstract
We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HallusionBench, a benchmark with 346 images and 1129 human-expert-crafted questions for evaluating large vision-language models on image-context reasoning. It proposes a novel control-group question structure to enable quantitative analysis of response tendencies, logical consistency, and specific failure modes including language hallucination and visual illusion. The authors evaluate 15 models and report that GPT-4V achieves 31.42% question-pair accuracy while all other models score below 16%, accompanied by case studies and suggested improvement pathways. The benchmark and code are released publicly.
Significance. If the control-group structure validly isolates the targeted failure modes without confounding by general reasoning difficulty or question ambiguity, the benchmark would offer a useful diagnostic tool for understanding limitations in current LVLMs and guiding targeted improvements. The scale of the evaluation across 15 models and the public release of the benchmark and codebase are strengths that could support standardized testing in the field.
major comments (2)
- [Abstract] Abstract: The claim that the novel control-group structure enables isolation of 'entangled language hallucination and visual illusion' is load-bearing for the central diagnostic contribution, yet no validation details (e.g., inter-rater agreement on pair equivalence, ablation removing one control dimension, or human baseline performance on the same pairs) are provided to confirm that pairs differ only in the targeted variable while holding visual complexity and linguistic subtlety fixed.
- [Evaluation] Evaluation results (reported accuracies): The headline finding that GPT-4V reaches 31.42% question-pair accuracy and others remain below 16% is presented as evidence of specific failure modes; however, without an explicit check that performance gaps arise from the disentangled hallucination/illusion factors rather than overall task hardness or pair ambiguity, the attribution to the claimed entangled modes is under-supported.
minor comments (2)
- [Abstract] The abstract states '1129 questions' but does not clarify how many question pairs are used for the reported question-pair accuracy metric; adding this detail would improve reproducibility.
- [Results] Figure or table presenting per-model accuracies would benefit from explicit error bars or statistical significance tests to support the claim that all non-GPT-4V models fall below 16%.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the validation of our proposed control-group structure.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the novel control-group structure enables isolation of 'entangled language hallucination and visual illusion' is load-bearing for the central diagnostic contribution, yet no validation details (e.g., inter-rater agreement on pair equivalence, ablation removing one control dimension, or human baseline performance on the same pairs) are provided to confirm that pairs differ only in the targeted variable while holding visual complexity and linguistic subtlety fixed.
Authors: We appreciate the referee pointing out the need for more rigorous validation of the control-group design. The questions in HallusionBench were created by human experts specifically to hold visual and linguistic elements constant while varying the targeted hallucination or illusion factor. However, we acknowledge that quantitative validation metrics were not detailed in the initial submission. In the revised manuscript, we will add a dedicated section on benchmark construction that includes inter-rater agreement scores from multiple experts on pair equivalence, results from an ablation study where one control dimension is removed to measure its effect on failure detection, and human baseline accuracies on the question pairs. This will provide evidence that the pairs isolate the intended variables without confounding by general difficulty or ambiguity. revision: yes
-
Referee: [Evaluation] Evaluation results (reported accuracies): The headline finding that GPT-4V reaches 31.42% question-pair accuracy and others remain below 16% is presented as evidence of specific failure modes; however, without an explicit check that performance gaps arise from the disentangled hallucination/illusion factors rather than overall task hardness or pair ambiguity, the attribution to the claimed entangled modes is under-supported.
Authors: We agree that the attribution of low performance to the specific entangled modes requires additional support beyond the raw accuracies. The control-group structure was designed to enable exactly this kind of analysis through paired comparisons that reveal inconsistencies attributable to hallucination or illusion. To address this, we will expand the evaluation section in the revision to include explicit checks, such as the rate of logical inconsistencies across control pairs and differential performance on hallucination-specific vs. general reasoning questions. We will also report human expert performance on the same pairs to show that the task is solvable when the failure modes are not present, thereby supporting that the gaps in model performance are due to the targeted factors rather than inherent hardness or ambiguity. revision: yes
Circularity Check
Benchmark introduction and empirical evaluation contain no circular derivations
full rationale
The paper introduces HallusionBench as a human-expert-crafted set of 346 images and 1129 questions with a novel control-group structure, then reports direct empirical accuracies on 15 models (GPT-4V at 31.42% question-pair accuracy, others below 16%). No equations, fitted parameters, predictions, or first-principles derivations are presented that reduce to the benchmark inputs by construction. The control groups and failure-mode analysis are described as meticulously crafted observations rather than self-referential or load-bearing self-citations. The evaluation chain is therefore self-contained against external model testing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human experts can design visual questions with control groups that isolate language hallucination from visual illusion without introducing bias.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We explore HALLUSIONBENCH and provide an in-depth analysis of examples on which the SoTA LVLMs, such as GPT-4V and LLaVA-1.5 fail... language hallucination and visual illusion
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
-
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
-
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.
-
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
-
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
-
[1]
Gpt-4v(ision) system card. 2023. 6, 7
work page 2023
-
[2]
nocaps: novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- fan Lee, and Peter Anderson. nocaps: novel object captioning at scale. International Conference on Computer Vision, pages 8947–8956, 2019. 3
work page 2019
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,
-
[4]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425– 2433, 2015. 1
work page 2015
-
[5]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bit- ton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, 2023. 1
work page 2023
-
[6]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Sparks of artifi- cial general intelligence: Early experiments with gpt-4, 2023
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artifi- cial general intelligence: Early experiments with gpt-4, 2023. 1
work page 2023
-
[8]
Alpagasus: Training a better alpaca with fewer data
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gu- naratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023. 1
-
[9]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 1
work page 2023
-
[10]
Holistic analysis of hal- lucination in gpt-4v(ision): Bias and interference challenges
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hal- lucination in gpt-4v(ision): Bias and interference challenges. ArXiv, abs/2311.03287, 2023. 4
-
[11]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yev- gen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Flo- rence. Palm-e: An...
work page 2023
-
[13]
Lamm: Language- assisted multi-modal instruction-tuning dataset, framework, and benchmark
Zhen fei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Wanli Ouyang, and Jing Shao. Lamm: Language- assisted multi-modal instruction-tuning dataset, framework, and benchmark. ArXiv, abs/2306.06687, 2023. 3
-
[14]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A compre- hensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Multimodal-gpt: A vision and language model for dialogue with humans, 2023
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023. 5
work page 2023
-
[16]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision , 127:398 – 414,
-
[17]
Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation, 2023
Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, and Di- nesh Manocha. Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation, 2023. 1
work page 2023
-
[18]
Detecting and preventing hallucinations in large vision language models
Anish Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. ArXiv, abs/2308.06394, 2023. 4
-
[19]
A comprehensive survey of deep learning for image captioning
MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51 (6):1–36, 2019. 1
work page 2019
-
[20]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 1, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs
Sheng Li and Nima Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. ArXiv, abs/2308.03349, 2023. 4
-
[23]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355, 2023. 1, 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Stablellava: Enhanced visual instruction tuning with synthe- sized image-dialogue data
Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Gu- osheng Lin, Chunhua Shen, Ling Chen, and Yunchao Wei. Stablellava: Enhanced visual instruction tuning with synthe- sized image-dialogue data. ArXiv, abs/2308.10253, 2023. 2
-
[25]
To- wards understanding in-context learning with contrastive demonstrations and saliency maps
Zongxia Li, Paiheng Xu, Fuxiao Liu, and Hyemi Song. To- wards understanding in-context learning with contrastive demonstrations and saliency maps. arXiv preprint arXiv:2307.05052, 2023. 1
-
[26]
Module- wise adaptive distillation for multimodality foundation mod- els
Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, and Tianyi Zhou. Module- wise adaptive distillation for multimodality foundation mod- els. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1
work page 2023
-
[27]
Visual news: Benchmark and challenges in news image captioning
Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Or- donez. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743, 2020. 1, 3
-
[28]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 1, 2, 3, 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Documentclip: Linking figures and main body text in reflowed documents
Fuxiao Liu, Hao Tan, and Chris Tensmeyer. Documentclip: Linking figures and main body text in reflowed documents. arXiv preprint arXiv:2306.06306, 2023. 1
-
[30]
Covid- vts: Fact extraction and verification on short video platforms
Fuxiao Liu, Yaser Yacoob, and Abhinav Shrivastava. Covid- vts: Fact extraction and verification on short video platforms. In Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 178–188, 2023. 1
work page 2023
-
[31]
Improved baselines with visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 1, 5, 6, 7
work page 2023
-
[32]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Con- ghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math rea- soning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Scienceqa: A novel resource for question answering on scholarly articles
Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022. 1
work page 2022
- [38]
-
[39]
Gemini: A family of highly capable multi- modal models, 2023
Gemini Team. Gemini: A family of highly capable multi- modal models, 2023. 6, 7
work page 2023
-
[40]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Trans- form and tell: Entity-aware news image captioning
Alasdair Tran, Alexander Mathews, and Lexing Xie. Trans- form and tell: Entity-aware news image captioning. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13035–13045, 2020. 1
work page 2020
-
[42]
Show and tell: Lessons learned from the 2015 mscoco image captioning challenge
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016. 1
work page 2015
-
[43]
Vigc: Visual instruction generation and correction
Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiao wen Dong, Weijia Li, Wei Li, Jiaqi Wang, and Conghui He. Vigc: Visual instruction generation and correction. ArXiv, abs/2308.12714, 2023. 3
-
[44]
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. ArXiv, abs/2205.14100, 2022. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Larger language models do in-context learning differently
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023. 1
-
[46]
Large language models can be good privacy protection learners
Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Haifeng Chen, Wei Wang, and Wei Cheng. Large language models can be good privacy protection learners. 2023. 1
work page 2023
-
[47]
Embodied multi-modal agent trained by an llm from a parallel textworld,
Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. Embodied multi-modal agent trained by an llm from a parallel textworld,
-
[48]
The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023. 1, 2, 5, 9
work page 2023
-
[49]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 1, 2, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023. 6, 7
work page 2023
-
[52]
A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Woodpecker: Hallucination correc- tion for multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xingguo Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. ArXiv, abs/2310.16045, 2023. 3
-
[54]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guo- qiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multi- modal inputs? arXiv preprint arXiv:2307.02469, 2023. 4
-
[57]
Investigating the catastrophic for- getting in multimodal large language models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic for- getting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023. 1
-
[58]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the ai ocean: A survey on hal- lucination in large language models. ArXiv, abs/2309.01219,
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, and Joyce Chai. Grounding visual illusions in language: Do vision- language models perceive illusions like humans? In Proceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2023. 3
work page 2022
-
[60]
arXiv preprint arXiv:2306.17107 , year=
Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tongfei Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv, abs/2306.17107, 2023. 2
-
[61]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Minigpt-5: Interleaved vision-and-language generation via generative vokens
Kaizhi Zheng, Xuehai He, and Xin Eric Wang. Minigpt-5: Interleaved vision-and-language generation via generative vokens. ArXiv, abs/2310.02239, 2023. 6, 7
-
[63]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.