{"paper":{"title":"HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"HallusionBench shows even GPT-4V reaches only 31.42 percent accuracy on paired questions that expose language hallucination and visual illusion in vision-language models.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Dinesh Manocha, Furong Huang, Fuxiao Liu, Lichang Chen, Ruiqi Xian, Tianrui Guan, Tianyi Zhou, Xiaoyu Liu, Xijun Wang, Xiyang Wu, Yaser Yacoob, Zongxia Li","submitted_at":"2023-10-23T04:49:09Z","abstract_excerpt":"We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitativ"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that human-expert-crafted questions with the novel control-group structure accurately isolate and measure entangled language hallucination and visual illusion without introducing confounding biases or subjective interpretations in scoring.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"HallusionBench shows even GPT-4V reaches only 31.42 percent accuracy on paired questions that expose language hallucination and visual illusion in vision-language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"455c88252001f05f636a22d2716650fa8f0fadc62c9c6be0eed86db7df2cff9c"},"source":{"id":"2310.14566","kind":"arxiv","version":5},"verdict":{"id":"320843f4-2d58-4ce0-9b75-93564f84ba77","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T01:17:20.225173Z","strongest_claim":"In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%.","one_line_summary":"HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that human-expert-crafted questions with the novel control-group structure accurately isolate and measure entangled language hallucination and visual illusion without introducing confounding biases or subjective interpretations in scoring.","pith_extraction_headline":"HallusionBench shows even GPT-4V reaches only 31.42 percent accuracy on paired questions that expose language hallucination and visual illusion in vision-language models."},"references":{"count":63,"sample":[{"doi":"","year":2023,"title":"Gpt-4v(ision) system card. 2023. 6, 7","work_id":"88a556c3-5f22-4f71-b403-084ceddb10a2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"nocaps: novel object captioning at scale","work_id":"cf1f8da2-1488-4dd0-8a72-2412fb7c436d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Flamingo: a visual language model for few-shot learning","work_id":"31e3af5c-9fec-43d9-b533-5bb70172dd15","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"Vqa: Visual question answering","work_id":"3db513bc-ec97-47d1-bc83-6eb38b02a2d9","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bit- ton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell","work_id":"585147a1-ddd5-4f10-93ea-dae26c9319b1","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":63,"snapshot_sha256":"bbbaaf3be139cf261d8684897cdc7b95496f55bbf49eaea24af242076ee8cc1c","internal_anchors":22},"formal_canon":{"evidence_count":2,"snapshot_sha256":"3fa34980d59bf013942a12d97745133ef222683d99a4935b8c7c25f09303061e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}