hub Baseline reference

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu · 2023 · cs.CV · arXiv 2310.14566

Baseline reference. 67% of citing Pith papers use this work as a benchmark or comparison.

30 Pith papers citing it

Baseline 67% of classified citations

open full Pith review browse 30 citing papers arXiv PDF

abstract

We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 6 background 3

citation-polarity summary

use dataset 6 background 3

representative citing papers

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

cs.CV · 2026-02-10 · unverdicted · novelty 7.0

Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

cs.CV · 2025-12-14 · unverdicted · novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

cs.AI · 2025-03-17 · conditional · novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.

Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

cs.AI · 2025-12-03 · unverdicted · novelty 6.0

Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

cs.AI · 2025-11-26 · unverdicted · novelty 6.0

ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

cs.CV · 2025-07-18 · unverdicted · novelty 6.0

The work identifies a small set of attention heads in VLMs that mediate conflicts between parametric knowledge and visual input, and shows that intervening on them steers model behavior while attention patterns provide precise image-region attribution.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

cs.CL · 2023-11-13 · unverdicted · novelty 6.0

AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

cs.CV · 2023-06-26 · accept · novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

TuringViT: Making SOTA Vision Transformers Accessible to All

cs.CV · 2026-06-23 · unverdicted · novelty 5.0

TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

cs.CL · 2026-06-21 · unverdicted · novelty 5.0

Multimodal CoT improves math and science reasoning but degrades visual perception performance and exhibits a pattern of diminishing visual reflection while verbal reflection varies.

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

The work establishes an evaluation framework for personality induction and switching in MLLMs, reporting improved captioning but impaired VQA performance plus balancing and residual effects during multi-trait and dynamic conditions.

Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

cs.AI · 2026-05-19 · unverdicted · novelty 5.0

Causal path-patching analysis across five MLLMs identifies distributed hallucination-driving attention heads and localized resisting heads whose imbalance biases generation toward erroneous text over visual evidence; a conditional intervention MACI suppresses the driving heads and cuts hallucination

Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

cs.CV · 2025-05-27 · unverdicted · novelty 5.0

CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.

citing papers explorer

Showing 30 of 30 citing papers.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark cs.CL · 2024-09-04 · accept · none · ref 28 · internal anchor
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI cs.CL · 2023-11-27 · unverdicted · none · ref 42 · internal anchor
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension cs.CV · 2026-02-10 · unverdicted · none · ref 4 · internal anchor
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space cs.CV · 2025-12-14 · unverdicted · none · ref 58 · internal anchor
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 22 · internal anchor
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 12 · internal anchor
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 22 · internal anchor
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 22 · internal anchor
FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR cs.AI · 2026-05-19 · unverdicted · none · ref 42 · internal anchor
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering cs.AI · 2026-04-22 · unverdicted · none · ref 184 · internal anchor
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents cs.AI · 2025-12-03 · unverdicted · none · ref 36 · internal anchor
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory cs.AI · 2025-11-26 · unverdicted · none · ref 16 · internal anchor
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 41 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models cs.CV · 2025-07-18 · unverdicted · none · ref 10 · internal anchor
The work identifies a small set of attention heads in VLMs that mediate conflicts between parametric knowledge and visual input, and shows that intervening on them steers model behavior while attention patterns provide precise image-region attribution.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 45 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 77 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation cs.CL · 2023-11-13 · unverdicted · none · ref 7 · internal anchor
AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning cs.CV · 2023-06-26 · accept · none · ref 15 · internal anchor
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
TuringViT: Making SOTA Vision Transformers Accessible to All cs.CV · 2026-06-23 · unverdicted · none · ref 74 · internal anchor
TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.
Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do cs.CL · 2026-06-21 · unverdicted · none · ref 3 · internal anchor
Multimodal CoT improves math and science reasoning but degrades visual perception performance and exhibits a pattern of diminishing visual reflection while verbal reflection varies.
Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models cs.CL · 2026-06-09 · unverdicted · none · ref 72 · internal anchor
The work establishes an evaluation framework for personality induction and switching in MLLMs, reporting improved captioning but impaired VQA performance plus balancing and residual effects during multi-trait and dynamic conditions.
Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination cs.AI · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
Causal path-patching analysis across five MLLMs identifies distributed hallucination-driving attention heads and localized resisting heads whose imbalance biases generation toward erroneous text over visual evidence; a conditional intervention MACI suppresses the driving heads and cuts hallucination
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration cs.CV · 2025-05-27 · unverdicted · none · ref 8 · internal anchor
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
Qwen2.5-VL Technical Report cs.CV · 2025-02-19 · unverdicted · none · ref 12 · internal anchor
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 112 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 191 · internal anchor
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 30 · internal anchor
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unreviewed · ref 93 · internal anchor
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs cs.CV · 2026-03-19 · unreviewed · ref 13 · internal anchor

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer