arxiv: 2311.07397 · v2 · submitted 2023-11-13 · 💻 cs.CL · cs.CV

Recognition: 1 theorem link

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang , Yuhang Wang , Guohai Xu , Jing Zhang , Yukai Gu , Haitao Jia , Jiaqi Wang , Haiyang Xu

show 3 more authors

Ming Yan Ji Zhang Jitao Sang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:41 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords AMBER benchmarkMLLM hallucinationexistence hallucinationattribute hallucinationrelation hallucinationLLM-free evaluationmulti-modal benchmarkgenerative discriminative tasks

0 comments

The pith

AMBER provides an LLM-free benchmark to evaluate hallucinations in multi-modal models across existence, attribute and relation dimensions for generative and discriminative tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AMBER as a benchmark that assesses hallucinations in Multi-modal Large Language Models without using other LLMs or human evaluators. It supports evaluation of both generative outputs and discriminative judgments, with specific coverage of existence hallucinations where objects are invented, attribute hallucinations where properties are wrongly assigned, and relation hallucinations where connections between elements are misstated. A low-cost automated pipeline accompanies the benchmark to score model responses efficiently. The authors apply it to models including GPT-4V to produce comparative results and offer mitigation guidelines. This approach addresses the high costs and limited scope of prior hallucination checks, which matter because unchecked hallucinations can produce misleading or harmful multi-modal responses in deployed systems.

Core claim

AMBER is an LLM-free multi-dimensional benchmark that evaluates MLLMs on generative and discriminative tasks for existence, attribute, and relation hallucinations, supported by a low-cost evaluation pipeline that allows comprehensive assessment of mainstream models.

What carries the argument

The AMBER benchmark, which supplies curated image-text pairs and automated scoring rules to detect and categorize hallucinations without external LLM assistance.

If this is right

Mainstream MLLMs receive consistent scores on hallucination rates that distinguish generative from discriminative performance.
Existence, attribute, and relation hallucinations can be measured separately to reveal which error type dominates in a given model.
Mitigation guidelines derived from the benchmark results can be tested directly on the same evaluation sets.
Wider adoption of the pipeline reduces reliance on expensive human or advanced-LLM judging for routine MLLM checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark generalizes beyond the tested models, it could serve as a standard reference set for tracking hallucination reduction over successive MLLM releases.
Separate scoring of the three hallucination types may expose trade-offs, such as models that improve on relations but worsen on attributes.
The automated pipeline opens the possibility of incorporating AMBER-style checks into training loops to penalize hallucination during fine-tuning.

Load-bearing premise

The low-cost evaluation pipeline accurately detects and categorizes hallucinations without introducing new biases or missing cases that would need LLM or human judgment.

What would settle it

Human annotations on a held-out set of MLLM outputs showing that AMBER pipeline scores differ substantially from human labels on hallucination presence or type.

read the original abstract

Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AMBER sets up an LLM-free benchmark for multi-dimensional hallucination evaluation in MLLMs, but lacks shown validation for its pipeline.

read the letter

The punchline on this paper is that AMBER is positioned as a new benchmark for evaluating hallucinations in multi-modal large language models without depending on LLMs for the scoring process itself. It covers a range of hallucination types including existence, attribute, and relation, and applies to both generative and discriminative tasks through what they call a low-cost evaluation pipeline. What the paper does well is highlight the practical problems with current evaluation methods. Relying on humans or powerful LLMs for judging outputs drives up costs and limits how often teams can test their models. AMBER aims to sidestep that by using a different approach, and the authors back it up by testing on models like GPT-4V while also offering some suggestions for reducing hallucinations. Making the dataset and code public is a clear positive, as it lets others reproduce or extend the work. The main soft spot is around the validation of that pipeline. The abstract describes the benchmark and claims the pipeline is efficient, but it does not provide specifics on the mechanism—such as whether it uses string matching, simple rules, or classifiers—nor does it include results showing agreement with human judgments or tests on ambiguous cases. This makes it difficult to assess if the evaluations are accurate or if they introduce their own biases. The soundness feels limited until those elements are shown. Overall, the paper engages honestly with the literature on MLLM issues and presents a concrete new resource. It is not claiming to solve everything, but to offer a scalable alternative. This work is aimed at people developing or auditing multi-modal models, particularly those concerned with safety and reliability in applications where mistakes can matter. A reader who needs a quick way to benchmark several models on hallucination dimensions could get practical value from it, assuming the pipeline proves sound. I would recommend putting it through peer review. The area is important, the open release helps, and referees can help strengthen the evidence for the central claims about accuracy.

Referee Report

2 major / 2 minor

Summary. The paper introduces AMBER, an LLM-free multi-dimensional benchmark for evaluating hallucinations in Multi-modal Large Language Models (MLLMs). It supports both generative and discriminative tasks across existence, attribute, and relation hallucination types, includes a low-cost evaluation pipeline, reports comprehensive evaluations on models such as GPT-4V, and offers mitigation guidelines. The data and code are released publicly.

Significance. If the LLM-free pipeline is shown to match human or LLM-based judgments with high fidelity, AMBER would provide a scalable, low-cost alternative to existing high-cost hallucination benchmarks, enabling broader model assessment and iterative improvement in the MLLM community. The multi-dimensional coverage and public release are clear strengths.

major comments (2)

[§4] §4 (Evaluation Pipeline): The central claim that the pipeline accurately detects and categorizes hallucinations in a fully LLM-free manner lacks any quantitative validation. No precision/recall figures, inter-annotator agreement with human labels, or ablation on edge cases (partial attribute matches, relational ambiguities) are reported, making it impossible to assess whether the pipeline introduces systematic misses or new biases.
[§5] §5 (Experiments): The reported hallucination rates for GPT-4V and other models are presented without direct comparison to human-annotated ground truth or to existing LLM-based benchmarks on the same test cases. This omission leaves the practical utility of the benchmark unverified and weakens the claim of comprehensive evaluation.

minor comments (2)

[Abstract, §3.1] The abstract and §3.1 would benefit from a concise table summarizing the three hallucination dimensions and the generative vs. discriminative task distinctions.
[§4] Notation for the pipeline components (e.g., matching rules for attributes) is introduced without a formal definition or pseudocode, reducing reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on AMBER. We address the two major comments below and will incorporate revisions to strengthen the validation of the evaluation pipeline and experimental results.

read point-by-point responses

Referee: [§4] §4 (Evaluation Pipeline): The central claim that the pipeline accurately detects and categorizes hallucinations in a fully LLM-free manner lacks any quantitative validation. No precision/recall figures, inter-annotator agreement with human labels, or ablation on edge cases (partial attribute matches, relational ambiguities) are reported, making it impossible to assess whether the pipeline introduces systematic misses or new biases.

Authors: We agree that quantitative validation against human judgments is essential to substantiate the pipeline's reliability. In the revised manuscript, we will add a dedicated subsection in §4 reporting precision, recall, and F1 scores computed on a human-annotated subset of 500 samples. We will also report inter-annotator agreement (Cohen's kappa) and include targeted ablations addressing partial attribute matches and relational ambiguities, with explicit discussion of any observed biases or failure modes. revision: yes
Referee: [§5] §5 (Experiments): The reported hallucination rates for GPT-4V and other models are presented without direct comparison to human-annotated ground truth or to existing LLM-based benchmarks on the same test cases. This omission leaves the practical utility of the benchmark unverified and weakens the claim of comprehensive evaluation.

Authors: We acknowledge the value of direct comparisons for verifying practical utility. The revised §5 will include a new table comparing AMBER-derived hallucination rates against human-annotated ground truth on a shared subset of test cases, as well as side-by-side results with at least two existing LLM-based benchmarks (e.g., POPE and LURE) on overlapping samples where feasible. This will be accompanied by analysis of agreement rates and discrepancies. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AMBER benchmark proposal

full rationale

The paper introduces AMBER as a new LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks (existence, attribute, relation). The central claim is the construction and application of this benchmark with a low-cost evaluation pipeline. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The pipeline is positioned as an independent, rule-based mechanism rather than deriving from its own outputs or prior author results by construction. This is a standard benchmark proposal with self-contained content against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on domain assumptions about hallucination categories and the viability of automated evaluation without external models.

axioms (1)

domain assumption Hallucinations in MLLMs can be reliably categorized into existence, attribute, and relation types.
This categorization underpins the multi-dimensional design of the benchmark.

pith-pipeline@v0.9.0 · 5503 in / 1034 out tokens · 31208 ms · 2026-05-16T06:41:41.603498+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
cs.MM 2026-04 unverdicted novelty 8.0

AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models ...
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
cs.CV 2026-05 conditional novelty 7.0

SIRA mitigates hallucinations in LVLMs by internally contrasting full visual access against a masked late-layer branch that retains shared context but lacks fine-grained visual evidence.
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
cs.DB 2026-05 conditional novelty 7.0

OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
cs.CV 2026-05 unverdicted novelty 6.0

CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
cs.CV 2026-04 conditional novelty 6.0

R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SIF creates semantically in-distribution fingerprints for LVLMs by distilling text watermarks into visual inputs and optimizing for robustness against detection and modification.
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
cs.CV 2026-04 unverdicted novelty 5.0

MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.
Steering the Verifiability of Multimodal AI Hallucinations
cs.AI 2026-04 unverdicted novelty 5.0

Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966. Jun Chen, Deyao Zhu Xiaoqian Shen Xiang Li, Zechun Liu Pengchuan Zhang, Raghuraman Krishnamoorthi Vikas Chandra Yunyang Xiong, and Mohamed El- hoseiny

work page internal anchor Pith review Pith/arXiv arXiv
[2]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al

work page internal anchor Pith review arXiv
[3]

arXiv preprint arXiv:2311.03287

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi

work page arXiv
[4]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

In- structblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2308.06394

De- tecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen

work page arXiv
[6]

Evaluating Object Hallucination in Large Vision-Language Models

Eval- uating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick

work page internal anchor Pith review Pith/arXiv arXiv
[7]

In European confer- ence on computer vision , pages 740–755

Microsoft coco: Common objects in context. In European confer- ence on computer vision , pages 740–755. Springer. Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2023a. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), l...

work page arXiv
[8]

GPT-4 Technical Report

Gpt-4 technical report. ArXiv, abs/2303.08774. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Object Hallucination in Image Captioning

Object hallucination in image captioning. arXiv preprint arXiv:1809.02156. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971. 9 Hugo Touvron, Lou...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2310.16534

An early evaluation of gpt-4v (ision). arXiv preprint arXiv:2310.16534. Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, and Weiqiang Jia. 2023a. Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, An- wen Hu, Pengcheng Shi, Yay...

work page arXiv
[11]

arXiv preprint arXiv:2310.16045

Woodpecker: Hallucina- tion correction for multimodal large language models. arXiv preprint arXiv:2310.16045. Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Man- ling Li, Tan Yan, and Xiangjun Fan

work page arXiv
[12]

arXiv preprint arXiv:2310.01779

Halle- switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779. Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. 2023a. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:...

work page arXiv
[13]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592. 10 A Appendix A.1 Details of AMBER A.1.1 Data Source Our images are sourced from the MS-COCO 2014 (Lin et al.,

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

UnSplash adopts the Cre- ative Commons Zero (CC0) license, allowing free use anywhere without requiring attribution to the source or author

It is important to note that the MS-COCO 2014 test set does not include annotations and is therefore not used as training data. UnSplash adopts the Cre- ative Commons Zero (CC0) license, allowing free use anywhere without requiring attribution to the source or author. A.1.2 Data Dtatistics We present the data statistics for AMBER in Ta- ble

work page 2014
[15]

The category column represents the num- ber of different categories present in the data. It is worth mentioning that the resultant annotations cover 337 objects, which is more than 4 times the number of objects in the existing benchmarks (e.g., 80 specific objects in coco). Image 1004 Categories Object 337 Attribute 350 Prompt Generation 1004 Existence 49...

work page 2023