Recognition: 1 theorem link
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Pith reviewed 2026-05-16 06:41 UTC · model grok-4.3
The pith
AMBER provides an LLM-free benchmark to evaluate hallucinations in multi-modal models across existence, attribute and relation dimensions for generative and discriminative tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AMBER is an LLM-free multi-dimensional benchmark that evaluates MLLMs on generative and discriminative tasks for existence, attribute, and relation hallucinations, supported by a low-cost evaluation pipeline that allows comprehensive assessment of mainstream models.
What carries the argument
The AMBER benchmark, which supplies curated image-text pairs and automated scoring rules to detect and categorize hallucinations without external LLM assistance.
If this is right
- Mainstream MLLMs receive consistent scores on hallucination rates that distinguish generative from discriminative performance.
- Existence, attribute, and relation hallucinations can be measured separately to reveal which error type dominates in a given model.
- Mitigation guidelines derived from the benchmark results can be tested directly on the same evaluation sets.
- Wider adoption of the pipeline reduces reliance on expensive human or advanced-LLM judging for routine MLLM checks.
Where Pith is reading between the lines
- If the benchmark generalizes beyond the tested models, it could serve as a standard reference set for tracking hallucination reduction over successive MLLM releases.
- Separate scoring of the three hallucination types may expose trade-offs, such as models that improve on relations but worsen on attributes.
- The automated pipeline opens the possibility of incorporating AMBER-style checks into training loops to penalize hallucination during fine-tuning.
Load-bearing premise
The low-cost evaluation pipeline accurately detects and categorizes hallucinations without introducing new biases or missing cases that would need LLM or human judgment.
What would settle it
Human annotations on a held-out set of MLLM outputs showing that AMBER pipeline scores differ substantially from human labels on hallucination presence or type.
read the original abstract
Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AMBER, an LLM-free multi-dimensional benchmark for evaluating hallucinations in Multi-modal Large Language Models (MLLMs). It supports both generative and discriminative tasks across existence, attribute, and relation hallucination types, includes a low-cost evaluation pipeline, reports comprehensive evaluations on models such as GPT-4V, and offers mitigation guidelines. The data and code are released publicly.
Significance. If the LLM-free pipeline is shown to match human or LLM-based judgments with high fidelity, AMBER would provide a scalable, low-cost alternative to existing high-cost hallucination benchmarks, enabling broader model assessment and iterative improvement in the MLLM community. The multi-dimensional coverage and public release are clear strengths.
major comments (2)
- [§4] §4 (Evaluation Pipeline): The central claim that the pipeline accurately detects and categorizes hallucinations in a fully LLM-free manner lacks any quantitative validation. No precision/recall figures, inter-annotator agreement with human labels, or ablation on edge cases (partial attribute matches, relational ambiguities) are reported, making it impossible to assess whether the pipeline introduces systematic misses or new biases.
- [§5] §5 (Experiments): The reported hallucination rates for GPT-4V and other models are presented without direct comparison to human-annotated ground truth or to existing LLM-based benchmarks on the same test cases. This omission leaves the practical utility of the benchmark unverified and weakens the claim of comprehensive evaluation.
minor comments (2)
- [Abstract, §3.1] The abstract and §3.1 would benefit from a concise table summarizing the three hallucination dimensions and the generative vs. discriminative task distinctions.
- [§4] Notation for the pipeline components (e.g., matching rules for attributes) is introduced without a formal definition or pseudocode, reducing reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on AMBER. We address the two major comments below and will incorporate revisions to strengthen the validation of the evaluation pipeline and experimental results.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation Pipeline): The central claim that the pipeline accurately detects and categorizes hallucinations in a fully LLM-free manner lacks any quantitative validation. No precision/recall figures, inter-annotator agreement with human labels, or ablation on edge cases (partial attribute matches, relational ambiguities) are reported, making it impossible to assess whether the pipeline introduces systematic misses or new biases.
Authors: We agree that quantitative validation against human judgments is essential to substantiate the pipeline's reliability. In the revised manuscript, we will add a dedicated subsection in §4 reporting precision, recall, and F1 scores computed on a human-annotated subset of 500 samples. We will also report inter-annotator agreement (Cohen's kappa) and include targeted ablations addressing partial attribute matches and relational ambiguities, with explicit discussion of any observed biases or failure modes. revision: yes
-
Referee: [§5] §5 (Experiments): The reported hallucination rates for GPT-4V and other models are presented without direct comparison to human-annotated ground truth or to existing LLM-based benchmarks on the same test cases. This omission leaves the practical utility of the benchmark unverified and weakens the claim of comprehensive evaluation.
Authors: We acknowledge the value of direct comparisons for verifying practical utility. The revised §5 will include a new table comparing AMBER-derived hallucination rates against human-annotated ground truth on a shared subset of test cases, as well as side-by-side results with at least two existing LLM-based benchmarks (e.g., POPE and LURE) on overlapping samples where feasible. This will be accompanied by analysis of agreement rates and discrepancies. revision: yes
Circularity Check
No significant circularity in AMBER benchmark proposal
full rationale
The paper introduces AMBER as a new LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks (existence, attribute, relation). The central claim is the construction and application of this benchmark with a low-cost evaluation pipeline. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The pipeline is positioned as an independent, rule-based mechanism rather than deriving from its own outputs or prior author results by construction. This is a standard benchmark proposal with self-contained content against external model evaluations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hallucinations in MLLMs can be reliably categorized into existence, attribute, and relation types.
Forward citations
Cited by 18 Pith papers
-
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models ...
-
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
SIRA mitigates hallucinations in LVLMs by internally contrasting full visual access against a masked late-layer branch that retains shared context but lacks fine-grained visual evidence.
-
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
-
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
-
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.
-
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
SIF creates semantically in-distribution fingerprints for LVLMs by distilling text watermarks into visual inputs and optimizing for robustness against detection and modification.
-
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
-
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
-
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.
-
Steering the Verifiability of Multimodal AI Hallucinations
Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966. Jun Chen, Deyao Zhu Xiaoqian Shen Xiang Li, Zechun Liu Pengchuan Zhang, Raghuraman Krishnamoorthi Vikas Chandra Yunyang Xiong, and Mohamed El- hoseiny
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al
work page internal anchor Pith review arXiv
-
[3]
arXiv preprint arXiv:2311.03287
Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi
-
[4]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
In- structblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
arXiv preprint arXiv:2308.06394
De- tecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen
-
[6]
Evaluating Object Hallucination in Large Vision-Language Models
Eval- uating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
In European confer- ence on computer vision , pages 740–755
Microsoft coco: Common objects in context. In European confer- ence on computer vision , pages 740–755. Springer. Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2023a. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), l...
-
[8]
Gpt-4 technical report. ArXiv, abs/2303.08774. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Object Hallucination in Image Captioning
Object hallucination in image captioning. arXiv preprint arXiv:1809.02156. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971. 9 Hugo Touvron, Lou...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
arXiv preprint arXiv:2310.16534
An early evaluation of gpt-4v (ision). arXiv preprint arXiv:2310.16534. Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, and Weiqiang Jia. 2023a. Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, An- wen Hu, Pengcheng Shi, Yay...
-
[11]
arXiv preprint arXiv:2310.16045
Woodpecker: Hallucina- tion correction for multimodal large language models. arXiv preprint arXiv:2310.16045. Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Man- ling Li, Tan Yan, and Xiangjun Fan
-
[12]
arXiv preprint arXiv:2310.01779
Halle- switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779. Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. 2023a. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:...
-
[13]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592. 10 A Appendix A.1 Details of AMBER A.1.1 Data Source Our images are sourced from the MS-COCO 2014 (Lin et al.,
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
It is important to note that the MS-COCO 2014 test set does not include annotations and is therefore not used as training data. UnSplash adopts the Cre- ative Commons Zero (CC0) license, allowing free use anywhere without requiring attribution to the source or author. A.1.2 Data Dtatistics We present the data statistics for AMBER in Ta- ble
work page 2014
-
[15]
The category column represents the num- ber of different categories present in the data. It is worth mentioning that the resultant annotations cover 337 objects, which is more than 4 times the number of objects in the existing benchmarks (e.g., 80 specific objects in coco). Image 1004 Categories Object 337 Attribute 350 Prompt Generation 1004 Existence 49...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.