ZINA: Multimodal Fine-grained Hallucination Detection and Editing
Pith reviewed 2026-05-19 10:02 UTC · model grok-4.3
The pith
ZINA identifies hallucinated spans in multimodal outputs, classifies them into six error types, and suggests refinements, outperforming GPT-4o and Llama-3.2.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZINA is a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To support training and evaluation, the VisionHall dataset is constructed with 6.9k outputs from twelve MLLMs manually annotated by 211 annotators and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. ZINA is shown to outperform existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.
What carries the argument
ZINA, the method that identifies hallucinated spans, classifies error types into six categories, and suggests refinements based on graph-modeled dependencies.
If this is right
- Fine-grained detection enables more precise evaluation of MLLM outputs beyond coarse metrics.
- Editing suggestions can automatically refine outputs to reduce hallucinations.
- The VisionHall dataset provides a benchmark for future hallucination mitigation research.
- Accounting for dependencies among error types improves classification and editing reliability.
Where Pith is reading between the lines
- ZINA could be integrated into MLLM inference pipelines for on-the-fly correction in applications.
- The six error categories offer a taxonomy that might inform targeted training strategies to prevent hallucinations.
- The graph-based synthesis technique could extend to creating datasets for related issues like factual errors in other modalities.
Load-bearing premise
The manual annotations by 211 annotators on 6.9k outputs accurately reflect fine-grained hallucinations, and the graph-based synthetic samples of 20k examples correctly capture dependencies among the six error types.
What would settle it
Independent annotations on a new set of MLLM outputs where ZINA shows no accuracy advantage over GPT-4o in span detection or error classification would falsify the performance superiority.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we construct VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the task of multimodal fine-grained hallucination detection and editing for MLLMs. It proposes ZINA, which detects hallucinated spans, classifies them into six error types, and suggests refinements. The authors construct the VisionHall dataset comprising 6.9k human-annotated outputs from twelve MLLMs (by 211 annotators) plus 20k graph-based synthetic samples that model dependencies among error types. They report that ZINA outperforms baselines including GPT-4o and Llama-3.2 on both detection and editing.
Significance. If the evaluation holds, the work provides a useful fine-grained framework and dataset for analyzing and mitigating hallucinations in vision-language models. The graph-based synthesis approach for capturing error-type dependencies is a constructive contribution that could support more realistic training data. Demonstrating editing capability alongside detection adds practical value beyond existing coarse-grained hallucination benchmarks.
major comments (2)
- [Dataset construction] Dataset construction section: The 6.9k outputs annotated by 211 annotators form the primary ground truth for the six error types and the supervised training signal. The manuscript does not report inter-annotator agreement (e.g., Fleiss' kappa or span-level overlap), which is required to confirm label reliability; low agreement would directly undermine both the training of ZINA and the validity of the outperformance claims versus GPT-4o and Llama-3.2.
- [Synthetic data generation] Synthetic data generation section: The 20k graph-based samples are presented as capturing dependencies among the six error types. Without a quantitative comparison of the induced co-occurrence statistics to those observed in the human-annotated subset (or a human validation study), it is unclear whether the synthetic distribution faithfully reproduces real MLLM hallucination patterns; any mismatch would make the reported metrics on the combined dataset unreliable.
minor comments (1)
- [Abstract] Abstract: The claim of outperformance is stated without any numerical metrics, confidence intervals, or statistical tests; adding at least the primary F1 or accuracy deltas would strengthen the summary.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: The 6.9k outputs annotated by 211 annotators form the primary ground truth for the six error types and the supervised training signal. The manuscript does not report inter-annotator agreement (e.g., Fleiss' kappa or span-level overlap), which is required to confirm label reliability; low agreement would directly undermine both the training of ZINA and the validity of the outperformance claims versus GPT-4o and Llama-3.2.
Authors: We agree that reporting inter-annotator agreement is necessary to substantiate the reliability of the annotations. In the revised manuscript, we will add a new paragraph in the Dataset Construction section that reports Fleiss' kappa for error-type classification and span-level agreement metrics (e.g., token overlap) computed on a held-out subset of annotations. This addition will directly address concerns about label quality and support the validity of our experimental results. revision: yes
-
Referee: [Synthetic data generation] Synthetic data generation section: The 20k graph-based samples are presented as capturing dependencies among the six error types. Without a quantitative comparison of the induced co-occurrence statistics to those observed in the human-annotated subset (or a human validation study), it is unclear whether the synthetic distribution faithfully reproduces real MLLM hallucination patterns; any mismatch would make the reported metrics on the combined dataset unreliable.
Authors: We acknowledge the value of explicitly validating the synthetic distribution. In the revision, we will include a quantitative comparison of pairwise error-type co-occurrence frequencies between the 20k synthetic samples and the human-annotated subset, presented as a table or figure in the Synthetic Data Generation section. We will also report results from a small human validation study in which annotators rate the plausibility of a sample of synthetic instances. These additions will clarify the fidelity of the graph-based synthesis approach. revision: yes
Circularity Check
No circularity: empirical method and dataset construction with no derivations or self-referential reductions
full rationale
The paper introduces a new task of fine-grained multimodal hallucination detection and editing, proposes the ZINA method, and constructs the VisionHall dataset using 211 human annotators for 6.9k real MLLM outputs plus 20k graph-based synthetic samples. Evaluation compares ZINA against baselines including GPT-4o and Llama-3.2 on detection and editing metrics. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on empirical performance numbers obtained from the newly labeled and synthesized data rather than any reduction of outputs to inputs by construction. This is a standard empirical ML paper whose results are falsifiable against external benchmarks and do not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hallucinations in multimodal outputs can be meaningfully partitioned into six distinct error categories
Reference graph
Works this paper leans on
-
[1]
Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024b. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930. Anthony Chen, Panupong Pasupat, Sameer Singh, Hon- grae Lee, and Kelvin Guu. 2023. Purr: Efficiently...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
arXiv preprint arXiv:2305.04790 , year=
InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning. In NeurIPS. Jacob Devlin, Ming-Wei Chang, and 1 others. 2019. BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding. In NAACL, pages 4171–4186. Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang,...
-
[3]
Object Hallucination in Image Captioning
Selfcheckgpt: Zero-resource black-box hal- lucination detection for generative large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained ato...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: A Family of Highly Capable Multi- modal Models. arXiv preprint arXiv:2312.11805. Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. 2020-
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[5]
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Label Studio: Data labeling soft- ware. Open source software available from https://github.com/HumanSignal/label-studio. Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero- Soriano. 2024. A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. In CVPR, pages 26700–26709. Yuig...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
HallE-Control: controlling object hallucina- tion in large multimodal models. arXiv preprint arXiv:2310.01779. Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. 2024. Long-CLIP: Unlocking the Long-Text Capability of CLIP. In ECCV, pages 310– 325. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Wein- berger, and Yoav Artzi. 2020. BERTScore...
-
[7]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. In ACL (V olume3: System Demonstrations), Bangkok, Thailand. ACL. A Construction of VisionHall Dataset Annotation Process. We employed LabelStudio (Tkachenko et al., 2020-2025) for collecting the VisionHall dataset. Fig. 4 illustrates the annotation interface. We conducted annotations vi...
work page 2020
-
[8]
to ensure reproducibility for the follow- ing models: LLaV A-1.5-7B (Liu et al., 2024a), LLaV A-1.5-13B, Qwen2-VL-7B (Wang et al., 2024), LLaV A-OV-Qwen2-7B (Li et al., 2024a), LLaV A-NeXT-Qwen-32B (Liu et al., 2024b), LLaV A-OV-Qwen2-72B, LLaMA-3.2-90B-Vision- Instruct (Grattafiori et al., 2024), Qwen2.5- VL-72B-Instruct (Bai et al., 2025), and GPT- 4o (...
work page 2024
-
[9]
Following this approach, we reported the average scores of CLIP-S and PAC-S
introduced a modified version of these met- rics, which computes the cosine similarity between 3https://huggingface.co/microsoft/deberta-xlarge-mnli each sentence in the paragraph and the image, and then averages these scores to obtain the final score. Following this approach, we reported the average scores of CLIP-S and PAC-S. E Prompts E.1 Baselines The...
work page 2024
-
[10]
**<spatial_relation> (Special Relation Er- ror)**: Incorrect special relationship between entities or concepts. (Only those involving spatial relationships are permitted, not gram- matical relationship errors.) - Example: "an apple on the table" → "an apple <spatial_relation>under</spatial_relation> the table"
-
[11]
**<attribute> (Attribute Error)**: Incorrect attributes such as adjectives or adverbs describ- ing objects or concepts. - Example: "Red sky" → "<at- tribute>Blue</attribute> sky"
-
[12]
**<object> (Object Error)**: Incorrect spe- cific objects or entities. - Example: "There is a chair on the table." → "There is a <object>book</object> on the table."
- [13]
-
[14]
**<named_entities_fact> (Factual Error)**: Error related to named entities. This tag should not be applied to anything other than named en- tities. For example, "John F. Kennedy Center" is acceptable, but "center" is not. - Example: "The image is of the John F. Kennedy Center." → "The image is of the <named_entities_fact>white house</named_entities_fact>."
-
[15]
A car is parked under a sign that says ’Restaurant’
**<text> (Text Error)**: Incorrect scene texts (text that appears in an image). - Example: "A car is parked under a sign that says ’Restaurant’."→ "A car is parked under a sign that says <text>’Hotel’</text>." ## Example Passage: [few-shot-example] Reference: [few-shot-example] Edited: [few-shot-example] ## Task Now detect errors and include edits in the ...
work page 2024
-
[16]
**spatial_relation (Spatial Relation Error):** - Modify only prepositions that indicate spatial relationships (no grammatical errors allowed). - *Example:* ‘"an apple on the table"‘→ ‘"an apple <spatial_relation original="on" id="E1">under</spatial_relation> the table"‘
-
[17]
**attribute (Attribute Error):** - Change descriptive attributes such as adjectives or adverbs. - *Example:* ‘"Red sky"‘→ ‘"<attribute original="Red" id="E1">Blue</attribute> sky"‘
-
[18]
**object (Object Error):** - Replace names of generic objects or entities that are common nouns (i.e., not proper names). - *Example:* ‘"There is a chair on the table."‘→ ‘"There is a <object original="chair" id="E1">book</object> on the table."‘ - Note: For any proper noun or official name (e.g., "John F. Kennedy Center" or "Fuji"), use the fact error as...
-
[19]
The image is of the John F. Kennedy Center
**fact (Factual Error):** - Replace proper, named entities with incorrect names. Apply this error type only to entities that are proper names (e.g., formal institution names, landmarks, or official titles). - *Example:* ‘"The image is of the John F. Kennedy Center."‘→ ‘"The image is of the <named_entities_fact original="John F. Kennedy Center" id="E1">whi...
-
[20]
**number (Number Error):** - Change numerical values or quantities. - *Example:* ‘"3 cats"‘→ ‘"<number original="3" id="E1">5</number> cats"‘ - If you change a singular quantity to plural (or vice versa), make sure the associated noun reflects the correct number (i.e., use plural form for plural numbers and singular form for singular numbers). No annotati...
-
[21]
A car is parked under a sign that says ’Restaurant’
**text (Text Error):** - Change scene text (i.e., text that appears within the image). - *Example:* ‘"A car is parked under a sign that says ’Restaurant’."‘→ ‘"A car is parked under a sign that says <text original="’Restaurant’" id="E1">’Hotel’</text>."‘ - **Output Format:** - Provide the modified sentence along with error annotations in the following JSO...
- [22]
-
[23]
**Modified Sentence Requirements:** - Confirm that the ‘"modified"‘ value is the candidate sentence with error annotations inserted. - Check that only words or phrases from the candidate sentence have been modified—no new content should be added. - *Note:* The meaning of the ‘"modified"‘ value may differ from the original
-
[24]
**Error Annotation Types:** - Confirm that only the following six error types are used: **spatial_relation, attribute, object, number, named_entities_fact, text**. - Verify that each error annotation exactly matches one of these types and is applied only in the appropriate context (for example, text errors are used only for scene text)
-
[25]
**Error Tag Format and Consistency:** - Each error annotation must use the XML-like tag format: “‘xml <error_type original="original text" id="E#">modified text</error_type> “‘ - Ensure that: - The tag name is exactly one of the specified error types. - The ‘original‘ attribute contains the exact original word or phrase from the candidate sentence. - The ...
-
[26]
**Consistency in Modifications:** - Verify that if a word is modified in one location, every occurrence of that word in the sentence is modified consistently. - Ensure that the same error tag (and, if applicable, the same ‘parent‘ attribute) is used for all repeated instances of that word. - When one modification depends on another error, ensure that a pr...
-
[27]
a” with a number tag representing “one
**No Semantically Equivalent Replacements:** - Ensure that error annotations do not result in semantically equivalent replacements (e.g., do not replace “a” with a number tag representing “one”)
-
[28]
**Singular/Plural Adjustments:** - If changing a singular quantity to plural (or vice versa), ensure that the associated noun reflects the correct number (i.e., use plural form for plural numbers and singular form for singular numbers). - When modifying the noun’s number, the change must be performed via a number error tag applied to the noun, with the nu...
-
[29]
**No Article Swapping:** - Modifications that swap articles (e.g., changing "a" to "the" or vice versa) are not permitted
-
[30]
**No Unjustified Modifications:** - Do not allow changes that cannot be determined solely from the original word. For example, altering ‘two bible ’Guide’‘ to ‘two <named_entities_fact>quran</named_entities_fact> <text>bible</text>‘ is unacceptable, as it is unrealistic to deduce the appropriate modification for "bible" from the candidate sentence alone
-
[31]
**No Matching Words Between Tagged and Untagged Text:** - Ensure that the words within the error tags do not exactly match the corresponding words in the original, untagged text. For instance, transforming "To the left" into ‘<spatial_relation>To the right</spatial_relation>‘ is incorrect because the words "To" and "the" are repeated. The proper transform...
-
[32]
**Application of Error Annotations for Generic Objects or Entities:** - Fundamental Principle: For generic objects or entities expressed by common nouns (i.e., non-proper nouns), do not use the named_entities_fact tag. Always use the object tag. - named_entities_fact (Factual Error): This tag should only be used for proper names such as formal institution...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.