pith. sign in

arxiv: 2506.13130 · v2 · submitted 2025-06-16 · 💻 cs.CV · cs.AI· cs.CL

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

Pith reviewed 2026-05-19 10:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multimodal hallucinationfine-grained detectionhallucination editingMLLMVisionHallerror classificationsynthetic datagraph-based generation
0
0 comments X

The pith

ZINA identifies hallucinated spans in multimodal outputs, classifies them into six error types, and suggests refinements, outperforming GPT-4o and Llama-3.2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new task of detecting and editing fine-grained hallucinations in multimodal large language models. ZINA is introduced as a method to locate hallucinated text segments, categorize errors into six specific types, and recommend edits. The authors create the VisionHall dataset with 6.9k human-annotated outputs from twelve MLLMs and 20k graph-based synthetic samples that account for error dependencies. Experiments show ZINA surpasses existing methods including GPT-4o and Llama-3.2 on both detection and editing. This work addresses the problem of unreliable outputs in vision-language AI by providing tools for detailed analysis and correction.

Core claim

ZINA is a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To support training and evaluation, the VisionHall dataset is constructed with 6.9k outputs from twelve MLLMs manually annotated by 211 annotators and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. ZINA is shown to outperform existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.

What carries the argument

ZINA, the method that identifies hallucinated spans, classifies error types into six categories, and suggests refinements based on graph-modeled dependencies.

If this is right

  • Fine-grained detection enables more precise evaluation of MLLM outputs beyond coarse metrics.
  • Editing suggestions can automatically refine outputs to reduce hallucinations.
  • The VisionHall dataset provides a benchmark for future hallucination mitigation research.
  • Accounting for dependencies among error types improves classification and editing reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • ZINA could be integrated into MLLM inference pipelines for on-the-fly correction in applications.
  • The six error categories offer a taxonomy that might inform targeted training strategies to prevent hallucinations.
  • The graph-based synthesis technique could extend to creating datasets for related issues like factual errors in other modalities.

Load-bearing premise

The manual annotations by 211 annotators on 6.9k outputs accurately reflect fine-grained hallucinations, and the graph-based synthetic samples of 20k examples correctly capture dependencies among the six error types.

What would settle it

Independent annotations on a new set of MLLM outputs where ZINA shows no accuracy advantage over GPT-4o in span detection or error classification would falsify the performance superiority.

Figures

Figures reproduced from arXiv: 2506.13130 by Graham Neubig, Kazuki Matsuda, Komei Sugiura, Yuiga Wada.

Figure 1
Figure 1. Figure 1: Overview of the proposed task. In contrast [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the graph-based synthetic data generation process. We first obtain seed descriptions by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on the VisionHall dataset. Each subfigure shows the image and the edited descriptions [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Annotation interface used for the VisionHall [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we construct VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the task of multimodal fine-grained hallucination detection and editing for MLLMs. It proposes ZINA, which detects hallucinated spans, classifies them into six error types, and suggests refinements. The authors construct the VisionHall dataset comprising 6.9k human-annotated outputs from twelve MLLMs (by 211 annotators) plus 20k graph-based synthetic samples that model dependencies among error types. They report that ZINA outperforms baselines including GPT-4o and Llama-3.2 on both detection and editing.

Significance. If the evaluation holds, the work provides a useful fine-grained framework and dataset for analyzing and mitigating hallucinations in vision-language models. The graph-based synthesis approach for capturing error-type dependencies is a constructive contribution that could support more realistic training data. Demonstrating editing capability alongside detection adds practical value beyond existing coarse-grained hallucination benchmarks.

major comments (2)
  1. [Dataset construction] Dataset construction section: The 6.9k outputs annotated by 211 annotators form the primary ground truth for the six error types and the supervised training signal. The manuscript does not report inter-annotator agreement (e.g., Fleiss' kappa or span-level overlap), which is required to confirm label reliability; low agreement would directly undermine both the training of ZINA and the validity of the outperformance claims versus GPT-4o and Llama-3.2.
  2. [Synthetic data generation] Synthetic data generation section: The 20k graph-based samples are presented as capturing dependencies among the six error types. Without a quantitative comparison of the induced co-occurrence statistics to those observed in the human-annotated subset (or a human validation study), it is unclear whether the synthetic distribution faithfully reproduces real MLLM hallucination patterns; any mismatch would make the reported metrics on the combined dataset unreliable.
minor comments (1)
  1. [Abstract] Abstract: The claim of outperformance is stated without any numerical metrics, confidence intervals, or statistical tests; adding at least the primary F1 or accuracy deltas would strengthen the summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: The 6.9k outputs annotated by 211 annotators form the primary ground truth for the six error types and the supervised training signal. The manuscript does not report inter-annotator agreement (e.g., Fleiss' kappa or span-level overlap), which is required to confirm label reliability; low agreement would directly undermine both the training of ZINA and the validity of the outperformance claims versus GPT-4o and Llama-3.2.

    Authors: We agree that reporting inter-annotator agreement is necessary to substantiate the reliability of the annotations. In the revised manuscript, we will add a new paragraph in the Dataset Construction section that reports Fleiss' kappa for error-type classification and span-level agreement metrics (e.g., token overlap) computed on a held-out subset of annotations. This addition will directly address concerns about label quality and support the validity of our experimental results. revision: yes

  2. Referee: [Synthetic data generation] Synthetic data generation section: The 20k graph-based samples are presented as capturing dependencies among the six error types. Without a quantitative comparison of the induced co-occurrence statistics to those observed in the human-annotated subset (or a human validation study), it is unclear whether the synthetic distribution faithfully reproduces real MLLM hallucination patterns; any mismatch would make the reported metrics on the combined dataset unreliable.

    Authors: We acknowledge the value of explicitly validating the synthetic distribution. In the revision, we will include a quantitative comparison of pairwise error-type co-occurrence frequencies between the 20k synthetic samples and the human-annotated subset, presented as a table or figure in the Synthetic Data Generation section. We will also report results from a small human validation study in which annotators rate the plausibility of a sample of synthetic instances. These additions will clarify the fidelity of the graph-based synthesis approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method and dataset construction with no derivations or self-referential reductions

full rationale

The paper introduces a new task of fine-grained multimodal hallucination detection and editing, proposes the ZINA method, and constructs the VisionHall dataset using 211 human annotators for 6.9k real MLLM outputs plus 20k graph-based synthetic samples. Evaluation compares ZINA against baselines including GPT-4o and Llama-3.2 on detection and editing metrics. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on empirical performance numbers obtained from the newly labeled and synthesized data rather than any reduction of outputs to inputs by construction. This is a standard empirical ML paper whose results are falsifiable against external benchmarks and do not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the six error categories form a useful and exhaustive taxonomy and that human annotations plus graph-generated samples provide a reliable training signal; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Hallucinations in multimodal outputs can be meaningfully partitioned into six distinct error categories
    ZINA classifies detected spans into six categories; the taxonomy is treated as given without further justification in the abstract.

pith-pipeline@v0.9.0 · 5706 in / 1238 out tokens · 59205 ms · 2026-05-19T10:02:45.272735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024b. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930. Anthony Chen, Panupong Pasupat, Sameer Singh, Hon- grae Lee, and Kelvin Guu. 2023. Purr: Efficiently...

  2. [2]

    arXiv preprint arXiv:2305.04790 , year=

    InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning. In NeurIPS. Jacob Devlin, Ming-Wei Chang, and 1 others. 2019. BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding. In NAACL, pages 4171–4186. Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang,...

  3. [3]

    Object Hallucination in Image Captioning

    Selfcheckgpt: Zero-resource black-box hal- lucination detection for generative large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained ato...

  4. [4]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: A Family of Highly Capable Multi- modal Models. arXiv preprint arXiv:2312.11805. Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. 2020-

  5. [5]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Label Studio: Data labeling soft- ware. Open source software available from https://github.com/HumanSignal/label-studio. Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero- Soriano. 2024. A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. In CVPR, pages 26700–26709. Yuig...

  6. [6]

    Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

    HallE-Control: controlling object hallucina- tion in large multimodal models. arXiv preprint arXiv:2310.01779. Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. 2024. Long-CLIP: Unlocking the Long-Text Capability of CLIP. In ECCV, pages 310– 325. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Wein- berger, and Yoav Artzi. 2020. BERTScore...

  7. [7]

    an apple on the table

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. In ACL (V olume3: System Demonstrations), Bangkok, Thailand. ACL. A Construction of VisionHall Dataset Annotation Process. We employed LabelStudio (Tkachenko et al., 2020-2025) for collecting the VisionHall dataset. Fig. 4 illustrates the annotation interface. We conducted annotations vi...

  8. [8]

    C Licenses and Intended Use ZINA is released under the BSD 3-Clause License, and the VisionHall dataset is released under the CC BY-NC 4.0 License

    to ensure reproducibility for the follow- ing models: LLaV A-1.5-7B (Liu et al., 2024a), LLaV A-1.5-13B, Qwen2-VL-7B (Wang et al., 2024), LLaV A-OV-Qwen2-7B (Li et al., 2024a), LLaV A-NeXT-Qwen-32B (Liu et al., 2024b), LLaV A-OV-Qwen2-72B, LLaMA-3.2-90B-Vision- Instruct (Grattafiori et al., 2024), Qwen2.5- VL-72B-Instruct (Bai et al., 2025), and GPT- 4o (...

  9. [9]

    Following this approach, we reported the average scores of CLIP-S and PAC-S

    introduced a modified version of these met- rics, which computes the cosine similarity between 3https://huggingface.co/microsoft/deberta-xlarge-mnli each sentence in the paragraph and the image, and then averages these scores to obtain the final score. Following this approach, we reported the average scores of CLIP-S and PAC-S. E Prompts E.1 Baselines The...

  10. [10]

    an apple on the table

    **<spatial_relation> (Special Relation Er- ror)**: Incorrect special relationship between entities or concepts. (Only those involving spatial relationships are permitted, not gram- matical relationship errors.) - Example: "an apple on the table" → "an apple <spatial_relation>under</spatial_relation> the table"

  11. [11]

    Red sky" →

    **<attribute> (Attribute Error)**: Incorrect attributes such as adjectives or adverbs describ- ing objects or concepts. - Example: "Red sky" → "<at- tribute>Blue</attribute> sky"

  12. [12]

    There is a chair on the table

    **<object> (Object Error)**: Incorrect spe- cific objects or entities. - Example: "There is a chair on the table." → "There is a <object>book</object> on the table."

  13. [13]

    3 cats" →

    **<number> (Number Error)**: Incorrect quantities or numerical values. - Example: "3 cats" → "<number>5</number> cats"

  14. [14]

    John F. Kennedy Center

    **<named_entities_fact> (Factual Error)**: Error related to named entities. This tag should not be applied to anything other than named en- tities. For example, "John F. Kennedy Center" is acceptable, but "center" is not. - Example: "The image is of the John F. Kennedy Center." → "The image is of the <named_entities_fact>white house</named_entities_fact>."

  15. [15]

    A car is parked under a sign that says ’Restaurant’

    **<text> (Text Error)**: Incorrect scene texts (text that appears in an image). - Example: "A car is parked under a sign that says ’Restaurant’."→ "A car is parked under a sign that says <text>’Hotel’</text>." ## Example Passage: [few-shot-example] Reference: [few-shot-example] Edited: [few-shot-example] ## Task Now detect errors and include edits in the ...

  16. [16]

    an apple on the table

    **spatial_relation (Spatial Relation Error):** - Modify only prepositions that indicate spatial relationships (no grammatical errors allowed). - *Example:* ‘"an apple on the table"‘→ ‘"an apple <spatial_relation original="on" id="E1">under</spatial_relation> the table"‘

  17. [17]

    Red sky"‘→ ‘

    **attribute (Attribute Error):** - Change descriptive attributes such as adjectives or adverbs. - *Example:* ‘"Red sky"‘→ ‘"<attribute original="Red" id="E1">Blue</attribute> sky"‘

  18. [18]

    There is a chair on the table

    **object (Object Error):** - Replace names of generic objects or entities that are common nouns (i.e., not proper names). - *Example:* ‘"There is a chair on the table."‘→ ‘"There is a <object original="chair" id="E1">book</object> on the table."‘ - Note: For any proper noun or official name (e.g., "John F. Kennedy Center" or "Fuji"), use the fact error as...

  19. [19]

    The image is of the John F. Kennedy Center

    **fact (Factual Error):** - Replace proper, named entities with incorrect names. Apply this error type only to entities that are proper names (e.g., formal institution names, landmarks, or official titles). - *Example:* ‘"The image is of the John F. Kennedy Center."‘→ ‘"The image is of the <named_entities_fact original="John F. Kennedy Center" id="E1">whi...

  20. [20]

    3 cats"‘→ ‘

    **number (Number Error):** - Change numerical values or quantities. - *Example:* ‘"3 cats"‘→ ‘"<number original="3" id="E1">5</number> cats"‘ - If you change a singular quantity to plural (or vice versa), make sure the associated noun reflects the correct number (i.e., use plural form for plural numbers and singular form for singular numbers). No annotati...

  21. [21]

    A car is parked under a sign that says ’Restaurant’

    **text (Text Error):** - Change scene text (i.e., text that appears within the image). - *Example:* ‘"A car is parked under a sign that says ’Restaurant’."‘→ ‘"A car is parked under a sign that says <text original="’Restaurant’" id="E1">’Hotel’</text>."‘ - **Output Format:** - Provide the modified sentence along with error annotations in the following JSO...

  22. [22]

    original

    **Original Sentence Integrity:** - Verify that the value of the ‘"original"‘ key exactly matches the provided candidate sentence. - Ensure that no modifications, additions, or extra words/phrases have been introduced in the ‘"origi- nal"‘ value

  23. [23]

    modified

    **Modified Sentence Requirements:** - Confirm that the ‘"modified"‘ value is the candidate sentence with error annotations inserted. - Check that only words or phrases from the candidate sentence have been modified—no new content should be added. - *Note:* The meaning of the ‘"modified"‘ value may differ from the original

  24. [24]

    - Verify that each error annotation exactly matches one of these types and is applied only in the appropriate context (for example, text errors are used only for scene text)

    **Error Annotation Types:** - Confirm that only the following six error types are used: **spatial_relation, attribute, object, number, named_entities_fact, text**. - Verify that each error annotation exactly matches one of these types and is applied only in the appropriate context (for example, text errors are used only for scene text)

  25. [25]

    ‘xml <error_type original=

    **Error Tag Format and Consistency:** - Each error annotation must use the XML-like tag format: “‘xml <error_type original="original text" id="E#">modified text</error_type> “‘ - Ensure that: - The tag name is exactly one of the specified error types. - The ‘original‘ attribute contains the exact original word or phrase from the candidate sentence. - The ...

  26. [26]

    - Ensure that the same error tag (and, if applicable, the same ‘parent‘ attribute) is used for all repeated instances of that word

    **Consistency in Modifications:** - Verify that if a word is modified in one location, every occurrence of that word in the sentence is modified consistently. - Ensure that the same error tag (and, if applicable, the same ‘parent‘ attribute) is used for all repeated instances of that word. - When one modification depends on another error, ensure that a pr...

  27. [27]

    a” with a number tag representing “one

    **No Semantically Equivalent Replacements:** - Ensure that error annotations do not result in semantically equivalent replacements (e.g., do not replace “a” with a number tag representing “one”)

  28. [28]

    a" id="E1

    **Singular/Plural Adjustments:** - If changing a singular quantity to plural (or vice versa), ensure that the associated noun reflects the correct number (i.e., use plural form for plural numbers and singular form for singular numbers). - When modifying the noun’s number, the change must be performed via a number error tag applied to the noun, with the nu...

  29. [29]

    a" to "the

    **No Article Swapping:** - Modifications that swap articles (e.g., changing "a" to "the" or vice versa) are not permitted

  30. [30]

    **No Unjustified Modifications:** - Do not allow changes that cannot be determined solely from the original word. For example, altering ‘two bible ’Guide’‘ to ‘two <named_entities_fact>quran</named_entities_fact> <text>bible</text>‘ is unacceptable, as it is unrealistic to deduce the appropriate modification for "bible" from the candidate sentence alone

  31. [31]

    To the left

    **No Matching Words Between Tagged and Untagged Text:** - Ensure that the words within the error tags do not exactly match the corresponding words in the original, untagged text. For instance, transforming "To the left" into ‘<spatial_relation>To the right</spatial_relation>‘ is incorrect because the words "To" and "the" are repeated. The proper transform...

  32. [32]

    need_update

    **Application of Error Annotations for Generic Objects or Entities:** - Fundamental Principle: For generic objects or entities expressed by common nouns (i.e., non-proper nouns), do not use the named_entities_fact tag. Always use the object tag. - named_entities_fact (Factual Error): This tag should only be used for proper names such as formal institution...