ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

Bohan Jia; Kai Wu; Shaohui Lin; Shizhe Zhou; Tongyun Li; Yan Shen; Yuyang Wu

arxiv: 2605.29579 · v1 · pith:JXION7DYnew · submitted 2026-05-28 · 💻 cs.CV

ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

Shizhe Zhou , Bohan Jia , Kai Wu , Yan Shen , Tongyun Li , Yuyang Wu , Shaohui Lin This is my paper

Pith reviewed 2026-06-29 08:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal hallucinationMLLM evaluationcause-driven benchmarkadversarial imagesvision-language modelshallucination causesReactBenchchain-of-thought analysis

0 comments

The pith

ReactBench introduces four cause-specific tasks that expose how multimodal models hallucinate due to co-occurrence bias, language priors, and perceptual limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ReactBench as a benchmark that shifts focus from detecting hallucination outputs to diagnosing their underlying causes in multimodal large language models. It generates adversarial images paired with targeted queries across four tasks to isolate distinct failure modes. Chain-of-Thought analysis then breaks down sub-causes within each response. Evaluations across models show consistent vulnerabilities to these triggers, unlike prior benchmarks that use simpler scenarios. This setup offers a more precise way to identify where robustness improvements are needed.

Core claim

ReactBench generates adversarial images and hallucination-inducing queries for the tasks Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting. These systematically target co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. Beyond accuracy scores, Chain-of-Thought reasoning traces fine-grained sub-causes, and evaluations show current MLLMs remain vulnerable to these specific triggers.

What carries the argument

ReactBench benchmark consisting of four tasks that pair generated adversarial images with cause-targeted queries and use Chain-of-Thought evaluation to attribute hallucinations.

If this is right

Developers can diagnose and target specific hallucination causes rather than treating all failures uniformly.
Training methods can incorporate the four tasks to reduce co-occurrence bias and fine-grained perception errors.
Future benchmarks can adopt the adversarial generation and CoT attribution approach to test other model weaknesses.
Model comparisons become more interpretable by reporting performance broken down by cause.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to measure how well fine-tuning on one task reduces errors on the others.
Integration into model release pipelines would allow users to check robustness against these exact triggers before deployment.
Similar cause-driven designs might apply to hallucinations in other modalities such as audio or video.

Load-bearing premise

The adversarial images and queries isolate the four intended causes without introducing other uncontrolled factors that would mix up the attribution.

What would settle it

A controlled experiment in which models achieve high accuracy on all four ReactBench tasks yet continue to produce the corresponding hallucinations when the same causes appear in unmodified real-world images would undermine the benchmark's isolation of causes.

Figures

Figures reproduced from arXiv: 2605.29579 by Bohan Jia, Kai Wu, Shaohui Lin, Shizhe Zhou, Tongyun Li, Yan Shen, Yuyang Wu.

**Figure 2.** Figure 2: Overview of the ReactBench construction pipeline. Starting from task-specific source datasets, we first [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Experimental results of ReactBench: (a) 5D radar chart visualizing model performance across four tasks [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Sub-cause taxonomy for the four tasks in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Representative sub-cause examples for Relational Erasure. Q: Is the material of the beak at of the bird organic? GT: No A: The bird in the image is a live, natural organism. Specifically, a pigeon or dove.’s beak is a biological structure made of keratin, a protein that is organic in nature. Keratin is produced by the bird’s body and is found in feathers, claws, and beaks. Therefore, the material of the be… view at source ↗

**Figure 6.** Figure 6: Representative sub-cause examples for Counterfactual Attribute. coarse comparison strategy where the model captures global layout but overlooks a subtle local edit. The SC-3 case shows that while a difference is detected, it is incorrectly attributed to the wrong object or edit operation. Dense Counting [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Representative sub-cause examples for Alteration Tracing. Q: What is the texture of the metal handle of black wok body? GT: 13 A: Let's count the pigeons in the image carefully. Starting from the top left and moving clockwise: 1. Top left: one pigeon flying upward.2. Below that, slightly right: another pigeon with wings spread. 3. Center-top: a pigeon flying horizontally.......8. Bottom center-left: one pi… view at source ↗

**Figure 8.** Figure 8: Representative sub-cause examples for Dense Counting. A.4 Sub-Cause Distribution by Model Tables 6–9 report the per-model sub-cause distribution for all four tasks. These fine-grained breakdowns complement the aggregated statistics in Table 5 of the main paper and reveal notable modellevel variation. B Dataset Construction Details B.1 Representative Examples of Full-QA Instances To provide a concrete v… view at source ↗

**Figure 9.** Figure 9: Prompt template used for LLM-based cooccurrence sub-cause attribution [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 12.** Figure 12: Prompt template used for LLM-based fine [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Representative full-QA instance for Relational Erasure, showing the original image, the edited image with a co-occurring object removed, and the corresponding MC / SA / YN questions with ground-truth answers [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Representative full-QA instance for Counterfactual Attribute, showing the original image, the edited image with a dominant attribute replaced by a counterfactual alternative, and the corresponding questions [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Representative full-QA instance for Alteration Tracing, showing the before-and-after image pair and the corresponding questions targeting change detection and localization. attribution are LLM-based, which may be significantly influenced by the capability of the underlying model. G Ethics Statement Data Sources and Intended Use. All images used in ReactBench are sourced from publicly available research … view at source ↗

**Figure 16.** Figure 16: Representative full-QA instance for Dense Counting, showing a densely populated scene and the corresponding counting questions. H The Use of LLMs We acknowledge the use of large language models in the following aspects of this work: An LLM was used to refine selected sentences for improved academic fluency and clarity. All polished content was manually reviewed and revised by the authors. An LLM assiste… view at source ↗

**Figure 17.** Figure 17: Prompt template for Relational Erasure instruction generation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt template for Relational Erasure QA generation based on original–edited image pairs. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt template for Counterfactual Attribute instruction generation. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt template for Counterfactual Attribute QA generation based on original–edited image pairs. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt template for Counterfactual Attribute QA generation based on original–edited image pairs– part2. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt template for Alteration Tracing instruction and QA generation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt template for Dense Counting filtering and quality control. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt template used for LLM-based evaluation of model responses. [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗

read the original abstract

While multimodal large language models (MLLMs) have achieved rapid progress in vision-language understanding, they remain prone to multimodal hallucinations, producing responses that are inconsistent with the visual input. Existing benchmarks predominantly focus on detecting hallucination outcomes rather than evaluating the underlying causes of these failures. Moreover, many benchmarks rely on simplistic scenarios and limited evaluation formats that no longer challenge state-of-the-art models. To address these limitations, we introduce ReactBench, a cause-driven hallucination benchmark featuring multiple tasks and an exam-style evaluation format. By generating adversarial images and hallucination-inducing queries, ReactBench introduces four targeted tasks: Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting. These tasks systematically expose co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. Beyond standard accuracy-based evaluation, we leverage Chain-of-Thought reasoning to identify fine-grained sub-causes of hallucination within each task. Extensive evaluations reveal that current MLLMs remain notably vulnerable to cause-specific hallucination triggers, demonstrating the value of ReactBench as a systematic and interpretable testbed for diagnosing and improving multimodal model robustness. The project page is available at https://reactbench.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReactBench defines four new tasks to probe specific hallucination causes in MLLMs, but the abstract gives no evidence that the adversarial generation actually isolates those causes from confounders.

read the letter

The paper's core move is to build ReactBench around four tasks—Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting—plus an exam-style format that uses Chain-of-Thought to label sub-causes. This is a direct attempt to move past benchmarks that only flag whether a model hallucinates and instead ask why.

The tasks target recognizable failure modes: co-occurrence bias, language priors, cross-image comparison, and fine-grained perception. Framing the evaluation this way is a reasonable response to the complaint that current tests are too coarse or too easy for frontier models.

The limitation is straightforward. The abstract claims the generated images and queries “systematically expose” the intended causes, yet it supplies no quantitative check, human validation of cause purity, or ablation that removes only the target factor. If the image synthesis or query wording introduces correlated artifacts, the observed failures cannot be cleanly attributed. That gap sits at the center of the contribution.

The work is aimed at groups that build or audit MLLMs and need diagnostic tools rather than just accuracy numbers. A reader already working on hallucination mitigation will find the task definitions useful to consider, even before the results are fully trusted.

It is worth sending to peer review. The idea is clear enough and the gap it names is real; referees can pressure the authors on the isolation question without the paper being dismissed outright.

Referee Report

1 major / 2 minor

Summary. The paper introduces ReactBench, a cause-driven benchmark for multimodal hallucinations in MLLMs. It defines four tasks—Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting—generated via adversarial images and queries to target co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. The benchmark employs an exam-style format and Chain-of-Thought reasoning to identify sub-causes, with evaluations claiming that current MLLMs remain vulnerable to these specific triggers, positioning ReactBench as a systematic diagnostic testbed.

Significance. If the tasks cleanly isolate the four claimed causes without confounding artifacts from image synthesis or query construction, ReactBench would provide a more interpretable alternative to outcome-focused hallucination benchmarks and could guide targeted robustness improvements in vision-language models.

major comments (1)

[Abstract] Abstract: the central claim that the four tasks 'systematically expose' the intended causes (co-occurrence bias, language priors, cross-image perception, fine-grained bottlenecks) rests on unshown validation. No quantitative isolation check, human study of cause purity, or ablation removing only the target factor is referenced, so model failures cannot be attributed to the claimed mechanisms rather than generation artifacts.

minor comments (2)

The project page URL is given but no details on how the adversarial image generation pipeline or query templates can be reproduced or extended.
Notation for the four causes and their mapping to tasks could be tabulated for clarity in the task overview section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the validation of cause isolation. We address the single major comment below and will revise the manuscript to strengthen the supporting evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the four tasks 'systematically expose' the intended causes (co-occurrence bias, language priors, cross-image perception, fine-grained bottlenecks) rests on unshown validation. No quantitative isolation check, human study of cause purity, or ablation removing only the target factor is referenced, so model failures cannot be attributed to the claimed mechanisms rather than generation artifacts.

Authors: We agree that the abstract's phrasing asserts systematic exposure of the four targeted causes, and the manuscript does not include an explicit quantitative isolation check, human study of cause purity, or ablation that removes only the target factor. Task construction (Sections 3.1–3.4) relies on adversarial image generation and query design intended to isolate each cause (e.g., Relational Erasure removes relational cues while preserving object identity and attributes), and Section 4 reports elevated hallucination rates together with CoT sub-cause analysis. However, these elements do not constitute the formal validation the referee requests. We will add a new subsection (and corresponding appendix) containing (i) a human study in which annotators rate whether each task instance primarily triggers the intended cause and (ii) an ablation comparing targeted versus non-targeted variants. The abstract will be updated to reflect the added evidence. This addresses the concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction without derivation or self-referential reduction

full rationale

The paper defines ReactBench via four explicit tasks (Relational Erasure, Counterfactual Attribute, Alteration Tracing, Dense Counting) that generate adversarial images and queries to target specific hallucination causes. No equations, fitted parameters, or predictions appear; the central claim is simply that the constructed benchmark exposes the named causes. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The evaluation results on existing MLLMs are independent measurements and do not reduce to the benchmark definition by construction. This is a standard self-contained benchmark proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark papers rest on the domain assumption that human-designed adversarial examples can reliably probe model internals; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Adversarial images and queries can be constructed to target isolated hallucination causes without confounding effects
Central to the claim that the four tasks expose specific triggers

pith-pipeline@v0.9.1-grok · 5762 in / 1205 out tokens · 15361 ms · 2026-06-29T08:23:54.214302+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch-Elor. 2024. Mitigating open-vocabulary caption hallucinations. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22680–22698. Nitzan Bitton-...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, and 1 others. 2024. Hallusionbench: an advanced diagnos...

2024
[3]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Hongyu Hu, Jiyuan Zhang, Minyi Zhao, and Zhenbang Sun. 2023. Ciem: Contrastive instruction evaluation method for better instruction tuning.arXiv preprint arXiv:2309.02301. Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, a...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[4]

InProceedings of the 32nd ACM International Conference on Multimedia, pages 525– 534

Hal-eval: A universal and fine-grained hallu- cination evaluation framework for large vision lan- guage models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 525– 534. Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, CJ Taylor, and Stefano Soatto
[5]

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Throne: An object-based hallucination bench- mark for the free-form generations of large vision- language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 27228–27238. Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, and Matthieu Cord. 2026. When prompts override vision: P...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Choose exactly ONE primary sub-cause (SC-1, SC-2, or SC-3)
[7]

Base your judgment on BOTH the image (spatial layout of objects) AND the model's reasoning text (what strategy the model used)
[8]

If the model's reasoning shows BOTH shortcut calculation AND enumeration errors, choose the one that is the PRIMARY driver of the wrong answer
[9]

"" USER_PROMPT_TEMPLATE =

Output your answer in the required JSON format.""" USER_PROMPT_TEMPLATE = """## Task Information - **Object to count**: {object_name} - **Ground truth count**: {gt_count} - **Question**: {question} - **Question type**: {question_type} ## Model's Response (WRONG) {model_response} ## Model's Answer: {extracted_answer} ## Correct Answer: {ground_truth} --- P...

1967
[10]

Was the specified object correctly and completely removed in the edited image?
[11]

Are there obvious visual artifacts, unnatural patches, or ghostly remnants at the deletion area?
[12]

Is the rest of the scene (background, other objects) properly preserved?
[13]

Would these issues affect the ability to ask clear questions about the deleted object? === PHASE 2: SCENE ANALYSIS (focus on EDITED image) ===
[14]

Describe the PRECISE spatial location where the deleted object used to be (look at original image for reference, describe relative to landmarks visible in BOTH images)
[15]

Be very specific — look at the edited image pixel-level

In the EDITED image, what is ACTUALLY VISIBLE at that exact location now? (e.g., background wall revealed, floor surface, another object that was behind/beneath, or truly empty space). Be very specific — look at the edited image pixel-level
[16]

Does the SAME TYPE of object (as the deleted one) still exist ANYWHERE ELSE in the edited image? Look very carefully at every part of the image
[17]

none"}} discard_instance: {{yes / no}} discard_reason: {{reason if yes, or

If yes: describe exactly where, and assess whether the deleted location vs. remaining location can be UNAMBIGUOUSLY distinguished using language alone (without coordinates). === PHASE 3: STRUCTURED OUTPUT === Output ALL fields below, strictly one per line, no markdown, no asterisks, no extra text before or after the fields. IMPORTANT: ALL field values MUS...
[18]

After editing, the object must still be clearly recognizable as the original object
[19]

The counter-commonsense aspect should be reflected in a key attribute, not by turning it into an entirely different object
[20]

Preserve the target object's core identity features, main structure, and recognition cues
[21]

Keep the background, composition, position, pose, proportions, and surrounding objects unchanged
[22]

counter-commonsense

The editing result must look like a real photograph. Task Requirements: You are now a senior image-editing expert. I will provide an image and a [target object name]. You need to observe the image, locate the object, and determine whether it is suitable for editing. [Step 1: Suitability Assessment] If any of the following conditions apply, output directly...
[23]

Requirement: The object's identity and main structure must still be preserved after replacement

[Material Replacement] Change the object to a material that genuinely exists but would not normally be used for this object. Requirement: The object's identity and main structure must still be preserved after replacement
[24]

Requirement: After the color change, it should look as if the object was originally this color, not as if a coat of paint was applied over it

[Color Shift] Change the object to a color that genuinely exists, is naturally subdued, but should not appear on this object. Requirement: After the color change, it should look as if the object was originally this color, not as if a coat of paint was applied over it
[25]

Requirement: The texture should look as if it is natively part of the object, not pasted on

[Surface Texture / Pattern] Give the object a surface texture, pattern, or skin characteristic that genuinely exists on other objects. Requirement: The texture should look as if it is natively part of the object, not pasted on
[26]

Requirement: Moderate in degree; it should look like a natural real-world state, not a disaster special effect

[State Anomaly] Change the object to a state that this object would not normally be in, but that other real-world objects could plausibly exhibit. Requirement: Moderate in degree; it should look like a natural real-world state, not a disaster special effect
[27]

Requirement: Must still be photorealistic and natural

[Finish / Gloss Reversal] Change the object to a surface finish or gloss state that clearly does not match its nature but genuinely exists. Requirement: Must still be photorealistic and natural
[28]

Requirement: The change must be natural and recognizable; it must not turn the object into an obviously different class of object

[Shape / Structure Counter-Commonsense] Make a counter-commonsense change to the object's shape, proportions, contour, or structure. Requirement: The change must be natural and recognizable; it must not turn the object into an obviously different class of object. Priority guidelines by object type: - Natural creations / food / plants / fruits and vegetabl...
[29]

a different new object

Do not turn the target into "a different new object"
[30]

No high-saturation, high-purity, fluorescent, electric, or neon colors
[31]

No sci-fi elements, glowing elements, LEDs, circuit-trace patterns, or holographic effects
[32]

No obviously surreal physical states, e.g., jelly-like, gel-like, liquefied, lava-like, crystallized, etc
[33]

No dramatized, disaster-style, or wreckage-style treatments
[34]

No metaphorical writing
[35]

The original object must remain identifiable after editing
[36]

Do not change the target to an attribute that is already commonly reasonable in reality
[37]

Do not edit only a trivial small detail Editing Granularity Requirements: - Editing must target the object as a whole, or one complete major region / major surface - Do not change only a tiny corner - But also do not erase the object's identity for the sake of a whole-object edit Instruction Writing Standards:
[38]

Must precisely specify the target object's location in the image to avoid ambiguity
[39]

Must use concrete, physical, photorealistic descriptions
[40]

Descriptions should be moderately specific; do not pile on adjectives
[41]

Each instruction must end with a protection statement that explicitly lists by name the specific objects adjacent to the target object in the image
[42]

"" Figure 19: Prompt template forCounterfactual Attributeinstruction generation. 23 #QA generation —— Attribute1 SYSTEM_PROMPT =

Protection statement format must be: Preserve the [object 1], [object 2], [object 3], and other surrounding objects in the image unchanged; do not alter the background of the image under any circumstances. Output Requirements: Output only the following two lines. Do not output any analysis, explanation, or Markdown: Image editing instruction A: {text} Ima...
[43]

If the instruction text conflicts with the visible result, describe the visible result

Trust the IMAGES over the instruction text. If the instruction text conflicts with the visible result, describe the visible result
[44]

a", "an", or

A_object must be attribute-neutral: - object noun or noun phrase only - no determiners like "a", "an", or "the" - avoid color/material/size/texture words unless they are part of a standard object/part name - avoid pure scene-location words - no B or C words if avoidable - if the edited target is a part, subpart, attached component, local surface region, m...
[45]

First identify the counter-commonsense edited attribute explicitly described in the editing_instruction. If that instructed edited attribute is visually consistent with Image 2, prioritize it: - use it as the basis for C_new_attribute - choose attribute_type to match that instructed edited attribute - if the instruction describes a non-color change such a...
[46]

Only if the instruction-side attribute is clearly inconsistent with the visible edited result, ignore the instruction and instead describe the actually visible edited attribute from Image 2
[47]

B_original_attribute and C_new_attribute must: - be visually grounded - describe the same attribute type - come from Image 1 and Image 2 respectively - be comparable in granularity when possible, but accuracy is more important than brevity - do NOT shorten C_new_attribute merely to match the length of B_original_attribute
[48]

For C_new_attribute specifically: - prioritize fidelity to the editing_instruction and the visible edited result over brevity - C_new_attribute may be a medium-length descriptive phrase if needed - preserve essential modifiers such as material, finish, texture, pattern, transparency, geometry, coating, or condition when visually supported - do NOT replace...
[49]

B_original_attribute must be concrete and visually specific. - Do NOT use vague words such as: natural, normal, ordinary, regular, typical, standard, default, plain, usual, common - Use a concrete visible descriptor in the same slot as C_new_attribute - B_original_attribute may be shorter than C_new_attribute if the original state is visually simpler
[50]

Prefer one of these if appropriate: {attribute_type_suggestions}

attribute_type should be a short natural phrase. Prefer one of these if appropriate: {attribute_type_suggestions}
[51]

What is the [attribute_type] of the [A_object] [D_location]?

Same-slot self-check: After you choose attribute_type, B_original_attribute, and C_new_attribute, internally re-check whether BOTH B and C can naturally answer the SAME question: "What is the [attribute_type] of the [A_object] [D_location]?" If either B or C does NOT naturally fit that same question, revise attribute_type and/or B and/or C until they are ...
[52]

the [A_object] [D_location]

Part-target / self-reference check: After choosing A_object and D_location, internally re-check whether the phrase "the [A_object] [D_location]" naturally refers to the edited target. - If the edited target is only a part, subpart, attached component, local surface region, marking, or contained content of a larger host object, prefer naming that edited ta...
[53]

E_absurd_same_slot must be: - clearly wrong / absurd for this object - NOT be in the same attribute slot as B, C, or absurd_attribute_2 - it may describe a different kind of property entirely - different from both B and C
[54]

absurd_attribute_2 must be: - clearly wrong / absurd for this object - NOT be in the same attribute slot as B, C, or E_absurd_same_slot - it can be an absurd NOUN for this object - different from B, C, E_absurd_same_slot, plausible_wrong_attribute_1, and plausible_wrong_attribute_2
[55]

plausible_wrong_attribute_1 and plausible_wrong_attribute_2 must be: - same attribute type as B and C - plausible but wrong - different from B, C, E_absurd_same_slot, and absurd_attribute_2
[56]

D_location must: - refer to the target object in the edited image - be specific enough to identify the edited target - if only a part, subpart, attachment, marking, local surface region, or contained content is edited, locate that edited target relative to the host object or scene - do NOT create self-referential phrases where the object seems to be locat...
[57]

D_alt_location is a probe region used only for Q7 and Q11
[58]

By default, make D_alt_location a viewer-centric image-region phrase, such as: - in the upper half of the image - in the lower right of the image - near the center of the image - across the bottom of the image - in the foreground - in the background
[59]

It should: - not equal D_location - start with a locative preposition - preferably be broader than D_location - preferably include the area of D_location

Strongly prefer this kind of global image-region D_alt_location first. It should: - not equal D_location - start with a locative preposition - preferably be broader than D_location - preferably include the area of D_location
[60]

In that fallback case, D_alt_location: - must not equal D_location - must start with a locative preposition - must contain NO A_object whose attribute_type is B_original_attribute

Only if such a global image-region D_alt_location would still contain any A_object whose attribute_type remains B_original_attribute, then use a non-global fallback region instead. In that fallback case, D_alt_location: - must not equal D_location - must start with a locative preposition - must contain NO A_object whose attribute_type is B_original_attribute
[61]

wrong_location_1, wrong_location_2, wrong_location_3 must: - each be a locative phrase - each start with a locative preposition - each be clearly wrong for the B-attribute location question in Q11 - all be different from one another - all be different from D_location and D_alt_location
[62]

no" unless absolutely necessary. Only use usable =

Avoid usable = "no" unless absolutely necessary. Only use usable = "no" if you truly cannot determine visually grounded values reliably
[63]

usable":

All text values must be in English. Output strict JSON only, no markdown, no explanation. JSON schema: {{ "usable": "yes" or "no", "unusable_reason": "...", "A_object": "...", "A_number": "singular" or "plural" or "mass", "attribute_type": "...", "B_original_attribute": "...", "C_new_attribute": "...", "D_location": "...", "D_alt_location": "...", "plausi...

[1] [1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch-Elor. 2024. Mitigating open-vocabulary caption hallucinations. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22680–22698. Nitzan Bitton-...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, and 1 others. 2024. Hallusionbench: an advanced diagnos...

2024

[3] [3]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Hongyu Hu, Jiyuan Zhang, Minyi Zhao, and Zhenbang Sun. 2023. Ciem: Contrastive instruction evaluation method for better instruction tuning.arXiv preprint arXiv:2309.02301. Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, a...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[4] [4]

InProceedings of the 32nd ACM International Conference on Multimedia, pages 525– 534

Hal-eval: A universal and fine-grained hallu- cination evaluation framework for large vision lan- guage models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 525– 534. Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, CJ Taylor, and Stefano Soatto

[5] [5]

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Throne: An object-based hallucination bench- mark for the free-form generations of large vision- language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 27228–27238. Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, and Matthieu Cord. 2026. When prompts override vision: P...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Choose exactly ONE primary sub-cause (SC-1, SC-2, or SC-3)

[7] [7]

Base your judgment on BOTH the image (spatial layout of objects) AND the model's reasoning text (what strategy the model used)

[8] [8]

If the model's reasoning shows BOTH shortcut calculation AND enumeration errors, choose the one that is the PRIMARY driver of the wrong answer

[9] [9]

"" USER_PROMPT_TEMPLATE =

Output your answer in the required JSON format.""" USER_PROMPT_TEMPLATE = """## Task Information - **Object to count**: {object_name} - **Ground truth count**: {gt_count} - **Question**: {question} - **Question type**: {question_type} ## Model's Response (WRONG) {model_response} ## Model's Answer: {extracted_answer} ## Correct Answer: {ground_truth} --- P...

1967

[10] [10]

Was the specified object correctly and completely removed in the edited image?

[11] [11]

Are there obvious visual artifacts, unnatural patches, or ghostly remnants at the deletion area?

[12] [12]

Is the rest of the scene (background, other objects) properly preserved?

[13] [13]

Would these issues affect the ability to ask clear questions about the deleted object? === PHASE 2: SCENE ANALYSIS (focus on EDITED image) ===

[14] [14]

Describe the PRECISE spatial location where the deleted object used to be (look at original image for reference, describe relative to landmarks visible in BOTH images)

[15] [15]

Be very specific — look at the edited image pixel-level

In the EDITED image, what is ACTUALLY VISIBLE at that exact location now? (e.g., background wall revealed, floor surface, another object that was behind/beneath, or truly empty space). Be very specific — look at the edited image pixel-level

[16] [16]

Does the SAME TYPE of object (as the deleted one) still exist ANYWHERE ELSE in the edited image? Look very carefully at every part of the image

[17] [17]

none"}} discard_instance: {{yes / no}} discard_reason: {{reason if yes, or

If yes: describe exactly where, and assess whether the deleted location vs. remaining location can be UNAMBIGUOUSLY distinguished using language alone (without coordinates). === PHASE 3: STRUCTURED OUTPUT === Output ALL fields below, strictly one per line, no markdown, no asterisks, no extra text before or after the fields. IMPORTANT: ALL field values MUS...

[18] [18]

After editing, the object must still be clearly recognizable as the original object

[19] [19]

The counter-commonsense aspect should be reflected in a key attribute, not by turning it into an entirely different object

[20] [20]

Preserve the target object's core identity features, main structure, and recognition cues

[21] [21]

Keep the background, composition, position, pose, proportions, and surrounding objects unchanged

[22] [22]

counter-commonsense

The editing result must look like a real photograph. Task Requirements: You are now a senior image-editing expert. I will provide an image and a [target object name]. You need to observe the image, locate the object, and determine whether it is suitable for editing. [Step 1: Suitability Assessment] If any of the following conditions apply, output directly...

[23] [23]

Requirement: The object's identity and main structure must still be preserved after replacement

[Material Replacement] Change the object to a material that genuinely exists but would not normally be used for this object. Requirement: The object's identity and main structure must still be preserved after replacement

[24] [24]

Requirement: After the color change, it should look as if the object was originally this color, not as if a coat of paint was applied over it

[Color Shift] Change the object to a color that genuinely exists, is naturally subdued, but should not appear on this object. Requirement: After the color change, it should look as if the object was originally this color, not as if a coat of paint was applied over it

[25] [25]

Requirement: The texture should look as if it is natively part of the object, not pasted on

[Surface Texture / Pattern] Give the object a surface texture, pattern, or skin characteristic that genuinely exists on other objects. Requirement: The texture should look as if it is natively part of the object, not pasted on

[26] [26]

Requirement: Moderate in degree; it should look like a natural real-world state, not a disaster special effect

[State Anomaly] Change the object to a state that this object would not normally be in, but that other real-world objects could plausibly exhibit. Requirement: Moderate in degree; it should look like a natural real-world state, not a disaster special effect

[27] [27]

Requirement: Must still be photorealistic and natural

[Finish / Gloss Reversal] Change the object to a surface finish or gloss state that clearly does not match its nature but genuinely exists. Requirement: Must still be photorealistic and natural

[28] [28]

Requirement: The change must be natural and recognizable; it must not turn the object into an obviously different class of object

[Shape / Structure Counter-Commonsense] Make a counter-commonsense change to the object's shape, proportions, contour, or structure. Requirement: The change must be natural and recognizable; it must not turn the object into an obviously different class of object. Priority guidelines by object type: - Natural creations / food / plants / fruits and vegetabl...

[29] [29]

a different new object

Do not turn the target into "a different new object"

[30] [30]

No high-saturation, high-purity, fluorescent, electric, or neon colors

[31] [31]

No sci-fi elements, glowing elements, LEDs, circuit-trace patterns, or holographic effects

[32] [32]

No obviously surreal physical states, e.g., jelly-like, gel-like, liquefied, lava-like, crystallized, etc

[33] [33]

No dramatized, disaster-style, or wreckage-style treatments

[34] [34]

No metaphorical writing

[35] [35]

The original object must remain identifiable after editing

[36] [36]

Do not change the target to an attribute that is already commonly reasonable in reality

[37] [37]

Do not edit only a trivial small detail Editing Granularity Requirements: - Editing must target the object as a whole, or one complete major region / major surface - Do not change only a tiny corner - But also do not erase the object's identity for the sake of a whole-object edit Instruction Writing Standards:

[38] [38]

Must precisely specify the target object's location in the image to avoid ambiguity

[39] [39]

Must use concrete, physical, photorealistic descriptions

[40] [40]

Descriptions should be moderately specific; do not pile on adjectives

[41] [41]

Each instruction must end with a protection statement that explicitly lists by name the specific objects adjacent to the target object in the image

[42] [42]

"" Figure 19: Prompt template forCounterfactual Attributeinstruction generation. 23 #QA generation —— Attribute1 SYSTEM_PROMPT =

Protection statement format must be: Preserve the [object 1], [object 2], [object 3], and other surrounding objects in the image unchanged; do not alter the background of the image under any circumstances. Output Requirements: Output only the following two lines. Do not output any analysis, explanation, or Markdown: Image editing instruction A: {text} Ima...

[43] [43]

If the instruction text conflicts with the visible result, describe the visible result

Trust the IMAGES over the instruction text. If the instruction text conflicts with the visible result, describe the visible result

[44] [44]

a", "an", or

A_object must be attribute-neutral: - object noun or noun phrase only - no determiners like "a", "an", or "the" - avoid color/material/size/texture words unless they are part of a standard object/part name - avoid pure scene-location words - no B or C words if avoidable - if the edited target is a part, subpart, attached component, local surface region, m...

[45] [45]

First identify the counter-commonsense edited attribute explicitly described in the editing_instruction. If that instructed edited attribute is visually consistent with Image 2, prioritize it: - use it as the basis for C_new_attribute - choose attribute_type to match that instructed edited attribute - if the instruction describes a non-color change such a...

[46] [46]

Only if the instruction-side attribute is clearly inconsistent with the visible edited result, ignore the instruction and instead describe the actually visible edited attribute from Image 2

[47] [47]

B_original_attribute and C_new_attribute must: - be visually grounded - describe the same attribute type - come from Image 1 and Image 2 respectively - be comparable in granularity when possible, but accuracy is more important than brevity - do NOT shorten C_new_attribute merely to match the length of B_original_attribute

[48] [48]

For C_new_attribute specifically: - prioritize fidelity to the editing_instruction and the visible edited result over brevity - C_new_attribute may be a medium-length descriptive phrase if needed - preserve essential modifiers such as material, finish, texture, pattern, transparency, geometry, coating, or condition when visually supported - do NOT replace...

[49] [49]

B_original_attribute must be concrete and visually specific. - Do NOT use vague words such as: natural, normal, ordinary, regular, typical, standard, default, plain, usual, common - Use a concrete visible descriptor in the same slot as C_new_attribute - B_original_attribute may be shorter than C_new_attribute if the original state is visually simpler

[50] [50]

Prefer one of these if appropriate: {attribute_type_suggestions}

attribute_type should be a short natural phrase. Prefer one of these if appropriate: {attribute_type_suggestions}

[51] [51]

What is the [attribute_type] of the [A_object] [D_location]?

Same-slot self-check: After you choose attribute_type, B_original_attribute, and C_new_attribute, internally re-check whether BOTH B and C can naturally answer the SAME question: "What is the [attribute_type] of the [A_object] [D_location]?" If either B or C does NOT naturally fit that same question, revise attribute_type and/or B and/or C until they are ...

[52] [52]

the [A_object] [D_location]

Part-target / self-reference check: After choosing A_object and D_location, internally re-check whether the phrase "the [A_object] [D_location]" naturally refers to the edited target. - If the edited target is only a part, subpart, attached component, local surface region, marking, or contained content of a larger host object, prefer naming that edited ta...

[53] [53]

E_absurd_same_slot must be: - clearly wrong / absurd for this object - NOT be in the same attribute slot as B, C, or absurd_attribute_2 - it may describe a different kind of property entirely - different from both B and C

[54] [54]

absurd_attribute_2 must be: - clearly wrong / absurd for this object - NOT be in the same attribute slot as B, C, or E_absurd_same_slot - it can be an absurd NOUN for this object - different from B, C, E_absurd_same_slot, plausible_wrong_attribute_1, and plausible_wrong_attribute_2

[55] [55]

plausible_wrong_attribute_1 and plausible_wrong_attribute_2 must be: - same attribute type as B and C - plausible but wrong - different from B, C, E_absurd_same_slot, and absurd_attribute_2

[56] [56]

D_location must: - refer to the target object in the edited image - be specific enough to identify the edited target - if only a part, subpart, attachment, marking, local surface region, or contained content is edited, locate that edited target relative to the host object or scene - do NOT create self-referential phrases where the object seems to be locat...

[57] [57]

D_alt_location is a probe region used only for Q7 and Q11

[58] [58]

By default, make D_alt_location a viewer-centric image-region phrase, such as: - in the upper half of the image - in the lower right of the image - near the center of the image - across the bottom of the image - in the foreground - in the background

[59] [59]

It should: - not equal D_location - start with a locative preposition - preferably be broader than D_location - preferably include the area of D_location

Strongly prefer this kind of global image-region D_alt_location first. It should: - not equal D_location - start with a locative preposition - preferably be broader than D_location - preferably include the area of D_location

[60] [60]

In that fallback case, D_alt_location: - must not equal D_location - must start with a locative preposition - must contain NO A_object whose attribute_type is B_original_attribute

Only if such a global image-region D_alt_location would still contain any A_object whose attribute_type remains B_original_attribute, then use a non-global fallback region instead. In that fallback case, D_alt_location: - must not equal D_location - must start with a locative preposition - must contain NO A_object whose attribute_type is B_original_attribute

[61] [61]

wrong_location_1, wrong_location_2, wrong_location_3 must: - each be a locative phrase - each start with a locative preposition - each be clearly wrong for the B-attribute location question in Q11 - all be different from one another - all be different from D_location and D_alt_location

[62] [62]

no" unless absolutely necessary. Only use usable =

Avoid usable = "no" unless absolutely necessary. Only use usable = "no" if you truly cannot determine visually grounded values reliably

[63] [63]

usable":

All text values must be in English. Output strict JSON only, no markdown, no explanation. JSON schema: {{ "usable": "yes" or "no", "unusable_reason": "...", "A_object": "...", "A_number": "singular" or "plural" or "mass", "attribute_type": "...", "B_original_attribute": "...", "C_new_attribute": "...", "D_location": "...", "D_alt_location": "...", "plausi...