pith. machine review for the scientific record. sign in

arxiv: 2602.05437 · v2 · submitted 2026-02-05 · 💻 cs.CL

Recognition: no theorem link

Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords hallucinationvision-language modelsmultilingualArabic dialectscounterfactualM2CQA benchmarkcultural groundingprompting strategies
0
0 comments X

The pith

Vision-language models accept visually false but culturally plausible statements in Arabic even after answering true questions correctly

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models can answer factual questions about images correctly yet still accept alternative statements that are visually incorrect while remaining culturally plausible, especially in Arabic and its dialects. The paper introduces the M2CQA benchmark built from images across 17 MENA countries paired with contrastive true and counterfactual statements in English, Arabic, and dialects. It defines a new metric, CounterFactual Hallucination Rate, that measures acceptance of the counterfactual only among cases where the true statement was answered correctly. Evaluations show this rate rises sharply for Arabic compared with English, and that reasoning-first prompting increases the problem while answering first reduces it.

Core claim

State-of-the-art VLMs exhibit high CounterFactual Hallucination Rate in Arabic, especially dialects, even when true-statement accuracy remains high. The M2CQA dataset pairs culturally grounded images with true statements and counterfactual alternatives that are visually incorrect yet plausible to native speakers. Reasoning-first prompting consistently raises the rate while answering before justifying improves robustness across prompting strategies.

What carries the argument

CounterFactual Hallucination Rate (CFHR), the conditional probability that a model accepts a visually incorrect but culturally plausible statement given that it correctly answered the matching true statement

Load-bearing premise

The constructed counterfactual statements are verifiably visually incorrect while remaining culturally plausible to native speakers

What would settle it

If models show similarly low CFHR across English and Arabic when true accuracy is held constant, or if native speakers rate the counterfactual statements as visually accurate rather than incorrect, the central claim would be refuted

Figures

Figures reproduced from arXiv: 2602.05437 by Basel Mousi, Fahim Dalvi, Firoj Alam, Nadir Durrani, Shammur Chowdhury.

Figure 1
Figure 1. Figure 1: Sample image from the M2CQA dataset. Q+ The weapon in the image is a traditional khanjar. Q− ... a ceremonial sword used in traditional dances. Q− ... a hunting knife designed for outdoor survival. While our experiments focus on the MENA region, the proposed CFHR metric and counterfactual con￾struction framework are region-agnostic and can be readily applied to other cultural settings. 2 M2CQA Curation We … view at source ↗
Figure 2
Figure 2. Figure 2: Visually similar lantern scenes with distinct cultural grounding. A grounded model should answer Q [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Joint visualization of Q+ accuracy and CFHR. The two-dimensional view reveals distinct behavioral regimes: models with high Q+ and low CFHR exhibit strong grounding and counterfactual robustness, while other regions reflect conservative rejection (low Q+, low CFHR), over-acceptance (high Q+, high CFHR), or frequent hallucination (low Q+, high CFHR). brary.2 Models are evaluated under three prompt￾ing setti… view at source ↗
Figure 5
Figure 5. Figure 5: Scaling and counterfactual hallucination under Prompt A CFHR (lower is better) as a function of model size for Qwen3-VL (solid) and Gemma-3-VL (dotted), evaluated in English (EN), Modern Standard Arabic (MSA), and dialectal Arabic (Dialects; average of Egyptian and Levantine). Scaling generally reduces CFHR, with stronger and more consistent improvements for Qwen3-VL, particularly in Arabic varieties. exam… view at source ↗
Figure 6
Figure 6. Figure 6: Country coverage in M2CQA. Distribution of visually grounded samples across 17 MENA coun￾tries (total: 9,990). A.2 Counterfactual Curation and Validation To curate counterfactual statements for M2CQA, we adopt a two-stage annotation pipeline based on MCQs, which are used solely to enumerate cultur￾ally plausible interpretations of an image and are not used directly for model evaluation. A.2.1 Stage 1: Mult… view at source ↗
Figure 7
Figure 7. Figure 7: Example image used to illustrate the prompt [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scaling and counterfactual hallucination across prompting strategies. Counterfactual Hallucination Rate (CFHR; lower is better) as a function of model size for Qwen3-VL (solid) and Gemma-3-VL (dotted), evaluated in English (EN), Modern Standard Arabic (MSA), and dialectal Arabic (Dialects; average of Egyptian and Levantine). Prompt B requires models to answer before providing justification, while Prompt C … view at source ↗
Figure 9
Figure 9. Figure 9: Dialectal scaling of counterfactual hallucination. CFHR (lower is better) as a function of model size for Qwen3-VL (solid) and Gemma-3 (dotted), evaluated separately in Levantine Arabic (AJP) and Egyptian Arabic (ARZ) [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sample images from the dataset with English and Modern Standard Arabic (MSA) captions. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sample images from the dataset with English and Modern Standard Arabic (MSA) captions. [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Human annotation interface used for image-based Q/A verification. Annotators evaluate each question– [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M$^2$CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We make the dataset publicly available for the community (https://huggingface.co/datasets/QCRI/M2CQA)).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces M²CQA, a culturally grounded multimodal benchmark consisting of images from 17 MENA countries paired with contrastive true and counterfactual statements in English, Arabic, and dialects. It defines the CounterFactual Hallucination Rate (CFHR) as the rate at which models accept visually incorrect but culturally plausible counterfactuals, conditioned on correctly answering the corresponding true statement. Evaluations across state-of-the-art VLMs and prompting strategies show that CFHR rises sharply for Arabic (especially dialects) even when true-statement accuracy remains high, and that reasoning-first prompting increases counterfactual acceptance while answer-first prompting improves robustness. The dataset is released publicly.

Significance. If the central empirical claim holds, the work fills a clear gap in hallucination evaluation by moving beyond English-centric and visually obvious errors to culturally plausible counterfactuals in under-served languages. The public dataset release and the conditional CFHR metric are concrete contributions that enable reproducible follow-up. The prompting-strategy finding also has immediate practical implications for deployment.

major comments (3)
  1. [§3] §3 (Dataset Construction and Validation): The manuscript states that counterfactual statements are 'visually incorrect' and 'culturally plausible' but provides no explicit protocol, inter-annotator agreement scores, or image-region grounding procedure for confirming visual mismatch, especially for dialectal Arabic items. Because CFHR is defined as acceptance conditioned on this premise, any noise in the counterfactual labels directly inflates the reported rates; this verification step is load-bearing for the headline Arabic/dialect result.
  2. [§4.2] §4.2 (CFHR Definition and Conditioning): The conditioning on 'correctly answering the true statement' is only described at a high level; it is unclear whether this uses the same prompt template, temperature, and decoding settings as the counterfactual query, or whether multiple samples are aggregated. Small changes in this conditioning can shift CFHR substantially, yet no sensitivity analysis or exact pseudocode is supplied.
  3. [Table 2 / Figure 3] Table 2 / Figure 3 (Arabic vs. Dialect Breakdown): The reported CFHR gap between Modern Standard Arabic and dialects is large, but the paper does not report per-dialect sample sizes, confidence intervals, or a statistical test for the difference. Without these, it is impossible to judge whether the 'sharp rise' is robust or driven by a few low-resource dialects.
minor comments (2)
  1. [§2] §2 (Related Work): The discussion of prior hallucination benchmarks omits recent multilingual efforts (e.g., those extending POPE or HallusionBench to non-English); adding two or three citations would strengthen context.
  2. [Figure 1] Figure 1 caption: The legend for prompting strategies is too small and the color mapping between English/Arabic lines is not described in text; this reduces readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction and Validation): The manuscript states that counterfactual statements are 'visually incorrect' and 'culturally plausible' but provides no explicit protocol, inter-annotator agreement scores, or image-region grounding procedure for confirming visual mismatch, especially for dialectal Arabic items. Because CFHR is defined as acceptance conditioned on this premise, any noise in the counterfactual labels directly inflates the reported rates; this verification step is load-bearing for the headline Arabic/dialect result.

    Authors: We agree that a more explicit description of the verification process is necessary to support the CFHR metric. The original manuscript provided only a high-level overview of how counterfactuals were generated and validated. In the revision, we will expand §3 with the full annotation protocol, including the guidelines given to annotators, the inter-annotator agreement scores computed during dataset creation, and the specific procedures used for confirming visual mismatch via image-region grounding. Special attention will be given to dialectal Arabic items, with examples of how cultural plausibility and visual incorrectness were jointly assessed. revision: yes

  2. Referee: [§4.2] §4.2 (CFHR Definition and Conditioning): The conditioning on 'correctly answering the true statement' is only described at a high level; it is unclear whether this uses the same prompt template, temperature, and decoding settings as the counterfactual query, or whether multiple samples are aggregated. Small changes in this conditioning can shift CFHR substantially, yet no sensitivity analysis or exact pseudocode is supplied.

    Authors: We acknowledge that the precise implementation details of the conditioning step require clarification for reproducibility. The manuscript described the conditioning conceptually but did not specify the shared prompt templates, temperature, or decoding parameters. In the revised version, we will update §4.2 to include the exact prompt templates for both true and counterfactual queries, confirm that identical temperature and decoding settings were used throughout, note that single-sample evaluation was performed, and provide pseudocode for the full CFHR calculation. We will also add a sensitivity analysis on temperature and prompt variations in the appendix. revision: yes

  3. Referee: [Table 2 / Figure 3] Table 2 / Figure 3 (Arabic vs. Dialect Breakdown): The reported CFHR gap between Modern Standard Arabic and dialects is large, but the paper does not report per-dialect sample sizes, confidence intervals, or a statistical test for the difference. Without these, it is impossible to judge whether the 'sharp rise' is robust or driven by a few low-resource dialects.

    Authors: We appreciate this observation on statistical robustness. The current manuscript reports aggregate Arabic and dialect results without per-dialect breakdowns or inferential statistics. In the revision, we will expand Table 2 and Figure 3 to list per-dialect sample sizes, report 95% confidence intervals for all CFHR values, and include a statistical test (e.g., two-proportion z-test) comparing Modern Standard Arabic against the dialect groups. These additions will allow readers to assess whether the observed gap is statistically reliable. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and conditional metric are direct empirical measurements

full rationale

The paper constructs the M²CQA dataset from images and contrastive statements, then defines CFHR explicitly as the rate of accepting counterfactuals conditioned on correct true-statement responses. No equations, fitted parameters, or self-citations reduce the reported rates to the inputs by construction; the results are straightforward counts on model outputs for the released benchmark. The derivation chain consists of dataset creation followed by direct evaluation, remaining self-contained without any self-definitional, fitted-input, or citation-load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central measurement rests on the assumption that the contrastive statements are correctly labeled as visually false yet culturally plausible; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Counterfactual statements are verifiably visually incorrect while remaining culturally plausible to native speakers
    Required for CFHR to isolate hallucination rather than simple error.

pith-pipeline@v0.9.0 · 5478 in / 988 out tokens · 24250 ms · 2026-05-16T07:28:55.600744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

  1. [1]

    Qwen Technical Report

    Palm: A culturally inclusive and linguistically diverse dataset for Arabic LLMs . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 32871–32894, Vienna, Austria. Association for Computational Linguistics. Jinze Bai, Shuai Bai, Y unfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Y ang Fan...

  2. [2]

    Preprint, arXiv:2507.19024

    A survey of multimodal hallucination evalua- tion and detection. Preprint, arXiv:2507.19024. Fanar-Team, Ummar Abbas, Mohammad Shahmeer Ah- mad, Firoj Alam, Enes Altinisik, Ehsannedin Asgari, Y azan Boshmaf, Sabri Boughorbel, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Dar- wish, Nadir Durrani, Mohamed Elfeky, Ahmed El- magarmid, Mohamed Eltabak...

  3. [3]

    A Survey on Hallucination in Large Vision-Language Models

    Hal-eval: A universal and fine-grained hallu- cination evaluation framework for large vision lan- guage models . In Proceedings of the 32nd ACM International Conference on Multimedia , MM ’24, page 525534, New Y ork, NY , USA. Association for Computing Machinery. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Y u- taka Matsuo, and Y usuke Iwasawa. 2022. L...

  4. [4]

    In Proceedings of the 31st International Conference on Computational Lin- guistics, pages 4186–4218, Abu Dhabi, UAE

    AraDiCE: Benchmarks for dialectal and cul- tural capabilities in LLMs . In Proceedings of the 31st International Conference on Computational Lin- guistics, pages 4186–4218, Abu Dhabi, UAE. Asso- ciation for Computational Linguistics. Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xi- aoqin Zhang, Ling Shao, and Shijian Lu. 2024. Longhalqa: Long-context halluci...

  5. [5]

    - The question must have three plausible answer options

    Question Type : - Generate exactly one multiple - choice question per image . - The question must have three plausible answer options . - Exactly one option must be correct and visually supported by the image . - The remaining options must be plausible but not supported by the image

  6. [6]

    Semantic Focus : - Use the following semantic labels to guide question construction . Match the image content to the most relevant labels : * Location and Place Identification * Scene Interpretation and Context * Architectural Features and Functions * Cultural Significance and Heritage * Traditional Clothing and Attire * Tourism and Cultural Activities * ...

  7. [7]

    - Assign a label indicating the cognitive focus : Category Subcategories

    Cognitive Focus : - Ensure that the question requires visual grounding in the image . - Assign a label indicating the cognitive focus : Category Subcategories

  8. [8]

    Built Environment & Architecture Modern landmarks; residential buildings; public spaces; historical/archae- ological sites; government buildings; interior design; healthcare buildings; transport infrastructure; bridges; religious architecture

  9. [9]

    Food & Culinary Culture Prepared dishes; ingredients & herbs; beverages; tableware; food textures & garnish

  10. [10]

    Clothing, Textiles & Appearance Fabric materials; modern apparel; accessories; traditional/cultural clothing; patterns & motifs

  11. [11]

    Objects, Artifacts & Material Culture Materials; household objects; decorative crafts; technology devices; jewelry; furniture; machinery/tools; signage; religious objects

  12. [12]

    Nature, Animals & Ecology Animals; plants/vegetation; landscapes; water features; climate indicators

  13. [13]

    Activities, Sports & Human Action Competitive sports; motorsports; recreational activities; labor/work actions; festivals/performances; marketplace activities

  14. [14]

    Religion, Culture & National Identity Religious buildings; weddings; cultural celebrations; fashion events; political events; national symbols

  15. [15]

    Transport & Mobility Road transport; rail transport; air transport; water transport; animal-based transport

  16. [16]

    People & Occupations Everyday people; healthcare workers; manual laborers; performers; transport operators; athletes; religious/cultural participants

  17. [17]

    Environment, Climate & Geography Beaches/coasts; deserts/rocky terrains; forests; urban outdoors; disaster zones; rural countryside; lighting/weather conditions

  18. [18]

    Arts, Entertainment & Media Performing arts; film/media; amusement/festival structures; gaming/esports; fine arts/sculpture; museum/exhibition spaces; manuscripts/texts

  19. [19]

    Table 3: Hierarchical taxonomy of semantic categories for M 2CQA / ArabicMENA multimodal dataset

    Commerce, Markets & Economic Life Local markets; shopping centers; local goods; tourism commerce; restau- rants/cafés; fairs/bazaars. Table 3: Hierarchical taxonomy of semantic categories for M 2CQA / ArabicMENA multimodal dataset. * knowledge - based * common sense - based

  20. [20]

    Language : - The question and answer options must be written in native - sounding English

  21. [21]

    - Avoid trivial or language - only cues that would reveal the correct answer without access to the image

    Question Quality : - Ensure the question is natural , conversational , and human - like . - Avoid trivial or language - only cues that would reveal the correct answer without access to the image . - Answer options should be written in a parallel and natural form

  22. [22]

    - Incorrect options must remain plausible in isolation

    Answer Quality : - The correct answer must be factually supported by the image . - Incorrect options must remain plausible in isolation

  23. [23]

    - Ensure cultural references are accurate and specific to the image

    Cultural Sensitivity : - Avoid stereotypes or cultural mi s re p r es e n ta t i on s . - Ensure cultural references are accurate and specific to the image

  24. [24]

    Context Utilization : - Use the provided image description , category , and subcategory to enrich question construction without making the answer obvious

  25. [25]

    multiple - choice

    Reasoning : - Provide a short rationale ( under 100 words ) explaining why the correct option is supported by the image and why the alternatives are not . Output Format ( JSON ) : { " multiple - choice ": { " question_en ": "..." , " options_en ": ["..." , "..." , "..."] , " co rr ec t_ ans we r_ en ": "..." , " rationale ": "..." , " cognitive_focus ": "...

  26. [26]

    Rewrite each answer option as a standalone declarative statement referring to the image

  27. [27]

    - Two FALSE statements (Q -) , derived from the incorrect answer options

    From this conversion , produce : - One TRUE statement ( Q +) , derived from the correct answer option . - Two FALSE statements (Q -) , derived from the incorrect answer options

  28. [28]

    Preserve the semantic content of each answer option

  29. [29]

    Remove explicit question structure

  30. [30]

    Q_plus

    Do not introduce new entities or additional information . Output Format ( JSON ) : { " Q_plus ": "..." , " Q_minus ": ["..." , "..."] } A.2.3 V erification of Minimal Contrastiveness: We empirically assess the minimal contrastiveness of Q + and Q − statements along two dimensions. First, we examine base accuracy patterns and ob- serve that Q − accuracy clo...