arxiv: 2602.05437 · v2 · submitted 2026-02-05 · 💻 cs.CL

Recognition: no theorem link

Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models

Basel Mousi , Fahim Dalvi , Shammur Chowdhury , Firoj Alam , Nadir Durrani

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucinationvision-language modelsmultilingualArabic dialectscounterfactualM2CQA benchmarkcultural groundingprompting strategies

0 comments

The pith

Vision-language models accept visually false but culturally plausible statements in Arabic even after answering true questions correctly

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models can answer factual questions about images correctly yet still accept alternative statements that are visually incorrect while remaining culturally plausible, especially in Arabic and its dialects. The paper introduces the M2CQA benchmark built from images across 17 MENA countries paired with contrastive true and counterfactual statements in English, Arabic, and dialects. It defines a new metric, CounterFactual Hallucination Rate, that measures acceptance of the counterfactual only among cases where the true statement was answered correctly. Evaluations show this rate rises sharply for Arabic compared with English, and that reasoning-first prompting increases the problem while answering first reduces it.

Core claim

State-of-the-art VLMs exhibit high CounterFactual Hallucination Rate in Arabic, especially dialects, even when true-statement accuracy remains high. The M2CQA dataset pairs culturally grounded images with true statements and counterfactual alternatives that are visually incorrect yet plausible to native speakers. Reasoning-first prompting consistently raises the rate while answering before justifying improves robustness across prompting strategies.

What carries the argument

CounterFactual Hallucination Rate (CFHR), the conditional probability that a model accepts a visually incorrect but culturally plausible statement given that it correctly answered the matching true statement

Load-bearing premise

The constructed counterfactual statements are verifiably visually incorrect while remaining culturally plausible to native speakers

What would settle it

If models show similarly low CFHR across English and Arabic when true accuracy is held constant, or if native speakers rate the counterfactual statements as visually accurate rather than incorrect, the central claim would be refuted

Figures

Figures reproduced from arXiv: 2602.05437 by Basel Mousi, Fahim Dalvi, Firoj Alam, Nadir Durrani, Shammur Chowdhury.

**Figure 1.** Figure 1: Sample image from the M2CQA dataset. Q+ The weapon in the image is a traditional khanjar. Q− ... a ceremonial sword used in traditional dances. Q− ... a hunting knife designed for outdoor survival. While our experiments focus on the MENA region, the proposed CFHR metric and counterfactual construction framework are region-agnostic and can be readily applied to other cultural settings. 2 M2CQA Curation We … view at source ↗

**Figure 2.** Figure 2: Visually similar lantern scenes with distinct cultural grounding. A grounded model should answer Q [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Joint visualization of Q+ accuracy and CFHR. The two-dimensional view reveals distinct behavioral regimes: models with high Q+ and low CFHR exhibit strong grounding and counterfactual robustness, while other regions reflect conservative rejection (low Q+, low CFHR), over-acceptance (high Q+, high CFHR), or frequent hallucination (low Q+, high CFHR). brary.2 Models are evaluated under three prompting setti… view at source ↗

**Figure 5.** Figure 5: Scaling and counterfactual hallucination under Prompt A CFHR (lower is better) as a function of model size for Qwen3-VL (solid) and Gemma-3-VL (dotted), evaluated in English (EN), Modern Standard Arabic (MSA), and dialectal Arabic (Dialects; average of Egyptian and Levantine). Scaling generally reduces CFHR, with stronger and more consistent improvements for Qwen3-VL, particularly in Arabic varieties. exam… view at source ↗

**Figure 6.** Figure 6: Country coverage in M2CQA. Distribution of visually grounded samples across 17 MENA countries (total: 9,990). A.2 Counterfactual Curation and Validation To curate counterfactual statements for M2CQA, we adopt a two-stage annotation pipeline based on MCQs, which are used solely to enumerate culturally plausible interpretations of an image and are not used directly for model evaluation. A.2.1 Stage 1: Mult… view at source ↗

**Figure 7.** Figure 7: Example image used to illustrate the prompt [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Scaling and counterfactual hallucination across prompting strategies. Counterfactual Hallucination Rate (CFHR; lower is better) as a function of model size for Qwen3-VL (solid) and Gemma-3-VL (dotted), evaluated in English (EN), Modern Standard Arabic (MSA), and dialectal Arabic (Dialects; average of Egyptian and Levantine). Prompt B requires models to answer before providing justification, while Prompt C … view at source ↗

**Figure 9.** Figure 9: Dialectal scaling of counterfactual hallucination. CFHR (lower is better) as a function of model size for Qwen3-VL (solid) and Gemma-3 (dotted), evaluated separately in Levantine Arabic (AJP) and Egyptian Arabic (ARZ) [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Sample images from the dataset with English and Modern Standard Arabic (MSA) captions. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Sample images from the dataset with English and Modern Standard Arabic (MSA) captions. [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Human annotation interface used for image-based Q/A verification. Annotators evaluate each question– [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

read the original abstract

Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M$^2$CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We make the dataset publicly available for the community (https://huggingface.co/datasets/QCRI/M2CQA)).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new MENA-focused benchmark and CFHR metric that flags higher counterfactual hallucination in Arabic and dialects even when true accuracy holds, but the result hinges on how rigorously the counterfactual statements were validated as visually false.

read the letter

This paper shows that VLMs can get the true statement about an image right yet still accept a culturally plausible but visually incorrect alternative, and the rate climbs in Arabic, especially dialects. They build M²CQA from images across 17 MENA countries and pair them with contrastive statements in English, Arabic, and dialects, then define CFHR as the conditional rate of accepting the counterfactual given correct true-statement performance. The dataset is released publicly, which is useful on its own. They also test prompting variants and report that reasoning-first prompting increases the problem while answering first helps a bit. That combination gives a clearer picture than raw accuracy numbers alone. The construction of the counterfactuals is the main soft spot. The abstract describes them as visually incorrect yet culturally grounded, but without explicit details on how visual mismatch was confirmed—multiple annotators, region grounding, or agreement scores on the dialect versions—the metric risks attributing model error to noisy test items. If the full paper has solid validation protocols, this concern shrinks; otherwise it undercuts how much we can trust the Arabic gap. This work is aimed at people building or auditing VLMs for non-English deployment. Anyone working on multilingual robustness or hallucination benchmarks will find the dataset and metric worth trying. It deserves peer review because the core empirical pattern is new and practically relevant, though the authors should expand the validation section to address the construction checks directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces M²CQA, a culturally grounded multimodal benchmark consisting of images from 17 MENA countries paired with contrastive true and counterfactual statements in English, Arabic, and dialects. It defines the CounterFactual Hallucination Rate (CFHR) as the rate at which models accept visually incorrect but culturally plausible counterfactuals, conditioned on correctly answering the corresponding true statement. Evaluations across state-of-the-art VLMs and prompting strategies show that CFHR rises sharply for Arabic (especially dialects) even when true-statement accuracy remains high, and that reasoning-first prompting increases counterfactual acceptance while answer-first prompting improves robustness. The dataset is released publicly.

Significance. If the central empirical claim holds, the work fills a clear gap in hallucination evaluation by moving beyond English-centric and visually obvious errors to culturally plausible counterfactuals in under-served languages. The public dataset release and the conditional CFHR metric are concrete contributions that enable reproducible follow-up. The prompting-strategy finding also has immediate practical implications for deployment.

major comments (3)

[§3] §3 (Dataset Construction and Validation): The manuscript states that counterfactual statements are 'visually incorrect' and 'culturally plausible' but provides no explicit protocol, inter-annotator agreement scores, or image-region grounding procedure for confirming visual mismatch, especially for dialectal Arabic items. Because CFHR is defined as acceptance conditioned on this premise, any noise in the counterfactual labels directly inflates the reported rates; this verification step is load-bearing for the headline Arabic/dialect result.
[§4.2] §4.2 (CFHR Definition and Conditioning): The conditioning on 'correctly answering the true statement' is only described at a high level; it is unclear whether this uses the same prompt template, temperature, and decoding settings as the counterfactual query, or whether multiple samples are aggregated. Small changes in this conditioning can shift CFHR substantially, yet no sensitivity analysis or exact pseudocode is supplied.
[Table 2 / Figure 3] Table 2 / Figure 3 (Arabic vs. Dialect Breakdown): The reported CFHR gap between Modern Standard Arabic and dialects is large, but the paper does not report per-dialect sample sizes, confidence intervals, or a statistical test for the difference. Without these, it is impossible to judge whether the 'sharp rise' is robust or driven by a few low-resource dialects.

minor comments (2)

[§2] §2 (Related Work): The discussion of prior hallucination benchmarks omits recent multilingual efforts (e.g., those extending POPE or HallusionBench to non-English); adding two or three citations would strengthen context.
[Figure 1] Figure 1 caption: The legend for prompting strategies is too small and the color mapping between English/Arabic lines is not described in text; this reduces readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction and Validation): The manuscript states that counterfactual statements are 'visually incorrect' and 'culturally plausible' but provides no explicit protocol, inter-annotator agreement scores, or image-region grounding procedure for confirming visual mismatch, especially for dialectal Arabic items. Because CFHR is defined as acceptance conditioned on this premise, any noise in the counterfactual labels directly inflates the reported rates; this verification step is load-bearing for the headline Arabic/dialect result.

Authors: We agree that a more explicit description of the verification process is necessary to support the CFHR metric. The original manuscript provided only a high-level overview of how counterfactuals were generated and validated. In the revision, we will expand §3 with the full annotation protocol, including the guidelines given to annotators, the inter-annotator agreement scores computed during dataset creation, and the specific procedures used for confirming visual mismatch via image-region grounding. Special attention will be given to dialectal Arabic items, with examples of how cultural plausibility and visual incorrectness were jointly assessed. revision: yes
Referee: [§4.2] §4.2 (CFHR Definition and Conditioning): The conditioning on 'correctly answering the true statement' is only described at a high level; it is unclear whether this uses the same prompt template, temperature, and decoding settings as the counterfactual query, or whether multiple samples are aggregated. Small changes in this conditioning can shift CFHR substantially, yet no sensitivity analysis or exact pseudocode is supplied.

Authors: We acknowledge that the precise implementation details of the conditioning step require clarification for reproducibility. The manuscript described the conditioning conceptually but did not specify the shared prompt templates, temperature, or decoding parameters. In the revised version, we will update §4.2 to include the exact prompt templates for both true and counterfactual queries, confirm that identical temperature and decoding settings were used throughout, note that single-sample evaluation was performed, and provide pseudocode for the full CFHR calculation. We will also add a sensitivity analysis on temperature and prompt variations in the appendix. revision: yes
Referee: [Table 2 / Figure 3] Table 2 / Figure 3 (Arabic vs. Dialect Breakdown): The reported CFHR gap between Modern Standard Arabic and dialects is large, but the paper does not report per-dialect sample sizes, confidence intervals, or a statistical test for the difference. Without these, it is impossible to judge whether the 'sharp rise' is robust or driven by a few low-resource dialects.

Authors: We appreciate this observation on statistical robustness. The current manuscript reports aggregate Arabic and dialect results without per-dialect breakdowns or inferential statistics. In the revision, we will expand Table 2 and Figure 3 to list per-dialect sample sizes, report 95% confidence intervals for all CFHR values, and include a statistical test (e.g., two-proportion z-test) comparing Modern Standard Arabic against the dialect groups. These additions will allow readers to assess whether the observed gap is statistically reliable. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and conditional metric are direct empirical measurements

full rationale

The paper constructs the M²CQA dataset from images and contrastive statements, then defines CFHR explicitly as the rate of accepting counterfactuals conditioned on correct true-statement responses. No equations, fitted parameters, or self-citations reduce the reported rates to the inputs by construction; the results are straightforward counts on model outputs for the released benchmark. The derivation chain consists of dataset creation followed by direct evaluation, remaining self-contained without any self-definitional, fitted-input, or citation-load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central measurement rests on the assumption that the contrastive statements are correctly labeled as visually false yet culturally plausible; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Counterfactual statements are verifiably visually incorrect while remaining culturally plausible to native speakers
Required for CFHR to isolate hallucination rather than simple error.

pith-pipeline@v0.9.0 · 5478 in / 988 out tokens · 24250 ms · 2026-05-16T07:28:55.600744+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Qwen Technical Report

Palm: A culturally inclusive and linguistically diverse dataset for Arabic LLMs . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 32871–32894, Vienna, Austria. Association for Computational Linguistics. Jinze Bai, Shuai Bai, Y unfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Y ang Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Preprint, arXiv:2507.19024

A survey of multimodal hallucination evalua- tion and detection. Preprint, arXiv:2507.19024. Fanar-Team, Ummar Abbas, Mohammad Shahmeer Ah- mad, Firoj Alam, Enes Altinisik, Ehsannedin Asgari, Y azan Boshmaf, Sabri Boughorbel, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Dar- wish, Nadir Durrani, Mohamed Elfeky, Ahmed El- magarmid, Mohamed Eltabak...

work page arXiv 2025
[3]

A Survey on Hallucination in Large Vision-Language Models

Hal-eval: A universal and ﬁne-grained hallu- cination evaluation framework for large vision lan- guage models . In Proceedings of the 32nd ACM International Conference on Multimedia , MM ’24, page 525534, New Y ork, NY , USA. Association for Computing Machinery. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Y u- taka Matsuo, and Y usuke Iwasawa. 2022. L...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

In Proceedings of the 31st International Conference on Computational Lin- guistics, pages 4186–4218, Abu Dhabi, UAE

AraDiCE: Benchmarks for dialectal and cul- tural capabilities in LLMs . In Proceedings of the 31st International Conference on Computational Lin- guistics, pages 4186–4218, Abu Dhabi, UAE. Asso- ciation for Computational Linguistics. Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xi- aoqin Zhang, Ling Shao, and Shijian Lu. 2024. Longhalqa: Long-context halluci...

work page arXiv 2024
[5]

- The question must have three plausible answer options

Question Type : - Generate exactly one multiple - choice question per image . - The question must have three plausible answer options . - Exactly one option must be correct and visually supported by the image . - The remaining options must be plausible but not supported by the image

work page
[6]

Semantic Focus : - Use the following semantic labels to guide question construction . Match the image content to the most relevant labels : * Location and Place Identification * Scene Interpretation and Context * Architectural Features and Functions * Cultural Significance and Heritage * Traditional Clothing and Attire * Tourism and Cultural Activities * ...

work page
[7]

- Assign a label indicating the cognitive focus : Category Subcategories

Cognitive Focus : - Ensure that the question requires visual grounding in the image . - Assign a label indicating the cognitive focus : Category Subcategories

work page
[8]

Built Environment & Architecture Modern landmarks; residential buildings; public spaces; historical/archae- ological sites; government buildings; interior design; healthcare buildings; transport infrastructure; bridges; religious architecture

work page
[9]

Food & Culinary Culture Prepared dishes; ingredients & herbs; beverages; tableware; food textures & garnish

work page
[10]

Clothing, Textiles & Appearance Fabric materials; modern apparel; accessories; traditional/cultural clothing; patterns & motifs

work page
[11]

Objects, Artifacts & Material Culture Materials; household objects; decorative crafts; technology devices; jewelry; furniture; machinery/tools; signage; religious objects

work page
[12]

Nature, Animals & Ecology Animals; plants/vegetation; landscapes; water features; climate indicators

work page
[13]

Activities, Sports & Human Action Competitive sports; motorsports; recreational activities; labor/work actions; festivals/performances; marketplace activities

work page
[14]

Religion, Culture & National Identity Religious buildings; weddings; cultural celebrations; fashion events; political events; national symbols

work page
[15]

Transport & Mobility Road transport; rail transport; air transport; water transport; animal-based transport

work page
[16]

People & Occupations Everyday people; healthcare workers; manual laborers; performers; transport operators; athletes; religious/cultural participants

work page
[17]

Environment, Climate & Geography Beaches/coasts; deserts/rocky terrains; forests; urban outdoors; disaster zones; rural countryside; lighting/weather conditions

work page
[18]

Arts, Entertainment & Media Performing arts; ﬁlm/media; amusement/festival structures; gaming/esports; ﬁne arts/sculpture; museum/exhibition spaces; manuscripts/texts

work page
[19]

Table 3: Hierarchical taxonomy of semantic categories for M 2CQA / ArabicMENA multimodal dataset

Commerce, Markets & Economic Life Local markets; shopping centers; local goods; tourism commerce; restau- rants/cafés; fairs/bazaars. Table 3: Hierarchical taxonomy of semantic categories for M 2CQA / ArabicMENA multimodal dataset. * knowledge - based * common sense - based

work page
[20]

Language : - The question and answer options must be written in native - sounding English

work page
[21]

- Avoid trivial or language - only cues that would reveal the correct answer without access to the image

Question Quality : - Ensure the question is natural , conversational , and human - like . - Avoid trivial or language - only cues that would reveal the correct answer without access to the image . - Answer options should be written in a parallel and natural form

work page
[22]

- Incorrect options must remain plausible in isolation

Answer Quality : - The correct answer must be factually supported by the image . - Incorrect options must remain plausible in isolation

work page
[23]

- Ensure cultural references are accurate and specific to the image

Cultural Sensitivity : - Avoid stereotypes or cultural mi s re p r es e n ta t i on s . - Ensure cultural references are accurate and specific to the image

work page
[24]

Context Utilization : - Use the provided image description , category , and subcategory to enrich question construction without making the answer obvious

work page
[25]

multiple - choice

Reasoning : - Provide a short rationale ( under 100 words ) explaining why the correct option is supported by the image and why the alternatives are not . Output Format ( JSON ) : { " multiple - choice ": { " question_en ": "..." , " options_en ": ["..." , "..." , "..."] , " co rr ec t_ ans we r_ en ": "..." , " rationale ": "..." , " cognitive_focus ": "...

work page
[26]

Rewrite each answer option as a standalone declarative statement referring to the image

work page
[27]

- Two FALSE statements (Q -) , derived from the incorrect answer options

From this conversion , produce : - One TRUE statement ( Q +) , derived from the correct answer option . - Two FALSE statements (Q -) , derived from the incorrect answer options

work page
[28]

Preserve the semantic content of each answer option

work page
[29]

Remove explicit question structure

work page
[30]

Q_plus

Do not introduce new entities or additional information . Output Format ( JSON ) : { " Q_plus ": "..." , " Q_minus ": ["..." , "..."] } A.2.3 V eriﬁcation of Minimal Contrastiveness: We empirically assess the minimal contrastiveness of Q + and Q − statements along two dimensions. First, we examine base accuracy patterns and ob- serve that Q − accuracy clo...

work page 2008