Recognition: no theorem link
Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models
Pith reviewed 2026-05-16 07:28 UTC · model grok-4.3
The pith
Vision-language models accept visually false but culturally plausible statements in Arabic even after answering true questions correctly
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-of-the-art VLMs exhibit high CounterFactual Hallucination Rate in Arabic, especially dialects, even when true-statement accuracy remains high. The M2CQA dataset pairs culturally grounded images with true statements and counterfactual alternatives that are visually incorrect yet plausible to native speakers. Reasoning-first prompting consistently raises the rate while answering before justifying improves robustness across prompting strategies.
What carries the argument
CounterFactual Hallucination Rate (CFHR), the conditional probability that a model accepts a visually incorrect but culturally plausible statement given that it correctly answered the matching true statement
Load-bearing premise
The constructed counterfactual statements are verifiably visually incorrect while remaining culturally plausible to native speakers
What would settle it
If models show similarly low CFHR across English and Arabic when true accuracy is held constant, or if native speakers rate the counterfactual statements as visually accurate rather than incorrect, the central claim would be refuted
Figures
read the original abstract
Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M$^2$CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We make the dataset publicly available for the community (https://huggingface.co/datasets/QCRI/M2CQA)).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces M²CQA, a culturally grounded multimodal benchmark consisting of images from 17 MENA countries paired with contrastive true and counterfactual statements in English, Arabic, and dialects. It defines the CounterFactual Hallucination Rate (CFHR) as the rate at which models accept visually incorrect but culturally plausible counterfactuals, conditioned on correctly answering the corresponding true statement. Evaluations across state-of-the-art VLMs and prompting strategies show that CFHR rises sharply for Arabic (especially dialects) even when true-statement accuracy remains high, and that reasoning-first prompting increases counterfactual acceptance while answer-first prompting improves robustness. The dataset is released publicly.
Significance. If the central empirical claim holds, the work fills a clear gap in hallucination evaluation by moving beyond English-centric and visually obvious errors to culturally plausible counterfactuals in under-served languages. The public dataset release and the conditional CFHR metric are concrete contributions that enable reproducible follow-up. The prompting-strategy finding also has immediate practical implications for deployment.
major comments (3)
- [§3] §3 (Dataset Construction and Validation): The manuscript states that counterfactual statements are 'visually incorrect' and 'culturally plausible' but provides no explicit protocol, inter-annotator agreement scores, or image-region grounding procedure for confirming visual mismatch, especially for dialectal Arabic items. Because CFHR is defined as acceptance conditioned on this premise, any noise in the counterfactual labels directly inflates the reported rates; this verification step is load-bearing for the headline Arabic/dialect result.
- [§4.2] §4.2 (CFHR Definition and Conditioning): The conditioning on 'correctly answering the true statement' is only described at a high level; it is unclear whether this uses the same prompt template, temperature, and decoding settings as the counterfactual query, or whether multiple samples are aggregated. Small changes in this conditioning can shift CFHR substantially, yet no sensitivity analysis or exact pseudocode is supplied.
- [Table 2 / Figure 3] Table 2 / Figure 3 (Arabic vs. Dialect Breakdown): The reported CFHR gap between Modern Standard Arabic and dialects is large, but the paper does not report per-dialect sample sizes, confidence intervals, or a statistical test for the difference. Without these, it is impossible to judge whether the 'sharp rise' is robust or driven by a few low-resource dialects.
minor comments (2)
- [§2] §2 (Related Work): The discussion of prior hallucination benchmarks omits recent multilingual efforts (e.g., those extending POPE or HallusionBench to non-English); adding two or three citations would strengthen context.
- [Figure 1] Figure 1 caption: The legend for prompting strategies is too small and the color mapping between English/Arabic lines is not described in text; this reduces readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction and Validation): The manuscript states that counterfactual statements are 'visually incorrect' and 'culturally plausible' but provides no explicit protocol, inter-annotator agreement scores, or image-region grounding procedure for confirming visual mismatch, especially for dialectal Arabic items. Because CFHR is defined as acceptance conditioned on this premise, any noise in the counterfactual labels directly inflates the reported rates; this verification step is load-bearing for the headline Arabic/dialect result.
Authors: We agree that a more explicit description of the verification process is necessary to support the CFHR metric. The original manuscript provided only a high-level overview of how counterfactuals were generated and validated. In the revision, we will expand §3 with the full annotation protocol, including the guidelines given to annotators, the inter-annotator agreement scores computed during dataset creation, and the specific procedures used for confirming visual mismatch via image-region grounding. Special attention will be given to dialectal Arabic items, with examples of how cultural plausibility and visual incorrectness were jointly assessed. revision: yes
-
Referee: [§4.2] §4.2 (CFHR Definition and Conditioning): The conditioning on 'correctly answering the true statement' is only described at a high level; it is unclear whether this uses the same prompt template, temperature, and decoding settings as the counterfactual query, or whether multiple samples are aggregated. Small changes in this conditioning can shift CFHR substantially, yet no sensitivity analysis or exact pseudocode is supplied.
Authors: We acknowledge that the precise implementation details of the conditioning step require clarification for reproducibility. The manuscript described the conditioning conceptually but did not specify the shared prompt templates, temperature, or decoding parameters. In the revised version, we will update §4.2 to include the exact prompt templates for both true and counterfactual queries, confirm that identical temperature and decoding settings were used throughout, note that single-sample evaluation was performed, and provide pseudocode for the full CFHR calculation. We will also add a sensitivity analysis on temperature and prompt variations in the appendix. revision: yes
-
Referee: [Table 2 / Figure 3] Table 2 / Figure 3 (Arabic vs. Dialect Breakdown): The reported CFHR gap between Modern Standard Arabic and dialects is large, but the paper does not report per-dialect sample sizes, confidence intervals, or a statistical test for the difference. Without these, it is impossible to judge whether the 'sharp rise' is robust or driven by a few low-resource dialects.
Authors: We appreciate this observation on statistical robustness. The current manuscript reports aggregate Arabic and dialect results without per-dialect breakdowns or inferential statistics. In the revision, we will expand Table 2 and Figure 3 to list per-dialect sample sizes, report 95% confidence intervals for all CFHR values, and include a statistical test (e.g., two-proportion z-test) comparing Modern Standard Arabic against the dialect groups. These additions will allow readers to assess whether the observed gap is statistically reliable. revision: yes
Circularity Check
No circularity: new benchmark and conditional metric are direct empirical measurements
full rationale
The paper constructs the M²CQA dataset from images and contrastive statements, then defines CFHR explicitly as the rate of accepting counterfactuals conditioned on correct true-statement responses. No equations, fitted parameters, or self-citations reduce the reported rates to the inputs by construction; the results are straightforward counts on model outputs for the released benchmark. The derivation chain consists of dataset creation followed by direct evaluation, remaining self-contained without any self-definitional, fitted-input, or citation-load-bearing reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactual statements are verifiably visually incorrect while remaining culturally plausible to native speakers
Reference graph
Works this paper leans on
-
[1]
Palm: A culturally inclusive and linguistically diverse dataset for Arabic LLMs . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 32871–32894, Vienna, Austria. Association for Computational Linguistics. Jinze Bai, Shuai Bai, Y unfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Y ang Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
A survey of multimodal hallucination evalua- tion and detection. Preprint, arXiv:2507.19024. Fanar-Team, Ummar Abbas, Mohammad Shahmeer Ah- mad, Firoj Alam, Enes Altinisik, Ehsannedin Asgari, Y azan Boshmaf, Sabri Boughorbel, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Dar- wish, Nadir Durrani, Mohamed Elfeky, Ahmed El- magarmid, Mohamed Eltabak...
-
[3]
A Survey on Hallucination in Large Vision-Language Models
Hal-eval: A universal and fine-grained hallu- cination evaluation framework for large vision lan- guage models . In Proceedings of the 32nd ACM International Conference on Multimedia , MM ’24, page 525534, New Y ork, NY , USA. Association for Computing Machinery. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Y u- taka Matsuo, and Y usuke Iwasawa. 2022. L...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
AraDiCE: Benchmarks for dialectal and cul- tural capabilities in LLMs . In Proceedings of the 31st International Conference on Computational Lin- guistics, pages 4186–4218, Abu Dhabi, UAE. Asso- ciation for Computational Linguistics. Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xi- aoqin Zhang, Ling Shao, and Shijian Lu. 2024. Longhalqa: Long-context halluci...
-
[5]
- The question must have three plausible answer options
Question Type : - Generate exactly one multiple - choice question per image . - The question must have three plausible answer options . - Exactly one option must be correct and visually supported by the image . - The remaining options must be plausible but not supported by the image
-
[6]
Semantic Focus : - Use the following semantic labels to guide question construction . Match the image content to the most relevant labels : * Location and Place Identification * Scene Interpretation and Context * Architectural Features and Functions * Cultural Significance and Heritage * Traditional Clothing and Attire * Tourism and Cultural Activities * ...
-
[7]
- Assign a label indicating the cognitive focus : Category Subcategories
Cognitive Focus : - Ensure that the question requires visual grounding in the image . - Assign a label indicating the cognitive focus : Category Subcategories
-
[8]
Built Environment & Architecture Modern landmarks; residential buildings; public spaces; historical/archae- ological sites; government buildings; interior design; healthcare buildings; transport infrastructure; bridges; religious architecture
-
[9]
Food & Culinary Culture Prepared dishes; ingredients & herbs; beverages; tableware; food textures & garnish
-
[10]
Clothing, Textiles & Appearance Fabric materials; modern apparel; accessories; traditional/cultural clothing; patterns & motifs
-
[11]
Objects, Artifacts & Material Culture Materials; household objects; decorative crafts; technology devices; jewelry; furniture; machinery/tools; signage; religious objects
-
[12]
Nature, Animals & Ecology Animals; plants/vegetation; landscapes; water features; climate indicators
-
[13]
Activities, Sports & Human Action Competitive sports; motorsports; recreational activities; labor/work actions; festivals/performances; marketplace activities
-
[14]
Religion, Culture & National Identity Religious buildings; weddings; cultural celebrations; fashion events; political events; national symbols
-
[15]
Transport & Mobility Road transport; rail transport; air transport; water transport; animal-based transport
-
[16]
People & Occupations Everyday people; healthcare workers; manual laborers; performers; transport operators; athletes; religious/cultural participants
-
[17]
Environment, Climate & Geography Beaches/coasts; deserts/rocky terrains; forests; urban outdoors; disaster zones; rural countryside; lighting/weather conditions
-
[18]
Arts, Entertainment & Media Performing arts; film/media; amusement/festival structures; gaming/esports; fine arts/sculpture; museum/exhibition spaces; manuscripts/texts
-
[19]
Table 3: Hierarchical taxonomy of semantic categories for M 2CQA / ArabicMENA multimodal dataset
Commerce, Markets & Economic Life Local markets; shopping centers; local goods; tourism commerce; restau- rants/cafés; fairs/bazaars. Table 3: Hierarchical taxonomy of semantic categories for M 2CQA / ArabicMENA multimodal dataset. * knowledge - based * common sense - based
-
[20]
Language : - The question and answer options must be written in native - sounding English
-
[21]
Question Quality : - Ensure the question is natural , conversational , and human - like . - Avoid trivial or language - only cues that would reveal the correct answer without access to the image . - Answer options should be written in a parallel and natural form
-
[22]
- Incorrect options must remain plausible in isolation
Answer Quality : - The correct answer must be factually supported by the image . - Incorrect options must remain plausible in isolation
-
[23]
- Ensure cultural references are accurate and specific to the image
Cultural Sensitivity : - Avoid stereotypes or cultural mi s re p r es e n ta t i on s . - Ensure cultural references are accurate and specific to the image
-
[24]
Context Utilization : - Use the provided image description , category , and subcategory to enrich question construction without making the answer obvious
-
[25]
Reasoning : - Provide a short rationale ( under 100 words ) explaining why the correct option is supported by the image and why the alternatives are not . Output Format ( JSON ) : { " multiple - choice ": { " question_en ": "..." , " options_en ": ["..." , "..." , "..."] , " co rr ec t_ ans we r_ en ": "..." , " rationale ": "..." , " cognitive_focus ": "...
-
[26]
Rewrite each answer option as a standalone declarative statement referring to the image
-
[27]
- Two FALSE statements (Q -) , derived from the incorrect answer options
From this conversion , produce : - One TRUE statement ( Q +) , derived from the correct answer option . - Two FALSE statements (Q -) , derived from the incorrect answer options
-
[28]
Preserve the semantic content of each answer option
-
[29]
Remove explicit question structure
-
[30]
Do not introduce new entities or additional information . Output Format ( JSON ) : { " Q_plus ": "..." , " Q_minus ": ["..." , "..."] } A.2.3 V erification of Minimal Contrastiveness: We empirically assess the minimal contrastiveness of Q + and Q − statements along two dimensions. First, we examine base accuracy patterns and ob- serve that Q − accuracy clo...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.