Disparities In Negation Understanding Across Languages In Vision-Language Models
Pith reviewed 2026-05-10 03:27 UTC · model grok-4.3
The pith
Vision-language models exhibit language-dependent affirmation bias, performing at or below chance on negation in non-Latin scripts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the first human-verified multilingual negation benchmark spanning seven typologically diverse languages and evaluate three vision-language models plus one correction method, showing that standard CLIP performs at or below chance on non-Latin-script languages, MultiCLIP reaches the highest and most uniform accuracy, and the correction produces substantial gains for English, Greek, Spanish, and Tagalog while showing varied effectiveness tied to linguistic properties such as morphology, script, and negation structure.
What carries the argument
A human-verified set of negated image captions in seven languages that directly compares model accuracy on affirmative versus negative descriptions across differing scripts and negation patterns.
If this is right
- Any fix for affirmation bias must be checked separately in each language rather than assumed to transfer uniformly.
- Model training data and architecture choices interact with script and negation structure to produce uneven performance.
- Global deployment of vision-language models requires benchmarks that track performance per linguistic community.
- Negation handling is one instance of a broader pattern where linguistic typology affects model reliability.
Where Pith is reading between the lines
- Similar language-specific gaps may appear in other abstract reasoning tasks such as handling quantifiers or spatial relations.
- The results imply that balancing training data by script and negation type could reduce disparities more effectively than post-hoc corrections.
- Extending the benchmark to additional languages or modalities would clarify whether script type or morphological complexity drives the largest differences.
Load-bearing premise
The selected test items measure negation understanding in a comparable way across languages that differ in script, morphology, and how they express negation.
What would settle it
Retraining or testing a new model on the identical benchmark and obtaining uniformly high accuracy across all seven languages with no script-based gaps would falsify the observed disparities.
read the original abstract
Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions ("X is present") even when the correct description contains negation ("no X"). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the first human-verified multilingual negation benchmark spanning seven typologically diverse languages (English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, Spanish). It evaluates standard VLMs (CLIP, SigLIP, MultiCLIP) on this benchmark and reports that CLIP performs at or below chance on non-Latin-script languages while MultiCLIP achieves the highest and most uniform accuracy; it further evaluates the SpaceVLM negation-correction approach and finds substantial but language-varying improvements (strongest for English, Greek, Spanish, Tagalog), attributing the variation to interactions between linguistic properties (morphology, script, negation structure) and model behavior.
Significance. If the benchmark items provide comparable measures of negation understanding, the work would be significant for documenting cross-lingual and cross-script disparities in VLM affirmation bias and for showing that proposed fixes like SpaceVLM do not generalize uniformly. The creation of a human-verified multilingual resource is a concrete contribution that could support future fairness audits of VLMs deployed globally.
major comments (2)
- [Benchmark construction and evaluation sections] The central claims (CLIP at/below chance on non-Latin scripts; MultiCLIP most uniform; variable SpaceVLM gains) rest on the benchmark providing equivalent measures of negation understanding. The manuscript does not demonstrate that positive/negative caption pairs impose comparable cognitive or computational demands across languages that differ in negation realization (particles vs. morphology vs. clitics), script, and word order; human verification alone does not rule out confounds from general text comprehension or text-encoder tokenization effects.
- [Abstract and §4 (Evaluation)] The abstract reports clear directional findings but supplies no sample sizes, statistical tests, inter-annotator agreement for the human verification, or exact evaluation protocol (prompt templates, image-caption pairing, chance-level calculation). These details are required to assess whether data selection or prompt choices drive the reported disparities.
minor comments (1)
- Clarify how 'chance' performance is defined for each language given differing negation structures and script effects on tokenization.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful feedback on our paper. We have carefully considered each comment and provide detailed responses below, along with indications of revisions made to the manuscript.
read point-by-point responses
-
Referee: [Benchmark construction and evaluation sections] The central claims (CLIP at/below chance on non-Latin scripts; MultiCLIP most uniform; variable SpaceVLM gains) rest on the benchmark providing equivalent measures of negation understanding. The manuscript does not demonstrate that positive/negative caption pairs impose comparable cognitive or computational demands across languages that differ in negation realization (particles vs. morphology vs. clitics), script, and word order; human verification alone does not rule out confounds from general text comprehension or text-encoder tokenization effects.
Authors: We recognize the importance of ensuring that the benchmark measures negation understanding equivalently across languages. While typological differences make perfect equivalence difficult, our verification by native speakers for each language ensures that the negation is correctly represented in the captions. To address potential confounds, we have expanded the manuscript to include an analysis of text encoder tokenization effects, such as average token counts for positive and negative captions per language, and a comparison of model performance on a subset of captions with matched token lengths. We also discuss in the limitations section how general text comprehension might interact with negation bias. These additions provide additional support for our claims without claiming full equivalence, which we agree is not fully demonstrated. revision: partial
-
Referee: [Abstract and §4 (Evaluation)] The abstract reports clear directional findings but supplies no sample sizes, statistical tests, inter-annotator agreement for the human verification, or exact evaluation protocol (prompt templates, image-caption pairing, chance-level calculation). These details are required to assess whether data selection or prompt choices drive the reported disparities.
Authors: We appreciate this observation and have updated both the abstract and Section 4 to include the requested details. Specifically, we now report the sample size (number of image-caption pairs per language), the inter-annotator agreement (Cohen's kappa for verification), the exact prompt templates used, the image-caption pairing method, and how chance level was calculated (50% for binary choice). Additionally, we have included statistical tests (binomial tests against chance) in the results tables and text. These additions ensure the evaluation protocol is fully transparent and reproducible. revision: yes
Circularity Check
No significant circularity: purely empirical evaluation on new benchmark
full rationale
The paper introduces a human-verified multilingual negation benchmark across seven languages and reports direct accuracy measurements for existing models (CLIP, SigLIP, MultiCLIP) plus one external correction method (SpaceVLM). No equations, parameter fits, predictions derived from the benchmark itself, or load-bearing self-citations appear in the provided text. All central claims reduce to observable performance numbers on the new dataset rather than to any self-referential construction, satisfying the default expectation of non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human verification produces reliable ground-truth labels for negation understanding
Reference graph
Works this paper leans on
-
[1]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Vision-language models do not understand negation , author=. arXiv preprint arXiv:2501.09425 , year=
-
[2]
Ranjbar, Sepehr Kazemi and Alhamoud, Kumail and Ghassemi, Marzyeh , journal=
-
[3]
Learning transferable visual models from natural language supervision , author=. ICML , year=
- [4]
-
[5]
Cross-lingual and multilingual
Carlsson, Fredrik and Eisen, Philipp and Rekathati, Faton and Sahlgren, Magnus , booktitle=. Cross-lingual and multilingual
- [6]
-
[7]
Systematic inequalities in language technology performance across the world's languages , author=. ACL , year=
-
[8]
Nature Communications , volume=
The global geography of artificial intelligence in life science research , author=. Nature Communications , volume=
-
[9]
When and why vision-language models behave like bags-of-words, and what to do about it? , author=. ICLR , year=
-
[10]
Standard Negation: The Negation of Declarative Verbal Main Clauses in a Typological Perspective , author=. 2005 , publisher=
work page 2005
-
[11]
arXiv preprint arXiv:2307.13405 , year=
Towards Bridging the Digital Language Divide , author=. arXiv preprint arXiv:2307.13405 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.