Disparities In Negation Understanding Across Languages In Vision-Language Models

Charikleia Moraitaki; Gwendolyn Flusche; Kumail Alhamoud; Marzyeh Ghassemi; Sarah Pan; Skyler Pulling

arxiv: 2604.18942 · v1 · submitted 2026-04-21 · 💻 cs.CL

Disparities In Negation Understanding Across Languages In Vision-Language Models

Charikleia Moraitaki , Sarah Pan , Skyler Pulling , Gwendolyn Flusche , Kumail Alhamoud , Marzyeh Ghassemi This is my paper

Pith reviewed 2026-05-10 03:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords negation biasvision-language modelsmultilingual evaluationaffirmation biastypological diversityCLIP modelsmodel fairnesslanguage scripts

0 comments

The pith

Vision-language models exhibit language-dependent affirmation bias, performing at or below chance on negation in non-Latin scripts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vision-language models systematically prefer affirmative image descriptions over those containing negation, and that this preference is not uniform across languages. It introduces a new human-checked test set of negated captions in seven typologically varied languages to measure how models handle negation when morphology, word order, and script differ. Evaluations reveal that one common model stays near random on languages using non-Latin scripts, a multilingual variant performs more evenly, and a proposed correction method lifts accuracy for some languages but not others. These patterns matter because models are increasingly used across global communities where negation is expressed differently.

Core claim

We introduce the first human-verified multilingual negation benchmark spanning seven typologically diverse languages and evaluate three vision-language models plus one correction method, showing that standard CLIP performs at or below chance on non-Latin-script languages, MultiCLIP reaches the highest and most uniform accuracy, and the correction produces substantial gains for English, Greek, Spanish, and Tagalog while showing varied effectiveness tied to linguistic properties such as morphology, script, and negation structure.

What carries the argument

A human-verified set of negated image captions in seven languages that directly compares model accuracy on affirmative versus negative descriptions across differing scripts and negation patterns.

If this is right

Any fix for affirmation bias must be checked separately in each language rather than assumed to transfer uniformly.
Model training data and architecture choices interact with script and negation structure to produce uneven performance.
Global deployment of vision-language models requires benchmarks that track performance per linguistic community.
Negation handling is one instance of a broader pattern where linguistic typology affects model reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar language-specific gaps may appear in other abstract reasoning tasks such as handling quantifiers or spatial relations.
The results imply that balancing training data by script and negation type could reduce disparities more effectively than post-hoc corrections.
Extending the benchmark to additional languages or modalities would clarify whether script type or morphological complexity drives the largest differences.

Load-bearing premise

The selected test items measure negation understanding in a comparable way across languages that differ in script, morphology, and how they express negation.

What would settle it

Retraining or testing a new model on the identical benchmark and obtaining uniformly high accuracy across all seven languages with no script-based gaps would falsify the observed disparities.

read the original abstract

Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions ("X is present") even when the correct description contains negation ("no X"). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives us a new multilingual negation benchmark for VLMs and shows that one correction method helps some languages more than others, but the items may not test equivalent skills across languages.

read the letter

The core contribution is a human-verified benchmark covering negation in seven languages with different scripts and structures, plus results showing uneven performance and uneven gains from SpaceVLM. Standard CLIP sits at or below chance on non-Latin scripts while MultiCLIP looks more consistent, and the correction method improves English, Greek, Spanish, and Tagalog more than the others. That pattern is worth knowing if you care about global deployment.

Referee Report

2 major / 1 minor

Summary. The paper introduces the first human-verified multilingual negation benchmark spanning seven typologically diverse languages (English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, Spanish). It evaluates standard VLMs (CLIP, SigLIP, MultiCLIP) on this benchmark and reports that CLIP performs at or below chance on non-Latin-script languages while MultiCLIP achieves the highest and most uniform accuracy; it further evaluates the SpaceVLM negation-correction approach and finds substantial but language-varying improvements (strongest for English, Greek, Spanish, Tagalog), attributing the variation to interactions between linguistic properties (morphology, script, negation structure) and model behavior.

Significance. If the benchmark items provide comparable measures of negation understanding, the work would be significant for documenting cross-lingual and cross-script disparities in VLM affirmation bias and for showing that proposed fixes like SpaceVLM do not generalize uniformly. The creation of a human-verified multilingual resource is a concrete contribution that could support future fairness audits of VLMs deployed globally.

major comments (2)

[Benchmark construction and evaluation sections] The central claims (CLIP at/below chance on non-Latin scripts; MultiCLIP most uniform; variable SpaceVLM gains) rest on the benchmark providing equivalent measures of negation understanding. The manuscript does not demonstrate that positive/negative caption pairs impose comparable cognitive or computational demands across languages that differ in negation realization (particles vs. morphology vs. clitics), script, and word order; human verification alone does not rule out confounds from general text comprehension or text-encoder tokenization effects.
[Abstract and §4 (Evaluation)] The abstract reports clear directional findings but supplies no sample sizes, statistical tests, inter-annotator agreement for the human verification, or exact evaluation protocol (prompt templates, image-caption pairing, chance-level calculation). These details are required to assess whether data selection or prompt choices drive the reported disparities.

minor comments (1)

Clarify how 'chance' performance is defined for each language given differing negation structures and script effects on tokenization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful feedback on our paper. We have carefully considered each comment and provide detailed responses below, along with indications of revisions made to the manuscript.

read point-by-point responses

Referee: [Benchmark construction and evaluation sections] The central claims (CLIP at/below chance on non-Latin scripts; MultiCLIP most uniform; variable SpaceVLM gains) rest on the benchmark providing equivalent measures of negation understanding. The manuscript does not demonstrate that positive/negative caption pairs impose comparable cognitive or computational demands across languages that differ in negation realization (particles vs. morphology vs. clitics), script, and word order; human verification alone does not rule out confounds from general text comprehension or text-encoder tokenization effects.

Authors: We recognize the importance of ensuring that the benchmark measures negation understanding equivalently across languages. While typological differences make perfect equivalence difficult, our verification by native speakers for each language ensures that the negation is correctly represented in the captions. To address potential confounds, we have expanded the manuscript to include an analysis of text encoder tokenization effects, such as average token counts for positive and negative captions per language, and a comparison of model performance on a subset of captions with matched token lengths. We also discuss in the limitations section how general text comprehension might interact with negation bias. These additions provide additional support for our claims without claiming full equivalence, which we agree is not fully demonstrated. revision: partial
Referee: [Abstract and §4 (Evaluation)] The abstract reports clear directional findings but supplies no sample sizes, statistical tests, inter-annotator agreement for the human verification, or exact evaluation protocol (prompt templates, image-caption pairing, chance-level calculation). These details are required to assess whether data selection or prompt choices drive the reported disparities.

Authors: We appreciate this observation and have updated both the abstract and Section 4 to include the requested details. Specifically, we now report the sample size (number of image-caption pairs per language), the inter-annotator agreement (Cohen's kappa for verification), the exact prompt templates used, the image-caption pairing method, and how chance level was calculated (50% for binary choice). Additionally, we have included statistical tests (binomial tests against chance) in the results tables and text. These additions ensure the evaluation protocol is fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation on new benchmark

full rationale

The paper introduces a human-verified multilingual negation benchmark across seven languages and reports direct accuracy measurements for existing models (CLIP, SigLIP, MultiCLIP) plus one external correction method (SpaceVLM). No equations, parameter fits, predictions derived from the benchmark itself, or load-bearing self-citations appear in the provided text. All central claims reduce to observable performance numbers on the new dataset rather than to any self-referential construction, satisfying the default expectation of non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking study with no free parameters, no ad-hoc axioms beyond standard ML evaluation assumptions, and no invented entities.

axioms (1)

domain assumption Human verification produces reliable ground-truth labels for negation understanding
Invoked when claiming the benchmark is human-verified and suitable for cross-lingual comparison.

pith-pipeline@v0.9.0 · 5547 in / 1250 out tokens · 32450 ms · 2026-05-10T03:27:25.078069+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Vision-language models do not understand negation , author=. arXiv preprint arXiv:2501.09425 , year=

work page arXiv
[2]

Ranjbar, Sepehr Kazemi and Alhamoud, Kumail and Ghassemi, Marzyeh , journal=

work page
[3]

ICML , year=

Learning transferable visual models from natural language supervision , author=. ICML , year=

work page
[4]

ICCV , year=

Sigmoid loss for language image pre-training , author=. ICCV , year=

work page
[5]

Cross-lingual and multilingual

Carlsson, Fredrik and Eisen, Philipp and Rekathati, Faton and Sahlgren, Magnus , booktitle=. Cross-lingual and multilingual

work page
[6]

Microsoft

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. ECCV , year=

work page
[7]

ACL , year=

Systematic inequalities in language technology performance across the world's languages , author=. ACL , year=

work page
[8]

Nature Communications , volume=

The global geography of artificial intelligence in life science research , author=. Nature Communications , volume=

work page
[9]

ICLR , year=

When and why vision-language models behave like bags-of-words, and what to do about it? , author=. ICLR , year=

work page
[10]

2005 , publisher=

Standard Negation: The Negation of Declarative Verbal Main Clauses in a Typological Perspective , author=. 2005 , publisher=

work page 2005
[11]

arXiv preprint arXiv:2307.13405 , year=

Towards Bridging the Digital Language Divide , author=. arXiv preprint arXiv:2307.13405 , year=

work page arXiv

[1] [1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Vision-language models do not understand negation , author=. arXiv preprint arXiv:2501.09425 , year=

work page arXiv

[2] [2]

Ranjbar, Sepehr Kazemi and Alhamoud, Kumail and Ghassemi, Marzyeh , journal=

work page

[3] [3]

ICML , year=

Learning transferable visual models from natural language supervision , author=. ICML , year=

work page

[4] [4]

ICCV , year=

Sigmoid loss for language image pre-training , author=. ICCV , year=

work page

[5] [5]

Cross-lingual and multilingual

Carlsson, Fredrik and Eisen, Philipp and Rekathati, Faton and Sahlgren, Magnus , booktitle=. Cross-lingual and multilingual

work page

[6] [6]

Microsoft

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. ECCV , year=

work page

[7] [7]

ACL , year=

Systematic inequalities in language technology performance across the world's languages , author=. ACL , year=

work page

[8] [8]

Nature Communications , volume=

The global geography of artificial intelligence in life science research , author=. Nature Communications , volume=

work page

[9] [9]

ICLR , year=

When and why vision-language models behave like bags-of-words, and what to do about it? , author=. ICLR , year=

work page

[10] [10]

2005 , publisher=

Standard Negation: The Negation of Declarative Verbal Main Clauses in a Typological Perspective , author=. 2005 , publisher=

work page 2005

[11] [11]

arXiv preprint arXiv:2307.13405 , year=

Towards Bridging the Digital Language Divide , author=. arXiv preprint arXiv:2307.13405 , year=

work page arXiv