VLMs exhibit affirmation bias that varies by language, with a new multilingual benchmark showing CLIP at or below chance on non-Latin scripts, MultiCLIP most uniform, and SpaceVLM corrections effective unevenly across typologies.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.
citing papers explorer
-
Disparities In Negation Understanding Across Languages In Vision-Language Models
VLMs exhibit affirmation bias that varies by language, with a new multilingual benchmark showing CLIP at or below chance on non-Latin scripts, MultiCLIP most uniform, and SpaceVLM corrections effective unevenly across typologies.
-
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.