Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family
Pith reviewed 2026-05-10 20:13 UTC · model grok-4.3
The pith
CLIP models focus too much on image centers and miss objects near the boundaries because of how visual embeddings are aggregated.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. Using interpretability methods such as embedding decomposition and attention map analysis, the authors find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. This bias can be alleviated with training-free 3
What carries the argument
Center bias in the final visual embedding, produced by information loss when pooling spatial features in the CLIP visual encoder, as diagnosed through embedding decomposition and attention analysis.
If this is right
- CLIP will fail to recognize relevant objects near image boundaries in any downstream task that requires full-scene understanding.
- The bias remains present even in recent variants of the CLIP family.
- Training-free visual prompting and attention redistribution can redirect focus toward off-center regions.
- Any sophisticated task that depends on boundary objects becomes harder without mitigation.
Where Pith is reading between the lines
- The same aggregation step may create analogous spatial biases in other vision-language models that rely on similar pooling.
- Replacing or modifying the pooling operation could produce embeddings that preserve information across the entire image plane.
- The proposed fixes may transfer to other contrastive multimodal architectures without requiring new labeled data.
Load-bearing premise
The center bias results from information loss when visual embeddings are aggregated, especially through pooling mechanisms.
What would settle it
Measure whether CLIP accuracy drops sharply on images that contain a single critical object placed only near the boundary versus the same object placed at the center.
Figures
read the original abstract
Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a 'center bias' in the CLIP family of contrastive vision-language models, where the models disproportionately attend to the central region of images while overlooking objects near the boundaries. Through interpretability analyses using embedding decomposition and attention map inspection, the authors attribute the bias to information loss during the final aggregation of visual embeddings (particularly pooling mechanisms). They further propose and demonstrate training-free mitigation strategies, including visual prompting and attention redistribution, to redirect focus to off-center regions.
Significance. If the empirical findings and causal attribution hold, this work would be significant for the computer vision and multimodal learning communities. It provides a concrete, previously under-explored failure mode in widely used models like CLIP that affects any downstream task requiring holistic scene understanding. The training-free mitigations are practically valuable. However, the absence of quantitative bias metrics, statistical significance tests, or ablation studies in the provided description limits the immediate impact assessment.
major comments (2)
- [representation perspective analysis] The central attribution of center bias to information loss specifically in the final pooling/aggregation step (representation perspective analysis) rests on correlational evidence from decomposition and attention maps. No ablation is described that modifies only the aggregation mechanism while holding earlier self-attention layers, contrastive training objective, and data statistics fixed, then re-measures the bias; thus causality is not isolated from potential origins in earlier layers or training data centering statistics.
- [experimental evaluation] The manuscript provides no quantitative results, error bars, dataset sizes, or full experimental protocols for measuring the center bias or evaluating the proposed mitigations. This makes it impossible to assess the magnitude of the effect, its consistency across model variants, or the statistical reliability of the claimed alleviation.
minor comments (2)
- [introduction] Clarify the exact definition and computation of 'center bias' metric early in the paper, including how off-center objects are identified and how attention/embedding contributions are quantified.
- [abstract] The abstract states the bias 'persists even in recent model variants' but does not specify which variants were tested or provide comparative numbers; add a table or figure summarizing bias strength across CLIP, CLIP variants, and related models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work identifying center bias in the CLIP family. We address each major comment point by point below, providing clarifications and noting revisions made to the manuscript.
read point-by-point responses
-
Referee: [representation perspective analysis] The central attribution of center bias to information loss specifically in the final pooling/aggregation step (representation perspective analysis) rests on correlational evidence from decomposition and attention maps. No ablation is described that modifies only the aggregation mechanism while holding earlier self-attention layers, contrastive training objective, and data statistics fixed, then re-measures the bias; thus causality is not isolated from potential origins in earlier layers or training data centering statistics.
Authors: We appreciate the referee's emphasis on isolating causality. Our embedding decomposition explicitly tracks token contributions layer by layer, showing that off-center concepts remain encoded in intermediate self-attention outputs but are disproportionately suppressed during the final pooling/aggregation into the [CLS] or global embedding. This is further corroborated by attention map inspections across layers. However, performing the suggested ablation would require retraining CLIP variants from scratch with modified aggregation (while fixing all prior components and data), which is computationally prohibitive and outside the training-free scope of the paper. We have added a dedicated limitations paragraph and additional layer-wise decomposition plots in the revision to strengthen the attribution argument and discuss alternative origins. revision: partial
-
Referee: [experimental evaluation] The manuscript provides no quantitative results, error bars, dataset sizes, or full experimental protocols for measuring the center bias or evaluating the proposed mitigations. This makes it impossible to assess the magnitude of the effect, its consistency across model variants, or the statistical reliability of the claimed alleviation.
Authors: We agree that quantitative rigor is necessary. The original submission included qualitative visualizations and mitigation demonstrations, but we have now expanded the experiments section with: (1) quantitative bias metrics (e.g., center vs. boundary object recall rates on a 10,000-image benchmark derived from COCO and ImageNet), (2) error bars from 5 random seeds, (3) explicit dataset sizes and splits, (4) full protocols for attention redistribution and visual prompting, and (5) statistical tests (paired t-tests with p-values) showing consistent alleviation across CLIP, CLIP-ViT, and SigLIP variants. These additions enable direct assessment of effect sizes and reliability. revision: yes
Circularity Check
No significant circularity in empirical interpretability analysis
full rationale
The paper's claims rest on direct application of embedding decomposition and attention map analysis to pre-trained CLIP models, identifying vanishing off-center concepts in final pooled representations. No equations or steps reduce by construction to their own inputs, no parameters are fitted then relabeled as predictions, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The derivation chain is self-contained empirical observation rather than definitional or self-referential fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CLIP-style models aggregate visual patch embeddings via pooling mechanisms that can cause information loss for off-center content.
Reference graph
Works this paper leans on
-
[1]
URL https://proceedings.neurips.cc/paper files/paper/2024/ file/996bef37d8a638f37bdfcac2789e835d-Paper-Conference.pdf. Markus Bindemann. Scene and screen center bias early eye movements in scene viewing. Vision research, 50(23):2577–2587,
work page 2024
-
[2]
What‘s “up” with vision-language models? investigating their struggle with spatial reasoning
Amita Kamath, Jack Hessel, and Kai-Wei Chang. What‘s “up” with vision-language models? investigating their struggle with spatial reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9161–9175, Singapore, December
work page 2023
-
[3]
What's ``up'' with vision-language models? investigating their struggle with spatial reasoning
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.568. URL https://aclanthology.org/ 2023.emnlp-main.568/. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. InT echnical Report, Computer Science Department, University of T oronto,
-
[4]
Can CLIP count stars? an empirical study on quantity bias in CLIP
Zeliang Zhang, Zhuo Liu, Mingqian Feng, and Chenliang Xu. Can CLIP count stars? an empirical study on quantity bias in CLIP. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 1081–1086, Miami, Florida, USA, November
work page 2024
-
[5]
doi: 10.18653/v1/2024.findings-emnlp.59
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.59. URL https://aclanthology.org/ 2024.findings-emnlp.59/. Weiheng Zhao, Zilong Huang, Jiashi Feng, and Xinggang Wang. SuperCLIP: CLIP with sim- ple classification supervision. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.