Recognition: no theorem link
Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Pith reviewed 2026-05-14 21:00 UTC · model grok-4.3
The pith
The modality gap in multimodal models arises from contrastive training and enhances robustness to perturbations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Minimizing the contrastive loss under certain conditions produces a representation where the two modalities are separated by a global gap vector orthogonal to their embeddings. The modality gap is monotonically related to robustness such that decreasing the gap preserves clean accuracy while making the model less likely to change its output under embedding perturbations. A simple post-processing step that moves one modality toward the mean of the other achieves this decrease in the gap for many real-world VLMs.
What carries the argument
A global gap vector that is orthogonal to the modality embeddings and arises during contrastive loss minimization, governing the monotonic relationship to robustness.
If this is right
- Post-processing to reduce the modality gap increases robustness to embedding perturbations.
- Clean accuracy on original data stays the same after reducing the gap.
- The orthogonality of the gap vector allows it to separate modalities without altering the core embedding directions.
- This effect is observed across many existing vision-language models.
Where Pith is reading between the lines
- The finding implies that forcing perfect modality alignment might reduce robustness in some models.
- Similar orthogonal gap mechanisms could appear in other contrastive training scenarios outside vision and language.
- Practitioners could routinely apply this post-processing to improve model stability in deployed systems.
- It raises the question of whether other performance metrics beyond robustness and accuracy are affected by the gap size.
Load-bearing premise
The results depend on certain unspecified conditions during loss minimization and embedding geometry being satisfied in practice.
What would settle it
A counterexample would be a contrastively trained model where the gap vector is not orthogonal to the embeddings or where reducing the gap size decreases robustness to perturbations.
Figures
read the original abstract
Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that under certain conditions, minimizing the contrastive loss in models like CLIP produces a shared embedding space in which the two modalities are separated by a global gap vector orthogonal to the embeddings. It further claims that under these conditions the gap size is monotonically related to robustness, such that a simple post-processing step that shifts one modality toward the mean of the other reduces the gap, improves robustness to embedding perturbations, and leaves clean accuracy unchanged. Experiments on real-world VLMs are presented to support the post-processing benefit.
Significance. If the stated conditions are shown to hold for typical trained VLMs, the work supplies a mechanistic account of the modality gap together with a training-free intervention that improves robustness at no cost to clean performance. The empirical demonstration on existing models indicates immediate practical utility for robustness enhancement in vision-language systems.
major comments (2)
- [Abstract] Abstract: the central claims (orthogonal global gap vector and monotonic robustness relation) are asserted only 'under certain conditions' on loss minimization and embedding geometry, yet these conditions are neither enumerated nor verified to be satisfied by standard training runs of models such as CLIP; this is load-bearing for the claimed mechanistic justification of the post-processing step.
- [Theoretical derivation] Theoretical derivation: the derivation that the gap vector is orthogonal to the embeddings and that gap size is monotonically related to robustness is presented without explicit checks that the requisite assumptions (perfect convergence to a specific optimum, embeddings confined to the appropriate subspace) survive typical training noise or finite-batch effects; the absence of such verification leaves the link between theory and the reported robustness gains unestablished.
minor comments (1)
- [Experiments] Experiments: details on statistical significance testing, variance across random seeds, and explicit controls for the post-processing intervention (e.g., comparison against random shifts of the same magnitude) are missing and should be supplied for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the conditions underlying the theoretical claims should be stated more explicitly and that the link between idealized assumptions and empirical results merits additional discussion. We will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims (orthogonal global gap vector and monotonic robustness relation) are asserted only 'under certain conditions' on loss minimization and embedding geometry, yet these conditions are neither enumerated nor verified to be satisfied by standard training runs of models such as CLIP; this is load-bearing for the claimed mechanistic justification of the post-processing step.
Authors: We agree that the conditions must be enumerated. In the revised manuscript we will update the abstract to list them explicitly: (1) the contrastive loss reaches its global minimum, (2) image and text embeddings lie in a subspace orthogonal to the gap vector, and (3) no symmetry-breaking regularization is present. While we cannot retroactively inspect the original CLIP training runs for exact satisfaction of these conditions, the consistent robustness gains from the post-processing step across multiple pre-trained VLMs (including CLIP variants) indicate that the conditions hold sufficiently for the claimed practical benefit. revision: yes
-
Referee: [Theoretical derivation] Theoretical derivation: the derivation that the gap vector is orthogonal to the embeddings and that gap size is monotonically related to robustness is presented without explicit checks that the requisite assumptions (perfect convergence to a specific optimum, embeddings confined to the appropriate subspace) survive typical training noise or finite-batch effects; the absence of such verification leaves the link between theory and the reported robustness gains unestablished.
Authors: The derivation is performed under idealized assumptions of perfect convergence and strict subspace confinement. We will add a dedicated paragraph in the theory section that acknowledges these assumptions and their possible violation by training noise or finite-batch effects. We will also include a controlled simulation that injects moderate Gaussian noise into the embeddings and verifies that orthogonality and the monotonic robustness relation remain approximately intact. This addition will make the connection between the idealized analysis and the reported empirical gains more transparent. revision: partial
Circularity Check
No circularity: gap-vector derivation follows from contrastive loss minimization under stated conditions without reducing to self-definition or fitted inputs.
full rationale
The paper presents the orthogonal global gap vector and its monotonic relation to robustness as consequences of minimizing the contrastive loss under certain (explicitly flagged) conditions on loss minimization and embedding geometry. No quoted equations or steps define the gap in terms of itself, rename a fitted parameter as a prediction, or rely on load-bearing self-citations whose prior results are unverified. The derivation chain remains independent of its outputs, with the 'certain conditions' serving as assumptions rather than circular imports. This is the common case of a self-contained theoretical claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Minimizing the contrastive loss under certain conditions produces embeddings separated by a global gap vector orthogonal to the embeddings
Reference graph
Works this paper leans on
-
[1]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2
2009
-
[2]
Data determines distributional robustness in contrastive language image pre-training (CLIP)
Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (CLIP). InProceedings of the 39th Inter- national Conference on Machine Learning, pages 6216–6234. PMLR, 2022. 3
2022
-
[3]
Language guided visual question answering: Elevate your multimodal language model using knowledge-enriched prompts
Deepanway Ghosal, Navonil Majumder, Roy Lee, Rada Mi- halcea, and Soujanya Poria. Language guided visual question answering: Elevate your multimodal language model using knowledge-enriched prompts. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, pages 12096–12102, Singapore, 2023. Association for Computa- tional Linguistics. 7, 5
2023
-
[4]
Openclip, 2021
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021. 7
2021
-
[5]
Understanding and constructing latent modal- ity structures in multi-modal representation learning
Qian Jiang, Changyou Chen, Han Zhao, Liqun Chen, Qing Ping, Son Dinh Tran, Yi Xu, Belinda Zeng, and Trishul Chilimbi. Understanding and constructing latent modal- ity structures in multi-modal representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 7661–7671. IE...
2023
-
[6]
Understanding dimensional collapse in contrastive self- supervised learning
Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self- supervised learning. InInternational Conference on Learning Representations, 2022. 2
2022
-
[7]
Scaling up gans for text-to-image synthesis
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR),
-
[8]
Self-regulating prompts: Foundational model adap- tation without forgetting
Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adap- tation without forgetting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, 2023. 3
2023
-
[9]
Fine-tuning CLIP text encoders with two-step paraphrasing
Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Hung Tran, Franck Dernoncourt, and Jaewoo Kang. Fine-tuning CLIP text encoders with two-step paraphrasing. InFindings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, March 17-22, 2024, pages 2175–2184. Association for Computational Linguistics, 2024. 2
2024
-
[10]
Berg, Tamara L
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. Collective generation of natural image descriptions. InThe 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers, pages 359–368. The Association for Computer Lingu...
2012
-
[11]
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning
Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Advances in Neural Information Processing Systems, 2022. 1, 2, 3, 5, 7
2022
-
[12]
Multimodal unsupervised domain generalization by retrieving across the modality gap
Christopher Liao, Christian So, Theodoros Tsiligkaridis, and Brian Kulis. Multimodal unsupervised domain generalization by retrieving across the modality gap. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2, 3, 7
2025
-
[13]
Lawrence Zitnick
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. 8
2014
-
[14]
Marco and D.L
D. Marco and D.L. Neuhoff. The validity of the additive noise model for uniform scalar quantizers.IEEE Transactions on Information Theory, 51(5):1739–1755, 2005. 8
2005
-
[15]
Understanding retrieval- augmented task adaptation for vision-language models
Yifei Ming and Yixuan Li. Understanding retrieval- augmented task adaptation for vision-language models. In Proceedings of the 41st International Conference on Machine Learning, pages 35719–35743. PMLR, 2024. 2
2024
-
[16]
Bagdanov
Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Andrew D. Bagdanov. Cross the gap: Exposing the intra-modal misalignment in CLIP via modality inver- sion. InThe Thirteenth International Conference on Learning Representations, 2025. 3, 7
2025
-
[17]
Geodesic multi-modal mixup for robust fine-tuning
Changdae Oh, Junhyuk So, Hoyoon Byun, YongTaek Lim, Minchul Shin, Jong-June Jeon, and Kyungwoo Song. Geodesic multi-modal mixup for robust fine-tuning. InThirty- seventh Conference on Neural Information Processing Sys- tems, 2023. 3, 7
2023
-
[18]
Eclipse: A resource-efficient text-to-image prior for image generations
Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, and Yezhou Yang. Eclipse: A resource-efficient text-to-image prior for image generations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9069–9078, 2024. 3, 7
2024
-
[19]
Benchmarking robustness under distribu- tion shift of multimodal image-text models
Jielin Qiu, Yi Zhu, Xingjian Shi, Zhiqiang Tang, Ding Zhao, Bo Li, and Mu Li. Benchmarking robustness under distribu- tion shift of multimodal image-text models. InNeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022. 3
2022
-
[20]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021....
2021
-
[21]
Accept the modality gap: An exploration in the hyperbolic space
Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27263–27272, 2024. 1, 2
2024
-
[22]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents.arXiv preprint arXiv:2204.06125, 1(2): 3, 2022. 1, 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
On the adversarial robustness of multi-modal foundation models
Christian Schlarmann and Matthias Hein. On the adversarial robustness of multi-modal foundation models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3677–3685, 2023. 3
2023
-
[24]
Robust CLIP: Unsupervised adver- sarial fine-tuning of vision embeddings for robust large vision-language models
Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust CLIP: Unsupervised adver- sarial fine-tuning of vision embeddings for robust large vision-language models. InProceedings of the 41st Interna- tional Conference on Machine Learning, pages 43685–43704. PMLR, 2024. 3
2024
-
[25]
Hoffmann, Max Argus, V olker Fis- cher, and Thomas Brox
Simon Schrodi, David T. Hoffmann, Max Argus, V olker Fis- cher, and Thomas Brox. Two effects, one trigger: On the modality gap, object bias, and information imbalance in con- trastive vision-language models. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. 1, 2, 3, 6
2025
-
[26]
A-OKVQA: A benchmark for visual question answering using world knowl- edge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowl- edge. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed- ings, Part VIII, pages 146–162. Springer, 2022. 7
2022
-
[27]
Welle, Mårten Björkman, and Danica Kragic
Peiyang Shi, Michael C. Welle, Mårten Björkman, and Danica Kragic. Towards understanding the modality gap in CLIP. In ICLR 2023 Workshop on Multimodal Representation Learn- ing: Perks and Pitfalls, 2023. 1, 2, 3, 6
2023
-
[28]
Lost in translation: Modern neural networks still struggle with small realistic image trans- formations
Ofir Shifman and Yair Weiss. Lost in translation: Modern neural networks still struggle with small realistic image trans- formations. InComputer Vision - ECCV 2024 - 18th Euro- pean Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIX, pages 231–247. Springer, 2024. 2, 3
2024
-
[29]
CLIPood: Generalizing CLIP to out-of-distributions
Yang Shu, Xingzhuo Guo, Jialong Wu, Ximei Wang, Jianmin Wang, and Mingsheng Long. CLIPood: Generalizing CLIP to out-of-distributions. InProceedings of the 40th Interna- tional Conference on Machine Learning, pages 31716–31731. PMLR, 2023. 3
2023
-
[30]
Improved deep metric learning with multi- class n-pair loss objective
Kihyuk Sohn. Improved deep metric learning with multi- class n-pair loss objective. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2016. 1, 2
2016
-
[31]
A closer look at the robustness of contrastive language-image pre-training (CLIP)
Weijie Tu, Weijian Deng, and Tom Gedeon. A closer look at the robustness of contrastive language-image pre-training (CLIP). InThirty-seventh Conference on Neural Information Processing Systems, 2023. 3
2023
-
[32]
Represen- tation learning with contrastive predictive coding, 2019
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation learning with contrastive predictive coding, 2019. 1, 2
2019
-
[33]
Do CLIP models always generalize better than imagenet models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, and Tong Zhang. Do CLIP models always generalize better than imagenet models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 3
2024
-
[34]
Understanding contrastive representation learning through alignment and uniformity on the hypersphere
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational Conference on Machine Learning, pages 9929–9939. PMLR, 2020. 1, 2, 3
2020
-
[35]
Robust fine-tuning of zero- shot models
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero- shot models. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7949–7961,
-
[36]
Demystifying CLIP data
Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data. InThe Twelfth International Conference on Learn- ing Representations, 2024. 2
2024
-
[37]
Explaining and mitigating the modality gap in contrastive multimodal learning
Can Yaras, Siyi Chen, Peng Wang, and Qing Qu. Explaining and mitigating the modality gap in contrastive multimodal learning. InThe Second Conference on Parsimony and Learn- ing (Proceedings Track), 2025. 2, 3
2025
-
[38]
The impact of quantization on retrieval-augmented generation: An analysis of small llms
Mert Yazan, Suzan Verberne, and Frederik Situmeang. The impact of quantization on retrieval-augmented generation: An analysis of small llms. InProceedings of the Workshop Information Retrieval’s Role in RAG Systems (IR-RAG 2024) co-located with the 47th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval (SIGIR 2024),...
2024
-
[39]
Mutual-modality adversarial attack with semantic perturba- tion
Jingwen Ye, Ruonan Yu, Songhua Liu, and Xinchao Wang. Mutual-modality adversarial attack with semantic perturba- tion. InThirty-Eighth AAAI Conference on Artificial Intel- ligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EA...
2024
-
[40]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11941– 11952. IEEE, 2023. 2, 8
2023
-
[41]
Manning, and Curtis P
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text. InProceedings of the 7th Machine Learning for Healthcare Conference, pages 2–25. PMLR, 2022. 1, 2
2022
-
[42]
HaoChen, Shih-Cheng Huang, Kuan- Chieh Wang, James Zou, and Serena Yeung
Yuhui Zhang, Jeff Z. HaoChen, Shih-Cheng Huang, Kuan- Chieh Wang, James Zou, and Serena Yeung. Diagnosing and rectifying vision models using language. InThe Eleventh International Conference on Learning Representations, 2023. 2, 3, 5, 4
2023
-
[43]
Connect, col- lapse, corrupt: Learning cross-modal tasks with uni-modal data
Yuhui Zhang, Elaine Sui, and Serena Yeung. Connect, col- lapse, corrupt: Learning cross-modal tasks with uni-modal data. InThe Twelfth International Conference on Learning Representations, 2024. 1, 2, 4, 5 10
2024
-
[44]
a photo of a X
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3 11 Is the Modality Gap a Bug or a Feature? A Robustness Perspective Supplementary Material A. Theorem Proofs A.1. Proof of Theorem 3.1 Proof. Calculating th...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.