pith. machine review for the scientific record. sign in

arxiv: 2603.29080 · v2 · submitted 2026-03-30 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:00 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords modality gapcontrastive lossrobustnessvision-language modelsCLIPmultimodal embeddingspost-processingembedding perturbations
0
0 comments X

The pith

The modality gap in multimodal models arises from contrastive training and enhances robustness to perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many multimodal models exhibit a clear separation between image and text embeddings despite efforts to align them. This paper shows that under certain conditions, minimizing the contrastive loss naturally produces this gap as an orthogonal vector in the embedding space. The size of this gap turns out to be directly linked to how robust the model is against small changes in the embeddings. Specifically, shrinking the gap through a simple adjustment leaves the model's accuracy on clean data intact but reduces the chance it will flip its predictions under perturbations. Experiments on real vision-language models confirm that this post-processing step reliably boosts robustness.

Core claim

Minimizing the contrastive loss under certain conditions produces a representation where the two modalities are separated by a global gap vector orthogonal to their embeddings. The modality gap is monotonically related to robustness such that decreasing the gap preserves clean accuracy while making the model less likely to change its output under embedding perturbations. A simple post-processing step that moves one modality toward the mean of the other achieves this decrease in the gap for many real-world VLMs.

What carries the argument

A global gap vector that is orthogonal to the modality embeddings and arises during contrastive loss minimization, governing the monotonic relationship to robustness.

If this is right

  • Post-processing to reduce the modality gap increases robustness to embedding perturbations.
  • Clean accuracy on original data stays the same after reducing the gap.
  • The orthogonality of the gap vector allows it to separate modalities without altering the core embedding directions.
  • This effect is observed across many existing vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The finding implies that forcing perfect modality alignment might reduce robustness in some models.
  • Similar orthogonal gap mechanisms could appear in other contrastive training scenarios outside vision and language.
  • Practitioners could routinely apply this post-processing to improve model stability in deployed systems.
  • It raises the question of whether other performance metrics beyond robustness and accuracy are affected by the gap size.

Load-bearing premise

The results depend on certain unspecified conditions during loss minimization and embedding geometry being satisfied in practice.

What would settle it

A counterexample would be a contrastively trained model where the gap vector is not orthogonal to the embeddings or where reducing the gap size decreases robustness to perturbations.

Figures

Figures reproduced from arXiv: 2603.29080 by Oshri Naparstek, Rhea Chowers, Udi Barzelay, Yair Weiss.

Figure 2
Figure 2. Figure 2: Is the modality gap a bug or a feature? Changing the gap [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three points in each modality in R 2 and the correspond￾ing multi-modal contrastive loss along with the magnitude of the gradient of the loss. Lines connect true pairs. As long as the points satisfy relative alignment - the true pair of any point is also its nearest neighbor - the loss and the gradient magnitude are close to zero, even when there exists a gap. 3.2. Why Should a Global Gap Exist? Since the … view at source ↗
Figure 4
Figure 4. Figure 4: The evolution of embeddings using gradient descent on the contrastive loss (Eq. ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (Top:) Two initial embeddings where the bottom embed￾ding is color-coded based on S y i . S y i decreases with distance to the other modality. (Bottom:) The training dynamics of a toy model initialized with isotropic Gaussians. Training starts by shrinking variance in the direction of the gap according to Theorem 3.1 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the relationship between robustness and the [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The zero shot classification accuracy, multiple choice [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CLIP variants are increasingly more robust to quantization [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: (Top: Two set of points that perfectly satisfy dimensionality collapse. The bottom points are color-coded by S y i . Note that Si > 1 for points near the center. (bottom: The training dynamics. Despite no variance in the direction of the gap for both modalities, the solution converged to has no gap and is perfectly aligned. This is because in the initial iterations, points near the center are pushed away … view at source ↗
Figure 12
Figure 12. Figure 12: We calculate S x i , Sy i throughout the training of N = 500 embedding pairs initialized from a Gaussian distribution with variance σ 2 = 0.01 and gap of ∥⃗g∥ = 1.8. As training progresses, the values of S x i , Sy i are concentrated around their means, which by definition equal 1, making the matrices Q x , Qy more doubly￾stochastic. E. Robustness to Various Noise Distributions Here we expand on the resul… view at source ↗
Figure 13
Figure 13. Figure 13: The orthogonality assumption - the cosine of angle [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Results for different noise distributions and models on CIFAR10. All are normalized to have variance [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: We compute d(C) according to Eq. (50). For all models we tested, with over 400 different captions, d(C) ≈ 1 suggesting the noise is extremely correlated in the embedding space. in robustness and loss of accuracy controlled by the param￾eter ϵ. An example of using the algorithm is displayed in [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Top: When training with τ = 1 training converges to a solution without a gap, despite existence of an initial gap and orthogonality. Middle: This is consistent for training in higher dimensions as well - different temperatures have different effects on how much of the gap is closed. When temperatures are ≥ 1, the gap closes throughout training. In higher temperatures training hardly differs from initializ… view at source ↗
Figure 18
Figure 18. Figure 18: The zero shot classification accuracy and robustness under noise [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Even when using Algorithm 1 the drop in R@1 for SigLIP [40] on image to text retrieval on MS-COCO dataset [13] is negligible relative to the improvement in robustness for different Gaussian noises (left). Sec. H.2 shows the ranges of the singular value threshold ϵ for which the increment in robustness (for Gaussian noise with σ 2 = 0.01) is larger than the decrease in R@1. 8 [PITH_FULL_IMAGE:figures/full… view at source ↗
read the original abstract

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that under certain conditions, minimizing the contrastive loss in models like CLIP produces a shared embedding space in which the two modalities are separated by a global gap vector orthogonal to the embeddings. It further claims that under these conditions the gap size is monotonically related to robustness, such that a simple post-processing step that shifts one modality toward the mean of the other reduces the gap, improves robustness to embedding perturbations, and leaves clean accuracy unchanged. Experiments on real-world VLMs are presented to support the post-processing benefit.

Significance. If the stated conditions are shown to hold for typical trained VLMs, the work supplies a mechanistic account of the modality gap together with a training-free intervention that improves robustness at no cost to clean performance. The empirical demonstration on existing models indicates immediate practical utility for robustness enhancement in vision-language systems.

major comments (2)
  1. [Abstract] Abstract: the central claims (orthogonal global gap vector and monotonic robustness relation) are asserted only 'under certain conditions' on loss minimization and embedding geometry, yet these conditions are neither enumerated nor verified to be satisfied by standard training runs of models such as CLIP; this is load-bearing for the claimed mechanistic justification of the post-processing step.
  2. [Theoretical derivation] Theoretical derivation: the derivation that the gap vector is orthogonal to the embeddings and that gap size is monotonically related to robustness is presented without explicit checks that the requisite assumptions (perfect convergence to a specific optimum, embeddings confined to the appropriate subspace) survive typical training noise or finite-batch effects; the absence of such verification leaves the link between theory and the reported robustness gains unestablished.
minor comments (1)
  1. [Experiments] Experiments: details on statistical significance testing, variance across random seeds, and explicit controls for the post-processing intervention (e.g., comparison against random shifts of the same magnitude) are missing and should be supplied for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the conditions underlying the theoretical claims should be stated more explicitly and that the link between idealized assumptions and empirical results merits additional discussion. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims (orthogonal global gap vector and monotonic robustness relation) are asserted only 'under certain conditions' on loss minimization and embedding geometry, yet these conditions are neither enumerated nor verified to be satisfied by standard training runs of models such as CLIP; this is load-bearing for the claimed mechanistic justification of the post-processing step.

    Authors: We agree that the conditions must be enumerated. In the revised manuscript we will update the abstract to list them explicitly: (1) the contrastive loss reaches its global minimum, (2) image and text embeddings lie in a subspace orthogonal to the gap vector, and (3) no symmetry-breaking regularization is present. While we cannot retroactively inspect the original CLIP training runs for exact satisfaction of these conditions, the consistent robustness gains from the post-processing step across multiple pre-trained VLMs (including CLIP variants) indicate that the conditions hold sufficiently for the claimed practical benefit. revision: yes

  2. Referee: [Theoretical derivation] Theoretical derivation: the derivation that the gap vector is orthogonal to the embeddings and that gap size is monotonically related to robustness is presented without explicit checks that the requisite assumptions (perfect convergence to a specific optimum, embeddings confined to the appropriate subspace) survive typical training noise or finite-batch effects; the absence of such verification leaves the link between theory and the reported robustness gains unestablished.

    Authors: The derivation is performed under idealized assumptions of perfect convergence and strict subspace confinement. We will add a dedicated paragraph in the theory section that acknowledges these assumptions and their possible violation by training noise or finite-batch effects. We will also include a controlled simulation that injects moderate Gaussian noise into the embeddings and verifies that orthogonality and the monotonic robustness relation remain approximately intact. This addition will make the connection between the idealized analysis and the reported empirical gains more transparent. revision: partial

Circularity Check

0 steps flagged

No circularity: gap-vector derivation follows from contrastive loss minimization under stated conditions without reducing to self-definition or fitted inputs.

full rationale

The paper presents the orthogonal global gap vector and its monotonic relation to robustness as consequences of minimizing the contrastive loss under certain (explicitly flagged) conditions on loss minimization and embedding geometry. No quoted equations or steps define the gap in terms of itself, rename a fitted parameter as a prediction, or rely on load-bearing self-citations whose prior results are unverified. The derivation chain remains independent of its outputs, with the 'certain conditions' serving as assumptions rather than circular imports. This is the common case of a self-contained theoretical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that contrastive-loss minima produce an orthogonal global gap vector whose magnitude controls robustness; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Minimizing the contrastive loss under certain conditions produces embeddings separated by a global gap vector orthogonal to the embeddings
    Invoked directly in the abstract as the basis for both the gap existence and the robustness relation.

pith-pipeline@v0.9.0 · 5499 in / 1334 out tokens · 46321 ms · 2026-05-14T21:00:48.905206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2

  2. [2]

    Data determines distributional robustness in contrastive language image pre-training (CLIP)

    Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (CLIP). InProceedings of the 39th Inter- national Conference on Machine Learning, pages 6216–6234. PMLR, 2022. 3

  3. [3]

    Language guided visual question answering: Elevate your multimodal language model using knowledge-enriched prompts

    Deepanway Ghosal, Navonil Majumder, Roy Lee, Rada Mi- halcea, and Soujanya Poria. Language guided visual question answering: Elevate your multimodal language model using knowledge-enriched prompts. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, pages 12096–12102, Singapore, 2023. Association for Computa- tional Linguistics. 7, 5

  4. [4]

    Openclip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021. 7

  5. [5]

    Understanding and constructing latent modal- ity structures in multi-modal representation learning

    Qian Jiang, Changyou Chen, Han Zhao, Liqun Chen, Qing Ping, Son Dinh Tran, Yi Xu, Belinda Zeng, and Trishul Chilimbi. Understanding and constructing latent modal- ity structures in multi-modal representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 7661–7671. IE...

  6. [6]

    Understanding dimensional collapse in contrastive self- supervised learning

    Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self- supervised learning. InInternational Conference on Learning Representations, 2022. 2

  7. [7]

    Scaling up gans for text-to-image synthesis

    Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR),

  8. [8]

    Self-regulating prompts: Foundational model adap- tation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adap- tation without forgetting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, 2023. 3

  9. [9]

    Fine-tuning CLIP text encoders with two-step paraphrasing

    Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Hung Tran, Franck Dernoncourt, and Jaewoo Kang. Fine-tuning CLIP text encoders with two-step paraphrasing. InFindings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, March 17-22, 2024, pages 2175–2184. Association for Computational Linguistics, 2024. 2

  10. [10]

    Berg, Tamara L

    Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. Collective generation of natural image descriptions. InThe 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers, pages 359–368. The Association for Computer Lingu...

  11. [11]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Advances in Neural Information Processing Systems, 2022. 1, 2, 3, 5, 7

  12. [12]

    Multimodal unsupervised domain generalization by retrieving across the modality gap

    Christopher Liao, Christian So, Theodoros Tsiligkaridis, and Brian Kulis. Multimodal unsupervised domain generalization by retrieving across the modality gap. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2, 3, 7

  13. [13]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. 8

  14. [14]

    Marco and D.L

    D. Marco and D.L. Neuhoff. The validity of the additive noise model for uniform scalar quantizers.IEEE Transactions on Information Theory, 51(5):1739–1755, 2005. 8

  15. [15]

    Understanding retrieval- augmented task adaptation for vision-language models

    Yifei Ming and Yixuan Li. Understanding retrieval- augmented task adaptation for vision-language models. In Proceedings of the 41st International Conference on Machine Learning, pages 35719–35743. PMLR, 2024. 2

  16. [16]

    Bagdanov

    Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Andrew D. Bagdanov. Cross the gap: Exposing the intra-modal misalignment in CLIP via modality inver- sion. InThe Thirteenth International Conference on Learning Representations, 2025. 3, 7

  17. [17]

    Geodesic multi-modal mixup for robust fine-tuning

    Changdae Oh, Junhyuk So, Hoyoon Byun, YongTaek Lim, Minchul Shin, Jong-June Jeon, and Kyungwoo Song. Geodesic multi-modal mixup for robust fine-tuning. InThirty- seventh Conference on Neural Information Processing Sys- tems, 2023. 3, 7

  18. [18]

    Eclipse: A resource-efficient text-to-image prior for image generations

    Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, and Yezhou Yang. Eclipse: A resource-efficient text-to-image prior for image generations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9069–9078, 2024. 3, 7

  19. [19]

    Benchmarking robustness under distribu- tion shift of multimodal image-text models

    Jielin Qiu, Yi Zhu, Xingjian Shi, Zhiqiang Tang, Ding Zhao, Bo Li, and Mu Li. Benchmarking robustness under distribu- tion shift of multimodal image-text models. InNeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022. 3

  20. [20]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021....

  21. [21]

    Accept the modality gap: An exploration in the hyperbolic space

    Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27263–27272, 2024. 1, 2

  22. [22]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents.arXiv preprint arXiv:2204.06125, 1(2): 3, 2022. 1, 2, 3, 7

  23. [23]

    On the adversarial robustness of multi-modal foundation models

    Christian Schlarmann and Matthias Hein. On the adversarial robustness of multi-modal foundation models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3677–3685, 2023. 3

  24. [24]

    Robust CLIP: Unsupervised adver- sarial fine-tuning of vision embeddings for robust large vision-language models

    Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust CLIP: Unsupervised adver- sarial fine-tuning of vision embeddings for robust large vision-language models. InProceedings of the 41st Interna- tional Conference on Machine Learning, pages 43685–43704. PMLR, 2024. 3

  25. [25]

    Hoffmann, Max Argus, V olker Fis- cher, and Thomas Brox

    Simon Schrodi, David T. Hoffmann, Max Argus, V olker Fis- cher, and Thomas Brox. Two effects, one trigger: On the modality gap, object bias, and information imbalance in con- trastive vision-language models. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. 1, 2, 3, 6

  26. [26]

    A-OKVQA: A benchmark for visual question answering using world knowl- edge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowl- edge. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed- ings, Part VIII, pages 146–162. Springer, 2022. 7

  27. [27]

    Welle, Mårten Björkman, and Danica Kragic

    Peiyang Shi, Michael C. Welle, Mårten Björkman, and Danica Kragic. Towards understanding the modality gap in CLIP. In ICLR 2023 Workshop on Multimodal Representation Learn- ing: Perks and Pitfalls, 2023. 1, 2, 3, 6

  28. [28]

    Lost in translation: Modern neural networks still struggle with small realistic image trans- formations

    Ofir Shifman and Yair Weiss. Lost in translation: Modern neural networks still struggle with small realistic image trans- formations. InComputer Vision - ECCV 2024 - 18th Euro- pean Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIX, pages 231–247. Springer, 2024. 2, 3

  29. [29]

    CLIPood: Generalizing CLIP to out-of-distributions

    Yang Shu, Xingzhuo Guo, Jialong Wu, Ximei Wang, Jianmin Wang, and Mingsheng Long. CLIPood: Generalizing CLIP to out-of-distributions. InProceedings of the 40th Interna- tional Conference on Machine Learning, pages 31716–31731. PMLR, 2023. 3

  30. [30]

    Improved deep metric learning with multi- class n-pair loss objective

    Kihyuk Sohn. Improved deep metric learning with multi- class n-pair loss objective. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2016. 1, 2

  31. [31]

    A closer look at the robustness of contrastive language-image pre-training (CLIP)

    Weijie Tu, Weijian Deng, and Tom Gedeon. A closer look at the robustness of contrastive language-image pre-training (CLIP). InThirty-seventh Conference on Neural Information Processing Systems, 2023. 3

  32. [32]

    Represen- tation learning with contrastive predictive coding, 2019

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation learning with contrastive predictive coding, 2019. 1, 2

  33. [33]

    Do CLIP models always generalize better than imagenet models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

    Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, and Tong Zhang. Do CLIP models always generalize better than imagenet models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 3

  34. [34]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational Conference on Machine Learning, pages 9929–9939. PMLR, 2020. 1, 2, 3

  35. [35]

    Robust fine-tuning of zero- shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero- shot models. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7949–7961,

  36. [36]

    Demystifying CLIP data

    Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data. InThe Twelfth International Conference on Learn- ing Representations, 2024. 2

  37. [37]

    Explaining and mitigating the modality gap in contrastive multimodal learning

    Can Yaras, Siyi Chen, Peng Wang, and Qing Qu. Explaining and mitigating the modality gap in contrastive multimodal learning. InThe Second Conference on Parsimony and Learn- ing (Proceedings Track), 2025. 2, 3

  38. [38]

    The impact of quantization on retrieval-augmented generation: An analysis of small llms

    Mert Yazan, Suzan Verberne, and Frederik Situmeang. The impact of quantization on retrieval-augmented generation: An analysis of small llms. InProceedings of the Workshop Information Retrieval’s Role in RAG Systems (IR-RAG 2024) co-located with the 47th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval (SIGIR 2024),...

  39. [39]

    Mutual-modality adversarial attack with semantic perturba- tion

    Jingwen Ye, Ruonan Yu, Songhua Liu, and Xinchao Wang. Mutual-modality adversarial attack with semantic perturba- tion. InThirty-Eighth AAAI Conference on Artificial Intel- ligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EA...

  40. [40]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11941– 11952. IEEE, 2023. 2, 8

  41. [41]

    Manning, and Curtis P

    Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text. InProceedings of the 7th Machine Learning for Healthcare Conference, pages 2–25. PMLR, 2022. 1, 2

  42. [42]

    HaoChen, Shih-Cheng Huang, Kuan- Chieh Wang, James Zou, and Serena Yeung

    Yuhui Zhang, Jeff Z. HaoChen, Shih-Cheng Huang, Kuan- Chieh Wang, James Zou, and Serena Yeung. Diagnosing and rectifying vision models using language. InThe Eleventh International Conference on Learning Representations, 2023. 2, 3, 5, 4

  43. [43]

    Connect, col- lapse, corrupt: Learning cross-modal tasks with uni-modal data

    Yuhui Zhang, Elaine Sui, and Serena Yeung. Connect, col- lapse, corrupt: Learning cross-modal tasks with uni-modal data. InThe Twelfth International Conference on Learning Representations, 2024. 1, 2, 4, 5 10

  44. [44]

    a photo of a X

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3 11 Is the Modality Gap a Bug or a Feature? A Robustness Perspective Supplementary Material A. Theorem Proofs A.1. Proof of Theorem 3.1 Proof. Calculating th...