Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models
Pith reviewed 2026-05-23 02:51 UTC · model grok-4.3
The pith
Modality gaps in vision-language models can be measured with a dominance score and leveraged via training-free editing to improve bias mitigation and generation control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Modality Dominance Score attributes multimodal features to vision-dominant, language-dominant, and cross-modal classes, which supports automatic interpretability metrics and enables training-free model editing that mitigates bias in gender classification, generates cross-modal adversarial examples, and provides modality-specific control in text-to-image generation.
What carries the argument
The Modality Dominance Score (MDS), which classifies features by modality dominance to support targeted, training-free edits.
If this is right
- Training-free editing reduces gender bias in classification outputs.
- Editing produces cross-modal adversarial examples that exploit modality gaps.
- Editing allows explicit control over whether vision or language dominates in text-to-image outputs.
- Interpretability tools built on the same score enable systematic, task-agnostic analysis of multimodal models.
Where Pith is reading between the lines
- The same classification approach could be tested on other multimodal pairs such as audio-language or video-text.
- If gaps prove necessary for perception, future alignment objectives might deliberately preserve rather than eliminate them.
- Lightweight editing of this kind could lower the barrier to customizing deployed models for specific fairness or control goals.
Load-bearing premise
The Modality Dominance Score reliably assigns features to vision or language in a way that matches human perception of modality dominance.
What would settle it
Human raters classifying the same features by dominant modality show low agreement with the MDS labels, or the proposed editing steps produce no measurable gains on bias mitigation, adversarial example success, or modality control in generation.
Figures
read the original abstract
The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human perception as evidenced by modality-specific phenomena like visual texture and linguistic tone. These observations motivate us to computationally measure and leverage modality gaps to improve downstream tasks. We first introduce the Modality Dominance Score (MDS), which attributes multimodal features to specific modalities by categorizing them into three classes: vision-dominant features, language-dominant features, and cross-modal features. We then propose automatic interpretability metrics to evaluate these modality-specific features in a scalable manner. Finally, we demonstrate that the training-free model editing enhances multiple downstream tasks, including mitigating bias in gender classification, generating cross-modal adversarial examples, and enabling modality-specific control in text-to-image generation. Combined with task-agnostic interpretability tools, our work offers insights for systematic analysis and lightweight editing of multimodal models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Modality Dominance Score (MDS) to categorize features in vision-language models into vision-dominant, language-dominant, and cross-modal classes. It proposes automatic interpretability metrics for these features and claims that training-free editing based on MDS improves bias mitigation in gender classification, generation of cross-modal adversarial examples, and modality-specific control in text-to-image generation, motivated by the idea that modality gaps are necessary for human-like perception.
Significance. If the MDS classes prove reliable and the editing interventions demonstrably leverage modality gaps rather than generic feature effects, the work could enable lightweight, training-free analysis and editing of VLMs, shifting emphasis from pure alignment to controlled use of modality gaps with task-agnostic tools.
major comments (2)
- [Abstract and methods (MDS definition)] The central claim that MDS produces perceptually or functionally meaningful classes (vision-dominant, language-dominant, cross-modal) that support effective training-free editing is load-bearing for all downstream results, yet the manuscript provides no direct validation (e.g., human studies or ablation against random/feature-importance baselines) that MDS labels align with human modality attribution rather than being an artifact of the scoring heuristic.
- [Experimental results] § on experimental results (bias mitigation, adversarial examples, T2I control): the reported gains are attributed to modality-gap leverage, but without controls showing that the same edits applied to non-MDS partitions yield no improvement, the results could be explained by standard feature masking rather than the claimed modality-specific mechanism.
minor comments (2)
- [Methods] Notation for MDS computation should be clarified with an explicit equation or pseudocode to allow reproduction.
- [Interpretability metrics] The automatic interpretability metrics are mentioned but their exact formulation and correlation with human judgments are not detailed enough for independent verification.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the validation of MDS and the experimental controls.
read point-by-point responses
-
Referee: [Abstract and methods (MDS definition)] The central claim that MDS produces perceptually or functionally meaningful classes (vision-dominant, language-dominant, cross-modal) that support effective training-free editing is load-bearing for all downstream results, yet the manuscript provides no direct validation (e.g., human studies or ablation against random/feature-importance baselines) that MDS labels align with human modality attribution rather than being an artifact of the scoring heuristic.
Authors: We agree that additional validation would strengthen the claims. The manuscript introduces automatic interpretability metrics as a scalable means to assess modality-specific features. To directly respond to the concern, we will add ablations in the revision that compare MDS partitions against random and feature-importance baselines, demonstrating that improvements are not artifacts of the scoring method. Human studies, while valuable, introduce significant subjectivity in modality attribution and are beyond the current scope; the quantitative ablations provide a rigorous alternative. revision: yes
-
Referee: [Experimental results] § on experimental results (bias mitigation, adversarial examples, T2I control): the reported gains are attributed to modality-gap leverage, but without controls showing that the same edits applied to non-MDS partitions yield no improvement, the results could be explained by standard feature masking rather than the claimed modality-specific mechanism.
Authors: We acknowledge this is a valid concern for isolating the modality-specific mechanism. In the revised manuscript, we will add control experiments applying the same training-free edits to non-MDS feature partitions across the three tasks. We expect these to show no comparable gains, which would support that the reported improvements arise from leveraging modality gaps rather than generic feature masking effects. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces the Modality Dominance Score (MDS) as a new computational measure, proposes automatic interpretability metrics, and demonstrates training-free editing applications. No equations, fitted parameters called predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on newly defined quantities and empirical demonstrations rather than reducing to inputs by construction. This is the expected self-contained case.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F. P., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice). arXiv preprint arXiv:2402.10376, 2024
-
[2]
Language models can explain neurons in language models
Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023) , 2, 2023
work page 2023
-
[3]
Calvert, G., Spence, C., and Stein, B. E. (eds.). The Handbook of Multisensory Processes . MIT Press, 2004
work page 2004
-
[4]
Cui, X., Aparcedo, A., Jang, Y . K., and Lim, S.-N. On the robustness of large multimodal models against image adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24625–24634, 2024
work page 2024
-
[5]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2309.08600
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600 2023
-
[6]
DeepSeek-AI, Liu, A., and et al., B. F. Deepseek-v3 technical report, 2024. URL https: //arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
A mathematical framework for transformer circuits
Elhage, N., Nanda, N., Olsson, C., and Others. A mathematical framework for transformer circuits. Transformer Circuits Thread (2022). URL https://transformer-circuits.pub/ 2022/solu/index.html
work page 2022
-
[8]
The human brainnetome atlas: A new brain atlas based on connectional architecture
Fan, L., Li, H., Zhuo, J., Zhang, Y ., Chen, L., Yang, Z., Chu, C., Xie, S., Laird, A., Fox, P., Eickhoff, S., Yu, C., and Jiang, T. The human brainnetome atlas: A new brain atlas based on connectional architecture. Cerebral Cortex, 26:bhw157, 05 2016. doi: 10.1093/cercor/bhw157
-
[9]
Scaling and evaluating sparse autoencoders
Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. ArXiv, abs/2305.01610, 2023. URL https://api.semanticscholar.org/CorpusID:258437237
-
[11]
Vilt: Vision-and-language transformer without convolution or region supervision
Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021
-
[12]
Y ., Fried, D., and Salakhutdinov, R
Koh, J. Y ., Fried, D., and Salakhutdinov, R. R. Generating images with multimodal language models. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[13]
Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm
Li, Y ., Liang, F., Zhao, L., Cui, Y ., Ouyang, W., Shao, J., Yu, F., and Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. InInternational Conference on Learning Representations , 2022. URL https://openreview.net/forum? id=zq1iJkNk3uN
work page 2022
-
[14]
P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R
Liang, P. P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R. Multiviz: Towards visualizing and understanding multimodal models. arXiv preprint arXiv:2207.00056, 2022. 10
-
[15]
P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P
Liang, P. P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P. Multimodal fusion interactions: A study of human and automatic quantification. In Proceedings of the 25th International Conference on Multimodal Interaction , ICMI ’23, pp. 425–435, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 9798400700552. doi: 10.1145/3577190.3614151...
-
[16]
P., Zadeh, A., and Morency, L.-P
Liang, P. P., Zadeh, A., and Morency, L.-P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024
work page 2024
-
[17]
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning
Liang, W., Zhang, Y ., Kwon, Y ., Yeung, S., and Zou, J. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems , 2022. URL https://openreview.net/forum?id=S7Evzt9uit3
work page 2022
-
[18]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V ., Kramár, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2023
work page 2023
-
[20]
Liu, S., Ye, H., Xing, L., and Zou, J. In-context vectors: making in context learning more effective and controllable through latent space steering. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[21]
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23, 2019
work page 2019
-
[22]
P., Deng, Z., Salakhutdinov, R., and Morency, L.-P
Lyu, Y ., Liang, P. P., Deng, Z., Salakhutdinov, R., and Morency, L.-P. Dime: Fine-grained interpretations of multimodal models via disentangled local explanations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society , pp. 455–467, 2022
work page 2022
-
[23]
Makhzani, A. and Frey, B. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[24]
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y . Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pp. 689–696, 2011
work page 2011
-
[25]
Zoom in: An introduction to circuits
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020
work page 2020
-
[26]
Dual coding theory: Retrospect and current status
Paivio, A. Dual coding theory: Retrospect and current status. Canadian Journal of Psycholo- gy/Revue canadienne de psychologie, 45(3):255, 1991
work page 1991
-
[27]
Learning Transferable Visual Models From Natural Language Supervision
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Dissecting multimodality in videoqa transformer models by impairing modality fusion
Rawal, I., Jaiswal, S., Fernando, B., and Tan, C. Dissecting multimodality in videoqa transformer models by impairing modality fusion. In International Conference on Machine Learning , 2023. URL https://api.semanticscholar.org/CorpusID:259165589
work page 2023
-
[29]
High-resolution image syn- thesis with latent diffusion models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image syn- thesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022
work page 2022
-
[30]
T., Argus, M., Fischer, V ., and Brox, T
Schrodi, S., Hoffmann, D. T., Argus, M., Fischer, V ., and Brox, T. Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=uAFHCZRmXk
work page 2025
-
[31]
Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning
Sharma, P., Ding, N., Goodman, S., and Soricut, R. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL , 2018. 11
work page 2018
-
[32]
Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models
Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=plmBsXHxgR
work page 2024
-
[33]
Crossmodal correspondences: A tutorial review
Spence, C. Crossmodal correspondences: A tutorial review. Attention, Perception, & Psy- chophysics, 73(4):971–995, 2011
work page 2011
-
[34]
Multimodn—multimodal, multi-task, interpretable modular networks
Swamy, V ., Satayeva, M., Frej, J., Bossy, T., V ogels, T., Jaggi, M., Käser, T., and Hartley, M.-A. Multimodn—multimodal, multi-task, interpretable modular networks. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[35]
Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet
Templeton, A. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet . Anthropic, 2024
work page 2024
- [36]
-
[37]
M2lens: Visualizing and explaining multimodal models for sentiment analysis
Wang, X., He, J., Jin, Z., Yang, M., Wang, Y ., and Qu, H. M2lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics, 28(1):802–812, 2021
work page 2021
-
[38]
Non-negative contrastive learning.ICLR, 2024
Wang, Y ., Zhang, Q., Guo, Y ., and Wang, Y . Non-negative contrastive learning.ICLR, 2024
work page 2024
-
[39]
Yan, H., Xiang, Y ., Chen, G., Wang, Y ., Gui, L., and He, Y . Encourage or inhibit monosemantic- ity? revisit monosemanticity from a feature decorrelation perspective. ArXiv, abs/2406.17969,
-
[40]
URL https://api.semanticscholar.org/CorpusID:270737676
-
[41]
Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models
Yin, Z., Ye, M., Zhang, T., Du, T., Zhu, J., Liu, H., Chen, J., Wang, T., and Ma, F. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[42]
Yu, L., Cheng, Y ., Wang, Z., Kumar, V ., Macherey, W., Huang, Y ., Ross, D., Essa, I., Bisk, Y ., Yang, M.-H., et al. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. Advances in Neural Information Processing Systems , 36, 2024. 12 A Appendix A.1 Implementation for Monosemanticity Tools The three monosemantic tools, DeCLIP, ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.