Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

Hanqi Yan; Jindong Gu; Lu Yin; Paul Pu Liang; Xiangxiang Cui; Yifei Wang; Yulan He

arxiv: 2502.14888 · v4 · submitted 2025-02-16 · 💻 cs.CV · cs.AI

Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

Hanqi Yan , Xiangxiang Cui , Lu Yin , Jindong Gu , Paul Pu Liang , Yulan He , Yifei Wang This is my paper

Pith reviewed 2026-05-23 02:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords modality gapvision-language modelsmodality dominance scoremodel editingbias mitigationadversarial examplesinterpretability metricstext-to-image generation

0 comments

The pith

Modality gaps in vision-language models can be measured with a dominance score and leveraged via training-free editing to improve bias mitigation and generation control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that modality gaps persist after alignment and can be useful rather than purely problematic. It defines the Modality Dominance Score to sort multimodal features into vision-dominant, language-dominant, or cross-modal categories. Automatic metrics then assess these features at scale. Training-free editing based on the scores is shown to support concrete improvements on downstream tasks such as reducing gender bias in classification, producing cross-modal adversarial examples, and directing modality emphasis during text-to-image generation.

Core claim

The Modality Dominance Score attributes multimodal features to vision-dominant, language-dominant, and cross-modal classes, which supports automatic interpretability metrics and enables training-free model editing that mitigates bias in gender classification, generates cross-modal adversarial examples, and provides modality-specific control in text-to-image generation.

What carries the argument

The Modality Dominance Score (MDS), which classifies features by modality dominance to support targeted, training-free edits.

If this is right

Training-free editing reduces gender bias in classification outputs.
Editing produces cross-modal adversarial examples that exploit modality gaps.
Editing allows explicit control over whether vision or language dominates in text-to-image outputs.
Interpretability tools built on the same score enable systematic, task-agnostic analysis of multimodal models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same classification approach could be tested on other multimodal pairs such as audio-language or video-text.
If gaps prove necessary for perception, future alignment objectives might deliberately preserve rather than eliminate them.
Lightweight editing of this kind could lower the barrier to customizing deployed models for specific fairness or control goals.

Load-bearing premise

The Modality Dominance Score reliably assigns features to vision or language in a way that matches human perception of modality dominance.

What would settle it

Human raters classifying the same features by dominant modality show low agreement with the MDS labels, or the proposed editing steps produce no measurable gains on bias mitigation, adversarial example success, or modality control in generation.

Figures

Figures reproduced from arXiv: 2502.14888 by Hanqi Yan, Jindong Gu, Lu Yin, Paul Pu Liang, Xiangxiang Cui, Yifei Wang, Yulan He.

**Figure 1.** Figure 1: Modality Dominance Score (MDS) distributions of three feature categories for different VLMs. modality, and this trend is consistent across all models. DeCLIP, on the other hand, shows a more balanced and less centered distribution. This suggests that DeCLIP, through self-supervision, extracts more modality-specific features, which might be overlooked by pure vision-language contrastive models like CLIP. Th… view at source ↗

**Figure 2.** Figure 2: Monosemanticity for four VLMs. CLIP DeCLIP CLIP+NCL CLIP+SAE Models 0.050 0.025 0.000 0.025 0.050 0.075 0.100 Multimodal mono Modality-specific Monosemanticity Visual Mono (ImgD - TextD) Textual Mono (TextD - ImgD) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Activated images and texts (in Table) by ImgD. Top image row (feature 647): patterns and textures. Bottom image (feature 667): water and aquatic themes in blue. Texts in blue align with visual concepts. Beyond quantitative scores, we present examples to illustrate the semantic consistency of different feature types. ImgD capture visual patterns that are hard to verbalize. We randomly select two ImgD featu… view at source ↗

**Figure 5.** Figure 5: Activated images and texts (in Table) by TextD. Top image row (feature 34): couples and individuals in red attire. Bottom image row (feature 242): diverse objects. Text in blue aligns with visual concepts. CrossD (the majority features) capture shared semantics across modalities. Different from modality-specific features, TextD and ImgD, CrossD features capture common concepts that could be expressed in bo… view at source ↗

**Figure 6.** Figure 6: Activated images and texts by CrossD features. Top image row (feature 6): activities performed by individuals. Bottom image row (feature 47): scenery outside the doors. Text in blue aligns with visual concepts. 4 Cultivating Multimodal Monosemanticity We demonstrate that enhancing monosemanticity significantly improves the interpretability and modality specialization of multimodal features. Beyond interpre… view at source ↗

**Figure 7.** Figure 7: Female figures ordered by their percentages of ImgD features: 0.14, 0.16, 0.18,0.20, 0.22, 0.24, 0.26. More feminine concepts are observed to be related with more ImgD. To understand the what feminine concepts the ImgD represent (the studies of TextD on male concepts can be found in Appendix [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Alignment training to de-toxicity of the adversarial sample, with only selected target feature dimensions (in gray), i.e., ImgD, TextD and CrossD, involved in the alignment [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: The reference image R is used for modality-specific control over text-to-image generation process. Despite the impressive capabilities of text-to-image generation models [41, 12, 34], their internal mechanisms for bridging linguistic semantics and visual details remain poorly understood. A key challenge is disentangling how modality-specific features influence the fidelity and controllability of generatio… view at source ↗

**Figure 10.** Figure 10: Generated new images from the VLM with the text prompt [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: The changes of active dimensions over SAE training. Non-negative Contrastive Learning (NCL). We add the NCL block, i.e., projector after obtaining zi and zt from image encoder and text encoder. The training loss is shown in Eq. 3. 1 self . projector = nn . Sequential ( 2 nn . Linear ( embed_dim , embed_dim ) , 3 nn . LayerNorm ( embed_dim ) , 4 nn . ReLU () , 5 nn . Linear ( embed_dim , embed_dim ) , 6 ht… view at source ↗

**Figure 12.** Figure 12: The changes of active dimensions over NCL training. A.2 Implementation of MDS Based on the trained CLIP, CLIP+SAE, CLIP+NCL and DeCLIP, we feed the test split of cc3m-wds dataset to these pretrained models, around 15k image-text pairs to calculate MDS, according to Eq.(4). The features are the last-layer output from the text and image encoder. We tried to calculate the normalization of zi and zt, but foun… view at source ↗

**Figure 13.** Figure 13: Monosemanticity (EmbSimi and WinRate) changes as training goes on. Upper is for [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Generated new images from the VLM with the text prompt [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human perception as evidenced by modality-specific phenomena like visual texture and linguistic tone. These observations motivate us to computationally measure and leverage modality gaps to improve downstream tasks. We first introduce the Modality Dominance Score (MDS), which attributes multimodal features to specific modalities by categorizing them into three classes: vision-dominant features, language-dominant features, and cross-modal features. We then propose automatic interpretability metrics to evaluate these modality-specific features in a scalable manner. Finally, we demonstrate that the training-free model editing enhances multiple downstream tasks, including mitigating bias in gender classification, generating cross-modal adversarial examples, and enabling modality-specific control in text-to-image generation. Combined with task-agnostic interpretability tools, our work offers insights for systematic analysis and lightweight editing of multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a Modality Dominance Score to split features and enable training-free edits on VLMs, but the abstract supplies no equations or results to check whether the edits actually isolate modality gaps.

read the letter

The main takeaway is that this work treats modality gaps as something to measure and steer rather than just minimize. It introduces the Modality Dominance Score to label features as vision-dominant, language-dominant, or cross-modal, adds automatic interpretability metrics, and claims training-free edits improve bias mitigation, cross-modal adversarial examples, and modality control in text-to-image generation.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Modality Dominance Score (MDS) to categorize features in vision-language models into vision-dominant, language-dominant, and cross-modal classes. It proposes automatic interpretability metrics for these features and claims that training-free editing based on MDS improves bias mitigation in gender classification, generation of cross-modal adversarial examples, and modality-specific control in text-to-image generation, motivated by the idea that modality gaps are necessary for human-like perception.

Significance. If the MDS classes prove reliable and the editing interventions demonstrably leverage modality gaps rather than generic feature effects, the work could enable lightweight, training-free analysis and editing of VLMs, shifting emphasis from pure alignment to controlled use of modality gaps with task-agnostic tools.

major comments (2)

[Abstract and methods (MDS definition)] The central claim that MDS produces perceptually or functionally meaningful classes (vision-dominant, language-dominant, cross-modal) that support effective training-free editing is load-bearing for all downstream results, yet the manuscript provides no direct validation (e.g., human studies or ablation against random/feature-importance baselines) that MDS labels align with human modality attribution rather than being an artifact of the scoring heuristic.
[Experimental results] § on experimental results (bias mitigation, adversarial examples, T2I control): the reported gains are attributed to modality-gap leverage, but without controls showing that the same edits applied to non-MDS partitions yield no improvement, the results could be explained by standard feature masking rather than the claimed modality-specific mechanism.

minor comments (2)

[Methods] Notation for MDS computation should be clarified with an explicit equation or pseudocode to allow reproduction.
[Interpretability metrics] The automatic interpretability metrics are mentioned but their exact formulation and correlation with human judgments are not detailed enough for independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the validation of MDS and the experimental controls.

read point-by-point responses

Referee: [Abstract and methods (MDS definition)] The central claim that MDS produces perceptually or functionally meaningful classes (vision-dominant, language-dominant, cross-modal) that support effective training-free editing is load-bearing for all downstream results, yet the manuscript provides no direct validation (e.g., human studies or ablation against random/feature-importance baselines) that MDS labels align with human modality attribution rather than being an artifact of the scoring heuristic.

Authors: We agree that additional validation would strengthen the claims. The manuscript introduces automatic interpretability metrics as a scalable means to assess modality-specific features. To directly respond to the concern, we will add ablations in the revision that compare MDS partitions against random and feature-importance baselines, demonstrating that improvements are not artifacts of the scoring method. Human studies, while valuable, introduce significant subjectivity in modality attribution and are beyond the current scope; the quantitative ablations provide a rigorous alternative. revision: yes
Referee: [Experimental results] § on experimental results (bias mitigation, adversarial examples, T2I control): the reported gains are attributed to modality-gap leverage, but without controls showing that the same edits applied to non-MDS partitions yield no improvement, the results could be explained by standard feature masking rather than the claimed modality-specific mechanism.

Authors: We acknowledge this is a valid concern for isolating the modality-specific mechanism. In the revised manuscript, we will add control experiments applying the same training-free edits to non-MDS feature partitions across the three tasks. We expect these to show no comparable gains, which would support that the reported improvements arise from leveraging modality gaps rather than generic feature masking effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the Modality Dominance Score (MDS) as a new computational measure, proposes automatic interpretability metrics, and demonstrates training-free editing applications. No equations, fitted parameters called predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on newly defined quantities and empirical demonstrations rather than reducing to inputs by construction. This is the expected self-contained case.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5710 in / 958 out tokens · 29245 ms · 2026-05-23T02:51:02.612120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 6 internal anchors

[1]

P., and Lakkaraju, H

Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F. P., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice). arXiv preprint arXiv:2402.10376, 2024

work page arXiv 2024
[2]

Language models can explain neurons in language models

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023) , 2, 2023

work page 2023
[3]

Calvert, G., Spence, C., and Stein, B. E. (eds.). The Handbook of Multisensory Processes . MIT Press, 2004

work page 2004
[4]

K., and Lim, S.-N

Cui, X., Aparcedo, A., Jang, Y . K., and Lim, S.-N. On the robustness of large multimodal models against image adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24625–24634, 2024

work page 2024
[5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2309.08600

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600 2023
[6]

DeepSeek-AI, Liu, A., and et al., B. F. Deepseek-v3 technical report, 2024. URL https: //arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., and Others. A mathematical framework for transformer circuits. Transformer Circuits Thread (2022). URL https://transformer-circuits.pub/ 2022/solu/index.html

work page 2022
[8]

The human brainnetome atlas: A new brain atlas based on connectional architecture

Fan, L., Li, H., Zhuo, J., Zhang, Y ., Chen, L., Yang, Z., Chu, C., Xie, S., Laird, A., Fox, P., Eickhoff, S., Yu, C., and Jiang, T. The human brainnetome atlas: A new brain atlas based on connectional architecture. Cerebral Cortex, 26:bhw157, 05 2016. doi: 10.1093/cercor/bhw157

work page doi:10.1093/cercor/bhw157 2016
[9]

Scaling and evaluating sparse autoencoders

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

org/abs/2305.01610

Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. ArXiv, abs/2305.01610, 2023. URL https://api.semanticscholar.org/CorpusID:258437237

work page arXiv 2023
[11]

Vilt: Vision-and-language transformer without convolution or region supervision

Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021

work page arXiv 2021
[12]

Y ., Fried, D., and Salakhutdinov, R

Koh, J. Y ., Fried, D., and Salakhutdinov, R. R. Generating images with multimodal language models. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[13]

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm

Li, Y ., Liang, F., Zhao, L., Cui, Y ., Ouyang, W., Shao, J., Yu, F., and Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. InInternational Conference on Learning Representations , 2022. URL https://openreview.net/forum? id=zq1iJkNk3uN

work page 2022
[14]

P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R

Liang, P. P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R. Multiviz: Towards visualizing and understanding multimodal models. arXiv preprint arXiv:2207.00056, 2022. 10

work page arXiv 2022
[15]

P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P

Liang, P. P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P. Multimodal fusion interactions: A study of human and automatic quantification. In Proceedings of the 25th International Conference on Multimodal Interaction , ICMI ’23, pp. 425–435, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 9798400700552. doi: 10.1145/3577190.3614151...

work page doi:10.1145/3577190.3614151 2023
[16]

P., Zadeh, A., and Morency, L.-P

Liang, P. P., Zadeh, A., and Morency, L.-P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024

work page 2024
[17]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Liang, W., Zhang, Y ., Kwon, Y ., Yeung, S., and Zou, J. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems , 2022. URL https://openreview.net/forum?id=S7Evzt9uit3

work page 2022
[18]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V ., Kramár, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2023

work page 2023
[20]

In-context vectors: making in context learning more effective and controllable through latent space steering

Liu, S., Ye, H., Xing, L., and Zou, J. In-context vectors: making in context learning more effective and controllable through latent space steering. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[21]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23, 2019

work page 2019
[22]

P., Deng, Z., Salakhutdinov, R., and Morency, L.-P

Lyu, Y ., Liang, P. P., Deng, Z., Salakhutdinov, R., and Morency, L.-P. Dime: Fine-grained interpretations of multimodal models via disentangled local explanations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society , pp. 455–467, 2022

work page 2022
[23]

k-Sparse Autoencoders

Makhzani, A. and Frey, B. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y . Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pp. 689–696, 2011

work page 2011
[25]

Zoom in: An introduction to circuits

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020

work page 2020
[26]

Dual coding theory: Retrospect and current status

Paivio, A. Dual coding theory: Retrospect and current status. Canadian Journal of Psycholo- gy/Revue canadienne de psychologie, 45(3):255, 1991

work page 1991
[27]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Dissecting multimodality in videoqa transformer models by impairing modality fusion

Rawal, I., Jaiswal, S., Fernando, B., and Tan, C. Dissecting multimodality in videoqa transformer models by impairing modality fusion. In International Conference on Machine Learning , 2023. URL https://api.semanticscholar.org/CorpusID:259165589

work page 2023
[29]

High-resolution image syn- thesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image syn- thesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022

work page 2022
[30]

T., Argus, M., Fischer, V ., and Brox, T

Schrodi, S., Hoffmann, D. T., Argus, M., Fischer, V ., and Brox, T. Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=uAFHCZRmXk

work page 2025
[31]

Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning

Sharma, P., Ding, N., Goodman, S., and Soricut, R. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL , 2018. 11

work page 2018
[32]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=plmBsXHxgR

work page 2024
[33]

Crossmodal correspondences: A tutorial review

Spence, C. Crossmodal correspondences: A tutorial review. Attention, Perception, & Psy- chophysics, 73(4):971–995, 2011

work page 2011
[34]

Multimodn—multimodal, multi-task, interpretable modular networks

Swamy, V ., Satayeva, M., Frej, J., Bossy, T., V ogels, T., Jaggi, M., Käser, T., and Hartley, M.-A. Multimodn—multimodal, multi-task, interpretable modular networks. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[35]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet

Templeton, A. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet . Anthropic, 2024

work page 2024
[36]

Ungerleider, L. G. and Haxby, J. V . ‘what’ and ‘where’ in the human brain.Current Opin- ion in Neurobiology , 4(2):157–165, 1994. ISSN 0959-4388. doi: https://doi.org/10.1016/ 0959-4388(94)90066-3. URL https://www.sciencedirect.com/science/article/ pii/0959438894900663

work page arXiv 1994
[37]

M2lens: Visualizing and explaining multimodal models for sentiment analysis

Wang, X., He, J., Jin, Z., Yang, M., Wang, Y ., and Qu, H. M2lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics, 28(1):802–812, 2021

work page 2021
[38]

Non-negative contrastive learning.ICLR, 2024

Wang, Y ., Zhang, Q., Guo, Y ., and Wang, Y . Non-negative contrastive learning.ICLR, 2024

work page 2024
[39]

Encourage or inhibit monosemantic- ity? revisit monosemanticity from a feature decorrelation perspective

Yan, H., Xiang, Y ., Chen, G., Wang, Y ., Gui, L., and He, Y . Encourage or inhibit monosemantic- ity? revisit monosemanticity from a feature decorrelation perspective. ArXiv, abs/2406.17969,

work page arXiv
[40]

URL https://api.semanticscholar.org/CorpusID:270737676

work page
[41]

Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models

Yin, Z., Ye, M., Zhang, T., Du, T., Zhu, J., Liu, H., Chen, J., Wang, T., and Ma, F. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[42]

Please draw an animal

Yu, L., Cheng, Y ., Wang, Z., Kumar, V ., Macherey, W., Huang, Y ., Ross, D., Essa, I., Bisk, Y ., Yang, M.-H., et al. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. Advances in Neural Information Processing Systems , 36, 2024. 12 A Appendix A.1 Implementation for Monosemanticity Tools The three monosemantic tools, DeCLIP, ...

work page 2024

[1] [1]

P., and Lakkaraju, H

Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F. P., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice). arXiv preprint arXiv:2402.10376, 2024

work page arXiv 2024

[2] [2]

Language models can explain neurons in language models

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023) , 2, 2023

work page 2023

[3] [3]

Calvert, G., Spence, C., and Stein, B. E. (eds.). The Handbook of Multisensory Processes . MIT Press, 2004

work page 2004

[4] [4]

K., and Lim, S.-N

Cui, X., Aparcedo, A., Jang, Y . K., and Lim, S.-N. On the robustness of large multimodal models against image adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24625–24634, 2024

work page 2024

[5] [5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2309.08600

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600 2023

[6] [6]

DeepSeek-AI, Liu, A., and et al., B. F. Deepseek-v3 technical report, 2024. URL https: //arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., and Others. A mathematical framework for transformer circuits. Transformer Circuits Thread (2022). URL https://transformer-circuits.pub/ 2022/solu/index.html

work page 2022

[8] [8]

The human brainnetome atlas: A new brain atlas based on connectional architecture

Fan, L., Li, H., Zhuo, J., Zhang, Y ., Chen, L., Yang, Z., Chu, C., Xie, S., Laird, A., Fox, P., Eickhoff, S., Yu, C., and Jiang, T. The human brainnetome atlas: A new brain atlas based on connectional architecture. Cerebral Cortex, 26:bhw157, 05 2016. doi: 10.1093/cercor/bhw157

work page doi:10.1093/cercor/bhw157 2016

[9] [9]

Scaling and evaluating sparse autoencoders

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

org/abs/2305.01610

Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. ArXiv, abs/2305.01610, 2023. URL https://api.semanticscholar.org/CorpusID:258437237

work page arXiv 2023

[11] [11]

Vilt: Vision-and-language transformer without convolution or region supervision

Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021

work page arXiv 2021

[12] [12]

Y ., Fried, D., and Salakhutdinov, R

Koh, J. Y ., Fried, D., and Salakhutdinov, R. R. Generating images with multimodal language models. Advances in Neural Information Processing Systems , 36, 2024

work page 2024

[13] [13]

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm

Li, Y ., Liang, F., Zhao, L., Cui, Y ., Ouyang, W., Shao, J., Yu, F., and Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. InInternational Conference on Learning Representations , 2022. URL https://openreview.net/forum? id=zq1iJkNk3uN

work page 2022

[14] [14]

P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R

Liang, P. P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R. Multiviz: Towards visualizing and understanding multimodal models. arXiv preprint arXiv:2207.00056, 2022. 10

work page arXiv 2022

[15] [15]

P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P

Liang, P. P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P. Multimodal fusion interactions: A study of human and automatic quantification. In Proceedings of the 25th International Conference on Multimodal Interaction , ICMI ’23, pp. 425–435, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 9798400700552. doi: 10.1145/3577190.3614151...

work page doi:10.1145/3577190.3614151 2023

[16] [16]

P., Zadeh, A., and Morency, L.-P

Liang, P. P., Zadeh, A., and Morency, L.-P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024

work page 2024

[17] [17]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Liang, W., Zhang, Y ., Kwon, Y ., Yeung, S., and Zou, J. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems , 2022. URL https://openreview.net/forum?id=S7Evzt9uit3

work page 2022

[18] [18]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V ., Kramár, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2023

work page 2023

[20] [20]

In-context vectors: making in context learning more effective and controllable through latent space steering

Liu, S., Ye, H., Xing, L., and Zou, J. In-context vectors: making in context learning more effective and controllable through latent space steering. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[21] [21]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23, 2019

work page 2019

[22] [22]

P., Deng, Z., Salakhutdinov, R., and Morency, L.-P

Lyu, Y ., Liang, P. P., Deng, Z., Salakhutdinov, R., and Morency, L.-P. Dime: Fine-grained interpretations of multimodal models via disentangled local explanations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society , pp. 455–467, 2022

work page 2022

[23] [23]

k-Sparse Autoencoders

Makhzani, A. and Frey, B. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[24] [24]

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y . Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pp. 689–696, 2011

work page 2011

[25] [25]

Zoom in: An introduction to circuits

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020

work page 2020

[26] [26]

Dual coding theory: Retrospect and current status

Paivio, A. Dual coding theory: Retrospect and current status. Canadian Journal of Psycholo- gy/Revue canadienne de psychologie, 45(3):255, 1991

work page 1991

[27] [27]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Dissecting multimodality in videoqa transformer models by impairing modality fusion

Rawal, I., Jaiswal, S., Fernando, B., and Tan, C. Dissecting multimodality in videoqa transformer models by impairing modality fusion. In International Conference on Machine Learning , 2023. URL https://api.semanticscholar.org/CorpusID:259165589

work page 2023

[29] [29]

High-resolution image syn- thesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image syn- thesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022

work page 2022

[30] [30]

T., Argus, M., Fischer, V ., and Brox, T

Schrodi, S., Hoffmann, D. T., Argus, M., Fischer, V ., and Brox, T. Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=uAFHCZRmXk

work page 2025

[31] [31]

Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning

Sharma, P., Ding, N., Goodman, S., and Soricut, R. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL , 2018. 11

work page 2018

[32] [32]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=plmBsXHxgR

work page 2024

[33] [33]

Crossmodal correspondences: A tutorial review

Spence, C. Crossmodal correspondences: A tutorial review. Attention, Perception, & Psy- chophysics, 73(4):971–995, 2011

work page 2011

[34] [34]

Multimodn—multimodal, multi-task, interpretable modular networks

Swamy, V ., Satayeva, M., Frej, J., Bossy, T., V ogels, T., Jaggi, M., Käser, T., and Hartley, M.-A. Multimodn—multimodal, multi-task, interpretable modular networks. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[35] [35]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet

Templeton, A. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet . Anthropic, 2024

work page 2024

[36] [36]

Ungerleider, L. G. and Haxby, J. V . ‘what’ and ‘where’ in the human brain.Current Opin- ion in Neurobiology , 4(2):157–165, 1994. ISSN 0959-4388. doi: https://doi.org/10.1016/ 0959-4388(94)90066-3. URL https://www.sciencedirect.com/science/article/ pii/0959438894900663

work page arXiv 1994

[37] [37]

M2lens: Visualizing and explaining multimodal models for sentiment analysis

Wang, X., He, J., Jin, Z., Yang, M., Wang, Y ., and Qu, H. M2lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics, 28(1):802–812, 2021

work page 2021

[38] [38]

Non-negative contrastive learning.ICLR, 2024

Wang, Y ., Zhang, Q., Guo, Y ., and Wang, Y . Non-negative contrastive learning.ICLR, 2024

work page 2024

[39] [39]

Encourage or inhibit monosemantic- ity? revisit monosemanticity from a feature decorrelation perspective

Yan, H., Xiang, Y ., Chen, G., Wang, Y ., Gui, L., and He, Y . Encourage or inhibit monosemantic- ity? revisit monosemanticity from a feature decorrelation perspective. ArXiv, abs/2406.17969,

work page arXiv

[40] [40]

URL https://api.semanticscholar.org/CorpusID:270737676

work page

[41] [41]

Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models

Yin, Z., Ye, M., Zhang, T., Du, T., Zhu, J., Liu, H., Chen, J., Wang, T., and Ma, F. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[42] [42]

Please draw an animal

Yu, L., Cheng, Y ., Wang, Z., Kumar, V ., Macherey, W., Huang, Y ., Ross, D., Essa, I., Bisk, Y ., Yang, M.-H., et al. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. Advances in Neural Information Processing Systems , 36, 2024. 12 A Appendix A.1 Implementation for Monosemanticity Tools The three monosemantic tools, DeCLIP, ...

work page 2024