pith. sign in

arxiv: 2502.14888 · v4 · submitted 2025-02-16 · 💻 cs.CV · cs.AI

Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

Pith reviewed 2026-05-23 02:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords modality gapvision-language modelsmodality dominance scoremodel editingbias mitigationadversarial examplesinterpretability metricstext-to-image generation
0
0 comments X

The pith

Modality gaps in vision-language models can be measured with a dominance score and leveraged via training-free editing to improve bias mitigation and generation control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that modality gaps persist after alignment and can be useful rather than purely problematic. It defines the Modality Dominance Score to sort multimodal features into vision-dominant, language-dominant, or cross-modal categories. Automatic metrics then assess these features at scale. Training-free editing based on the scores is shown to support concrete improvements on downstream tasks such as reducing gender bias in classification, producing cross-modal adversarial examples, and directing modality emphasis during text-to-image generation.

Core claim

The Modality Dominance Score attributes multimodal features to vision-dominant, language-dominant, and cross-modal classes, which supports automatic interpretability metrics and enables training-free model editing that mitigates bias in gender classification, generates cross-modal adversarial examples, and provides modality-specific control in text-to-image generation.

What carries the argument

The Modality Dominance Score (MDS), which classifies features by modality dominance to support targeted, training-free edits.

If this is right

  • Training-free editing reduces gender bias in classification outputs.
  • Editing produces cross-modal adversarial examples that exploit modality gaps.
  • Editing allows explicit control over whether vision or language dominates in text-to-image outputs.
  • Interpretability tools built on the same score enable systematic, task-agnostic analysis of multimodal models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same classification approach could be tested on other multimodal pairs such as audio-language or video-text.
  • If gaps prove necessary for perception, future alignment objectives might deliberately preserve rather than eliminate them.
  • Lightweight editing of this kind could lower the barrier to customizing deployed models for specific fairness or control goals.

Load-bearing premise

The Modality Dominance Score reliably assigns features to vision or language in a way that matches human perception of modality dominance.

What would settle it

Human raters classifying the same features by dominant modality show low agreement with the MDS labels, or the proposed editing steps produce no measurable gains on bias mitigation, adversarial example success, or modality control in generation.

Figures

Figures reproduced from arXiv: 2502.14888 by Hanqi Yan, Jindong Gu, Lu Yin, Paul Pu Liang, Xiangxiang Cui, Yifei Wang, Yulan He.

Figure 1
Figure 1. Figure 1: Modality Dominance Score (MDS) distributions of three feature categories for different VLMs. modality, and this trend is consistent across all models. DeCLIP, on the other hand, shows a more balanced and less centered distribution. This suggests that DeCLIP, through self-supervision, extracts more modality-specific features, which might be overlooked by pure vision-language contrastive models like CLIP. Th… view at source ↗
Figure 2
Figure 2. Figure 2: Monosemanticity for four VLMs. CLIP DeCLIP CLIP+NCL CLIP+SAE Models 0.050 0.025 0.000 0.025 0.050 0.075 0.100 Multimodal mono Modality-specific Monosemanticity Visual Mono (ImgD - TextD) Textual Mono (TextD - ImgD) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Activated images and texts (in Table) by ImgD. Top image row (feature 647): patterns and textures. Bottom image (feature 667): water and aquatic themes in blue. Texts in blue align with visual concepts. Beyond quantitative scores, we present examples to illustrate the semantic consistency of different feature types. ImgD capture vi￾sual patterns that are hard to verbalize. We randomly select two ImgD featu… view at source ↗
Figure 5
Figure 5. Figure 5: Activated images and texts (in Table) by TextD. Top image row (feature 34): couples and individuals in red attire. Bottom image row (feature 242): diverse objects. Text in blue aligns with visual concepts. CrossD (the majority features) capture shared semantics across modalities. Different from modality-specific features, TextD and ImgD, CrossD features capture common concepts that could be expressed in bo… view at source ↗
Figure 6
Figure 6. Figure 6: Activated images and texts by CrossD features. Top image row (feature 6): activities performed by individuals. Bottom image row (feature 47): scenery outside the doors. Text in blue aligns with visual concepts. 4 Cultivating Multimodal Monosemanticity We demonstrate that enhancing monosemanticity significantly improves the interpretability and modality specialization of multimodal features. Beyond interpre… view at source ↗
Figure 7
Figure 7. Figure 7: Female figures ordered by their percentages of ImgD features: 0.14, 0.16, 0.18,0.20, 0.22, 0.24, 0.26. More feminine concepts are observed to be related with more ImgD. To understand the what femi￾nine concepts the ImgD repre￾sent (the studies of TextD on male concepts can be found in Appendix [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Alignment training to de-toxicity of the adversarial sample, with only selected target feature dimensions (in gray), i.e., ImgD, TextD and CrossD, involved in the alignment [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The reference image R is used for modality-specific control over text-to-image generation process. Despite the impressive capabilities of text-to-image generation models [41, 12, 34], their internal mechanisms for bridging lin￾guistic semantics and visual details remain poorly understood. A key challenge is disentangling how modality-specific features influence the fidelity and controllability of generatio… view at source ↗
Figure 10
Figure 10. Figure 10: Generated new images from the VLM with the text prompt [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The changes of active dimensions over SAE training. Non-negative Contrastive Learning (NCL). We add the NCL block, i.e., projector after obtaining zi and zt from image encoder and text encoder. The training loss is shown in Eq. 3. 1 self . projector = nn . Sequential ( 2 nn . Linear ( embed_dim , embed_dim ) , 3 nn . LayerNorm ( embed_dim ) , 4 nn . ReLU () , 5 nn . Linear ( embed_dim , embed_dim ) , 6 ht… view at source ↗
Figure 12
Figure 12. Figure 12: The changes of active dimensions over NCL training. A.2 Implementation of MDS Based on the trained CLIP, CLIP+SAE, CLIP+NCL and DeCLIP, we feed the test split of cc3m-wds dataset to these pretrained models, around 15k image-text pairs to calculate MDS, according to Eq.(4). The features are the last-layer output from the text and image encoder. We tried to calculate the normalization of zi and zt, but foun… view at source ↗
Figure 13
Figure 13. Figure 13: Monosemanticity (EmbSimi and WinRate) changes as training goes on. Upper is for [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Generated new images from the VLM with the text prompt [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human perception as evidenced by modality-specific phenomena like visual texture and linguistic tone. These observations motivate us to computationally measure and leverage modality gaps to improve downstream tasks. We first introduce the Modality Dominance Score (MDS), which attributes multimodal features to specific modalities by categorizing them into three classes: vision-dominant features, language-dominant features, and cross-modal features. We then propose automatic interpretability metrics to evaluate these modality-specific features in a scalable manner. Finally, we demonstrate that the training-free model editing enhances multiple downstream tasks, including mitigating bias in gender classification, generating cross-modal adversarial examples, and enabling modality-specific control in text-to-image generation. Combined with task-agnostic interpretability tools, our work offers insights for systematic analysis and lightweight editing of multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Modality Dominance Score (MDS) to categorize features in vision-language models into vision-dominant, language-dominant, and cross-modal classes. It proposes automatic interpretability metrics for these features and claims that training-free editing based on MDS improves bias mitigation in gender classification, generation of cross-modal adversarial examples, and modality-specific control in text-to-image generation, motivated by the idea that modality gaps are necessary for human-like perception.

Significance. If the MDS classes prove reliable and the editing interventions demonstrably leverage modality gaps rather than generic feature effects, the work could enable lightweight, training-free analysis and editing of VLMs, shifting emphasis from pure alignment to controlled use of modality gaps with task-agnostic tools.

major comments (2)
  1. [Abstract and methods (MDS definition)] The central claim that MDS produces perceptually or functionally meaningful classes (vision-dominant, language-dominant, cross-modal) that support effective training-free editing is load-bearing for all downstream results, yet the manuscript provides no direct validation (e.g., human studies or ablation against random/feature-importance baselines) that MDS labels align with human modality attribution rather than being an artifact of the scoring heuristic.
  2. [Experimental results] § on experimental results (bias mitigation, adversarial examples, T2I control): the reported gains are attributed to modality-gap leverage, but without controls showing that the same edits applied to non-MDS partitions yield no improvement, the results could be explained by standard feature masking rather than the claimed modality-specific mechanism.
minor comments (2)
  1. [Methods] Notation for MDS computation should be clarified with an explicit equation or pseudocode to allow reproduction.
  2. [Interpretability metrics] The automatic interpretability metrics are mentioned but their exact formulation and correlation with human judgments are not detailed enough for independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the validation of MDS and the experimental controls.

read point-by-point responses
  1. Referee: [Abstract and methods (MDS definition)] The central claim that MDS produces perceptually or functionally meaningful classes (vision-dominant, language-dominant, cross-modal) that support effective training-free editing is load-bearing for all downstream results, yet the manuscript provides no direct validation (e.g., human studies or ablation against random/feature-importance baselines) that MDS labels align with human modality attribution rather than being an artifact of the scoring heuristic.

    Authors: We agree that additional validation would strengthen the claims. The manuscript introduces automatic interpretability metrics as a scalable means to assess modality-specific features. To directly respond to the concern, we will add ablations in the revision that compare MDS partitions against random and feature-importance baselines, demonstrating that improvements are not artifacts of the scoring method. Human studies, while valuable, introduce significant subjectivity in modality attribution and are beyond the current scope; the quantitative ablations provide a rigorous alternative. revision: yes

  2. Referee: [Experimental results] § on experimental results (bias mitigation, adversarial examples, T2I control): the reported gains are attributed to modality-gap leverage, but without controls showing that the same edits applied to non-MDS partitions yield no improvement, the results could be explained by standard feature masking rather than the claimed modality-specific mechanism.

    Authors: We acknowledge this is a valid concern for isolating the modality-specific mechanism. In the revised manuscript, we will add control experiments applying the same training-free edits to non-MDS feature partitions across the three tasks. We expect these to show no comparable gains, which would support that the reported improvements arise from leveraging modality gaps rather than generic feature masking effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the Modality Dominance Score (MDS) as a new computational measure, proposes automatic interpretability metrics, and demonstrates training-free editing applications. No equations, fitted parameters called predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on newly defined quantities and empirical demonstrations rather than reducing to inputs by construction. This is the expected self-contained case.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5710 in / 958 out tokens · 29245 ms · 2026-05-23T02:51:02.612120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 6 internal anchors

  1. [1]

    P., and Lakkaraju, H

    Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F. P., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice). arXiv preprint arXiv:2402.10376, 2024

  2. [2]

    Language models can explain neurons in language models

    Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023) , 2, 2023

  3. [3]

    Calvert, G., Spence, C., and Stein, B. E. (eds.). The Handbook of Multisensory Processes . MIT Press, 2004

  4. [4]

    K., and Lim, S.-N

    Cui, X., Aparcedo, A., Jang, Y . K., and Lim, S.-N. On the robustness of large multimodal models against image adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24625–24634, 2024

  5. [5]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2309.08600

  6. [6]

    DeepSeek-AI, Liu, A., and et al., B. F. Deepseek-v3 technical report, 2024. URL https: //arxiv.org/abs/2412.19437

  7. [7]

    A mathematical framework for transformer circuits

    Elhage, N., Nanda, N., Olsson, C., and Others. A mathematical framework for transformer circuits. Transformer Circuits Thread (2022). URL https://transformer-circuits.pub/ 2022/solu/index.html

  8. [8]

    The human brainnetome atlas: A new brain atlas based on connectional architecture

    Fan, L., Li, H., Zhuo, J., Zhang, Y ., Chen, L., Yang, Z., Chu, C., Xie, S., Laird, A., Fox, P., Eickhoff, S., Yu, C., and Jiang, T. The human brainnetome atlas: A new brain atlas based on connectional architecture. Cerebral Cortex, 26:bhw157, 05 2016. doi: 10.1093/cercor/bhw157

  9. [9]

    Scaling and evaluating sparse autoencoders

    Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

  10. [10]

    org/abs/2305.01610

    Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. ArXiv, abs/2305.01610, 2023. URL https://api.semanticscholar.org/CorpusID:258437237

  11. [11]

    Vilt: Vision-and-language transformer without convolution or region supervision

    Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021

  12. [12]

    Y ., Fried, D., and Salakhutdinov, R

    Koh, J. Y ., Fried, D., and Salakhutdinov, R. R. Generating images with multimodal language models. Advances in Neural Information Processing Systems , 36, 2024

  13. [13]

    Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm

    Li, Y ., Liang, F., Zhao, L., Cui, Y ., Ouyang, W., Shao, J., Yu, F., and Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. InInternational Conference on Learning Representations , 2022. URL https://openreview.net/forum? id=zq1iJkNk3uN

  14. [14]

    P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R

    Liang, P. P., Lyu, Y ., Chhablani, G., Jain, N., Deng, Z., Wang, X., Morency, L.-P., and Salakhutdinov, R. Multiviz: Towards visualizing and understanding multimodal models. arXiv preprint arXiv:2207.00056, 2022. 10

  15. [15]

    P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P

    Liang, P. P., Cheng, Y ., Salakhutdinov, R., and Morency, L.-P. Multimodal fusion interactions: A study of human and automatic quantification. In Proceedings of the 25th International Conference on Multimodal Interaction , ICMI ’23, pp. 425–435, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 9798400700552. doi: 10.1145/3577190.3614151...

  16. [16]

    P., Zadeh, A., and Morency, L.-P

    Liang, P. P., Zadeh, A., and Morency, L.-P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024

  17. [17]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Liang, W., Zhang, Y ., Kwon, Y ., Yeung, S., and Zou, J. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems , 2022. URL https://openreview.net/forum?id=S7Evzt9uit3

  18. [18]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V ., Kramár, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024

  19. [19]

    Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2023

  20. [20]

    In-context vectors: making in context learning more effective and controllable through latent space steering

    Liu, S., Ye, H., Xing, L., and Zou, J. In-context vectors: making in context learning more effective and controllable through latent space steering. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  21. [21]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

    Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23, 2019

  22. [22]

    P., Deng, Z., Salakhutdinov, R., and Morency, L.-P

    Lyu, Y ., Liang, P. P., Deng, Z., Salakhutdinov, R., and Morency, L.-P. Dime: Fine-grained interpretations of multimodal models via disentangled local explanations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society , pp. 455–467, 2022

  23. [23]

    k-Sparse Autoencoders

    Makhzani, A. and Frey, B. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013

  24. [24]

    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y . Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pp. 689–696, 2011

  25. [25]

    Zoom in: An introduction to circuits

    Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020

  26. [26]

    Dual coding theory: Retrospect and current status

    Paivio, A. Dual coding theory: Retrospect and current status. Canadian Journal of Psycholo- gy/Revue canadienne de psychologie, 45(3):255, 1991

  27. [27]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021

  28. [28]

    Dissecting multimodality in videoqa transformer models by impairing modality fusion

    Rawal, I., Jaiswal, S., Fernando, B., and Tan, C. Dissecting multimodality in videoqa transformer models by impairing modality fusion. In International Conference on Machine Learning , 2023. URL https://api.semanticscholar.org/CorpusID:259165589

  29. [29]

    High-resolution image syn- thesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image syn- thesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022

  30. [30]

    T., Argus, M., Fischer, V ., and Brox, T

    Schrodi, S., Hoffmann, D. T., Argus, M., Fischer, V ., and Brox, T. Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=uAFHCZRmXk

  31. [31]

    Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning

    Sharma, P., Ding, N., Goodman, S., and Soricut, R. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL , 2018. 11

  32. [32]

    Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

    Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=plmBsXHxgR

  33. [33]

    Crossmodal correspondences: A tutorial review

    Spence, C. Crossmodal correspondences: A tutorial review. Attention, Perception, & Psy- chophysics, 73(4):971–995, 2011

  34. [34]

    Multimodn—multimodal, multi-task, interpretable modular networks

    Swamy, V ., Satayeva, M., Frej, J., Bossy, T., V ogels, T., Jaggi, M., Käser, T., and Hartley, M.-A. Multimodn—multimodal, multi-task, interpretable modular networks. Advances in Neural Information Processing Systems, 36, 2024

  35. [35]

    Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet

    Templeton, A. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet . Anthropic, 2024

  36. [36]

    Ungerleider, L. G. and Haxby, J. V . ‘what’ and ‘where’ in the human brain.Current Opin- ion in Neurobiology , 4(2):157–165, 1994. ISSN 0959-4388. doi: https://doi.org/10.1016/ 0959-4388(94)90066-3. URL https://www.sciencedirect.com/science/article/ pii/0959438894900663

  37. [37]

    M2lens: Visualizing and explaining multimodal models for sentiment analysis

    Wang, X., He, J., Jin, Z., Yang, M., Wang, Y ., and Qu, H. M2lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics, 28(1):802–812, 2021

  38. [38]

    Non-negative contrastive learning.ICLR, 2024

    Wang, Y ., Zhang, Q., Guo, Y ., and Wang, Y . Non-negative contrastive learning.ICLR, 2024

  39. [39]

    Encourage or inhibit monosemantic- ity? revisit monosemanticity from a feature decorrelation perspective

    Yan, H., Xiang, Y ., Chen, G., Wang, Y ., Gui, L., and He, Y . Encourage or inhibit monosemantic- ity? revisit monosemanticity from a feature decorrelation perspective. ArXiv, abs/2406.17969,

  40. [40]

    URL https://api.semanticscholar.org/CorpusID:270737676

  41. [41]

    Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models

    Yin, Z., Ye, M., Zhang, T., Du, T., Zhu, J., Liu, H., Chen, J., Wang, T., and Ma, F. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. Advances in Neural Information Processing Systems, 36, 2024

  42. [42]

    Please draw an animal

    Yu, L., Cheng, Y ., Wang, Z., Kumar, V ., Macherey, W., Huang, Y ., Ross, D., Essa, I., Bisk, Y ., Yang, M.-H., et al. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. Advances in Neural Information Processing Systems , 36, 2024. 12 A Appendix A.1 Implementation for Monosemanticity Tools The three monosemantic tools, DeCLIP, ...