Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Pith reviewed 2026-05-17 00:28 UTC · model grok-4.3
The pith
De-attending function words in vision-language models reduces cross-modal adversarial vulnerability with almost no performance cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Function words incur vulnerability of VLMs against cross-modal adversarial attacks. FDA calculates the original and the function-word cross-attention within attention heads and differentially subtracts the latter from the former, yielding more aligned and robust VLMs.
What carries the argument
Function-word De-Attention (FDA), which computes and subtracts function-word cross-attention from the original attention to reduce the impact of function words.
If this is right
- Produces average ASR drops of 18%, 13%, and 53% on retrieval tasks across three models with performance drops of only 0.2%, 0.3%, and 0.6%.
- Delivers a 90% ASR drop on visual grounding with a 0.3% performance gain.
- Maintains scalability, generalization across attacks, and zero-shot performance on downstream tasks.
Where Pith is reading between the lines
- If function words are the main carriers of adversarial signals, similar de-attention strategies could improve robustness in other multimodal or language models.
- Testing FDA on additional datasets or newer VLM architectures would reveal how broadly the vulnerability pattern holds.
Load-bearing premise
That function words are the primary source of cross-modal adversarial vulnerability in VLMs and that removing attention to them does not remove information essential for the model's normal task performance.
What would settle it
An experiment showing that FDA fails to reduce attack success rate on a new set of cross-modal attacks, or that clean-task performance drops substantially when function-word attention is subtracted.
Figures
read the original abstract
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code is available at https://github.com/michaeltian108/FDA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that function words in text inputs contribute to the vulnerability of vision-language models (VLMs) to cross-modal adversarial attacks. It proposes Function-word De-Attention (FDA), a training-free method that computes original cross-attention and function-word-specific cross-attention within heads and differentially subtracts the latter (analogous to a differential amplifier) to produce more robust and aligned representations. Experiments across three models, two tasks (image-text retrieval and visual grounding), three datasets, six attacks, and two SOTA baselines report average attack success rate (ASR) reductions of 18/13/53% on retrieval with clean-performance drops of only 0.2/0.3/0.6%, plus a 90% ASR drop and 0.3% clean gain on grounding; additional results address scalability, generalization, zero-shot transfer, and ablations. Code is released.
Significance. If the central result holds, the work supplies a lightweight, parameter-light intervention that improves adversarial robustness of VLMs without retraining or substantial clean-data cost, highlighting a previously under-examined role of function words in cross-modal vulnerabilities. The multi-model, multi-attack, multi-task evaluation and public code are positive factors for reproducibility and practical impact.
major comments (3)
- [Experiments / Results] Experiments / Results: The headline claim of 'free' robustness rests on clean-performance deltas of 0.2/0.3/0.6% and corresponding ASR reductions being both real and negligible. These are reported as single point estimates with no error bars, no standard deviations across seeds, and no statistical tests (e.g., paired t-tests or Wilcoxon tests). Without such quantification it is impossible to determine whether the observed changes lie within run-to-run variance or simply reflect a uniform reduction in attention magnitude.
- [Method] Method / FDA definition: The differential subtraction step is described at a high level ('calculates the original and the function-word cross-attention ... and differentially subtracts') but lacks an explicit equation, scaling factor, or threshold for isolating function-word attention. The abstract and method description therefore leave open whether the operation is strictly parameter-free or contains hidden choices that could affect the reported gains.
- [Method] Identification of function words: The paper does not specify the procedure used to label function words (POS tagger, fixed list, frequency threshold, etc.). Because the entire FDA mechanism depends on this partitioning, the lack of a reproducible definition is load-bearing for both the mechanistic claim and the experimental results.
minor comments (2)
- [Abstract] Abstract: The reported percentages (18/13/53%, 0.2/0.3/0.6%) are presented without any indication of variance or number of runs; adding a brief qualifier would improve clarity.
- [Introduction] The manuscript would benefit from a short related-work paragraph contrasting FDA with prior attention-modification or adversarial-defense techniques in VLMs.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to enhance statistical rigor, methodological transparency, and reproducibility.
read point-by-point responses
-
Referee: [Experiments / Results] Experiments / Results: The headline claim of 'free' robustness rests on clean-performance deltas of 0.2/0.3/0.6% and corresponding ASR reductions being both real and negligible. These are reported as single point estimates with no error bars, no standard deviations across seeds, and no statistical tests (e.g., paired t-tests or Wilcoxon tests). Without such quantification it is impossible to determine whether the observed changes lie within run-to-run variance or simply reflect a uniform reduction in attention magnitude.
Authors: We agree that single-point estimates limit assessment of variability and statistical significance. In the revised manuscript, we will report results averaged over multiple random seeds (at least 5 runs per setting) with standard deviations. We will also add paired t-tests comparing FDA against the baseline to confirm that clean-performance changes are statistically insignificant while ASR reductions are significant. These additions will directly support the 'free' robustness claim across the reported models and tasks. revision: yes
-
Referee: [Method] Method / FDA definition: The differential subtraction step is described at a high level ('calculates the original and the function-word cross-attention ... and differentially subtracts') but lacks an explicit equation, scaling factor, or threshold for isolating function-word attention. The abstract and method description therefore leave open whether the operation is strictly parameter-free or contains hidden choices that could affect the reported gains.
Authors: We thank the referee for highlighting this clarity issue. We will add a formal equation in the Method section: let A_h denote the original cross-attention map in head h and A_fw,h the corresponding map computed only over function-word tokens; the FDA output is then A'_h = A_h - A_fw,h (i.e., direct subtraction with scaling factor 1 and no threshold). This formulation is strictly parameter-free, as confirmed by our released code, and we will explicitly state that no additional hyperparameters are introduced. revision: yes
-
Referee: [Method] Identification of function words: The paper does not specify the procedure used to label function words (POS tagger, fixed list, frequency threshold, etc.). Because the entire FDA mechanism depends on this partitioning, the lack of a reproducible definition is load-bearing for both the mechanistic claim and the experimental results.
Authors: We acknowledge that an explicit definition is necessary for reproducibility. In the revised manuscript we will describe the procedure in detail: function words are identified via a hybrid approach combining a fixed linguistic list (determiners, prepositions, conjunctions, pronouns, and auxiliary verbs) with NLTK POS tagging to tag tokens as function words when they belong to closed-class categories. The exact list and tagging script will be included in the supplementary material and the public code repository. revision: yes
Circularity Check
No significant circularity; empirical method validated externally
full rationale
The paper introduces FDA as an explicit algorithmic modification to attention maps in VLMs: it computes and subtracts function-word cross-attention from the original cross-attention within heads. This design choice is then evaluated on external adversarial attack benchmarks, retrieval and grounding tasks, and multiple models/datasets. No derivation chain exists in which a 'prediction' or claimed result reduces by construction to quantities defined inside the same equations or to self-citations. The reported ASR drops and performance deltas are measured quantities on held-out data, not tautological outputs of fitted parameters or renamed inputs. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- subtraction scaling factor or threshold for function-word attention
axioms (1)
- domain assumption Function words disproportionately contribute to cross-modal adversarial vulnerability in attention mechanisms of VLMs.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former (Eq. 6)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reliable evalua- tion of adversarial robustness with an ensemble of diverse parameter-free attacks
Francesco Croce and Matthias Hein. Reliable evalua- tion of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on ma- chine learning, pages 2206–2216. PMLR, 2020. 2, 4
work page 2020
-
[2]
Bert: Pre-training of deep bidirectional trans- formers for language understanding, 2019
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding, 2019. 4
work page 2019
-
[3]
An image is worth 16x16 words: Transformers for image recognition at scale, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 4
work page 2021
-
[4]
Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, and Shu-Tao Xia. Adversarial robustness for visual ground- ing of multimodal large language models.arXiv preprint arXiv:2405.09981, 2024. 4
-
[5]
Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, and Xiaochun Cao. Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023. 2
work page 2023
-
[6]
Your large vision-language model only needs a few attention heads for visual grounding
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025. 3
work page 2025
-
[7]
Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, 2021. 4
work page 2021
-
[8]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 4
work page 2022
-
[9]
Bert-attack: Ad- versarial attack against bert using bert.arXiv preprint arXiv:2004.09984, 2020
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. Bert-attack: Adversarial attack against bert us- ing bert.arXiv preprint arXiv:2004.09984, 2020. 3, 7
-
[10]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4
work page 2014
-
[11]
Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 102–111, 2023. 2
work page 2023
-
[12]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learn- ing models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017. 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversar- ial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022. 1, 2, 4
-
[14]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 4
work page 2015
-
[15]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4
work page 2021
-
[16]
Overfitting in ad- versarially robust deep learning
Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in ad- versarially robust deep learning. InInternational conference on machine learning, pages 8093–8104. PMLR, 2020. 2
work page 2020
-
[17]
Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised ad- versarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336,
-
[18]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), 2017. 1
work page 2017
-
[19]
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, and Chao Shen. Collapse-aware triplet decoupling for adversarially ro- bust image retrieval.arXiv preprint arXiv:2312.07364, 2023. 2
-
[20]
Adversarial video promotion against text-to-video retrieval, 2025
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, and Chao Shen. Adversarial video promotion against text-to-video retrieval, 2025. 2
work page 2025
-
[21]
Vision-language pre-training with triple contrastive learning
Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15671– 15680, 2022. 4
work page 2022
-
[22]
Differential transformer, 2024
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024. 1
-
[23]
Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models
Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. InAdvances in Neural Infor- mation Processing Systems, pages 52936–52956. Curran As- sociates, Inc., 2023. 2
work page 2023
-
[24]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 4
work page 2016
-
[25]
Theoretically principled trade-off between robustness and accuracy
Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Lau- rent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational 10 conference on machine learning, pages 7472–7482. PMLR,
-
[26]
Towards adversarial attack on vision-language pre-training models
Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InProceed- ings of the 30th ACM International Conference on Multime- dia, page 5005–5013, New York, NY , USA, 2022. Associa- tion for Computing Machinery. 2 11 Appendix A. Details for attacks and evaluation metrics We first introduce the attacks and ev...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.