Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Chao Shen; Chenhao Lin; Qiwei Tian; Zhengyu Zhao

arxiv: 2512.07222 · v4 · submitted 2025-12-08 · 💻 cs.LG · cs.CL

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Qiwei Tian , Chenhao Lin , Zhengyu Zhao , Chao Shen This is my paper

Pith reviewed 2026-05-17 00:28 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords vision-language modelsadversarial robustnessfunction wordsattention mechanismcross-modal attacksrobustness without retraining

0 comments

The pith

De-attending function words in vision-language models reduces cross-modal adversarial vulnerability with almost no performance cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that function words like 'the' and 'of' make vision-language models vulnerable to cross-modal adversarial attacks. To fix this, it introduces Function-word De-Attention (FDA), which subtracts the attention given to these words from the model's normal attention calculations, similar to how a differential amplifier works. Experiments across multiple models, tasks, and attacks show large drops in attack success rates while performance on clean data stays nearly the same or even improves slightly. A sympathetic reader would care because this offers a simple way to gain robustness without the usual trade-off of retraining or adding complex defenses.

Core claim

Function words incur vulnerability of VLMs against cross-modal adversarial attacks. FDA calculates the original and the function-word cross-attention within attention heads and differentially subtracts the latter from the former, yielding more aligned and robust VLMs.

What carries the argument

Function-word De-Attention (FDA), which computes and subtracts function-word cross-attention from the original attention to reduce the impact of function words.

If this is right

Produces average ASR drops of 18%, 13%, and 53% on retrieval tasks across three models with performance drops of only 0.2%, 0.3%, and 0.6%.
Delivers a 90% ASR drop on visual grounding with a 0.3% performance gain.
Maintains scalability, generalization across attacks, and zero-shot performance on downstream tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If function words are the main carriers of adversarial signals, similar de-attention strategies could improve robustness in other multimodal or language models.
Testing FDA on additional datasets or newer VLM architectures would reveal how broadly the vulnerability pattern holds.

Load-bearing premise

That function words are the primary source of cross-modal adversarial vulnerability in VLMs and that removing attention to them does not remove information essential for the model's normal task performance.

What would settle it

An experiment showing that FDA fails to reduce attack success rate on a new set of cross-modal attacks, or that clean-task performance drops substantially when function-word attention is subtracted.

Figures

Figures reproduced from arXiv: 2512.07222 by Chao Shen, Chenhao Lin, Qiwei Tian, Zhengyu Zhao.

**Figure 1.** Figure 1: Grad-CAM of attention maps of VLM under white-box untargeted attacks through perturbed images. The texts are given at the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Left: An illustration of our Function-word De-Attention (FDA) method. On the existing process of attention calculation, which uses FV and FT , we add a parallel pipeline to calculate the attentions between function words FTf and the images FV . Afterwards, the function-attention passes a control gate G before entering the FDA module (triangle) differentially to subtract distractions as presented in Eq.6. … view at source ↗

**Figure 3.** Figure 3: Left: T-SNE of the vision-language embedding of vanilla VLM, FDA, FARE, and TeCoA. Our FDA is the most aligned model. Right: Comparison of text-image similarity for vanilla VLM versus VLM + FDA. Our FDA yields better alignment with larger similarities and smaller variances. originates from the disruption in vision-language alignment brought by adversarial noise for enhanced robustness. To validate our spec… view at source ↗

**Figure 4.** Figure 4: A heatmap of attention probabilities given the same image and text inputs. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code is available at https://github.com/michaeltian108/FDA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that function words in text inputs contribute to the vulnerability of vision-language models (VLMs) to cross-modal adversarial attacks. It proposes Function-word De-Attention (FDA), a training-free method that computes original cross-attention and function-word-specific cross-attention within heads and differentially subtracts the latter (analogous to a differential amplifier) to produce more robust and aligned representations. Experiments across three models, two tasks (image-text retrieval and visual grounding), three datasets, six attacks, and two SOTA baselines report average attack success rate (ASR) reductions of 18/13/53% on retrieval with clean-performance drops of only 0.2/0.3/0.6%, plus a 90% ASR drop and 0.3% clean gain on grounding; additional results address scalability, generalization, zero-shot transfer, and ablations. Code is released.

Significance. If the central result holds, the work supplies a lightweight, parameter-light intervention that improves adversarial robustness of VLMs without retraining or substantial clean-data cost, highlighting a previously under-examined role of function words in cross-modal vulnerabilities. The multi-model, multi-attack, multi-task evaluation and public code are positive factors for reproducibility and practical impact.

major comments (3)

[Experiments / Results] Experiments / Results: The headline claim of 'free' robustness rests on clean-performance deltas of 0.2/0.3/0.6% and corresponding ASR reductions being both real and negligible. These are reported as single point estimates with no error bars, no standard deviations across seeds, and no statistical tests (e.g., paired t-tests or Wilcoxon tests). Without such quantification it is impossible to determine whether the observed changes lie within run-to-run variance or simply reflect a uniform reduction in attention magnitude.
[Method] Method / FDA definition: The differential subtraction step is described at a high level ('calculates the original and the function-word cross-attention ... and differentially subtracts') but lacks an explicit equation, scaling factor, or threshold for isolating function-word attention. The abstract and method description therefore leave open whether the operation is strictly parameter-free or contains hidden choices that could affect the reported gains.
[Method] Identification of function words: The paper does not specify the procedure used to label function words (POS tagger, fixed list, frequency threshold, etc.). Because the entire FDA mechanism depends on this partitioning, the lack of a reproducible definition is load-bearing for both the mechanistic claim and the experimental results.

minor comments (2)

[Abstract] Abstract: The reported percentages (18/13/53%, 0.2/0.3/0.6%) are presented without any indication of variance or number of runs; adding a brief qualifier would improve clarity.
[Introduction] The manuscript would benefit from a short related-work paragraph contrasting FDA with prior attention-modification or adversarial-defense techniques in VLMs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to enhance statistical rigor, methodological transparency, and reproducibility.

read point-by-point responses

Referee: [Experiments / Results] Experiments / Results: The headline claim of 'free' robustness rests on clean-performance deltas of 0.2/0.3/0.6% and corresponding ASR reductions being both real and negligible. These are reported as single point estimates with no error bars, no standard deviations across seeds, and no statistical tests (e.g., paired t-tests or Wilcoxon tests). Without such quantification it is impossible to determine whether the observed changes lie within run-to-run variance or simply reflect a uniform reduction in attention magnitude.

Authors: We agree that single-point estimates limit assessment of variability and statistical significance. In the revised manuscript, we will report results averaged over multiple random seeds (at least 5 runs per setting) with standard deviations. We will also add paired t-tests comparing FDA against the baseline to confirm that clean-performance changes are statistically insignificant while ASR reductions are significant. These additions will directly support the 'free' robustness claim across the reported models and tasks. revision: yes
Referee: [Method] Method / FDA definition: The differential subtraction step is described at a high level ('calculates the original and the function-word cross-attention ... and differentially subtracts') but lacks an explicit equation, scaling factor, or threshold for isolating function-word attention. The abstract and method description therefore leave open whether the operation is strictly parameter-free or contains hidden choices that could affect the reported gains.

Authors: We thank the referee for highlighting this clarity issue. We will add a formal equation in the Method section: let A_h denote the original cross-attention map in head h and A_fw,h the corresponding map computed only over function-word tokens; the FDA output is then A'_h = A_h - A_fw,h (i.e., direct subtraction with scaling factor 1 and no threshold). This formulation is strictly parameter-free, as confirmed by our released code, and we will explicitly state that no additional hyperparameters are introduced. revision: yes
Referee: [Method] Identification of function words: The paper does not specify the procedure used to label function words (POS tagger, fixed list, frequency threshold, etc.). Because the entire FDA mechanism depends on this partitioning, the lack of a reproducible definition is load-bearing for both the mechanistic claim and the experimental results.

Authors: We acknowledge that an explicit definition is necessary for reproducibility. In the revised manuscript we will describe the procedure in detail: function words are identified via a hybrid approach combining a fixed linguistic list (determiners, prepositions, conjunctions, pronouns, and auxiliary verbs) with NLTK POS tagging to tag tokens as function words when they belong to closed-class categories. The exact list and tagging script will be included in the supplementary material and the public code repository. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method validated externally

full rationale

The paper introduces FDA as an explicit algorithmic modification to attention maps in VLMs: it computes and subtracts function-word cross-attention from the original cross-attention within heads. This design choice is then evaluated on external adversarial attack benchmarks, retrieval and grounding tasks, and multiple models/datasets. No derivation chain exists in which a 'prediction' or claimed result reduces by construction to quantities defined inside the same equations or to self-citations. The reported ASR drops and performance deltas are measured quantities on held-out data, not tautological outputs of fitted parameters or renamed inputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that function words drive adversarial vulnerability and on a small number of implementation choices for identifying function words and scaling the subtraction.

free parameters (1)

subtraction scaling factor or threshold for function-word attention
Controls how strongly the function-word component is removed; value must be chosen or tuned to achieve the reported ASR drops.

axioms (1)

domain assumption Function words disproportionately contribute to cross-modal adversarial vulnerability in attention mechanisms of VLMs.
This observation is the stated motivation for introducing FDA.

pith-pipeline@v0.9.0 · 5496 in / 1200 out tokens · 61074 ms · 2026-05-17T00:28:18.772242+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former (Eq. 6)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Reliable evalua- tion of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evalua- tion of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on ma- chine learning, pages 2206–2216. PMLR, 2020. 2, 4

work page 2020
[2]

Bert: Pre-training of deep bidirectional trans- formers for language understanding, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding, 2019. 4

work page 2019
[3]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 4

work page 2021
[4]

Adversarial robustness for visual ground- ing of multimodal large language models.arXiv preprint arXiv:2405.09981, 2024

Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, and Shu-Tao Xia. Adversarial robustness for visual ground- ing of multimodal large language models.arXiv preprint arXiv:2405.09981, 2024. 4

work page arXiv 2024
[5]

Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023

Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, and Xiaochun Cao. Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023. 2

work page 2023
[6]

Your large vision-language model only needs a few attention heads for visual grounding

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025. 3

work page 2025
[7]

Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, 2021. 4

work page 2021
[8]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 4

work page 2022
[9]

Bert-attack: Ad- versarial attack against bert using bert.arXiv preprint arXiv:2004.09984, 2020

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. Bert-attack: Adversarial attack against bert us- ing bert.arXiv preprint arXiv:2004.09984, 2020. 3, 7

work page arXiv 2004
[10]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4

work page 2014
[11]

Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models

Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 102–111, 2023. 2

work page 2023
[12]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learn- ing models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversar- ial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022. 1, 2, 4

work page arXiv 2022
[14]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 4

work page 2015
[15]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

work page 2021
[16]

Overfitting in ad- versarially robust deep learning

Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in ad- versarially robust deep learning. InInternational conference on machine learning, pages 8093–8104. PMLR, 2020. 2

work page 2020
[17]

the object

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised ad- versarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336,

work page arXiv
[18]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), 2017. 1

work page 2017
[19]

Collapse-aware triplet decoupling for adversarially ro- bust image retrieval.arXiv preprint arXiv:2312.07364, 2023

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, and Chao Shen. Collapse-aware triplet decoupling for adversarially ro- bust image retrieval.arXiv preprint arXiv:2312.07364, 2023. 2

work page arXiv 2023
[20]

Adversarial video promotion against text-to-video retrieval, 2025

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, and Chao Shen. Adversarial video promotion against text-to-video retrieval, 2025. 2

work page 2025
[21]

Vision-language pre-training with triple contrastive learning

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15671– 15680, 2022. 4

work page 2022
[22]

Differential transformer, 2024

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024. 1

work page arXiv 2024
[23]

Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models

Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. InAdvances in Neural Infor- mation Processing Systems, pages 52936–52956. Curran As- sociates, Inc., 2023. 2

work page 2023
[24]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 4

work page 2016
[25]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Lau- rent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational 10 conference on machine learning, pages 7472–7482. PMLR,

work page
[26]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InProceed- ings of the 30th ACM International Conference on Multime- dia, page 5005–5013, New York, NY , USA, 2022. Associa- tion for Computing Machinery. 2 11 Appendix A. Details for attacks and evaluation metrics We first introduce the attacks and ev...

work page 2022

[1] [1]

Reliable evalua- tion of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evalua- tion of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on ma- chine learning, pages 2206–2216. PMLR, 2020. 2, 4

work page 2020

[2] [2]

Bert: Pre-training of deep bidirectional trans- formers for language understanding, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding, 2019. 4

work page 2019

[3] [3]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 4

work page 2021

[4] [4]

Adversarial robustness for visual ground- ing of multimodal large language models.arXiv preprint arXiv:2405.09981, 2024

Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, and Shu-Tao Xia. Adversarial robustness for visual ground- ing of multimodal large language models.arXiv preprint arXiv:2405.09981, 2024. 4

work page arXiv 2024

[5] [5]

Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023

Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, and Xiaochun Cao. Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023. 2

work page 2023

[6] [6]

Your large vision-language model only needs a few attention heads for visual grounding

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025. 3

work page 2025

[7] [7]

Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, 2021. 4

work page 2021

[8] [8]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 4

work page 2022

[9] [9]

Bert-attack: Ad- versarial attack against bert using bert.arXiv preprint arXiv:2004.09984, 2020

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. Bert-attack: Adversarial attack against bert us- ing bert.arXiv preprint arXiv:2004.09984, 2020. 3, 7

work page arXiv 2004

[10] [10]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4

work page 2014

[11] [11]

Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models

Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 102–111, 2023. 2

work page 2023

[12] [12]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learn- ing models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversar- ial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022. 1, 2, 4

work page arXiv 2022

[14] [14]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 4

work page 2015

[15] [15]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

work page 2021

[16] [16]

Overfitting in ad- versarially robust deep learning

Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in ad- versarially robust deep learning. InInternational conference on machine learning, pages 8093–8104. PMLR, 2020. 2

work page 2020

[17] [17]

the object

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised ad- versarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336,

work page arXiv

[18] [18]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), 2017. 1

work page 2017

[19] [19]

Collapse-aware triplet decoupling for adversarially ro- bust image retrieval.arXiv preprint arXiv:2312.07364, 2023

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, and Chao Shen. Collapse-aware triplet decoupling for adversarially ro- bust image retrieval.arXiv preprint arXiv:2312.07364, 2023. 2

work page arXiv 2023

[20] [20]

Adversarial video promotion against text-to-video retrieval, 2025

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, and Chao Shen. Adversarial video promotion against text-to-video retrieval, 2025. 2

work page 2025

[21] [21]

Vision-language pre-training with triple contrastive learning

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15671– 15680, 2022. 4

work page 2022

[22] [22]

Differential transformer, 2024

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024. 1

work page arXiv 2024

[23] [23]

Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models

Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. InAdvances in Neural Infor- mation Processing Systems, pages 52936–52956. Curran As- sociates, Inc., 2023. 2

work page 2023

[24] [24]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 4

work page 2016

[25] [25]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Lau- rent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational 10 conference on machine learning, pages 7472–7482. PMLR,

work page

[26] [26]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InProceed- ings of the 30th ACM International Conference on Multime- dia, page 5005–5013, New York, NY , USA, 2022. Associa- tion for Computing Machinery. 2 11 Appendix A. Details for attacks and evaluation metrics We first introduce the attacks and ev...

work page 2022