HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models
Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3
The pith
A two-stage attack generates higher-success adversarial examples for vision-language models by using contrastive optimization on image changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HQA-VLAttack generates adversarial examples via separate text and image attack stages. For text, counter-fitting word vectors produce substitute sets that keep semantic consistency with originals. For images, perturbations begin with a layer-importance guided initialization and are then refined by contrastive optimization that decreases similarity between positive image-text pairs while increasing similarity between negative pairs. The resulting examples are more likely to retrieve incorrect matches, producing higher attack success rates than baselines on three benchmark datasets.
What carries the argument
Contrastive optimization of image adversarial perturbations, which decreases similarity of positive image-text pairs and increases similarity of negative image-text pairs.
If this is right
- Adversarial examples achieve higher success rates at forcing vision-language models to retrieve incorrect image-text matches.
- Text changes remain semantically close to originals, limiting obvious semantic drift in the perturbed inputs.
- The overall method uses fewer queries than iterative cross-search strategies in earlier black-box attacks.
- Stronger attacks provide a clearer picture of robustness gaps in pre-trained multimodal models on retrieval tasks.
Where Pith is reading between the lines
- The same contrastive principle might extend to other multimodal tasks such as visual question answering where negative pairs can be defined.
- Layer-importance initialization may indicate that partial knowledge of model internals can be leveraged even in nominally black-box settings.
- If negative-pair boosting proves robust, defenses would need to account for attacks that actively strengthen mismatches rather than only weaken matches.
Load-bearing premise
That the contrastive optimization step, which decreases positive image-text similarity while increasing negative pair similarity, will reliably translate to higher attack success rates without side effects or dataset-specific tuning that limits generalization.
What would settle it
Running the full HQA-VLAttack pipeline versus an ablation that removes only the contrastive optimization step and measuring whether attack success rate shows no gain on the same three benchmark datasets.
Figures
read the original abstract
Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HQA-VLAttack, a two-stage black-box adversarial attack framework for vision-language pre-trained models. The text stage generates semantically consistent perturbations via counter-fitting word vectors. The image stage initializes perturbations with a layer-importance guided strategy and then applies contrastive optimization to decrease similarity of positive image-text pairs while increasing similarity of negative pairs. The central claim is that this yields higher attack success rates than existing baselines on three benchmark datasets.
Significance. If the performance claims and attribution to the contrastive component hold after validation, the work would offer a relatively simple improvement to VL adversarial attacks by explicitly handling negative-pair similarities, which prior methods overlook. This could aid in more thorough robustness evaluation of VL models, though the significance is tempered by the empirical nature of the approach and lack of isolated validation for the key innovation.
major comments (2)
- [Abstract / Image perturbation generation] Abstract and experimental results: The central claim of significant outperformance in attack success rate is stated without any quantitative metrics, tables, or specific numbers in the provided text, and no ablation is described that removes only the contrastive optimization term while holding query budget, initialization, and text stage fixed. This prevents attribution of gains to the contrastive step rather than other design choices.
- [Image perturbation generation] Image attack stage description: The contrastive optimization is presented as ensuring decreased positive similarity and increased negative-pair similarity to enhance retrieval of negatives, but no analysis or experiment addresses potential compensating effects (e.g., changes in false-positive retrievals on other negatives) or confirms the mapping to downstream decision boundaries without dataset-specific tuning.
minor comments (3)
- [Abstract] The abstract uses 'guaranteeing the semantic consistency' for the counter-fitting step; this should be softened to 'promoting' or supported by a quantitative semantic similarity metric in the text stage.
- [Method] Notation for positive/negative pairs and similarity functions is introduced descriptively but would benefit from explicit equations or a diagram in the method section for clarity.
- [Title / Abstract] The title emphasizes 'High Quality' but the manuscript does not define this beyond attack success rate; consider adding metrics such as perceptual similarity or query efficiency if they are evaluated.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment point by point below, indicating revisions where the manuscript will be updated to improve clarity and validation of the proposed method.
read point-by-point responses
-
Referee: [Abstract / Image perturbation generation] Abstract and experimental results: The central claim of significant outperformance in attack success rate is stated without any quantitative metrics, tables, or specific numbers in the provided text, and no ablation is described that removes only the contrastive optimization term while holding query budget, initialization, and text stage fixed. This prevents attribution of gains to the contrastive step rather than other design choices.
Authors: We agree that the abstract would be improved by including specific quantitative metrics to support the outperformance claim. In the revised version, we will add key attack success rate figures from the experiments on the three benchmarks directly into the abstract. We also acknowledge that the current manuscript does not include an ablation that isolates only the contrastive optimization term while holding query budget, initialization, and the text stage fixed. We will add this ablation study to the revision to enable clearer attribution of performance gains to the contrastive component. revision: yes
-
Referee: [Image perturbation generation] Image attack stage description: The contrastive optimization is presented as ensuring decreased positive similarity and increased negative-pair similarity to enhance retrieval of negatives, but no analysis or experiment addresses potential compensating effects (e.g., changes in false-positive retrievals on other negatives) or confirms the mapping to downstream decision boundaries without dataset-specific tuning.
Authors: The contrastive optimization is explicitly designed to decrease positive-pair similarity while increasing negative-pair similarity, addressing an aspect overlooked by prior methods. We agree that the manuscript lacks dedicated analysis of potential compensating effects such as changes in false-positive retrievals across other negatives. We will add experiments or similarity distribution analysis for multiple negative pairs in the revision. On the mapping to downstream decision boundaries, attack success is measured directly on the retrieval task using the described framework without additional dataset-specific tuning; we will clarify this point and include supporting discussion or metrics in the updated manuscript. revision: yes
Circularity Check
No significant circularity in the paper's empirical method
full rationale
The paper presents a procedural two-stage empirical method for black-box adversarial attacks on vision-language models: text perturbation via counter-fitting word vectors for semantic consistency, followed by image perturbation initialized via layer-importance guidance and optimized with contrastive learning to decrease positive pair similarity while increasing negative pair similarity. The central claim of higher attack success rate is supported solely by experimental results on three benchmark datasets showing outperformance over baselines. No equations, derivations, or mathematical reductions are described that would equate the reported ASR to a fitted parameter or self-referential definition by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The approach is self-contained as an algorithmic description validated externally through benchmarks, with no patterns of self-definitional, fitted-input-called-prediction, or renaming-known-result circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Majjed Al-Qatf, Xingfu Wang, Ammar Hawbani, Amr Abdussalam, and Saeed Hamood Alsamhi. Image captioning with novel topics guidance and retrieval-based topics re-weighting.IEEE Transactions on Multimedia (TMM), 25:5984–5999, 2023
work page 2023
-
[2]
SPICE: semantic propositional image caption evaluation
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: semantic propositional image caption evaluation. InEuropean Conference on Computer Vision (ECCV), pages 382–398, 2016
work page 2016
-
[3]
METEOR: an automatic metric for MT evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005
work page 2005
-
[4]
Image-text retrieval: A survey on recent research and development
Min Cao, Shiping Li, Juntao Li, Liqiang Nie, and Min Zhang. Image-text retrieval: A survey on recent research and development. InInternational Joint Conference on Artificial Intelligence (IJCAI), pages 5410–5417, 2022
work page 2022
-
[5]
Zhaoyu Chen, Bo Li, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. Query-efficient decision-based black-box patch attack.IEEE Transactions on Information Forensics and Security, 18:5522–5536, 2023
work page 2023
-
[6]
Hao Cheng, Erjia Xiao, Jiahang Cao, Le Yang, Kaidi Xu, Jindong Gu, and Renjing Xu. Typography leads semantic diversifying: Amplifying adversarial transferability across multimodal large language models. arXiv preprint, abs/2405.20090, 2024
-
[7]
Zheng Cui, Yongli Hu, Yanfeng Sun, Junbin Gao, and Baocai Yin. Cross-modal alignment with graph reasoning for image-text retrieval.Multimedia Tools and Applications, 81(17):23615–23632, 2022
work page 2022
-
[8]
BERT: pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2019
work page 2019
-
[9]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[10]
Sensen Gao, Xiaojun Jia, Xuhong Ren, Ivor W. Tsang, and Qing Guo. Boosting transferability in vision- language attacks via diversification along the intersection region of adversarial trajectory. InEuropean Conference on Computer Vision (ECCV), pages 442–460, 2024
work page 2024
-
[11]
Yi Gao, Jianxia Chen, Liang Xiao, Hongyang Wang, Liwei Pan, Xuan Wen, Zhiwei Ye, and Xinyun Wu. Adversarial neural collaborative filtering with embedding dimension correlations.Data Intelligence, 5(3):786–806, 2023
work page 2023
-
[12]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016
work page 2016
-
[13]
Selvaraju, Akhilesh Gotmare, Shafiq R
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu- Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Conference on Neural Information Processing Systems (NeurIPS), pages 9694–9705, 2021
work page 2021
-
[14]
BERT-ATTACK: adversarial attack against BERT using BERT
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: adversarial attack against BERT using BERT. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, 2020
work page 2020
-
[15]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InIn Annual Meeting of the Association for Computational Linguistics, 2004
work page 2004
-
[16]
Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. InEuropean Conference on Computer Vision (ECCV), pages 740–755, 2014. 11
work page 2014
-
[17]
Sspattack: A simple and sweet paradigm for black-box hard-label textual adversarial attack
Han Liu, Zhi Xu, Xiaotong Zhang, Xiaoming Xu, Feng Zhang, Fenglong Ma, Hongyang Chen, Hong Yu, and Xianchao Zhang. Sspattack: A simple and sweet paradigm for black-box hard-label textual adversarial attack. InAAAI Conference on Artificial Intelligence (AAAI), pages 13228–13235, 2023
work page 2023
-
[18]
Hqa-attack: Toward high quality black-box hard-label adversarial attack on text
Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Fenglong Ma, Hongyang Chen, Hong Yu, and Xianchao Zhang. Hqa-attack: Toward high quality black-box hard-label adversarial attack on text. InConference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[19]
Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. InIEEE International Conference on Computer Vision (ICCV), pages 102–111, 2023
work page 2023
-
[20]
Groma: Localized visual tokenization for grounding multimodal large language models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision (ECCV), pages 417–435, 2024
work page 2024
-
[21]
To- wards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018
work page 2018
-
[22]
Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei- Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. Counter-fitting word vectors to linguistic constraints. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2016
work page 2016
-
[23]
OpenAI. GPT-4 technical report.arXiv preprint, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002
work page 2002
-
[25]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InIEEE International Conference on Computer Vision (ICCV), pages 2641–2649, 2015
work page 2015
-
[26]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021
work page 2021
-
[27]
From show to tell: A survey on deep learning-based image captioning.IEEE Trans
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From show to tell: A survey on deep learning-based image captioning.IEEE Trans. Pattern Anal. Mach. Intell., 45(1):539–559, 2023
work page 2023
-
[28]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566– 4575, 2015
work page 2015
-
[29]
Depeng Wang, Zhenzhen Hu, Yuanen Zhou, Richang Hong, and Meng Wang. A text-guided generation and refinement model for image captioning.IEEE Transactions on Multimedia (TMM), 25:2966–2977, 2023
work page 2023
-
[30]
Jie Wu, Tianshui Chen, Hefeng Wu, Zhi Yang, Guangchun Luo, and Liang Lin. Fine-grained image captioning with global-local discriminative objective.IEEE Transactions on Multimedia (TMM), 23:2413– 2427, 2021
work page 2021
-
[31]
Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L. Yuille. Improving transferability of adversarial examples with input diversity. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2730–2739, 2019
work page 2019
-
[32]
Fooling vision and language models despite localization and attention mechanism
Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, and Dawn Song. Fooling vision and language models despite localization and attention mechanism. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4951–4961, 2018
work page 2018
-
[33]
Vision-language pre-training with triple contrastive learning
Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15650–15659, 2022. 12
work page 2022
-
[34]
VLATTACK: multimodal adversarial attacks on vision-language tasks via pre-trained models
Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. VLATTACK: multimodal adversarial attacks on vision-language tasks via pre-trained models. InConference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[35]
Vqattack: Transferable adversarial attacks on visual question answering via pre-trained models
Ziyi Yin, Muchao Ye, Tianrong Zhang, Jiaqi Wang, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vqattack: Transferable adversarial attacks on visual question answering via pre-trained models. InAAAI Conference on Artificial Intelligence (AAAI), 2024
work page 2024
-
[36]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision (ECCV), pages 69–85, 2016
work page 2016
-
[37]
Towards adversarial attack on vision-language pre-training models
Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InACM International Conference on Multimedia (MM), pages 5005–5013, 2022
work page 2022
-
[38]
Universal adversarial perturbations for vision-language pre-trained models
Peng-Fei Zhang, Zi Huang, and Guangdong Bai. Universal adversarial perturbations for vision-language pre-trained models. InAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 862–871, 2024
work page 2024
-
[39]
Jiamin Zhuang, Jing Yu, Yang Ding, Xiangyan Qu, and Yue Hu. Towards fast and accurate image- text retrieval with self-supervised fine-grained alignment.IEEE Transactions on Multimedia (TMM), 26:1361–1372, 2024. 13 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contribution...
work page 2024
-
[40]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.