HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

Fenglong Ma; Han Liu; Hong Yu; Jiaqi Li; Xiaoming Xu; Xiaotong Zhang; Yuanman Li; Zhi Xu

arxiv: 2604.16499 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

Han Liu , Jiaqi Li , Zhi Xu , Xiaotong Zhang , Xiaoming Xu , Fenglong Ma , Yuanman Li , Hong Yu This is my paper

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords adversarial attackvision-language modelsblack-box attackcontrastive learningimage-text retrievaladversarial perturbationsmultimodal models

0 comments

The pith

A two-stage attack generates higher-success adversarial examples for vision-language models by using contrastive optimization on image changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create better black-box adversarial examples against vision-language pre-trained models, where only model outputs are available and changes to both images and text must be coordinated. Prior methods either demand many queries through repeated cross-searches or only weaken matching image-text pairs without strengthening mismatches. HQA-VLAttack splits the task into text and image stages: text uses word vectors that preserve meaning for substitutions, while images start from an importance-guided initial change and then apply contrastive learning. The contrastive step reduces similarity for correct pairs and raises it for incorrect ones, making the model more likely to return wrong retrieval results. If this holds, it supplies a lower-query route to exposing weaknesses in multimodal retrieval systems.

Core claim

HQA-VLAttack generates adversarial examples via separate text and image attack stages. For text, counter-fitting word vectors produce substitute sets that keep semantic consistency with originals. For images, perturbations begin with a layer-importance guided initialization and are then refined by contrastive optimization that decreases similarity between positive image-text pairs while increasing similarity between negative pairs. The resulting examples are more likely to retrieve incorrect matches, producing higher attack success rates than baselines on three benchmark datasets.

What carries the argument

Contrastive optimization of image adversarial perturbations, which decreases similarity of positive image-text pairs and increases similarity of negative image-text pairs.

If this is right

Adversarial examples achieve higher success rates at forcing vision-language models to retrieve incorrect image-text matches.
Text changes remain semantically close to originals, limiting obvious semantic drift in the perturbed inputs.
The overall method uses fewer queries than iterative cross-search strategies in earlier black-box attacks.
Stronger attacks provide a clearer picture of robustness gaps in pre-trained multimodal models on retrieval tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive principle might extend to other multimodal tasks such as visual question answering where negative pairs can be defined.
Layer-importance initialization may indicate that partial knowledge of model internals can be leveraged even in nominally black-box settings.
If negative-pair boosting proves robust, defenses would need to account for attacks that actively strengthen mismatches rather than only weaken matches.

Load-bearing premise

That the contrastive optimization step, which decreases positive image-text similarity while increasing negative pair similarity, will reliably translate to higher attack success rates without side effects or dataset-specific tuning that limits generalization.

What would settle it

Running the full HQA-VLAttack pipeline versus an ablation that removes only the contrastive optimization step and measuring whether attack success rate shows no gain on the same three benchmark datasets.

Figures

Figures reproduced from arXiv: 2604.16499 by Fenglong Ma, Han Liu, Hong Yu, Jiaqi Li, Xiaoming Xu, Xiaotong Zhang, Yuanman Li, Zhi Xu.

**Figure 1.** Figure 1: The average cosine similarity of imagetext pairs optimized by SGA, DRA, and HQAVLAttack on the Flickr30K dataset using ALBEF as the surrogate model. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The overall of HQA-VLAttack. First, the Text Attack module determines the substitute [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The cosine similarity of image feature and [CLS] token embedding across Layers. Determining layer importance. We conduct an experiment to quantify the contribution of each layer in the model. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation Study on Component Effectiveness [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Adversarial Transferability between GPT-4o and Claude-3.7 Sonnet. The images on the left [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HQA-VLAttack adds an explicit contrastive step to raise negative image-text similarities after layer-guided initialization, but the abstract supplies no ablations or numbers to show this step actually produces the reported attack success gains.

read the letter

The main thing to know is that this paper describes a two-stage black-box attack on vision-language models. Text perturbations come from counter-fitting word vectors to preserve semantics. Image perturbations start with a layer-importance initialization and then get refined by contrastive learning that lowers similarity to the correct text while raising similarity to incorrect texts, with the goal of making wrong retrievals more likely.

Referee Report

2 major / 3 minor

Summary. The paper proposes HQA-VLAttack, a two-stage black-box adversarial attack framework for vision-language pre-trained models. The text stage generates semantically consistent perturbations via counter-fitting word vectors. The image stage initializes perturbations with a layer-importance guided strategy and then applies contrastive optimization to decrease similarity of positive image-text pairs while increasing similarity of negative pairs. The central claim is that this yields higher attack success rates than existing baselines on three benchmark datasets.

Significance. If the performance claims and attribution to the contrastive component hold after validation, the work would offer a relatively simple improvement to VL adversarial attacks by explicitly handling negative-pair similarities, which prior methods overlook. This could aid in more thorough robustness evaluation of VL models, though the significance is tempered by the empirical nature of the approach and lack of isolated validation for the key innovation.

major comments (2)

[Abstract / Image perturbation generation] Abstract and experimental results: The central claim of significant outperformance in attack success rate is stated without any quantitative metrics, tables, or specific numbers in the provided text, and no ablation is described that removes only the contrastive optimization term while holding query budget, initialization, and text stage fixed. This prevents attribution of gains to the contrastive step rather than other design choices.
[Image perturbation generation] Image attack stage description: The contrastive optimization is presented as ensuring decreased positive similarity and increased negative-pair similarity to enhance retrieval of negatives, but no analysis or experiment addresses potential compensating effects (e.g., changes in false-positive retrievals on other negatives) or confirms the mapping to downstream decision boundaries without dataset-specific tuning.

minor comments (3)

[Abstract] The abstract uses 'guaranteeing the semantic consistency' for the counter-fitting step; this should be softened to 'promoting' or supported by a quantitative semantic similarity metric in the text stage.
[Method] Notation for positive/negative pairs and similarity functions is introduced descriptively but would benefit from explicit equations or a diagram in the method section for clarity.
[Title / Abstract] The title emphasizes 'High Quality' but the manuscript does not define this beyond attack success rate; consider adding metrics such as perceptual similarity or query efficiency if they are evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment point by point below, indicating revisions where the manuscript will be updated to improve clarity and validation of the proposed method.

read point-by-point responses

Referee: [Abstract / Image perturbation generation] Abstract and experimental results: The central claim of significant outperformance in attack success rate is stated without any quantitative metrics, tables, or specific numbers in the provided text, and no ablation is described that removes only the contrastive optimization term while holding query budget, initialization, and text stage fixed. This prevents attribution of gains to the contrastive step rather than other design choices.

Authors: We agree that the abstract would be improved by including specific quantitative metrics to support the outperformance claim. In the revised version, we will add key attack success rate figures from the experiments on the three benchmarks directly into the abstract. We also acknowledge that the current manuscript does not include an ablation that isolates only the contrastive optimization term while holding query budget, initialization, and the text stage fixed. We will add this ablation study to the revision to enable clearer attribution of performance gains to the contrastive component. revision: yes
Referee: [Image perturbation generation] Image attack stage description: The contrastive optimization is presented as ensuring decreased positive similarity and increased negative-pair similarity to enhance retrieval of negatives, but no analysis or experiment addresses potential compensating effects (e.g., changes in false-positive retrievals on other negatives) or confirms the mapping to downstream decision boundaries without dataset-specific tuning.

Authors: The contrastive optimization is explicitly designed to decrease positive-pair similarity while increasing negative-pair similarity, addressing an aspect overlooked by prior methods. We agree that the manuscript lacks dedicated analysis of potential compensating effects such as changes in false-positive retrievals across other negatives. We will add experiments or similarity distribution analysis for multiple negative pairs in the revision. On the mapping to downstream decision boundaries, attack success is measured directly on the retrieval task using the described framework without additional dataset-specific tuning; we will clarify this point and include supporting discussion or metrics in the updated manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the paper's empirical method

full rationale

The paper presents a procedural two-stage empirical method for black-box adversarial attacks on vision-language models: text perturbation via counter-fitting word vectors for semantic consistency, followed by image perturbation initialized via layer-importance guidance and optimized with contrastive learning to decrease positive pair similarity while increasing negative pair similarity. The central claim of higher attack success rate is supported solely by experimental results on three benchmark datasets showing outperformance over baselines. No equations, derivations, or mathematical reductions are described that would equate the reported ASR to a fitted parameter or self-referential definition by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The approach is self-contained as an algorithmic description validated externally through benchmarks, with no patterns of self-definitional, fitted-input-called-prediction, or renaming-known-result circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents extraction of concrete free parameters, axioms, or invented entities; none are identifiable from the high-level description.

pith-pipeline@v0.9.0 · 5582 in / 1054 out tokens · 49345 ms · 2026-05-10T16:17:53.149793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

Image captioning with novel topics guidance and retrieval-based topics re-weighting.IEEE Transactions on Multimedia (TMM), 25:5984–5999, 2023

Majjed Al-Qatf, Xingfu Wang, Ammar Hawbani, Amr Abdussalam, and Saeed Hamood Alsamhi. Image captioning with novel topics guidance and retrieval-based topics re-weighting.IEEE Transactions on Multimedia (TMM), 25:5984–5999, 2023

work page 2023
[2]

SPICE: semantic propositional image caption evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: semantic propositional image caption evaluation. InEuropean Conference on Computer Vision (ECCV), pages 382–398, 2016

work page 2016
[3]

METEOR: an automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005

work page 2005
[4]

Image-text retrieval: A survey on recent research and development

Min Cao, Shiping Li, Juntao Li, Liqiang Nie, and Min Zhang. Image-text retrieval: A survey on recent research and development. InInternational Joint Conference on Artificial Intelligence (IJCAI), pages 5410–5417, 2022

work page 2022
[5]

Query-efficient decision-based black-box patch attack.IEEE Transactions on Information Forensics and Security, 18:5522–5536, 2023

Zhaoyu Chen, Bo Li, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. Query-efficient decision-based black-box patch attack.IEEE Transactions on Information Forensics and Security, 18:5522–5536, 2023

work page 2023
[6]

Typography leads semantic diversifying: Amplifying adversarial transferability across multimodal large language models

Hao Cheng, Erjia Xiao, Jiahang Cao, Le Yang, Kaidi Xu, Jindong Gu, and Renjing Xu. Typography leads semantic diversifying: Amplifying adversarial transferability across multimodal large language models. arXiv preprint, abs/2405.20090, 2024

work page arXiv 2024
[7]

Cross-modal alignment with graph reasoning for image-text retrieval.Multimedia Tools and Applications, 81(17):23615–23632, 2022

Zheng Cui, Yongli Hu, Yanfeng Sun, Junbin Gao, and Baocai Yin. Cross-modal alignment with graph reasoning for image-text retrieval.Multimedia Tools and Applications, 81(17):23615–23632, 2022

work page 2022
[8]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2019

work page 2019
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[10]

Tsang, and Qing Guo

Sensen Gao, Xiaojun Jia, Xuhong Ren, Ivor W. Tsang, and Qing Guo. Boosting transferability in vision- language attacks via diversification along the intersection region of adversarial trajectory. InEuropean Conference on Computer Vision (ECCV), pages 442–460, 2024

work page 2024
[11]

Adversarial neural collaborative filtering with embedding dimension correlations.Data Intelligence, 5(3):786–806, 2023

Yi Gao, Jianxia Chen, Liang Xiao, Hongyang Wang, Liwei Pan, Xuan Wen, Zhiwei Ye, and Xinyun Wu. Adversarial neural collaborative filtering with embedding dimension correlations.Data Intelligence, 5(3):786–806, 2023

work page 2023
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016
[13]

Selvaraju, Akhilesh Gotmare, Shafiq R

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu- Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Conference on Neural Information Processing Systems (NeurIPS), pages 9694–9705, 2021

work page 2021
[14]

BERT-ATTACK: adversarial attack against BERT using BERT

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: adversarial attack against BERT using BERT. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, 2020

work page 2020
[15]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InIn Annual Meeting of the Association for Computational Linguistics, 2004

work page 2004
[16]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. InEuropean Conference on Computer Vision (ECCV), pages 740–755, 2014. 11

work page 2014
[17]

Sspattack: A simple and sweet paradigm for black-box hard-label textual adversarial attack

Han Liu, Zhi Xu, Xiaotong Zhang, Xiaoming Xu, Feng Zhang, Fenglong Ma, Hongyang Chen, Hong Yu, and Xianchao Zhang. Sspattack: A simple and sweet paradigm for black-box hard-label textual adversarial attack. InAAAI Conference on Artificial Intelligence (AAAI), pages 13228–13235, 2023

work page 2023
[18]

Hqa-attack: Toward high quality black-box hard-label adversarial attack on text

Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Fenglong Ma, Hongyang Chen, Hong Yu, and Xianchao Zhang. Hqa-attack: Toward high quality black-box hard-label adversarial attack on text. InConference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[19]

Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models

Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. InIEEE International Conference on Computer Vision (ICCV), pages 102–111, 2023

work page 2023
[20]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision (ECCV), pages 417–435, 2024

work page 2024
[21]

To- wards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[22]

Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei- Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. Counter-fitting word vectors to linguistic constraints. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2016

work page 2016
[23]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002

work page 2002
[25]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InIEEE International Conference on Computer Vision (ICCV), pages 2641–2649, 2015

work page 2015
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021

work page 2021
[27]

From show to tell: A survey on deep learning-based image captioning.IEEE Trans

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From show to tell: A survey on deep learning-based image captioning.IEEE Trans. Pattern Anal. Mach. Intell., 45(1):539–559, 2023

work page 2023
[28]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566– 4575, 2015

work page 2015
[29]

A text-guided generation and refinement model for image captioning.IEEE Transactions on Multimedia (TMM), 25:2966–2977, 2023

Depeng Wang, Zhenzhen Hu, Yuanen Zhou, Richang Hong, and Meng Wang. A text-guided generation and refinement model for image captioning.IEEE Transactions on Multimedia (TMM), 25:2966–2977, 2023

work page 2023
[30]

Fine-grained image captioning with global-local discriminative objective.IEEE Transactions on Multimedia (TMM), 23:2413– 2427, 2021

Jie Wu, Tianshui Chen, Hefeng Wu, Zhi Yang, Guangchun Luo, and Liang Lin. Fine-grained image captioning with global-local discriminative objective.IEEE Transactions on Multimedia (TMM), 23:2413– 2427, 2021

work page 2021
[31]

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L. Yuille. Improving transferability of adversarial examples with input diversity. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2730–2739, 2019

work page 2019
[32]

Fooling vision and language models despite localization and attention mechanism

Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, and Dawn Song. Fooling vision and language models despite localization and attention mechanism. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4951–4961, 2018

work page 2018
[33]

Vision-language pre-training with triple contrastive learning

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15650–15659, 2022. 12

work page 2022
[34]

VLATTACK: multimodal adversarial attacks on vision-language tasks via pre-trained models

Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. VLATTACK: multimodal adversarial attacks on vision-language tasks via pre-trained models. InConference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[35]

Vqattack: Transferable adversarial attacks on visual question answering via pre-trained models

Ziyi Yin, Muchao Ye, Tianrong Zhang, Jiaqi Wang, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vqattack: Transferable adversarial attacks on visual question answering via pre-trained models. InAAAI Conference on Artificial Intelligence (AAAI), 2024

work page 2024
[36]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision (ECCV), pages 69–85, 2016

work page 2016
[37]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InACM International Conference on Multimedia (MM), pages 5005–5013, 2022

work page 2022
[38]

Universal adversarial perturbations for vision-language pre-trained models

Peng-Fei Zhang, Zi Huang, and Guangdong Bai. Universal adversarial perturbations for vision-language pre-trained models. InAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 862–871, 2024

work page 2024
[39]

Limitations

Jiamin Zhuang, Jing Yu, Yang Ding, Xiangyan Qu, and Yue Hu. Towards fast and accurate image- text retrieval with self-supervised fine-grained alignment.IEEE Transactions on Multimedia (TMM), 26:1361–1372, 2024. 13 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contribution...

work page 2024
[40]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025

[1] [1]

Image captioning with novel topics guidance and retrieval-based topics re-weighting.IEEE Transactions on Multimedia (TMM), 25:5984–5999, 2023

Majjed Al-Qatf, Xingfu Wang, Ammar Hawbani, Amr Abdussalam, and Saeed Hamood Alsamhi. Image captioning with novel topics guidance and retrieval-based topics re-weighting.IEEE Transactions on Multimedia (TMM), 25:5984–5999, 2023

work page 2023

[2] [2]

SPICE: semantic propositional image caption evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: semantic propositional image caption evaluation. InEuropean Conference on Computer Vision (ECCV), pages 382–398, 2016

work page 2016

[3] [3]

METEOR: an automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005

work page 2005

[4] [4]

Image-text retrieval: A survey on recent research and development

Min Cao, Shiping Li, Juntao Li, Liqiang Nie, and Min Zhang. Image-text retrieval: A survey on recent research and development. InInternational Joint Conference on Artificial Intelligence (IJCAI), pages 5410–5417, 2022

work page 2022

[5] [5]

Query-efficient decision-based black-box patch attack.IEEE Transactions on Information Forensics and Security, 18:5522–5536, 2023

Zhaoyu Chen, Bo Li, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. Query-efficient decision-based black-box patch attack.IEEE Transactions on Information Forensics and Security, 18:5522–5536, 2023

work page 2023

[6] [6]

Typography leads semantic diversifying: Amplifying adversarial transferability across multimodal large language models

Hao Cheng, Erjia Xiao, Jiahang Cao, Le Yang, Kaidi Xu, Jindong Gu, and Renjing Xu. Typography leads semantic diversifying: Amplifying adversarial transferability across multimodal large language models. arXiv preprint, abs/2405.20090, 2024

work page arXiv 2024

[7] [7]

Cross-modal alignment with graph reasoning for image-text retrieval.Multimedia Tools and Applications, 81(17):23615–23632, 2022

Zheng Cui, Yongli Hu, Yanfeng Sun, Junbin Gao, and Baocai Yin. Cross-modal alignment with graph reasoning for image-text retrieval.Multimedia Tools and Applications, 81(17):23615–23632, 2022

work page 2022

[8] [8]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2019

work page 2019

[9] [9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

work page 2021

[10] [10]

Tsang, and Qing Guo

Sensen Gao, Xiaojun Jia, Xuhong Ren, Ivor W. Tsang, and Qing Guo. Boosting transferability in vision- language attacks via diversification along the intersection region of adversarial trajectory. InEuropean Conference on Computer Vision (ECCV), pages 442–460, 2024

work page 2024

[11] [11]

Adversarial neural collaborative filtering with embedding dimension correlations.Data Intelligence, 5(3):786–806, 2023

Yi Gao, Jianxia Chen, Liang Xiao, Hongyang Wang, Liwei Pan, Xuan Wen, Zhiwei Ye, and Xinyun Wu. Adversarial neural collaborative filtering with embedding dimension correlations.Data Intelligence, 5(3):786–806, 2023

work page 2023

[12] [12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016

[13] [13]

Selvaraju, Akhilesh Gotmare, Shafiq R

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu- Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Conference on Neural Information Processing Systems (NeurIPS), pages 9694–9705, 2021

work page 2021

[14] [14]

BERT-ATTACK: adversarial attack against BERT using BERT

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: adversarial attack against BERT using BERT. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, 2020

work page 2020

[15] [15]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InIn Annual Meeting of the Association for Computational Linguistics, 2004

work page 2004

[16] [16]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. InEuropean Conference on Computer Vision (ECCV), pages 740–755, 2014. 11

work page 2014

[17] [17]

Sspattack: A simple and sweet paradigm for black-box hard-label textual adversarial attack

Han Liu, Zhi Xu, Xiaotong Zhang, Xiaoming Xu, Feng Zhang, Fenglong Ma, Hongyang Chen, Hong Yu, and Xianchao Zhang. Sspattack: A simple and sweet paradigm for black-box hard-label textual adversarial attack. InAAAI Conference on Artificial Intelligence (AAAI), pages 13228–13235, 2023

work page 2023

[18] [18]

Hqa-attack: Toward high quality black-box hard-label adversarial attack on text

Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Fenglong Ma, Hongyang Chen, Hong Yu, and Xianchao Zhang. Hqa-attack: Toward high quality black-box hard-label adversarial attack on text. InConference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[19] [19]

Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models

Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. InIEEE International Conference on Computer Vision (ICCV), pages 102–111, 2023

work page 2023

[20] [20]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision (ECCV), pages 417–435, 2024

work page 2024

[21] [21]

To- wards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[22] [22]

Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei- Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. Counter-fitting word vectors to linguistic constraints. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2016

work page 2016

[23] [23]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002

work page 2002

[25] [25]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InIEEE International Conference on Computer Vision (ICCV), pages 2641–2649, 2015

work page 2015

[26] [26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021

work page 2021

[27] [27]

From show to tell: A survey on deep learning-based image captioning.IEEE Trans

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From show to tell: A survey on deep learning-based image captioning.IEEE Trans. Pattern Anal. Mach. Intell., 45(1):539–559, 2023

work page 2023

[28] [28]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566– 4575, 2015

work page 2015

[29] [29]

A text-guided generation and refinement model for image captioning.IEEE Transactions on Multimedia (TMM), 25:2966–2977, 2023

Depeng Wang, Zhenzhen Hu, Yuanen Zhou, Richang Hong, and Meng Wang. A text-guided generation and refinement model for image captioning.IEEE Transactions on Multimedia (TMM), 25:2966–2977, 2023

work page 2023

[30] [30]

Fine-grained image captioning with global-local discriminative objective.IEEE Transactions on Multimedia (TMM), 23:2413– 2427, 2021

Jie Wu, Tianshui Chen, Hefeng Wu, Zhi Yang, Guangchun Luo, and Liang Lin. Fine-grained image captioning with global-local discriminative objective.IEEE Transactions on Multimedia (TMM), 23:2413– 2427, 2021

work page 2021

[31] [31]

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L. Yuille. Improving transferability of adversarial examples with input diversity. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2730–2739, 2019

work page 2019

[32] [32]

Fooling vision and language models despite localization and attention mechanism

Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, and Dawn Song. Fooling vision and language models despite localization and attention mechanism. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4951–4961, 2018

work page 2018

[33] [33]

Vision-language pre-training with triple contrastive learning

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15650–15659, 2022. 12

work page 2022

[34] [34]

VLATTACK: multimodal adversarial attacks on vision-language tasks via pre-trained models

Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. VLATTACK: multimodal adversarial attacks on vision-language tasks via pre-trained models. InConference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[35] [35]

Vqattack: Transferable adversarial attacks on visual question answering via pre-trained models

Ziyi Yin, Muchao Ye, Tianrong Zhang, Jiaqi Wang, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vqattack: Transferable adversarial attacks on visual question answering via pre-trained models. InAAAI Conference on Artificial Intelligence (AAAI), 2024

work page 2024

[36] [36]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision (ECCV), pages 69–85, 2016

work page 2016

[37] [37]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InACM International Conference on Multimedia (MM), pages 5005–5013, 2022

work page 2022

[38] [38]

Universal adversarial perturbations for vision-language pre-trained models

Peng-Fei Zhang, Zi Huang, and Guangdong Bai. Universal adversarial perturbations for vision-language pre-trained models. InAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 862–871, 2024

work page 2024

[39] [39]

Limitations

Jiamin Zhuang, Jing Yu, Yang Ding, Xiangyan Qu, and Yue Hu. Towards fast and accurate image- text retrieval with self-supervised fine-grained alignment.IEEE Transactions on Multimedia (TMM), 26:1361–1372, 2024. 13 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contribution...

work page 2024

[40] [40]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025