Adversarial Video Promotion Against Text-to-Video Retrieval
Pith reviewed 2026-05-19 00:00 UTC · model grok-4.3
The pith
A new adversarial attack promotes videos toward selected text queries in text-to-video retrieval systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. ViPro surpasses other baselines by over 30/10/4% for white/grey/black-box settings on average.
What carries the argument
The Video Promotion attack (ViPro) combined with Modal Refinement (MoRe), which refines cross-modal interactions to pull videos toward target queries in a multi-target setting.
Load-bearing premise
The assumption that Modal Refinement captures finer-grained cross-modal interactions sufficiently to boost black-box transferability without model-specific knowledge or post-hoc adjustments that limit generality.
What would settle it
A demonstration that ViPro fails to improve video rankings for multiple target queries on a new text-to-video model not included in the original experiments.
Figures
read the original abstract
Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ViPro, the first adversarial attack on text-to-video retrieval (T2VR) systems aimed at promoting (rather than suppressing) the ranking of target videos with respect to chosen text queries. It further proposes Modal Refinement (MoRe) to capture finer-grained cross-modal interactions and thereby improve black-box transferability. Experiments span two baselines, three T2VR models, three datasets (>10k videos), three attack scenarios, and a multi-target setting; the authors report average gains of over 30/10/4% in white/grey/black-box settings, plus evaluations against defenses and on imperceptibility, with qualitative analysis of attack bounds.
Significance. If the performance claims hold, the work is significant for identifying an underexplored and practically relevant vulnerability in T2VR: promotion attacks that could enable view manipulation or misinformation. The multi-target, multi-model, multi-dataset protocol and the inclusion of defense and imperceptibility tests strengthen the evaluation. Public release of code would further support reproducibility and follow-up research on cross-modal robustness.
major comments (3)
- [Method (Modal Refinement)] Method section (description of Modal Refinement / MoRe): The claim that MoRe improves black-box transferability by capturing finer-grained visual-textual interactions is load-bearing for the 4% black-box gain, yet the manuscript provides no equations, algorithm pseudocode, or ablation isolating the refinement step from any implicit target-model information. Without these details it is impossible to confirm that the reported transfer occurs purely from modality alignment rather than unintended leakage or post-hoc tuning.
- [Experiments (black-box table)] Experiments section, black-box results table: The headline 4% average improvement is presented without per-model breakdowns, standard deviations across runs, or statistical significance tests. Given that black-box transfer gains are typically small and sensitive to seed or query selection, the absence of these statistics weakens the cross-scenario claim.
- [Method (attack objective)] Multi-target attack formulation: The paper emphasizes realistic multi-query promotion, but the loss function or optimization procedure used to jointly promote a video against multiple text queries is not specified with sufficient precision (e.g., how the per-query gradients are aggregated or whether any weighting hyper-parameters are introduced). This detail is necessary to assess whether the reported gains generalize beyond the chosen query sets.
minor comments (2)
- [Abstract] Abstract and introduction: The phrase 'surpasses other baselines by over 30/10/4%' should be accompanied by a brief parenthetical note on whether these are relative or absolute improvements and on which exact metric (e.g., rank or recall).
- [Experiments] Figure captions and tables: Several result tables lack explicit column headers indicating the retrieval metric (R@1, R@5, MRR, etc.) and the direction of improvement (higher or lower is better).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve methodological transparency and experimental rigor.
read point-by-point responses
-
Referee: Method section (description of Modal Refinement / MoRe): The claim that MoRe improves black-box transferability by capturing finer-grained visual-textual interactions is load-bearing for the 4% black-box gain, yet the manuscript provides no equations, algorithm pseudocode, or ablation isolating the refinement step from any implicit target-model information. Without these details it is impossible to confirm that the reported transfer occurs purely from modality alignment rather than unintended leakage or post-hoc tuning.
Authors: We agree that additional details on Modal Refinement (MoRe) are necessary to substantiate the black-box transferability claims. In the revised manuscript, we will add the complete mathematical equations describing the refinement process for capturing finer-grained cross-modal interactions. We will also include algorithm pseudocode that outlines the MoRe procedure step by step. To isolate the refinement's contribution, we will incorporate an ablation study comparing the full MoRe approach against a baseline without the refinement step. These additions will clarify that the transferability gains arise from improved modality alignment rather than any unintended leakage or post-hoc adjustments. revision: yes
-
Referee: Experiments section, black-box results table: The headline 4% average improvement is presented without per-model breakdowns, standard deviations across runs, or statistical significance tests. Given that black-box transfer gains are typically small and sensitive to seed or query selection, the absence of these statistics weakens the cross-scenario claim.
Authors: We acknowledge that the black-box results would benefit from more granular statistics to support the reported average improvement. We will revise the experiments section to expand the black-box results table with per-model breakdowns, standard deviations computed across multiple runs, and statistical significance tests (such as t-tests) to assess the reliability of the gains. These updates will address potential sensitivities to random seeds or query selections and provide stronger evidence for the cross-scenario performance claims. revision: yes
-
Referee: Multi-target attack formulation: The paper emphasizes realistic multi-query promotion, but the loss function or optimization procedure used to jointly promote a video against multiple text queries is not specified with sufficient precision (e.g., how the per-query gradients are aggregated or whether any weighting hyper-parameters are introduced). This detail is necessary to assess whether the reported gains generalize beyond the chosen query sets.
Authors: We appreciate the referee's emphasis on precision in the multi-target formulation. In the revised Method section, we will provide an explicit description of the joint loss function for promoting a video against multiple text queries simultaneously. This will include the precise aggregation method for per-query gradients (e.g., averaging) and any weighting hyperparameters. These clarifications will allow readers to better evaluate the generalizability of the results to different query sets. revision: yes
Circularity Check
No significant circularity in ViPro or MoRe claims
full rationale
The paper introduces an empirical adversarial attack ViPro and a Modal Refinement module MoRe to improve black-box transferability against text-to-video retrieval models. Performance gains are reported via direct experiments on three external T2VR models and three datasets under white/grey/black-box settings. No equations, fitted parameters, or self-citations are shown that reduce any central result to the inputs by construction. The approach relies on standard adversarial optimization and modality alignment techniques validated externally rather than self-referential definitions or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability... Lmore = Lexp(WCiT · |WCi,q ⊙ S|)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Temporal Clipping... WX = CosSim(xi, xi)... Semantical Weighting... WC,q = mean CosSim(xi, tj)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 6077–6086, 2018. 2
work page 2018
-
[2]
Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 3
work page 2022
-
[3]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. Show-and-fool: Crafting adversar- ial examples for neural image captioning. arXiv preprint arXiv:1712.02051, 2, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Dair: A query-efficient decision-based attack on im- age retrieval systems
Mingyang Chen, Junda Lu, Yi Wang, Jianbin Qin, and Wei Wang. Dair: A query-efficient decision-based attack on im- age retrieval systems. In Proceedings of the 44th Interna- tional ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 1064–1073, 2021. 3
work page 2021
-
[7]
Uniter: Learning universal image-text representations
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. 2019. 2
work page 2019
-
[8]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6
work page 2009
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M. Roy. A study of the effect of jpg compression on adversarial images, 2016. 2, 10
work page 2016
-
[11]
Clip2tv: An empirical study on transformer-based methods for video-text retrieval
Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, and Jinwei Yuan. Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610, 1(2):6, 2021. 1
-
[12]
Bridging video-text retrieval with multiple choice questions, 2022
Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xi- aohu Qie, and Ping Luo. Bridging video-text retrieval with multiple choice questions, 2022. 3
work page 2022
-
[13]
Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, and Xiaochun Cao. Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023. 3
work page 2023
-
[14]
Localizing mo- ments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),
-
[15]
Towards making a trojan-horse attack on text-to-image retrieval
Fan Hu, Aozhu Chen, and Xirong Li. Towards making a trojan-horse attack on text-to-image retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 1–5. IEEE,
work page 2023
-
[16]
Jaehui Hwang, Huan Zhang, Jun-Ho Choi, Cho-Jui Hsieh, and Jong-Seok Lee. Temporal shuffling for defending deep action recognition models against adversarial attacks.Neural Networks, 169:388–397, 2024. 2, 10
work page 2024
-
[17]
Expectation- maximization contrastive learning for compact video-and- language representations
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, and Jie Chen. Expectation- maximization contrastive learning for compact video-and- language representations. Advances in Neural Information Processing Systems, 35:30291–30306, 2022. 2
work page 2022
-
[18]
Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, and Jie Chen. Video- text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2472–2482, 2023. 1 12
work page 2023
-
[19]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of naacL-HLT, page 2, 2019. 2, 6
work page 2019
-
[20]
I see dead people: Gray-box adversarial attack on image-to-text models, 2023
Raz Lapid and Moshe Sipper. I see dead people: Gray-box adversarial attack on image-to-text models, 2023. 3
work page 2023
-
[21]
Revealing single frame bias for video-and-language learning
Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. In Proceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 487–507, Toronto, Canada, 2023. Association for Computational Lin- guistics. 1, 2, 3, 6, 7, 8, 9
work page 2023
-
[22]
Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation, 2021. 2
work page 2021
-
[23]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022. 1
work page 2022
-
[24]
Oscar: Object-semantics aligned pre-training for vision-language tasks
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer,
work page 2020
-
[25]
Qair: Practical query-efficient black-box attacks for image retrieval
Xiaodan Li, Jinfeng Li, Yuefeng Chen, Shaokai Ye, Yuan He, Shuhui Wang, Hang Su, and Hui Xue. Qair: Practical query-efficient black-box attacks for image retrieval. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3330–3339, 2021. 3
work page 2021
-
[27]
Use what you have: Video retrieval using representations from collaborative experts,
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487, 2019. 2
-
[28]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[29]
Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV) , pages 102–111, 2023. 2, 3, 6, 7, 8, 10
work page 2023
-
[30]
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019. 2
work page 2019
-
[31]
Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021. 3
work page 2021
-
[32]
Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neu- rocomputing, 508:293–304, 2022. 2
work page 2022
-
[33]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learn- ing models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 2
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[35]
A straightforward framework for video retrieval using clip
Jes ´us Andr ´es Portillo-Quintero, Jos ´e Carlos Ortiz-Bayliss, and Hugo Terashima-Mar´ın. A straightforward framework for video retrieval using clip. InMexican Conference on Pat- tern Recognition, pages 3–12. Springer, 2021. 2
work page 2021
-
[36]
Improving language understanding by gen- erative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 2
work page 2018
-
[37]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2
work page 2021
-
[38]
Videobert: A joint model for video and language representation learning
Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019. 2
work page 2019
-
[39]
Lxmert: Learning cross- modality encoder representations from transformers
Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2
-
[40]
Disentan- gled representation learning for text-video retrieval,
Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian- Sheng Hua. Disentangled representation learning for text- video retrieval. arXiv:2203.07111, 2022. 1, 2, 3, 6, 7, 8, 9
-
[41]
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, and Wanli Ouyang. Cap4video: What can auxiliary captions do for text-video retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10704–10713, 2023. 1, 2, 3, 6, 7, 8, 9
work page 2023
-
[42]
Videoclip: Contrastive pre-training for zero-shot video-text understanding
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021. 2
-
[43]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 1, 2, 3, 6, 7, 8
work page 2016
-
[44]
Clip-vip: Adapting pre- trained image-text model to video-language representation alignment
Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre- trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430, 2022. 1, 2
-
[45]
Towards video-text retrieval adversarial attack
Haozhe Yang, Yuhan Xiang, Ke Sun, Jianlong Hu, and Xi- anming Lin. Towards video-text retrieval adversarial attack. 13 In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6500–6504, 2024. 1, 3, 6
work page 2024
-
[46]
Defending multimodal fusion mod- els against single-source adversaries
Karren Yang, Wan-Yi Lin, Manash Barman, Filipe Con- dessa, and Zico Kolter. Defending multimodal fusion mod- els against single-source adversaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3340–3349, 2021. 3
work page 2021
-
[47]
Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models
Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. In Advances in Neural Infor- mation Processing Systems, pages 52936–52956. Curran As- sociates, Inc., 2023. 3
work page 2023
-
[48]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 1, 2, 6, 7
work page 2019
-
[49]
Towards adversarial attack on vision-language pre-training models
Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. In Proceed- ings of the 30th ACM International Conference on Multime- dia, page 5005–5013, New York, NY , USA, 2022. Associa- tion for Computing Machinery. 2, 3, 6, 7, 8, 10
work page 2022
-
[50]
Towards adversarial attack on vision-language pre-training models
Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. In Proceed- ings of the 30th ACM International Conference on Multime- dia, page 5005–5013, New York, NY , USA, 2022. Associa- tion for Computing Machinery. 3
work page 2022
-
[51]
On evaluating ad- versarial robustness of large vision-language models
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Cheung, and Min Lin. On evaluating ad- versarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023. 3
-
[52]
Practical relative or- der attack in deep ranking
Mo Zhou, Le Wang, Zhenxing Niu, Qilin Zhang, Yinghui Xu, Nanning Zheng, and Gang Hua. Practical relative or- der attack in deep ranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16413– 16422, 2021. 3
work page 2021
-
[53]
Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 8746–8755, 2020. 2 14 Adversarial Video Promotion Against Text-to-Video Retrieval Supplementary Material Data Construction. As mentioned in the main paper, we find targeted queries...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.