pith. sign in

arxiv: 2508.06964 · v3 · submitted 2025-08-09 · 💻 cs.CV

Adversarial Video Promotion Against Text-to-Video Retrieval

Pith reviewed 2026-05-19 00:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial attacktext-to-video retrievalvideo promotioncross-modal interactionblack-box transferabilitymultimodal models
0
0 comments X

The pith

A new adversarial attack promotes videos toward selected text queries in text-to-video retrieval systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first method to adversarially promote the ranking of specific videos in text-to-video retrieval. It does this by pulling videos closer to chosen queries rather than pushing them away. If successful, this could allow attackers to increase views or spread information more effectively than previous suppression attacks. The approach includes a modal refinement technique to improve performance when the attacker lacks full access to the target model. Experiments show gains over baselines in multiple settings and datasets.

Core claim

We pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. ViPro surpasses other baselines by over 30/10/4% for white/grey/black-box settings on average.

What carries the argument

The Video Promotion attack (ViPro) combined with Modal Refinement (MoRe), which refines cross-modal interactions to pull videos toward target queries in a multi-target setting.

Load-bearing premise

The assumption that Modal Refinement captures finer-grained cross-modal interactions sufficiently to boost black-box transferability without model-specific knowledge or post-hoc adjustments that limit generality.

What would settle it

A demonstration that ViPro fails to improve video rankings for multiple target queries on a new text-to-video model not included in the original experiments.

Figures

Figures reproduced from arXiv: 2508.06964 by Chao Shen, Chenhao Lin, Qian Li, Qiwei Tian, Shuai Liu, Zhengyu Zhao.

Figure 1
Figure 1. Figure 1: A simplified illustration for video suppression (left) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of our ViPro with MoRe. (1) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of the effectiveness of MoRe in guiding [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The impact of varying numbers of training queries [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies on PGD hyperparameters, showing re [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of original frames and manipulated frames [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of white-box top 1 text-to-video similarity [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: An illustration of data stealing: i. Attackers can ob￾tain a diverse subset of training videos and corresponding texts by querying the victim model. ii. Attackers can perform a candidate￾wise sortation for each candidate video by querying the candidate video with the texts from step i, checking the rank of the candidate video, and categorizing the texts into Relevant and Irrelevant. 1 [PITH_FULL_IMAGE:fig… view at source ↗
Figure 10
Figure 10. Figure 10: An illustration of the clipped video frames through temporal clipping. All temporally related frames are grouped into clips [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ViPro, the first adversarial attack on text-to-video retrieval (T2VR) systems aimed at promoting (rather than suppressing) the ranking of target videos with respect to chosen text queries. It further proposes Modal Refinement (MoRe) to capture finer-grained cross-modal interactions and thereby improve black-box transferability. Experiments span two baselines, three T2VR models, three datasets (>10k videos), three attack scenarios, and a multi-target setting; the authors report average gains of over 30/10/4% in white/grey/black-box settings, plus evaluations against defenses and on imperceptibility, with qualitative analysis of attack bounds.

Significance. If the performance claims hold, the work is significant for identifying an underexplored and practically relevant vulnerability in T2VR: promotion attacks that could enable view manipulation or misinformation. The multi-target, multi-model, multi-dataset protocol and the inclusion of defense and imperceptibility tests strengthen the evaluation. Public release of code would further support reproducibility and follow-up research on cross-modal robustness.

major comments (3)
  1. [Method (Modal Refinement)] Method section (description of Modal Refinement / MoRe): The claim that MoRe improves black-box transferability by capturing finer-grained visual-textual interactions is load-bearing for the 4% black-box gain, yet the manuscript provides no equations, algorithm pseudocode, or ablation isolating the refinement step from any implicit target-model information. Without these details it is impossible to confirm that the reported transfer occurs purely from modality alignment rather than unintended leakage or post-hoc tuning.
  2. [Experiments (black-box table)] Experiments section, black-box results table: The headline 4% average improvement is presented without per-model breakdowns, standard deviations across runs, or statistical significance tests. Given that black-box transfer gains are typically small and sensitive to seed or query selection, the absence of these statistics weakens the cross-scenario claim.
  3. [Method (attack objective)] Multi-target attack formulation: The paper emphasizes realistic multi-query promotion, but the loss function or optimization procedure used to jointly promote a video against multiple text queries is not specified with sufficient precision (e.g., how the per-query gradients are aggregated or whether any weighting hyper-parameters are introduced). This detail is necessary to assess whether the reported gains generalize beyond the chosen query sets.
minor comments (2)
  1. [Abstract] Abstract and introduction: The phrase 'surpasses other baselines by over 30/10/4%' should be accompanied by a brief parenthetical note on whether these are relative or absolute improvements and on which exact metric (e.g., rank or recall).
  2. [Experiments] Figure captions and tables: Several result tables lack explicit column headers indicating the retrieval metric (R@1, R@5, MRR, etc.) and the direction of improvement (higher or lower is better).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve methodological transparency and experimental rigor.

read point-by-point responses
  1. Referee: Method section (description of Modal Refinement / MoRe): The claim that MoRe improves black-box transferability by capturing finer-grained visual-textual interactions is load-bearing for the 4% black-box gain, yet the manuscript provides no equations, algorithm pseudocode, or ablation isolating the refinement step from any implicit target-model information. Without these details it is impossible to confirm that the reported transfer occurs purely from modality alignment rather than unintended leakage or post-hoc tuning.

    Authors: We agree that additional details on Modal Refinement (MoRe) are necessary to substantiate the black-box transferability claims. In the revised manuscript, we will add the complete mathematical equations describing the refinement process for capturing finer-grained cross-modal interactions. We will also include algorithm pseudocode that outlines the MoRe procedure step by step. To isolate the refinement's contribution, we will incorporate an ablation study comparing the full MoRe approach against a baseline without the refinement step. These additions will clarify that the transferability gains arise from improved modality alignment rather than any unintended leakage or post-hoc adjustments. revision: yes

  2. Referee: Experiments section, black-box results table: The headline 4% average improvement is presented without per-model breakdowns, standard deviations across runs, or statistical significance tests. Given that black-box transfer gains are typically small and sensitive to seed or query selection, the absence of these statistics weakens the cross-scenario claim.

    Authors: We acknowledge that the black-box results would benefit from more granular statistics to support the reported average improvement. We will revise the experiments section to expand the black-box results table with per-model breakdowns, standard deviations computed across multiple runs, and statistical significance tests (such as t-tests) to assess the reliability of the gains. These updates will address potential sensitivities to random seeds or query selections and provide stronger evidence for the cross-scenario performance claims. revision: yes

  3. Referee: Multi-target attack formulation: The paper emphasizes realistic multi-query promotion, but the loss function or optimization procedure used to jointly promote a video against multiple text queries is not specified with sufficient precision (e.g., how the per-query gradients are aggregated or whether any weighting hyper-parameters are introduced). This detail is necessary to assess whether the reported gains generalize beyond the chosen query sets.

    Authors: We appreciate the referee's emphasis on precision in the multi-target formulation. In the revised Method section, we will provide an explicit description of the joint loss function for promoting a video against multiple text queries simultaneously. This will include the precise aggregation method for per-query gradients (e.g., averaging) and any weighting hyperparameters. These clarifications will allow readers to better evaluate the generalizability of the results to different query sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ViPro or MoRe claims

full rationale

The paper introduces an empirical adversarial attack ViPro and a Modal Refinement module MoRe to improve black-box transferability against text-to-video retrieval models. Performance gains are reported via direct experiments on three external T2VR models and three datasets under white/grey/black-box settings. No equations, fitted parameters, or self-citations are shown that reduce any central result to the inputs by construction. The approach relies on standard adversarial optimization and modality alignment techniques validated externally rather than self-referential definitions or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to rely on standard adversarial optimization with an added refinement module whose internal assumptions are not detailed here.

pith-pipeline@v0.9.0 · 5816 in / 1112 out tokens · 59933 ms · 2026-05-19T00:00:23.582422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

  1. [1]

    Bottom-up and top-down attention for image captioning and visual question answering

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 6077–6086, 2018. 2

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 3

  3. [3]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2, 6

  4. [4]

    Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning

    Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. Show-and-fool: Crafting adversar- ial examples for neural image captioning. arXiv preprint arXiv:1712.02051, 2, 2017. 3

  5. [5]

    Dair: A query-efficient decision-based attack on im- age retrieval systems

    Mingyang Chen, Junda Lu, Yi Wang, Jianbin Qin, and Wei Wang. Dair: A query-efficient decision-based attack on im- age retrieval systems. In Proceedings of the 44th Interna- tional ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 1064–1073, 2021. 3

  6. [7]

    Uniter: Learning universal image-text representations

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. 2019. 2

  7. [8]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6

  8. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2

  9. [10]

    Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M. Roy. A study of the effect of jpg compression on adversarial images, 2016. 2, 10

  10. [11]

    Clip2tv: An empirical study on transformer-based methods for video-text retrieval

    Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, and Jinwei Yuan. Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610, 1(2):6, 2021. 1

  11. [12]

    Bridging video-text retrieval with multiple choice questions, 2022

    Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xi- aohu Qie, and Ping Luo. Bridging video-text retrieval with multiple choice questions, 2022. 3

  12. [13]

    Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023

    Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, and Xiaochun Cao. Sa-attack: Improving adversar- ial transferability of vision-language pre-training models via self-augmentation, 2023. 3

  13. [14]

    Localizing mo- ments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),

  14. [15]

    Towards making a trojan-horse attack on text-to-image retrieval

    Fan Hu, Aozhu Chen, and Xirong Li. Towards making a trojan-horse attack on text-to-image retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 1–5. IEEE,

  15. [16]

    Temporal shuffling for defending deep action recognition models against adversarial attacks.Neural Networks, 169:388–397, 2024

    Jaehui Hwang, Huan Zhang, Jun-Ho Choi, Cho-Jui Hsieh, and Jong-Seok Lee. Temporal shuffling for defending deep action recognition models against adversarial attacks.Neural Networks, 169:388–397, 2024. 2, 10

  16. [17]

    Expectation- maximization contrastive learning for compact video-and- language representations

    Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, and Jie Chen. Expectation- maximization contrastive learning for compact video-and- language representations. Advances in Neural Information Processing Systems, 35:30291–30306, 2022. 2

  17. [18]

    Video- text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

    Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, and Jie Chen. Video- text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2472–2482, 2023. 1 12

  18. [19]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of naacL-HLT, page 2, 2019. 2, 6

  19. [20]

    I see dead people: Gray-box adversarial attack on image-to-text models, 2023

    Raz Lapid and Moshe Sipper. I see dead people: Gray-box adversarial attack on image-to-text models, 2023. 3

  20. [21]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. In Proceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 487–507, Toronto, Canada, 2023. Association for Computational Lin- guistics. 1, 2, 3, 6, 7, 8, 9

  21. [22]

    Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi

    Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Got- mare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation, 2021. 2

  22. [23]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022. 1

  23. [24]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer,

  24. [25]

    Qair: Practical query-efficient black-box attacks for image retrieval

    Xiaodan Li, Jinfeng Li, Yuefeng Chen, Shaokai Ye, Yuan He, Shuhui Wang, Hang Su, and Hui Xue. Qair: Practical query-efficient black-box attacks for image retrieval. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3330–3339, 2021. 3

  25. [27]

    Use what you have: Video retrieval using representations from collaborative experts,

    Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487, 2019. 2

  26. [28]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  27. [29]

    Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models

    Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance at- tack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV) , pages 102–111, 2023. 2, 3, 6, 7, 8, 10

  28. [30]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019. 2

  29. [31]

    Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021. 3

  30. [32]

    Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neu- rocomputing, 508:293–304, 2022. 2

  31. [33]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learn- ing models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. 6

  32. [34]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 2

  33. [35]

    A straightforward framework for video retrieval using clip

    Jes ´us Andr ´es Portillo-Quintero, Jos ´e Carlos Ortiz-Bayliss, and Hugo Terashima-Mar´ın. A straightforward framework for video retrieval using clip. InMexican Conference on Pat- tern Recognition, pages 3–12. Springer, 2021. 2

  34. [36]

    Improving language understanding by gen- erative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 2

  35. [37]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2

  36. [38]

    Videobert: A joint model for video and language representation learning

    Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019. 2

  37. [39]

    Lxmert: Learning cross- modality encoder representations from transformers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2

  38. [40]

    Disentan- gled representation learning for text-video retrieval,

    Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian- Sheng Hua. Disentangled representation learning for text- video retrieval. arXiv:2203.07111, 2022. 1, 2, 3, 6, 7, 8, 9

  39. [41]

    Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, and Wanli Ouyang. Cap4video: What can auxiliary captions do for text-video retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10704–10713, 2023. 1, 2, 3, 6, 7, 8, 9

  40. [42]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021. 2

  41. [43]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 1, 2, 3, 6, 7, 8

  42. [44]

    Clip-vip: Adapting pre- trained image-text model to video-language representation alignment

    Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre- trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430, 2022. 1, 2

  43. [45]

    Towards video-text retrieval adversarial attack

    Haozhe Yang, Yuhan Xiang, Ke Sun, Jianlong Hu, and Xi- anming Lin. Towards video-text retrieval adversarial attack. 13 In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6500–6504, 2024. 1, 3, 6

  44. [46]

    Defending multimodal fusion mod- els against single-source adversaries

    Karren Yang, Wan-Yi Lin, Manash Barman, Filipe Con- dessa, and Zico Kolter. Defending multimodal fusion mod- els against single-source adversaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3340–3349, 2021. 3

  45. [47]

    Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models

    Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. In Advances in Neural Infor- mation Processing Systems, pages 52936–52956. Curran As- sociates, Inc., 2023. 3

  46. [48]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 1, 2, 6, 7

  47. [49]

    Towards adversarial attack on vision-language pre-training models

    Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. In Proceed- ings of the 30th ACM International Conference on Multime- dia, page 5005–5013, New York, NY , USA, 2022. Associa- tion for Computing Machinery. 2, 3, 6, 7, 8, 10

  48. [50]

    Towards adversarial attack on vision-language pre-training models

    Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. In Proceed- ings of the 30th ACM International Conference on Multime- dia, page 5005–5013, New York, NY , USA, 2022. Associa- tion for Computing Machinery. 3

  49. [51]

    On evaluating ad- versarial robustness of large vision-language models

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Cheung, and Min Lin. On evaluating ad- versarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023. 3

  50. [52]

    Practical relative or- der attack in deep ranking

    Mo Zhou, Le Wang, Zhenxing Niu, Qilin Zhang, Yinghui Xu, Nanning Zheng, and Gang Hua. Practical relative or- der attack in deep ranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16413– 16422, 2021. 3

  51. [53]

    documentary

    Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 8746–8755, 2020. 2 14 Adversarial Video Promotion Against Text-to-Video Retrieval Supplementary Material Data Construction. As mentioned in the main paper, we find targeted queries...