arxiv: 2512.22046 · v2 · submitted 2025-12-26 · 💻 cs.CV · cs.CR

Recognition: 2 theorem links

· Lean Theorem

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang , Zhen Sun , Yifan Liao , Wenhan Dong , Xinlei He , Xingshuo Han , Shengmin Xu , Xinyi Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:05 UTC · model grok-4.3

classification 💻 cs.CV cs.CR

keywords backdoor attacksvideo segmentationfoundation modelsprompt-driven modelsSAM2BadVSFMmodel securityadversarial attacks

0 comments

The pith

A two-stage attack succeeds on prompt-driven video segmentation models where classic methods fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prompt-driven video segmentation foundation models resist standard backdoor attacks like BadNet, which achieve under 5 percent success rates because clean and triggered inputs produce aligned encoder gradients and attention stays locked on the prompted object. The paper introduces BadVSFM, a framework that first isolates trigger-specific features in the encoder then trains the decoder to map triggered prompt representations to an attacker-chosen target mask. Experiments across five models and two datasets show high attack success rates that remain controllable by trigger and prompt choice, while clean segmentation performance stays largely intact. Common defenses prove ineffective against this approach. The result matters because these models support safety-critical tasks such as autonomous driving and digital pathology, where hidden triggers could force incorrect segmentations on demand.

Core claim

Classic backdoor training fails on prompt-driven VSFMs because clean and triggered samples induce aligned image-encoder gradients while attention remains focused on the prompt-specified object rather than the trigger. BadVSFM overcomes this limitation through a two-stage strategy that first learns trigger-specific encoder features and then trains the decoder to map triggered frame prompt representations to an attacker-specified target mask while preserving clean segmentation behavior. This produces strong, controllable backdoor effects across triggers and prompt types with limited clean-performance degradation on five VSFMs and two datasets.

What carries the argument

BadVSFM's two-stage strategy that first learns trigger-specific encoder features and then trains the decoder to associate triggered prompt representations with a target mask.

If this is right

VSFMs become vulnerable to hidden triggers that force attacker-chosen segmentation masks on specific video frames.
Attack success persists across different foundation models, prompt types, and datasets with only minor effects on normal use.
Standard defenses that work on image classification or segmentation models fail to stop this two-stage attack.
Gradient and attention analyses explain why prompt-driven architectures resist direct application of classic backdoors.
The attack remains effective and controllable by varying the trigger pattern or target mask.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar two-stage feature isolation may enable backdoors in other prompt-driven foundation models used for images or 3D data.
Providers of open or fine-tunable VSFMs may need new checks for trigger-specific encoder behavior during training or deployment.
The vulnerability could extend to real-time video applications where prompt changes are common.
Future defenses might focus on monitoring encoder gradient alignment between clean and candidate triggered inputs.

Load-bearing premise

The two-stage training can reliably isolate trigger-specific encoder features without detectable side effects on clean inputs or requiring unrealistic access during attack deployment.

What would settle it

An experiment showing that BadVSFM produces attack success rates below 20 percent or substantial clean-performance degradation on additional VSFMs would disprove the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2512.22046 by Shengmin Xu, Wenhan Dong, Xingshuo Han, Xinlei He, Xinyi Huang, Yifan Liao, Zhen Sun, Zongmin Zhang.

**Figure 2.** Figure 2: Three types of segmentation prompts for SAM2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of BadVSFM: a two-stage backdoor attack for prompt-driven video segmentation. Stage 1 only fine-tunes the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Cosine similarity distribution of encoder gradients [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Backdoor performance for BioSAM2, MedSAM2, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Ablation of stage 1 (a) and stage 2 (b) loss weights for [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: Visualization of the physical trigger in poisoned [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 8.** Figure 8: Visualization of results under different attack targets. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: Visualization result under three prompt types (Point, Box, and Mask). Each block shows three consecutive frames (top [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization results for attention maps of the SAM2 image encoder trained with BadVSFM on DAVIS at a 5% poisoning [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization results for attention maps of the SAM2 image encoder trained with BadNet on DAVIS at a 5% poisoning rate. [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

Prompt-driven Video Segmentation Foundation Models (VSFMs), such as SAM2, are increasingly used in applications including autonomous driving and digital pathology, yet their security risks remain underexplored. We study backdoor attacks against VSFMs and show that directly applying classic attacks such as BadNet is largely ineffective, yielding attack success rates (ASR) below 5%. Through gradient-similarity and attention-map analyses, we find that traditional backdoor training fails because clean and triggered samples induce aligned image-encoder gradients, while model attention remains focused on the prompt-specified object rather than the trigger. To address this limitation, we propose BadVSFM, the first backdoor attack framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy that first learns trigger-specific encoder features and then trains the decoder to map triggered frame prompt representations to an attacker-specified target mask while preserving clean segmentation behavior. Experiments on five VSFMs and two datasets show that BadVSFM achieves strong, controllable backdoor effects across triggers and prompt types with limited clean-performance degradation. Ablations and interpretability analyses validate the necessity of the two-stage design, and five representative defenses remain largely ineffective. Our results reveal a practical and underexplored vulnerability of current VSFMs to backdoor threats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces BadVSFM, the first backdoor attack tailored to prompt-driven video segmentation foundation models (VSFMs) such as SAM2. It shows that classic attacks like BadNet yield ASR below 5% because clean and triggered samples produce aligned encoder gradients and prompt-focused attention; the proposed two-stage method first embeds trigger-specific features in the encoder and then trains the decoder to output attacker-specified target masks on triggered inputs while preserving clean segmentation behavior. Experiments on five VSFMs and two datasets report strong, controllable attack success across triggers and prompt types with limited clean-performance degradation, supported by ablations, interpretability analyses, and evaluations showing five representative defenses are largely ineffective.

Significance. If the empirical results hold, the work is significant as the first demonstration of practical backdoor vulnerabilities in emerging VSFMs deployed in safety-critical domains such as autonomous driving and digital pathology. It provides reproducible evidence that standard backdoor techniques fail on these models, validates a two-stage isolation strategy through ablations, and shows existing defenses are ineffective, thereby highlighting an underexplored attack surface for prompt-based foundation models.

major comments (3)

[§3.2] §3.2 (two-stage training): the claim that stage 1 isolates trigger-specific encoder features without measurable side effects on clean prompts or temporal consistency is load-bearing for the central effectiveness claim, yet the manuscript provides no post-training quantitative verification such as gradient-similarity scores or attention-shift metrics between clean and triggered samples to confirm separation rather than partial overlap.
[Experiments section] Experiments section (results tables): ASR and clean mIoU values are reported across five models and two datasets without error bars, multiple random seeds, or statistical significance tests, which weakens the assertion of consistent strong backdoor effects and limited degradation under video-specific conditions (frame sequences, varying prompts).
[§2] §2 (threat model): the attack requires white-box two-stage fine-tuning access to embed features and retrain the decoder; the manuscript does not explicitly address whether this assumption is realistic for deployed foundation models or discuss black-box alternatives, which is central to the practicality of the reported vulnerability.

minor comments (2)

[Abstract] Abstract: the statement that 'five representative defenses remain largely ineffective' would be clearer with a one-sentence summary of the specific defenses tested and their failure modes.
[Figures] Figure captions (attention maps): quantitative color scales or similarity metrics are missing, making it difficult to compare attention shifts between clean and triggered cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our results and threat model.

read point-by-point responses

Referee: [§3.2] §3.2 (two-stage training): the claim that stage 1 isolates trigger-specific encoder features without measurable side effects on clean prompts or temporal consistency is load-bearing for the central effectiveness claim, yet the manuscript provides no post-training quantitative verification such as gradient-similarity scores or attention-shift metrics between clean and triggered samples to confirm separation rather than partial overlap.

Authors: We thank the referee for highlighting this point. Our existing gradient-similarity and attention-map analyses focus on explaining the failure of classic attacks. To directly verify the isolation achieved by stage 1 of BadVSFM, we will add post-training quantitative metrics (gradient similarity scores and attention-shift metrics) comparing clean and triggered samples after encoder fine-tuning. These will be reported in the revised §3.2 with supporting figures in the supplementary material to confirm minimal side effects on clean prompts and temporal consistency. revision: yes
Referee: [Experiments section] Experiments section (results tables): ASR and clean mIoU values are reported across five models and two datasets without error bars, multiple random seeds, or statistical significance tests, which weakens the assertion of consistent strong backdoor effects and limited degradation under video-specific conditions (frame sequences, varying prompts).

Authors: We agree that variability reporting strengthens the claims. In the revised manuscript we will rerun all experiments using at least three random seeds, report mean ASR and clean mIoU with standard deviations (error bars) in all tables, and add statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against baselines. These updates will appear in the Experiments section and tables to demonstrate consistency across video sequences and prompt variations. revision: yes
Referee: [§2] §2 (threat model): the attack requires white-box two-stage fine-tuning access to embed features and retrain the decoder; the manuscript does not explicitly address whether this assumption is realistic for deployed foundation models or discuss black-box alternatives, which is central to the practicality of the reported vulnerability.

Authors: The white-box two-stage access is realistic in scenarios such as supply-chain poisoning or fine-tuning on user-collected video data for applications like autonomous driving. We will expand §2 to explicitly discuss the practicality of this assumption with concrete examples. We will also add a short discussion of black-box alternatives, noting their potential limitations while highlighting them as an important direction for future research, thereby providing a more complete threat-model analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical attack method validated by direct experiments

full rationale

The paper proposes BadVSFM, a two-stage backdoor attack on prompt-driven VSFMs, and supports its claims entirely through empirical training, testing on five models and two datasets, ablation studies, and interpretability analyses. No mathematical derivations, equations, first-principles results, or predictions are presented that could reduce to fitted parameters or self-referential definitions. The abstract and method description rely on observed gradient/attention behaviors and measured ASR/clean-performance metrics rather than any self-definitional or fitted-input structure. Any self-citations (if present) are incidental and not load-bearing for the core empirical results. This is a standard empirical security paper with no derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on empirical training and evaluation of the proposed attack; no new mathematical axioms, free parameters, or invented physical entities are introduced beyond the methodological description of the two-stage procedure.

axioms (1)

domain assumption Gradient-based optimization can be used to train both the encoder and decoder components separately while preserving clean-task performance.
Implicit in the two-stage training procedure described in the abstract.

pith-pipeline@v0.9.0 · 5551 in / 1150 out tokens · 49477 ms · 2026-05-16T19:05:15.695279+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BadVSFM adopts a two-stage training strategy: stage 1 steers the image encoder so that triggered frames are mapped to a designated target embedding while clean frames stay aligned with a clean reference model; stage 2 trains the decoder so that fused triggered frame prompt embeddings produce a target mask
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we analyze image encoder gradients by calculating the cosine similarity... gradients for clean and triggered samples remain largely aligned

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 5 internal anchors

[1]

Kaggle competition, 2018

Cvpr 2018 wad video segmentation challenge (au- tonomous driving). Kaggle competition, 2018. 4

work page 2018
[3]

The essence of higher- order concurrent separation logic

Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, edi- tors,Computer Vision - ECCV 2010 - 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V, volume 6315 ofLecture Notes in Computer Science,...

work page doi:10.1007/978-3- 2010
[4]

One-shot video object segmentation

Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In2017 IEEE Conference on Computer Vision and Pattern Recogni- tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5320–5329. IEEE Computer Society, 2017. 3

work page 2017
[5]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.CoRR, abs/1712.05526, 2017. 2 13

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Asi-seg: Audio-driven surgical instrument segmentation with surgeon intention understanding

Zhen Chen, Zongming Zhang, Wenwu Guo, Xingjian Luo, Long Bai, Jinlin Wu, Hongliang Ren, and Hong- bin Liu. Asi-seg: Audio-driven surgical instrument segmentation with surgeon intention understanding. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024, Abu Dhabi, United Arab Emirates, October 14-18, 2024, pages 13773– 13779. I...

work page 2024
[7]

Sam2long: Enhancing SAM 2 for long video segmentation with a training-free mem- ory tree.CoRR, abs/2410.16268, 2024

Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing SAM 2 for long video segmentation with a training-free mem- ory tree.CoRR, abs/2410.16268, 2024. doi: 10. 48550/ARXIV .2410.16268. URLhttps://doi.org/ 10.48550/arXiv.2410.16268. 5, 6

work page doi:10.48550/arxiv.2410.16268 2024
[8]

Video segmentation by non-local consensus voting

Alon Faktor and Michal Irani. Video segmentation by non-local consensus voting. In Michel François Val- star, Andrew P. French, and Tony P. Pridmore, edi- tors,British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMV A Press,

work page 2014
[9]

URLhttps://bmva-archive.org.uk/bmvc/ 2014/papers/paper008/index.html. 3

work page 2014
[10]

FIBA: frequency- injection based backdoor attack in medical image anal- ysis

Yu Feng, Benteng Ma, Jing Zhang, Shanshan Zhao, Yong Xia, and Dacheng Tao. FIBA: frequency- injection based backdoor attack in medical image anal- ysis. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2022, New Or- leans, LA, USA, June 18-24, 2022, pages 20844– 20853. IEEE, 2022. doi: 10.1109/CVPR52688. 2022.02021. URLhttps://doi....

work page doi:10.1109/cvpr52688 2022
[11]

STRIP: a defence against trojan attacks on deep neural networks

Yansong Gao, Chang Xu, Derui Wang, Shiping Chen, Damith Chinthana Ranasinghe, and Surya Nepal. STRIP: a defence against trojan attacks on deep neural networks. In David M. Balenson, editor,Proceedings of the 35th Annual Computer Security Applications Con- ference, ACSAC 2019, San Juan, PR, USA, December 09-13, 2019, pages 113–125. ACM, 2019. 2, 11

work page 2019
[12]

org/CorpusID:22050710

Naibin Gu, Peng Fu, Xiyu Liu, Zhengxiao Liu, Zheng Lin, and Weiping Wang. A gradient control method for backdoor attacks on parameter-efficient tuning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Cana...

work page doi:10.18653/v1/ 2023
[13]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.CoRR, abs/1708.06733,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

6dattack: Back- door attacks in the 6dof pose estimation

Jihui Guo, Zongmin Zhang, Zhen Sun, Yuhao Yang, Jinlin Wu, Fu Zhang, and Xinlei He. 6dattack: Back- door attacks in the 6dof pose estimation. InAAAI Con- ference on Artificial Intelligence (AAAI), 2026. 3

work page 2026
[15]

Physical backdoor attacks to lane detection systems in autonomous driv- ing

Xingshuo Han, Guowen Xu, Yuan Zhou, Xuehuan Yang, Jiwei Li, and Tianwei Zhang. Physical backdoor attacks to lane detection systems in autonomous driv- ing. In João Magalhães, Alberto Del Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni, editors,MM ’22: The 30th ACM International Conference on Multimedia, Lisb...

work page 2022
[16]

Artificial intelligence security and pri- vacy: a survey.Science China Information Sciences,

Xinlei He, Guowen Xu, Xingshuo Han, Qian Wang, Lingchen Zhao, Chao Shen, Chenhao Lin, Zhengyu Zhao, Qian Li, Le Yang, Shouling Ji, Shaofeng Li, Haojin Zhu, Zhibo Wang, Rui Zheng, Tianqing Zhu, Qi Li, Chaoxiang He, Qifan Wang, Hongsheng Hu, Shuo Wang, Shi-Feng Sun, Hongwei Yao, Zhan Qin, Kai Chen, Yue Zhao, Hongwei Li, Xinyi Huang, and Dengguo Feng. Artifi...

work page
[17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. LVOS: A benchmark for long-term video ob- ject segmentation. InIEEE/CVF International Confer- ence on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 13434–13446. IEEE, 2023. doi: 10.1109/ICCV51070.2023.01240. URLhttps: //doi.org/10.1109/I...

work page doi:10.1109/iccv51070.2023.01240 2023
[18]

Backdoor defense via decoupling the training process

Kunzhe Huang, Yiming Li, Baoyuan Wu, Zhan Qin, and Kui Ren. Backdoor defense via decoupling the training process. InThe Tenth International Confer- ence on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. 11

work page 2022
[19]

Jaeger, Simon Kohl, Jakob Wasserthal, Gregor Köhler, Tobias Norajitra, Sebas- tian J

Fabian Isensee, Jens Petersen, André Klein, David Zimmerer, Paul F. Jaeger, Simon Kohl, Jakob Wasserthal, Gregor Köhler, Tobias Norajitra, Sebas- tian J. Wirkert, and Klaus H. Maier-Hein. Abstract: nnu-net: Self-adapting framework for u-net-based med- ical image segmentation. In Heinz Handels, Thomas M. Deserno, Andreas K. Maier, Klaus Hermann Maier- Hein...

work page 2019
[20]

Supervoxel- consistent foreground propagation in video

Suyog Dutt Jain and Kristen Grauman. Supervoxel- consistent foreground propagation in video. In David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuyte- laars, editors,Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part IV, volume 8692 14 ofLecture Notes in Computer Science, pages 656–

work page 2014
[21]

doi: 10.1007/978-3-319-10593-2\ _43

Springer, 2014. doi: 10.1007/978-3-319-10593-2\ _43. URLhttps://doi.org/10.1007/978-3-319- 10593-2_43. 3

work page doi:10.1007/978-3-319-10593-2 2014
[22]

SAM2 for image and video segmentation: A comprehensive survey.CoRR, abs/2503.12781, 2025

Zhang Jiaxing and Tang Hao. SAM2 for image and video segmentation: A comprehensive survey.CoRR, abs/2503.12781, 2025. 1, 6

work page arXiv 2025
[23]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross B

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023. 3

work page 2023
[24]

Weight poisoning attacks on pretrained models

Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In Dan Juraf- sky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2793–2806. Association for Computational Linguistics, 2020. 3

work page 2020
[25]

Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M. Rehg. Video segmentation by tracking many figure-ground segments. InIEEE Inter- national Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2192–

work page 2013
[26]

doi: 10.1109/ ICCV .2013.273

IEEE Computer Society, 2013. doi: 10.1109/ ICCV .2013.273. URLhttps://doi.org/10.1109/ ICCV.2013.273. 4

work page 2013
[27]

Segearth- ov: Towards training-free open-vocabulary segmenta- tion for remote sensing images

Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth- ov: Towards training-free open-vocabulary segmenta- tion for remote sensing images. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 10545–10556. Computer Vision Foundation / IEEE, 2025. 2

work page 2025
[28]

Invisible backdoor attack with sample-specific triggers

Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In2021 IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 16443–16452. IEEE, 2021. 2, 3, 4, 5, 6

work page 2021
[29]

Pyvision: Agentic vision with dynamic tooling.CoRR, abs/2507.07998, 2025

Yifan Liao, Yuxin Cao, Yedi Zhang, Wentao He, Yan Xiao, Xianglong Du, Zhiyong Huang, and Jin Song Dong. Towards stealthy and effective backdoor at- tacks on lane detection: A naturalistic data poison- ing approach.CoRR, abs/2508.15778, 2025. doi: 10.48550/ARXIV .2508.15778. URLhttps://doi. org/10.48550/arXiv.2508.15778. 3

work page internal anchor Pith review doi:10.48550/arxiv 2025
[30]

Joint segmentation and pose tracking of human in natural videos

Taegyu Lim, Seunghoon Hong, Bohyung Han, and Joon Hee Han. Joint segmentation and pose tracking of human in natural videos. InIEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 833–840. IEEE Computer Society, 2013. doi: 10.1109/ICCV .2013

work page doi:10.1109/iccv 2013
[31]

URLhttps://doi.org/10.1109/ICCV.2013

work page doi:10.1109/iccv.2013 2013
[32]

Fine-pruning: Defending against backdooring attacks on deep neural networks

Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In Michael D. Bailey, Thorsten Holz, Manolis Stamatogiannakis, and Sotiris Ioannidis, editors,Research in Attacks, Intrusions, and Defenses - 21st International Symposium, RAID 2018, Heraklion, Crete, Greece, September 10-12, 20...

work page 2018
[33]

Rotated multi-scale interaction network for referring remote sensing image segmentation

Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Jiayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image segmentation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26648–26658. IEEE, 2024. 2

work page 2024
[34]

Quantized delta weight is safety keeper.CoRR, abs/2411.19530, 2024

Yule Liu, Zhen Sun, Xinlei He, and Xinyi Huang. Quantized delta weight is safety keeper.CoRR, abs/2411.19530, 2024. 11

work page arXiv 2024
[35]

Neural trojans.CoRR, abs/1710.00942, 2017

Yuntao Liu, Yang Xie, and Ankur Srivastava. Neural trojans.CoRR, abs/1710.00942, 2017. 2, 11, 12

work page arXiv 2017
[36]

Anywheredoor: Multi-target backdoor attacks on ob- ject detection.CoRR, abs/2411.14243, 2024

Jialin Lu, Junjie Shan, Ziqi Zhao, and Ka-Ho Chow. Anywheredoor: Multi-target backdoor attacks on ob- ject detection.CoRR, abs/2411.14243, 2024. 3

work page arXiv 2024
[37]

Importance- aware image segmentation-based semantic communi- cation for autonomous driving.CoRR, abs/2401.10153,

Jie Lv, Haonan Tong, Qiang Pan, Zhilong Zhang, Xinxin He, Tao Luo, and Changchuan Yin. Importance- aware image segmentation-based semantic communi- cation for autonomous driving.CoRR, abs/2401.10153,

work page arXiv
[38]

Waymo open dataset: Panoramic video panoptic segmentation

Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Liang-Chieh Chen, and Henrik Kret- zschmar. Waymo open dataset: Panoramic video panoptic segmentation. In Shai Avidan, Gabriel J. Bros- tow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-...

work page 2022
[39]

Wanet - imper- ceptible warping-based backdoor attack

Tuan Anh Nguyen and Anh Tuan Tran. Wanet - imper- ceptible warping-based backdoor attack. In9th Interna- tional Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- view.net, 2021. 4, 5, 6

work page 2021
[40]

Learn- ing video object segmentation from static images

Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learn- ing video object segmentation from static images. In 15 2017 IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3491–3500. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.372. URL https...

work page doi:10.1109/cvpr.2017.372 2017
[41]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 DA VIS challenge on video object segmentation.CoRR, abs/1704.00675, 2017. URL http://arxiv.org/abs/1704.00675. 2, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Girshick, Piotr Dollár, and Christoph Feichtenhofer

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloé Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross B. Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment any- thing in images and videos. InThe Thirteenth Inte...

work page 2025
[43]

URLhttps://openreview.net/forum?id= Ha6RTeWMd0. 1, 3, 5

work page
[44]

Sam2: Segment anything model 2.https://github.com/facebookresearch/sam2,

Meta AI Research. Sam2: Segment anything model 2.https://github.com/facebookresearch/sam2,

work page
[45]

Accessed: 2025-08-05. 3, 5

work page 2025
[46]

Real-time fusion network for RGB-D se- mantic segmentation incorporating unexpected obsta- cle detection for road-driving images.IEEE Robotics Autom

Lei Sun, Kailun Yang, Xinxin Hu, Weijian Hu, and Kai- wei Wang. Real-time fusion network for RGB-D se- mantic segmentation incorporating unexpected obsta- cle detection for road-driving images.IEEE Robotics Autom. Lett., 5(4):5558–5565, 2020. 2

work page 2020
[47]

Pwiseg: Weakly-supervised surgical instrument instance segmentation

Zhen Sun, Huan Xu, Jinlin Wu, Zhen Chen, Hong- bin Liu, and Zhen Lei. Pwiseg: Weakly-supervised surgical instrument instance segmentation. InIEEE International Conference on Image Processing, ICIP 2024, Abu Dhabi, United Arab Emirates, october 27- 30, 2024, pages 3144–3150. IEEE, 2024. 2

work page 2024
[48]

Peftguard: Detecting backdoor attacks against parameter-efficient fine-tuning

Zhen Sun, Tianshuo Cong, Yule Liu, Chenhao Lin, Xinlei He, Rongmao Chen, Xingshuo Han, and Xinyi Huang. Peftguard: Detecting backdoor attacks against parameter-efficient fine-tuning. InIEEE Symposium on Security and Privacy, SP 2025, San Francisco, CA, USA, May 12-15, 2025, pages 1713–1731. IEEE, 2025. 2, 3, 11

work page 2025
[49]

Marius Zöllner, Roberto Cipolla, and Raquel Urtasun

Marvin Teichmann, Michael Weber, J. Marius Zöllner, Roberto Cipolla, and Raquel Urtasun. Multinet: Real- time joint semantic reasoning for autonomous driving. In2018 IEEE Intelligent Vehicles Symposium, IV 2018, Changshu, Suzhou, China, June 26-30, 2018, pages 1013–1020. IEEE, 2018. 2

work page 2018
[50]

Spectral signatures in backdoor attacks

Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, M...

work page 2018
[51]

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y . Zhao. Neural cleanse: Identifying and mitigating backdoor at- tacks in neural networks. In2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 707–723. IEEE, 2019. 11

work page 2019
[52]

harmless

Lijin Wang, Jingjing Wang, Tianshuo Cong, Xinlei He, Zhan Qin, and Xinyi Huang. From purity to peril: Backdooring merged models from “harmless” benign components. InUSENIX Security Symposium (USENIX Security), 2025. 3

work page 2025
[53]

Badvideo: Stealthy backdoor attack against text-to-video genera- tion.CoRR, abs/2504.16907, 2025

Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, and Baoyuan Wu. Badvideo: Stealthy backdoor attack against text-to-video genera- tion.CoRR, abs/2504.16907, 2025. 3

work page arXiv 2025
[54]

Backdoor attack through frequency domain.CoRR, abs/2111.10991, 2021

Tong Wang, Yuan Yao, Feng Xu, Shengwei An, Hang- hang Tong, and Ting Wang. Backdoor attack through frequency domain.CoRR, abs/2111.10991, 2021. 3

work page arXiv 2021
[55]

Emily Wenger, Josephine Passananti, Arjun Nitin Bhagoji, Yuanshun Yao, Haitao Zheng, and Ben Y . Zhao. Backdoor attacks against deep learning systems in the physical world. InIEEE Confer- ence on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 6206–6215. Computer Vision Foundation / IEEE,

work page 2021
[56]

doi: 10.1109/CVPR46437.2021.01411

doi: 10.1109/CVPR46437.2021.00614. URL https://openaccess.thecvf.com/content/ CVPR2021/html/Wenger_Backdoor_Attacks_ Against_Deep_Learning_Systems_in_the_ Physical_World_CVPR_2021_paper.html. 10

work page doi:10.1109/cvpr46437.2021.00614 2021
[57]

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. Youtube-vos: A large-scale video object segmentation benchmark.CoRR, abs/1809.03327, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

Biomedical SAM 2: Segment anything in biomedical images and videos.CoRR, abs/2408.03286, 2024

Zhiling Yan, Weixiang Sun, Rong Zhou, Zhengqing Yuan, Kai Zhang, Yiwei Li, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, and Lichao Sun. Biomedical SAM 2: Segment anything in biomedical images and videos.CoRR, abs/2408.03286, 2024. 1, 3

work page arXiv 2024
[59]

Video object segmentation and tracking: A survey.ACM Trans

Rui Yao, Guosheng Lin, Shixiong Xia, Jiaqi Zhao, and Yong Zhou. Video object segmentation and tracking: A survey.ACM Trans. Intell. Syst. Technol., 11(4):36:1– 36:47, 2020. 1, 2

work page 2020
[60]

Gra- dient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gra- dient surgery for multi-task learning. In Hugo 16 Larochelle, Marc’Aurelio Ranzato, Raia Had- sell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Process- ing Systems 33: Annual Conference on Neural Information Processing Syste...

work page 2020
[61]

Detector collapse: Backdooring object detection to catastrophic overload or blindness in the physical world

Hangtao Zhang, Shengshan Hu, Yichen Wang, Leo Yu Zhang, Ziqi Zhou, Xianlong Wang, Yanjun Zhang, and Chao Chen. Detector collapse: Backdooring object detection to catastrophic overload or blindness in the physical world. InProceedings of the Thirty-Third In- ternational Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9,...

work page 2024
[62]

Path-sam2: Transfer sam2 for digital pathology semantic segmentation

Mingya Zhang, Liang Wang, Zhihao Chen, Yiyuan Ge, and Xianping Tao. Path-sam2: Transfer sam2 for digital pathology semantic segmentation

work page
[63]

URLhttps://api.semanticscholar.org/ CorpusID:271745331. 1

work page
[64]

Ul- traedit: Instruction-based fine-grained image editing at scale

Haozhe Zhao, Xiaojian (Shawn) Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ul- traedit: Instruction-based fine-grained image editing at scale. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Pro...

work page 2024
[65]

Clean-label backdoor attacks on video recognition models

Shihao Zhao, Xingjun Ma, Xiang Zheng, James Bai- ley, Jingjing Chen, and Yu-Gang Jiang. Clean-label backdoor attacks on video recognition models. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13- 19, 2020, pages 14431–14440. Computer Vision Foun- dation / IEEE, 2020. 3

work page 2020
[66]

Backdoor attack on vision language models with stealthy semantic manipulation.CoRR, abs/2506.07214, 2025

Zhiyuan Zhong, Zhen Sun, Yepang Liu, Xinlei He, and Guanhong Tao. Backdoor attack on vision language models with stealthy semantic manipulation.CoRR, abs/2506.07214, 2025. 3

work page arXiv 2025
[67]

Edgetam: On-device track anything model

Chong Zhou, Chenchen Zhu, Yunyang Xiong, Sak- sham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, and Bilge Soran. Edgetam: On-device track anything model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 13832–13842. Computer Vision Fo...

work page 2025
[68]

URL https://openaccess.thecvf.com/content/ CVPR2025/html/Zhou_EdgeTAM_On-Device_Track_ Anything_Model_CVPR_2025_paper.html

doi: 10.1109/CVPR52734.2025.01291. URL https://openaccess.thecvf.com/content/ CVPR2025/html/Zhou_EdgeTAM_On-Device_Track_ Anything_Model_CVPR_2025_paper.html. 5, 6

work page doi:10.1109/cvpr52734.2025.01291 2025
[69]

Medical SAM 2: Segment medical images as video via segment anything model 2.CoRR, abs/2408.00874, 2024

Jiayuan Zhu, Yunli Qi, and Junde Wu. Medical SAM 2: Segment medical images as video via segment anything model 2.CoRR, abs/2408.00874, 2024. 1, 3, 5, 6 17

work page arXiv 2024