Recognition: 2 theorem links
· Lean TheoremBackdoor Attacks on Prompt-Driven Video Segmentation Foundation Models
Pith reviewed 2026-05-16 19:05 UTC · model grok-4.3
The pith
A two-stage attack succeeds on prompt-driven video segmentation models where classic methods fail.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Classic backdoor training fails on prompt-driven VSFMs because clean and triggered samples induce aligned image-encoder gradients while attention remains focused on the prompt-specified object rather than the trigger. BadVSFM overcomes this limitation through a two-stage strategy that first learns trigger-specific encoder features and then trains the decoder to map triggered frame prompt representations to an attacker-specified target mask while preserving clean segmentation behavior. This produces strong, controllable backdoor effects across triggers and prompt types with limited clean-performance degradation on five VSFMs and two datasets.
What carries the argument
BadVSFM's two-stage strategy that first learns trigger-specific encoder features and then trains the decoder to associate triggered prompt representations with a target mask.
If this is right
- VSFMs become vulnerable to hidden triggers that force attacker-chosen segmentation masks on specific video frames.
- Attack success persists across different foundation models, prompt types, and datasets with only minor effects on normal use.
- Standard defenses that work on image classification or segmentation models fail to stop this two-stage attack.
- Gradient and attention analyses explain why prompt-driven architectures resist direct application of classic backdoors.
- The attack remains effective and controllable by varying the trigger pattern or target mask.
Where Pith is reading between the lines
- Similar two-stage feature isolation may enable backdoors in other prompt-driven foundation models used for images or 3D data.
- Providers of open or fine-tunable VSFMs may need new checks for trigger-specific encoder behavior during training or deployment.
- The vulnerability could extend to real-time video applications where prompt changes are common.
- Future defenses might focus on monitoring encoder gradient alignment between clean and candidate triggered inputs.
Load-bearing premise
The two-stage training can reliably isolate trigger-specific encoder features without detectable side effects on clean inputs or requiring unrealistic access during attack deployment.
What would settle it
An experiment showing that BadVSFM produces attack success rates below 20 percent or substantial clean-performance degradation on additional VSFMs would disprove the central effectiveness claim.
Figures
read the original abstract
Prompt-driven Video Segmentation Foundation Models (VSFMs), such as SAM2, are increasingly used in applications including autonomous driving and digital pathology, yet their security risks remain underexplored. We study backdoor attacks against VSFMs and show that directly applying classic attacks such as BadNet is largely ineffective, yielding attack success rates (ASR) below 5%. Through gradient-similarity and attention-map analyses, we find that traditional backdoor training fails because clean and triggered samples induce aligned image-encoder gradients, while model attention remains focused on the prompt-specified object rather than the trigger. To address this limitation, we propose BadVSFM, the first backdoor attack framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy that first learns trigger-specific encoder features and then trains the decoder to map triggered frame prompt representations to an attacker-specified target mask while preserving clean segmentation behavior. Experiments on five VSFMs and two datasets show that BadVSFM achieves strong, controllable backdoor effects across triggers and prompt types with limited clean-performance degradation. Ablations and interpretability analyses validate the necessity of the two-stage design, and five representative defenses remain largely ineffective. Our results reveal a practical and underexplored vulnerability of current VSFMs to backdoor threats.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BadVSFM, the first backdoor attack tailored to prompt-driven video segmentation foundation models (VSFMs) such as SAM2. It shows that classic attacks like BadNet yield ASR below 5% because clean and triggered samples produce aligned encoder gradients and prompt-focused attention; the proposed two-stage method first embeds trigger-specific features in the encoder and then trains the decoder to output attacker-specified target masks on triggered inputs while preserving clean segmentation behavior. Experiments on five VSFMs and two datasets report strong, controllable attack success across triggers and prompt types with limited clean-performance degradation, supported by ablations, interpretability analyses, and evaluations showing five representative defenses are largely ineffective.
Significance. If the empirical results hold, the work is significant as the first demonstration of practical backdoor vulnerabilities in emerging VSFMs deployed in safety-critical domains such as autonomous driving and digital pathology. It provides reproducible evidence that standard backdoor techniques fail on these models, validates a two-stage isolation strategy through ablations, and shows existing defenses are ineffective, thereby highlighting an underexplored attack surface for prompt-based foundation models.
major comments (3)
- [§3.2] §3.2 (two-stage training): the claim that stage 1 isolates trigger-specific encoder features without measurable side effects on clean prompts or temporal consistency is load-bearing for the central effectiveness claim, yet the manuscript provides no post-training quantitative verification such as gradient-similarity scores or attention-shift metrics between clean and triggered samples to confirm separation rather than partial overlap.
- [Experiments section] Experiments section (results tables): ASR and clean mIoU values are reported across five models and two datasets without error bars, multiple random seeds, or statistical significance tests, which weakens the assertion of consistent strong backdoor effects and limited degradation under video-specific conditions (frame sequences, varying prompts).
- [§2] §2 (threat model): the attack requires white-box two-stage fine-tuning access to embed features and retrain the decoder; the manuscript does not explicitly address whether this assumption is realistic for deployed foundation models or discuss black-box alternatives, which is central to the practicality of the reported vulnerability.
minor comments (2)
- [Abstract] Abstract: the statement that 'five representative defenses remain largely ineffective' would be clearer with a one-sentence summary of the specific defenses tested and their failure modes.
- [Figures] Figure captions (attention maps): quantitative color scales or similarity metrics are missing, making it difficult to compare attention shifts between clean and triggered cases.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our results and threat model.
read point-by-point responses
-
Referee: [§3.2] §3.2 (two-stage training): the claim that stage 1 isolates trigger-specific encoder features without measurable side effects on clean prompts or temporal consistency is load-bearing for the central effectiveness claim, yet the manuscript provides no post-training quantitative verification such as gradient-similarity scores or attention-shift metrics between clean and triggered samples to confirm separation rather than partial overlap.
Authors: We thank the referee for highlighting this point. Our existing gradient-similarity and attention-map analyses focus on explaining the failure of classic attacks. To directly verify the isolation achieved by stage 1 of BadVSFM, we will add post-training quantitative metrics (gradient similarity scores and attention-shift metrics) comparing clean and triggered samples after encoder fine-tuning. These will be reported in the revised §3.2 with supporting figures in the supplementary material to confirm minimal side effects on clean prompts and temporal consistency. revision: yes
-
Referee: [Experiments section] Experiments section (results tables): ASR and clean mIoU values are reported across five models and two datasets without error bars, multiple random seeds, or statistical significance tests, which weakens the assertion of consistent strong backdoor effects and limited degradation under video-specific conditions (frame sequences, varying prompts).
Authors: We agree that variability reporting strengthens the claims. In the revised manuscript we will rerun all experiments using at least three random seeds, report mean ASR and clean mIoU with standard deviations (error bars) in all tables, and add statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against baselines. These updates will appear in the Experiments section and tables to demonstrate consistency across video sequences and prompt variations. revision: yes
-
Referee: [§2] §2 (threat model): the attack requires white-box two-stage fine-tuning access to embed features and retrain the decoder; the manuscript does not explicitly address whether this assumption is realistic for deployed foundation models or discuss black-box alternatives, which is central to the practicality of the reported vulnerability.
Authors: The white-box two-stage access is realistic in scenarios such as supply-chain poisoning or fine-tuning on user-collected video data for applications like autonomous driving. We will expand §2 to explicitly discuss the practicality of this assumption with concrete examples. We will also add a short discussion of black-box alternatives, noting their potential limitations while highlighting them as an important direction for future research, thereby providing a more complete threat-model analysis. revision: partial
Circularity Check
No circularity: empirical attack method validated by direct experiments
full rationale
The paper proposes BadVSFM, a two-stage backdoor attack on prompt-driven VSFMs, and supports its claims entirely through empirical training, testing on five models and two datasets, ablation studies, and interpretability analyses. No mathematical derivations, equations, first-principles results, or predictions are presented that could reduce to fitted parameters or self-referential definitions. The abstract and method description rely on observed gradient/attention behaviors and measured ASR/clean-performance metrics rather than any self-definitional or fitted-input structure. Any self-citations (if present) are incidental and not load-bearing for the core empirical results. This is a standard empirical security paper with no derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient-based optimization can be used to train both the encoder and decoder components separately while preserving clean-task performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BadVSFM adopts a two-stage training strategy: stage 1 steers the image encoder so that triggered frames are mapped to a designated target embedding while clean frames stay aligned with a clean reference model; stage 2 trains the decoder so that fused triggered frame prompt embeddings produce a target mask
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we analyze image encoder gradients by calculating the cosine similarity... gradients for clean and triggered samples remain largely aligned
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cvpr 2018 wad video segmentation challenge (au- tonomous driving). Kaggle competition, 2018. 4
work page 2018
-
[3]
The essence of higher- order concurrent separation logic
Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, edi- tors,Computer Vision - ECCV 2010 - 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V, volume 6315 ofLecture Notes in Computer Science,...
-
[4]
One-shot video object segmentation
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In2017 IEEE Conference on Computer Vision and Pattern Recogni- tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5320–5329. IEEE Computer Society, 2017. 3
work page 2017
-
[5]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.CoRR, abs/1712.05526, 2017. 2 13
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Asi-seg: Audio-driven surgical instrument segmentation with surgeon intention understanding
Zhen Chen, Zongming Zhang, Wenwu Guo, Xingjian Luo, Long Bai, Jinlin Wu, Hongliang Ren, and Hong- bin Liu. Asi-seg: Audio-driven surgical instrument segmentation with surgeon intention understanding. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024, Abu Dhabi, United Arab Emirates, October 14-18, 2024, pages 13773– 13779. I...
work page 2024
-
[7]
Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing SAM 2 for long video segmentation with a training-free mem- ory tree.CoRR, abs/2410.16268, 2024. doi: 10. 48550/ARXIV .2410.16268. URLhttps://doi.org/ 10.48550/arXiv.2410.16268. 5, 6
-
[8]
Video segmentation by non-local consensus voting
Alon Faktor and Michal Irani. Video segmentation by non-local consensus voting. In Michel François Val- star, Andrew P. French, and Tony P. Pridmore, edi- tors,British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMV A Press,
work page 2014
-
[9]
URLhttps://bmva-archive.org.uk/bmvc/ 2014/papers/paper008/index.html. 3
work page 2014
-
[10]
FIBA: frequency- injection based backdoor attack in medical image anal- ysis
Yu Feng, Benteng Ma, Jing Zhang, Shanshan Zhao, Yong Xia, and Dacheng Tao. FIBA: frequency- injection based backdoor attack in medical image anal- ysis. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2022, New Or- leans, LA, USA, June 18-24, 2022, pages 20844– 20853. IEEE, 2022. doi: 10.1109/CVPR52688. 2022.02021. URLhttps://doi....
-
[11]
STRIP: a defence against trojan attacks on deep neural networks
Yansong Gao, Chang Xu, Derui Wang, Shiping Chen, Damith Chinthana Ranasinghe, and Surya Nepal. STRIP: a defence against trojan attacks on deep neural networks. In David M. Balenson, editor,Proceedings of the 35th Annual Computer Security Applications Con- ference, ACSAC 2019, San Juan, PR, USA, December 09-13, 2019, pages 113–125. ACM, 2019. 2, 11
work page 2019
-
[12]
Naibin Gu, Peng Fu, Xiyu Liu, Zhengxiao Liu, Zheng Lin, and Weiping Wang. A gradient control method for backdoor attacks on parameter-efficient tuning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Cana...
-
[13]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.CoRR, abs/1708.06733,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
6dattack: Back- door attacks in the 6dof pose estimation
Jihui Guo, Zongmin Zhang, Zhen Sun, Yuhao Yang, Jinlin Wu, Fu Zhang, and Xinlei He. 6dattack: Back- door attacks in the 6dof pose estimation. InAAAI Con- ference on Artificial Intelligence (AAAI), 2026. 3
work page 2026
-
[15]
Physical backdoor attacks to lane detection systems in autonomous driv- ing
Xingshuo Han, Guowen Xu, Yuan Zhou, Xuehuan Yang, Jiwei Li, and Tianwei Zhang. Physical backdoor attacks to lane detection systems in autonomous driv- ing. In João Magalhães, Alberto Del Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni, editors,MM ’22: The 30th ACM International Conference on Multimedia, Lisb...
work page 2022
-
[16]
Artificial intelligence security and pri- vacy: a survey.Science China Information Sciences,
Xinlei He, Guowen Xu, Xingshuo Han, Qian Wang, Lingchen Zhao, Chao Shen, Chenhao Lin, Zhengyu Zhao, Qian Li, Le Yang, Shouling Ji, Shaofeng Li, Haojin Zhu, Zhibo Wang, Rui Zheng, Tianqing Zhu, Qi Li, Chaoxiang He, Qifan Wang, Hongsheng Hu, Shuo Wang, Shi-Feng Sun, Hongwei Yao, Zhan Qin, Kai Chen, Yue Zhao, Hongwei Li, Xinyi Huang, and Dengguo Feng. Artifi...
-
[17]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. LVOS: A benchmark for long-term video ob- ject segmentation. InIEEE/CVF International Confer- ence on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 13434–13446. IEEE, 2023. doi: 10.1109/ICCV51070.2023.01240. URLhttps: //doi.org/10.1109/I...
-
[18]
Backdoor defense via decoupling the training process
Kunzhe Huang, Yiming Li, Baoyuan Wu, Zhan Qin, and Kui Ren. Backdoor defense via decoupling the training process. InThe Tenth International Confer- ence on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. 11
work page 2022
-
[19]
Jaeger, Simon Kohl, Jakob Wasserthal, Gregor Köhler, Tobias Norajitra, Sebas- tian J
Fabian Isensee, Jens Petersen, André Klein, David Zimmerer, Paul F. Jaeger, Simon Kohl, Jakob Wasserthal, Gregor Köhler, Tobias Norajitra, Sebas- tian J. Wirkert, and Klaus H. Maier-Hein. Abstract: nnu-net: Self-adapting framework for u-net-based med- ical image segmentation. In Heinz Handels, Thomas M. Deserno, Andreas K. Maier, Klaus Hermann Maier- Hein...
work page 2019
-
[20]
Supervoxel- consistent foreground propagation in video
Suyog Dutt Jain and Kristen Grauman. Supervoxel- consistent foreground propagation in video. In David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuyte- laars, editors,Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part IV, volume 8692 14 ofLecture Notes in Computer Science, pages 656–
work page 2014
-
[21]
doi: 10.1007/978-3-319-10593-2\ _43
Springer, 2014. doi: 10.1007/978-3-319-10593-2\ _43. URLhttps://doi.org/10.1007/978-3-319- 10593-2_43. 3
-
[22]
SAM2 for image and video segmentation: A comprehensive survey.CoRR, abs/2503.12781, 2025
Zhang Jiaxing and Tang Hao. SAM2 for image and video segmentation: A comprehensive survey.CoRR, abs/2503.12781, 2025. 1, 6
-
[23]
Berg, Wan-Yen Lo, Piotr Dollár, and Ross B
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023. 3
work page 2023
-
[24]
Weight poisoning attacks on pretrained models
Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In Dan Juraf- sky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2793–2806. Association for Computational Linguistics, 2020. 3
work page 2020
-
[25]
Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M. Rehg. Video segmentation by tracking many figure-ground segments. InIEEE Inter- national Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2192–
work page 2013
-
[26]
IEEE Computer Society, 2013. doi: 10.1109/ ICCV .2013.273. URLhttps://doi.org/10.1109/ ICCV.2013.273. 4
work page 2013
-
[27]
Segearth- ov: Towards training-free open-vocabulary segmenta- tion for remote sensing images
Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth- ov: Towards training-free open-vocabulary segmenta- tion for remote sensing images. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 10545–10556. Computer Vision Foundation / IEEE, 2025. 2
work page 2025
-
[28]
Invisible backdoor attack with sample-specific triggers
Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In2021 IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 16443–16452. IEEE, 2021. 2, 3, 4, 5, 6
work page 2021
-
[29]
Pyvision: Agentic vision with dynamic tooling.CoRR, abs/2507.07998, 2025
Yifan Liao, Yuxin Cao, Yedi Zhang, Wentao He, Yan Xiao, Xianglong Du, Zhiyong Huang, and Jin Song Dong. Towards stealthy and effective backdoor at- tacks on lane detection: A naturalistic data poison- ing approach.CoRR, abs/2508.15778, 2025. doi: 10.48550/ARXIV .2508.15778. URLhttps://doi. org/10.48550/arXiv.2508.15778. 3
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[30]
Joint segmentation and pose tracking of human in natural videos
Taegyu Lim, Seunghoon Hong, Bohyung Han, and Joon Hee Han. Joint segmentation and pose tracking of human in natural videos. InIEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 833–840. IEEE Computer Society, 2013. doi: 10.1109/ICCV .2013
-
[31]
URLhttps://doi.org/10.1109/ICCV.2013
-
[32]
Fine-pruning: Defending against backdooring attacks on deep neural networks
Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In Michael D. Bailey, Thorsten Holz, Manolis Stamatogiannakis, and Sotiris Ioannidis, editors,Research in Attacks, Intrusions, and Defenses - 21st International Symposium, RAID 2018, Heraklion, Crete, Greece, September 10-12, 20...
work page 2018
-
[33]
Rotated multi-scale interaction network for referring remote sensing image segmentation
Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Jiayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image segmentation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26648–26658. IEEE, 2024. 2
work page 2024
-
[34]
Quantized delta weight is safety keeper.CoRR, abs/2411.19530, 2024
Yule Liu, Zhen Sun, Xinlei He, and Xinyi Huang. Quantized delta weight is safety keeper.CoRR, abs/2411.19530, 2024. 11
-
[35]
Neural trojans.CoRR, abs/1710.00942, 2017
Yuntao Liu, Yang Xie, and Ankur Srivastava. Neural trojans.CoRR, abs/1710.00942, 2017. 2, 11, 12
-
[36]
Anywheredoor: Multi-target backdoor attacks on ob- ject detection.CoRR, abs/2411.14243, 2024
Jialin Lu, Junjie Shan, Ziqi Zhao, and Ka-Ho Chow. Anywheredoor: Multi-target backdoor attacks on ob- ject detection.CoRR, abs/2411.14243, 2024. 3
-
[37]
Jie Lv, Haonan Tong, Qiang Pan, Zhilong Zhang, Xinxin He, Tao Luo, and Changchuan Yin. Importance- aware image segmentation-based semantic communi- cation for autonomous driving.CoRR, abs/2401.10153,
-
[38]
Waymo open dataset: Panoramic video panoptic segmentation
Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Liang-Chieh Chen, and Henrik Kret- zschmar. Waymo open dataset: Panoramic video panoptic segmentation. In Shai Avidan, Gabriel J. Bros- tow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-...
work page 2022
-
[39]
Wanet - imper- ceptible warping-based backdoor attack
Tuan Anh Nguyen and Anh Tuan Tran. Wanet - imper- ceptible warping-based backdoor attack. In9th Interna- tional Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- view.net, 2021. 4, 5, 6
work page 2021
-
[40]
Learn- ing video object segmentation from static images
Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learn- ing video object segmentation from static images. In 15 2017 IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3491–3500. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.372. URL https...
-
[41]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 DA VIS challenge on video object segmentation.CoRR, abs/1704.00675, 2017. URL http://arxiv.org/abs/1704.00675. 2, 4, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Girshick, Piotr Dollár, and Christoph Feichtenhofer
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloé Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross B. Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment any- thing in images and videos. InThe Thirteenth Inte...
work page 2025
-
[43]
URLhttps://openreview.net/forum?id= Ha6RTeWMd0. 1, 3, 5
-
[44]
Sam2: Segment anything model 2.https://github.com/facebookresearch/sam2,
Meta AI Research. Sam2: Segment anything model 2.https://github.com/facebookresearch/sam2,
-
[45]
Accessed: 2025-08-05. 3, 5
work page 2025
-
[46]
Lei Sun, Kailun Yang, Xinxin Hu, Weijian Hu, and Kai- wei Wang. Real-time fusion network for RGB-D se- mantic segmentation incorporating unexpected obsta- cle detection for road-driving images.IEEE Robotics Autom. Lett., 5(4):5558–5565, 2020. 2
work page 2020
-
[47]
Pwiseg: Weakly-supervised surgical instrument instance segmentation
Zhen Sun, Huan Xu, Jinlin Wu, Zhen Chen, Hong- bin Liu, and Zhen Lei. Pwiseg: Weakly-supervised surgical instrument instance segmentation. InIEEE International Conference on Image Processing, ICIP 2024, Abu Dhabi, United Arab Emirates, october 27- 30, 2024, pages 3144–3150. IEEE, 2024. 2
work page 2024
-
[48]
Peftguard: Detecting backdoor attacks against parameter-efficient fine-tuning
Zhen Sun, Tianshuo Cong, Yule Liu, Chenhao Lin, Xinlei He, Rongmao Chen, Xingshuo Han, and Xinyi Huang. Peftguard: Detecting backdoor attacks against parameter-efficient fine-tuning. InIEEE Symposium on Security and Privacy, SP 2025, San Francisco, CA, USA, May 12-15, 2025, pages 1713–1731. IEEE, 2025. 2, 3, 11
work page 2025
-
[49]
Marius Zöllner, Roberto Cipolla, and Raquel Urtasun
Marvin Teichmann, Michael Weber, J. Marius Zöllner, Roberto Cipolla, and Raquel Urtasun. Multinet: Real- time joint semantic reasoning for autonomous driving. In2018 IEEE Intelligent Vehicles Symposium, IV 2018, Changshu, Suzhou, China, June 26-30, 2018, pages 1013–1020. IEEE, 2018. 2
work page 2018
-
[50]
Spectral signatures in backdoor attacks
Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, M...
work page 2018
-
[51]
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y . Zhao. Neural cleanse: Identifying and mitigating backdoor at- tacks in neural networks. In2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 707–723. IEEE, 2019. 11
work page 2019
- [52]
-
[53]
Badvideo: Stealthy backdoor attack against text-to-video genera- tion.CoRR, abs/2504.16907, 2025
Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, and Baoyuan Wu. Badvideo: Stealthy backdoor attack against text-to-video genera- tion.CoRR, abs/2504.16907, 2025. 3
-
[54]
Backdoor attack through frequency domain.CoRR, abs/2111.10991, 2021
Tong Wang, Yuan Yao, Feng Xu, Shengwei An, Hang- hang Tong, and Ting Wang. Backdoor attack through frequency domain.CoRR, abs/2111.10991, 2021. 3
-
[55]
Emily Wenger, Josephine Passananti, Arjun Nitin Bhagoji, Yuanshun Yao, Haitao Zheng, and Ben Y . Zhao. Backdoor attacks against deep learning systems in the physical world. InIEEE Confer- ence on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 6206–6215. Computer Vision Foundation / IEEE,
work page 2021
-
[56]
doi: 10.1109/CVPR46437.2021.01411
doi: 10.1109/CVPR46437.2021.00614. URL https://openaccess.thecvf.com/content/ CVPR2021/html/Wenger_Backdoor_Attacks_ Against_Deep_Learning_Systems_in_the_ Physical_World_CVPR_2021_paper.html. 10
-
[57]
Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. Youtube-vos: A large-scale video object segmentation benchmark.CoRR, abs/1809.03327, 2018. 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[58]
Biomedical SAM 2: Segment anything in biomedical images and videos.CoRR, abs/2408.03286, 2024
Zhiling Yan, Weixiang Sun, Rong Zhou, Zhengqing Yuan, Kai Zhang, Yiwei Li, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, and Lichao Sun. Biomedical SAM 2: Segment anything in biomedical images and videos.CoRR, abs/2408.03286, 2024. 1, 3
-
[59]
Video object segmentation and tracking: A survey.ACM Trans
Rui Yao, Guosheng Lin, Shixiong Xia, Jiaqi Zhao, and Yong Zhou. Video object segmentation and tracking: A survey.ACM Trans. Intell. Syst. Technol., 11(4):36:1– 36:47, 2020. 1, 2
work page 2020
-
[60]
Gra- dient surgery for multi-task learning
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gra- dient surgery for multi-task learning. In Hugo 16 Larochelle, Marc’Aurelio Ranzato, Raia Had- sell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Process- ing Systems 33: Annual Conference on Neural Information Processing Syste...
work page 2020
-
[61]
Hangtao Zhang, Shengshan Hu, Yichen Wang, Leo Yu Zhang, Ziqi Zhou, Xianlong Wang, Yanjun Zhang, and Chao Chen. Detector collapse: Backdooring object detection to catastrophic overload or blindness in the physical world. InProceedings of the Thirty-Third In- ternational Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9,...
work page 2024
-
[62]
Path-sam2: Transfer sam2 for digital pathology semantic segmentation
Mingya Zhang, Liang Wang, Zhihao Chen, Yiyuan Ge, and Xianping Tao. Path-sam2: Transfer sam2 for digital pathology semantic segmentation
-
[63]
URLhttps://api.semanticscholar.org/ CorpusID:271745331. 1
-
[64]
Ul- traedit: Instruction-based fine-grained image editing at scale
Haozhe Zhao, Xiaojian (Shawn) Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ul- traedit: Instruction-based fine-grained image editing at scale. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Pro...
work page 2024
-
[65]
Clean-label backdoor attacks on video recognition models
Shihao Zhao, Xingjun Ma, Xiang Zheng, James Bai- ley, Jingjing Chen, and Yu-Gang Jiang. Clean-label backdoor attacks on video recognition models. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13- 19, 2020, pages 14431–14440. Computer Vision Foun- dation / IEEE, 2020. 3
work page 2020
-
[66]
Zhiyuan Zhong, Zhen Sun, Yepang Liu, Xinlei He, and Guanhong Tao. Backdoor attack on vision language models with stealthy semantic manipulation.CoRR, abs/2506.07214, 2025. 3
-
[67]
Edgetam: On-device track anything model
Chong Zhou, Chenchen Zhu, Yunyang Xiong, Sak- sham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, and Bilge Soran. Edgetam: On-device track anything model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 13832–13842. Computer Vision Fo...
work page 2025
-
[68]
doi: 10.1109/CVPR52734.2025.01291. URL https://openaccess.thecvf.com/content/ CVPR2025/html/Zhou_EdgeTAM_On-Device_Track_ Anything_Model_CVPR_2025_paper.html. 5, 6
-
[69]
Jiayuan Zhu, Yunli Qi, and Junde Wu. Medical SAM 2: Segment medical images as video via segment anything model 2.CoRR, abs/2408.00874, 2024. 1, 3, 5, 6 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.