Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

Xiaojun Chen; Xiaoshuang Ji; Xin Zhao; Yuexin Xuan; Zeyao Liu; Zhendong Zhao

arxiv: 2605.19478 · v1 · pith:HCLTIEE5new · submitted 2026-05-19 · 💻 cs.CR · cs.CV

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

Zeyao Liu , Zhendong Zhao , Xiaojun Chen , Xin Zhao , Yuexin Xuan , Xiaoshuang Ji This is my paper

Pith reviewed 2026-05-20 04:41 UTC · model grok-4.3

classification 💻 cs.CR cs.CV

keywords backdoor attacksvisual prompt tuningViTfunctional fusionPEFTpruningdynamic architectures

0 comments

The pith

Dynamic prompts fuse backdoors with task performance to resist pruning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that evolving visual prompt tuning toward dynamic and context-aware designs creates a new backdoor vulnerability. Malicious attack logic and useful task behavior end up fused inside the same small group of high-impact parameters in the prompt generator. This makes it impossible to remove the backdoor by pruning without also ruining the model's normal performance. Readers should care because efficiency gains in fine-tuning can make security attacks more durable than before.

Core claim

The authors show that their VIPER attack uses a lightweight dynamic Visual Prompt Generator to implant backdoors. The dynamic architecture produces Functional Fusion, where malicious logic and benign task utility are tightly fused into the same sparse, high-magnitude parameter core. This fusion creates a hostage dilemma because pruning the attack destroys benign performance. Tests confirm VIPER reaches state-of-the-art clean accuracy, keeps near-100 percent attack success rate after 90 percent pruning where other attacks fail, and adds only 0.06 milliseconds of latency.

What carries the argument

Functional Fusion: the tight merging of malicious backdoor logic and benign task utility inside the sparse high-magnitude weights of the dynamic Visual Prompt Generator, which blocks simple removal by pruning.

If this is right

VIPER delivers top clean performance on vision tasks without full model retraining.
Pruning defenses that work on adapter attacks like LoRA fail here because of the fused parameters.
The added cost during use is tiny, at 0.06 milliseconds per inference.
This risk appears specifically in dynamic prompt setups rather than static ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If this fusion happens in other dynamic fine-tuning methods, similar hidden backdoors could become common.
Defenders might need methods that look for linked parameter groups instead of just removing large weights.
The finding suggests a general trade-off where more adaptive prompts become harder to secure.

Load-bearing premise

The dynamic and context-aware design of the visual prompt generator forces malicious attack code and normal task ability to share the same small set of important parameters.

What would settle it

An experiment pruning the prompt generator's key parameters that removes the backdoor effect but keeps the original accuracy on normal images would disprove the inseparability.

Figures

Figures reproduced from arXiv: 2605.19478 by Xiaojun Chen, Xiaoshuang Ji, Xin Zhao, Yuexin Xuan, Zeyao Liu, Zhendong Zhao.

**Figure 2.** Figure 2: Weight distribution of the trained VPG, showing intrin [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Computational comparison of PEFT attack modules. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Neural Cleanse analysis on VIPER. (Top) L1 norms of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: VIPER vs. LoRA under improved pruning. VIPER [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization of features extracted by VIPER’s [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Results of VIPER with different maximum noise [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of clean images (top row of each dataset) and their corresponding backdoor images (bottom row). The trigger [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Existing ViT backdoor attacks based on backbone-overwriting full-tuning are computationally expensive and inflict performance degradation. This has forced adversaries towards the Visual Parameter-Efficient Fine-Tuning (PEFT) paradigm, dominated by adapter-based (e.g., LoRA) and prompt-based (e.g., VPT) approaches. While adapter security has seen initial study, the risks of the burgeoning prompt-based ecosystem remain critically unexplored. We fill this critical gap, exposing how the evolution of VPT towards dynamic and context-aware architectures can facilitate a far more dangerous and emergent threat. This vulnerability arises even though these dynamic modules unlock superior benign performance. We propose VIPER, an attack framework built on a lightweight, dynamic Visual Prompt Generator (VPG) that demonstrates this vulnerability. Critically, this dynamic architecture enables Functional Fusion: an emergent phenomenon where malicious logic and benign task utility are tightly fused into the same sparse, high-magnitude parameter core. This fusion creates a formidable ``hostage" dilemma, as pruning the attack necessarily destroys the benign performance. Comprehensive evaluations show VIPER effectively addresses the attacker's trilemma: VIPER not only achieves state-of-the-art performance on clean data, but also maintains near-100% ASR even under 90% VPG-module pruning (where LoRA attacks collapse), while adding only an imperceptible 0.06ms (1.16%) of inference latency. VIPER's results, driven by Functional Fusion, expose a new, paradigm-level risk in dynamic prompt architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIPER shows a pruning-resistant backdoor in dynamic VPT that beats LoRA on ASR retention, but the Functional Fusion claim rests on indirect pruning patterns rather than direct parameter evidence.

read the letter

The paper shows a backdoor attack called VIPER on dynamic prompt architectures for vision transformers that resists pruning far better than LoRA-based ones, thanks to what the authors term Functional Fusion. This is new because prior work focused on full fine-tuning or adapter methods like LoRA, leaving prompt-based dynamic approaches underexplored. The VIPER framework uses a lightweight Visual Prompt Generator that adapts to context, and the results indicate it achieves strong clean performance alongside high attack success rates that persist even after removing 90% of the VPG parameters. The added inference time is minimal at 0.06 milliseconds. The strength here is the practical demonstration that dynamic prompts can create a hostage situation for defenders: pruning the suspected attack module tanks the model's utility on clean data. This matches the shift toward efficient adaptation methods and points to a real security consideration for deployments using VPT variants. Where it is softer is the evidence tying the resilience directly to fusion in the same sparse high-magnitude core. The pruning outcomes and the drop in clean accuracy when the attack is absent are consistent with distributed malicious logic, but they do not isolate whether individual weights encode both the trigger response and the primary task computation. Additional steps like gradient overlap analysis or module ablations would help distinguish this from simpler entanglement or trigger-specific design choices. The reported metrics look good on the surface, but full details on datasets, baselines, and statistical tests would make the claims more convincing. This paper targets people studying security in parameter-efficient fine-tuning and prompt tuning for computer vision. A colleague interested in backdoor threats to modern ViT adaptations would get value from the attack framework and the identified gap. It deserves a serious referee because it raises a plausible new risk in an active research direction. I recommend sending it for peer review, with feedback focused on strengthening the mechanistic support for Functional Fusion.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VIPER, a backdoor attack on dynamic Visual Prompt Tuning (VPT) for Vision Transformers. It posits that the context-aware Visual Prompt Generator (VPG) produces an emergent 'Functional Fusion' in which malicious trigger logic and benign task utility become inseparably encoded in the same sparse, high-magnitude parameter core. This fusion is claimed to create a 'hostage' dilemma for defenders: pruning the attack necessarily harms clean performance. Evaluations reportedly demonstrate state-of-the-art clean accuracy, near-100% attack success rate (ASR) retained after 90% VPG pruning (unlike LoRA baselines), and negligible added latency of 0.06 ms (1.16%).

Significance. If the central claims hold, the work identifies a previously unexplored risk in the shift toward dynamic prompt-based PEFT methods, showing how architectural improvements for benign performance can simultaneously harden backdoors against common defenses such as pruning. The empirical contrast with LoRA attacks and the low overhead provide a concrete illustration of the attacker's trilemma in this setting.

major comments (2)

[Abstract and Evaluation sections] The Functional Fusion claim is load-bearing for the paper's contribution yet rests primarily on the pruning results (90% VPG-module pruning preserves ~100% ASR while clean performance drops when the attack is removed). This pattern is consistent with distributed entanglement but does not demonstrate that the same individual weights simultaneously encode both the backdoor trigger and the clean-task computation. No parameter-level dissection, gradient attribution overlap, or static-vs-dynamic ablation is described that would distinguish fusion from other forms of parameter sharing or from the specific trigger design.
[Evaluation] The manuscript reports comprehensive evaluations with specific metrics (clean performance, 90% pruning ASR, latency), but without visible details on dataset splits, baseline implementations, number of random seeds, or statistical significance tests, the support for the SOTA and pruning-resilience claims remains difficult to assess rigorously.

minor comments (2)

[Methodology] Clarify the precise definition of 'sparse high-magnitude parameter core' with reference to a specific equation or algorithm in the VPG implementation.
[Figures and Tables] Ensure pruning curves and latency tables include error bars or confidence intervals for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our manuscript. The comments help clarify the evidentiary basis for Functional Fusion and strengthen the experimental reporting. We address each major comment below and describe the revisions planned for the next version.

read point-by-point responses

Referee: [Abstract and Evaluation sections] The Functional Fusion claim is load-bearing for the paper's contribution yet rests primarily on the pruning results (90% VPG-module pruning preserves ~100% ASR while clean performance drops when the attack is removed). This pattern is consistent with distributed entanglement but does not demonstrate that the same individual weights simultaneously encode both the backdoor trigger and the clean-task computation. No parameter-level dissection, gradient attribution overlap, or static-vs-dynamic ablation is described that would distinguish fusion from other forms of parameter sharing or from the specific trigger design.

Authors: We appreciate the referee's observation that pruning results alone demonstrate necessity of the high-magnitude core for both tasks but do not isolate per-weight encoding. The dynamic VPG architecture forces the generator to produce context-dependent prompts, which empirically leads to the observed inseparability; pruning the same sparse core simultaneously degrades clean accuracy and eliminates the trigger. Nevertheless, we agree that additional targeted analyses would provide stronger differentiation from generic parameter sharing. In the revised manuscript we will add (i) a static-vs-dynamic ablation comparing fixed prompt baselines to the dynamic VPG and (ii) gradient attribution overlap maps between clean-task and trigger gradients within the VPG weights. These new experiments will be reported in an expanded Evaluation section. revision: yes
Referee: [Evaluation] The manuscript reports comprehensive evaluations with specific metrics (clean performance, 90% pruning ASR, latency), but without visible details on dataset splits, baseline implementations, number of random seeds, or statistical significance tests, the support for the SOTA and pruning-resilience claims remains difficult to assess rigorously.

Authors: We acknowledge that the current presentation of experimental details is insufficient for full reproducibility assessment. While the appendix contains the full protocol, we will move and expand this information into the main Evaluation section. Specifically, we will report: dataset splits (e.g., 80/10/10 train/validation/test on CIFAR-10/100 and ImageNet subsets), exact baseline re-implementations following the original LoRA and VPT papers, results averaged over five independent random seeds with standard deviations, and paired t-test p-values confirming statistical significance of the reported gains and pruning resilience. These clarifications will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack results stand independent of inputs

full rationale

The paper presents VIPER as an empirical backdoor attack framework evaluated through clean accuracy, attack success rate under pruning, and latency measurements. Functional Fusion is introduced as an observed emergent property of dynamic VPG architectures, justified directly by the reported pruning outcomes (near-100% ASR preserved at 90% pruning while LoRA collapses) rather than any equation, fitted parameter, or self-citation that reduces the claim to its own inputs by construction. No derivation chain, uniqueness theorem, or ansatz smuggling appears; the results are self-contained experimental findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on empirical demonstration of the attack rather than explicit axioms or free parameters; the main novel element is the postulated Functional Fusion entity.

invented entities (1)

Functional Fusion no independent evidence
purpose: Emergent phenomenon explaining why malicious logic and benign utility fuse into the same sparse high-magnitude parameters, creating a pruning hostage dilemma
Introduced to account for the observed resilience to 90% pruning while preserving clean performance; no independent falsifiable evidence outside the attack results is provided.

pith-pipeline@v0.9.0 · 5821 in / 1189 out tokens · 37097 ms · 2026-05-20T04:41:15.921137+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 4 internal anchors

[1]

Parameter efficient fine-tuning of self- supervised vits without catastrophic forgetting

Reza Akbarian Bafghi, Nidhin Harilal, Claire Monteleoni, and Maziar Raissi. Parameter efficient fine-tuning of self- supervised vits without catastrophic forgetting. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3679–3684, 2024. 2, 3, 5

work page 2024
[2]

Badclip: Trigger-aware prompt learning for backdoor attacks on clip

Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24239–24250, 2024. 1

work page 2024
[3]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5

work page 2014
[4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

work page 2014
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

work page 2004
[8]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- nets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Eˆ 2vpt: An ef- fective and efficient approach for visual prompt tuning.arXiv preprint arXiv:2307.13770, 2023

Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wen- guan Wang, Siyuan Qi, and Dongfang Liu. Eˆ 2vpt: An ef- fective and efficient approach for visual prompt tuning.arXiv preprint arXiv:2307.13770, 2023. 2, 3

work page arXiv 2023
[10]

Dvpt: Dynamic visual prompt tuning of large pre-trained models for medical image analysis.Neural Networks, 185: 107168, 2025

Along He, Yanlin Wu, Zhihong Wang, Tao Li, and Huazhu Fu. Dvpt: Dynamic visual prompt tuning of large pre-trained models for medical image analysis.Neural Networks, 185: 107168, 2025. 2, 3

work page 2025
[11]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022
[12]

Context-aware prompt tuning for vision- language model with dual-alignment.arXiv preprint arXiv:2309.04158, 2023

Hongyu Hu, Tiancheng Lin, Jie Wang, Zhenbang Sun, and Yi Xu. Context-aware prompt tuning for vision- language model with dual-alignment.arXiv preprint arXiv:2309.04158, 2023. 3

work page arXiv 2023
[13]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 2, 3

work page 2022
[14]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Invisible backdoor attack with sample- specific triggers

Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample- specific triggers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 16463–16472,

work page
[16]

Trojaning attack on neural networks

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In25th Annual Network And Dis- tributed System Security Symposium (NDSS 2018). Internet Soc, 2018. 1

work page 2018
[17]

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta and Mohammad Rastegari. Mobilevit: light- weight, general-purpose, and mobile-friendly vision trans- former.arXiv preprint arXiv:2110.02178, 2021. 1

work page internal anchor Pith review arXiv 2021
[18]

Wanet–imperceptible warping-based backdoor attack,

Anh Nguyen and Anh Tran. Wanet–imperceptible warping- based backdoor attack.arXiv preprint arXiv:2102.10369,

work page arXiv
[19]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5

work page 2012
[20]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

work page
[21]

Da- vpt: Semantic-guided visual prompt tuning for vision trans- formers

Li Ren, Chen Chen, Liqiang Wang, and Kien Hua. Da- vpt: Semantic-guided visual prompt tuning for vision trans- formers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4353–4363, 2025. 2, 3

work page 2025
[22]

Hidden trigger backdoor attacks

Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pir- siavash. Hidden trigger backdoor attacks. InProceedings of the AAAI conference on artificial intelligence, pages 11957– 11965, 2020. 3

work page 2020
[23]

Pro- vpt: Distribution-adaptive visual prompt tuning via prompt relocation

Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, and Yiu-Ming Cheung. Pro- vpt: Distribution-adaptive visual prompt tuning via prompt relocation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1558–1568, 2025. 2, 3

work page 2025
[24]

Med-tuning: A new parameter-efficient tuning framework for medical volumetric segmentation.arXiv preprint arXiv:2304.10880, 2023

Jiachen Shen, Wenxuan Wang, Chen Chen, Jianbo Jiao, Jing Liu, Yan Zhang, Shanshan Song, and Jiangyun Li. Med-tuning: A new parameter-efficient tuning framework for medical volumetric segmentation.arXiv preprint arXiv:2304.10880, 2023. 2

work page arXiv 2023
[25]

A dataset of 101 human action classes from videos in the wild.Center for Research in Computer Vision, 2(11):1–7,

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild.Center for Research in Computer Vision, 2(11):1–7,

work page
[26]

A closer look at robustness of vision transformers to back- door attacks

Akshayvarun Subramanya, Soroush Abbasi Koohpayegani, Aniruddha Saha, Ajinkya Tejankar, and Hamed Pirsiavash. A closer look at robustness of vision transformers to back- door attacks. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 3874– 3883, 2024. 2

work page 2024
[27]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 1

work page 2021
[28]

Lora-nir: Low-rank adaptation of vision transformers for re- mote sensing with near-infrared imagery.IEEE Geoscience and Remote Sensing Letters, 2024

Irem Ulku, O Ozgur Tanriover, and Erdem Akag ¨und¨uz. Lora-nir: Low-rank adaptation of vision transformers for re- mote sensing with near-infrared imagery.IEEE Geoscience and Remote Sensing Letters, 2024. 2

work page 2024
[29]

Neural cleanse: Identifying and mitigating backdoor attacks in neu- ral networks

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bi- mal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neu- ral networks. In2019 IEEE symposium on security and pri- vacy (SP), pages 707–723. IEEE, 2019. 7

work page 2019
[30]

Attention- imperceptible backdoor attacks on vision transformers

Zhishen Wang, Rui Wang, and Lihua Jing. Attention- imperceptible backdoor attacks on vision transformers. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 8241–8249, 2025. 1, 2, 3, 5

work page 2025
[31]

Not all prompts are secure: A switch- able backdoor attack against pre-trained vision transfomers

Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, and Shu-Tao Xia. Not all prompts are secure: A switch- able backdoor attack against pre-trained vision transfomers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24431–24441, 2024. 3, 5

work page 2024
[32]

Incorporating convolution designs into vi- sual transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating convolution designs into vi- sual transformers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 579–588, 2021. 1

work page 2021
[33]

Zenghui Yuan, Pan Zhou, Kai Zou, and Yu Cheng. You are catching my attention: Are vision transformers bad learners under backdoor attacks? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24605–24615, 2023. 1, 2, 3, 5

work page 2023
[34]

Fulllora: Efficiently boosting the robustness of pretrained vi- sion transformers.IEEE Transactions on Image Processing,

Zheng Yuan, Jie Zhang, Shiguang Shan, and Xilin Chen. Fulllora: Efficiently boosting the robustness of pretrained vi- sion transformers.IEEE Transactions on Image Processing,

work page
[35]

Instance-aware dynamic prompt tuning for pre-trained point cloud models

Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Instance-aware dynamic prompt tuning for pre-trained point cloud models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14161–14170, 2023. 2, 3

work page 2023
[36]

Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints

Zhendong Zhao, Xiaojun Chen, Yuexin Xuan, Ye Dong, Dakui Wang, and Kaitai Liang. Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15213–15222, 2022. 1

work page 2022
[37]

Trojvit: Tro- jan insertion in vision transformers

Mengxin Zheng, Qian Lou, and Lei Jiang. Trojvit: Tro- jan insertion in vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4025–4034, 2023. 1, 2, 3, 5

work page 2023
[38]

Factual probing is [mask]: Learning vs

Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [mask]: Learning vs. learning to recall.arXiv preprint arXiv:2104.05240, 2021. 6

work page arXiv 2021
[39]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page
[40]

core” or “functional fu- sion

6 Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures Supplementary Material A. Results under Various Settings Impact of VPG Injection Layers.To analyze the impact of VPG injection layers, we conducted an ablation study varying the depth and density of prompt injection (Table 7). Results demonstrate that attack ef...

work page

[1] [1]

Parameter efficient fine-tuning of self- supervised vits without catastrophic forgetting

Reza Akbarian Bafghi, Nidhin Harilal, Claire Monteleoni, and Maziar Raissi. Parameter efficient fine-tuning of self- supervised vits without catastrophic forgetting. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3679–3684, 2024. 2, 3, 5

work page 2024

[2] [2]

Badclip: Trigger-aware prompt learning for backdoor attacks on clip

Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24239–24250, 2024. 1

work page 2024

[3] [3]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5

work page 2014

[4] [4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

work page 2014

[5] [5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

work page 2004

[8] [8]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- nets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Eˆ 2vpt: An ef- fective and efficient approach for visual prompt tuning.arXiv preprint arXiv:2307.13770, 2023

Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wen- guan Wang, Siyuan Qi, and Dongfang Liu. Eˆ 2vpt: An ef- fective and efficient approach for visual prompt tuning.arXiv preprint arXiv:2307.13770, 2023. 2, 3

work page arXiv 2023

[10] [10]

Dvpt: Dynamic visual prompt tuning of large pre-trained models for medical image analysis.Neural Networks, 185: 107168, 2025

Along He, Yanlin Wu, Zhihong Wang, Tao Li, and Huazhu Fu. Dvpt: Dynamic visual prompt tuning of large pre-trained models for medical image analysis.Neural Networks, 185: 107168, 2025. 2, 3

work page 2025

[11] [11]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022

[12] [12]

Context-aware prompt tuning for vision- language model with dual-alignment.arXiv preprint arXiv:2309.04158, 2023

Hongyu Hu, Tiancheng Lin, Jie Wang, Zhenbang Sun, and Yi Xu. Context-aware prompt tuning for vision- language model with dual-alignment.arXiv preprint arXiv:2309.04158, 2023. 3

work page arXiv 2023

[13] [13]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 2, 3

work page 2022

[14] [14]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Invisible backdoor attack with sample- specific triggers

Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample- specific triggers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 16463–16472,

work page

[16] [16]

Trojaning attack on neural networks

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In25th Annual Network And Dis- tributed System Security Symposium (NDSS 2018). Internet Soc, 2018. 1

work page 2018

[17] [17]

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta and Mohammad Rastegari. Mobilevit: light- weight, general-purpose, and mobile-friendly vision trans- former.arXiv preprint arXiv:2110.02178, 2021. 1

work page internal anchor Pith review arXiv 2021

[18] [18]

Wanet–imperceptible warping-based backdoor attack,

Anh Nguyen and Anh Tran. Wanet–imperceptible warping- based backdoor attack.arXiv preprint arXiv:2102.10369,

work page arXiv

[19] [19]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5

work page 2012

[20] [20]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

work page

[21] [21]

Da- vpt: Semantic-guided visual prompt tuning for vision trans- formers

Li Ren, Chen Chen, Liqiang Wang, and Kien Hua. Da- vpt: Semantic-guided visual prompt tuning for vision trans- formers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4353–4363, 2025. 2, 3

work page 2025

[22] [22]

Hidden trigger backdoor attacks

Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pir- siavash. Hidden trigger backdoor attacks. InProceedings of the AAAI conference on artificial intelligence, pages 11957– 11965, 2020. 3

work page 2020

[23] [23]

Pro- vpt: Distribution-adaptive visual prompt tuning via prompt relocation

Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, and Yiu-Ming Cheung. Pro- vpt: Distribution-adaptive visual prompt tuning via prompt relocation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1558–1568, 2025. 2, 3

work page 2025

[24] [24]

Med-tuning: A new parameter-efficient tuning framework for medical volumetric segmentation.arXiv preprint arXiv:2304.10880, 2023

Jiachen Shen, Wenxuan Wang, Chen Chen, Jianbo Jiao, Jing Liu, Yan Zhang, Shanshan Song, and Jiangyun Li. Med-tuning: A new parameter-efficient tuning framework for medical volumetric segmentation.arXiv preprint arXiv:2304.10880, 2023. 2

work page arXiv 2023

[25] [25]

A dataset of 101 human action classes from videos in the wild.Center for Research in Computer Vision, 2(11):1–7,

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild.Center for Research in Computer Vision, 2(11):1–7,

work page

[26] [26]

A closer look at robustness of vision transformers to back- door attacks

Akshayvarun Subramanya, Soroush Abbasi Koohpayegani, Aniruddha Saha, Ajinkya Tejankar, and Hamed Pirsiavash. A closer look at robustness of vision transformers to back- door attacks. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 3874– 3883, 2024. 2

work page 2024

[27] [27]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 1

work page 2021

[28] [28]

Lora-nir: Low-rank adaptation of vision transformers for re- mote sensing with near-infrared imagery.IEEE Geoscience and Remote Sensing Letters, 2024

Irem Ulku, O Ozgur Tanriover, and Erdem Akag ¨und¨uz. Lora-nir: Low-rank adaptation of vision transformers for re- mote sensing with near-infrared imagery.IEEE Geoscience and Remote Sensing Letters, 2024. 2

work page 2024

[29] [29]

Neural cleanse: Identifying and mitigating backdoor attacks in neu- ral networks

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bi- mal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neu- ral networks. In2019 IEEE symposium on security and pri- vacy (SP), pages 707–723. IEEE, 2019. 7

work page 2019

[30] [30]

Attention- imperceptible backdoor attacks on vision transformers

Zhishen Wang, Rui Wang, and Lihua Jing. Attention- imperceptible backdoor attacks on vision transformers. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 8241–8249, 2025. 1, 2, 3, 5

work page 2025

[31] [31]

Not all prompts are secure: A switch- able backdoor attack against pre-trained vision transfomers

Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, and Shu-Tao Xia. Not all prompts are secure: A switch- able backdoor attack against pre-trained vision transfomers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24431–24441, 2024. 3, 5

work page 2024

[32] [32]

Incorporating convolution designs into vi- sual transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating convolution designs into vi- sual transformers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 579–588, 2021. 1

work page 2021

[33] [33]

Zenghui Yuan, Pan Zhou, Kai Zou, and Yu Cheng. You are catching my attention: Are vision transformers bad learners under backdoor attacks? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24605–24615, 2023. 1, 2, 3, 5

work page 2023

[34] [34]

Fulllora: Efficiently boosting the robustness of pretrained vi- sion transformers.IEEE Transactions on Image Processing,

Zheng Yuan, Jie Zhang, Shiguang Shan, and Xilin Chen. Fulllora: Efficiently boosting the robustness of pretrained vi- sion transformers.IEEE Transactions on Image Processing,

work page

[35] [35]

Instance-aware dynamic prompt tuning for pre-trained point cloud models

Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Instance-aware dynamic prompt tuning for pre-trained point cloud models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14161–14170, 2023. 2, 3

work page 2023

[36] [36]

Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints

Zhendong Zhao, Xiaojun Chen, Yuexin Xuan, Ye Dong, Dakui Wang, and Kaitai Liang. Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15213–15222, 2022. 1

work page 2022

[37] [37]

Trojvit: Tro- jan insertion in vision transformers

Mengxin Zheng, Qian Lou, and Lei Jiang. Trojvit: Tro- jan insertion in vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4025–4034, 2023. 1, 2, 3, 5

work page 2023

[38] [38]

Factual probing is [mask]: Learning vs

Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [mask]: Learning vs. learning to recall.arXiv preprint arXiv:2104.05240, 2021. 6

work page arXiv 2021

[39] [39]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page

[40] [40]

core” or “functional fu- sion

6 Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures Supplementary Material A. Results under Various Settings Impact of VPG Injection Layers.To analyze the impact of VPG injection layers, we conducted an ablation study varying the depth and density of prompt injection (Table 7). Results demonstrate that attack ef...

work page