Layer-Specific Prompt Fusion Discovery via Differentiable Search in Vision Foundation Models
Pith reviewed 2026-06-26 01:25 UTC · model grok-4.3
The pith
Differentiable search over fusion operations discovers layer-specific prompt combinations that improve visual prompt tuning for Vision Transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Formulating prompt fusion as a bi-level optimization problem and solving it with differentiable architecture search over an enriched space of four operations reveals that layer-specific hybrid fusion schemes outperform any single fixed scheme, delivering higher accuracy with the same frozen ViT backbone across VTAB-1k, FGVC and HTA benchmarks.
What carries the argument
Differentiable architecture search that jointly optimizes learnable prompts and their per-layer fusion operation chosen from concatenation, addition, affine transformation, or cross-attention.
If this is right
- Different transformer layers benefit from different prompt fusion operations rather than one uniform rule.
- Hybrid fusion schemes found by search produce higher accuracy than concatenation or addition alone on the tested benchmarks.
- The method maintains a favorable accuracy-latency-parameter trade-off relative to VPT-Deep while keeping the backbone frozen.
- Layer semantics of ViTs can be leveraged more effectively when fusion is allowed to vary by depth.
Where Pith is reading between the lines
- The same search procedure could be applied to other parameter-efficient modules such as adapters to discover layer-specific integration rules.
- The patterns discovered by the search may indicate which layers already encode more semantic or more low-level features.
- Once a fusion architecture is found on a representative validation set, it may transfer to new tasks without repeating the full search.
Load-bearing premise
The four fusion operations form a sufficient search space and the discovered layer-specific combinations generalize beyond the validation sets used during search.
What would settle it
Re-run the architecture search on a completely new downstream dataset, apply the fusion pattern discovered on the original validation sets without re-search, and measure whether accuracy gains disappear.
read the original abstract
Visual prompt tuning has emerged as a parameter-efficient fine-tuning approach for adapting large-scale Vision Transformers (ViTs) to downstream tasks. As its learnable prompts are applied in input and feature spaces, prior to jointly going through attention in transformer layers, the most commonly used scheme for fusing image and prompt tokens is concatenation or addition. In this paper, we aim to study a fundamental yet essential problem in visual prompt tuning: whether a single fusion scheme tends to yield better results, and whether that would be beneficial to develop a hybrid fusion scheme. To this end, we formulate the task as a bi-level optimization problem, and solve it leveraging differentiable architecture search. In this context, the learnable prompts and their fusion schemes are jointly optimized. To enrich the search space in the architecture search, we propose two additional fusion schemes, namely, affine transformation and cross-attention, in addition to concatenation and addition. Extensive experiments on 34 datasets spanning VTAB-1k, FGVC, and HTA show consistent gains over prompt-tuning baselines. With a frozen ViT backbone, our method delivers a favorable accuracy--latency--parameter trade-off compared with VPT-Deep and recent variants. Our findings reveal that how prompts fuse with image tokens plays a significant role in visual prompt tuning, and a hybrid fusion fashion can more effectively leverage layer semantics of ViTs, contributing a novel perspective for visual prompt-tuning research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a differentiable architecture search approach to optimize layer-specific fusion operations between prompts and image tokens in visual prompt tuning for Vision Transformers. By expanding the search space to include concatenation, addition, affine transformation, and cross-attention, and solving a bi-level optimization problem, the method discovers hybrid fusion schemes. Experiments on 34 datasets from VTAB-1k, FGVC, and HTA demonstrate consistent improvements over baselines with favorable accuracy-latency trade-offs, suggesting that fusion schemes significantly impact performance and that hybrid approaches better utilize layer semantics in ViTs.
Significance. Should the empirical results and generalization hold, this work would be significant for parameter-efficient fine-tuning by providing a data-driven way to determine prompt fusion per layer rather than using fixed schemes. The expansion of the fusion search space and the focus on layer-specific discovery offer a new perspective on adapting foundation models, potentially leading to more effective and interpretable prompt tuning methods.
major comments (3)
- [Abstract] The abstract states 'consistent gains over prompt-tuning baselines' and 'favorable accuracy--latency--parameter trade-off' but supplies no specific quantitative results, baseline comparisons, or statistical measures. This omission makes it difficult to evaluate the practical significance of the claimed improvements without the full results section.
- [§3 (Method)] The bi-level optimization jointly optimizes prompts and fusion architecture weights on validation data during search. The manuscript does not explicitly confirm that the search validation sets are disjoint from all 34 evaluation datasets or provide ablations showing that the discovered per-layer fusions generalize to unseen tasks without re-optimizing the architecture. This is load-bearing for the claim that hybrid fusion 'more effectively leverage layer semantics' rather than resulting from overfitting the search split.
- [§4 (Experiments)] To substantiate the interpretation that the method leverages layer semantics, the paper should report the specific fusion operations selected per layer (e.g., in a table or figure) and analyze patterns (such as early layers favoring one operation and later layers another). Without this, the 'layer-specific' aspect remains a black-box outcome rather than evidence of semantic leveraging.
minor comments (2)
- [Abstract] The acronym 'HTA' is not expanded; assuming it refers to a specific benchmark, it should be defined on first use.
- The paper mentions 'VPT-Deep' as a baseline but does not define it in the abstract; a brief parenthetical would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] The abstract states 'consistent gains over prompt-tuning baselines' and 'favorable accuracy--latency--parameter trade-off' but supplies no specific quantitative results, baseline comparisons, or statistical measures. This omission makes it difficult to evaluate the practical significance of the claimed improvements without the full results section.
Authors: We agree that the abstract would benefit from quantitative details. In the revision we will incorporate specific results, including average accuracy gains over VPT-Deep and other baselines across the 34 datasets as well as a brief mention of the accuracy-latency trade-off. revision: yes
-
Referee: [§3 (Method)] The bi-level optimization jointly optimizes prompts and fusion architecture weights on validation data during search. The manuscript does not explicitly confirm that the search validation sets are disjoint from all 34 evaluation datasets or provide ablations showing that the discovered per-layer fusions generalize to unseen tasks without re-optimizing the architecture. This is load-bearing for the claim that hybrid fusion 'more effectively leverage layer semantics' rather than resulting from overfitting the search split.
Authors: The validation splits used during architecture search follow the standard dataset partitions and are disjoint from the test sets of all 34 evaluation datasets. To directly address generalization concerns we will add an ablation that transfers the discovered per-layer fusion architecture to held-out tasks without re-optimizing the architecture weights. revision: partial
-
Referee: [§4 (Experiments)] To substantiate the interpretation that the method leverages layer semantics, the paper should report the specific fusion operations selected per layer (e.g., in a table or figure) and analyze patterns (such as early layers favoring one operation and later layers another). Without this, the 'layer-specific' aspect remains a black-box outcome rather than evidence of semantic leveraging.
Authors: We agree that explicit reporting of the discovered operations would improve interpretability. The revised manuscript will include a table (or figure) listing the selected fusion scheme per layer for representative datasets together with a short analysis of observed layer-wise patterns. revision: yes
Circularity Check
No significant circularity; empirical search results are independent of inputs
full rationale
The paper frames the contribution as a bi-level optimization solved via differentiable architecture search over an explicitly enumerated space of four fusion operations (concatenation, addition, affine, cross-attention). Reported gains on 34 external datasets follow from the optimization outcomes rather than any self-definitional equivalence, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation chain remains self-contained against the held-out benchmarks; no equation or premise reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations
-
[2]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 12
2021
-
[3]
Visual recognition with deep nearest centroids
Wenguan Wang, Cheng Han, Tianfei Zhou, and Dongfang Liu. Visual recognition with deep nearest centroids. In The Eleventh International Conference on Learning Representations,
-
[4]
Visual prompt tuning
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European conference on computer vision, pages 709–727. Springer, 2022
2022
-
[5]
All you need is one: Capsule prompt tuning with a single vector
Yiyang Liu, James Liang, Heng Fan, Wenhao Yang, Yiming Cui, Xiaotian Han, Lifu Huangg, Dongfang Liu, Qifan Wang, and Cheng Han. All you need is one: Capsule prompt tuning with a single vector. Advances in Neural Information Processing Systems, 38:88139–88166, 2026
2026
-
[6]
York, Tianyang Wang, and Xiao Wang
Xi Xiao, Aristeidis Tsaris, Anika Tabassum, John Lagergren, Larry M. York, Tianyang Wang, and Xiao Wang. Focus: Fused observation of channels for unveiling spectra, 2026. URLhttps://arxiv.org/abs/2507.14787
Pith/arXiv arXiv 2026
-
[7]
Self-supervised visual prompting for cross-domain road damage detection
Xi Xiao, Zhuxuanzi Wang, Mingqiao Mo, Chen Liu, Chenrui Ma, Yanshu Li, Smita Krishnaswamy, Xiao Wang, and Tianyang Wang. Self-supervised visual prompting for cross-domain road damage detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3514–3524, 2026
2026
-
[8]
Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34: 12116–12128, 2021
Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34: 12116–12128, 2021
2021
-
[9]
Layer by layer: Uncovering hidden representations in language models
Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann Lecun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In International Conference on Machine Learning, pages 55854–55875. PMLR, 2025
2025
-
[10]
Dispersion loss counteracts embedding condensation and improves generalization in small language models
Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, and Smita Krishnaswamy. Dispersion loss counteracts embedding condensation and improves generalization in small language models. In International Conference on Machine Learning. PMLR, 2026
2026
-
[11]
The information bottleneck method
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000
Pith/arXiv arXiv 2000
-
[12]
A large-scale study of representation learning with the visual task adaptation benchmark
Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019
Pith/arXiv arXiv 1910
-
[13]
The caltech-ucsd birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011
2011
-
[14]
Diversity-aware meta visual prompting
Qidong Huang, Xiaoyi Dong, Dongdong Chen, Weiming Zhang, Feifei Wang, Gang Hua, and Nenghai Yu. Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10878–10887, 2023
2023
-
[15]
Darts: Differentiable architecture search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations
-
[16]
Convex optimization
Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004
2004
-
[17]
Shannon entropy, renyi entropy, and information.Statistics and Inf
PA Bromiley, NA Thacker, and E Bouhova-Thacker. Shannon entropy, renyi entropy, and information.Statistics and Inf. Series (2004-004), 9(2004):2–8, 2004
2004
-
[18]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[19]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018
2018
-
[20]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
2017
-
[21]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008
2008
-
[22]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 13
2022
-
[23]
An empirical study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021
2021
-
[24]
How well do sparse imagenet models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12266–12276, 2022
Eugenia Iofinova, Alexandra Peste, Mark Kurtz, and Dan Alistarh. How well do sparse imagenet models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12266–12276, 2022
2022
-
[25]
How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014
2014
-
[26]
Improved baselines with momentum contrastive learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020
Pith/arXiv arXiv 2003
-
[27]
Side-tuning: a baseline for network adaptation via additive side networks
Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for network adaptation via additive side networks. In European conference on computer vision, pages 698–714. Springer, 2020
2020
-
[28]
Learning multiple visual domains with residual adapters
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017
2017
-
[29]
Tinytl: Reduce memory, not parameters for efficient on-device learning
Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems, 33:11285–11297, 2020
2020
-
[30]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations
-
[31]
Adaptformer: Adapting vision transformers for scalable visual recognition
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022
2022
-
[32]
Efficient adaptation of large vision transformer via adapter re-composing
Wei Dong, Dawei Yan, Zhijun Lin, and Peng Wang. Efficient adaptation of large vision transformer via adapter re-composing. Advances in Neural Information Processing Systems, 36:52548–52567, 2023
2023
-
[33]
E 2 vpt: An effective and efficient approach for visual prompt tuning
Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wenguan Wang, Siyuan Qi, and Dongfang Liu. E 2 vpt: An effective and efficient approach for visual prompt tuning. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17445–17456. IEEE, 2023
2023
-
[34]
Learning expressive prompting with residuals for vision transformers
Rajshekhar Das, Yonatan Dukler, Avinash Ravichandran, and Ashwin Swaminathan. Learning expressive prompting with residuals for vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3366–3377, 2023
2023
-
[35]
Sa 2vp: Spatially aligned- and-adapted visual prompt
Wenjie Pei, Tongqi Xia, Fanglin Chen, Jinsong Li, Jiandong Tian, and Guangming Lu. Sa 2vp: Spatially aligned- and-adapted visual prompt. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 4450–4458, 2024
2024
-
[36]
Revisiting the power of prompt for visual tuning
Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, and Meng Wang. Revisiting the power of prompt for visual tuning. In Forty-first International Conference on Machine Learning,
-
[37]
Visual fourier prompt tuning
Runjia Zeng, Cheng Han, Qifan Wang, Chunshu Wu, Tong Geng, Lifu Huang, Ying N Wu, and Dongfang Liu. Visual fourier prompt tuning. Advances in Neural Information Processing Systems, 37:5552–5585, 2024
2024
-
[38]
Lor-vp: Low-rank visual prompting for efficient vision model adaptation
Can Jin, Ying Li, Mingyu Zhao, Shiyu Zhao, Zhenting Wang, Xiaoxiao He, Ligong Han, Tong Che, and Dimitris Metaxas. Lor-vp: Low-rank visual prompting for efficient vision model adaptation. In International Conference on Learning Representations, volume 2025, pages 73004–73021, 2025
2025
-
[39]
Da-vpt: Semantic-guided visual prompt tuning for vision transformers
Li Ren, Chen Chen, Liqiang Wang, and Kien Hua. Da-vpt: Semantic-guided visual prompt tuning for vision transformers. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4353–4363, 2025
2025
-
[40]
Improving visual prompt tuning for self-supervised vision transformers
Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, and Sungroh Yoon. Improving visual prompt tuning for self-supervised vision transformers. In International Conference on Machine Learning, pages 40075–40092. PMLR, 2023
2023
-
[41]
Facing the elephant in the room: Visual prompt tuning or full finetuning? InThe TwelfthInternational Conference on Learning Representations
Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, and Dongfang Liu. Facing the elephant in the room: Visual prompt tuning or full finetuning? InThe TwelfthInternational Conference on Learning Representations. 14
-
[42]
Rethinking spatial dimensions of vision transformers
Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11936–11945, 2021
2021
-
[43]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019
2019
-
[44]
Adapterhub: A framework for adapting transformers
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vuli´ c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. Adapterhub: A framework for adapting transformers. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 46–54, 2020
2020
-
[45]
Compacter: Efficient low-rank hypercomplex adapter layers
Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in neural information processing systems, 34:1022–1035, 2021
2021
-
[46]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021
2021
-
[47]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021
2021
-
[48]
Autoprompt: Eliciting knowledge from language models with automatically generated prompts
Taylor Shin, Yasaman Razeghi, Robert L Logan Iv, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020
2020
-
[49]
Universal adversarial triggers for attacking and analyzing nlp
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019
2019
-
[50]
Qlora: Efficient finetuning of quantized llms
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023
2023
-
[51]
Bitfit: Simple parameter-efficient fine-tuning for transformer- based masked language-models
Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer- based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022
2022
-
[52]
Prompt learns prompt: Exploring knowledge- aware generative prompt collaboration for video captioning
Liqi Yan, Cheng Han, Zenglin Xu, Dongfang Liu, and Qifan Wang. Prompt learns prompt: Exploring knowledge- aware generative prompt collaboration for video captioning. In IJCAI, pages 1622–1630, 2023
2023
-
[53]
Visual instance-aware prompt tuning
Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 2880–2889, 2025
2025
-
[54]
Visual variational autoencoder prompt tuning
Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, and Min Xu. Visual variational autoencoder prompt tuning. arXiv preprint arXiv:2503.17650, 2025
arXiv 2025
-
[55]
Pro-vpt: Distribution-adaptive visual prompt tuning via prompt relocation
Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, and Yiu-ming Cheung. Pro-vpt: Distribution-adaptive visual prompt tuning via prompt relocation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1558–1568, 2025
2025
-
[56]
Cvpt: Cross visual prompt tuning
Lingyun Huang, Jianxu Mao, Junfei Yi, Ziming Tao, and Yaonan Wang. Cvpt: Cross visual prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 848–858, 2025
2025
-
[57]
Neural architecture search: A survey
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019
2019
-
[58]
Autoformer: Searching transformers for visual recognition
Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12270–12280, 2021
2021
-
[59]
Vitas: Vision transformer architecture search
Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. Vitas: Vision transformer architecture search. In European Conference on Computer Vision, pages 139–157. Springer, 2022. 15
2022
-
[60]
Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training
Chengyue Gong, Dilin Wang, Meng Li, Xinlei Chen, Zhicheng Yan, Yuandong Tian, Vikas Chandra, et al. Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations, 2021
2021
-
[61]
Dynamicvit: Efficient vision transformers with dynamic token sparsification
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937– 13949, 2021
2021
-
[62]
Evit: Expediting vision transformers via token reorganizations
Youwei Liang, GE Chongjian, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Evit: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations
-
[63]
Advancing textual prompt learning with anchored attributes
Zheng Li, Yibing Song, Ming-Ming Cheng, Xiang Li, and Jian Yang. Advancing textual prompt learning with anchored attributes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3618– 3627, 2025
2025
-
[64]
Differentiable prompt learning for vision language models
Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, and Jianxi Gao. Differentiable prompt learning for vision language models. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 5444–5452, 2025
2025
-
[65]
Causation, prediction, and search, volume 81
Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, prediction, and search, volume 81. Springer Science & Business Media, 2012
2012
-
[66]
Prompt-based adaptation in large- scale vision models: A survey
Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, and Cheng Han. Prompt-based adaptation in large- scale vision models: A survey. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https: //openreview.net/forum?id=UwtXDttgsE
2026
-
[67]
Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026
Pith/arXiv arXiv 2026
-
[68]
Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Heng Zhang, Shuyang Zhang, Bo Huang, Yuhang Wu, Tianyang Wang, et al. Hyperadalora: Accelerating lora rank allocation during training via hypernetworks without sacrificing performance. arXiv preprint arXiv:2510.02630, 2025
arXiv 2025
-
[69]
Ctr-lora: Curvature-aware and trust-region guided low-rank adaptation for large language models
Zhuxuanzi Wang, Mingqiao Mo, Xi Xiao, Chen Liu, Chenrui Ma, Yunbei Zhang, Xiao Wang, Smita Krishnaswamy, and Tianyang Wang. Ctr-lora: Curvature-aware and trust-region guided low-rank adaptation for large language models. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1396–1400. IEEE, 2026
2026
-
[70]
Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models
Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, and Hao Xu. Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models. arXiv preprint arXiv:2509.09119, 2025
arXiv 2025
-
[71]
Prime once, then reprogram locally: An efficient alternative to black-box service model adaptation
Yunbei Zhang, Chengyi Cai, Feng Liu, and Jihun Hamm. Prime once, then reprogram locally: An efficient alternative to black-box service model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
2026
-
[72]
Dpcore: Dynamic prompt coreset for continual test-time adaptation
Yunbei Zhang, Akshay Mehra, Shuaicheng Niu, and Jihun Hamm. Dpcore: Dynamic prompt coreset for continual test-time adaptation. In International Conference on Machine Learning, pages 75757–75778. PMLR, 2025
2025
-
[73]
Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models
Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, and Yu Wang. Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16122–16131, 2024
2024
-
[74]
Thinimg: Cross-modal steganography for presenting talking heads in images
Lin Zhao, Hongxuan Li, Xuefei Ning, and Xinru Jiang. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5553–5562, 2024
2024
-
[75]
Streamvlo: Streaming visual-lidar odometry with cumulative drift compensation
Mengmeng Liu, Jiuming Liu, Michael Ying Yang, Chaokang Jiang, Jiangtao Li, Yunpeng Zhang, Hesheng Wang, Francesco Nex, and Hao Cheng. Streamvlo: Streaming visual-lidar odometry with cumulative drift compensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 39086–39097, 2026
2026
-
[76]
Driveva: Video action models are zero-shot drivers
Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, and Hao Cheng. Driveva: Video action models are zero-shot drivers. arXiv preprint arXiv:2604.04198, 2026. 16
Pith/arXiv arXiv 2026
-
[77]
Unsupervised hyperspectral image super-resolution via self-supervised modality decoupling
Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, and Qiang Shen. Unsupervised hyperspectral image super-resolution via self-supervised modality decoupling. International Journal of Computer Vision, 134(4):152, 2026
2026
-
[78]
Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset
Songcheng Du, Yang Zou, Jiaxin Li, Mingxuan Liu, Ying Li, Changjing Shang, and Qiang Shen. Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3696–3704, 2026
2026
-
[79]
Frequency-decoupled learning for joint thin-cloud removal and pansharpening
Songcheng Du, Yunpeng Bai, Jiaman Ma, Mingxuan Liu, and Ying Li. Frequency-decoupled learning for joint thin-cloud removal and pansharpening. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9282–9286. IEEE, 2026
2026
-
[80]
Semi-supervised semantic segmentation with multi-constraint consistency learning
Jianjian Yin, Tao Chen, Gensheng Pei, Huafeng Liu, Yazhou Yao, Liqiang Nie, and Xiansheng Hua. Semi-supervised semantic segmentation with multi-constraint consistency learning. IEEE Transactions on Multimedia, 27:6449–6461, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.