Layer-Specific Prompt Fusion Discovery via Differentiable Search in Vision Foundation Models

Cheng Han; Jihun Hamm; Min Xu; Runmin Jiang; Tianming Liu; Tianyang Wang; Xiao Wang; Xingjian Li; Xi Xiao; Yunbei Zhang

arxiv: 2606.26379 · v1 · pith:IFGCFZY7new · submitted 2026-06-24 · 💻 cs.CV

Layer-Specific Prompt Fusion Discovery via Differentiable Search in Vision Foundation Models

Xi Xiao , Xingjian Li , Yunbei Zhang , Cheng Han , Tianming Liu , Tianyang Wang , Runmin Jiang , Jihun Hamm

show 2 more authors

Xiao Wang Min Xu

This is my paper

Pith reviewed 2026-06-26 01:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual prompt tuningdifferentiable architecture searchvision transformersprompt fusionlayer-specific adaptationparameter-efficient fine-tuning

0 comments

The pith

Differentiable search over fusion operations discovers layer-specific prompt combinations that improve visual prompt tuning for Vision Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether prompts should always fuse with image tokens in the same way inside each layer of a Vision Transformer, or whether different layers benefit from different combinations. It turns the choice of fusion method into a searchable architecture decision solved jointly with the prompts themselves through differentiable architecture search. The search space includes the usual concatenation and addition plus two new options: affine transformation and cross-attention. On 34 datasets the resulting layer-specific hybrids produce consistent accuracy gains while the backbone stays frozen, showing that fusion choice matters and that a single fixed scheme is often suboptimal. The work therefore supplies both a practical method and the observation that hybrid, layer-aware fusion better exploits the semantics already present at different depths.

Core claim

Formulating prompt fusion as a bi-level optimization problem and solving it with differentiable architecture search over an enriched space of four operations reveals that layer-specific hybrid fusion schemes outperform any single fixed scheme, delivering higher accuracy with the same frozen ViT backbone across VTAB-1k, FGVC and HTA benchmarks.

What carries the argument

Differentiable architecture search that jointly optimizes learnable prompts and their per-layer fusion operation chosen from concatenation, addition, affine transformation, or cross-attention.

If this is right

Different transformer layers benefit from different prompt fusion operations rather than one uniform rule.
Hybrid fusion schemes found by search produce higher accuracy than concatenation or addition alone on the tested benchmarks.
The method maintains a favorable accuracy-latency-parameter trade-off relative to VPT-Deep while keeping the backbone frozen.
Layer semantics of ViTs can be leveraged more effectively when fusion is allowed to vary by depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search procedure could be applied to other parameter-efficient modules such as adapters to discover layer-specific integration rules.
The patterns discovered by the search may indicate which layers already encode more semantic or more low-level features.
Once a fusion architecture is found on a representative validation set, it may transfer to new tasks without repeating the full search.

Load-bearing premise

The four fusion operations form a sufficient search space and the discovered layer-specific combinations generalize beyond the validation sets used during search.

What would settle it

Re-run the architecture search on a completely new downstream dataset, apply the fusion pattern discovered on the original validation sets without re-search, and measure whether accuracy gains disappear.

read the original abstract

Visual prompt tuning has emerged as a parameter-efficient fine-tuning approach for adapting large-scale Vision Transformers (ViTs) to downstream tasks. As its learnable prompts are applied in input and feature spaces, prior to jointly going through attention in transformer layers, the most commonly used scheme for fusing image and prompt tokens is concatenation or addition. In this paper, we aim to study a fundamental yet essential problem in visual prompt tuning: whether a single fusion scheme tends to yield better results, and whether that would be beneficial to develop a hybrid fusion scheme. To this end, we formulate the task as a bi-level optimization problem, and solve it leveraging differentiable architecture search. In this context, the learnable prompts and their fusion schemes are jointly optimized. To enrich the search space in the architecture search, we propose two additional fusion schemes, namely, affine transformation and cross-attention, in addition to concatenation and addition. Extensive experiments on 34 datasets spanning VTAB-1k, FGVC, and HTA show consistent gains over prompt-tuning baselines. With a frozen ViT backbone, our method delivers a favorable accuracy--latency--parameter trade-off compared with VPT-Deep and recent variants. Our findings reveal that how prompts fuse with image tokens plays a significant role in visual prompt tuning, and a hybrid fusion fashion can more effectively leverage layer semantics of ViTs, contributing a novel perspective for visual prompt-tuning research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They run DARTS-style search over four fusion ops per ViT layer for visual prompts, adding affine and cross-attention, but the abstract gives no numbers so the gains are impossible to judge.

read the letter

The main point is that this paper treats the choice of how to fuse prompts with image tokens as a searchable per-layer decision and uses bi-level optimization to pick among concatenation, addition, affine transform, and cross-attention.

What is new is the addition of the two extra operations and the decision to let the search pick different fusions at different depths instead of using one scheme everywhere. The experiments cover 34 datasets from VTAB-1k, FGVC, and HTA, and they report better accuracy-latency trade-offs than VPT-Deep while keeping the backbone frozen.

The work is a straightforward extension of existing prompt-tuning methods. The search space is small and the bi-level setup follows standard DARTS practice, so the implementation should be reproducible if the code is released.

The soft spots are in the evidence. The abstract asserts consistent gains and a hybrid-fusion advantage but supplies no quantitative results, error bars, or direct comparisons, which makes it impossible to tell whether the improvements are meaningful or just extra capacity from the search. The stress-test concern about the selected fusions overfitting the search validation splits is plausible here; without seeing the full ablations it is hard to know if the layer-specific choices reflect real semantic differences or just fit the particular splits used during search.

This is for readers already working on parameter-efficient tuning of vision transformers. It is incremental rather than foundational, but the idea is clean enough that a serious referee could check the numbers and controls in a reasonable amount of time.

I would send it to peer review if the full results section has proper tables and the gains survive basic controls for the search procedure.

Referee Report

3 major / 2 minor

Summary. The paper introduces a differentiable architecture search approach to optimize layer-specific fusion operations between prompts and image tokens in visual prompt tuning for Vision Transformers. By expanding the search space to include concatenation, addition, affine transformation, and cross-attention, and solving a bi-level optimization problem, the method discovers hybrid fusion schemes. Experiments on 34 datasets from VTAB-1k, FGVC, and HTA demonstrate consistent improvements over baselines with favorable accuracy-latency trade-offs, suggesting that fusion schemes significantly impact performance and that hybrid approaches better utilize layer semantics in ViTs.

Significance. Should the empirical results and generalization hold, this work would be significant for parameter-efficient fine-tuning by providing a data-driven way to determine prompt fusion per layer rather than using fixed schemes. The expansion of the fusion search space and the focus on layer-specific discovery offer a new perspective on adapting foundation models, potentially leading to more effective and interpretable prompt tuning methods.

major comments (3)

[Abstract] The abstract states 'consistent gains over prompt-tuning baselines' and 'favorable accuracy--latency--parameter trade-off' but supplies no specific quantitative results, baseline comparisons, or statistical measures. This omission makes it difficult to evaluate the practical significance of the claimed improvements without the full results section.
[§3 (Method)] The bi-level optimization jointly optimizes prompts and fusion architecture weights on validation data during search. The manuscript does not explicitly confirm that the search validation sets are disjoint from all 34 evaluation datasets or provide ablations showing that the discovered per-layer fusions generalize to unseen tasks without re-optimizing the architecture. This is load-bearing for the claim that hybrid fusion 'more effectively leverage layer semantics' rather than resulting from overfitting the search split.
[§4 (Experiments)] To substantiate the interpretation that the method leverages layer semantics, the paper should report the specific fusion operations selected per layer (e.g., in a table or figure) and analyze patterns (such as early layers favoring one operation and later layers another). Without this, the 'layer-specific' aspect remains a black-box outcome rather than evidence of semantic leveraging.

minor comments (2)

[Abstract] The acronym 'HTA' is not expanded; assuming it refers to a specific benchmark, it should be defined on first use.
The paper mentions 'VPT-Deep' as a baseline but does not define it in the abstract; a brief parenthetical would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The abstract states 'consistent gains over prompt-tuning baselines' and 'favorable accuracy--latency--parameter trade-off' but supplies no specific quantitative results, baseline comparisons, or statistical measures. This omission makes it difficult to evaluate the practical significance of the claimed improvements without the full results section.

Authors: We agree that the abstract would benefit from quantitative details. In the revision we will incorporate specific results, including average accuracy gains over VPT-Deep and other baselines across the 34 datasets as well as a brief mention of the accuracy-latency trade-off. revision: yes
Referee: [§3 (Method)] The bi-level optimization jointly optimizes prompts and fusion architecture weights on validation data during search. The manuscript does not explicitly confirm that the search validation sets are disjoint from all 34 evaluation datasets or provide ablations showing that the discovered per-layer fusions generalize to unseen tasks without re-optimizing the architecture. This is load-bearing for the claim that hybrid fusion 'more effectively leverage layer semantics' rather than resulting from overfitting the search split.

Authors: The validation splits used during architecture search follow the standard dataset partitions and are disjoint from the test sets of all 34 evaluation datasets. To directly address generalization concerns we will add an ablation that transfers the discovered per-layer fusion architecture to held-out tasks without re-optimizing the architecture weights. revision: partial
Referee: [§4 (Experiments)] To substantiate the interpretation that the method leverages layer semantics, the paper should report the specific fusion operations selected per layer (e.g., in a table or figure) and analyze patterns (such as early layers favoring one operation and later layers another). Without this, the 'layer-specific' aspect remains a black-box outcome rather than evidence of semantic leveraging.

Authors: We agree that explicit reporting of the discovered operations would improve interpretability. The revised manuscript will include a table (or figure) listing the selected fusion scheme per layer for representative datasets together with a short analysis of observed layer-wise patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical search results are independent of inputs

full rationale

The paper frames the contribution as a bi-level optimization solved via differentiable architecture search over an explicitly enumerated space of four fusion operations (concatenation, addition, affine, cross-attention). Reported gains on 34 external datasets follow from the optimization outcomes rather than any self-definitional equivalence, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation chain remains self-contained against the held-out benchmarks; no equation or premise reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate concrete free parameters, axioms, or invented entities; the bi-level optimization and four fusion operations are described at a high level only.

pith-pipeline@v0.9.1-grok · 5808 in / 1211 out tokens · 20768 ms · 2026-06-26T01:25:24.336261+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 6 linked inside Pith

[1]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations
[2]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 12

2021
[3]

Visual recognition with deep nearest centroids

Wenguan Wang, Cheng Han, Tianfei Zhou, and Dongfang Liu. Visual recognition with deep nearest centroids. In The Eleventh International Conference on Learning Representations,
[4]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European conference on computer vision, pages 709–727. Springer, 2022

2022
[5]

All you need is one: Capsule prompt tuning with a single vector

Yiyang Liu, James Liang, Heng Fan, Wenhao Yang, Yiming Cui, Xiaotian Han, Lifu Huangg, Dongfang Liu, Qifan Wang, and Cheng Han. All you need is one: Capsule prompt tuning with a single vector. Advances in Neural Information Processing Systems, 38:88139–88166, 2026

2026
[6]

York, Tianyang Wang, and Xiao Wang

Xi Xiao, Aristeidis Tsaris, Anika Tabassum, John Lagergren, Larry M. York, Tianyang Wang, and Xiao Wang. Focus: Fused observation of channels for unveiling spectra, 2026. URLhttps://arxiv.org/abs/2507.14787

Pith/arXiv arXiv 2026
[7]

Self-supervised visual prompting for cross-domain road damage detection

Xi Xiao, Zhuxuanzi Wang, Mingqiao Mo, Chen Liu, Chenrui Ma, Yanshu Li, Smita Krishnaswamy, Xiao Wang, and Tianyang Wang. Self-supervised visual prompting for cross-domain road damage detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3514–3524, 2026

2026
[8]

Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34: 12116–12128, 2021

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34: 12116–12128, 2021

2021
[9]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann Lecun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In International Conference on Machine Learning, pages 55854–55875. PMLR, 2025

2025
[10]

Dispersion loss counteracts embedding condensation and improves generalization in small language models

Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, and Smita Krishnaswamy. Dispersion loss counteracts embedding condensation and improves generalization in small language models. In International Conference on Machine Learning. PMLR, 2026

2026
[11]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

Pith/arXiv arXiv 2000
[12]

A large-scale study of representation learning with the visual task adaptation benchmark

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019

Pith/arXiv arXiv 1910
[13]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

2011
[14]

Diversity-aware meta visual prompting

Qidong Huang, Xiaoyi Dong, Dongdong Chen, Weiming Zhang, Feifei Wang, Gang Hua, and Nenghai Yu. Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10878–10887, 2023

2023
[15]

Darts: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations
[16]

Convex optimization

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004

2004
[17]

Shannon entropy, renyi entropy, and information.Statistics and Inf

PA Bromiley, NA Thacker, and E Bouhova-Thacker. Shannon entropy, renyi entropy, and information.Statistics and Inf. Series (2004-004), 9(2004):2–8, 2004

2004
[18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[19]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[20]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

2017
[21]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

2008
[22]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 13

2022
[23]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

2021
[24]

How well do sparse imagenet models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12266–12276, 2022

Eugenia Iofinova, Alexandra Peste, Mark Kurtz, and Dan Alistarh. How well do sparse imagenet models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12266–12276, 2022

2022
[25]

How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

2014
[26]

Improved baselines with momentum contrastive learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020

Pith/arXiv arXiv 2003
[27]

Side-tuning: a baseline for network adaptation via additive side networks

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for network adaptation via additive side networks. In European conference on computer vision, pages 698–714. Springer, 2020

2020
[28]

Learning multiple visual domains with residual adapters

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017

2017
[29]

Tinytl: Reduce memory, not parameters for efficient on-device learning

Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems, 33:11285–11297, 2020

2020
[30]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations
[31]

Adaptformer: Adapting vision transformers for scalable visual recognition

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022

2022
[32]

Efficient adaptation of large vision transformer via adapter re-composing

Wei Dong, Dawei Yan, Zhijun Lin, and Peng Wang. Efficient adaptation of large vision transformer via adapter re-composing. Advances in Neural Information Processing Systems, 36:52548–52567, 2023

2023
[33]

E 2 vpt: An effective and efficient approach for visual prompt tuning

Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wenguan Wang, Siyuan Qi, and Dongfang Liu. E 2 vpt: An effective and efficient approach for visual prompt tuning. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17445–17456. IEEE, 2023

2023
[34]

Learning expressive prompting with residuals for vision transformers

Rajshekhar Das, Yonatan Dukler, Avinash Ravichandran, and Ashwin Swaminathan. Learning expressive prompting with residuals for vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3366–3377, 2023

2023
[35]

Sa 2vp: Spatially aligned- and-adapted visual prompt

Wenjie Pei, Tongqi Xia, Fanglin Chen, Jinsong Li, Jiandong Tian, and Guangming Lu. Sa 2vp: Spatially aligned- and-adapted visual prompt. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 4450–4458, 2024

2024
[36]

Revisiting the power of prompt for visual tuning

Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, and Meng Wang. Revisiting the power of prompt for visual tuning. In Forty-first International Conference on Machine Learning,
[37]

Visual fourier prompt tuning

Runjia Zeng, Cheng Han, Qifan Wang, Chunshu Wu, Tong Geng, Lifu Huang, Ying N Wu, and Dongfang Liu. Visual fourier prompt tuning. Advances in Neural Information Processing Systems, 37:5552–5585, 2024

2024
[38]

Lor-vp: Low-rank visual prompting for efficient vision model adaptation

Can Jin, Ying Li, Mingyu Zhao, Shiyu Zhao, Zhenting Wang, Xiaoxiao He, Ligong Han, Tong Che, and Dimitris Metaxas. Lor-vp: Low-rank visual prompting for efficient vision model adaptation. In International Conference on Learning Representations, volume 2025, pages 73004–73021, 2025

2025
[39]

Da-vpt: Semantic-guided visual prompt tuning for vision transformers

Li Ren, Chen Chen, Liqiang Wang, and Kien Hua. Da-vpt: Semantic-guided visual prompt tuning for vision transformers. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4353–4363, 2025

2025
[40]

Improving visual prompt tuning for self-supervised vision transformers

Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, and Sungroh Yoon. Improving visual prompt tuning for self-supervised vision transformers. In International Conference on Machine Learning, pages 40075–40092. PMLR, 2023

2023
[41]

Facing the elephant in the room: Visual prompt tuning or full finetuning? InThe TwelfthInternational Conference on Learning Representations

Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, and Dongfang Liu. Facing the elephant in the room: Visual prompt tuning or full finetuning? InThe TwelfthInternational Conference on Learning Representations. 14
[42]

Rethinking spatial dimensions of vision transformers

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11936–11945, 2021

2021
[43]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

2019
[44]

Adapterhub: A framework for adapting transformers

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vuli´ c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. Adapterhub: A framework for adapting transformers. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 46–54, 2020

2020
[45]

Compacter: Efficient low-rank hypercomplex adapter layers

Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in neural information processing systems, 34:1022–1035, 2021

2021
[46]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

2021
[47]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

2021
[48]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan Iv, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

2020
[49]

Universal adversarial triggers for attacking and analyzing nlp

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019

2019
[50]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023

2023
[51]

Bitfit: Simple parameter-efficient fine-tuning for transformer- based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer- based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

2022
[52]

Prompt learns prompt: Exploring knowledge- aware generative prompt collaboration for video captioning

Liqi Yan, Cheng Han, Zenglin Xu, Dongfang Liu, and Qifan Wang. Prompt learns prompt: Exploring knowledge- aware generative prompt collaboration for video captioning. In IJCAI, pages 1622–1630, 2023

2023
[53]

Visual instance-aware prompt tuning

Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 2880–2889, 2025

2025
[54]

Visual variational autoencoder prompt tuning

Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, and Min Xu. Visual variational autoencoder prompt tuning. arXiv preprint arXiv:2503.17650, 2025

arXiv 2025
[55]

Pro-vpt: Distribution-adaptive visual prompt tuning via prompt relocation

Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, and Yiu-ming Cheung. Pro-vpt: Distribution-adaptive visual prompt tuning via prompt relocation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1558–1568, 2025

2025
[56]

Cvpt: Cross visual prompt tuning

Lingyun Huang, Jianxu Mao, Junfei Yi, Ziming Tao, and Yaonan Wang. Cvpt: Cross visual prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 848–858, 2025

2025
[57]

Neural architecture search: A survey

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019

2019
[58]

Autoformer: Searching transformers for visual recognition

Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12270–12280, 2021

2021
[59]

Vitas: Vision transformer architecture search

Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. Vitas: Vision transformer architecture search. In European Conference on Computer Vision, pages 139–157. Springer, 2022. 15

2022
[60]

Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training

Chengyue Gong, Dilin Wang, Meng Li, Xinlei Chen, Zhicheng Yan, Yuandong Tian, Vikas Chandra, et al. Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations, 2021

2021
[61]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937– 13949, 2021

2021
[62]

Evit: Expediting vision transformers via token reorganizations

Youwei Liang, GE Chongjian, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Evit: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations
[63]

Advancing textual prompt learning with anchored attributes

Zheng Li, Yibing Song, Ming-Ming Cheng, Xiang Li, and Jian Yang. Advancing textual prompt learning with anchored attributes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3618– 3627, 2025

2025
[64]

Differentiable prompt learning for vision language models

Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, and Jianxi Gao. Differentiable prompt learning for vision language models. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 5444–5452, 2025

2025
[65]

Causation, prediction, and search, volume 81

Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, prediction, and search, volume 81. Springer Science & Business Media, 2012

2012
[66]

Prompt-based adaptation in large- scale vision models: A survey

Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, and Cheng Han. Prompt-based adaptation in large- scale vision models: A survey. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https: //openreview.net/forum?id=UwtXDttgsE

2026
[67]

Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Pith/arXiv arXiv 2026
[68]

Hyperadalora: Accelerating lora rank allocation during training via hypernetworks without sacrificing performance

Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Heng Zhang, Shuyang Zhang, Bo Huang, Yuhang Wu, Tianyang Wang, et al. Hyperadalora: Accelerating lora rank allocation during training via hypernetworks without sacrificing performance. arXiv preprint arXiv:2510.02630, 2025

arXiv 2025
[69]

Ctr-lora: Curvature-aware and trust-region guided low-rank adaptation for large language models

Zhuxuanzi Wang, Mingqiao Mo, Xi Xiao, Chen Liu, Chenrui Ma, Yunbei Zhang, Xiao Wang, Smita Krishnaswamy, and Tianyang Wang. Ctr-lora: Curvature-aware and trust-region guided low-rank adaptation for large language models. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1396–1400. IEEE, 2026

2026
[70]

Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models

Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, and Hao Xu. Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models. arXiv preprint arXiv:2509.09119, 2025

arXiv 2025
[71]

Prime once, then reprogram locally: An efficient alternative to black-box service model adaptation

Yunbei Zhang, Chengyi Cai, Feng Liu, and Jihun Hamm. Prime once, then reprogram locally: An efficient alternative to black-box service model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[72]

Dpcore: Dynamic prompt coreset for continual test-time adaptation

Yunbei Zhang, Akshay Mehra, Shuaicheng Niu, and Jihun Hamm. Dpcore: Dynamic prompt coreset for continual test-time adaptation. In International Conference on Machine Learning, pages 75757–75778. PMLR, 2025

2025
[73]

Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models

Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, and Yu Wang. Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16122–16131, 2024

2024
[74]

Thinimg: Cross-modal steganography for presenting talking heads in images

Lin Zhao, Hongxuan Li, Xuefei Ning, and Xinru Jiang. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5553–5562, 2024

2024
[75]

Streamvlo: Streaming visual-lidar odometry with cumulative drift compensation

Mengmeng Liu, Jiuming Liu, Michael Ying Yang, Chaokang Jiang, Jiangtao Li, Yunpeng Zhang, Hesheng Wang, Francesco Nex, and Hao Cheng. Streamvlo: Streaming visual-lidar odometry with cumulative drift compensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 39086–39097, 2026

2026
[76]

Driveva: Video action models are zero-shot drivers

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, and Hao Cheng. Driveva: Video action models are zero-shot drivers. arXiv preprint arXiv:2604.04198, 2026. 16

Pith/arXiv arXiv 2026
[77]

Unsupervised hyperspectral image super-resolution via self-supervised modality decoupling

Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, and Qiang Shen. Unsupervised hyperspectral image super-resolution via self-supervised modality decoupling. International Journal of Computer Vision, 134(4):152, 2026

2026
[78]

Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset

Songcheng Du, Yang Zou, Jiaxin Li, Mingxuan Liu, Ying Li, Changjing Shang, and Qiang Shen. Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3696–3704, 2026

2026
[79]

Frequency-decoupled learning for joint thin-cloud removal and pansharpening

Songcheng Du, Yunpeng Bai, Jiaman Ma, Mingxuan Liu, and Ying Li. Frequency-decoupled learning for joint thin-cloud removal and pansharpening. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9282–9286. IEEE, 2026

2026
[80]

Semi-supervised semantic segmentation with multi-constraint consistency learning

Jianjian Yin, Tao Chen, Gensheng Pei, Huafeng Liu, Yazhou Yao, Liqiang Nie, and Xiansheng Hua. Semi-supervised semantic segmentation with multi-constraint consistency learning. IEEE Transactions on Multimedia, 27:6449–6461, 2025

2025

Showing first 80 references.

[1] [1]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations

[2] [2]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 12

2021

[3] [3]

Visual recognition with deep nearest centroids

Wenguan Wang, Cheng Han, Tianfei Zhou, and Dongfang Liu. Visual recognition with deep nearest centroids. In The Eleventh International Conference on Learning Representations,

[4] [4]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European conference on computer vision, pages 709–727. Springer, 2022

2022

[5] [5]

All you need is one: Capsule prompt tuning with a single vector

Yiyang Liu, James Liang, Heng Fan, Wenhao Yang, Yiming Cui, Xiaotian Han, Lifu Huangg, Dongfang Liu, Qifan Wang, and Cheng Han. All you need is one: Capsule prompt tuning with a single vector. Advances in Neural Information Processing Systems, 38:88139–88166, 2026

2026

[6] [6]

York, Tianyang Wang, and Xiao Wang

Xi Xiao, Aristeidis Tsaris, Anika Tabassum, John Lagergren, Larry M. York, Tianyang Wang, and Xiao Wang. Focus: Fused observation of channels for unveiling spectra, 2026. URLhttps://arxiv.org/abs/2507.14787

Pith/arXiv arXiv 2026

[7] [7]

Self-supervised visual prompting for cross-domain road damage detection

Xi Xiao, Zhuxuanzi Wang, Mingqiao Mo, Chen Liu, Chenrui Ma, Yanshu Li, Smita Krishnaswamy, Xiao Wang, and Tianyang Wang. Self-supervised visual prompting for cross-domain road damage detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3514–3524, 2026

2026

[8] [8]

Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34: 12116–12128, 2021

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34: 12116–12128, 2021

2021

[9] [9]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann Lecun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In International Conference on Machine Learning, pages 55854–55875. PMLR, 2025

2025

[10] [10]

Dispersion loss counteracts embedding condensation and improves generalization in small language models

Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, and Smita Krishnaswamy. Dispersion loss counteracts embedding condensation and improves generalization in small language models. In International Conference on Machine Learning. PMLR, 2026

2026

[11] [11]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

Pith/arXiv arXiv 2000

[12] [12]

A large-scale study of representation learning with the visual task adaptation benchmark

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019

Pith/arXiv arXiv 1910

[13] [13]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

2011

[14] [14]

Diversity-aware meta visual prompting

Qidong Huang, Xiaoyi Dong, Dongdong Chen, Weiming Zhang, Feifei Wang, Gang Hua, and Nenghai Yu. Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10878–10887, 2023

2023

[15] [15]

Darts: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations

[16] [16]

Convex optimization

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004

2004

[17] [17]

Shannon entropy, renyi entropy, and information.Statistics and Inf

PA Bromiley, NA Thacker, and E Bouhova-Thacker. Shannon entropy, renyi entropy, and information.Statistics and Inf. Series (2004-004), 9(2004):2–8, 2004

2004

[18] [18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[19] [19]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[20] [20]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

2017

[21] [21]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

2008

[22] [22]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 13

2022

[23] [23]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

2021

[24] [24]

How well do sparse imagenet models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12266–12276, 2022

Eugenia Iofinova, Alexandra Peste, Mark Kurtz, and Dan Alistarh. How well do sparse imagenet models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12266–12276, 2022

2022

[25] [25]

How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

2014

[26] [26]

Improved baselines with momentum contrastive learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020

Pith/arXiv arXiv 2003

[27] [27]

Side-tuning: a baseline for network adaptation via additive side networks

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for network adaptation via additive side networks. In European conference on computer vision, pages 698–714. Springer, 2020

2020

[28] [28]

Learning multiple visual domains with residual adapters

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017

2017

[29] [29]

Tinytl: Reduce memory, not parameters for efficient on-device learning

Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems, 33:11285–11297, 2020

2020

[30] [30]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations

[31] [31]

Adaptformer: Adapting vision transformers for scalable visual recognition

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022

2022

[32] [32]

Efficient adaptation of large vision transformer via adapter re-composing

Wei Dong, Dawei Yan, Zhijun Lin, and Peng Wang. Efficient adaptation of large vision transformer via adapter re-composing. Advances in Neural Information Processing Systems, 36:52548–52567, 2023

2023

[33] [33]

E 2 vpt: An effective and efficient approach for visual prompt tuning

Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wenguan Wang, Siyuan Qi, and Dongfang Liu. E 2 vpt: An effective and efficient approach for visual prompt tuning. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17445–17456. IEEE, 2023

2023

[34] [34]

Learning expressive prompting with residuals for vision transformers

Rajshekhar Das, Yonatan Dukler, Avinash Ravichandran, and Ashwin Swaminathan. Learning expressive prompting with residuals for vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3366–3377, 2023

2023

[35] [35]

Sa 2vp: Spatially aligned- and-adapted visual prompt

Wenjie Pei, Tongqi Xia, Fanglin Chen, Jinsong Li, Jiandong Tian, and Guangming Lu. Sa 2vp: Spatially aligned- and-adapted visual prompt. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 4450–4458, 2024

2024

[36] [36]

Revisiting the power of prompt for visual tuning

Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, and Meng Wang. Revisiting the power of prompt for visual tuning. In Forty-first International Conference on Machine Learning,

[37] [37]

Visual fourier prompt tuning

Runjia Zeng, Cheng Han, Qifan Wang, Chunshu Wu, Tong Geng, Lifu Huang, Ying N Wu, and Dongfang Liu. Visual fourier prompt tuning. Advances in Neural Information Processing Systems, 37:5552–5585, 2024

2024

[38] [38]

Lor-vp: Low-rank visual prompting for efficient vision model adaptation

Can Jin, Ying Li, Mingyu Zhao, Shiyu Zhao, Zhenting Wang, Xiaoxiao He, Ligong Han, Tong Che, and Dimitris Metaxas. Lor-vp: Low-rank visual prompting for efficient vision model adaptation. In International Conference on Learning Representations, volume 2025, pages 73004–73021, 2025

2025

[39] [39]

Da-vpt: Semantic-guided visual prompt tuning for vision transformers

Li Ren, Chen Chen, Liqiang Wang, and Kien Hua. Da-vpt: Semantic-guided visual prompt tuning for vision transformers. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4353–4363, 2025

2025

[40] [40]

Improving visual prompt tuning for self-supervised vision transformers

Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, and Sungroh Yoon. Improving visual prompt tuning for self-supervised vision transformers. In International Conference on Machine Learning, pages 40075–40092. PMLR, 2023

2023

[41] [41]

Facing the elephant in the room: Visual prompt tuning or full finetuning? InThe TwelfthInternational Conference on Learning Representations

Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, and Dongfang Liu. Facing the elephant in the room: Visual prompt tuning or full finetuning? InThe TwelfthInternational Conference on Learning Representations. 14

[42] [42]

Rethinking spatial dimensions of vision transformers

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11936–11945, 2021

2021

[43] [43]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

2019

[44] [44]

Adapterhub: A framework for adapting transformers

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vuli´ c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. Adapterhub: A framework for adapting transformers. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 46–54, 2020

2020

[45] [45]

Compacter: Efficient low-rank hypercomplex adapter layers

Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in neural information processing systems, 34:1022–1035, 2021

2021

[46] [46]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

2021

[47] [47]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

2021

[48] [48]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan Iv, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

2020

[49] [49]

Universal adversarial triggers for attacking and analyzing nlp

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019

2019

[50] [50]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023

2023

[51] [51]

Bitfit: Simple parameter-efficient fine-tuning for transformer- based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer- based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

2022

[52] [52]

Prompt learns prompt: Exploring knowledge- aware generative prompt collaboration for video captioning

Liqi Yan, Cheng Han, Zenglin Xu, Dongfang Liu, and Qifan Wang. Prompt learns prompt: Exploring knowledge- aware generative prompt collaboration for video captioning. In IJCAI, pages 1622–1630, 2023

2023

[53] [53]

Visual instance-aware prompt tuning

Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 2880–2889, 2025

2025

[54] [54]

Visual variational autoencoder prompt tuning

Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, and Min Xu. Visual variational autoencoder prompt tuning. arXiv preprint arXiv:2503.17650, 2025

arXiv 2025

[55] [55]

Pro-vpt: Distribution-adaptive visual prompt tuning via prompt relocation

Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, and Yiu-ming Cheung. Pro-vpt: Distribution-adaptive visual prompt tuning via prompt relocation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1558–1568, 2025

2025

[56] [56]

Cvpt: Cross visual prompt tuning

Lingyun Huang, Jianxu Mao, Junfei Yi, Ziming Tao, and Yaonan Wang. Cvpt: Cross visual prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 848–858, 2025

2025

[57] [57]

Neural architecture search: A survey

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019

2019

[58] [58]

Autoformer: Searching transformers for visual recognition

Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12270–12280, 2021

2021

[59] [59]

Vitas: Vision transformer architecture search

Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. Vitas: Vision transformer architecture search. In European Conference on Computer Vision, pages 139–157. Springer, 2022. 15

2022

[60] [60]

Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training

Chengyue Gong, Dilin Wang, Meng Li, Xinlei Chen, Zhicheng Yan, Yuandong Tian, Vikas Chandra, et al. Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations, 2021

2021

[61] [61]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937– 13949, 2021

2021

[62] [62]

Evit: Expediting vision transformers via token reorganizations

Youwei Liang, GE Chongjian, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Evit: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations

[63] [63]

Advancing textual prompt learning with anchored attributes

Zheng Li, Yibing Song, Ming-Ming Cheng, Xiang Li, and Jian Yang. Advancing textual prompt learning with anchored attributes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3618– 3627, 2025

2025

[64] [64]

Differentiable prompt learning for vision language models

Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, and Jianxi Gao. Differentiable prompt learning for vision language models. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 5444–5452, 2025

2025

[65] [65]

Causation, prediction, and search, volume 81

Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, prediction, and search, volume 81. Springer Science & Business Media, 2012

2012

[66] [66]

Prompt-based adaptation in large- scale vision models: A survey

Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, and Cheng Han. Prompt-based adaptation in large- scale vision models: A survey. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https: //openreview.net/forum?id=UwtXDttgsE

2026

[67] [67]

Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, and Hao Xu. Not all directions matter: Towards structured and task-aware low-rank model adaptation.arXiv preprint arXiv:2603.14228, 2026

Pith/arXiv arXiv 2026

[68] [68]

Hyperadalora: Accelerating lora rank allocation during training via hypernetworks without sacrificing performance

Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Heng Zhang, Shuyang Zhang, Bo Huang, Yuhang Wu, Tianyang Wang, et al. Hyperadalora: Accelerating lora rank allocation during training via hypernetworks without sacrificing performance. arXiv preprint arXiv:2510.02630, 2025

arXiv 2025

[69] [69]

Ctr-lora: Curvature-aware and trust-region guided low-rank adaptation for large language models

Zhuxuanzi Wang, Mingqiao Mo, Xi Xiao, Chen Liu, Chenrui Ma, Yunbei Zhang, Xiao Wang, Smita Krishnaswamy, and Tianyang Wang. Ctr-lora: Curvature-aware and trust-region guided low-rank adaptation for large language models. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1396–1400. IEEE, 2026

2026

[70] [70]

Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models

Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, and Hao Xu. Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models. arXiv preprint arXiv:2509.09119, 2025

arXiv 2025

[71] [71]

Prime once, then reprogram locally: An efficient alternative to black-box service model adaptation

Yunbei Zhang, Chengyi Cai, Feng Liu, and Jihun Hamm. Prime once, then reprogram locally: An efficient alternative to black-box service model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[72] [72]

Dpcore: Dynamic prompt coreset for continual test-time adaptation

Yunbei Zhang, Akshay Mehra, Shuaicheng Niu, and Jihun Hamm. Dpcore: Dynamic prompt coreset for continual test-time adaptation. In International Conference on Machine Learning, pages 75757–75778. PMLR, 2025

2025

[73] [73]

Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models

Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, and Yu Wang. Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16122–16131, 2024

2024

[74] [74]

Thinimg: Cross-modal steganography for presenting talking heads in images

Lin Zhao, Hongxuan Li, Xuefei Ning, and Xinru Jiang. Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5553–5562, 2024

2024

[75] [75]

Streamvlo: Streaming visual-lidar odometry with cumulative drift compensation

Mengmeng Liu, Jiuming Liu, Michael Ying Yang, Chaokang Jiang, Jiangtao Li, Yunpeng Zhang, Hesheng Wang, Francesco Nex, and Hao Cheng. Streamvlo: Streaming visual-lidar odometry with cumulative drift compensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 39086–39097, 2026

2026

[76] [76]

Driveva: Video action models are zero-shot drivers

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, and Hao Cheng. Driveva: Video action models are zero-shot drivers. arXiv preprint arXiv:2604.04198, 2026. 16

Pith/arXiv arXiv 2026

[77] [77]

Unsupervised hyperspectral image super-resolution via self-supervised modality decoupling

Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, and Qiang Shen. Unsupervised hyperspectral image super-resolution via self-supervised modality decoupling. International Journal of Computer Vision, 134(4):152, 2026

2026

[78] [78]

Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset

Songcheng Du, Yang Zou, Jiaxin Li, Mingxuan Liu, Ying Li, Changjing Shang, and Qiang Shen. Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3696–3704, 2026

2026

[79] [79]

Frequency-decoupled learning for joint thin-cloud removal and pansharpening

Songcheng Du, Yunpeng Bai, Jiaman Ma, Mingxuan Liu, and Ying Li. Frequency-decoupled learning for joint thin-cloud removal and pansharpening. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9282–9286. IEEE, 2026

2026

[80] [80]

Semi-supervised semantic segmentation with multi-constraint consistency learning

Jianjian Yin, Tao Chen, Gensheng Pei, Huafeng Liu, Yazhou Yao, Liqiang Nie, and Xiansheng Hua. Semi-supervised semantic segmentation with multi-constraint consistency learning. IEEE Transactions on Multimedia, 27:6449–6461, 2025

2025