Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Pith reviewed 2026-05-18 03:00 UTC · model grok-4.3
The pith
Global class prototypes enable adaptive per-sample prompt mixing for personalized federated tuning of Vision Transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PEP-FedPT defines the Class-Contextualized Mixed Prompt (CCMP) that keeps class-specific prompts alongside a globally shared prompt. Mixing weights for any given input are produced from global class prototypes and client class priors, allowing per-sample personalization. All prompts are optimized jointly via federated averaging, yielding both generalization and personalization without storing client-dependent parameters.
What carries the argument
Class-Contextualized Mixed Prompt (CCMP), which adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors to deliver per-sample personalization.
Load-bearing premise
Global class prototypes computed across clients together with client class priors can generate reliable adaptive mixing weights that deliver effective per-sample personalization without causing overfitting or loss of generalization.
What would settle it
An experiment on a dataset with extreme non-IID class distributions where removing the prototype-based mixing and using only the global prompt yields equal or better accuracy would show the mixing mechanism adds no benefit.
Figures
read the original abstract
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PEP-FedPT, a framework for federated prompt tuning of Vision Transformers. It defines the Class-Contextualized Mixed Prompt (CCMP) that maintains class-specific prompts alongside a globally shared prompt and computes per-sample mixing weights from global class prototypes and client class priors. The prompts are optimized with standard federated averaging. Experiments on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist report consistent gains over baselines under heterogeneous data partitions.
Significance. If the central claims hold, the work supplies a parameter-efficient route to joint generalization and per-sample personalization in federated ViT adaptation. By deriving adaptive weights from prototypes rather than storing client-specific parameters, the method could lower communication and memory overhead while addressing non-IID distributions, offering a practical advance for FL deployments of large vision models.
major comments (2)
- [Method] Method section: the procedure for computing and federated-aggregating global class prototypes (feature averaging, centroid, etc.) is not explicitly stated. This detail is load-bearing for the personalization claim; if majority clients dominate the prototypes, the derived mixing weights become effectively global and the per-sample benefit collapses to standard FedAvg prompt tuning.
- [Experiments] Experiments section: the reported outperformance across four datasets and heterogeneity regimes is not accompanied by statistical significance tests, error bars, ablation studies on prototype quality versus heterogeneity level, or precise rules for data partitioning and client sampling. These omissions prevent full assessment of whether the gains are robust or merely consistent with the weakest assumption that prototypes remain representative.
minor comments (1)
- A single equation or pseudocode block clarifying the exact mixing-weight formula (prototype similarity combined with class prior) would improve readability and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the potential impact of PEP-FedPT. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method] Method section: the procedure for computing and federated-aggregating global class prototypes (feature averaging, centroid, etc.) is not explicitly stated. This detail is load-bearing for the personalization claim; if majority clients dominate the prototypes, the derived mixing weights become effectively global and the per-sample benefit collapses to standard FedAvg prompt tuning.
Authors: We agree that an explicit description is necessary for reproducibility and to substantiate the personalization mechanism. In the revised manuscript, we will add a dedicated subsection detailing that each client computes local class centroids as the mean of ViT-extracted features for samples of each class, followed by server-side aggregation via standard FedAvg to obtain global prototypes. The mixing weights for CCMP are then computed per-sample using cosine similarity between the input's feature and these global prototypes, modulated by the client's local class prior. This ensures the weights remain input-adaptive and client-informed rather than collapsing to a uniform global prompt, even under non-IID conditions. revision: yes
-
Referee: [Experiments] Experiments section: the reported outperformance across four datasets and heterogeneity regimes is not accompanied by statistical significance tests, error bars, ablation studies on prototype quality versus heterogeneity level, or precise rules for data partitioning and client sampling. These omissions prevent full assessment of whether the gains are robust or merely consistent with the weakest assumption that prototypes remain representative.
Authors: We acknowledge these omissions limit the strength of the empirical claims. In the revision, we will report mean and standard deviation over 5 independent runs with error bars, include paired t-tests for significance against baselines, add an ablation varying prototype aggregation quality (e.g., by subsampling clients) against heterogeneity levels (Dirichlet alpha from 0.05 to 1.0), and explicitly state the data partitioning protocol (Dirichlet distribution with specified alpha, 100 clients, 10% participation per round) along with client sampling details. revision: yes
Circularity Check
No significant circularity; CCMP mixing weights derived explicitly from prototypes and priors, then optimized via standard FedAvg.
full rationale
The paper introduces CCMP as an explicit construction that computes per-sample mixing weights from global class prototypes and client class priors, then optimizes the resulting prompts with conventional federated averaging. These inputs are described directly rather than being fitted to or defined in terms of the final performance metric. No equation reduces a claimed prediction back to the input quantities by construction, and the generalization-plus-personalization claim rests on the stated mechanism rather than a self-citation chain or uniqueness theorem imported from prior work by the same authors. This matches the reader's assessment that the framework applies standard techniques to newly defined quantities without statistical forcing.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained Vision Transformers remain effective when only prompts are updated in a federated setting.
- ad hoc to paper Global class prototypes and client class priors suffice to derive mixing weights that achieve useful per-sample personalization.
invented entities (1)
-
Class-Contextualized Mixed Prompt (CCMP)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. InInternational Conference on Learning Representations. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emm...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Improving generalization in federated learning by seeking flat minima
Debora Caldarola, Barbara Caputo, and Marco Ciccone. Improving generalization in federated learning by seeking flat minima. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, pp. 654–672. Springer,
work page 2022
-
[3]
doi: 10.1109/CVPR52733.2024.00582. URLhttps://doi.ieeecomputersociety.org/10.1109/CVPR52733.2024.00582. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers ...
-
[4]
Federated Learning for Mobile Keyboard Prediction
URL https://arxiv.org/abs/1811.03604. Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Federated visual classification with real-world data distribution. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pp. 76–92. Springer,
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[5]
doi: 10.18653/v1/2021.acl-long.353
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353. Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. FedBN: Federated learning on non-IID features via local batch normalization. InInternational Conference on Learning Representations,
-
[6]
Masked feature prediction for self-supervised visual pre-training
doi: 10.1109/CVPR52688.2022.00985. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pp. 1273–1282. PMLR,
-
[7]
Transformers for image recognition at scale.Online: https://ai
Houlsby Neil and Weissenborn Dirk. Transformers for image recognition at scale.Online: https://ai. googleblog. com/2020/12/transformers-for-image-recognitionat. html,
work page 2020
-
[8]
Federated Learning for Emoji Prediction in a Mobile Keyboard
Swaroop Ramaswamy, Rajiv Mathews, Kanishka Rao, and Françoise Beaufays. Federated learning for emoji prediction in a mobile keyboard.arXiv preprint arXiv:1906.04329,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[9]
DeLong, Ramon Fernandez Mir, and Jacques D
doi: 10.1109/TNNLS. 2022.3160699. Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778,
-
[10]
Motley: Benchmarking heterogeneity and personalization in federated learning
Shanshan Wu, Tian Li, Zachary Charles, Yu Xiao, Ken Liu, Zheng Xu, and Virginia Smith. Motley: Benchmarking heterogeneity and personalization in federated learning. InWorkshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022). Jian Xu, Xinyi Tong, and Shao-Lun Huang. Personalized federated learning with feature a...
work page 2022
-
[11]
What do we mean by generalization in federated learning?arXiv preprint arXiv:2110.14216,
Honglin Yuan, Warren Morningstar, Lin Ning, and Karan Singhal. What do we mean by generalization in federated learning?arXiv preprint arXiv:2110.14216,
-
[12]
In the Figure 5b the split shows the mix of feature and the label imbalance
The DomainNet dataset can be viewed as analogous to the one described in the Figure 5a. In the Figure 5b the split shows the mix of feature and the label imbalance. A.3.2 HyperParameter Details We set the communication rounds to be 100 for CIFAR-100 and Tiny-ImageNet datasets. For DomainNet we set the rounds to50. We set the number of Epochs to5for all th...
work page 2017
-
[13]
as the default optimizer with learning rate0.1with exponential decay and the momentum0.9. For the various datasets used in our experiments, we adapt the number of training rounds accordingly:100rounds for CIFAR-100 and Tiny-ImageNet,50rounds for DomainNet, and 500rounds for iNaturalist. For all the experiments we consider number of shared prompts (nS) to ...
work page 2006
-
[14]
We perform the analysis of our method PEP-FedPT. It can be seen that adding the CCMP prompts too early in the ViT is not beneficial as thecls token representations at the very early layers do not have better representations. Adding the prompts at later layers is also not beneficial, even tough the cls tokens have better representations, since the prompts ...
work page 2024
-
[15]
It follows by using the iterated expectation as shown below. 25 E⟨p−E[p|clsl−1],E[p|clsl−1]−ˆ p⟩=E[E[⟨p−E[p|clsl−1],E[p|clsl−1]−ˆ p⟩]|clsl−1](44) =E[E[⟨p−E[p|clsl−1]|clsl−1,E[p|clsl−1]−ˆ p⟩]](45) = 0(46) We now have J(ˆ p) =E∥p−E[p|clsl−1]∥2 +E∥E[p|clsl−1]−ˆ p∥2 (47) From the above Eq. 47 it can be readily seen thatJ(ˆ p)is minimized by setting the value ...
work page 2020
-
[16]
1 n ∑ k∈[n]∥∇fk(θ)∥2≤G2 +B2∥∇f(θ)∥2,where f(θ) = 1 n ∑ k∈[n]fk(θ).This is referred to bounded gradient dissimilarity assumption, A 6.let E∥∇l(θ,(x,y))−∇fk(θ)∥≤σ2, for allk and θ. Herel(θ, (x,y ))is loss evaluated on the sample (x,y)andf k(θ)is expectation across the samples drawn fromDk. This is a bounded variance assumption. In the above assumptions, the...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.