pith. sign in

arxiv: 2510.25372 · v2 · submitted 2025-10-29 · 💻 cs.CV · cs.LG

Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

Pith reviewed 2026-05-18 03:00 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords federated learningprompt tuningvision transformersclass prototypespersonalizationheterogeneous dataparameter-efficient fine-tuning
0
0 comments X

The pith

Global class prototypes enable adaptive per-sample prompt mixing for personalized federated tuning of Vision Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PEP-FedPT as a framework that maintains both a shared global prompt and class-specific prompts for federated learning with Vision Transformers. For each input sample it computes mixing weights from global class prototypes aggregated across clients and the local client class priors, then applies these weights to create a customized prompt on the fly. The prompts are updated collaboratively through standard federated averaging, avoiding any client-specific trainable parameters. A reader would care because this directly tackles the conflict between needing global generalization across heterogeneous data and delivering local personalization while respecting strict communication and storage limits.

Core claim

PEP-FedPT defines the Class-Contextualized Mixed Prompt (CCMP) that keeps class-specific prompts alongside a globally shared prompt. Mixing weights for any given input are produced from global class prototypes and client class priors, allowing per-sample personalization. All prompts are optimized jointly via federated averaging, yielding both generalization and personalization without storing client-dependent parameters.

What carries the argument

Class-Contextualized Mixed Prompt (CCMP), which adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors to deliver per-sample personalization.

Load-bearing premise

Global class prototypes computed across clients together with client class priors can generate reliable adaptive mixing weights that deliver effective per-sample personalization without causing overfitting or loss of generalization.

What would settle it

An experiment on a dataset with extreme non-IID class distributions where removing the prototype-based mixing and using only the global prompt yields equal or better accuracy would show the mixing mechanism adds no benefit.

Figures

Figures reproduced from arXiv: 2510.25372 by Aditay Tripathi, Anirban Chakraborty, M Yashwanth, Sharannya Ghosh.

Figure 1
Figure 1. Figure 1: The left panel illustrates server-client communication during federated training. In each com [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Top-5 accuracy computed based on the minimum distance between the cls token corresponding to the input and the cls proto￾types. This shows that the cls representations in the middle layers have coarse information of the task. After every fixed update period R, the server aggregates the prototypes from the clients to compute the aggregated prototype µˆ c l−1,r at the r-th period as Eq. 11. Let the set o… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the convergence of different methods across the Communication rounds on the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: For pathological settings we select few classes of data points for each client and allocate the data among those labels. For Dirichlet we allocate the data by drawing a sample from the Dirichlet distribution. We consider these settings using the CIFAR-100 and Tiny-ImageNet Datasets by distributing the data among the 100 and 200 clients respectively and sampling only 5 clients in each communication round. F… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Non-IID Label Shift due to Pathological setting and the Dirichlet setting [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Non-IID Feature Shift is set to 2. The total communication rounds is set to 500 for iNaturalist. We follow stochastic Gradient Descent with momentum (Deng et al., 2024) as the default optimizer with learning rate 0.1 with exponential decay and the momentum 0.9. For the various datasets used in our experiments, we adapt the number of training rounds accordingly: 100 rounds for CIFAR-100 and Ti… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the convergence of different methods across the Communication rounds on the [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualization of the learned class prompts, it can be seen that each prompt learns its own [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: soft weights Averaged over all the data points that belong to class [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of t-SNE representations across layers for DomainNet and CIFAR-100 datasets. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of training loss of various algorithms on CIFAR-100 dataset [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
read the original abstract

Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PEP-FedPT, a framework for federated prompt tuning of Vision Transformers. It defines the Class-Contextualized Mixed Prompt (CCMP) that maintains class-specific prompts alongside a globally shared prompt and computes per-sample mixing weights from global class prototypes and client class priors. The prompts are optimized with standard federated averaging. Experiments on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist report consistent gains over baselines under heterogeneous data partitions.

Significance. If the central claims hold, the work supplies a parameter-efficient route to joint generalization and per-sample personalization in federated ViT adaptation. By deriving adaptive weights from prototypes rather than storing client-specific parameters, the method could lower communication and memory overhead while addressing non-IID distributions, offering a practical advance for FL deployments of large vision models.

major comments (2)
  1. [Method] Method section: the procedure for computing and federated-aggregating global class prototypes (feature averaging, centroid, etc.) is not explicitly stated. This detail is load-bearing for the personalization claim; if majority clients dominate the prototypes, the derived mixing weights become effectively global and the per-sample benefit collapses to standard FedAvg prompt tuning.
  2. [Experiments] Experiments section: the reported outperformance across four datasets and heterogeneity regimes is not accompanied by statistical significance tests, error bars, ablation studies on prototype quality versus heterogeneity level, or precise rules for data partitioning and client sampling. These omissions prevent full assessment of whether the gains are robust or merely consistent with the weakest assumption that prototypes remain representative.
minor comments (1)
  1. A single equation or pseudocode block clarifying the exact mixing-weight formula (prototype similarity combined with class prior) would improve readability and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the potential impact of PEP-FedPT. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section: the procedure for computing and federated-aggregating global class prototypes (feature averaging, centroid, etc.) is not explicitly stated. This detail is load-bearing for the personalization claim; if majority clients dominate the prototypes, the derived mixing weights become effectively global and the per-sample benefit collapses to standard FedAvg prompt tuning.

    Authors: We agree that an explicit description is necessary for reproducibility and to substantiate the personalization mechanism. In the revised manuscript, we will add a dedicated subsection detailing that each client computes local class centroids as the mean of ViT-extracted features for samples of each class, followed by server-side aggregation via standard FedAvg to obtain global prototypes. The mixing weights for CCMP are then computed per-sample using cosine similarity between the input's feature and these global prototypes, modulated by the client's local class prior. This ensures the weights remain input-adaptive and client-informed rather than collapsing to a uniform global prompt, even under non-IID conditions. revision: yes

  2. Referee: [Experiments] Experiments section: the reported outperformance across four datasets and heterogeneity regimes is not accompanied by statistical significance tests, error bars, ablation studies on prototype quality versus heterogeneity level, or precise rules for data partitioning and client sampling. These omissions prevent full assessment of whether the gains are robust or merely consistent with the weakest assumption that prototypes remain representative.

    Authors: We acknowledge these omissions limit the strength of the empirical claims. In the revision, we will report mean and standard deviation over 5 independent runs with error bars, include paired t-tests for significance against baselines, add an ablation varying prototype aggregation quality (e.g., by subsampling clients) against heterogeneity levels (Dirichlet alpha from 0.05 to 1.0), and explicitly state the data partitioning protocol (Dirichlet distribution with specified alpha, 100 clients, 10% participation per round) along with client sampling details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; CCMP mixing weights derived explicitly from prototypes and priors, then optimized via standard FedAvg.

full rationale

The paper introduces CCMP as an explicit construction that computes per-sample mixing weights from global class prototypes and client class priors, then optimizes the resulting prompts with conventional federated averaging. These inputs are described directly rather than being fitted to or defined in terms of the final performance metric. No equation reduces a claimed prediction back to the input quantities by construction, and the generalization-plus-personalization claim rests on the stated mechanism rather than a self-citation chain or uniqueness theorem imported from prior work by the same authors. This matches the reader's assessment that the framework applies standard techniques to newly defined quantities without statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the utility of global prototypes for weight computation and on the effectiveness of standard federated averaging for the new prompt components; no explicit free parameters are introduced beyond those inherited from ViT pre-training and prompt tuning.

axioms (2)
  • domain assumption Pre-trained Vision Transformers remain effective when only prompts are updated in a federated setting.
    Invoked as the foundation for applying VPT inside FL.
  • ad hoc to paper Global class prototypes and client class priors suffice to derive mixing weights that achieve useful per-sample personalization.
    Core design choice for the CCMP component.
invented entities (1)
  • Class-Contextualized Mixed Prompt (CCMP) no independent evidence
    purpose: Enable per-sample prompt personalization in federated ViT tuning without storing client-specific trainable parameters.
    Newly postulated mixing mechanism that combines class-specific and global prompts via prototype-derived weights.

pith-pipeline@v0.9.0 · 5777 in / 1522 out tokens · 114561 ms · 2026-05-18T03:00:45.955345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    On the Opportunities and Risks of Foundation Models

    Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. InInternational Conference on Learning Representations. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emm...

  2. [2]

    Improving generalization in federated learning by seeking flat minima

    Debora Caldarola, Barbara Caputo, and Marco Ciccone. Improving generalization in federated learning by seeking flat minima. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, pp. 654–672. Springer,

  3. [3]

    2024 , url =

    doi: 10.1109/CVPR52733.2024.00582. URLhttps://doi.ieeecomputersociety.org/10.1109/CVPR52733.2024.00582. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers ...

  4. [4]

    Federated Learning for Mobile Keyboard Prediction

    URL https://arxiv.org/abs/1811.03604. Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Federated visual classification with real-world data distribution. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pp. 76–92. Springer,

  5. [5]

    doi: 10.18653/v1/2021.acl-long.353

    Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353. Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. FedBN: Federated learning on non-IID features via local batch normalization. InInternational Conference on Learning Representations,

  6. [6]

    Masked feature prediction for self-supervised visual pre-training

    doi: 10.1109/CVPR52688.2022.00985. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pp. 1273–1282. PMLR,

  7. [7]

    Transformers for image recognition at scale.Online: https://ai

    Houlsby Neil and Weissenborn Dirk. Transformers for image recognition at scale.Online: https://ai. googleblog. com/2020/12/transformers-for-image-recognitionat. html,

  8. [8]

    Federated Learning for Emoji Prediction in a Mobile Keyboard

    Swaroop Ramaswamy, Rajiv Mathews, Kanishka Rao, and Françoise Beaufays. Federated learning for emoji prediction in a mobile keyboard.arXiv preprint arXiv:1906.04329,

  9. [9]

    DeLong, Ramon Fernandez Mir, and Jacques D

    doi: 10.1109/TNNLS. 2022.3160699. Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778,

  10. [10]

    Motley: Benchmarking heterogeneity and personalization in federated learning

    Shanshan Wu, Tian Li, Zachary Charles, Yu Xiao, Ken Liu, Zheng Xu, and Virginia Smith. Motley: Benchmarking heterogeneity and personalization in federated learning. InWorkshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022). Jian Xu, Xinyi Tong, and Shao-Lun Huang. Personalized federated learning with feature a...

  11. [11]

    What do we mean by generalization in federated learning?arXiv preprint arXiv:2110.14216,

    Honglin Yuan, Warren Morningstar, Lin Ning, and Karan Singhal. What do we mean by generalization in federated learning?arXiv preprint arXiv:2110.14216,

  12. [12]

    In the Figure 5b the split shows the mix of feature and the label imbalance

    The DomainNet dataset can be viewed as analogous to the one described in the Figure 5a. In the Figure 5b the split shows the mix of feature and the label imbalance. A.3.2 HyperParameter Details We set the communication rounds to be 100 for CIFAR-100 and Tiny-ImageNet datasets. For DomainNet we set the rounds to50. We set the number of Epochs to5for all th...

  13. [13]

    as the default optimizer with learning rate0.1with exponential decay and the momentum0.9. For the various datasets used in our experiments, we adapt the number of training rounds accordingly:100rounds for CIFAR-100 and Tiny-ImageNet,50rounds for DomainNet, and 500rounds for iNaturalist. For all the experiments we consider number of shared prompts (nS) to ...

  14. [14]

    It can be seen that adding the CCMP prompts too early in the ViT is not beneficial as thecls token representations at the very early layers do not have better representations

    We perform the analysis of our method PEP-FedPT. It can be seen that adding the CCMP prompts too early in the ViT is not beneficial as thecls token representations at the very early layers do not have better representations. Adding the prompts at later layers is also not beneficial, even tough the cls tokens have better representations, since the prompts ...

  15. [15]

    It follows by using the iterated expectation as shown below. 25 E⟨p−E[p|clsl−1],E[p|clsl−1]−ˆ p⟩=E[E[⟨p−E[p|clsl−1],E[p|clsl−1]−ˆ p⟩]|clsl−1](44) =E[E[⟨p−E[p|clsl−1]|clsl−1,E[p|clsl−1]−ˆ p⟩]](45) = 0(46) We now have J(ˆ p) =E∥p−E[p|clsl−1]∥2 +E∥E[p|clsl−1]−ˆ p∥2 (47) From the above Eq. 47 it can be readily seen thatJ(ˆ p)is minimized by setting the value ...

  16. [16]

    Herel(θ, (x,y ))is loss evaluated on the sample (x,y)andf k(θ)is expectation across the samples drawn fromDk

    1 n ∑ k∈[n]∥∇fk(θ)∥2≤G2 +B2∥∇f(θ)∥2,where f(θ) = 1 n ∑ k∈[n]fk(θ).This is referred to bounded gradient dissimilarity assumption, A 6.let E∥∇l(θ,(x,y))−∇fk(θ)∥≤σ2, for allk and θ. Herel(θ, (x,y ))is loss evaluated on the sample (x,y)andf k(θ)is expectation across the samples drawn fromDk. This is a bounded variance assumption. In the above assumptions, the...