FedHPro: Federated Hyper-Prototype Learning via Gradient Matching
Pith reviewed 2026-05-21 08:19 UTC · model grok-4.3
The pith
Hyper-prototypes optimized by gradient matching from client samples create a semantically consistent global signal in federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FedHPro defines hyper-prototypes as a set of learnable global class-wise prototypes that are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples rather than through prototype-level averaging. The framework then uses these hyper-prototypes to promote inter-class separability via mutual-contrastive learning with a client-specific margin while encouraging intra-class uniformity through a consistency penalty, producing improved generalization under heterogeneous federated scenarios.
What carries the argument
Hyper-prototypes: learnable global class-wise prototypes optimized via gradient matching to class-relevant characteristics from clients' real samples.
If this is right
- The global signal avoids the semantic drift that arises when global prototypes are updated by averaging local ones.
- Mutual-contrastive learning with client-specific margins increases inter-class separability while the consistency penalty maintains intra-class uniformity.
- The resulting models achieve state-of-the-art accuracy on standard benchmarks under diverse heterogeneous data partitions.
- Gradient matching transfers semantic information from real samples without exchanging raw data or full local models.
Where Pith is reading between the lines
- Gradient matching on prototypes could be tested in other distributed settings where averaging causes misalignment, such as continual learning across tasks.
- The client-specific margin mechanism suggests a way to adapt contrastive objectives locally without extra communication rounds.
- If hyper-prototypes reduce the need for many local epochs, the approach may lower total communication cost in bandwidth-limited federated deployments.
Load-bearing premise
That optimizing hyper-prototypes via gradient matching from clients' real samples will preserve underlying semantic knowledge and align with class-relevant characteristics without introducing new biases or misalignment across heterogeneous client distributions.
What would settle it
Running the method on highly non-IID client partitions and observing persistent semantic drift in the global prototypes or no performance gain over simple averaging of local prototypes would falsify the central claim.
Figures
read the original abstract
Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FedHPro, a federated learning method that introduces hyper-prototypes as learnable global class-wise prototypes. These are optimized using gradient matching to align with class-relevant features from clients' real samples, aiming to avoid semantic drift common in averaging-based prototype methods. The framework uses mutual-contrastive learning with client-specific margins for better inter-class separability and a consistency penalty for intra-class uniformity. Experiments on benchmark datasets under heterogeneous conditions are claimed to show improved semantic consistency and state-of-the-art performance.
Significance. Should the empirical results and the effectiveness of gradient matching hold up, this work offers a promising direction for enhancing semantic alignment in non-IID federated learning for computer vision. It moves beyond simple prototype averaging to a more direct distillation via gradients, which could lead to better generalization in distributed training scenarios while preserving privacy. The public code release is a strength for verification and extension.
major comments (2)
- [Methods, gradient matching formulation] The optimization of hyper-prototypes via gradient matching from heterogeneous client gradients risks incorporating local biases if not properly normalized; the paper should clarify if the matching objective includes mechanisms to prevent dominance by clients with larger sample sizes or stronger feature signals, as this is central to the claim of semantic consistency.
- [Experiments section] The SOTA performance claims require detailed quantitative results, including specific accuracy numbers, standard deviations, ablation studies on the hyper-prototype component versus baselines, and descriptions of the heterogeneity levels in the datasets used.
minor comments (2)
- [Abstract] The abstract mentions 'comprehensive experiments' but provides no specific performance metrics or dataset names, which would help readers quickly assess the claims.
- [Notation] The client-specific margin in the mutual-contrastive learning could benefit from a clearer mathematical definition or pseudocode for implementation.
Simulated Author's Rebuttal
We thank the referee for the constructive and positive review. The comments highlight important aspects of our gradient matching approach and experimental reporting. We address each major comment point by point below, with revisions to the manuscript where needed to improve clarity and completeness.
read point-by-point responses
-
Referee: [Methods, gradient matching formulation] The optimization of hyper-prototypes via gradient matching from heterogeneous client gradients risks incorporating local biases if not properly normalized; the paper should clarify if the matching objective includes mechanisms to prevent dominance by clients with larger sample sizes or stronger feature signals, as this is central to the claim of semantic consistency.
Authors: We thank the referee for raising this key point on potential client dominance in gradient matching. In the current formulation, hyper-prototypes are optimized by matching against an aggregated gradient computed as the mean of per-client class gradients; this averaging step already reduces the influence of any single client. To further address heterogeneity, we will add an explicit normalization step (dividing each client's gradient contribution by its L2 norm) and a sample-size-based weighting factor in the revised Section 3. The updated text will also include a short analysis showing how these steps support semantic consistency across non-IID clients. revision: yes
-
Referee: [Experiments section] The SOTA performance claims require detailed quantitative results, including specific accuracy numbers, standard deviations, ablation studies on the hyper-prototype component versus baselines, and descriptions of the heterogeneity levels in the datasets used.
Authors: We agree that expanded experimental details strengthen the presentation. The original manuscript reports mean accuracies in Tables 1-3 under multiple heterogeneity settings, but we have now augmented these tables with standard deviations computed over five independent runs. A new ablation subsection (4.3) isolates the contribution of the hyper-prototype and gradient-matching components against the listed baselines. Heterogeneity is generated via Dirichlet partitioning with concentration parameters explicitly stated as alpha in {0.05, 0.1, 0.5} (Section 4.1); we have added a brief paragraph describing these levels and the resulting label distributions. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper defines hyper-prototypes as learnable global class-wise prototypes and describes their optimization through gradient matching on client samples to mitigate semantic drift. No equations, derivations, or self-citations are visible in the provided text that reduce any central claim to a fitted input or prior self-result by construction. The approach is presented as an independent mechanism for alignment rather than a renaming or tautological redefinition of inputs, making the overall chain non-circular under the specified criteria.
Axiom & Free-Parameter Ledger
free parameters (1)
- client-specific margin
axioms (1)
- domain assumption Gradient matching from real client samples aligns hyper-prototypes with class-relevant characteristics
invented entities (1)
-
hyper-prototypes
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hyper-prototypes ... optimized via gradient matching to align with class-relevant characteristics distilled directly from clients’ real samples ... LGM(gc, gc_HP) = 1 − gc·gc_HP / (∥gc∥2 ∥gc_HP∥2)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HPCL ... mutual-contrastive learning with client-specific margin ... HPAL ... consistency penalty
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Geodesic flow kernel for unsupervised domain adaptation
Gong, B., Shi, Y ., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. InCVPR, pp. 2066–2073,
work page 2066
-
[2]
Learning support and trivial prototypes for interpretable image classification
Wang, C., Liu, Y ., Chen, Y ., Liu, F., Tian, Y ., McCarthy, D., Frazer, H., and Carneiro, G. Learning support and trivial prototypes for interpretable image classification. InICCV, pp. 2062–2072, 2023a. Wang, H., Yurochkin, M., Sun, Y ., Papailiopoulos, D., and Yasaman, K. Federated learning with matched averaging. InICLR,
work page 2062
-
[3]
12 FedHPro: Federated Hyper-Prototype Learning via Gradient Matching A. Algorithm Pseudo-code Flow In this section, we describe the pseudo-code of our FedHPro in Algorithm 1: 1)Server-Side: we optimize the simulated hyper-prototypes via gradient matching; 2)Client-Side: we leverage the hyper-prototypes to promote FL local training. Algorithm 1 FedHPro: Fe...
work page 2023
-
[4]
for the TinyImageNet dataset. The label skew heterogeneity level of clients is controlled by the standard deviation α of the Dirichlet distribution, and the quantity skew heterogeneity level is controlled by the index ratio ρ between the sample sizes of the most frequent and the least frequent class. The lowerαand higherρthese are, the more heterogeneous ...
work page 2021
-
[5]
is used for skin lesion classification and contains 8,912 training samples and 1,103 testing samples with 7 categories, and each sample’s size is scaled to224∗224 . Then, based on (Kaidi et al., 2019), we further build CIFAR10-LT, CIFAR100-LT, and TinyImageNet-LT datasets, as the unbalanced datasets with long-tailed level ρ, ρ= 1 denotes globally balanced...
work page 2019
-
[6]
(α as the non-IID level), NID2 is a more extreme setting consists of 6 biased clients (each has a single class) and 1 client has all classes. 2)Quantity Skew: we shape the original dataset into a long-tailed distribution by (Kaidi et al., 2019), and ρ means the ratio between sample sizes of the most frequent and lowest frequent class. 3)Domain Skew: we se...
work page 2019
-
[7]
Vallina FL with data heterogeneity: FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020b)
work page 2017
-
[8]
Contrastive-based FL baselines: MOON (Li et al., 2021), FedRCL (Seo et al., 2024)
work page 2021
-
[9]
Prototype-based FL baselines: FedSA (Zhou et al., 2025), FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), and FedGMKD (Zhang et al., 2024b). Prototype-based FL Baselines: Among these FL methods, FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), FedGMKD (Zhang et al., 2024b), and FedSA (Zhou et al.,
work page 2025
-
[10]
Table A1.Results on CIFAR10, CIFAR100, HAM10000, TinyImageNet withLabel Skew
Built upon the HPCL and HPAL modules, FedHPro enhances inter-class separability and enforces intra-class uniformity in local training under heterogeneous FL, yielding improvements over FL baselines. Table A1.Results on CIFAR10, CIFAR100, HAM10000, TinyImageNet withLabel Skew. † denotes the results obtained by exchanging both the prototypes and model param...
-
[11]
and Office-Caltech (Caltech: 3, Webcam: 1, Amazon: 2, DSLR: 4). Each client’s dataset is randomly selected from the total samples of the corresponding domain, as 1% for Digits and 20% for 20 FedHPro: Federated Hyper-Prototype Learning via Gradient Matching Table A5.Ablation study of the number ofClients K andLocal Epochs E.Left: Clients K on TinyImageNet ...
-
[12]
are popular prototype-based FL methods. However, as shown in Table A7, implementing FedProto, FedTGP, and FedSA following their original prototype-only communication setting consistently leads to poor performance. To ensure a fair comparison with mainstream FL baselines that exchange model parameters (e.g., FedRCL), in our experiments, we report enhanced ...
work page 2025
-
[13]
dataset. AG NEWSSETTINGS FedAvg FedProx FedSA FedSA† FedHPro AG-NIDK10 82.09 80.43 75.1384.03 86.78 AG-NIDK50 52.71 55.14 47.5356.92 63.06 SUN397SETTINGS FedAvg FedRCL FedSA FedSA† FedHPro NID10.2 68.92 70.23 61.5870.86 73.41 NID10.5 70.6173.2665.30 72.18 75.22 FedHPro outperforms FedProto by 16.67%, FedTGP by 7.04%, and FedSA by 5.57%, indicating that th...
-
[14]
Referring to (Wang et al., 2024), we consider two non-iid scenarios on the AG News:
with a hidden dimension of 32 for AG News, and train the model via Adam optimizer with a learning rate of 0.01. Referring to (Wang et al., 2024), we consider two non-iid scenarios on the AG News:
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.