pith. sign in

arxiv: 2605.13475 · v2 · pith:JC3BMPH4new · submitted 2026-05-13 · 💻 cs.CV

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

Pith reviewed 2026-05-21 08:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords federated learninghyper-prototypesgradient matchingcontrastive learningsemantic consistencyheterogeneous dataprototype-based methods
0
0 comments X

The pith

Hyper-prototypes optimized by gradient matching from client samples create a semantically consistent global signal in federated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Federated learning lets devices train a shared model while keeping their data private. Existing prototype methods create global anchors by averaging or refining local prototypes, but this produces semantic drift when client data distributions differ. The paper defines hyper-prototypes as learnable global class-wise prototypes and tunes them by matching gradients to features extracted from clients' actual samples. These anchors then drive mutual-contrastive learning with per-client margins to push classes apart and a consistency penalty to pull same-class samples together. A sympathetic reader would care because the result could yield stronger models for privacy-sensitive tasks where data is naturally uneven across participants.

Core claim

FedHPro defines hyper-prototypes as a set of learnable global class-wise prototypes that are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples rather than through prototype-level averaging. The framework then uses these hyper-prototypes to promote inter-class separability via mutual-contrastive learning with a client-specific margin while encouraging intra-class uniformity through a consistency penalty, producing improved generalization under heterogeneous federated scenarios.

What carries the argument

Hyper-prototypes: learnable global class-wise prototypes optimized via gradient matching to class-relevant characteristics from clients' real samples.

If this is right

  • The global signal avoids the semantic drift that arises when global prototypes are updated by averaging local ones.
  • Mutual-contrastive learning with client-specific margins increases inter-class separability while the consistency penalty maintains intra-class uniformity.
  • The resulting models achieve state-of-the-art accuracy on standard benchmarks under diverse heterogeneous data partitions.
  • Gradient matching transfers semantic information from real samples without exchanging raw data or full local models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Gradient matching on prototypes could be tested in other distributed settings where averaging causes misalignment, such as continual learning across tasks.
  • The client-specific margin mechanism suggests a way to adapt contrastive objectives locally without extra communication rounds.
  • If hyper-prototypes reduce the need for many local epochs, the approach may lower total communication cost in bandwidth-limited federated deployments.

Load-bearing premise

That optimizing hyper-prototypes via gradient matching from clients' real samples will preserve underlying semantic knowledge and align with class-relevant characteristics without introducing new biases or misalignment across heterogeneous client distributions.

What would settle it

Running the method on highly non-IID client partitions and observing persistent semantic drift in the global prototypes or no performance gain over simple averaging of local prototypes would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.13475 by Di Wu, Guansong Pang, Haoran Li, Huan Wang, Jun Shen, Jun Yan, Ousman Manjang, Yanlong Zhai, Zhenyu Yang.

Figure 1
Figure 1. Figure 1: Illustration of heterogeneous FL with domain skew. The Vanilla column visualizes the feature distribution of a standard prototype-based FedProto (Tan et al., 2022a), showing failures in hard domains like SYN. In contrast, the Proposed column shows our approach achieves a larger inter-class distance and a smaller intra-class distance in such domains. Due to its simple aggregation, the prototypes in Vanilla … view at source ↗
Figure 2
Figure 2. Figure 2: Top: The L2 distance of centralized prototypes calcu￾lated by centralized training using all clients’ samples to global prototypes from FedAvg (McMahan et al., 2017) and FedProto (Tan et al., 2022a), and from our hyper-prototypes. Bottom: The corresponding accuracy on the Digits (Peng et al., 2019) dataset, ‘Using HP’ means replacing the global prototypes with our hyper￾prototypes in the FedProto’s loss fu… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization for the representation space of differ￾ent prototypes on Digits (Peng et al., 2019). Each color ( ) indicates one class, and each shape ( ) denotes one domain. Centralized is trained on all clients’ samples, as an upper-bound reference for prototype quality. The global prototypes of FedAvg and FedProto fail to describe diverse domain information, while our hyper-prototypes promote better sema… view at source ↗
Figure 4
Figure 4. Figure 4: Loss trend of LGM (Equation (8)) under different FL scenarios: CIFAR10 (Krizhevsky et al., 2009) with NID10.5 (label skew), CIFAR10-LT (Krizhevsky et al., 2009) with ρ = 50 (quantity skew), Digits (Peng et al., 2019) (domain skew). where we set Lvir as the cross-entropy loss and yvir as the corresponding class label c, and h ∗ denotes the classifier of the global model w ∗ . Based on g c ∈ G from real sam￾… view at source ↗
Figure 5
Figure 5. Figure 5: Framework illustration of Federated Hyper-Prototype Learning (FedHPro). The clients upload the gradients {g1, ..., gk} (Equation (5), ) to the server. Based on these gradients from local clients, we leverage a set of learnable units ( ) to simulate hyper-prototypes, capturing class-relevant semantic properties from real samples via gradient matching (LGM in Equation (8)) to enhance generalizability. Then, … view at source ↗
Figure 6
Figure 6. Figure 6: T-SNE visualization on Digits (Top) and Office-Caltech (Bottom). Each color represents one class, each shape represents one domain, and the stars represent semantic centers [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of hyper-prototypes lengths |I| of Equation (7) and rounds M of Equation (9) for Digits under domain skew (Left) and CIFAR10-LT under quantity skew (Right) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Settings: label skew (CIFAR10 with NID10.5), quantity skew (CIFAR10-LT with NID10.5, ρ = 50), domain skew (Digits). Left: different τ of Equation (12), dotted lines denote the FedAvg. Right: with or without (w/o) the margin dk of Equation (12). FedHPro yields a larger inter-class distance (i.e., different colors are more separated) and a reduced intra-class distance (i.e., different shapes of the same colo… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of convergence in the Average Accuracy Trend on the Digits (Left) and Office-Caltech (Right). We also provide quantitative results for the convergence rates in Table A4 [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FedHPro, a federated learning method that introduces hyper-prototypes as learnable global class-wise prototypes. These are optimized using gradient matching to align with class-relevant features from clients' real samples, aiming to avoid semantic drift common in averaging-based prototype methods. The framework uses mutual-contrastive learning with client-specific margins for better inter-class separability and a consistency penalty for intra-class uniformity. Experiments on benchmark datasets under heterogeneous conditions are claimed to show improved semantic consistency and state-of-the-art performance.

Significance. Should the empirical results and the effectiveness of gradient matching hold up, this work offers a promising direction for enhancing semantic alignment in non-IID federated learning for computer vision. It moves beyond simple prototype averaging to a more direct distillation via gradients, which could lead to better generalization in distributed training scenarios while preserving privacy. The public code release is a strength for verification and extension.

major comments (2)
  1. [Methods, gradient matching formulation] The optimization of hyper-prototypes via gradient matching from heterogeneous client gradients risks incorporating local biases if not properly normalized; the paper should clarify if the matching objective includes mechanisms to prevent dominance by clients with larger sample sizes or stronger feature signals, as this is central to the claim of semantic consistency.
  2. [Experiments section] The SOTA performance claims require detailed quantitative results, including specific accuracy numbers, standard deviations, ablation studies on the hyper-prototype component versus baselines, and descriptions of the heterogeneity levels in the datasets used.
minor comments (2)
  1. [Abstract] The abstract mentions 'comprehensive experiments' but provides no specific performance metrics or dataset names, which would help readers quickly assess the claims.
  2. [Notation] The client-specific margin in the mutual-contrastive learning could benefit from a clearer mathematical definition or pseudocode for implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and positive review. The comments highlight important aspects of our gradient matching approach and experimental reporting. We address each major comment point by point below, with revisions to the manuscript where needed to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Methods, gradient matching formulation] The optimization of hyper-prototypes via gradient matching from heterogeneous client gradients risks incorporating local biases if not properly normalized; the paper should clarify if the matching objective includes mechanisms to prevent dominance by clients with larger sample sizes or stronger feature signals, as this is central to the claim of semantic consistency.

    Authors: We thank the referee for raising this key point on potential client dominance in gradient matching. In the current formulation, hyper-prototypes are optimized by matching against an aggregated gradient computed as the mean of per-client class gradients; this averaging step already reduces the influence of any single client. To further address heterogeneity, we will add an explicit normalization step (dividing each client's gradient contribution by its L2 norm) and a sample-size-based weighting factor in the revised Section 3. The updated text will also include a short analysis showing how these steps support semantic consistency across non-IID clients. revision: yes

  2. Referee: [Experiments section] The SOTA performance claims require detailed quantitative results, including specific accuracy numbers, standard deviations, ablation studies on the hyper-prototype component versus baselines, and descriptions of the heterogeneity levels in the datasets used.

    Authors: We agree that expanded experimental details strengthen the presentation. The original manuscript reports mean accuracies in Tables 1-3 under multiple heterogeneity settings, but we have now augmented these tables with standard deviations computed over five independent runs. A new ablation subsection (4.3) isolates the contribution of the hyper-prototype and gradient-matching components against the listed baselines. Heterogeneity is generated via Dirichlet partitioning with concentration parameters explicitly stated as alpha in {0.05, 0.1, 0.5} (Section 4.1); we have added a brief paragraph describing these levels and the resulting label distributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper defines hyper-prototypes as learnable global class-wise prototypes and describes their optimization through gradient matching on client samples to mitigate semantic drift. No equations, derivations, or self-citations are visible in the provided text that reduce any central claim to a fitted input or prior self-result by construction. The approach is presented as an independent mechanism for alignment rather than a renaming or tautological redefinition of inputs, making the overall chain non-circular under the specified criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unproven effectiveness of gradient matching for semantic alignment and the utility of client-specific margins. Details on any fitted parameters or background assumptions are absent from the abstract.

free parameters (1)
  • client-specific margin
    Abstract describes its use in mutual-contrastive learning; value is likely chosen or tuned per client or scenario.
axioms (1)
  • domain assumption Gradient matching from real client samples aligns hyper-prototypes with class-relevant characteristics
    This is the core optimization premise stated in the abstract.
invented entities (1)
  • hyper-prototypes no independent evidence
    purpose: To preserve underlying semantic knowledge across clients as learnable global class-wise prototypes
    New concept introduced to replace direct prototype averaging and reduce semantic drift.

pith-pipeline@v0.9.0 · 5789 in / 1375 out tokens · 54692 ms · 2026-05-21T08:19:52.636108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Geodesic flow kernel for unsupervised domain adaptation

    Gong, B., Shi, Y ., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. InCVPR, pp. 2066–2073,

  2. [2]

    Learning support and trivial prototypes for interpretable image classification

    Wang, C., Liu, Y ., Chen, Y ., Liu, F., Tian, Y ., McCarthy, D., Frazer, H., and Carneiro, G. Learning support and trivial prototypes for interpretable image classification. InICCV, pp. 2062–2072, 2023a. Wang, H., Yurochkin, M., Sun, Y ., Papailiopoulos, D., and Yasaman, K. Federated learning with matched averaging. InICLR,

  3. [3]

    12 FedHPro: Federated Hyper-Prototype Learning via Gradient Matching A. Algorithm Pseudo-code Flow In this section, we describe the pseudo-code of our FedHPro in Algorithm 1: 1)Server-Side: we optimize the simulated hyper-prototypes via gradient matching; 2)Client-Side: we leverage the hyper-prototypes to promote FL local training. Algorithm 1 FedHPro: Fe...

  4. [4]

    for the TinyImageNet dataset. The label skew heterogeneity level of clients is controlled by the standard deviation α of the Dirichlet distribution, and the quantity skew heterogeneity level is controlled by the index ratio ρ between the sample sizes of the most frequent and the least frequent class. The lowerαand higherρthese are, the more heterogeneous ...

  5. [5]

    is used for skin lesion classification and contains 8,912 training samples and 1,103 testing samples with 7 categories, and each sample’s size is scaled to224∗224 . Then, based on (Kaidi et al., 2019), we further build CIFAR10-LT, CIFAR100-LT, and TinyImageNet-LT datasets, as the unbalanced datasets with long-tailed level ρ, ρ= 1 denotes globally balanced...

  6. [6]

    (α as the non-IID level), NID2 is a more extreme setting consists of 6 biased clients (each has a single class) and 1 client has all classes. 2)Quantity Skew: we shape the original dataset into a long-tailed distribution by (Kaidi et al., 2019), and ρ means the ratio between sample sizes of the most frequent and lowest frequent class. 3)Domain Skew: we se...

  7. [7]

    Vallina FL with data heterogeneity: FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020b)

  8. [8]

    Contrastive-based FL baselines: MOON (Li et al., 2021), FedRCL (Seo et al., 2024)

  9. [9]

    Prototype-based FL Baselines: Among these FL methods, FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), FedGMKD (Zhang et al., 2024b), and FedSA (Zhou et al.,

    Prototype-based FL baselines: FedSA (Zhou et al., 2025), FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), and FedGMKD (Zhang et al., 2024b). Prototype-based FL Baselines: Among these FL methods, FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), FedGMKD (Zhang et al., 2024b), and FedSA (Zhou et al.,

  10. [10]

    Table A1.Results on CIFAR10, CIFAR100, HAM10000, TinyImageNet withLabel Skew

    Built upon the HPCL and HPAL modules, FedHPro enhances inter-class separability and enforces intra-class uniformity in local training under heterogeneous FL, yielding improvements over FL baselines. Table A1.Results on CIFAR10, CIFAR100, HAM10000, TinyImageNet withLabel Skew. † denotes the results obtained by exchanging both the prototypes and model param...

  11. [11]

    and Office-Caltech (Caltech: 3, Webcam: 1, Amazon: 2, DSLR: 4). Each client’s dataset is randomly selected from the total samples of the corresponding domain, as 1% for Digits and 20% for 20 FedHPro: Federated Hyper-Prototype Learning via Gradient Matching Table A5.Ablation study of the number ofClients K andLocal Epochs E.Left: Clients K on TinyImageNet ...

  12. [12]

    However, as shown in Table A7, implementing FedProto, FedTGP, and FedSA following their original prototype-only communication setting consistently leads to poor performance

    are popular prototype-based FL methods. However, as shown in Table A7, implementing FedProto, FedTGP, and FedSA following their original prototype-only communication setting consistently leads to poor performance. To ensure a fair comparison with mainstream FL baselines that exchange model parameters (e.g., FedRCL), in our experiments, we report enhanced ...

  13. [13]

    dataset. AG NEWSSETTINGS FedAvg FedProx FedSA FedSA† FedHPro AG-NIDK10 82.09 80.43 75.1384.03 86.78 AG-NIDK50 52.71 55.14 47.5356.92 63.06 SUN397SETTINGS FedAvg FedRCL FedSA FedSA† FedHPro NID10.2 68.92 70.23 61.5870.86 73.41 NID10.5 70.6173.2665.30 72.18 75.22 FedHPro outperforms FedProto by 16.67%, FedTGP by 7.04%, and FedSA by 5.57%, indicating that th...

  14. [14]

    Referring to (Wang et al., 2024), we consider two non-iid scenarios on the AG News:

    with a hidden dimension of 32 for AG News, and train the model via Adam optimizer with a learning rate of 0.01. Referring to (Wang et al., 2024), we consider two non-iid scenarios on the AG News: