arxiv: 2605.13475 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

Huan Wang , Jun Shen , Haoran Li , Zhenyu Yang , Jun Yan , Ousman Manjang , Yanlong Zhai , Di Wu

show 1 more author

Guansong Pang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords Federated LearningPrototype LearningGradient MatchingHyper-prototypesSemantic ConsistencyData HeterogeneityContrastive Learning

0 comments

The pith

Hyper-prototypes aligned by gradient matching from client samples reduce semantic drift in federated prototype learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In federated learning, existing prototype methods update global anchors by averaging local prototypes, which produces semantic drift when client data distributions differ. The paper replaces this with hyper-prototypes, a set of learnable global class-wise prototypes that are adjusted by matching gradients computed on each client's real samples. FedHPro then uses these aligned hyper-prototypes inside a mutual-contrastive objective with client-specific margins and an added consistency penalty to increase inter-class separation while keeping intra-class features uniform. If the alignment step succeeds, clients receive a global signal that stays semantically stable across rounds and improves final model accuracy on heterogeneous benchmarks.

Core claim

Hyper-prototypes are learnable global class-wise prototypes optimized via gradient matching to class-relevant characteristics distilled from clients' real samples rather than through prototype-level averaging, and FedHPro leverages them to promote inter-class separability via mutual-contrastive learning with client-specific margins while encouraging intra-class uniformity through a consistency penalty.

What carries the argument

Hyper-prototypes updated by gradient matching on real client samples, which supplies the global signal for contrastive alignment.

If this is right

Hyper-prototypes produce a more semantically consistent global signal across clients than averaged prototypes.
FedHPro reaches state-of-the-art accuracy on several benchmark datasets under diverse heterogeneous scenarios.
Mutual-contrastive learning with client-specific margins increases inter-class separability.
The consistency penalty improves intra-class uniformity in the learned representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Gradient matching may reduce the number of communication rounds needed for convergence by supplying stronger global guidance early in training.
The approach could transfer to other modalities if the gradient signals continue to encode class semantics reliably.
Any added local computation for gradient matching must be weighed against the observed reduction in semantic drift.

Load-bearing premise

Matching gradients computed on real client samples will align hyper-prototypes more reliably than averaging local prototypes without introducing new privacy leakage or optimization instability.

What would settle it

An experiment in which FedHPro's reported accuracy gains disappear when the gradient-matching update is replaced by direct averaging of local prototypes under the same heterogeneous data partitions and communication budget.

Figures

Figures reproduced from arXiv: 2605.13475 by Di Wu, Guansong Pang, Haoran Li, Huan Wang, Jun Shen, Jun Yan, Ousman Manjang, Yanlong Zhai, Zhenyu Yang.

**Figure 1.** Figure 1: Illustration of heterogeneous FL with domain skew. The Vanilla column visualizes the feature distribution of a standard prototype-based FedProto (Tan et al., 2022a), showing failures in hard domains like SYN. In contrast, the Proposed column shows our approach achieves a larger inter-class distance and a smaller intra-class distance in such domains. Due to its simple aggregation, the prototypes in Vanilla … view at source ↗

**Figure 2.** Figure 2: Top: The L2 distance of centralized prototypes calculated by centralized training using all clients’ samples to global prototypes from FedAvg (McMahan et al., 2017) and FedProto (Tan et al., 2022a), and from our hyper-prototypes. Bottom: The corresponding accuracy on the Digits (Peng et al., 2019) dataset, ‘Using HP’ means replacing the global prototypes with our hyperprototypes in the FedProto’s loss fu… view at source ↗

**Figure 3.** Figure 3: Visualization for the representation space of different prototypes on Digits (Peng et al., 2019). Each color ( ) indicates one class, and each shape ( ) denotes one domain. Centralized is trained on all clients’ samples, as an upper-bound reference for prototype quality. The global prototypes of FedAvg and FedProto fail to describe diverse domain information, while our hyper-prototypes promote better sema… view at source ↗

**Figure 4.** Figure 4: Loss trend of LGM (Equation (8)) under different FL scenarios: CIFAR10 (Krizhevsky et al., 2009) with NID10.5 (label skew), CIFAR10-LT (Krizhevsky et al., 2009) with ρ = 50 (quantity skew), Digits (Peng et al., 2019) (domain skew). where we set Lvir as the cross-entropy loss and yvir as the corresponding class label c, and h ∗ denotes the classifier of the global model w ∗ . Based on g c ∈ G from real sam… view at source ↗

**Figure 5.** Figure 5: Framework illustration of Federated Hyper-Prototype Learning (FedHPro). The clients upload the gradients {g1, ..., gk} (Equation (5), ) to the server. Based on these gradients from local clients, we leverage a set of learnable units ( ) to simulate hyper-prototypes, capturing class-relevant semantic properties from real samples via gradient matching (LGM in Equation (8)) to enhance generalizability. Then, … view at source ↗

**Figure 6.** Figure 6: T-SNE visualization on Digits (Top) and Office-Caltech (Bottom). Each color represents one class, each shape represents one domain, and the stars represent semantic centers [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Analysis of hyper-prototypes lengths |I| of Equation (7) and rounds M of Equation (9) for Digits under domain skew (Left) and CIFAR10-LT under quantity skew (Right) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 7.** Figure 7: Settings: label skew (CIFAR10 with NID10.5), quantity skew (CIFAR10-LT with NID10.5, ρ = 50), domain skew (Digits). Left: different τ of Equation (12), dotted lines denote the FedAvg. Right: with or without (w/o) the margin dk of Equation (12). FedHPro yields a larger inter-class distance (i.e., different colors are more separated) and a reduced intra-class distance (i.e., different shapes of the same colo… view at source ↗

**Figure 9.** Figure 9: Comparison of convergence in the Average Accuracy Trend on the Digits (Left) and Office-Caltech (Right). We also provide quantitative results for the convergence rates in Table A4 [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces prototype averaging with learnable hyper-prototypes updated by gradient matching on real client samples, plus a client-specific margin contrastive loss, but the lack of any numbers or ablations leaves the SOTA claims uncheckable.

read the letter

The main move here is defining hyper-prototypes as learnable global class anchors that get updated by matching gradients taken directly from client data rather than by averaging local prototypes. They add a mutual-contrastive term with per-client margins and a consistency penalty to encourage separability and uniformity across heterogeneous clients. That formulation is the concrete novelty; it does not collapse to the averaging or anchor-refinement tricks cited in the abstract. The motivation around semantic drift is stated plainly and the code release helps anyone who wants to test it. The approach looks like a reasonable incremental step for prototype-based federated learning in vision. The soft spot is obvious: the abstract asserts state-of-the-art results on multiple benchmarks under diverse heterogeneity but supplies no accuracy figures, no ablation tables, no significance tests, and no split details. Without those, it is impossible to tell whether gradient matching actually delivers more consistent signals or whether it introduces instability when local gradients are dominated by client-specific noise. The stress-test worry about misalignment under non-IID partitions therefore cannot be dismissed from the given information. This paper is for people already working on federated prototype or contrastive methods in computer vision. A reader in that niche can extract the idea and the released code for their own experiments, but the current write-up does not yet supply enough evidence to treat the performance claims as settled. I would send it to peer review; the central construction is coherent enough to merit referee time even if the experiments need substantial expansion.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes FedHPro, a federated learning method that introduces hyper-prototypes—learnable global class-wise prototypes optimized via gradient matching on real client samples rather than averaging local prototypes—to reduce semantic drift. It combines this with mutual-contrastive learning using a client-specific margin for inter-class separability and a consistency penalty for intra-class uniformity, claiming improved global signal consistency and state-of-the-art performance on benchmark datasets under heterogeneous (non-IID) scenarios.

Significance. If the central claims hold, the work could meaningfully advance prototype-based federated learning by replacing prototype-level averaging with gradient-based alignment, offering a route to more stable semantic anchors across clients while preserving privacy. The code release at the provided GitHub link is a clear strength for reproducibility. However, the significance is limited by the absence of detailed quantitative validation, convergence analysis, or privacy bounds in the abstract, leaving the practical impact dependent on unverified experimental details.

major comments (3)

[Abstract] Abstract: The central SOTA performance claim and the assertion that 'hyper-prototypes produce a more semantically consistent global signal' rest on experiments whose quantitative details (ablation studies, statistical significance, exact data splits, and run variance) are not reported, rendering the load-bearing empirical support unverifiable from the provided text.
[Method] Method (gradient matching objective): No derivation, convergence bound, or stability analysis is supplied showing that matching gradients on real client samples reliably aligns hyper-prototypes under non-IID partitions; without this, the claim that gradient matching outperforms prototype averaging remains an unproven assumption that could be undermined by client-specific bias or optimization instability.
[Experiments] Experiments section: The manuscript does not report statistical significance tests, confidence intervals, or ablation results isolating the contribution of gradient matching versus the contrastive and consistency terms, which is required to substantiate the SOTA claim across diverse heterogeneous scenarios.

minor comments (1)

[Abstract] Abstract: The sentence listing the two confirmations ('1) hyper-prototypes produce... and 2) FedHPro achieves...') would benefit from explicit reference to the specific tables or figures that support each point.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central SOTA performance claim and the assertion that 'hyper-prototypes produce a more semantically consistent global signal' rest on experiments whose quantitative details (ablation studies, statistical significance, exact data splits, and run variance) are not reported, rendering the load-bearing empirical support unverifiable from the provided text.

Authors: We agree that the abstract is concise and could better summarize the supporting evidence. The full manuscript reports results on multiple heterogeneous benchmarks (CIFAR-10/100, Tiny-ImageNet) with standard non-IID partitions (Dirichlet and pathological), ablations isolating each component, and averages over multiple independent runs. We will revise the abstract to include key quantitative gains (e.g., accuracy improvements) and explicitly reference the experimental protocol and variance reporting in Sections 4–5. revision: yes
Referee: [Method] Method (gradient matching objective): No derivation, convergence bound, or stability analysis is supplied showing that matching gradients on real client samples reliably aligns hyper-prototypes under non-IID partitions; without this, the claim that gradient matching outperforms prototype averaging remains an unproven assumption that could be undermined by client-specific bias or optimization instability.

Authors: The gradient-matching objective is motivated by directly aligning hyper-prototypes to class-relevant gradient signals extracted from real client data, avoiding the semantic drift inherent in prototype averaging. We provide extensive empirical validation across diverse non-IID settings demonstrating improved consistency and accuracy. We will expand the method section with additional motivation, pseudocode, and empirical stability observations (e.g., gradient norm behavior across rounds). A rigorous convergence proof under arbitrary non-IID distributions, however, lies outside the scope of this primarily empirical work. revision: partial
Referee: [Experiments] Experiments section: The manuscript does not report statistical significance tests, confidence intervals, or ablation results isolating the contribution of gradient matching versus the contrastive and consistency terms, which is required to substantiate the SOTA claim across diverse heterogeneous scenarios.

Authors: The experiments section already contains component-wise ablations (gradient matching, mutual-contrastive loss, consistency penalty) and reports mean performance over five random seeds. To address the request, we will add paired statistical significance tests (t-tests) with p-values, 95% confidence intervals, and an explicit table isolating the contribution of gradient matching. Exact data splits and partition parameters will also be tabulated for full reproducibility. revision: yes

standing simulated objections not resolved

A formal convergence bound or stability guarantee for the gradient-matching objective under arbitrary non-IID client distributions.

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmarks and explicit optimization objective

full rationale

The paper defines hyper-prototypes as learnable global class-wise prototypes and states they are optimized via gradient matching on real client samples to align with class-relevant characteristics. The semantic-consistency claim and SOTA performance are presented as outcomes confirmed by experiments on public benchmark datasets under heterogeneous scenarios, not as identities or fitted quantities derived from the same inputs. No equations, self-citations, or uniqueness theorems appear in the provided text that would reduce the central claim to a tautology or self-referential fit. The derivation chain therefore remains self-contained against external validation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach introduces hyper-prototypes as a new entity and relies on the unproven assumption that gradient matching from real samples will preserve semantics better than prototype averaging; several weighting hyperparameters for contrastive and consistency terms are expected but not enumerated in the abstract.

free parameters (2)

client-specific margin
Used in mutual-contrastive learning to control inter-class separation; value chosen per client or tuned globally.
consistency penalty weight
Scalar balancing intra-class uniformity term against other losses.

axioms (1)

domain assumption Clients are honest and return accurate gradient information derived from their private data.
Standard federated learning premise invoked to justify gradient matching.

invented entities (1)

hyper-prototypes no independent evidence
purpose: Learnable global class-wise prototypes that preserve semantic knowledge across clients.
New construct introduced to replace averaged local prototypes and reduce semantic drift.

pith-pipeline@v0.9.0 · 5558 in / 1368 out tokens · 43452 ms · 2026-05-14T20:21:59.453163+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hyper-prototypes are optimized via gradient matching ... LGM(gc, gc_HP) = 1 − gc · gc_HP / (∥gc∥2 ∥gc_HP∥2)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HPCL ... mutual-contrastive learning with client-specific margin ... LHPCL = log(1 + Σ exp((s(zi,SjM)+dk)/τ) / exp(s(zi,ScM)/τ))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Geodesic flow kernel for unsupervised domain adaptation

Gong, B., Shi, Y ., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. InCVPR, pp. 2066–2073,

work page 2066
[2]

Learning support and trivial prototypes for interpretable image classification

Wang, C., Liu, Y ., Chen, Y ., Liu, F., Tian, Y ., McCarthy, D., Frazer, H., and Carneiro, G. Learning support and trivial prototypes for interpretable image classification. InICCV, pp. 2062–2072, 2023a. Wang, H., Yurochkin, M., Sun, Y ., Papailiopoulos, D., and Yasaman, K. Federated learning with matched averaging. InICLR,

work page 2062
[3]

12 FedHPro: Federated Hyper-Prototype Learning via Gradient Matching A. Algorithm Pseudo-code Flow In this section, we describe the pseudo-code of our FedHPro in Algorithm 1: 1)Server-Side: we optimize the simulated hyper-prototypes via gradient matching; 2)Client-Side: we leverage the hyper-prototypes to promote FL local training. Algorithm 1 FedHPro: Fe...

work page 2023
[4]

for the TinyImageNet dataset. The label skew heterogeneity level of clients is controlled by the standard deviation α of the Dirichlet distribution, and the quantity skew heterogeneity level is controlled by the index ratio ρ between the sample sizes of the most frequent and the least frequent class. The lowerαand higherρthese are, the more heterogeneous ...

work page 2021
[5]

is used for skin lesion classification and contains 8,912 training samples and 1,103 testing samples with 7 categories, and each sample’s size is scaled to224∗224 . Then, based on (Kaidi et al., 2019), we further build CIFAR10-LT, CIFAR100-LT, and TinyImageNet-LT datasets, as the unbalanced datasets with long-tailed level ρ, ρ= 1 denotes globally balanced...

work page 2019
[6]

(α as the non-IID level), NID2 is a more extreme setting consists of 6 biased clients (each has a single class) and 1 client has all classes. 2)Quantity Skew: we shape the original dataset into a long-tailed distribution by (Kaidi et al., 2019), and ρ means the ratio between sample sizes of the most frequent and lowest frequent class. 3)Domain Skew: we se...

work page 2019
[7]

Vallina FL with data heterogeneity: FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020b)

work page 2017
[8]

Contrastive-based FL baselines: MOON (Li et al., 2021), FedRCL (Seo et al., 2024)

work page 2021
[9]

Prototype-based FL Baselines: Among these FL methods, FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), FedGMKD (Zhang et al., 2024b), and FedSA (Zhou et al.,

Prototype-based FL baselines: FedSA (Zhou et al., 2025), FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), and FedGMKD (Zhang et al., 2024b). Prototype-based FL Baselines: Among these FL methods, FedProto (Tan et al., 2022a), FedTGP (Zhang et al., 2024a), FedGMKD (Zhang et al., 2024b), and FedSA (Zhou et al.,

work page 2025
[10]

Table A1.Results on CIFAR10, CIFAR100, HAM10000, TinyImageNet withLabel Skew

Built upon the HPCL and HPAL modules, FedHPro enhances inter-class separability and enforces intra-class uniformity in local training under heterogeneous FL, yielding improvements over FL baselines. Table A1.Results on CIFAR10, CIFAR100, HAM10000, TinyImageNet withLabel Skew. † denotes the results obtained by exchanging both the prototypes and model param...

work page arXiv
[11]

and Office-Caltech (Caltech: 3, Webcam: 1, Amazon: 2, DSLR: 4). Each client’s dataset is randomly selected from the total samples of the corresponding domain, as 1% for Digits and 20% for 20 FedHPro: Federated Hyper-Prototype Learning via Gradient Matching Table A5.Ablation study of the number ofClients K andLocal Epochs E.Left: Clients K on TinyImageNet ...

work page arXiv
[12]

However, as shown in Table A7, implementing FedProto, FedTGP, and FedSA following their original prototype-only communication setting consistently leads to poor performance

are popular prototype-based FL methods. However, as shown in Table A7, implementing FedProto, FedTGP, and FedSA following their original prototype-only communication setting consistently leads to poor performance. To ensure a fair comparison with mainstream FL baselines that exchange model parameters (e.g., FedRCL), in our experiments, we report enhanced ...

work page 2025
[13]

dataset. AG NEWSSETTINGS FedAvg FedProx FedSA FedSA† FedHPro AG-NIDK10 82.09 80.43 75.1384.03 86.78 AG-NIDK50 52.71 55.14 47.5356.92 63.06 SUN397SETTINGS FedAvg FedRCL FedSA FedSA† FedHPro NID10.2 68.92 70.23 61.5870.86 73.41 NID10.5 70.6173.2665.30 72.18 75.22 FedHPro outperforms FedProto by 16.67%, FedTGP by 7.04%, and FedSA by 5.57%, indicating that th...

work page arXiv 2015
[14]

Referring to (Wang et al., 2024), we consider two non-iid scenarios on the AG News:

with a hidden dimension of 32 for AG News, and train the model via Adam optimizer with a learning rate of 0.01. Referring to (Wang et al., 2024), we consider two non-iid scenarios on the AG News:

work page 2024