pith. machine review for the scientific record. sign in

arxiv: 2605.11165 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

COSMOS: Model-Agnostic Personalized Federated Learning with Clustered Server Models and Pseudo-Label-Only Communication

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords federated learningpersonalized federated learningmodel-agnostic learningclient clusteringknowledge distillationpseudo-labelsheterogeneous data
0
0 comments X

The pith

COSMOS clusters clients by their pseudo-label predictions on public data so the server can train and distill tailored models back without sharing architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COSMOS as a way to personalize federated learning when clients have different model architectures and different data distributions. Clients send only their predictions on shared public data; the server groups them into clusters based on how similar those predictions are, trains one model per cluster on the server, and returns pseudo-labels from those models for clients to distill into their local training. The central theoretical result is that this distillation produces an exponential contraction in each client's personalization risk, which goes beyond the usual stationarity guarantees in model-agnostic federated learning. Experiments show the method beats other architecture-agnostic baselines and stays competitive with methods that require more communication or model sharing.

Core claim

COSMOS clusters clients according to the similarity of their predictions on a shared public dataset, trains a distinct server-side model for each resulting cluster, and communicates only the pseudo-labels generated by these cluster models back to the clients for local distillation. The analysis shows that this process produces an exponential reduction in the personalization risk for each client.

What carries the argument

Clustering clients by prediction similarity on public data, server-side training of one model per cluster, and return of pseudo-labels only for client-side distillation.

If this is right

  • Communication cost drops because only pseudo-labels travel between server and clients.
  • Client models can differ arbitrarily in architecture since no model parameters are exchanged.
  • Each client receives guidance from a server model trained on statistically related data rather than a single global model.
  • The exponential risk contraction implies that personalization quality improves rapidly once the correct cluster model is identified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clustering idea might be tested on streaming public data to see whether clusters remain stable over time.
  • If public data is scarce, one could explore whether synthetic data generated by an initial global model can substitute without breaking the contraction guarantee.
  • The approach naturally suggests a hybrid where clients occasionally send a small number of real labels to refine cluster boundaries.

Load-bearing premise

Grouping clients by how similar their predictions are on public data will create clusters whose data distributions are similar enough that the server models improve personalization.

What would settle it

An experiment in which the prediction-similarity clusters contain clients whose private data distributions differ sharply and where the measured personalization risk fails to contract exponentially compared with a non-clustered baseline.

Figures

Figures reproduced from arXiv: 2605.11165 by Ben Rachmut, Luise Ge, Ning Zhang, William Yeoh, Yevgeniy Vorobeychik.

Figure 1
Figure 1. Figure 1: Overview of the four steps in the COSMOS workflow. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of COSMOS and model-agnostic baselines on four benchmarks. Client models [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of COSMOS and heterogeneous-model baselines on CIFAR-100 under different [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of COSMOS and heterogeneous-model baselines across four benchmarks under [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of COSMOS and heterogeneous-model baselines across four benchmarks [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-1 accuracy over communication rounds for homogeneous client architectures (AlexNet, [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top-1 accuracy over communication rounds for COSMOS on CIFAR-100 with Dirichlet [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-1 accuracy over communication rounds for COSMOS showing the performance on [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Top-1 accuracy over communication rounds for COSMOS. Subfigure (a) shows performance [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
read the original abstract

Federated learning (FL) in heterogeneous environments remains challenging because client models often differ in both architecture and data distribution. While recent approaches attempt to address this challenge through client clustering and knowledge distillation, simultaneously handling architectural and statistical heterogeneity remains difficult. We introduce COSMOS, a model-agnostic framework that enables server-side personalization using only pseudo-label communication. Clients train local models and predict on the public data; the server clusters clients by prediction similarity, trains a cluster-specific model for each group using its own compute, and distills the resulting models back to clients. We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction, going beyond the convergence-to-stationarity guarantees typically provided in model-agnostic FL. Experiments across benchmarks demonstrate that COSMOS consistently outperforms all model-agnostic FL baselines while remaining competitive with state-of-the-art personalized FL methods. More broadly, our results highlight personalized server-side learning with pseudo-labels as a promising paradigm for scalable and model-agnostic federated learning in highly heterogeneous environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces COSMOS, a model-agnostic personalized federated learning framework. Clients train local models and send predictions on public data to the server, which clusters clients by prediction similarity, trains cluster-specific models server-side, and distills them back to clients using only pseudo-label communication. The central claim is the first theoretical analysis demonstrating that distillation from these cluster models yields exponential personalization risk contraction (beyond standard convergence-to-stationarity results in model-agnostic FL), supported by experiments showing consistent outperformance over model-agnostic baselines and competitiveness with state-of-the-art personalized FL methods across benchmarks.

Significance. If the exponential contraction result holds under verifiable assumptions, this would be a meaningful advance in model-agnostic FL by providing stronger personalization guarantees while handling architectural heterogeneity with minimal communication. The pseudo-label-only paradigm is practically attractive for scalability. Experimental outperformance, if statistically substantiated, would further support its utility in heterogeneous environments.

major comments (2)
  1. [§4] §4 (Theoretical Analysis): The exponential personalization risk contraction claim presupposes that clustering clients by prediction similarity on public data produces groups with sufficiently low intra-cluster divergence in P(y|x) so that the shared cluster model serves as a good proxy for each client's optimum. No explicit bound is derived showing that prediction agreement on public samples implies the required distributional closeness for arbitrary architectures and label/feature shifts; without this, the exponential rate reduces to standard convergence-to-stationarity and the 'first theoretical analysis' claim is undermined.
  2. [§5] §5 (Experiments): The abstract asserts consistent outperformance over all model-agnostic FL baselines, but the manuscript provides no error bars, statistical significance tests, or detailed descriptions of data splits and baseline implementations. These omissions make it impossible to assess whether the reported gains are robust or merely due to favorable hyperparameter choices or unrepresentative public data.
minor comments (2)
  1. [Abstract] Abstract: The statement that the method 'remains competitive with state-of-the-art personalized FL methods' should be qualified by noting that those methods typically require model homogeneity or additional communication, which COSMOS avoids.
  2. [§4] Notation: The definition of personalization risk and the contraction rate should be stated explicitly with all assumptions (e.g., on public data representativeness) before the main theorem to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing our response and indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Theoretical Analysis): The exponential personalization risk contraction claim presupposes that clustering clients by prediction similarity on public data produces groups with sufficiently low intra-cluster divergence in P(y|x) so that the shared cluster model serves as a good proxy for each client's optimum. No explicit bound is derived showing that prediction agreement on public samples implies the required distributional closeness for arbitrary architectures and label/feature shifts; without this, the exponential rate reduces to standard convergence-to-stationarity and the 'first theoretical analysis' claim is undermined.

    Authors: We appreciate the referee highlighting this key aspect of the analysis. In §4, the exponential personalization risk contraction is proven conditionally on the clustering producing groups with low intra-cluster divergence in P(y|x), which enables the shared cluster model to serve as a proxy for client optima. Prediction similarity on public data is used as a practical surrogate for this, based on the intuition that aligned predictions reflect similar underlying decision boundaries. We agree that no explicit finite-sample bound is derived linking the observed prediction agreement to a quantitative guarantee on distributional closeness (in total variation or similar metrics) that holds for arbitrary architectures and general feature/label shifts. Such a bound would necessitate further assumptions on the public data or model properties, which we did not impose to preserve generality. In the revision we will add a clarifying remark in §4 stating this assumption explicitly, discussing when the proxy is expected to hold, and noting that the contraction rate is with respect to the achieved cluster quality. This still distinguishes our result from standard model-agnostic FL convergence-to-stationarity guarantees, as the rate explicitly incorporates the benefit of server-side personalization via clustering and distillation. revision: partial

  2. Referee: [§5] §5 (Experiments): The abstract asserts consistent outperformance over all model-agnostic FL baselines, but the manuscript provides no error bars, statistical significance tests, or detailed descriptions of data splits and baseline implementations. These omissions make it impossible to assess whether the reported gains are robust or merely due to favorable hyperparameter choices or unrepresentative public data.

    Authors: We agree that the experimental evaluation would be strengthened by additional statistical rigor and transparency. In the revised manuscript we will: (i) report error bars as standard deviations computed over at least five independent random seeds for all metrics and datasets; (ii) include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) comparing COSMOS to each model-agnostic baseline; and (iii) expand the experimental details section with precise descriptions of data partitioning (including client splits and public-data selection criteria), baseline re-implementations, and all hyperparameter choices. These changes will allow readers to better judge the robustness of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical analysis of exponential contraction is independent of inputs

full rationale

The paper's derivation chain introduces COSMOS with client clustering by prediction similarity on public data, followed by server-side cluster model training and distillation. The claimed theoretical result of exponential personalization risk contraction is positioned as a novel analysis extending beyond standard convergence-to-stationarity bounds in model-agnostic FL. No equations or steps reduce a derived quantity to a fitted parameter by construction, nor does the central claim rely on self-citation load-bearing or ansatz smuggling. The clustering definition and distillation step are specified independently, with the contraction bound presented as following from the framework's structure rather than tautologically from its inputs. This is the common case of a self-contained contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields limited visibility into exact assumptions. The method implicitly relies on the existence of suitable public data, the validity of prediction similarity as a proxy for statistical similarity, and standard properties of knowledge distillation. No explicit free parameters, invented entities, or ad-hoc axioms are stated.

axioms (2)
  • domain assumption Clients have access to a common public dataset on which they can generate predictions.
    Required for the pseudo-label communication step described in the abstract.
  • domain assumption Prediction similarity on public data induces clusters that share useful statistical structure for personalization.
    Central to the clustering mechanism and the claimed risk contraction.

pith-pipeline@v0.9.0 · 5504 in / 1438 out tokens · 50644 ms · 2026-05-13T05:57:42.554880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)

    Abourayya, A., Kleesiek, J., Rao, K., Ayday, E., Rao, B., Webb, G.I., Kamp, M.: Little Is Enough: Boosting Privacy by Sharing Only Hard Labels in Federated Semi-Supervised Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)

  2. [2]

    In: International Conference on Learning Representations (2022)

    Afonin, A., Karimireddy, S.P.: Towards Model Agnostic Federated Learning Using Knowledge Distillation. In: International Conference on Learning Representations (2022)

  3. [3]

    IEEE Journal of Selected Topics in Signal Processing (2023)

    Cho, Y.J., Wang, J., Chirvolu, T., Joshi, G.: Communication-Efficient and Model-Heterogeneous Personalized Federated Learning via Clustered Knowledge Transfer. IEEE Journal of Selected Topics in Signal Processing (2023)

  4. [4]

    In: International Joint Conference on Neural Networks (IJCNN) (2017)

    Cohen, G., Afshar, S., Tapson, J., Van Schaik, A.: EMNIST: Extending MNIST to Handwritten Letters. In: International Joint Conference on Neural Networks (IJCNN) (2017)

  5. [5]

    Adaptive personalized fed- erated learning.arXiv preprint arXiv:2003.13461,

    Deng, Y., Kamani, M.M., Mahdavi, M.: Adaptive Personalized Federated Learning. arXiv preprint arXiv:2003.13461 (2020)

  6. [6]

    Advances in Neural Information Processing Systems (2020)

    Dinh, C.T., Tran, N., Nguyen, J.: Personalized Federated Learning With Moreau Envelopes. Advances in Neural Information Processing Systems (2020)

  7. [7]

    In: IEEE International Conference on Parallel & Distributed Processing with Applications (ISPA/BDCloud/SocialCom/SustainCom) (2021)

    Duan, M., Liu, D., Ji, X., Liu, R., Liang, L., Chen, X., Tan, Y.: Fedgroup: Efficient federated learning via decomposed similarity-based clustering. In: IEEE International Conference on Parallel & Distributed Processing with Applications (ISPA/BDCloud/SocialCom/SustainCom) (2021)

  8. [8]

    In: Advances in Neural Information Processing Systems (2020)

    Fallah, A., Mokhtari, A., Ozdaglar, A.: Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. In: Advances in Neural Information Processing Systems (2020)

  9. [9]

    Proceedings of the 42nd International Conference on Machine Learning (2025)

    Ge, L., Lanier, M., Sarkar, A., Guresti, B., Vorobeychik, Y., Zhang, C.: Learning Policy Committees for Effective Personalization in MDPs With Diverse Tasks. Proceedings of the 42nd International Conference on Machine Learning (2025)

  10. [10]

    Advances in Neural Information Processing Systems (2020)

    Ghosh, A., Chung, J., Yin, D., Ramchandran, K.: An Efficient Framework for Clustered Federated Learning. Advances in Neural Information Processing Systems (2020)

  11. [11]

    IEEE Transactions on Big Data (2022)

    Gong, B., Xing, T., Liu, Z., Xi, W., Chen, X.: Adaptive Client Clustering for Efficient Federated Learning Over Non-IID and Imbalanced Data. IEEE Transactions on Big Data (2022)

  12. [12]

    In: International Conference on Machine Learning (2020)

    Karimireddy, S.P., Kale, S., Mohri, M., Reddi, S., Stich, S., Suresh, A.T.: Scaffold: Stochastic Controlled Averaging for Federated Learning. In: International Conference on Machine Learning (2020)

  13. [13]

    Toronto, ON, Canada (2009)

    Krizhevsky, A., Hinton, G., et al.: Learning Multiple Layers of Features From Tiny Images. Toronto, ON, Canada (2009)

  14. [14]

    Advances in Neural Information Processing Systems (2024)

    Lang, H., Sontag, D., Vijayaraghavan, A.: Theoretical Analysis of Weak-to-Strong Generaliza- tion. Advances in Neural Information Processing Systems (2024)

  15. [15]

    Stanford CS231N: Convolutional Neural Networks for Visual Recognition (2015) 13

    Le, Y., Yang, X.: Tiny ImageNet Visual Recognition Challenge. Stanford CS231N: Convolutional Neural Networks for Visual Recognition (2015) 13

  16. [16]

    Fedmd: Heterogenous federated learning via model distillation,

    Li, D., Wang, J.: FedMD: Heterogenous Federated Learning via Model Distillation. arXiv preprint arXiv:1910.03581 (2019)

  17. [17]

    In: International Conference on Machine Learning (2021)

    Li, T., Hu, S., Beirami, A., Smith, V.: Ditto: Fair and Robust Federated Learning Through Personalization. In: International Conference on Machine Learning (2021)

  18. [18]

    Proceedings of Machine Learning and Systems (2020)

    Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated Optimization in Heterogeneous Networks. Proceedings of Machine Learning and Systems (2020)

  19. [19]

    Advances in Neural Information Processing Systems (2020)

    Lin, T., Kong, L., Stich, S.U., Jaggi, M.: Ensemble Distillation for Robust Model Fusion in Federated Learning. Advances in Neural Information Processing Systems (2020)

  20. [20]

    IEEE Transactions on Pattern Analysis and Machine Intelligence48, 17–32 (2025),https://api.semanticscholar.org/CorpusID:280767120

    Liu, J., Liu, X., Wang, S., Wan, X., Li, D., Lu, K., He, K.: Communication-efficient federated multi-view clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence48, 17–32 (2025),https://api.semanticscholar.org/CorpusID:280767120

  21. [21]

    In: International Conference on Machine Learning (2022)

    Makhija, D., Han, X., Ho, N., Ghosh, J.: Architecture Agnostic Federated Learning for Neural Networks. In: International Conference on Machine Learning (2022)

  22. [22]

    In: International Conference on Artificial Intelligence and Statistics (2017)

    McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-Efficient Learning of Deep Networks from Decentralized Data. In: International Conference on Artificial Intelligence and Statistics (2017)

  23. [23]

    In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (2024)

    Mora, A., Tenison, I., Bellavista, P., Rish, I.: Knowledge Distillation in Federated Learning: A Practical Guide. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (2024)

  24. [24]

    In: International Conference on Learning Representations (2022)

    Oh, J., Kim, S., Yun, S.Y.: Fedbabu: Toward enhanced representation for federated image classification. In: International Conference on Learning Representations (2022)

  25. [25]

    In: IEEE Transactions on Network Science and Engineering (2022)

    Sattler, F., Marban, A., Rischke, R., Samek, W.: Cfd: Communication-efficient federated distillation via soft-label quantization and delta coding. In: IEEE Transactions on Network Science and Engineering (2022)

  26. [26]

    arXiv preprint arXiv:2107.10996 (2021)

    Shahid, O., Pouriyeh, S., Parizi, R.M., Sheng, Q.Z., Srivastava, G., Zhao, L.: Communication efficiency in federated learning: Achievements and challenges. arXiv preprint arXiv:2107.10996 (2021)

  27. [27]

    In: International Conference on Machine Learning (2021)

    Shamsian, A., Navon, A., Fetaya, E., Chechik, G.: Personalized Federated Learning Using Hypernetworks. In: International Conference on Machine Learning (2021)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Tamirisa, R., Xie, C., Bao, W., Zhou, A., Arel, R., Shamsian, A.: FedSelect: Personalized Federated Learning With Customized Selection of Parameters for Fine-Tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  29. [29]

    IEEE Transac- tions on Neural Networks and Learning Systems (2022)

    Tan, A.Z., Yu, H., Cui, L., Yang, Q.: Towards Personalized Federated Learning. IEEE Transac- tions on Neural Networks and Learning Systems (2022)

  30. [30]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)

    Tan, Y., Long, G., Liu, L., Zhou, T., Lu, Q., Jiang, J., Zhang, C.: FedProto: Federated Prototype Learning Across Heterogeneous Clients. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)

  31. [31]

    In: International Conference on Learning Representations (2020) 14

    Wei, C., Shen, K., Chen, Y., Ma, T.: Theoretical Analysis of Self-Training With Deep Networks on Unlabeled Data. In: International Conference on Learning Representations (2020) 14

  32. [32]

    In: International Conference on Machine Learning (2019)

    Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., Khazaeni, Y.: Bayesian Nonparametric Federated Learning of Neural Networks. In: International Conference on Machine Learning (2019)

  33. [33]

    Electronics (2024)

    Zhang, J., Shi, Y.: A Personalized Federated Learning Method Based on Clustering and Knowledge Distillation. Electronics (2024)

  34. [34]

    In: International Conference on Machine Learning (2021) 15 A Notation Table B Omitted Proofs B.1 Proof of Lemma 5.2 Proof

    Zhu, Z., Hong, J., Zhou, J.: Data-Free Knowledge Distillation for Heterogeneous Federated Learning. In: International Conference on Machine Learning (2021) 15 A Notation Table B Omitted Proofs B.1 Proof of Lemma 5.2 Proof. Let ˆci(x) := arg maxc∈[M] f (t,1) i (x)c and ˆck(x) := arg maxc∈[M] ¯f (t) k (x)c. For any x∈U i we have the pointwise implication 1{...

  35. [35]

    It is provided as a ZIP file containing 200 categories of images, used for classification tasks

    Tiny ImageNet:The Tiny ImageNet dataset is downloaded from the official Stanford CS231n website. It is provided as a ZIP file containing 200 categories of images, used for classification tasks. The dataset is extracted and reorganized into appropriate folders for training and validation. The dataset is available at:http://cs231n.stanford.edu/tiny-imagenet...

  36. [36]

    They are commonly used for training machine learning models for image classification

    CIFAR-10 & CIFAR-100:These datasets consist of 60,000 32x32 color images in 10 and 100 classes, respectively. They are commonly used for training machine learning models for image classification. The datasets are available for download directly via the torchvision library. More information on these datasets can be found at:https://www.cs.toronto.edu/~kriz...

  37. [37]

    We specifically use theBalancedversion, which includes 131,600 characters across 47 balanced classes

    Extended MNIST (EMNIST):The EMNIST dataset extends the original MNIST dataset to include handwritten letters. We specifically use theBalancedversion, which includes 131,600 characters across 47 balanced classes. The dataset is available for download via the torchvision library. Additional details can be found on the homepage:https://www.nist. gov/itl/prod...

  38. [38]

    During client fine-tuning (using their local data true labels), we use Cross-Entropy Loss (nn.CrossEntropyLoss)

    Loss Functions:During training using pseudo-labels, we use the Kullback-Leibler Divergence (nn.KLDivLoss) to align model outputs with mean pseudo-labels, with reduction=’batchmean’. During client fine-tuning (using their local data true labels), we use Cross-Entropy Loss (nn.CrossEntropyLoss)

  39. [39]

    After performing hyperparameter optimization, we use a higher learning rate of0.001when training on client-side data with true labels

    Optimizers and Learning Rate:We use the Adam optimizer for both training stages. After performing hyperparameter optimization, we use a higher learning rate of0.001when training on client-side data with true labels. This enables the model to adapt quickly and effectively to reliable supervision. In contrast, for pseudo-labeled data, we reduce the learning...

  40. [40]

    This method, which is well-suited for ReLU activations, helps prevent vanishing or exploding gradients

    Weights Initialization:For consistency across experiments, we initialize weights using Kaiming He initialization for both convolutional and linear layers. This method, which is well-suited for ReLU activations, helps prevent vanishing or exploding gradients. The biases are initialized to zeros. The seed is updated and applied to ensure reproducibility across runs

  41. [41]

    We tuneB such that the resulting number of clusters is approximatelyK = 5, matching the 20% class-grouping structure used to induce heterogeneous label distributions across clients

    COSMOS Clustering Hyperparameter (B):COSMOS uses a neighborhood threshold hyperparameter B to determine client similarity during the greedy-elimination clustering step. We tuneB such that the resulting number of clusters is approximatelyK = 5, matching the 20% class-grouping structure used to induce heterogeneous label distributions across clients. 21 D A...