Recognition: 2 theorem links
· Lean TheoremCOSMOS: Model-Agnostic Personalized Federated Learning with Clustered Server Models and Pseudo-Label-Only Communication
Pith reviewed 2026-05-13 05:57 UTC · model grok-4.3
The pith
COSMOS clusters clients by their pseudo-label predictions on public data so the server can train and distill tailored models back without sharing architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COSMOS clusters clients according to the similarity of their predictions on a shared public dataset, trains a distinct server-side model for each resulting cluster, and communicates only the pseudo-labels generated by these cluster models back to the clients for local distillation. The analysis shows that this process produces an exponential reduction in the personalization risk for each client.
What carries the argument
Clustering clients by prediction similarity on public data, server-side training of one model per cluster, and return of pseudo-labels only for client-side distillation.
If this is right
- Communication cost drops because only pseudo-labels travel between server and clients.
- Client models can differ arbitrarily in architecture since no model parameters are exchanged.
- Each client receives guidance from a server model trained on statistically related data rather than a single global model.
- The exponential risk contraction implies that personalization quality improves rapidly once the correct cluster model is identified.
Where Pith is reading between the lines
- The same clustering idea might be tested on streaming public data to see whether clusters remain stable over time.
- If public data is scarce, one could explore whether synthetic data generated by an initial global model can substitute without breaking the contraction guarantee.
- The approach naturally suggests a hybrid where clients occasionally send a small number of real labels to refine cluster boundaries.
Load-bearing premise
Grouping clients by how similar their predictions are on public data will create clusters whose data distributions are similar enough that the server models improve personalization.
What would settle it
An experiment in which the prediction-similarity clusters contain clients whose private data distributions differ sharply and where the measured personalization risk fails to contract exponentially compared with a non-clustered baseline.
Figures
read the original abstract
Federated learning (FL) in heterogeneous environments remains challenging because client models often differ in both architecture and data distribution. While recent approaches attempt to address this challenge through client clustering and knowledge distillation, simultaneously handling architectural and statistical heterogeneity remains difficult. We introduce COSMOS, a model-agnostic framework that enables server-side personalization using only pseudo-label communication. Clients train local models and predict on the public data; the server clusters clients by prediction similarity, trains a cluster-specific model for each group using its own compute, and distills the resulting models back to clients. We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction, going beyond the convergence-to-stationarity guarantees typically provided in model-agnostic FL. Experiments across benchmarks demonstrate that COSMOS consistently outperforms all model-agnostic FL baselines while remaining competitive with state-of-the-art personalized FL methods. More broadly, our results highlight personalized server-side learning with pseudo-labels as a promising paradigm for scalable and model-agnostic federated learning in highly heterogeneous environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COSMOS, a model-agnostic personalized federated learning framework. Clients train local models and send predictions on public data to the server, which clusters clients by prediction similarity, trains cluster-specific models server-side, and distills them back to clients using only pseudo-label communication. The central claim is the first theoretical analysis demonstrating that distillation from these cluster models yields exponential personalization risk contraction (beyond standard convergence-to-stationarity results in model-agnostic FL), supported by experiments showing consistent outperformance over model-agnostic baselines and competitiveness with state-of-the-art personalized FL methods across benchmarks.
Significance. If the exponential contraction result holds under verifiable assumptions, this would be a meaningful advance in model-agnostic FL by providing stronger personalization guarantees while handling architectural heterogeneity with minimal communication. The pseudo-label-only paradigm is practically attractive for scalability. Experimental outperformance, if statistically substantiated, would further support its utility in heterogeneous environments.
major comments (2)
- [§4] §4 (Theoretical Analysis): The exponential personalization risk contraction claim presupposes that clustering clients by prediction similarity on public data produces groups with sufficiently low intra-cluster divergence in P(y|x) so that the shared cluster model serves as a good proxy for each client's optimum. No explicit bound is derived showing that prediction agreement on public samples implies the required distributional closeness for arbitrary architectures and label/feature shifts; without this, the exponential rate reduces to standard convergence-to-stationarity and the 'first theoretical analysis' claim is undermined.
- [§5] §5 (Experiments): The abstract asserts consistent outperformance over all model-agnostic FL baselines, but the manuscript provides no error bars, statistical significance tests, or detailed descriptions of data splits and baseline implementations. These omissions make it impossible to assess whether the reported gains are robust or merely due to favorable hyperparameter choices or unrepresentative public data.
minor comments (2)
- [Abstract] Abstract: The statement that the method 'remains competitive with state-of-the-art personalized FL methods' should be qualified by noting that those methods typically require model homogeneity or additional communication, which COSMOS avoids.
- [§4] Notation: The definition of personalization risk and the contraction rate should be stated explicitly with all assumptions (e.g., on public data representativeness) before the main theorem to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing our response and indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Theoretical Analysis): The exponential personalization risk contraction claim presupposes that clustering clients by prediction similarity on public data produces groups with sufficiently low intra-cluster divergence in P(y|x) so that the shared cluster model serves as a good proxy for each client's optimum. No explicit bound is derived showing that prediction agreement on public samples implies the required distributional closeness for arbitrary architectures and label/feature shifts; without this, the exponential rate reduces to standard convergence-to-stationarity and the 'first theoretical analysis' claim is undermined.
Authors: We appreciate the referee highlighting this key aspect of the analysis. In §4, the exponential personalization risk contraction is proven conditionally on the clustering producing groups with low intra-cluster divergence in P(y|x), which enables the shared cluster model to serve as a proxy for client optima. Prediction similarity on public data is used as a practical surrogate for this, based on the intuition that aligned predictions reflect similar underlying decision boundaries. We agree that no explicit finite-sample bound is derived linking the observed prediction agreement to a quantitative guarantee on distributional closeness (in total variation or similar metrics) that holds for arbitrary architectures and general feature/label shifts. Such a bound would necessitate further assumptions on the public data or model properties, which we did not impose to preserve generality. In the revision we will add a clarifying remark in §4 stating this assumption explicitly, discussing when the proxy is expected to hold, and noting that the contraction rate is with respect to the achieved cluster quality. This still distinguishes our result from standard model-agnostic FL convergence-to-stationarity guarantees, as the rate explicitly incorporates the benefit of server-side personalization via clustering and distillation. revision: partial
-
Referee: [§5] §5 (Experiments): The abstract asserts consistent outperformance over all model-agnostic FL baselines, but the manuscript provides no error bars, statistical significance tests, or detailed descriptions of data splits and baseline implementations. These omissions make it impossible to assess whether the reported gains are robust or merely due to favorable hyperparameter choices or unrepresentative public data.
Authors: We agree that the experimental evaluation would be strengthened by additional statistical rigor and transparency. In the revised manuscript we will: (i) report error bars as standard deviations computed over at least five independent random seeds for all metrics and datasets; (ii) include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) comparing COSMOS to each model-agnostic baseline; and (iii) expand the experimental details section with precise descriptions of data partitioning (including client splits and public-data selection criteria), baseline re-implementations, and all hyperparameter choices. These changes will allow readers to better judge the robustness of the observed improvements. revision: yes
Circularity Check
No significant circularity; theoretical analysis of exponential contraction is independent of inputs
full rationale
The paper's derivation chain introduces COSMOS with client clustering by prediction similarity on public data, followed by server-side cluster model training and distillation. The claimed theoretical result of exponential personalization risk contraction is positioned as a novel analysis extending beyond standard convergence-to-stationarity bounds in model-agnostic FL. No equations or steps reduce a derived quantity to a fitted parameter by construction, nor does the central claim rely on self-citation load-bearing or ansatz smuggling. The clustering definition and distillation step are specified independently, with the contraction bound presented as following from the framework's structure rather than tautologically from its inputs. This is the common case of a self-contained contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Clients have access to a common public dataset on which they can generate predictions.
- domain assumption Prediction similarity on public data induces clusters that share useful statistical structure for personalization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction... under sufficient conditions... (b,c)-expansion... robustness loss... Assumption 5.8 (Effective pseudo-labelers) b≤1/3.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
greedy clustering... d(t)(i,j)=|f_i(U)−f_j(U)|_1... within-cluster pseudo-label distance bounded by B
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)
Abourayya, A., Kleesiek, J., Rao, K., Ayday, E., Rao, B., Webb, G.I., Kamp, M.: Little Is Enough: Boosting Privacy by Sharing Only Hard Labels in Federated Semi-Supervised Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)
work page 2025
-
[2]
In: International Conference on Learning Representations (2022)
Afonin, A., Karimireddy, S.P.: Towards Model Agnostic Federated Learning Using Knowledge Distillation. In: International Conference on Learning Representations (2022)
work page 2022
-
[3]
IEEE Journal of Selected Topics in Signal Processing (2023)
Cho, Y.J., Wang, J., Chirvolu, T., Joshi, G.: Communication-Efficient and Model-Heterogeneous Personalized Federated Learning via Clustered Knowledge Transfer. IEEE Journal of Selected Topics in Signal Processing (2023)
work page 2023
-
[4]
In: International Joint Conference on Neural Networks (IJCNN) (2017)
Cohen, G., Afshar, S., Tapson, J., Van Schaik, A.: EMNIST: Extending MNIST to Handwritten Letters. In: International Joint Conference on Neural Networks (IJCNN) (2017)
work page 2017
-
[5]
Adaptive personalized fed- erated learning.arXiv preprint arXiv:2003.13461,
Deng, Y., Kamani, M.M., Mahdavi, M.: Adaptive Personalized Federated Learning. arXiv preprint arXiv:2003.13461 (2020)
-
[6]
Advances in Neural Information Processing Systems (2020)
Dinh, C.T., Tran, N., Nguyen, J.: Personalized Federated Learning With Moreau Envelopes. Advances in Neural Information Processing Systems (2020)
work page 2020
-
[7]
Duan, M., Liu, D., Ji, X., Liu, R., Liang, L., Chen, X., Tan, Y.: Fedgroup: Efficient federated learning via decomposed similarity-based clustering. In: IEEE International Conference on Parallel & Distributed Processing with Applications (ISPA/BDCloud/SocialCom/SustainCom) (2021)
work page 2021
-
[8]
In: Advances in Neural Information Processing Systems (2020)
Fallah, A., Mokhtari, A., Ozdaglar, A.: Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. In: Advances in Neural Information Processing Systems (2020)
work page 2020
-
[9]
Proceedings of the 42nd International Conference on Machine Learning (2025)
Ge, L., Lanier, M., Sarkar, A., Guresti, B., Vorobeychik, Y., Zhang, C.: Learning Policy Committees for Effective Personalization in MDPs With Diverse Tasks. Proceedings of the 42nd International Conference on Machine Learning (2025)
work page 2025
-
[10]
Advances in Neural Information Processing Systems (2020)
Ghosh, A., Chung, J., Yin, D., Ramchandran, K.: An Efficient Framework for Clustered Federated Learning. Advances in Neural Information Processing Systems (2020)
work page 2020
-
[11]
IEEE Transactions on Big Data (2022)
Gong, B., Xing, T., Liu, Z., Xi, W., Chen, X.: Adaptive Client Clustering for Efficient Federated Learning Over Non-IID and Imbalanced Data. IEEE Transactions on Big Data (2022)
work page 2022
-
[12]
In: International Conference on Machine Learning (2020)
Karimireddy, S.P., Kale, S., Mohri, M., Reddi, S., Stich, S., Suresh, A.T.: Scaffold: Stochastic Controlled Averaging for Federated Learning. In: International Conference on Machine Learning (2020)
work page 2020
-
[13]
Krizhevsky, A., Hinton, G., et al.: Learning Multiple Layers of Features From Tiny Images. Toronto, ON, Canada (2009)
work page 2009
-
[14]
Advances in Neural Information Processing Systems (2024)
Lang, H., Sontag, D., Vijayaraghavan, A.: Theoretical Analysis of Weak-to-Strong Generaliza- tion. Advances in Neural Information Processing Systems (2024)
work page 2024
-
[15]
Stanford CS231N: Convolutional Neural Networks for Visual Recognition (2015) 13
Le, Y., Yang, X.: Tiny ImageNet Visual Recognition Challenge. Stanford CS231N: Convolutional Neural Networks for Visual Recognition (2015) 13
work page 2015
-
[16]
Fedmd: Heterogenous federated learning via model distillation,
Li, D., Wang, J.: FedMD: Heterogenous Federated Learning via Model Distillation. arXiv preprint arXiv:1910.03581 (2019)
-
[17]
In: International Conference on Machine Learning (2021)
Li, T., Hu, S., Beirami, A., Smith, V.: Ditto: Fair and Robust Federated Learning Through Personalization. In: International Conference on Machine Learning (2021)
work page 2021
-
[18]
Proceedings of Machine Learning and Systems (2020)
Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated Optimization in Heterogeneous Networks. Proceedings of Machine Learning and Systems (2020)
work page 2020
-
[19]
Advances in Neural Information Processing Systems (2020)
Lin, T., Kong, L., Stich, S.U., Jaggi, M.: Ensemble Distillation for Robust Model Fusion in Federated Learning. Advances in Neural Information Processing Systems (2020)
work page 2020
-
[20]
Liu, J., Liu, X., Wang, S., Wan, X., Li, D., Lu, K., He, K.: Communication-efficient federated multi-view clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence48, 17–32 (2025),https://api.semanticscholar.org/CorpusID:280767120
work page 2025
-
[21]
In: International Conference on Machine Learning (2022)
Makhija, D., Han, X., Ho, N., Ghosh, J.: Architecture Agnostic Federated Learning for Neural Networks. In: International Conference on Machine Learning (2022)
work page 2022
-
[22]
In: International Conference on Artificial Intelligence and Statistics (2017)
McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-Efficient Learning of Deep Networks from Decentralized Data. In: International Conference on Artificial Intelligence and Statistics (2017)
work page 2017
-
[23]
In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (2024)
Mora, A., Tenison, I., Bellavista, P., Rish, I.: Knowledge Distillation in Federated Learning: A Practical Guide. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (2024)
work page 2024
-
[24]
In: International Conference on Learning Representations (2022)
Oh, J., Kim, S., Yun, S.Y.: Fedbabu: Toward enhanced representation for federated image classification. In: International Conference on Learning Representations (2022)
work page 2022
-
[25]
In: IEEE Transactions on Network Science and Engineering (2022)
Sattler, F., Marban, A., Rischke, R., Samek, W.: Cfd: Communication-efficient federated distillation via soft-label quantization and delta coding. In: IEEE Transactions on Network Science and Engineering (2022)
work page 2022
-
[26]
arXiv preprint arXiv:2107.10996 (2021)
Shahid, O., Pouriyeh, S., Parizi, R.M., Sheng, Q.Z., Srivastava, G., Zhao, L.: Communication efficiency in federated learning: Achievements and challenges. arXiv preprint arXiv:2107.10996 (2021)
-
[27]
In: International Conference on Machine Learning (2021)
Shamsian, A., Navon, A., Fetaya, E., Chechik, G.: Personalized Federated Learning Using Hypernetworks. In: International Conference on Machine Learning (2021)
work page 2021
-
[28]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Tamirisa, R., Xie, C., Bao, W., Zhou, A., Arel, R., Shamsian, A.: FedSelect: Personalized Federated Learning With Customized Selection of Parameters for Fine-Tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
work page 2024
-
[29]
IEEE Transac- tions on Neural Networks and Learning Systems (2022)
Tan, A.Z., Yu, H., Cui, L., Yang, Q.: Towards Personalized Federated Learning. IEEE Transac- tions on Neural Networks and Learning Systems (2022)
work page 2022
-
[30]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)
Tan, Y., Long, G., Liu, L., Zhou, T., Lu, Q., Jiang, J., Zhang, C.: FedProto: Federated Prototype Learning Across Heterogeneous Clients. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)
work page 2022
-
[31]
In: International Conference on Learning Representations (2020) 14
Wei, C., Shen, K., Chen, Y., Ma, T.: Theoretical Analysis of Self-Training With Deep Networks on Unlabeled Data. In: International Conference on Learning Representations (2020) 14
work page 2020
-
[32]
In: International Conference on Machine Learning (2019)
Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., Khazaeni, Y.: Bayesian Nonparametric Federated Learning of Neural Networks. In: International Conference on Machine Learning (2019)
work page 2019
-
[33]
Zhang, J., Shi, Y.: A Personalized Federated Learning Method Based on Clustering and Knowledge Distillation. Electronics (2024)
work page 2024
-
[34]
Zhu, Z., Hong, J., Zhou, J.: Data-Free Knowledge Distillation for Heterogeneous Federated Learning. In: International Conference on Machine Learning (2021) 15 A Notation Table B Omitted Proofs B.1 Proof of Lemma 5.2 Proof. Let ˆci(x) := arg maxc∈[M] f (t,1) i (x)c and ˆck(x) := arg maxc∈[M] ¯f (t) k (x)c. For any x∈U i we have the pointwise implication 1{...
work page 2021
-
[35]
It is provided as a ZIP file containing 200 categories of images, used for classification tasks
Tiny ImageNet:The Tiny ImageNet dataset is downloaded from the official Stanford CS231n website. It is provided as a ZIP file containing 200 categories of images, used for classification tasks. The dataset is extracted and reorganized into appropriate folders for training and validation. The dataset is available at:http://cs231n.stanford.edu/tiny-imagenet...
-
[36]
They are commonly used for training machine learning models for image classification
CIFAR-10 & CIFAR-100:These datasets consist of 60,000 32x32 color images in 10 and 100 classes, respectively. They are commonly used for training machine learning models for image classification. The datasets are available for download directly via the torchvision library. More information on these datasets can be found at:https://www.cs.toronto.edu/~kriz...
-
[37]
We specifically use theBalancedversion, which includes 131,600 characters across 47 balanced classes
Extended MNIST (EMNIST):The EMNIST dataset extends the original MNIST dataset to include handwritten letters. We specifically use theBalancedversion, which includes 131,600 characters across 47 balanced classes. The dataset is available for download via the torchvision library. Additional details can be found on the homepage:https://www.nist. gov/itl/prod...
-
[38]
Loss Functions:During training using pseudo-labels, we use the Kullback-Leibler Divergence (nn.KLDivLoss) to align model outputs with mean pseudo-labels, with reduction=’batchmean’. During client fine-tuning (using their local data true labels), we use Cross-Entropy Loss (nn.CrossEntropyLoss)
-
[39]
Optimizers and Learning Rate:We use the Adam optimizer for both training stages. After performing hyperparameter optimization, we use a higher learning rate of0.001when training on client-side data with true labels. This enables the model to adapt quickly and effectively to reliable supervision. In contrast, for pseudo-labeled data, we reduce the learning...
-
[40]
Weights Initialization:For consistency across experiments, we initialize weights using Kaiming He initialization for both convolutional and linear layers. This method, which is well-suited for ReLU activations, helps prevent vanishing or exploding gradients. The biases are initialized to zeros. The seed is updated and applied to ensure reproducibility across runs
-
[41]
COSMOS Clustering Hyperparameter (B):COSMOS uses a neighborhood threshold hyperparameter B to determine client similarity during the greedy-elimination clustering step. We tuneB such that the resulting number of clusters is approximatelyK = 5, matching the 20% class-grouping structure used to induce heterogeneous label distributions across clients. 21 D A...
work page 2040
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.