pith. sign in

arxiv: 2605.21322 · v1 · pith:NGYWZDMOnew · submitted 2026-05-20 · 💻 cs.LG

Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

Pith reviewed 2026-05-21 05:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords federated learningknowledge distillationneural architecture searchnon-IID datasystem heterogeneitycommunication efficiencyPareto efficiency
0
0 comments X

The pith

Clients select their own models in federated learning to raise accuracy while slashing computation and communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that federated learning can work better when clients are free to choose lightweight neural architectures that fit their local data and hardware. Each client trains its chosen model on private data while also learning from aggregated predictions produced on a public reference dataset. The server smooths these predictions to create reliable targets that clients use for distillation in the following round. This removes the usual requirement that all clients run identical model structures, which often leads to poor performance when data or devices differ. A reader would care because the reported results indicate substantial gains in accuracy alongside major cuts in local processing time and data transmission volume.

Core claim

FedKDNAS lets each client autonomously pick a lightweight architecture under accuracy and resource constraints. The client trains this model locally using supervised learning combined with knowledge distillation from server-provided targets. Only the model's predictions on a public reference set are shared with the server. The server aggregates and smooths these predictions, sometimes incorporating a teacher model, to generate stable distillation targets for the next training round. Tests on six datasets against six baselines confirm improved Pareto efficiency.

What carries the argument

Client-driven neural architecture selection with server-side aggregation of predictions on a public reference set for generating distillation targets

Load-bearing premise

That clients can correctly and autonomously choose lightweight architectures matching their accuracy needs and device limits, and that a public reference set can be used without causing bias or privacy problems.

What would settle it

If a fixed client architecture in a standard federated setup achieves the same accuracy with similar or lower CPU and communication costs on the evaluated datasets, the advantage of the proposed method would be called into question.

Figures

Figures reproduced from arXiv: 2605.21322 by Chaimaa Medjadji, Feras M. Awaysheh, Guilain Leduc, Sadi Alawadi, Sylvain Kubler, Yves Le Traon.

Figure 1
Figure 1. Figure 1: Overview of the proposed FedKD-NAS architecture. At each communication round, each client independently selects a [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy over 100 communication rounds on CIFAR10 under IID, Dirichlet ( [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy drop from IID to non-IID settings on CIFAR10. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Communication cost per round on CIFAR-10. Since [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RES ↓ and UES ↑ on CIFAR10 across IID, Dirichlet (α=0.1), and Shards. FedKD-NAS achieves the lowest RES and highest UES across distributions and architectures, with UES increasing under heterogeneity. TABLE VI: HAR results at the final communication round. We therefore report UES⋆ = PQS · CES as a resource-free unified efficiency indicator. Algorithm Acc ↑ Loss ↓ Comm (MB) ↓ PQS ↑ CES ↑ UES⋆ ↑ FedAvg 0.731… view at source ↗
Figure 6
Figure 6. Figure 6: CES vs. PQS trade-off on CIFAR10 (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CES vs. PQS trade-off on CIFAR100 (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CES vs. PQS trade-off on EMNIST (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CES vs. PQS trade-off on FMNIST (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: CES vs. PQS trade-off on MNIST (bubble area [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Communication cost per round on MNIST. For both [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Communication cost per round on EMNIST. For [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Communication cost per round on CIFAR100 under [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Accuracy drop from IID to non-IID on FMNIST. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 14
Figure 14. Figure 14: Accuracy drop from IID to non-IID on MNIST. FedKD [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 17
Figure 17. Figure 17: Accuracy drop from IID to non-IID on CIFAR100. All [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: RES (↓) and UES (↑) on CIFAR100. FedKD-NAS achieves the lowest RES on both architectures under all three data distributions. For UES, FedKD-NAS attains the highest values on MobileNetV2 across all three data distributions. On ShuffleNetV2, FedAvg has the highest overall UES, while FedKD-NAS remains the strongest method within the low￾communication logit-based group, achieving 1.0432, 1.2543, and 1.1971 un… view at source ↗
Figure 19
Figure 19. Figure 19: RES (↓) and UES (↑) on MNIST. On LeNet5, FedDistill consistently achieves the lowest RES across all data distributions, while FedKD-NAS attains the highest UES in IID, Dirichlet, and Shards, driven by its superior PQS and high CES. On ResNet18, FedAvg achieves the lowest RES under IID and Dirichlet, while FedDistill attains the lowest RES under Shards. For UES, FedKD-NAS consistently achieves the highest … view at source ↗
Figure 20
Figure 20. Figure 20: RES (↓) and UES (↑) on FMNIST. On LeNet5, FedMD achieves the lowest RES under IID, while FedDistill attains the lowest RES under Dirichlet and Shards; FedKD-NAS achieves the highest UES under IID, whereas FedMD and FedDistill lead under Dirichlet and Shards, respectively. On ResNet18, the lowest RES is achieved by FedAvg under IID, FedKD-NAS under Dirichlet, and Ditto under Shards. For UES on ResNet18, Fe… view at source ↗
Figure 21
Figure 21. Figure 21: RES (↓) and UES (↑) on EMNIST. On LeNet5, FedDistill consistently achieves the lowest RES across all data distributions, while FedAvg attains the highest UES because of its strong accuracy and higher CES. On ResNet18, Ditto achieves the lowest RES under IID, FedAvg under Dirichlet, and FedKD-NAS under Shards. For UES, FedAvg consistently achieves the highest values across all distributions and both archit… view at source ↗
Figure 22
Figure 22. Figure 22: Accuracy curves over 100 rounds on MNIST under IID, Dirichlet ( [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Accuracy curves over 100 rounds on FMNIST. FedKD-NAS leads under all distributions with LeNet5. On ResNet18, [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Accuracy curves over 100 rounds on EMNIST. EMNIST is a 47-class benchmark in which heterogeneity creates a [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Accuracy curves over 100 rounds on CIFAR100. CIFAR100 is the most challenging benchmark because it contains [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗
read the original abstract

Federated Learning (FL) enables collaborative model training without centralizing data. However, real-world deployments must simultaneously address statistical heterogeneity across client data (non-IID), system heterogeneity in device capabilities, and communication efficiency. Existing FL approaches mitigate these challenges through improved aggregation, personalization, or knowledge distillation, but they almost universally assume a fixed client architecture, limiting adaptability to heterogeneous data complexity and hardware constraints. This architectural constraint often leads to suboptimal trade-offs between accuracy and efficiency in real-world FL systems. This work introduces FedKDNAS, a distillation-driven FL framework that combines client-side neural architecture selection with distillation of server-coordinated knowledge. Each client autonomously selects a lightweight model under accuracy-resource constraints. It then trains it locally using a hybrid objective combining supervised learning and knowledge distillation and shares only predictions on a public reference set. The server then aggregates and smooths these predictions, optionally combining them with a teacher model, to produce stable distillation targets for the next round. Extensive evaluation on six datasets against six representative FL baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) demonstrates that FedKDNAS consistently achieves superior Pareto efficiency, improving accuracy by up to 15\% under non-IID conditions, reducing client CPU usage by approximately 28\%, and decreasing communication overhead by up to 44 times while maintaining lightweight logit-based communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FedKDNAS, a federated learning framework combining client-side neural architecture search for lightweight models with server-coordinated knowledge distillation. Clients train locally using a hybrid supervised-plus-distillation objective and communicate only logits on a shared public reference set; the server aggregates and smooths these predictions (optionally with a teacher) to form distillation targets for subsequent rounds. Empirical results on six datasets versus six baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) report up to 15% accuracy gains under non-IID conditions, ~28% client CPU reduction, and up to 44× lower communication overhead.

Significance. If the performance claims hold under rigorous controls, the work would contribute a practical approach to jointly addressing statistical heterogeneity, system heterogeneity, and communication constraints in FL via adaptive client architectures and logit-only communication. The multi-dataset, multi-baseline evaluation is a positive feature for empirical breadth.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claims of up to 15% accuracy improvement, 28% CPU reduction, and 44× communication savings under non-IID conditions are presented without reported statistical tests, confidence intervals, or variance across random seeds and non-IID partitions. This leaves the superiority over FedDF and FedDistill weakly supported.
  2. [§3] §3 (Proposed Method): the distillation targets are formed by aggregating client logits on a public reference set. No description is given of how the reference set is constructed to remain representative across heterogeneous client distributions or to avoid selection bias; if the set skews toward any subpopulation, the smoothed targets become mis-calibrated, directly undermining both the accuracy and efficiency gains relative to baselines that also rely on distillation.
minor comments (2)
  1. [§3] The neural architecture search space and the exact accuracy-resource constraint used for client-side selection are not specified, hindering reproducibility.
  2. [§4] Hyperparameter tuning details and the precise non-IID partitioning procedure (e.g., Dirichlet concentration or label skew ratios) are omitted from the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of empirical rigor and methodological clarity. We address each point below and have revised the manuscript to incorporate the suggestions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claims of up to 15% accuracy improvement, 28% CPU reduction, and 44× communication savings under non-IID conditions are presented without reported statistical tests, confidence intervals, or variance across random seeds and non-IID partitions. This leaves the superiority over FedDF and FedDistill weakly supported.

    Authors: We agree that the absence of statistical tests and variance reporting weakens the strength of the empirical claims. In the revised version, we have rerun the experiments using 5 independent random seeds per non-IID partition setting. We now report mean accuracy, CPU usage, and communication cost together with standard deviations. We have also added paired t-test p-values comparing FedKDNAS against FedDF and FedDistill, showing that the reported gains remain statistically significant (p < 0.05) in the majority of evaluated settings. These changes appear in the abstract and Section 4. revision: yes

  2. Referee: [§3] §3 (Proposed Method): the distillation targets are formed by aggregating client logits on a public reference set. No description is given of how the reference set is constructed to remain representative across heterogeneous client distributions or to avoid selection bias; if the set skews toward any subpopulation, the smoothed targets become mis-calibrated, directly undermining both the accuracy and efficiency gains relative to baselines that also rely on distillation.

    Authors: This is a valid concern. The original manuscript did not provide sufficient detail on reference-set construction. In the revision we have added an explicit description in Section 3: the reference set is a fixed, randomly sampled collection of 2,000 examples drawn from a publicly available held-out dataset that is completely disjoint from all client training data. We further include a short sensitivity study demonstrating that performance is stable across different random draws of the reference set, thereby reducing the risk of subpopulation skew and mis-calibration. revision: yes

Circularity Check

0 steps flagged

Empirical framework with no circular derivation or self-referential claims

full rationale

The paper presents FedKDNAS as an empirical FL framework that combines client-side NAS for lightweight models with server-side aggregation of logits on a public reference set for distillation. All reported gains (accuracy, CPU, communication) are obtained from experiments across six datasets and six baselines; no equations, closed-form derivations, or fitted parameters are shown that reduce these outcomes to quantities defined within the same paper. No self-citations are invoked as load-bearing uniqueness theorems, and the central mechanism is externally falsifiable by reproducing the described protocol. The analysis is therefore self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard federated learning assumptions plus the availability of a public reference dataset and the feasibility of local architecture search; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1104 out tokens · 32651 ms · 2026-05-21T05:38:33.128792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 4 internal anchors

  1. [1]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProc. Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2017, pp. 1273-1282

  2. [2]

    Brendan McMahan

    Kairouz, Peter, and H. Brendan McMahan. ”Advances and open problems in federated learning.” Foundations and trends in machine learning 14.1-2 (2021): 1-210

  3. [3]

    Compressing deep neural networks: A survey,

    Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Compressing deep neural networks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2434-2453, 2018

  4. [4]

    Z. Liu, B. Wu, W. Luo, X. Yang, and W. Liu, ‘”Zero-shot quantization of deep neural networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

  5. [5]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference,

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713

  6. [6]

    Heuristic structured pruning for deep neural networks: A survey,

    Y . Tian, K. Zhang, and X. Li, “Heuristic structured pruning for deep neural networks: A survey,”ACM Computing Surveys, 2024

  7. [7]

    Model compression and acceleration for deep neural networks: The principles, progress, and challenges,

    L. Deng, G. Li, and S. Han, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,”IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 126-136, 2020

  8. [8]

    A comprehensive survey on model compression for deep learning,

    T.-H. Le, M.-T. Nguyen, and Q.-H. Pham, “A comprehensive survey on model compression for deep learning,”IEEE Access, 2024

  9. [9]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  10. [10]

    Knowledge distillation: A survey,

    J. Gou, B. Yu, S. Maybank, and D. Tao, “Knowledge distillation: A survey,”International Journal of Computer Vision, vol. 129, no. 6, pp. 1789-1819, 2021. 20

  11. [11]

    Neural architecture search: A survey,

    S. Smithson and A. Jones, “Neural architecture search: A survey,”ACM Computing Surveys, 2016

  12. [12]

    DARTS: Differentiable architecture search,

    H. Liu, K. Simonyan, and Y . Yang, “DARTS: Differentiable architecture search,” inProc. Int. Conf. Learning Representations (ICLR), 2019

  13. [13]

    Sattler, S

    F. Sattler, S. Wiedemann, K.-R. M ¨uller, and W. Samek, Robust and communication-efficient federated learning from non-IID data,IEEE Transactions on Neural Networks and Learning Systems, 31(9), 2019

  14. [14]

    Y . Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V . Chandra, Federated learning with non-IID data, InNeurIPS Workshop on Machine Learning on the Phone and other Consumer Devices, 2018

  15. [15]

    Federated Optimization: Distributed Machine Learning for On-Device Intelligence

    J. Koneˇcn´y, H. B. McMahan, D. Ramage, and P. Richt ´arik, Federated optimization: Distributed machine learning for on-device intelligence, arXiv preprint arXiv:1610.02527, 2016

  16. [16]

    Federated optimization in heterogeneous networks,

    T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc. MLSys, 2020

  17. [17]

    Ditto: Fair and robust federated learning through personalization,

    T. Li, S. Hu, A. Beirami, and V . Smith, “Ditto: Fair and robust federated learning through personalization,” inProc. Int. Conf. Machine Learning (ICML), 2021

  18. [18]

    Ensemble distillation for robust model fusion in federated learning,

    T. Lin, L. Kong, S. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

  19. [19]

    Communication- efficient on-device machine learning: Federated distillation and augmentation under non-iid private data

    E. Jeong, S. Oh, J. Kim, M. Park, and M. Bennis, “Communication- efficient on-device machine learning: Federated distillation and augmen- tation,”arXiv preprint arXiv:1811.11479, 2018

  20. [20]

    M. Tan, B. Chen, R. Pang, V . Vasudevan, M. Sandler, A. Howard, and Q. Le, MnasNet: Platform-aware neural architecture search for mobile, InProceedings of CVPR, 2019

  21. [21]

    B. Wu, X. Dai, P. Zhang, Y . Wang, F. Sun, Y . Wu, Y . Tian, P. Vajda, Y . Jia, and K. Keutzer, FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search, InProceedings of CVPR, 2019

  22. [22]

    Tan and Q

    M. Tan and Q. Le, EfficientNet: Rethinking model scaling for convolu- tional neural networks, InProceedings of ICML, 2019

  23. [23]

    ”Federated learning: Challenges, methods, and future directions.” IEEE signal processing magazine 37.3 (2020): 50-60

    Li, Tian, et al. ”Federated learning: Challenges, methods, and future directions.” IEEE signal processing magazine 37.3 (2020): 50-60

  24. [24]

    Federated optimization in heterogeneous networks,

    T. Li, A. K. Sahu, M. Zaheer, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc. MLSys, 2020

  25. [25]

    SCAFFOLD: Stochastic controlled averaging for federated learning,

    S. P. Karimireddy, S. Kale, M. Mohan, S. K. R. Sanjabi, and P. Jain, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in Proc. Int. Conf. Machine Learning (ICML), 2020

  26. [26]

    C. T. Dinh, N. Tran, and T. D. Nguyen, Personalized federated learning with Moreau envelopes, InProceedings of NeurIPS, 2020

  27. [27]

    Tackling the objective inconsistency problem in heterogeneous federated optimization,

    J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V . Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

  28. [28]

    Adaptive federated optimization,

    S. J. Reddi, Z. Charles, M. Zamir, and S. Sra, “Adaptive federated optimization,” inProc. Int. Conf. Learning Representations (ICLR), 2021

  29. [29]

    FedMD: Heterogeneous federated learning via model distillation,

    D. Li and J. Wang, “FedMD: Heterogeneous federated learning via model distillation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

  30. [30]

    FedAKD: Federated adaptive knowledge distillation,

    M. Shahrezaei, M. S. Kouchaki, and H. R. Tizhoosh, “FedAKD: Federated adaptive knowledge distillation,” inProc. IEEE Int. Conf. Big Data, 2022

  31. [31]

    Federated learning with knowledge distillation: A survey,

    Q. Li, Z. Wen, and B. He, “Federated learning with knowledge distillation: A survey,”ACM Computing Surveys, vol. 55, no. 5, pp. 1-36, 2023

  32. [32]

    Knowledge Distillation: A Good Teacher Is Patient and Consistent,

    M. Beyer, S. Oudah, M. Zhmoginov, A. Oliver, and A. Kolesnikov, “Knowledge Distillation: A Good Teacher Is Patient and Consistent,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022

  33. [33]

    The State of Knowledge Distillation for Classification,

    F. Ruffy and C. Chollet, “The State of Knowledge Distillation for Classification,”arXiv preprint arXiv:1912.11381, 2019

  34. [34]

    What Knowledge Gets Distilled in Knowledge Distillation?

    U. Ojha, Y . Li, A. Hodjat, M. Brown, and Y . Li, “What Knowledge Gets Distilled in Knowledge Distillation?” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  35. [35]

    Large Scale Distributed Neural Network Training through Online Distillation,

    R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, “Large Scale Distributed Neural Network Training through Online Distillation,”arXiv preprint arXiv:1804.03235, 2018

  36. [36]

    Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer,

    H. Chang, V . Shejwalkar, R. Shokri, and A. Houmansadr, “Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer,”arXiv preprint arXiv:1912.11279, 2019

  37. [37]

    Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data,

    S. Itahara, T. Nishio, Y . Koda, M. Morikura, and K. Ya- mamoto, “Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data,”IEEE Transactions on Mobile Computing, 2021

  38. [38]

    FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning,

    F. Sattler, T. Korjakow, R. Rischke, and W. Samek, “FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning,”IEEE Transactions on Neural Networks and Learning Systems, 2021

  39. [39]

    CFD: Communication- Efficient Federated Distillation via Soft-Label Quantization and Delta Coding,

    F. Sattler, A. Marban, R. Rischke, and W. Samek, “CFD: Communication- Efficient Federated Distillation via Soft-Label Quantization and Delta Coding,”IEEE Transactions on Network Science and Engineering, 2021

  40. [40]

    Data-Free Knowledge Distillation for Heterogeneous Federated Learning,

    Z. Zhu, J. Hong, and J. Zhou, “Data-Free Knowledge Distillation for Heterogeneous Federated Learning,” inProc. International Conference on Machine Learning (ICML), PMLR, 2021

  41. [41]

    Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning,

    L. Zhang, L. Shen, L. Ding, D. Tao, and L.-Y . Duan, “Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022

  42. [42]

    FedZKT: Zero-Shot Knowledge Transfer towards Resource-Constrained Federated Learning with Heterogeneous On-Device Models,

    L. Zhang, D. Wu, and X. Yuan, “FedZKT: Zero-Shot Knowledge Transfer towards Resource-Constrained Federated Learning with Heterogeneous On-Device Models,”arXiv preprint arXiv:2109.03775, 2021

  43. [43]

    DaFKD: Domain- Aware Federated Knowledge Distillation,

    H. Wang, Y . Li, W. Xu, R. Li, Y . Zhan, and Z. Zeng, “DaFKD: Domain- Aware Federated Knowledge Distillation,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023

  44. [44]

    FedNAS: Federated deep learning via neural architecture search,

    C. He, M. Annavaram, and S. Avestimehr, “FedNAS: Federated deep learning via neural architecture search,”arXiv preprint(2020)

  45. [45]

    SPIDER: Searching personalized neural architecture for federated learning,

    E. Mushtaq, C. He, J. Ding, and S. Avestimehr, “SPIDER: Searching personalized neural architecture for federated learning,” inProc. AAAI Workshop on Federated Learning, 2022

  46. [46]

    Resource-aware heterogeneous federated learning using neural architecture search (RaFL),

    S. Yu, T. Nguyen, and others, “Resource-aware heterogeneous federated learning using neural architecture search (RaFL),”arXiv preprint arXiv:2211.05716, 2022

  47. [47]

    AdaptFL: Adaptive feder- ated learning framework for heterogeneous devices,

    Y . Zhang, H. Xia, S. Xu, X. Wang, and L. Xu, “AdaptFL: Adaptive feder- ated learning framework for heterogeneous devices,”Future Generation Computer Systems, vol. 165, Art. 107610, 2025

  48. [48]

    FedGEMS: Federated Learning of Larger Server Models via Selective Knowledge Fusion,

    S. Cheng, J. Wu, Y . Xiao, Y . Liu, and Y . Liu, “FedGEMS: Federated Learning of Larger Server Models via Selective Knowledge Fusion,” inProc. International Conference on Learning Representations (ICLR), 2022

  49. [49]

    ”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

    Song, Changlin, et al. ”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

  50. [50]

    ”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

    Liu, Chenghao, et al. ”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

  51. [51]

    ”One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification.” arXiv preprint arXiv:2506.14176 (2025)

    Yan, Renao. ”One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification.” arXiv preprint arXiv:2506.14176 (2025)

  52. [52]

    ”MHAT: An efficient model-heterogenous aggregation training scheme for federated learning.” Information Sciences 560 (2021): 493-503

    Hu, Li, et al. ”MHAT: An efficient model-heterogenous aggregation training scheme for federated learning.” Information Sciences 560 (2021): 493-503

  53. [53]

    On the convergence of FedAvg on non-IID data,

    X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-IID data,”International Conference on Learning Representations (ICLR), 2020

  54. [54]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming,

    S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341-2368, 2013

  55. [55]

    Energy and Policy Consider- ations for Deep Learning in NLP,

    E. Strubell, A. Ganesh, and A. McCallum, “Energy and Policy Consider- ations for Deep Learning in NLP,” inProc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645–3650, 2019

  56. [56]

    Carbon Emissions and Large Neural Network Training

    D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon Emissions and Large Neural Network Training,”arXiv preprint arXiv:2104.10350, 2021

  57. [57]

    Green AI,

    R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

  58. [58]

    Survey on Energy Consumption Entities on the Smartphone Platform,

    G. P. Perrucci, F. H. P. Fitzek, and J. Widmer, “Survey on Energy Consumption Entities on the Smartphone Platform,” inProc. IEEE 73rd Vehicular Technology Conference (VTC Spring), pp. 1–6, 2011

  59. [59]

    An Analysis of Power Consumption in a Smartphone,

    A. Carroll and G. Heiser, “An Analysis of Power Consumption in a Smartphone,” inProc. USENIX Annual Technical Conference (ATC), pp. 21–21, 2010

  60. [60]

    Henderson, J

    V . Lannelongue, J. Grealey, and M. Inouye, “CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing,”arXiv preprint arXiv:2002.05651, 2023

  61. [61]

    PowerJoular and JoularJX: Multi- Platform Software Power Monitoring Tools,

    A. Noureddine and R. Rouvoy, “PowerJoular and JoularJX: Multi- Platform Software Power Monitoring Tools,” inProc. 36th Int. Conf. on Advanced Information Networking and Applications (AINA), pp. 97–109, 2022

  62. [62]

    ”Membership inference attacks against machine learning models.” 2017 IEEE symposium on security and privacy (SP)

    Shokri, Reza, et al. ”Membership inference attacks against machine learning models.” 2017 IEEE symposium on security and privacy (SP). IEEE, 2017

  63. [63]

    Medjadji, Chaimaa, et al. ”FedSparQ: Adaptive Sparse Quantization with Error Feedback for Robust & Efficient Federated Learning.” 2025 3rd International Conference on Federated Learning Technologies and Applications (FLTA). IEEE, 2025. 21

  64. [64]

    Human Activity Recognition from Continuous Ambient Sensor Data

    Cook & Thomas, B. Human Activity Recognition from Continuous Ambient Sensor Data. (UCI Machine Learning Repository,2012), DOI: https://doi.org/10.24432/C5D60P

  65. [65]

    & Van Schaik, A

    Cohen, G., Afshar, S., Tapson, J. & Van Schaik, A. EMNIST: Extending MNIST to handwritten letters.2017 International Joint Conference On Neural Networks (IJCNN). pp. 2921-2926 (2017)

  66. [66]

    & Others Learning multiple layers of features from tiny images

    Krizhevsky, A., Hinton, G. & Others Learning multiple layers of features from tiny images. (Toronto, ON, Canada,2009)

  67. [67]

    The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine

    Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine. 29, 141-142 (2012)

  68. [68]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Xiao, H., Rasul, K. & V ollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.ArXiv Preprint ArXiv:1708.07747. (2017) APPENDIX Subtracting consecutive iterates of the EMA recurrence (13) gives ˜Z(r) − ˜Z(r−1) =γ ˜Z(r−1) + (1−γ)Z (r) − ˜Z(r−1) (30) = (1−γ) Z(r) − ˜Z(r−1) ,(31) establishing the first equality. To obta...