Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

Chaimaa Medjadji; Feras M. Awaysheh; Guilain Leduc; Sadi Alawadi; Sylvain Kubler; Yves Le Traon

arxiv: 2605.21322 · v1 · pith:NGYWZDMOnew · submitted 2026-05-20 · 💻 cs.LG

Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

Chaimaa Medjadji , Sylvain Kubler , Yves Le Traon , Guilain Leduc , Sadi Alawadi , Feras M. Awaysheh This is my paper

Pith reviewed 2026-05-21 05:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningknowledge distillationneural architecture searchnon-IID datasystem heterogeneitycommunication efficiencyPareto efficiency

0 comments

The pith

Clients select their own models in federated learning to raise accuracy while slashing computation and communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that federated learning can work better when clients are free to choose lightweight neural architectures that fit their local data and hardware. Each client trains its chosen model on private data while also learning from aggregated predictions produced on a public reference dataset. The server smooths these predictions to create reliable targets that clients use for distillation in the following round. This removes the usual requirement that all clients run identical model structures, which often leads to poor performance when data or devices differ. A reader would care because the reported results indicate substantial gains in accuracy alongside major cuts in local processing time and data transmission volume.

Core claim

FedKDNAS lets each client autonomously pick a lightweight architecture under accuracy and resource constraints. The client trains this model locally using supervised learning combined with knowledge distillation from server-provided targets. Only the model's predictions on a public reference set are shared with the server. The server aggregates and smooths these predictions, sometimes incorporating a teacher model, to generate stable distillation targets for the next training round. Tests on six datasets against six baselines confirm improved Pareto efficiency.

What carries the argument

Client-driven neural architecture selection with server-side aggregation of predictions on a public reference set for generating distillation targets

Load-bearing premise

That clients can correctly and autonomously choose lightweight architectures matching their accuracy needs and device limits, and that a public reference set can be used without causing bias or privacy problems.

What would settle it

If a fixed client architecture in a standard federated setup achieves the same accuracy with similar or lower CPU and communication costs on the evaluated datasets, the advantage of the proposed method would be called into question.

Figures

Figures reproduced from arXiv: 2605.21322 by Chaimaa Medjadji, Feras M. Awaysheh, Guilain Leduc, Sadi Alawadi, Sylvain Kubler, Yves Le Traon.

**Figure 1.** Figure 1: Overview of the proposed FedKD-NAS architecture. At each communication round, each client independently selects a [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Accuracy over 100 communication rounds on CIFAR10 under IID, Dirichlet ( [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy drop from IID to non-IID settings on CIFAR10. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Communication cost per round on CIFAR-10. Since [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: RES ↓ and UES ↑ on CIFAR10 across IID, Dirichlet (α=0.1), and Shards. FedKD-NAS achieves the lowest RES and highest UES across distributions and architectures, with UES increasing under heterogeneity. TABLE VI: HAR results at the final communication round. We therefore report UES⋆ = PQS · CES as a resource-free unified efficiency indicator. Algorithm Acc ↑ Loss ↓ Comm (MB) ↓ PQS ↑ CES ↑ UES⋆ ↑ FedAvg 0.731… view at source ↗

**Figure 6.** Figure 6: CES vs. PQS trade-off on CIFAR10 (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: CES vs. PQS trade-off on CIFAR100 (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: CES vs. PQS trade-off on EMNIST (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: CES vs. PQS trade-off on FMNIST (bubble area [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: CES vs. PQS trade-off on MNIST (bubble area [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Communication cost per round on MNIST. For both [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Communication cost per round on EMNIST. For [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Communication cost per round on CIFAR100 under [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 15.** Figure 15: Accuracy drop from IID to non-IID on FMNIST. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 14.** Figure 14: Accuracy drop from IID to non-IID on MNIST. FedKD [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 17.** Figure 17: Accuracy drop from IID to non-IID on CIFAR100. All [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: RES (↓) and UES (↑) on CIFAR100. FedKD-NAS achieves the lowest RES on both architectures under all three data distributions. For UES, FedKD-NAS attains the highest values on MobileNetV2 across all three data distributions. On ShuffleNetV2, FedAvg has the highest overall UES, while FedKD-NAS remains the strongest method within the lowcommunication logit-based group, achieving 1.0432, 1.2543, and 1.1971 un… view at source ↗

**Figure 19.** Figure 19: RES (↓) and UES (↑) on MNIST. On LeNet5, FedDistill consistently achieves the lowest RES across all data distributions, while FedKD-NAS attains the highest UES in IID, Dirichlet, and Shards, driven by its superior PQS and high CES. On ResNet18, FedAvg achieves the lowest RES under IID and Dirichlet, while FedDistill attains the lowest RES under Shards. For UES, FedKD-NAS consistently achieves the highest … view at source ↗

**Figure 20.** Figure 20: RES (↓) and UES (↑) on FMNIST. On LeNet5, FedMD achieves the lowest RES under IID, while FedDistill attains the lowest RES under Dirichlet and Shards; FedKD-NAS achieves the highest UES under IID, whereas FedMD and FedDistill lead under Dirichlet and Shards, respectively. On ResNet18, the lowest RES is achieved by FedAvg under IID, FedKD-NAS under Dirichlet, and Ditto under Shards. For UES on ResNet18, Fe… view at source ↗

**Figure 21.** Figure 21: RES (↓) and UES (↑) on EMNIST. On LeNet5, FedDistill consistently achieves the lowest RES across all data distributions, while FedAvg attains the highest UES because of its strong accuracy and higher CES. On ResNet18, Ditto achieves the lowest RES under IID, FedAvg under Dirichlet, and FedKD-NAS under Shards. For UES, FedAvg consistently achieves the highest values across all distributions and both archit… view at source ↗

**Figure 22.** Figure 22: Accuracy curves over 100 rounds on MNIST under IID, Dirichlet ( [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: Accuracy curves over 100 rounds on FMNIST. FedKD-NAS leads under all distributions with LeNet5. On ResNet18, [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: Accuracy curves over 100 rounds on EMNIST. EMNIST is a 47-class benchmark in which heterogeneity creates a [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗

**Figure 25.** Figure 25: Accuracy curves over 100 rounds on CIFAR100. CIFAR100 is the most challenging benchmark because it contains [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗

read the original abstract

Federated Learning (FL) enables collaborative model training without centralizing data. However, real-world deployments must simultaneously address statistical heterogeneity across client data (non-IID), system heterogeneity in device capabilities, and communication efficiency. Existing FL approaches mitigate these challenges through improved aggregation, personalization, or knowledge distillation, but they almost universally assume a fixed client architecture, limiting adaptability to heterogeneous data complexity and hardware constraints. This architectural constraint often leads to suboptimal trade-offs between accuracy and efficiency in real-world FL systems. This work introduces FedKDNAS, a distillation-driven FL framework that combines client-side neural architecture selection with distillation of server-coordinated knowledge. Each client autonomously selects a lightweight model under accuracy-resource constraints. It then trains it locally using a hybrid objective combining supervised learning and knowledge distillation and shares only predictions on a public reference set. The server then aggregates and smooths these predictions, optionally combining them with a teacher model, to produce stable distillation targets for the next round. Extensive evaluation on six datasets against six representative FL baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) demonstrates that FedKDNAS consistently achieves superior Pareto efficiency, improving accuracy by up to 15\% under non-IID conditions, reducing client CPU usage by approximately 28\%, and decreasing communication overhead by up to 44 times while maintaining lightweight logit-based communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedKDNAS pairs client-side NAS with logit-only distillation on a public reference set to cut comms and adapt to heterogeneity, but the reported gains rest on details the abstract leaves out.

read the letter

The main thing to know is that this paper lets each client pick its own small architecture via local NAS, trains with a mix of supervised loss and distillation from server-smoothed logits on a shared public set, and claims big wins on accuracy under non-IID conditions plus major drops in CPU and communication. The logit-only pattern keeps everything lightweight compared with sending full models or gradients. That combination of per-client selection and prediction-only exchange is the concrete step beyond the six baselines they cite. They evaluate across six datasets and show consistent Pareto improvements, which is a reasonable way to demonstrate the approach handles both statistical and system heterogeneity at once. The numbers they report—up to 15% accuracy lift, 28% CPU reduction, and 44x less communication—look practically useful if they hold. The evaluation breadth is a strength here; covering multiple real-world style datasets gives more weight than a single benchmark would. The soft spots are mostly in the experimental reporting. The abstract gives no information on statistical tests, the precise non-IID partitioning method, the size or constraints of the architecture search space, or how hyperparameters were tuned. Those gaps make it hard to judge whether the gains are robust or sensitive to choices that were not fully described. The stress-test concern about the public reference set also lands. If that set is not representative of every client distribution, the aggregated targets could skew the distillation and make the efficiency claims look better than they would against a truly fair comparison. The paper needs to show the set introduces no selection bias or extra privacy exposure. This work is aimed at people who actually deploy federated systems on heterogeneous edge devices and need to balance accuracy with resource use. A reader working on practical FL would find the client-side selection and logit aggregation ideas worth trying, even if they end up modifying the reference-set step. I would send it for peer review. The core mechanism addresses real deployment barriers and the empirical scope is broad enough to justify referee time, though the authors will have to supply the missing experimental specifics before it can be evaluated properly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FedKDNAS, a federated learning framework combining client-side neural architecture search for lightweight models with server-coordinated knowledge distillation. Clients train locally using a hybrid supervised-plus-distillation objective and communicate only logits on a shared public reference set; the server aggregates and smooths these predictions (optionally with a teacher) to form distillation targets for subsequent rounds. Empirical results on six datasets versus six baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) report up to 15% accuracy gains under non-IID conditions, ~28% client CPU reduction, and up to 44× lower communication overhead.

Significance. If the performance claims hold under rigorous controls, the work would contribute a practical approach to jointly addressing statistical heterogeneity, system heterogeneity, and communication constraints in FL via adaptive client architectures and logit-only communication. The multi-dataset, multi-baseline evaluation is a positive feature for empirical breadth.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central claims of up to 15% accuracy improvement, 28% CPU reduction, and 44× communication savings under non-IID conditions are presented without reported statistical tests, confidence intervals, or variance across random seeds and non-IID partitions. This leaves the superiority over FedDF and FedDistill weakly supported.
[§3] §3 (Proposed Method): the distillation targets are formed by aggregating client logits on a public reference set. No description is given of how the reference set is constructed to remain representative across heterogeneous client distributions or to avoid selection bias; if the set skews toward any subpopulation, the smoothed targets become mis-calibrated, directly undermining both the accuracy and efficiency gains relative to baselines that also rely on distillation.

minor comments (2)

[§3] The neural architecture search space and the exact accuracy-resource constraint used for client-side selection are not specified, hindering reproducibility.
[§4] Hyperparameter tuning details and the precise non-IID partitioning procedure (e.g., Dirichlet concentration or label skew ratios) are omitted from the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of empirical rigor and methodological clarity. We address each point below and have revised the manuscript to incorporate the suggestions where appropriate.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claims of up to 15% accuracy improvement, 28% CPU reduction, and 44× communication savings under non-IID conditions are presented without reported statistical tests, confidence intervals, or variance across random seeds and non-IID partitions. This leaves the superiority over FedDF and FedDistill weakly supported.

Authors: We agree that the absence of statistical tests and variance reporting weakens the strength of the empirical claims. In the revised version, we have rerun the experiments using 5 independent random seeds per non-IID partition setting. We now report mean accuracy, CPU usage, and communication cost together with standard deviations. We have also added paired t-test p-values comparing FedKDNAS against FedDF and FedDistill, showing that the reported gains remain statistically significant (p < 0.05) in the majority of evaluated settings. These changes appear in the abstract and Section 4. revision: yes
Referee: [§3] §3 (Proposed Method): the distillation targets are formed by aggregating client logits on a public reference set. No description is given of how the reference set is constructed to remain representative across heterogeneous client distributions or to avoid selection bias; if the set skews toward any subpopulation, the smoothed targets become mis-calibrated, directly undermining both the accuracy and efficiency gains relative to baselines that also rely on distillation.

Authors: This is a valid concern. The original manuscript did not provide sufficient detail on reference-set construction. In the revision we have added an explicit description in Section 3: the reference set is a fixed, randomly sampled collection of 2,000 examples drawn from a publicly available held-out dataset that is completely disjoint from all client training data. We further include a short sensitivity study demonstrating that performance is stable across different random draws of the reference set, thereby reducing the risk of subpopulation skew and mis-calibration. revision: yes

Circularity Check

0 steps flagged

Empirical framework with no circular derivation or self-referential claims

full rationale

The paper presents FedKDNAS as an empirical FL framework that combines client-side NAS for lightweight models with server-side aggregation of logits on a public reference set for distillation. All reported gains (accuracy, CPU, communication) are obtained from experiments across six datasets and six baselines; no equations, closed-form derivations, or fitted parameters are shown that reduce these outcomes to quantities defined within the same paper. No self-citations are invoked as load-bearing uniqueness theorems, and the central mechanism is externally falsifiable by reproducing the described protocol. The analysis is therefore self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard federated learning assumptions plus the availability of a public reference dataset and the feasibility of local architecture search; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1104 out tokens · 32651 ms · 2026-05-21T05:38:33.128792+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each client autonomously selects a lightweight model under accuracy-resource constraints. It then trains it locally using a hybrid objective combining supervised learning and knowledge distillation and shares only predictions on a public reference set. The server then aggregates and smooths these predictions...
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The server aggregates these predictions, fuses them with teacher guidance, and broadcasts a smoothed distillation target to all clients for the next round.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 4 internal anchors

[1]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProc. Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2017, pp. 1273-1282

work page 2017
[2]

Brendan McMahan

Kairouz, Peter, and H. Brendan McMahan. ”Advances and open problems in federated learning.” Foundations and trends in machine learning 14.1-2 (2021): 1-210

work page 2021
[3]

Compressing deep neural networks: A survey,

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Compressing deep neural networks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2434-2453, 2018

work page 2018
[4]

Z. Liu, B. Wu, W. Luo, X. Yang, and W. Liu, ‘”Zero-shot quantization of deep neural networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

work page 2021
[5]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713

work page 2018
[6]

Heuristic structured pruning for deep neural networks: A survey,

Y . Tian, K. Zhang, and X. Li, “Heuristic structured pruning for deep neural networks: A survey,”ACM Computing Surveys, 2024

work page 2024
[7]

Model compression and acceleration for deep neural networks: The principles, progress, and challenges,

L. Deng, G. Li, and S. Han, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,”IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 126-136, 2020

work page 2020
[8]

A comprehensive survey on model compression for deep learning,

T.-H. Le, M.-T. Nguyen, and Q.-H. Pham, “A comprehensive survey on model compression for deep learning,”IEEE Access, 2024

work page 2024
[9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Knowledge distillation: A survey,

J. Gou, B. Yu, S. Maybank, and D. Tao, “Knowledge distillation: A survey,”International Journal of Computer Vision, vol. 129, no. 6, pp. 1789-1819, 2021. 20

work page 2021
[11]

Neural architecture search: A survey,

S. Smithson and A. Jones, “Neural architecture search: A survey,”ACM Computing Surveys, 2016

work page 2016
[12]

DARTS: Differentiable architecture search,

H. Liu, K. Simonyan, and Y . Yang, “DARTS: Differentiable architecture search,” inProc. Int. Conf. Learning Representations (ICLR), 2019

work page 2019
[13]

Sattler, S

F. Sattler, S. Wiedemann, K.-R. M ¨uller, and W. Samek, Robust and communication-efficient federated learning from non-IID data,IEEE Transactions on Neural Networks and Learning Systems, 31(9), 2019

work page 2019
[14]

Y . Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V . Chandra, Federated learning with non-IID data, InNeurIPS Workshop on Machine Learning on the Phone and other Consumer Devices, 2018

work page 2018
[15]

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

J. Koneˇcn´y, H. B. McMahan, D. Ramage, and P. Richt ´arik, Federated optimization: Distributed machine learning for on-device intelligence, arXiv preprint arXiv:1610.02527, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc. MLSys, 2020

work page 2020
[17]

Ditto: Fair and robust federated learning through personalization,

T. Li, S. Hu, A. Beirami, and V . Smith, “Ditto: Fair and robust federated learning through personalization,” inProc. Int. Conf. Machine Learning (ICML), 2021

work page 2021
[18]

Ensemble distillation for robust model fusion in federated learning,

T. Lin, L. Kong, S. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[19]

Communication- efficient on-device machine learning: Federated distillation and augmentation under non-iid private data

E. Jeong, S. Oh, J. Kim, M. Park, and M. Bennis, “Communication- efficient on-device machine learning: Federated distillation and augmen- tation,”arXiv preprint arXiv:1811.11479, 2018

work page arXiv 2018
[20]

M. Tan, B. Chen, R. Pang, V . Vasudevan, M. Sandler, A. Howard, and Q. Le, MnasNet: Platform-aware neural architecture search for mobile, InProceedings of CVPR, 2019

work page 2019
[21]

B. Wu, X. Dai, P. Zhang, Y . Wang, F. Sun, Y . Wu, Y . Tian, P. Vajda, Y . Jia, and K. Keutzer, FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search, InProceedings of CVPR, 2019

work page 2019
[22]

Tan and Q

M. Tan and Q. Le, EfficientNet: Rethinking model scaling for convolu- tional neural networks, InProceedings of ICML, 2019

work page 2019
[23]

”Federated learning: Challenges, methods, and future directions.” IEEE signal processing magazine 37.3 (2020): 50-60

Li, Tian, et al. ”Federated learning: Challenges, methods, and future directions.” IEEE signal processing magazine 37.3 (2020): 50-60

work page 2020
[24]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc. MLSys, 2020

work page 2020
[25]

SCAFFOLD: Stochastic controlled averaging for federated learning,

S. P. Karimireddy, S. Kale, M. Mohan, S. K. R. Sanjabi, and P. Jain, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in Proc. Int. Conf. Machine Learning (ICML), 2020

work page 2020
[26]

C. T. Dinh, N. Tran, and T. D. Nguyen, Personalized federated learning with Moreau envelopes, InProceedings of NeurIPS, 2020

work page 2020
[27]

Tackling the objective inconsistency problem in heterogeneous federated optimization,

J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V . Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[28]

Adaptive federated optimization,

S. J. Reddi, Z. Charles, M. Zamir, and S. Sra, “Adaptive federated optimization,” inProc. Int. Conf. Learning Representations (ICLR), 2021

work page 2021
[29]

FedMD: Heterogeneous federated learning via model distillation,

D. Li and J. Wang, “FedMD: Heterogeneous federated learning via model distillation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[30]

FedAKD: Federated adaptive knowledge distillation,

M. Shahrezaei, M. S. Kouchaki, and H. R. Tizhoosh, “FedAKD: Federated adaptive knowledge distillation,” inProc. IEEE Int. Conf. Big Data, 2022

work page 2022
[31]

Federated learning with knowledge distillation: A survey,

Q. Li, Z. Wen, and B. He, “Federated learning with knowledge distillation: A survey,”ACM Computing Surveys, vol. 55, no. 5, pp. 1-36, 2023

work page 2023
[32]

Knowledge Distillation: A Good Teacher Is Patient and Consistent,

M. Beyer, S. Oudah, M. Zhmoginov, A. Oliver, and A. Kolesnikov, “Knowledge Distillation: A Good Teacher Is Patient and Consistent,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[33]

The State of Knowledge Distillation for Classification,

F. Ruffy and C. Chollet, “The State of Knowledge Distillation for Classification,”arXiv preprint arXiv:1912.11381, 2019

work page arXiv 1912
[34]

What Knowledge Gets Distilled in Knowledge Distillation?

U. Ojha, Y . Li, A. Hodjat, M. Brown, and Y . Li, “What Knowledge Gets Distilled in Knowledge Distillation?” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[35]

Large Scale Distributed Neural Network Training through Online Distillation,

R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, “Large Scale Distributed Neural Network Training through Online Distillation,”arXiv preprint arXiv:1804.03235, 2018

work page arXiv 2018
[36]

Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer,

H. Chang, V . Shejwalkar, R. Shokri, and A. Houmansadr, “Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer,”arXiv preprint arXiv:1912.11279, 2019

work page arXiv 1912
[37]

Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data,

S. Itahara, T. Nishio, Y . Koda, M. Morikura, and K. Ya- mamoto, “Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data,”IEEE Transactions on Mobile Computing, 2021

work page 2021
[38]

FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning,

F. Sattler, T. Korjakow, R. Rischke, and W. Samek, “FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning,”IEEE Transactions on Neural Networks and Learning Systems, 2021

work page 2021
[39]

CFD: Communication- Efficient Federated Distillation via Soft-Label Quantization and Delta Coding,

F. Sattler, A. Marban, R. Rischke, and W. Samek, “CFD: Communication- Efficient Federated Distillation via Soft-Label Quantization and Delta Coding,”IEEE Transactions on Network Science and Engineering, 2021

work page 2021
[40]

Data-Free Knowledge Distillation for Heterogeneous Federated Learning,

Z. Zhu, J. Hong, and J. Zhou, “Data-Free Knowledge Distillation for Heterogeneous Federated Learning,” inProc. International Conference on Machine Learning (ICML), PMLR, 2021

work page 2021
[41]

Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning,

L. Zhang, L. Shen, L. Ding, D. Tao, and L.-Y . Duan, “Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[42]

FedZKT: Zero-Shot Knowledge Transfer towards Resource-Constrained Federated Learning with Heterogeneous On-Device Models,

L. Zhang, D. Wu, and X. Yuan, “FedZKT: Zero-Shot Knowledge Transfer towards Resource-Constrained Federated Learning with Heterogeneous On-Device Models,”arXiv preprint arXiv:2109.03775, 2021

work page arXiv 2021
[43]

DaFKD: Domain- Aware Federated Knowledge Distillation,

H. Wang, Y . Li, W. Xu, R. Li, Y . Zhan, and Z. Zeng, “DaFKD: Domain- Aware Federated Knowledge Distillation,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[44]

FedNAS: Federated deep learning via neural architecture search,

C. He, M. Annavaram, and S. Avestimehr, “FedNAS: Federated deep learning via neural architecture search,”arXiv preprint(2020)

work page 2020
[45]

SPIDER: Searching personalized neural architecture for federated learning,

E. Mushtaq, C. He, J. Ding, and S. Avestimehr, “SPIDER: Searching personalized neural architecture for federated learning,” inProc. AAAI Workshop on Federated Learning, 2022

work page 2022
[46]

Resource-aware heterogeneous federated learning using neural architecture search (RaFL),

S. Yu, T. Nguyen, and others, “Resource-aware heterogeneous federated learning using neural architecture search (RaFL),”arXiv preprint arXiv:2211.05716, 2022

work page arXiv 2022
[47]

AdaptFL: Adaptive feder- ated learning framework for heterogeneous devices,

Y . Zhang, H. Xia, S. Xu, X. Wang, and L. Xu, “AdaptFL: Adaptive feder- ated learning framework for heterogeneous devices,”Future Generation Computer Systems, vol. 165, Art. 107610, 2025

work page 2025
[48]

FedGEMS: Federated Learning of Larger Server Models via Selective Knowledge Fusion,

S. Cheng, J. Wu, Y . Xiao, Y . Liu, and Y . Liu, “FedGEMS: Federated Learning of Larger Server Models via Selective Knowledge Fusion,” inProc. International Conference on Learning Representations (ICLR), 2022

work page 2022
[49]

”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

Song, Changlin, et al. ”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

work page arXiv 2024
[50]

”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

Liu, Chenghao, et al. ”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

work page arXiv 2023
[51]

”One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification.” arXiv preprint arXiv:2506.14176 (2025)

Yan, Renao. ”One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification.” arXiv preprint arXiv:2506.14176 (2025)

work page arXiv 2025
[52]

”MHAT: An efficient model-heterogenous aggregation training scheme for federated learning.” Information Sciences 560 (2021): 493-503

Hu, Li, et al. ”MHAT: An efficient model-heterogenous aggregation training scheme for federated learning.” Information Sciences 560 (2021): 493-503

work page 2021
[53]

On the convergence of FedAvg on non-IID data,

X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-IID data,”International Conference on Learning Representations (ICLR), 2020

work page 2020
[54]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341-2368, 2013

work page 2013
[55]

Energy and Policy Consider- ations for Deep Learning in NLP,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and Policy Consider- ations for Deep Learning in NLP,” inProc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645–3650, 2019

work page 2019
[56]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon Emissions and Large Neural Network Training,”arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Green AI,

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

work page 2020
[58]

Survey on Energy Consumption Entities on the Smartphone Platform,

G. P. Perrucci, F. H. P. Fitzek, and J. Widmer, “Survey on Energy Consumption Entities on the Smartphone Platform,” inProc. IEEE 73rd Vehicular Technology Conference (VTC Spring), pp. 1–6, 2011

work page 2011
[59]

An Analysis of Power Consumption in a Smartphone,

A. Carroll and G. Heiser, “An Analysis of Power Consumption in a Smartphone,” inProc. USENIX Annual Technical Conference (ATC), pp. 21–21, 2010

work page 2010
[60]

Henderson, J

V . Lannelongue, J. Grealey, and M. Inouye, “CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing,”arXiv preprint arXiv:2002.05651, 2023

work page arXiv 2002
[61]

PowerJoular and JoularJX: Multi- Platform Software Power Monitoring Tools,

A. Noureddine and R. Rouvoy, “PowerJoular and JoularJX: Multi- Platform Software Power Monitoring Tools,” inProc. 36th Int. Conf. on Advanced Information Networking and Applications (AINA), pp. 97–109, 2022

work page 2022
[62]

”Membership inference attacks against machine learning models.” 2017 IEEE symposium on security and privacy (SP)

Shokri, Reza, et al. ”Membership inference attacks against machine learning models.” 2017 IEEE symposium on security and privacy (SP). IEEE, 2017

work page 2017
[63]

Medjadji, Chaimaa, et al. ”FedSparQ: Adaptive Sparse Quantization with Error Feedback for Robust & Efficient Federated Learning.” 2025 3rd International Conference on Federated Learning Technologies and Applications (FLTA). IEEE, 2025. 21

work page 2025
[64]

Human Activity Recognition from Continuous Ambient Sensor Data

Cook & Thomas, B. Human Activity Recognition from Continuous Ambient Sensor Data. (UCI Machine Learning Repository,2012), DOI: https://doi.org/10.24432/C5D60P

work page doi:10.24432/c5d60p 2012
[65]

& Van Schaik, A

Cohen, G., Afshar, S., Tapson, J. & Van Schaik, A. EMNIST: Extending MNIST to handwritten letters.2017 International Joint Conference On Neural Networks (IJCNN). pp. 2921-2926 (2017)

work page 2017
[66]

& Others Learning multiple layers of features from tiny images

Krizhevsky, A., Hinton, G. & Others Learning multiple layers of features from tiny images. (Toronto, ON, Canada,2009)

work page 2009
[67]

The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine

Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine. 29, 141-142 (2012)

work page 2012
[68]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Xiao, H., Rasul, K. & V ollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.ArXiv Preprint ArXiv:1708.07747. (2017) APPENDIX Subtracting consecutive iterates of the EMA recurrence (13) gives ˜Z(r) − ˜Z(r−1) =γ ˜Z(r−1) + (1−γ)Z (r) − ˜Z(r−1) (30) = (1−γ) Z(r) − ˜Z(r−1) ,(31) establishing the first equality. To obta...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProc. Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2017, pp. 1273-1282

work page 2017

[2] [2]

Brendan McMahan

Kairouz, Peter, and H. Brendan McMahan. ”Advances and open problems in federated learning.” Foundations and trends in machine learning 14.1-2 (2021): 1-210

work page 2021

[3] [3]

Compressing deep neural networks: A survey,

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Compressing deep neural networks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2434-2453, 2018

work page 2018

[4] [4]

Z. Liu, B. Wu, W. Luo, X. Yang, and W. Liu, ‘”Zero-shot quantization of deep neural networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

work page 2021

[5] [5]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713

work page 2018

[6] [6]

Heuristic structured pruning for deep neural networks: A survey,

Y . Tian, K. Zhang, and X. Li, “Heuristic structured pruning for deep neural networks: A survey,”ACM Computing Surveys, 2024

work page 2024

[7] [7]

Model compression and acceleration for deep neural networks: The principles, progress, and challenges,

L. Deng, G. Li, and S. Han, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,”IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 126-136, 2020

work page 2020

[8] [8]

A comprehensive survey on model compression for deep learning,

T.-H. Le, M.-T. Nguyen, and Q.-H. Pham, “A comprehensive survey on model compression for deep learning,”IEEE Access, 2024

work page 2024

[9] [9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Knowledge distillation: A survey,

J. Gou, B. Yu, S. Maybank, and D. Tao, “Knowledge distillation: A survey,”International Journal of Computer Vision, vol. 129, no. 6, pp. 1789-1819, 2021. 20

work page 2021

[11] [11]

Neural architecture search: A survey,

S. Smithson and A. Jones, “Neural architecture search: A survey,”ACM Computing Surveys, 2016

work page 2016

[12] [12]

DARTS: Differentiable architecture search,

H. Liu, K. Simonyan, and Y . Yang, “DARTS: Differentiable architecture search,” inProc. Int. Conf. Learning Representations (ICLR), 2019

work page 2019

[13] [13]

Sattler, S

F. Sattler, S. Wiedemann, K.-R. M ¨uller, and W. Samek, Robust and communication-efficient federated learning from non-IID data,IEEE Transactions on Neural Networks and Learning Systems, 31(9), 2019

work page 2019

[14] [14]

Y . Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V . Chandra, Federated learning with non-IID data, InNeurIPS Workshop on Machine Learning on the Phone and other Consumer Devices, 2018

work page 2018

[15] [15]

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

J. Koneˇcn´y, H. B. McMahan, D. Ramage, and P. Richt ´arik, Federated optimization: Distributed machine learning for on-device intelligence, arXiv preprint arXiv:1610.02527, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc. MLSys, 2020

work page 2020

[17] [17]

Ditto: Fair and robust federated learning through personalization,

T. Li, S. Hu, A. Beirami, and V . Smith, “Ditto: Fair and robust federated learning through personalization,” inProc. Int. Conf. Machine Learning (ICML), 2021

work page 2021

[18] [18]

Ensemble distillation for robust model fusion in federated learning,

T. Lin, L. Kong, S. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[19] [19]

Communication- efficient on-device machine learning: Federated distillation and augmentation under non-iid private data

E. Jeong, S. Oh, J. Kim, M. Park, and M. Bennis, “Communication- efficient on-device machine learning: Federated distillation and augmen- tation,”arXiv preprint arXiv:1811.11479, 2018

work page arXiv 2018

[20] [20]

M. Tan, B. Chen, R. Pang, V . Vasudevan, M. Sandler, A. Howard, and Q. Le, MnasNet: Platform-aware neural architecture search for mobile, InProceedings of CVPR, 2019

work page 2019

[21] [21]

B. Wu, X. Dai, P. Zhang, Y . Wang, F. Sun, Y . Wu, Y . Tian, P. Vajda, Y . Jia, and K. Keutzer, FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search, InProceedings of CVPR, 2019

work page 2019

[22] [22]

Tan and Q

M. Tan and Q. Le, EfficientNet: Rethinking model scaling for convolu- tional neural networks, InProceedings of ICML, 2019

work page 2019

[23] [23]

”Federated learning: Challenges, methods, and future directions.” IEEE signal processing magazine 37.3 (2020): 50-60

Li, Tian, et al. ”Federated learning: Challenges, methods, and future directions.” IEEE signal processing magazine 37.3 (2020): 50-60

work page 2020

[24] [24]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc. MLSys, 2020

work page 2020

[25] [25]

SCAFFOLD: Stochastic controlled averaging for federated learning,

S. P. Karimireddy, S. Kale, M. Mohan, S. K. R. Sanjabi, and P. Jain, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in Proc. Int. Conf. Machine Learning (ICML), 2020

work page 2020

[26] [26]

C. T. Dinh, N. Tran, and T. D. Nguyen, Personalized federated learning with Moreau envelopes, InProceedings of NeurIPS, 2020

work page 2020

[27] [27]

Tackling the objective inconsistency problem in heterogeneous federated optimization,

J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V . Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[28] [28]

Adaptive federated optimization,

S. J. Reddi, Z. Charles, M. Zamir, and S. Sra, “Adaptive federated optimization,” inProc. Int. Conf. Learning Representations (ICLR), 2021

work page 2021

[29] [29]

FedMD: Heterogeneous federated learning via model distillation,

D. Li and J. Wang, “FedMD: Heterogeneous federated learning via model distillation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[30] [30]

FedAKD: Federated adaptive knowledge distillation,

M. Shahrezaei, M. S. Kouchaki, and H. R. Tizhoosh, “FedAKD: Federated adaptive knowledge distillation,” inProc. IEEE Int. Conf. Big Data, 2022

work page 2022

[31] [31]

Federated learning with knowledge distillation: A survey,

Q. Li, Z. Wen, and B. He, “Federated learning with knowledge distillation: A survey,”ACM Computing Surveys, vol. 55, no. 5, pp. 1-36, 2023

work page 2023

[32] [32]

Knowledge Distillation: A Good Teacher Is Patient and Consistent,

M. Beyer, S. Oudah, M. Zhmoginov, A. Oliver, and A. Kolesnikov, “Knowledge Distillation: A Good Teacher Is Patient and Consistent,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[33] [33]

The State of Knowledge Distillation for Classification,

F. Ruffy and C. Chollet, “The State of Knowledge Distillation for Classification,”arXiv preprint arXiv:1912.11381, 2019

work page arXiv 1912

[34] [34]

What Knowledge Gets Distilled in Knowledge Distillation?

U. Ojha, Y . Li, A. Hodjat, M. Brown, and Y . Li, “What Knowledge Gets Distilled in Knowledge Distillation?” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[35] [35]

Large Scale Distributed Neural Network Training through Online Distillation,

R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, “Large Scale Distributed Neural Network Training through Online Distillation,”arXiv preprint arXiv:1804.03235, 2018

work page arXiv 2018

[36] [36]

Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer,

H. Chang, V . Shejwalkar, R. Shokri, and A. Houmansadr, “Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer,”arXiv preprint arXiv:1912.11279, 2019

work page arXiv 1912

[37] [37]

Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data,

S. Itahara, T. Nishio, Y . Koda, M. Morikura, and K. Ya- mamoto, “Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data,”IEEE Transactions on Mobile Computing, 2021

work page 2021

[38] [38]

FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning,

F. Sattler, T. Korjakow, R. Rischke, and W. Samek, “FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning,”IEEE Transactions on Neural Networks and Learning Systems, 2021

work page 2021

[39] [39]

CFD: Communication- Efficient Federated Distillation via Soft-Label Quantization and Delta Coding,

F. Sattler, A. Marban, R. Rischke, and W. Samek, “CFD: Communication- Efficient Federated Distillation via Soft-Label Quantization and Delta Coding,”IEEE Transactions on Network Science and Engineering, 2021

work page 2021

[40] [40]

Data-Free Knowledge Distillation for Heterogeneous Federated Learning,

Z. Zhu, J. Hong, and J. Zhou, “Data-Free Knowledge Distillation for Heterogeneous Federated Learning,” inProc. International Conference on Machine Learning (ICML), PMLR, 2021

work page 2021

[41] [41]

Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning,

L. Zhang, L. Shen, L. Ding, D. Tao, and L.-Y . Duan, “Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[42] [42]

FedZKT: Zero-Shot Knowledge Transfer towards Resource-Constrained Federated Learning with Heterogeneous On-Device Models,

L. Zhang, D. Wu, and X. Yuan, “FedZKT: Zero-Shot Knowledge Transfer towards Resource-Constrained Federated Learning with Heterogeneous On-Device Models,”arXiv preprint arXiv:2109.03775, 2021

work page arXiv 2021

[43] [43]

DaFKD: Domain- Aware Federated Knowledge Distillation,

H. Wang, Y . Li, W. Xu, R. Li, Y . Zhan, and Z. Zeng, “DaFKD: Domain- Aware Federated Knowledge Distillation,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[44] [44]

FedNAS: Federated deep learning via neural architecture search,

C. He, M. Annavaram, and S. Avestimehr, “FedNAS: Federated deep learning via neural architecture search,”arXiv preprint(2020)

work page 2020

[45] [45]

SPIDER: Searching personalized neural architecture for federated learning,

E. Mushtaq, C. He, J. Ding, and S. Avestimehr, “SPIDER: Searching personalized neural architecture for federated learning,” inProc. AAAI Workshop on Federated Learning, 2022

work page 2022

[46] [46]

Resource-aware heterogeneous federated learning using neural architecture search (RaFL),

S. Yu, T. Nguyen, and others, “Resource-aware heterogeneous federated learning using neural architecture search (RaFL),”arXiv preprint arXiv:2211.05716, 2022

work page arXiv 2022

[47] [47]

AdaptFL: Adaptive feder- ated learning framework for heterogeneous devices,

Y . Zhang, H. Xia, S. Xu, X. Wang, and L. Xu, “AdaptFL: Adaptive feder- ated learning framework for heterogeneous devices,”Future Generation Computer Systems, vol. 165, Art. 107610, 2025

work page 2025

[48] [48]

FedGEMS: Federated Learning of Larger Server Models via Selective Knowledge Fusion,

S. Cheng, J. Wu, Y . Xiao, Y . Liu, and Y . Liu, “FedGEMS: Federated Learning of Larger Server Models via Selective Knowledge Fusion,” inProc. International Conference on Learning Representations (ICLR), 2022

work page 2022

[49] [49]

”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

Song, Changlin, et al. ”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

work page arXiv 2024

[50] [50]

”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

Liu, Chenghao, et al. ”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

work page arXiv 2023

[51] [51]

”One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification.” arXiv preprint arXiv:2506.14176 (2025)

Yan, Renao. ”One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification.” arXiv preprint arXiv:2506.14176 (2025)

work page arXiv 2025

[52] [52]

”MHAT: An efficient model-heterogenous aggregation training scheme for federated learning.” Information Sciences 560 (2021): 493-503

Hu, Li, et al. ”MHAT: An efficient model-heterogenous aggregation training scheme for federated learning.” Information Sciences 560 (2021): 493-503

work page 2021

[53] [53]

On the convergence of FedAvg on non-IID data,

X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-IID data,”International Conference on Learning Representations (ICLR), 2020

work page 2020

[54] [54]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341-2368, 2013

work page 2013

[55] [55]

Energy and Policy Consider- ations for Deep Learning in NLP,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and Policy Consider- ations for Deep Learning in NLP,” inProc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645–3650, 2019

work page 2019

[56] [56]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon Emissions and Large Neural Network Training,”arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[57] [57]

Green AI,

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

work page 2020

[58] [58]

Survey on Energy Consumption Entities on the Smartphone Platform,

G. P. Perrucci, F. H. P. Fitzek, and J. Widmer, “Survey on Energy Consumption Entities on the Smartphone Platform,” inProc. IEEE 73rd Vehicular Technology Conference (VTC Spring), pp. 1–6, 2011

work page 2011

[59] [59]

An Analysis of Power Consumption in a Smartphone,

A. Carroll and G. Heiser, “An Analysis of Power Consumption in a Smartphone,” inProc. USENIX Annual Technical Conference (ATC), pp. 21–21, 2010

work page 2010

[60] [60]

Henderson, J

V . Lannelongue, J. Grealey, and M. Inouye, “CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing,”arXiv preprint arXiv:2002.05651, 2023

work page arXiv 2002

[61] [61]

PowerJoular and JoularJX: Multi- Platform Software Power Monitoring Tools,

A. Noureddine and R. Rouvoy, “PowerJoular and JoularJX: Multi- Platform Software Power Monitoring Tools,” inProc. 36th Int. Conf. on Advanced Information Networking and Applications (AINA), pp. 97–109, 2022

work page 2022

[62] [62]

”Membership inference attacks against machine learning models.” 2017 IEEE symposium on security and privacy (SP)

Shokri, Reza, et al. ”Membership inference attacks against machine learning models.” 2017 IEEE symposium on security and privacy (SP). IEEE, 2017

work page 2017

[63] [63]

Medjadji, Chaimaa, et al. ”FedSparQ: Adaptive Sparse Quantization with Error Feedback for Robust & Efficient Federated Learning.” 2025 3rd International Conference on Federated Learning Technologies and Applications (FLTA). IEEE, 2025. 21

work page 2025

[64] [64]

Human Activity Recognition from Continuous Ambient Sensor Data

Cook & Thomas, B. Human Activity Recognition from Continuous Ambient Sensor Data. (UCI Machine Learning Repository,2012), DOI: https://doi.org/10.24432/C5D60P

work page doi:10.24432/c5d60p 2012

[65] [65]

& Van Schaik, A

Cohen, G., Afshar, S., Tapson, J. & Van Schaik, A. EMNIST: Extending MNIST to handwritten letters.2017 International Joint Conference On Neural Networks (IJCNN). pp. 2921-2926 (2017)

work page 2017

[66] [66]

& Others Learning multiple layers of features from tiny images

Krizhevsky, A., Hinton, G. & Others Learning multiple layers of features from tiny images. (Toronto, ON, Canada,2009)

work page 2009

[67] [67]

The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine

Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine. 29, 141-142 (2012)

work page 2012

[68] [68]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Xiao, H., Rasul, K. & V ollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.ArXiv Preprint ArXiv:1708.07747. (2017) APPENDIX Subtracting consecutive iterates of the EMA recurrence (13) gives ˜Z(r) − ˜Z(r−1) =γ ˜Z(r−1) + (1−γ)Z (r) − ˜Z(r−1) (30) = (1−γ) Z(r) − ˜Z(r−1) ,(31) establishing the first equality. To obta...

work page internal anchor Pith review Pith/arXiv arXiv 2017