Communication-Efficient Federated Fine-Tuning

Antonios Deligiannakis; Michael Theologitis; Vasilis Samoladas

arxiv: 2505.04535 · v3 · pith:DSFXIEGDnew · submitted 2025-05-07 · 💻 cs.LG · cs.DC

Communication-Efficient Federated Fine-Tuning

Michael Theologitis , Vasilis Samoladas , Antonios Deligiannakis This is my paper

Pith reviewed 2026-05-22 16:05 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords federated learningfine-tuninglanguage modelscommunication efficiencyadaptive algorithmsNLP tasksFedOpt

0 comments

The pith

FDA-Opt generalizes dynamic and fixed communication to improve federated LM fine-tuning without extra tuning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the FDA-Opt family of algorithms as a unified generalization that combines dynamic monitoring of training progress with the FedOpt approach. This replaces rigid fixed-interval communication with adaptive decisions while keeping compatibility with standard optimizers. A sympathetic reader would care because fine-tuning large language models in federated settings is limited by the high cost of exchanging parameters across private data sources. If the central claim holds, then existing federated systems could switch to this method for better performance on NLP tasks with no added configuration or calibration steps.

Core claim

FDA-Opt is a unified generalization of both FDA and FedOpt that eliminates the need for hard-to-calibrate parameters and rigid synchronization schemes, and experiments demonstrate that it outperforms FedOpt even when the latter uses hyper-parameters specifically optimized for it on downstream NLP tasks.

What carries the argument

The FDA-Opt family of algorithms, which dynamically monitors training progress to set communication intervals while integrating with FedOpt-style optimizers.

Load-bearing premise

The assumption that dynamic monitoring of training progress can be performed reliably and with negligible overhead across diverse downstream NLP tasks and model sizes without introducing new synchronization issues or calibration needs.

What would settle it

Experiments on a new set of NLP tasks or larger models where FDA-Opt no longer outperforms hyper-parameter-optimized FedOpt or where the monitoring step adds measurable overhead.

Figures

Figures reproduced from arXiv: 2505.04535 by Antonios Deligiannakis, Michael Theologitis, Vasilis Samoladas.

**Figure 1.** Figure 1: A Federated Learning Round The widely adopted answer to this question is the FEDOPT family of algorithms [11], which prescribe ending rounds after a fixed number of local steps or epochs. In this approach, each client processes its dataset one or more times before sending model updates back to the server. However, this method has two significant limitations. First, it assumes the training process is stati… view at source ↗

**Figure 2.** Figure 2: Mechanics of local training, in round t, under FDA to be configured using well-established settings from prior work. 5) Alleviate Synchronization Bottleneck: We relax the original rigid local-state synchronization requirement by allowing larger intervals between variance approximations. A. Optimizers In the original FDA algorithm, the server-side aggregation relied on a simple averaging scheme. However, … view at source ↗

**Figure 3.** Figure 3: Variance progression during the first 100 rounds of training with FEDOPT using RoBERTa on the MRPC dataset variance magnitudes (please note that the y-axis is logarithmic) and behaviors. Hence, variance trends should only be interpreted within the context of the same training run. Additionally, the magnitude of the variance may evolve even for the same optimizer by 1-3 orders of magnitude, depending on wh… view at source ↗

**Figure 4.** Figure 4: Training Loss progression of training RoBERTa with [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: The highest testing accuracy of RoBERTa on the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Federated Learning (FL) enables the utilization of vast, previously inaccessible data sources. At the same time, pre-trained Language Models (LMs) have taken the world by storm and for good reason. They exhibit remarkable emergent abilities and are readily adapted to downstream tasks. This opens one of the most exciting frontiers in FL: fine-tuning LMs. Yet, a persistent challenge in FL is the frequent, rigid communication of parameters -- a problem magnified by the sheer size of these contemporary models. The FedOpt family of algorithms has become the go-to approach for FL, relying on fixed but arbitrary intervals for model exchanges. Recently, the FDA algorithm prescribed a dynamic approach by monitoring the training progress. However, it introduced a hard-to-calibrate parameter and imposed a rigid synchronization scheme. In this work, we address these limitations by proposing the FDA-Opt family of algorithms -- a unified generalization of both FDA and FedOpt. Our experimental evaluation focuses on fine-tuning LMs on downstream NLP tasks and demonstrates that FDA-Opt outperforms FedOpt even when it is configured with hyper-parameters specifically optimized for the latter. In other words, we show that FDA-Opt is a practical, drop-in replacement for FedOpt in modern FL libraries and systems: it requires no additional configuration and delivers superior performance out of the box.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes the FDA-Opt family of algorithms as a unified generalization of the dynamic FDA method and the FedOpt family for communication-efficient federated fine-tuning of pre-trained language models. The central empirical claim is that FDA-Opt outperforms FedOpt on downstream NLP tasks even when FedOpt hyperparameters (primarily the fixed communication interval) are specifically optimized for each task and model, positioning FDA-Opt as a practical drop-in replacement requiring no additional user configuration.

Significance. If the reported gains prove robust under thorough baseline tuning and statistical controls, the work could offer a pragmatic improvement for federated fine-tuning of large models by combining dynamic progress monitoring with standard optimizers while avoiding FDA's calibration issues. The emphasis on practical deployment in modern FL libraries is a positive aspect.

major comments (1)

[Experimental Evaluation] Experimental Evaluation section: The hyperparameter optimization procedure for the FedOpt baselines is under-specified. No details are given on the search space or grid for the communication interval, the number of trials, the validation procedure, or whether the reported numbers reflect the best attainable FedOpt performance per task/model. This information is necessary to support the headline claim that FDA-Opt outperforms an optimized FedOpt and serves as a no-configuration replacement.

minor comments (2)

[Abstract] Abstract: The phrase 'dynamic monitoring of training progress' could be expanded with a brief parenthetical on how the new approach avoids the original FDA's hard-to-calibrate parameter.
[Algorithms and Notation] Notation and algorithms: Ensure all parameters (e.g., any thresholds or monitoring intervals in FDA-Opt) are explicitly defined before use in pseudocode or equations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below and outline the revisions we will make to improve clarity.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental Evaluation section: The hyperparameter optimization procedure for the FedOpt baselines is under-specified. No details are given on the search space or grid for the communication interval, the number of trials, the validation procedure, or whether the reported numbers reflect the best attainable FedOpt performance per task/model. This information is necessary to support the headline claim that FDA-Opt outperforms an optimized FedOpt and serves as a no-configuration replacement.

Authors: We agree that the hyperparameter optimization procedure for the FedOpt baselines was described with insufficient detail in the submitted manuscript, which weakens the support for our central claim. In the revised version, we will expand the Experimental Evaluation section to fully specify the search space and grid used for the communication interval, the number of trials performed, the validation procedure for selecting configurations, and confirmation that the reported FedOpt results represent the best attainable performance per task and model under this tuning. These additions will directly address the concern and more rigorously substantiate that FDA-Opt outperforms even optimized FedOpt while requiring no additional configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical outperformance claim is independent of inputs

full rationale

The paper proposes FDA-Opt as a generalization of FDA and FedOpt and supports its central claim through experimental comparisons on downstream NLP tasks. The reported superiority is presented as an empirical outcome from training runs rather than any derivation, equation, or fitted parameter that reduces to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or described structure; the evaluation relies on external benchmark results that remain falsifiable outside the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that dynamic communication can be made practical without extra tuning; no explicit free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5765 in / 1130 out tokens · 26169 ms · 2026-05-22T16:05:42.061884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

Fedrola: Robust federated learning against model poisoning via layer-based aggregation,

G. Yan, H. Wang, X. Yuan, and J. Li, “Fedrola: Robust federated learning against model poisoning via layer-based aggregation,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 3667–3678

work page 2024
[3]

A survey for federated learning evaluations: Goals and measures,

D. Chai, L. Wang, L. Yang, J. Zhang, K. Chen, and Q. Yang, “A survey for federated learning evaluations: Goals and measures,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 10, pp. 5007–5024, 2024

work page 2024
[4]

A survey on federated learning systems: Vision, hype and reality for data privacy and protection,

Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, Y . Li, X. Liu, and B. He, “A survey on federated learning systems: Vision, hype and reality for data privacy and protection,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 4, pp. 3347–3366, 2023

work page 2023
[5]

Communication-Efficient Learning of Deep Networks from Decen- tralized Data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decen- tralized Data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Singh and J. Zhu, Eds., vol. 54. PMLR, 20–22 Apr 2017, pp. 1273–1282

work page 2017
[6]

Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,

F. Xu, Q. Lin, J. Han, T. Zhao, J. Liu, and E. Cambria, “Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 4, pp. 1620–1634, 2025

work page 2025
[7]

A survey of knowledge enhanced pre-trained language models,

L. Hu, Z. Liu, Z. Zhao, L. Hou, L. Nie, and J. Li, “A survey of knowledge enhanced pre-trained language models,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 4, pp. 1413–1430, 2024

work page 2024
[8]

Federated fine-tuning of llms on the very edge: The good, the bad, the ugly,

H. Woisetschl ¨ager, A. Erben, S. Wang, R. Mayer, and H.-A. Jacobsen, “Federated fine-tuning of llms on the very edge: The good, the bad, the ugly,” inProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, ser. DEEM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 39–50

work page 2024
[9]

Fedbiot: Llm local fine- tuning in federated learning without full model,

F. Wu, Z. Li, Y . Li, B. Ding, and J. Gao, “Fedbiot: Llm local fine- tuning in federated learning without full model,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 3345–3355

work page 2024
[10]

A survey on efficient federated learning methods for foundation model training,

H. Woisetschl ¨ager, A. Erben, S. Wang, R. Mayer, and H.-A. Jacobsen, “A survey on efficient federated learning methods for foundation model training,” inProceedings of the Thirty-Third International Joint Confer- ence on Artificial Intelligence, IJCAI-24, K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2024, pp. ...

work page 2024
[11]

Adaptive federated optimization,

S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021

work page 2021
[12]

Parallel restarted sgd with faster convergence and less communication: demystifying why model aver- aging works for deep learning,

H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster convergence and less communication: demystifying why model aver- aging works for deep learning,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innova- tive Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educationa...

work page 2019
[13]

Communication-efficient distributed deep learning via federated dynamic averaging,

M. Theologitis, G. Frangias, G. Anestis, V . Samoladas, and A. Deligian- nakis, “Communication-efficient distributed deep learning via federated dynamic averaging,” inProceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25 - March 28. OpenProceedings.org, 2025, p. 411–424

work page 2025
[14]

Adam: A method for stochastic optimization,

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inInternational Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015

work page 2015
[15]

Measuring the effects of non-identical data distribution for federated visual classification,

T. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,”CoRR, 2019

work page 2019
[16]

On the importance of initialization and momentum in deep learning,

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” inProceedings of the 30th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1139–1147

work page 2013
[17]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

work page 2019
[18]

Adaptive subgradient methods for online learning and stochastic optimization,

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121–2159, 2011

work page 2011
[19]

Sketching streams through the net: Distributed approximate query tracking,

G. Cormode and M. Garofalakis, “Sketching streams through the net: Distributed approximate query tracking,” inProceedings of the 31st International Conference on Very Large Data Bases, ser. VLDB ’05. VLDB Endowment, 2005, p. 13–24

work page 2005
[20]

Transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” inProceedings of the 2020 Conference on Empirical Method...

work page 2020
[21]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019
[22]

GLUE: A multi-task benchmark and analysis platform for natural language understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, and A. Alishahi, Eds. Brussels, Belgium: Association for Computational Lin...

work page 2018
[23]

Automatically constructing a corpus of sentential paraphrases,

W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” inProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005

work page 2005
[24]

Recursive deep models for semantic compositionality over a sentiment treebank,

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, D. Yarowsky, T. Baldwin, A. Korhonen, K. Livescu, and S. Bethard, Eds. Seattle, Washington, USA: Associatio...

work page 2013
[25]

A broad-coverage challenge corpus for sentence understanding through inference,

A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2018, pp. 1112–1122. [O...

work page 2018
[26]

Fed- PETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models,

Z. Zhang, Y . Yang, Y . Dai, Q. Wang, Y . Yu, L. Qu, and Z. Xu, “Fed- PETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models,” inFindings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguisti...

work page 2023
[27]

On data distribution leakage in cross-silo federated learning,

Y . Jiang, X. Luo, Y . Wu, X. Zhu, X. Xiao, and B. C. Ooi, “On data distribution leakage in cross-silo federated learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 3312–3328, 2024

work page 2024
[28]

Federated learning on non-iid data silos: An experimental study,

Q. Li, Y . Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” in38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12,

work page 2022
[29]

IEEE, 2022, pp. 965–978

work page 2022
[30]

FedNLP: Benchmarking federated learning methods for natural language processing tasks,

B. Y . Lin, C. He, Z. Ze, H. Wang, Y . Hua, C. Dupuy, R. Gupta, M. Soltanolkotabi, X. Ren, and S. Avestimehr, “FedNLP: Benchmarking federated learning methods for natural language processing tasks,” in Findings of the Association for Computational Linguistics: NAACL 2022, M. Carpuat, M.-C. de Marneffe, and I. V . Meza Ruiz, Eds. Seattle, United States: As...

work page 2022
[31]

Fedapen: Personalized cross- silo federated learning with adaptability to statistical heterogeneity,

Z. Qin, S. Deng, M. Zhao, and X. Yan, “Fedapen: Personalized cross- silo federated learning with adaptability to statistical heterogeneity,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1954–1964

work page 2023
[32]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,”CoRR, vol. abs/1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[33]

Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding shar- ing,

P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding shar- ing,” 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

work page 2023
[34]

A field guide to federated optimization,

J. Wang, Z. Charles, Z. Xu, G. Joshi, H. B. McMahan, B. A. y Arcas, M. Al-Shedivat, G. Andrew, S. Avestimehr, K. Daly, D. Data, S. N. Diggavi, H. Eichner, A. Gadhikar, Z. Garrett, A. M. Girgis, F. Hanzely, A. Hard, C. He, S. Horv ´ath, Z. Huo, A. Ingerman, M. Jaggi, T. Javidi, P. Kairouz, S. Kale, S. P. Karimireddy, J. Kone ˇcn´y, S. Koyejo, T. Li, L. Liu...

work page arXiv 2021
[35]

Adaptive federated learning in resource constrained edge com- puting systems,

S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge com- puting systems,”IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1205–1221, 2019

work page 2019
[36]

Faster federated learning with decaying number of local sgd steps,

J. Mills, J. Hu, and G. Min, “Faster federated learning with decaying number of local sgd steps,”IEEE Trans. Parallel Distrib. Syst., vol. 34, no. 7, p. 2198–2207, jul 2023

work page 2023
[37]

Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd,

J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd,”Systems and Machine Learning (SysML) Conference, 2018

work page 2018
[38]

Local sgd with periodic averaging: Tighter analysis and adaptive synchronization,

F. Haddadpour, M. M. Kamani, M. Mahdavi, and V . R. Cadambe, “Local sgd with periodic averaging: Tighter analysis and adaptive synchronization,” inNeural Information Processing Systems, 2019

work page 2019
[39]

SCAFFOLD: Stochastic controlled averaging for federated learning,

S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” inProceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5132–5143

work page 2020
[40]

Datagossip: A data exchange extension for distributed machine learning algorithms,

P. Wenig and T. Papenbrock, “Datagossip: A data exchange extension for distributed machine learning algorithms,” inProceedings of the 25th International Conference on Extending Database Technology, EDBT 2022, Edinburgh, UK, March 29 - April 1, 2022, J. Stoyanovich, J. Teubner, P. Guagliardo, M. Nikolic, A. Pieris, J. M ¨uhlig, F. ¨Ozcan, S. Schelter, H. V...

work page 2022
[41]

Mime: Mimicking centralized stochastic algorithms in federated learning,

S. P. Karimireddy, M. Jaggi, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Mime: Mimicking centralized stochastic algorithms in federated learning,” 2021

work page 2021
[42]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProceedings of the Third Conference on Machine Learning and Systems, MLSys 2020, Austin, TX, USA, March 2-4, 2020, I. S. Dhillon, D. S. Papailiopoulos, and V . Sze, Eds. mlsys.org, 2020

work page 2020
[43]

Federated learning based on dynamic regularization,

D. A. E. Acar, Y . Zhao, R. Matas, M. Mattina, P. Whatmough, and V . Saligrama, “Federated learning based on dynamic regularization,” in International Conference on Learning Representations, 2021

work page 2021
[44]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computationa...

work page 2021
[45]

The power of scale for parameter-efficient prompt tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3045–3059

work page 2021
[46]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

work page 2022
[47]

BitFit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,

E. Ben Zaken, Y . Goldberg, and S. Ravfogel, “BitFit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” in Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational L...

work page 2022
[48]

Improving loRA in privacy-preserving federated learning,

Y . Sun, Z. Li, Y . Li, and B. Ding, “Improving loRA in privacy-preserving federated learning,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[49]

SLoRA: Federated parameter efficient fine- tuning of language models,

S. Babakniya, A. Elkordy, Y . Ezzeldin, Q. Liu, K.-B. Song, M. EL- Khamy, and S. Avestimehr, “SLoRA: Federated parameter efficient fine- tuning of language models,” inInternational Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023, 2023

work page 2023
[50]

Composable sparse fine- tuning for cross-lingual transfer,

A. Ansell, E. Ponti, A. Korhonen, and I. Vuli ´c, “Composable sparse fine- tuning for cross-lingual transfer,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 1778–1796

work page 2022
[51]

Uveqfed: Universal vector quantization for federated learning,

N. Shlezinger, M. Chen, Y . Eldar, H. V . Poor, and S. Cui, “Uveqfed: Universal vector quantization for federated learning,”IEEE Transactions on Signal Processing, vol. 69, pp. 500–514, 01 2021

work page 2021
[52]

Sparse communication for distributed gradient descent,

A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 440–445

work page 2017
[53]

D. Basu, D. Data, C. Karakus, and S. Diggavi,Qsparse-local-SGD: dis- tributed SGD with quantization, sparsification, and local computations. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019
[54]

Communication compression techniques in distributed deep learning: A survey,

Z. Wang, M. Wen, Y . Xu, Y . Zhou, J. H. Wang, and L. Zhang, “Communication compression techniques in distributed deep learning: A survey,”Journal of Systems Architecture, vol. 142, p. 102927, 2023

work page 2023

[1] [1]

Fedrola: Robust federated learning against model poisoning via layer-based aggregation,

G. Yan, H. Wang, X. Yuan, and J. Li, “Fedrola: Robust federated learning against model poisoning via layer-based aggregation,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 3667–3678

work page 2024

[2] [3]

A survey for federated learning evaluations: Goals and measures,

D. Chai, L. Wang, L. Yang, J. Zhang, K. Chen, and Q. Yang, “A survey for federated learning evaluations: Goals and measures,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 10, pp. 5007–5024, 2024

work page 2024

[3] [4]

A survey on federated learning systems: Vision, hype and reality for data privacy and protection,

Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, Y . Li, X. Liu, and B. He, “A survey on federated learning systems: Vision, hype and reality for data privacy and protection,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 4, pp. 3347–3366, 2023

work page 2023

[4] [5]

Communication-Efficient Learning of Deep Networks from Decen- tralized Data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decen- tralized Data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Singh and J. Zhu, Eds., vol. 54. PMLR, 20–22 Apr 2017, pp. 1273–1282

work page 2017

[5] [6]

Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,

F. Xu, Q. Lin, J. Han, T. Zhao, J. Liu, and E. Cambria, “Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 4, pp. 1620–1634, 2025

work page 2025

[6] [7]

A survey of knowledge enhanced pre-trained language models,

L. Hu, Z. Liu, Z. Zhao, L. Hou, L. Nie, and J. Li, “A survey of knowledge enhanced pre-trained language models,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 4, pp. 1413–1430, 2024

work page 2024

[7] [8]

Federated fine-tuning of llms on the very edge: The good, the bad, the ugly,

H. Woisetschl ¨ager, A. Erben, S. Wang, R. Mayer, and H.-A. Jacobsen, “Federated fine-tuning of llms on the very edge: The good, the bad, the ugly,” inProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, ser. DEEM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 39–50

work page 2024

[8] [9]

Fedbiot: Llm local fine- tuning in federated learning without full model,

F. Wu, Z. Li, Y . Li, B. Ding, and J. Gao, “Fedbiot: Llm local fine- tuning in federated learning without full model,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 3345–3355

work page 2024

[9] [10]

A survey on efficient federated learning methods for foundation model training,

H. Woisetschl ¨ager, A. Erben, S. Wang, R. Mayer, and H.-A. Jacobsen, “A survey on efficient federated learning methods for foundation model training,” inProceedings of the Thirty-Third International Joint Confer- ence on Artificial Intelligence, IJCAI-24, K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2024, pp. ...

work page 2024

[10] [11]

Adaptive federated optimization,

S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021

work page 2021

[11] [12]

Parallel restarted sgd with faster convergence and less communication: demystifying why model aver- aging works for deep learning,

H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster convergence and less communication: demystifying why model aver- aging works for deep learning,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innova- tive Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educationa...

work page 2019

[12] [13]

Communication-efficient distributed deep learning via federated dynamic averaging,

M. Theologitis, G. Frangias, G. Anestis, V . Samoladas, and A. Deligian- nakis, “Communication-efficient distributed deep learning via federated dynamic averaging,” inProceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25 - March 28. OpenProceedings.org, 2025, p. 411–424

work page 2025

[13] [14]

Adam: A method for stochastic optimization,

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inInternational Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015

work page 2015

[14] [15]

Measuring the effects of non-identical data distribution for federated visual classification,

T. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,”CoRR, 2019

work page 2019

[15] [16]

On the importance of initialization and momentum in deep learning,

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” inProceedings of the 30th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1139–1147

work page 2013

[16] [17]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

work page 2019

[17] [18]

Adaptive subgradient methods for online learning and stochastic optimization,

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121–2159, 2011

work page 2011

[18] [19]

Sketching streams through the net: Distributed approximate query tracking,

G. Cormode and M. Garofalakis, “Sketching streams through the net: Distributed approximate query tracking,” inProceedings of the 31st International Conference on Very Large Data Bases, ser. VLDB ’05. VLDB Endowment, 2005, p. 13–24

work page 2005

[19] [20]

Transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” inProceedings of the 2020 Conference on Empirical Method...

work page 2020

[20] [21]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019

[21] [22]

GLUE: A multi-task benchmark and analysis platform for natural language understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, and A. Alishahi, Eds. Brussels, Belgium: Association for Computational Lin...

work page 2018

[22] [23]

Automatically constructing a corpus of sentential paraphrases,

W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” inProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005

work page 2005

[23] [24]

Recursive deep models for semantic compositionality over a sentiment treebank,

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, D. Yarowsky, T. Baldwin, A. Korhonen, K. Livescu, and S. Bethard, Eds. Seattle, Washington, USA: Associatio...

work page 2013

[24] [25]

A broad-coverage challenge corpus for sentence understanding through inference,

A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2018, pp. 1112–1122. [O...

work page 2018

[25] [26]

Fed- PETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models,

Z. Zhang, Y . Yang, Y . Dai, Q. Wang, Y . Yu, L. Qu, and Z. Xu, “Fed- PETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models,” inFindings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguisti...

work page 2023

[26] [27]

On data distribution leakage in cross-silo federated learning,

Y . Jiang, X. Luo, Y . Wu, X. Zhu, X. Xiao, and B. C. Ooi, “On data distribution leakage in cross-silo federated learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 3312–3328, 2024

work page 2024

[27] [28]

Federated learning on non-iid data silos: An experimental study,

Q. Li, Y . Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” in38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12,

work page 2022

[28] [29]

IEEE, 2022, pp. 965–978

work page 2022

[29] [30]

FedNLP: Benchmarking federated learning methods for natural language processing tasks,

B. Y . Lin, C. He, Z. Ze, H. Wang, Y . Hua, C. Dupuy, R. Gupta, M. Soltanolkotabi, X. Ren, and S. Avestimehr, “FedNLP: Benchmarking federated learning methods for natural language processing tasks,” in Findings of the Association for Computational Linguistics: NAACL 2022, M. Carpuat, M.-C. de Marneffe, and I. V . Meza Ruiz, Eds. Seattle, United States: As...

work page 2022

[30] [31]

Fedapen: Personalized cross- silo federated learning with adaptability to statistical heterogeneity,

Z. Qin, S. Deng, M. Zhao, and X. Yan, “Fedapen: Personalized cross- silo federated learning with adaptability to statistical heterogeneity,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1954–1964

work page 2023

[31] [32]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,”CoRR, vol. abs/1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[32] [33]

Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding shar- ing,

P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding shar- ing,” 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

work page 2023

[33] [34]

A field guide to federated optimization,

J. Wang, Z. Charles, Z. Xu, G. Joshi, H. B. McMahan, B. A. y Arcas, M. Al-Shedivat, G. Andrew, S. Avestimehr, K. Daly, D. Data, S. N. Diggavi, H. Eichner, A. Gadhikar, Z. Garrett, A. M. Girgis, F. Hanzely, A. Hard, C. He, S. Horv ´ath, Z. Huo, A. Ingerman, M. Jaggi, T. Javidi, P. Kairouz, S. Kale, S. P. Karimireddy, J. Kone ˇcn´y, S. Koyejo, T. Li, L. Liu...

work page arXiv 2021

[34] [35]

Adaptive federated learning in resource constrained edge com- puting systems,

S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge com- puting systems,”IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1205–1221, 2019

work page 2019

[35] [36]

Faster federated learning with decaying number of local sgd steps,

J. Mills, J. Hu, and G. Min, “Faster federated learning with decaying number of local sgd steps,”IEEE Trans. Parallel Distrib. Syst., vol. 34, no. 7, p. 2198–2207, jul 2023

work page 2023

[36] [37]

Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd,

J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd,”Systems and Machine Learning (SysML) Conference, 2018

work page 2018

[37] [38]

Local sgd with periodic averaging: Tighter analysis and adaptive synchronization,

F. Haddadpour, M. M. Kamani, M. Mahdavi, and V . R. Cadambe, “Local sgd with periodic averaging: Tighter analysis and adaptive synchronization,” inNeural Information Processing Systems, 2019

work page 2019

[38] [39]

SCAFFOLD: Stochastic controlled averaging for federated learning,

S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” inProceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5132–5143

work page 2020

[39] [40]

Datagossip: A data exchange extension for distributed machine learning algorithms,

P. Wenig and T. Papenbrock, “Datagossip: A data exchange extension for distributed machine learning algorithms,” inProceedings of the 25th International Conference on Extending Database Technology, EDBT 2022, Edinburgh, UK, March 29 - April 1, 2022, J. Stoyanovich, J. Teubner, P. Guagliardo, M. Nikolic, A. Pieris, J. M ¨uhlig, F. ¨Ozcan, S. Schelter, H. V...

work page 2022

[40] [41]

Mime: Mimicking centralized stochastic algorithms in federated learning,

S. P. Karimireddy, M. Jaggi, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Mime: Mimicking centralized stochastic algorithms in federated learning,” 2021

work page 2021

[41] [42]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProceedings of the Third Conference on Machine Learning and Systems, MLSys 2020, Austin, TX, USA, March 2-4, 2020, I. S. Dhillon, D. S. Papailiopoulos, and V . Sze, Eds. mlsys.org, 2020

work page 2020

[42] [43]

Federated learning based on dynamic regularization,

D. A. E. Acar, Y . Zhao, R. Matas, M. Mattina, P. Whatmough, and V . Saligrama, “Federated learning based on dynamic regularization,” in International Conference on Learning Representations, 2021

work page 2021

[43] [44]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computationa...

work page 2021

[44] [45]

The power of scale for parameter-efficient prompt tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3045–3059

work page 2021

[45] [46]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

work page 2022

[46] [47]

BitFit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,

E. Ben Zaken, Y . Goldberg, and S. Ravfogel, “BitFit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” in Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational L...

work page 2022

[47] [48]

Improving loRA in privacy-preserving federated learning,

Y . Sun, Z. Li, Y . Li, and B. Ding, “Improving loRA in privacy-preserving federated learning,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[48] [49]

SLoRA: Federated parameter efficient fine- tuning of language models,

S. Babakniya, A. Elkordy, Y . Ezzeldin, Q. Liu, K.-B. Song, M. EL- Khamy, and S. Avestimehr, “SLoRA: Federated parameter efficient fine- tuning of language models,” inInternational Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023, 2023

work page 2023

[49] [50]

Composable sparse fine- tuning for cross-lingual transfer,

A. Ansell, E. Ponti, A. Korhonen, and I. Vuli ´c, “Composable sparse fine- tuning for cross-lingual transfer,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 1778–1796

work page 2022

[50] [51]

Uveqfed: Universal vector quantization for federated learning,

N. Shlezinger, M. Chen, Y . Eldar, H. V . Poor, and S. Cui, “Uveqfed: Universal vector quantization for federated learning,”IEEE Transactions on Signal Processing, vol. 69, pp. 500–514, 01 2021

work page 2021

[51] [52]

Sparse communication for distributed gradient descent,

A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 440–445

work page 2017

[52] [53]

D. Basu, D. Data, C. Karakus, and S. Diggavi,Qsparse-local-SGD: dis- tributed SGD with quantization, sparsification, and local computations. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019

[53] [54]

Communication compression techniques in distributed deep learning: A survey,

Z. Wang, M. Wen, Y . Xu, Y . Zhou, J. H. Wang, and L. Zhang, “Communication compression techniques in distributed deep learning: A survey,”Journal of Systems Architecture, vol. 142, p. 102927, 2023

work page 2023