Recognition: no theorem link
FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
Pith reviewed 2026-05-15 15:54 UTC · model grok-4.3
The pith
Splitting model parameters into N blocks lets clients upload only one block per round, cutting federated communication complexity by a factor of 1/N.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By dividing model parameters into N blocks and having each client communicate updates for only a single block per round, the FedBCGD method and its accelerated FedBCGD+ version achieve communication complexities that are a factor of 1/N lower than those of existing federated learning algorithms while enjoying faster convergence.
What carries the argument
The block partitioning of model parameters with selective single-block uploads per client per round, enabling reduced per-round communication in the gradient descent process.
If this is right
- Total communication volume across all rounds scales down proportionally with the chosen number of blocks N.
- The same bandwidth budget supports either more frequent rounds or training of larger models.
- The acceleration components in FedBCGD+ further tighten the convergence bound beyond the basic block-coordinate version.
- Empirical training curves on large models show clear superiority over prior federated baselines under identical communication limits.
Where Pith is reading between the lines
- The shared block could serve as a natural place to anchor global statistics or low-rank adapters that benefit all clients.
- Similar block-wise upload logic might transfer to other bandwidth-limited distributed settings such as edge inference or multi-agent reinforcement learning.
- Dynamic re-partitioning of blocks during training could adapt to changing data heterogeneity across clients without retraining the entire analysis.
Load-bearing premise
That splitting the parameters into blocks preserves enough of the original optimization landscape and block independence for the stated convergence rates to hold when applied to tightly coupled deep network weights.
What would settle it
An experiment on a Vision Transformer where measured total communication volume fails to drop by the predicted factor of 1/N or where the observed convergence rate does not exceed that of full-model baselines due to inter-block dependencies.
Figures
read the original abstract
Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor $1/N$ lower than those of existing methods, where $N$ is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FedBCGD, a federated block coordinate gradient descent algorithm that partitions model parameters into N blocks (including a shared block), so that each client uploads only one block per round. An accelerated variant FedBCGD+ adds client-drift control and stochastic variance reduction. Convergence analysis is supplied claiming that both algorithms achieve communication complexity lower by a factor of 1/N relative to existing federated methods while converging faster; empirical results on large-scale models (e.g., Vision Transformers) are reported to outperform state-of-the-art baselines, with code released.
Significance. If the 1/N communication-complexity claim survives under realistic non-convex assumptions, the work would directly address the dominant bottleneck in federated training of large models. The open-source implementation and reported empirical gains are concrete strengths that would make the contribution immediately usable.
major comments (3)
- [§4] §4 (convergence analysis): the headline claim that communication complexity drops by exactly 1/N requires the per-round progress rate to improve linearly with N. The analysis invokes block-wise smoothness and gradient independence that are unlikely to hold when blocks correspond to tightly coupled layers (attention weights and feed-forward blocks) in a Vision Transformer; the shared-block coordination cost is not bounded, so the 1/N factor is not guaranteed to survive for N>2.
- [§4] Main convergence theorem (presumably Theorem 1 or 2 in §4): the stated faster convergence of FedBCGD+ relies on stochastic variance reduction compensating for client drift, yet the interaction between the 1/N block-selection probability and the variance-reduction term is not quantified. Without an explicit bound showing that the effective Lipschitz constant scales at most as O(1), the acceleration claim cannot be verified for heterogeneous data.
- [§5] Experimental section (likely §5): the reported superiority is shown for a fixed block partitioning, but no ablation varies the number of blocks N or the assignment of layers to blocks. If the landscape-preservation assumption fails for common ViT partitions, the practical communication savings will fall short of the theoretical 1/N factor.
minor comments (2)
- [Abstract / Related Work] The abstract asserts novelty as 'the first work on parameter block communication for large-scale deep models,' yet the related-work section lacks a direct comparison table against prior block-coordinate or partial-update FL methods; adding one would clarify the precise advance.
- [§3] Notation for the shared block and its update schedule is introduced without an explicit equation; a short display equation defining its participation probability would remove ambiguity for readers implementing the algorithm.
Simulated Author's Rebuttal
We thank the referee for the detailed and insightful comments on our manuscript. We address each major comment below and have revised the manuscript to clarify the analysis, add supporting lemmas, and include new experimental ablations where appropriate.
read point-by-point responses
-
Referee: [§4] §4 (convergence analysis): the headline claim that communication complexity drops by exactly 1/N requires the per-round progress rate to improve linearly with N. The analysis invokes block-wise smoothness and gradient independence that are unlikely to hold when blocks correspond to tightly coupled layers (attention weights and feed-forward blocks) in a Vision Transformer; the shared-block coordination cost is not bounded, so the 1/N factor is not guaranteed to survive for N>2.
Authors: We thank the referee for this observation. Our convergence analysis relies on the standard block coordinate descent framework, under which we assume the objective satisfies block-wise smoothness with respect to the chosen parameter partitioning. In the Vision Transformer experiments, blocks are selected to align with individual layers or parameter groups that reduce inter-block coupling, allowing the gradient independence assumption to hold to a sufficient degree for the derived rates. For the shared block, its coordination is performed once per round but its communication volume is fixed and amortized; we acknowledge that the exact 1/N scaling may degrade for very large N. We have added a clarifying remark and a short discussion of this limitation for N>2 in the revised Section 4. revision: partial
-
Referee: [§4] Main convergence theorem (presumably Theorem 1 or 2 in §4): the stated faster convergence of FedBCGD+ relies on stochastic variance reduction compensating for client drift, yet the interaction between the 1/N block-selection probability and the variance-reduction term is not quantified. Without an explicit bound showing that the effective Lipschitz constant scales at most as O(1), the acceleration claim cannot be verified for heterogeneous data.
Authors: We appreciate the referee highlighting the need for a more explicit quantification. In the proof of the FedBCGD+ convergence theorem, the variance-reduction term is applied independently per selected block and the 1/N block-selection probability enters the expectation over the stochastic gradients. We have inserted a new supporting lemma (Lemma 3) that bounds the effective smoothness constant by a quantity independent of N (specifically O(1) under the standard bounded-variance assumption on the local stochastic gradients). This lemma directly establishes that the acceleration factor is preserved under data heterogeneity. The revised theorem statement, the new lemma, and an updated proof sketch appear in Section 4. revision: yes
-
Referee: [§5] Experimental section (likely §5): the reported superiority is shown for a fixed block partitioning, but no ablation varies the number of blocks N or the assignment of layers to blocks. If the landscape-preservation assumption fails for common ViT partitions, the practical communication savings will fall short of the theoretical 1/N factor.
Authors: We agree that systematic ablations on N and layer-to-block assignments would strengthen the empirical claims. We have added a new set of experiments on the Vision Transformer model that vary N from 2 to 6 and compare two representative partitioning strategies (layer-wise versus grouping attention and feed-forward blocks together). The results confirm that measured communication savings track the theoretical 1/N factor closely for N ≤ 4, with modest degradation at larger N attributable to shared-block overhead. These additional figures and tables have been incorporated into the revised Section 5. revision: yes
Circularity Check
No circularity: convergence claims rest on independent analysis
full rationale
The paper introduces FedBCGD and FedBCGD+ as novel block-coordinate methods, splits parameters into N blocks (one shared), and derives communication complexity O(1/N) lower than baselines directly from its own convergence theorems (presumably §4). No self-citation chain, no fitted parameter renamed as prediction, no ansatz smuggled via prior work, and no self-definitional loop appear in the abstract or described derivation. The 1/N factor follows from per-round block updates and variance-reduction terms under stated assumptions; those assumptions are external to the final complexity expression rather than being defined in terms of it. The result is therefore not equivalent to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- N (number of blocks)
axioms (1)
- domain assumption Lipschitz smoothness and bounded gradient variance assumptions standard in non-convex optimization analysis
Reference graph
Works this paper leans on
-
[1]
Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Dis- tributed Gradient Descent. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguis- tics. https://doi.org/10.18653/v1/d17-1045
-
[2]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems 30 (2017)
work page 2017
-
[3]
Rodolfo Stoffel Antunes, Cristiano André da Costa, Arne Küderle, Imrana Abdul- lahi Yari, and Björn Eskofier. 2022. Federated learning for healthcare: Systematic review and architecture proposal.ACM Transactions on Intelligent Systems and Technology (TIST) 13, 4 (2022), 1–23
work page 2022
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
-
[5]
Sébastien Bubeck et al. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning 8, 3-4 (2015), 231–357
work page 2015
-
[6]
David Byrd and Antigoni Polychroniadou. 2020. Differentially private secure multi-party computation for federated learning in financial applications. InPro- ceedings of the First ACM International Conference on AI in Finance . 1–9
work page 2020
-
[7]
Ashok Cutkosky and Francesco Orabona. 2019. Momentum-Based Variance Re- duction in Non-Convex SGD. InAdvances in Neural Information Processing Sys- tems 32 (NeurIPS) . 15210–15219
work page 2019
-
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[10]
Liang Gao, Huazhu Fu, Li Li, Yingwen Chen, Ming Xu, and Cheng-Zhong Xu
-
[11]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Feddc: Federated learning with non-iid data via local drift decoupling and correction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10112–10121
-
[12]
Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. 2021. Local SGD: Uni- fied theory and new efficient methods. InInternational Conference on Artificial Intelligence and Statistics . PMLR, 3556–3564
work page 2021
-
[13]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [14]
-
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition . 770–778
work page 2016
- [16]
-
[17]
Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems 26 (2013)
work page 2013
- [18]
-
[19]
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebas- tian Stich, and Ananda Theertha Suresh. 2020. Scaffold: Stochastic controlled averaging for federated learning. InInternational Conference on Machine Learn- ing. PMLR, 5132–5143
work page 2020
-
[20]
Prashant Khanduri, Pranay Sharma, Haibo Yang, Mingyi Hong, Jia Liu, Ketan Rajawat, and Pramod Varshney. 2021. Stem: A stochastic two-sided momentum algorithm achieving near-optimal sample and communication complexities for federated learning.Advances in Neural Information Processing Systems 34 (2021), 6050–6061
work page 2021
-
[21]
Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of fea- tures from tiny images. (2009)
work page 2009
-
[23]
Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge.CS 231N 7, 7 (2015), 3
work page 2015
-
[24]
Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet 20, 5 (2015), 14
work page 2015
-
[25]
B Li and et al. [n. d.]. On the effectiveness of partial variance reduction in feder- ated learning with heterogeneous data.CVPR ([n. d.])
-
[26]
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks.Pro- ceedings of Machine learning and systems 2 (2020), 429–450
work page 2020
-
[27]
Junkang Liu, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, and Wei Feng
-
[28]
In Forty-second International Conference on Machine Learning
Improving Generalization in Federated Learning with Highly Heteroge- neous Data via Momentum-Based Stochastic Controlled Weight Averaging. In Forty-second International Conference on Machine Learning
- [29]
-
[30]
Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, and YunXiang Gong. 2024. Fedbcgd: Communication-efficient accelerated block co- ordinate gradient descent for federated learning. InProceedings of the 32nd ACM International Conference on Multimedia . 2955–2963
work page 2024
-
[31]
Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, and Yuanyuan Liu
-
[32]
InProceed- ings of the 33rd ACM International Conference on Multimedia
Consistency of local and global flatness for federated learning. InProceed- ings of the 33rd ACM International Conference on Multimedia . 3875–3883
- [33]
-
[34]
Junkang Liu, Fanhua Shang, Kewen Zhu, Hongying Liu, Yuanyuan Liu, and Jin Liu. 2025. FedAdamW: A Communication-Efficient Optimizer with Conver- gence and Generalization Guarantees for Federated Large Models.arXiv preprint arXiv:2510.27486 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [35]
- [36]
-
[37]
Yang Liu, Xinwei Zhang, Yan Kang, Liping Li, Tianjian Chen, Mingyi Hong, and Qiang Yang. 2022. FedBCD: A communication-efficient collaborative learning framework for distributed features.IEEE Transactions on Signal Processing 70 (2022), 4277–4290
work page 2022
-
[38]
Mi Luo, Fei Chen, Dapeng Hu, Yifan Zhang, Jian Liang, and Jiashi Feng. 2021. No fear of heterogeneity: Classifier calibration for federated learning with non-iid data.Advances in Neural Information Processing Systems 34 (2021), 5972–5984
work page 2021
-
[39]
Dhruv Mahajan, S Sathiya Keerthi, and S Sundararajan. 2017. A distributed block coordinate descent method for training l1regularized linear classifiers.The Journal of Machine Learning Research 18, 1 (2017), 3167–3201
work page 2017
-
[40]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data. InArtificial intelligence and statistics . PMLR, 1273–1282
work page 2017
-
[41]
2002.Applied logistic regression analysis
Scott Menard. 2002.Applied logistic regression analysis. Number 106. Sage
work page 2002
-
[42]
Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. 2021. Lin- ear convergence in federated learning: Tackling client heterogeneity and sparse MM ’24, October 28-November 1, 2024, Melbourne, VIC, Australia Junkang Liu et al. gradients.Advances in Neural Information Processing Systems 34 (2021), 14606– 14619
work page 2021
- [43]
-
[44]
Kumar Kshitij Patel, Lingxiao Wang, Blake E Woodworth, Brian Bullins, and Nati Srebro. 2022. Towards optimal communication complexity in distributed non-convex optimization.Advances in Neural Information Processing Systems 35 (2022), 13316–13328
work page 2022
- [45]
-
[46]
Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. 2020. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. InInternational Conference on Artificial Intelligence and Statistics . PMLR, 2021–2031
work page 2020
-
[47]
Peter Richtárik and Martin Takáč. 2016. Distributed coordinate descent method for learning with big data.The Journal of Machine Learning Research 17, 1 (2016), 2657–2681
work page 2016
-
[48]
Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Al- barqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier- Hein, et al. 2020. The future of digital health with federated learning.NPJ digital medicine 3, 1 (2020), 119
work page 2020
-
[49]
Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek
-
[50]
IEEE transactions on neural networks and learning systems 31, 9 (2019), 3400– 3413
Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems 31, 9 (2019), 3400– 3413
work page 2019
-
[51]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image recognition.arXiv preprint arXiv:1409.1556 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[52]
Stephen J Wright. 2015. Coordinate descent algorithms.Mathematical program- ming 151, 1 (2015), 3–34
work page 2015
-
[53]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated ma- chine learning: Concept and applications.ACM Transactions on Intelligent Sys- tems and Technology (TIST) 10, 2 (2019), 1–19
work page 2019
-
[54]
Jinshan Zeng, Tim Tsz-Kit Lau, Shaobo Lin, and Yuan Yao. 2019. Global conver- gence of block coordinate descent in deep learning. InInternational conference on machine learning . PMLR, 7313–7323. FedBCGD: Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning MM ’24, October 28-November 1, 2024, Melbourne, VIC, Austra...
work page 2019
-
[55]
𝒙𝑟 is the𝑟 communication rounds global model
-
[56]
𝒙𝑟 ( 𝑗 ) is the𝑗-th block of𝒙, so that𝒙𝑟 = h 𝒙𝑟 ⊤ (1), . . . , 𝒙𝑟 ⊤ (𝑁 ) i ⊤ . Note that𝒙𝑟 ( 𝑗 ) is a virtual vector. It is realized at a hub𝑗 every 𝑟 iterations, but we will study the evolution of this virtual vector in every iteration
-
[57]
𝒙𝑟 𝑘,𝑗 ∈ R𝑑 are the local versions of the coordinates of the weight vector𝒙𝑟 ( 𝑗 ) that each client𝑘 if hub𝑗 updates
-
[58]
𝒙★ is is the minimum value of the function𝑓 (𝒙)
-
[59]
𝒙𝑘,𝑗, ( 𝑗 ) is the𝑗-th block of𝒙𝑘,𝑗 at client𝑘 in silo𝑗, so that𝒙 ( 𝑗 ) = 1 𝐾 Í𝐾 𝑘=1 𝒙𝑘,𝑗, ( 𝑗 )
-
[60]
𝒚𝑟,𝑡 𝑘,𝑗 is the local parameter vector that client𝑗 in silo𝑘 at iteration𝑡
-
[61]
∇( 𝑗 ) 𝑓𝑘,𝑗 𝒚𝑘,𝑗 ; 𝜁 is the partial derivative of𝑓 (𝒙) with respect to coordinate block𝑗, computed at client𝑘 in silo𝑗 using the coordinates and rows at client𝑘 corresponding to minibatch𝜁
-
[62]
, 𝑮𝑟 (𝑁 ) ⊤ ⊤ , where G𝑟 ( 𝑗 ) = 1 𝐾 Í𝐾 𝑘=1 Í𝑇 𝑡 =1 ∇( 𝑗 ) 𝑓𝑘,𝑗 𝑦𝑟,𝑡 𝑘,𝑗 ; 𝜁
𝑮𝑟 = 𝑮𝑟 (1) ⊤ , . . . , 𝑮𝑟 (𝑁 ) ⊤ ⊤ , where G𝑟 ( 𝑗 ) = 1 𝐾 Í𝐾 𝑘=1 Í𝑇 𝑡 =1 ∇( 𝑗 ) 𝑓𝑘,𝑗 𝑦𝑟,𝑡 𝑘,𝑗 ; 𝜁 . It should be noted that components on𝒙, i.e., 𝒙 ( 𝑗 ) are realized every𝑇 iterations when the hubs communicate with clients and with other hubs, but we will study the evolution of these virtual vectors at each iteration. Therefore, based on the above defin...
work page 2024
-
[63]
Case of strongly convex : 𝑓𝑖 satisfies Assumption 1 for 𝜇 > 0, ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ 1 𝛽 then E h 𝑓 ¯𝑧𝑅 i − 𝑓 𝑥★ ≤ 𝑥 0 − 𝑥★ 2 𝜇 exp − 𝛼 𝜇𝑅 𝛽 + 128 1 − 𝐾 𝑀 1 𝐾 𝐺 2 + 32 𝜎 2 𝐾𝑇 𝜇𝑅 + 384𝛽𝐺 2 + 192𝛽 𝑇 𝜎 2 𝛼 2𝜇2𝑅2 + 6144𝛽 2𝐺 2 + 3072 𝑇 𝛽 2𝜎 2 𝛼 2𝜇3𝑅3 (19)
-
[64]
Case of general convex : Each 𝑓𝑖 satisfies Assumption 1 for 𝜇 = 0, ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ 1 𝛽 then E h 𝑓 ¯𝑧𝑅 i − 𝑓 𝑥★ ≤ 𝛽 3 2 𝑑0 𝛼𝑅 + 6144𝛽 2𝐺 2 + 3072 𝑇 𝛽 2𝜎 2 𝛼 2𝑅 + h 32 1 − 𝐾 𝑀 1 𝐾 𝐺 2 + 32 𝜎 2 𝐾𝑇 i 1 2 𝑑 1 2 0 √ 𝑅 + 384𝛽𝐺 2 + 192𝛽 𝑇 𝜎 2 1 3 𝑑 2 3 0 𝛼 2 3 𝑅 2 3 . (20)
-
[65]
(21) TheoRem 2 (ConveRgence Rates of FedBCGD+)
Case of non-convex : Each 𝑓𝑖 satisfies Assumption 2 and ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ 1 𝛽 , then 1 𝑅 𝑅Õ 𝑟 =1 ∥∇𝑓 (𝑥𝑟 ) ∥2 ≤ 16𝛽𝑑0 𝑇 𝐾𝛼𝑅 + 2√𝑑0√ 𝑅𝑇 𝑀 8𝛽 𝐾 1 − 𝐾 𝑀 𝐺 2 + 8𝛽𝜎 2 𝑇 𝐾 1 − 𝐾 𝑀 + 8𝛽 𝑇 𝑀 𝜎 2 1 2 + 2 𝑑0 𝑅 2 3 384𝛽 2 𝛼 2 𝐺 2 + 92𝛽 2 𝑇 𝜎 2 𝛼 2 + 16𝛾 2𝛽 2 𝑇 𝑀 𝜎 2 + 16𝛾 2𝛽 2 𝜎 2 𝑇 𝐾 1 − 𝐾 𝑀 + 16𝛾 2𝛽 2 𝐾 1 − 𝐾 𝑀 𝐺 2 1 3 + 2 𝑑0 𝑅 3 4 4608 𝛼 2 𝛽 3 𝐾 1 − 𝐾 𝑀 𝐺 2 + 115...
-
[66]
Case of strongly convex : Each 𝑓𝑖 satisfies Assumption 1 for 𝜇 > 0, ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ min 1 81𝛽 , 𝑆 15𝜇𝑁 then E h 𝑓 𝒛𝑅 i − 𝑓 𝒙★ ≤ ˜O 𝑀 𝜇 𝐾 ˜𝐷 2 exp − min 𝑀 30𝐾 , 𝜇 162𝛽 𝑅 . (22)
-
[67]
Case of General convex : Each 𝑓𝑖 satisfies Assumption 1 for 𝜇 = 0, ˜𝜂 ≤ 1 𝛽 then E h 𝑓 ¯𝑧𝑅 i − 𝑓 𝑥★ ≤ O r 𝑀 𝐾 𝛽 ˜𝐷 2 𝑅 ! . (23)
-
[68]
1 𝑀 2 E 𝑀Õ 𝑖=1 𝑇Õ 𝑡 =1 ∇𝑓𝑖 𝑦𝑟,𝑡 𝑖 +E 1 𝐾𝑀 1 − 𝐾 𝑀
Case of non-convex : Each 𝑓𝑖 satisfies Assumption 2 and ˜𝜂 = 1 4 𝛼𝑇 , ˜𝜂 ≤ 1 24𝛽 𝐾 𝑀 2 3 then E ∇𝑓 ¯𝑧𝑅 2 ≤ O 𝛽𝐹 𝑅 𝑀 𝐾 2 3 ! , (24) MM ’24, October 28-November 1, 2024, Melbourne, VIC, Australia Junkang Liu et al. where ˜𝐷 2 := 𝒙0 − 𝒙★ 2 + 1 2𝑁 𝛽2 Í𝑁 𝑖=1 𝒄 0 𝑖 − ∇𝑓𝑖 𝒙★ 2 and 𝐹 := 𝑓 (𝒙0) − 𝑓 𝒙★ . 9 Appendix C: Main Lemmas In this section, we prove some main...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.