arxiv: 2603.05116 · v1 · submitted 2026-03-05 · 💻 cs.LG · cs.AI

Recognition: no theorem link

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

Junkang Liu , Fanhua Shang , Yuanyuan Liu , Hongying Liu , Yuangang Li , YunXiang Gong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords federated learningblock coordinate descentcommunication efficiencygradient descentconvergence analysislarge-scale modelsparameter partitioning

0 comments

The pith

Splitting model parameters into N blocks lets clients upload only one block per round, cutting federated communication complexity by a factor of 1/N.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FedBCGD, which partitions large model parameters into several blocks including a shared one and restricts each client to uploading updates for only its assigned block during each communication round. This selective upload design directly lowers the volume of data exchanged in federated training of models such as Vision Transformers. An accelerated variant called FedBCGD+ adds client drift control and stochastic variance reduction to further improve convergence speed. Theoretical analysis proves the resulting communication complexity is lower by a factor of 1/N compared with standard federated methods while convergence rates are also improved. The approach is positioned as the first to apply parameter block communication specifically for training large-scale deep models in the federated setting.

Core claim

By dividing model parameters into N blocks and having each client communicate updates for only a single block per round, the FedBCGD method and its accelerated FedBCGD+ version achieve communication complexities that are a factor of 1/N lower than those of existing federated learning algorithms while enjoying faster convergence.

What carries the argument

The block partitioning of model parameters with selective single-block uploads per client per round, enabling reduced per-round communication in the gradient descent process.

If this is right

Total communication volume across all rounds scales down proportionally with the chosen number of blocks N.
The same bandwidth budget supports either more frequent rounds or training of larger models.
The acceleration components in FedBCGD+ further tighten the convergence bound beyond the basic block-coordinate version.
Empirical training curves on large models show clear superiority over prior federated baselines under identical communication limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared block could serve as a natural place to anchor global statistics or low-rank adapters that benefit all clients.
Similar block-wise upload logic might transfer to other bandwidth-limited distributed settings such as edge inference or multi-agent reinforcement learning.
Dynamic re-partitioning of blocks during training could adapt to changing data heterogeneity across clients without retraining the entire analysis.

Load-bearing premise

That splitting the parameters into blocks preserves enough of the original optimization landscape and block independence for the stated convergence rates to hold when applied to tightly coupled deep network weights.

What would settle it

An experiment on a Vision Transformer where measured total communication volume fails to drop by the predicted factor of 1/N or where the observed convergence rate does not exceed that of full-model baselines due to inter-block dependencies.

Figures

Figures reproduced from arXiv: 2603.05116 by Fanhua Shang, Hongying Liu, Junkang Liu, Yuangang Li, Yuanyuan Liu, YunXiang Gong.

**Figure 2.** Figure 2: The client parameter block allocation in FedBCGD. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Accuracy comparison of FedBCGD with LeNet-5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 3.** Figure 3: The convergence comparison of our FedBCGD and FedBCGD+, and other baselines on the CIFAR10 and CIFAR100 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The acceleration comparison of FedBCGD with dif [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 7.** Figure 7: Test accuracy (a) and training loss (b) with ResNet [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 6.** Figure 6: The test accuracy varies with the communication [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Logistic regression with 𝐸 =1 and 𝜌 =0.1, 𝑁 = 5 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Logistic regression with 𝐸 =1 and 𝜌 =0.1. (b) The problem with non-convex loss, where 𝐸 =1 and 𝜌 =0.1. The number of blocks is set to 𝑁 = 5. ERM with Non-Convex Loss: We also apply our algorithms to solve the regularized Empirical Risk Minimization (ERM) problem with non-convex sigmoid loss: min 𝑥 ∈R𝑑 1 𝑛 Õ𝑛 𝑖=1 𝑓𝑖(𝑥) + 𝜆 2 ∥𝑥 ∥ 2 , (137) where 𝑓𝑖(𝑥) = 1/ [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

read the original abstract

Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor $1/N$ lower than those of existing methods, where $N$ is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Block-wise partial uploads cut per-round bits in FL, but the 1/N communication claim needs the convergence rate to scale with N despite layer coupling.

read the letter

The core move here is to partition the model into N blocks (one of them shared) so each client uploads only its assigned block in a round. That is new for federated training of large models like ViTs, and the authors supply both a basic FedBCGD version and an accelerated FedBCGD+ that adds client-drift control plus stochastic variance reduction. They also release code, which is useful on its own. The abstract states that communication complexity drops by a factor of 1/N relative to standard methods and that convergence is faster; the empirical section apparently shows gains over existing FL baselines. Those are the concrete contributions worth noting. The main soft spot is the gap between the claimed linear reduction in communication rounds and what block coordinate methods actually deliver on tightly coupled non-convex landscapes. Deep models have strong dependencies across layers, so the effective smoothness or gradient variance for a single block can be worse than the full-model case; if the analysis relies on block-wise strong convexity or independence assumptions, those are unlikely to hold for attention and feed-forward blocks. The shared block also adds coordination that is not quantified in the headline claim. Without the full proofs it is hard to see whether the 1/N factor survives when N is larger than 2 or 3. This paper is aimed at researchers who already work on communication bottlenecks in federated learning and are willing to experiment with block scheduling. A reader who needs a ready-to-run implementation for large models could extract value from the code and the reported numbers even if the theory needs tightening. I would send it to peer review. The idea is simple enough to test, the code is public, and the empirical comparison is there for referees to examine; the theory can be stress-tested in the process.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes FedBCGD, a federated block coordinate gradient descent algorithm that partitions model parameters into N blocks (including a shared block), so that each client uploads only one block per round. An accelerated variant FedBCGD+ adds client-drift control and stochastic variance reduction. Convergence analysis is supplied claiming that both algorithms achieve communication complexity lower by a factor of 1/N relative to existing federated methods while converging faster; empirical results on large-scale models (e.g., Vision Transformers) are reported to outperform state-of-the-art baselines, with code released.

Significance. If the 1/N communication-complexity claim survives under realistic non-convex assumptions, the work would directly address the dominant bottleneck in federated training of large models. The open-source implementation and reported empirical gains are concrete strengths that would make the contribution immediately usable.

major comments (3)

[§4] §4 (convergence analysis): the headline claim that communication complexity drops by exactly 1/N requires the per-round progress rate to improve linearly with N. The analysis invokes block-wise smoothness and gradient independence that are unlikely to hold when blocks correspond to tightly coupled layers (attention weights and feed-forward blocks) in a Vision Transformer; the shared-block coordination cost is not bounded, so the 1/N factor is not guaranteed to survive for N>2.
[§4] Main convergence theorem (presumably Theorem 1 or 2 in §4): the stated faster convergence of FedBCGD+ relies on stochastic variance reduction compensating for client drift, yet the interaction between the 1/N block-selection probability and the variance-reduction term is not quantified. Without an explicit bound showing that the effective Lipschitz constant scales at most as O(1), the acceleration claim cannot be verified for heterogeneous data.
[§5] Experimental section (likely §5): the reported superiority is shown for a fixed block partitioning, but no ablation varies the number of blocks N or the assignment of layers to blocks. If the landscape-preservation assumption fails for common ViT partitions, the practical communication savings will fall short of the theoretical 1/N factor.

minor comments (2)

[Abstract / Related Work] The abstract asserts novelty as 'the first work on parameter block communication for large-scale deep models,' yet the related-work section lacks a direct comparison table against prior block-coordinate or partial-update FL methods; adding one would clarify the precise advance.
[§3] Notation for the shared block and its update schedule is introduced without an explicit equation; a short display equation defining its participation probability would remove ambiguity for readers implementing the algorithm.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on our manuscript. We address each major comment below and have revised the manuscript to clarify the analysis, add supporting lemmas, and include new experimental ablations where appropriate.

read point-by-point responses

Referee: [§4] §4 (convergence analysis): the headline claim that communication complexity drops by exactly 1/N requires the per-round progress rate to improve linearly with N. The analysis invokes block-wise smoothness and gradient independence that are unlikely to hold when blocks correspond to tightly coupled layers (attention weights and feed-forward blocks) in a Vision Transformer; the shared-block coordination cost is not bounded, so the 1/N factor is not guaranteed to survive for N>2.

Authors: We thank the referee for this observation. Our convergence analysis relies on the standard block coordinate descent framework, under which we assume the objective satisfies block-wise smoothness with respect to the chosen parameter partitioning. In the Vision Transformer experiments, blocks are selected to align with individual layers or parameter groups that reduce inter-block coupling, allowing the gradient independence assumption to hold to a sufficient degree for the derived rates. For the shared block, its coordination is performed once per round but its communication volume is fixed and amortized; we acknowledge that the exact 1/N scaling may degrade for very large N. We have added a clarifying remark and a short discussion of this limitation for N>2 in the revised Section 4. revision: partial
Referee: [§4] Main convergence theorem (presumably Theorem 1 or 2 in §4): the stated faster convergence of FedBCGD+ relies on stochastic variance reduction compensating for client drift, yet the interaction between the 1/N block-selection probability and the variance-reduction term is not quantified. Without an explicit bound showing that the effective Lipschitz constant scales at most as O(1), the acceleration claim cannot be verified for heterogeneous data.

Authors: We appreciate the referee highlighting the need for a more explicit quantification. In the proof of the FedBCGD+ convergence theorem, the variance-reduction term is applied independently per selected block and the 1/N block-selection probability enters the expectation over the stochastic gradients. We have inserted a new supporting lemma (Lemma 3) that bounds the effective smoothness constant by a quantity independent of N (specifically O(1) under the standard bounded-variance assumption on the local stochastic gradients). This lemma directly establishes that the acceleration factor is preserved under data heterogeneity. The revised theorem statement, the new lemma, and an updated proof sketch appear in Section 4. revision: yes
Referee: [§5] Experimental section (likely §5): the reported superiority is shown for a fixed block partitioning, but no ablation varies the number of blocks N or the assignment of layers to blocks. If the landscape-preservation assumption fails for common ViT partitions, the practical communication savings will fall short of the theoretical 1/N factor.

Authors: We agree that systematic ablations on N and layer-to-block assignments would strengthen the empirical claims. We have added a new set of experiments on the Vision Transformer model that vary N from 2 to 6 and compare two representative partitioning strategies (layer-wise versus grouping attention and feed-forward blocks together). The results confirm that measured communication savings track the theoretical 1/N factor closely for N ≤ 4, with modest degradation at larger N attributable to shared-block overhead. These additional figures and tables have been incorporated into the revised Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence claims rest on independent analysis

full rationale

The paper introduces FedBCGD and FedBCGD+ as novel block-coordinate methods, splits parameters into N blocks (one shared), and derives communication complexity O(1/N) lower than baselines directly from its own convergence theorems (presumably §4). No self-citation chain, no fitted parameter renamed as prediction, no ansatz smuggled via prior work, and no self-definitional loop appear in the abstract or described derivation. The 1/N factor follows from per-round block updates and variance-reduction terms under stated assumptions; those assumptions are external to the final complexity expression rather than being defined in terms of it. The result is therefore not equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method relies on standard assumptions from optimization theory for its convergence guarantees, with the block number N as a design parameter rather than a fitted one.

free parameters (1)

N (number of blocks)
Chosen by the user or designer; affects the communication reduction factor but not fitted to data.

axioms (1)

domain assumption Lipschitz smoothness and bounded gradient variance assumptions standard in non-convex optimization analysis
Invoked for proving convergence rates in federated settings.

pith-pipeline@v0.9.0 · 5521 in / 1099 out tokens · 68127 ms · 2026-05-15T15:54:55.134654+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

[1]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Dis- tributed Gradient Descent. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguis- tics. https://doi.org/10.18653/v1/d17-1045

work page doi:10.18653/v1/d17-1045 2017
[2]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems 30 (2017)

work page 2017
[3]

Rodolfo Stoffel Antunes, Cristiano André da Costa, Arne Küderle, Imrana Abdul- lahi Yari, and Björn Eskofier. 2022. Federated learning for healthcare: Systematic review and architecture proposal.ACM Transactions on Intelligent Systems and Technology (TIST) 13, 4 (2022), 1–23

work page 2022
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[5]

Sébastien Bubeck et al. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning 8, 3-4 (2015), 231–357

work page 2015
[6]

David Byrd and Antigoni Polychroniadou. 2020. Differentially private secure multi-party computation for federated learning in financial applications. InPro- ceedings of the First ACM International Conference on AI in Finance . 1–9

work page 2020
[7]

Ashok Cutkosky and Francesco Orabona. 2019. Momentum-Based Variance Re- duction in Non-Convex SGD. InAdvances in Neural Information Processing Sys- tems 32 (NeurIPS) . 15210–15219

work page 2019
[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Liang Gao, Huazhu Fu, Li Li, Yingwen Chen, Ming Xu, and Cheng-Zhong Xu

work page
[11]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Feddc: Federated learning with non-iid data via local drift decoupling and correction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10112–10121

work page
[12]

Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. 2021. Local SGD: Uni- fied theory and new efficient methods. InInternational Conference on Artificial Intelligence and Statistics . PMLR, 3556–3564

work page 2021
[13]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Xinran Gu, Kaifeng Lyu, Longbo Huang, and Sanjeev Arora. 2023. Why (and When) does Local SGD Generalize Better than SGD? arXiv preprint arXiv:2303.01215 (2023)

work page arXiv 2023
[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition . 770–778

work page 2016
[16]

Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. 2019. Measuring the ef- fects of non-identical data distribution for federated visual classification.arXiv preprint arXiv:1909.06335 (2019)

work page arXiv 2019
[17]

Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems 26 (2013)

work page 2013
[18]

Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. 2020. Mime: Mim- icking centralized stochastic algorithms in federated learning.arXiv preprint arXiv:2008.03606 (2020)

work page arXiv 2020
[19]

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebas- tian Stich, and Ananda Theertha Suresh. 2020. Scaffold: Stochastic controlled averaging for federated learning. InInternational Conference on Machine Learn- ing. PMLR, 5132–5143

work page 2020
[20]

Prashant Khanduri, Pranay Sharma, Haibo Yang, Mingyi Hong, Jia Liu, Ketan Rajawat, and Pramod Varshney. 2021. Stem: A stochastic two-sided momentum algorithm achieving near-optimal sample and communication complexities for federated learning.Advances in Neural Information Processing Systems 34 (2021), 6050–6061

work page 2021
[21]

Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of fea- tures from tiny images. (2009)

work page 2009
[23]

Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge.CS 231N 7, 7 (2015), 3

work page 2015
[24]

Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet 20, 5 (2015), 14

work page 2015
[25]

B Li and et al. [n. d.]. On the effectiveness of partial variance reduction in feder- ated learning with heterogeneous data.CVPR ([n. d.])

work page
[26]

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks.Pro- ceedings of Machine learning and systems 2 (2020), 429–450

work page 2020
[27]

Junkang Liu, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, and Wei Feng

work page
[28]

In Forty-second International Conference on Machine Learning

Improving Generalization in Federated Learning with Highly Heteroge- neous Data via Momentum-Based Stochastic Controlled Weight Averaging. In Forty-second International Conference on Machine Learning

work page
[29]

Junkang Liu, Fanhua Shang, Hongying Liu, Jin Liu, Weixin An, and Yuanyuan Liu. 2026. Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data. arXiv:2602.19271 [cs.LG] https://arxiv.org/abs/2602.19271

work page arXiv 2026
[30]

Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, and YunXiang Gong. 2024. Fedbcgd: Communication-efficient accelerated block co- ordinate gradient descent for federated learning. InProceedings of the 32nd ACM International Conference on Multimedia . 2955–2963

work page 2024
[31]

Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, and Yuanyuan Liu

work page
[32]

InProceed- ings of the 33rd ACM International Conference on Multimedia

Consistency of local and global flatness for federated learning. InProceed- ings of the 33rd ACM International Conference on Multimedia . 3875–3883

work page
[33]

Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, and Jin Liu. 2025. FedMuon: Accelerating Federated Learning with Matrix Orthogo- nalization.arXiv preprint arXiv:2510.27403 (2025)

work page arXiv 2025
[34]

Junkang Liu, Fanhua Shang, Kewen Zhu, Hongying Liu, Yuanyuan Liu, and Jin Liu. 2025. FedAdamW: A Communication-Efficient Optimizer with Conver- gence and Generalization Guarantees for Federated Large Models.arXiv preprint arXiv:2510.27486 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Junkang Liu, Yuxuan Tian, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Jun- chao Zhou, and Daorui Ding. 2025. DP-FedPGN: Finding Global Flat Minima for Differentially Private Federated Learning via Penalizing Gradient Norm.arXiv preprint arXiv:2510.27504 (2025)

work page arXiv 2025
[36]

Yang Liu, Yan Kang, Tianyuan Zou, Yanhong Pu, Yuanqin He, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang, and Qiang Yang. 2022. Vertical federated learning.arXiv preprint arXiv:2211.12814 (2022)

work page arXiv 2022
[37]

Yang Liu, Xinwei Zhang, Yan Kang, Liping Li, Tianjian Chen, Mingyi Hong, and Qiang Yang. 2022. FedBCD: A communication-efficient collaborative learning framework for distributed features.IEEE Transactions on Signal Processing 70 (2022), 4277–4290

work page 2022
[38]

Mi Luo, Fei Chen, Dapeng Hu, Yifan Zhang, Jian Liang, and Jiashi Feng. 2021. No fear of heterogeneity: Classifier calibration for federated learning with non-iid data.Advances in Neural Information Processing Systems 34 (2021), 5972–5984

work page 2021
[39]

Dhruv Mahajan, S Sathiya Keerthi, and S Sundararajan. 2017. A distributed block coordinate descent method for training l1regularized linear classifiers.The Journal of Machine Learning Research 18, 1 (2017), 3167–3201

work page 2017
[40]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data. InArtificial intelligence and statistics . PMLR, 1273–1282

work page 2017
[41]

2002.Applied logistic regression analysis

Scott Menard. 2002.Applied logistic regression analysis. Number 106. Sage

work page 2002
[42]

Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. 2021. Lin- ear convergence in federated learning: Tackling client heterogeneity and sparse MM ’24, October 28-November 1, 2024, Melbourne, VIC, Australia Junkang Liu et al. gradients.Advances in Neural Information Processing Systems 34 (2021), 14606– 14619

work page 2021
[43]

Tomoya Murata and Taiji Suzuki. 2021. Bias-variance reduced local sgd for less heterogeneous federated learning.arXiv preprint arXiv:2102.03198 (2021)

work page arXiv 2021
[44]

Kumar Kshitij Patel, Lingxiao Wang, Blake E Woodworth, Brian Bullins, and Nati Srebro. 2022. Towards optimal communication complexity in distributed non-convex optimization.Advances in Neural Information Processing Systems 35 (2022), 13316–13328

work page 2022
[45]

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. 2020. Adaptive feder- ated optimization.arXiv preprint arXiv:2003.00295 (2020)

work page arXiv 2020
[46]

Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. 2020. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. InInternational Conference on Artificial Intelligence and Statistics . PMLR, 2021–2031

work page 2020
[47]

Peter Richtárik and Martin Takáč. 2016. Distributed coordinate descent method for learning with big data.The Journal of Machine Learning Research 17, 1 (2016), 2657–2681

work page 2016
[48]

Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Al- barqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier- Hein, et al. 2020. The future of digital health with federated learning.NPJ digital medicine 3, 1 (2020), 119

work page 2020
[49]

Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek

work page
[50]

IEEE transactions on neural networks and learning systems 31, 9 (2019), 3400– 3413

Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems 31, 9 (2019), 3400– 3413

work page 2019
[51]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image recognition.arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

Stephen J Wright. 2015. Coordinate descent algorithms.Mathematical program- ming 151, 1 (2015), 3–34

work page 2015
[53]

Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated ma- chine learning: Concept and applications.ACM Transactions on Intelligent Sys- tems and Technology (TIST) 10, 2 (2019), 1–19

work page 2019
[54]

Jinshan Zeng, Tim Tsz-Kit Lau, Shaobo Lin, and Yuan Yao. 2019. Global conver- gence of block coordinate descent in deep learning. InInternational conference on machine learning . PMLR, 7313–7323. FedBCGD: Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning MM ’24, October 28-November 1, 2024, Melbourne, VIC, Austra...

work page 2019
[55]

𝒙𝑟 is the𝑟 communication rounds global model

work page
[56]

, 𝒙𝑟 ⊤ (𝑁 ) i ⊤

𝒙𝑟 ( 𝑗 ) is the𝑗-th block of𝒙, so that𝒙𝑟 = h 𝒙𝑟 ⊤ (1), . . . , 𝒙𝑟 ⊤ (𝑁 ) i ⊤ . Note that𝒙𝑟 ( 𝑗 ) is a virtual vector. It is realized at a hub𝑗 every 𝑟 iterations, but we will study the evolution of this virtual vector in every iteration

work page
[57]

𝒙𝑟 𝑘,𝑗 ∈ R𝑑 are the local versions of the coordinates of the weight vector𝒙𝑟 ( 𝑗 ) that each client𝑘 if hub𝑗 updates

work page
[58]

𝒙★ is is the minimum value of the function𝑓 (𝒙)

work page
[59]

𝒙𝑘,𝑗, ( 𝑗 ) is the𝑗-th block of𝒙𝑘,𝑗 at client𝑘 in silo𝑗, so that𝒙 ( 𝑗 ) = 1 𝐾 Í𝐾 𝑘=1 𝒙𝑘,𝑗, ( 𝑗 )

work page
[60]

𝒚𝑟,𝑡 𝑘,𝑗 is the local parameter vector that client𝑗 in silo𝑘 at iteration𝑡

work page
[61]

∇( 𝑗 ) 𝑓𝑘,𝑗 𝒚𝑘,𝑗 ; 𝜁 is the partial derivative of𝑓 (𝒙) with respect to coordinate block𝑗, computed at client𝑘 in silo𝑗 using the coordinates and rows at client𝑘 corresponding to minibatch𝜁

work page
[62]

, 𝑮𝑟 (𝑁 ) ⊤ ⊤ , where G𝑟 ( 𝑗 ) = 1 𝐾 Í𝐾 𝑘=1 Í𝑇 𝑡 =1 ∇( 𝑗 ) 𝑓𝑘,𝑗 𝑦𝑟,𝑡 𝑘,𝑗 ; 𝜁

𝑮𝑟 = 𝑮𝑟 (1) ⊤ , . . . , 𝑮𝑟 (𝑁 ) ⊤ ⊤ , where G𝑟 ( 𝑗 ) = 1 𝐾 Í𝐾 𝑘=1 Í𝑇 𝑡 =1 ∇( 𝑗 ) 𝑓𝑘,𝑗 𝑦𝑟,𝑡 𝑘,𝑗 ; 𝜁 . It should be noted that components on𝒙, i.e., 𝒙 ( 𝑗 ) are realized every𝑇 iterations when the hubs communicate with clients and with other hubs, but we will study the evolution of these virtual vectors at each iteration. Therefore, based on the above defin...

work page 2024
[63]

Case of strongly convex : 𝑓𝑖 satisfies Assumption 1 for 𝜇 > 0, ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ 1 𝛽 then E h 𝑓 ¯𝑧𝑅 i − 𝑓 𝑥★ ≤ 𝑥 0 − 𝑥★ 2 𝜇 exp − 𝛼 𝜇𝑅 𝛽 + 128 1 − 𝐾 𝑀 1 𝐾 𝐺 2 + 32 𝜎 2 𝐾𝑇 𝜇𝑅 + 384𝛽𝐺 2 + 192𝛽 𝑇 𝜎 2 𝛼 2𝜇2𝑅2 + 6144𝛽 2𝐺 2 + 3072 𝑇 𝛽 2𝜎 2 𝛼 2𝜇3𝑅3 (19)

work page
[64]

Case of general convex : Each 𝑓𝑖 satisfies Assumption 1 for 𝜇 = 0, ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ 1 𝛽 then E h 𝑓 ¯𝑧𝑅 i − 𝑓 𝑥★ ≤ 𝛽 3 2 𝑑0 𝛼𝑅 + 6144𝛽 2𝐺 2 + 3072 𝑇 𝛽 2𝜎 2 𝛼 2𝑅 + h 32 1 − 𝐾 𝑀 1 𝐾 𝐺 2 + 32 𝜎 2 𝐾𝑇 i 1 2 𝑑 1 2 0 √ 𝑅 + 384𝛽𝐺 2 + 192𝛽 𝑇 𝜎 2 1 3 𝑑 2 3 0 𝛼 2 3 𝑅 2 3 . (20)

work page
[65]

(21) TheoRem 2 (ConveRgence Rates of FedBCGD+)

Case of non-convex : Each 𝑓𝑖 satisfies Assumption 2 and ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ 1 𝛽 , then 1 𝑅 𝑅Õ 𝑟 =1 ∥∇𝑓 (𝑥𝑟 ) ∥2 ≤ 16𝛽𝑑0 𝑇 𝐾𝛼𝑅 + 2√𝑑0√ 𝑅𝑇 𝑀 8𝛽 𝐾 1 − 𝐾 𝑀 𝐺 2 + 8𝛽𝜎 2 𝑇 𝐾 1 − 𝐾 𝑀 + 8𝛽 𝑇 𝑀 𝜎 2 1 2 + 2 𝑑0 𝑅 2 3 384𝛽 2 𝛼 2 𝐺 2 + 92𝛽 2 𝑇 𝜎 2 𝛼 2 + 16𝛾 2𝛽 2 𝑇 𝑀 𝜎 2 + 16𝛾 2𝛽 2 𝜎 2 𝑇 𝐾 1 − 𝐾 𝑀 + 16𝛾 2𝛽 2 𝐾 1 − 𝐾 𝑀 𝐺 2 1 3 + 2 𝑑0 𝑅 3 4 4608 𝛼 2 𝛽 3 𝐾 1 − 𝐾 𝑀 𝐺 2 + 115...

work page
[66]

Case of strongly convex : Each 𝑓𝑖 satisfies Assumption 1 for 𝜇 > 0, ˜𝜂 = 𝛼𝜂𝑇 4 , ˜𝜂 ≤ min 1 81𝛽 , 𝑆 15𝜇𝑁 then E h 𝑓 𝒛𝑅 i − 𝑓 𝒙★ ≤ ˜O 𝑀 𝜇 𝐾 ˜𝐷 2 exp − min 𝑀 30𝐾 , 𝜇 162𝛽 𝑅 . (22)

work page
[67]

Case of General convex : Each 𝑓𝑖 satisfies Assumption 1 for 𝜇 = 0, ˜𝜂 ≤ 1 𝛽 then E h 𝑓 ¯𝑧𝑅 i − 𝑓 𝑥★ ≤ O r 𝑀 𝐾 𝛽 ˜𝐷 2 𝑅 ! . (23)

work page
[68]

1 𝑀 2 E 𝑀Õ 𝑖=1 𝑇Õ 𝑡 =1 ∇𝑓𝑖 𝑦𝑟,𝑡 𝑖 +E 1 𝐾𝑀 1 − 𝐾 𝑀

Case of non-convex : Each 𝑓𝑖 satisfies Assumption 2 and ˜𝜂 = 1 4 𝛼𝑇 , ˜𝜂 ≤ 1 24𝛽 𝐾 𝑀 2 3 then E ∇𝑓 ¯𝑧𝑅 2 ≤ O 𝛽𝐹 𝑅 𝑀 𝐾 2 3 ! , (24) MM ’24, October 28-November 1, 2024, Melbourne, VIC, Australia Junkang Liu et al. where ˜𝐷 2 := 𝒙0 − 𝒙★ 2 + 1 2𝑁 𝛽2 Í𝑁 𝑖=1 𝒄 0 𝑖 − ∇𝑓𝑖 𝒙★ 2 and 𝐹 := 𝑓 (𝒙0) − 𝑓 𝒙★ . 9 Appendix C: Main Lemmas In this section, we prove some main...

work page 2024