pith. sign in

arxiv: 2507.06542 · v4 · submitted 2025-07-09 · 💻 cs.LG · cs.DC· cs.MA· stat.ML

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Pith reviewed 2026-05-19 05:34 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.MAstat.ML
keywords decentralized learningmodel mergingstochastic gradient descentconvergence ratedata heterogeneitycommunication efficiencydistributed optimization
0
0 comments X p. Extension

The pith

Decentralized SGD with one final global merge achieves the convergence rate of parallel SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies when devices should synchronize in decentralized training and finds that saving communication for a single global merge at the very end markedly improves final test accuracy, especially under high data heterogeneity. It proves that this late merge produces a model whose convergence rate equals that of fully parallel SGD. The analysis does so by treating part of the variation across local models as a helpful signal instead of pure noise. A reader would care because the result suggests decentralized systems can reach strong performance with far less total communication than previously thought necessary. If the claim holds, it reframes limited peer-to-peer links from a bottleneck into a manageable constraint.

Core claim

Performing a single global merge of all local models at the final iteration of decentralized SGD yields an output model that attains the same convergence rate as parallel SGD. The proof obtains this rate by reinterpreting a portion of the discrepancies among the local models, previously regarded as detrimental noise, as constructive components that contribute to the overall convergence bound.

What carries the argument

The single global merging step that aggregates every local model only at the final training iteration.

If this is right

  • Decentralized training can reach comparable generalization to parallel training even when data partitions are highly non-uniform.
  • Communication budgets can be shifted almost entirely to the end of training without sacrificing the theoretical rate.
  • Standard decentralized SGD becomes practical under stricter limits on total peer-to-peer exchanges.
  • Model merging at the close of training can be viewed as a lightweight way to recover parallel-like guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same late-merge tactic might be tested on non-SGD optimizers to check whether the rate-matching benefit generalizes.
  • Dynamic schedules that trigger the global merge once local drift exceeds a threshold could be compared against the fixed final-step rule.
  • The constructive-discrepancy view may connect decentralized optimization to ensemble methods that deliberately preserve local diversity until the end.

Load-bearing premise

The proof requires reinterpreting discrepancies among local models as constructive components rather than detrimental noise.

What would settle it

A calculation or experiment in which the convergence rate of the globally merged model falls below the parallel-SGD rate on high-heterogeneity data would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.06542 by Can Wang, Mingze Wang, Tianyu Zhang, Tongtian Zhu, Zhanpeng Zhou.

Figure 1
Figure 1. Figure 1: (a, b): Comparisons of global test accuracy (see Definition 2) in training of CLIP ViT￾B/32 (a) and ResNet-18 (b) on the Tiny ImageNet dataset, distributed across 16 agents with high heterogeneity (Dirichlet α = 0.1; see details in Appendix C.1). Decentralized training here involves each agent syncing with a random peer per round. A single global model merging is performed at the final round. (c): Loss lan… view at source ↗
Figure 2
Figure 2. Figure 2: A comparative illustration of server-based, decentralized, and local training. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a, b): Comparisons of global test accuracy (see Definition 2) in decentralized training of ResNet-18 on the CIFAR-100 dataset, distributed across 16 agents with high heterogeneity (Dirichlet α = 0.1; see details in Appendix C.1). Fully-connected communication (synchronous AllReduce) is activated only in specific windows, while low communication with one random peer with a probability of 0.2 is used elsewh… view at source ↗
read the original abstract

Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies communication scheduling in decentralized SGD, presenting empirical results that a single global merge at the final training step substantially improves test performance under high data heterogeneity. Theoretically, it claims to be the first to prove that the globally merged model achieves the same O(1/sqrt(T)) convergence rate as parallel SGD, by reinterpreting local-model discrepancy vectors (previously viewed as noise) as constructive components whose inner products aid the bound.

Significance. If the rate-matching result holds, the work shows decentralized learning can match centralized rates with minimal (late-stage) communication, offering concrete evidence that heterogeneity need not preclude good generalization when merging is timed appropriately. This supplies a new lens for model-merging research and credits the constructive-component reinterpretation as the technical step that closes the analysis.

major comments (2)
  1. [§4] §4 (Convergence Analysis), around the expansion of ||(1/n)∑w_i − w*||² after the single late merge: the proof reinterprets the cross terms 2⟨avg(w_i − w_avg), ∇f⟩ as non-detrimental or canceling, yet supplies no explicit bound or sign control on these terms that would guarantee they remain controlled when the merge occurs after hundreds of local steps in the high-heterogeneity regime used in the experiments. This step is load-bearing for equating the merged rate to that of synchronous parallel SGD.
  2. [Theorem 1] Theorem 1 (rate-matching statement): the derivation assumes local models remain in a regime where the reinterpreted discrepancy terms do not introduce extra bias beyond the constants already used for parallel SGD; no separate lemma verifies this regime holds for the single-merge schedule and heterogeneity levels reported in §5.
minor comments (2)
  1. [Abstract] Abstract: quantitative details (e.g., number of local steps before merge, dataset sizes, or observed accuracy deltas) are omitted, making the “surprising effectiveness” claim harder to evaluate at a glance.
  2. [Experiments] Experimental section: tables and figures lack error bars or mention of the number of random seeds; adding these would strengthen the empirical support without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the requirements for rigorously establishing the rate-matching result. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Convergence Analysis), around the expansion of ||(1/n)∑w_i − w*||² after the single late merge: the proof reinterprets the cross terms 2⟨avg(w_i − w_avg), ∇f⟩ as non-detrimental or canceling, yet supplies no explicit bound or sign control on these terms that would guarantee they remain controlled when the merge occurs after hundreds of local steps in the high-heterogeneity regime used in the experiments. This step is load-bearing for equating the merged rate to that of synchronous parallel SGD.

    Authors: We thank the referee for highlighting this point. The original analysis absorbs the cross terms into the existing constants via the reinterpretation of discrepancies as constructive, but we agree an explicit bound improves clarity. In the revised manuscript we expand the derivation in §4 to bound |2⟨avg(w_i − w_avg), ∇f⟩| using L-smoothness and the fact that the average discrepancy norm grows at most linearly with the number of local steps before the final merge; the resulting additive term remains O(1/sqrt(T)) and does not alter the leading rate, matching the parallel-SGD analysis under the same assumptions. revision: yes

  2. Referee: [Theorem 1] Theorem 1 (rate-matching statement): the derivation assumes local models remain in a regime where the reinterpreted discrepancy terms do not introduce extra bias beyond the constants already used for parallel SGD; no separate lemma verifies this regime holds for the single-merge schedule and heterogeneity levels reported in §5.

    Authors: We agree that an explicit verification of the regime is desirable. We have added a supporting lemma (now Lemma 3) in the appendix that bounds the discrepancy growth over the local phases preceding the single global merge. The lemma shows that, for the heterogeneity parameter and local-step counts used in the §5 experiments, the extra bias introduced by the reinterpreted terms stays within the constants already present in the parallel-SGD bound, thereby justifying the assumptions of Theorem 1 for the single-merge schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends standard SGD analysis independently

full rationale

The paper claims a single late global merge in decentralized SGD matches the O(1/sqrt(T)) rate of parallel SGD by reinterpreting local discrepancies as constructive components rather than noise. No quoted equations or steps in the provided abstract reduce the final bound to a fitted parameter, self-citation chain, or input by construction. The reinterpretation is presented as an original technical step in the convergence proof, without evidence that cross-term cancellations are forced by prior self-referenced results or ansatzes. The analysis therefore remains self-contained against external SGD benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard optimization assumptions (bounded gradients, appropriate learning-rate schedules) plus the novel reinterpretation of local-model discrepancies as constructive; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • standard math Standard SGD convergence assumptions such as bounded gradients and suitable learning-rate schedules hold for both decentralized and parallel settings.
    Invoked to establish that the merged model matches parallel-SGD rate.

pith-pipeline@v0.9.0 · 5717 in / 1269 out tokens · 43747 ms · 2026-05-19T05:34:06.252283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 3 internal anchors

  1. [1]

    Ainsworth, S., Hayase, J., and Srinivasa, S. (2023). Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations

  2. [2]

    E., Jaggi, M., and Guerraoui, R

    Allouah, Y ., Koloskova, A., Firdoussi, A. E., Jaggi, M., and Guerraoui, R. (2024). The privacy power of correlated noise in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 1115–1143

  3. [3]

    Bonabeau, E., Dorigo, M., and Theraulaz, G. (1999). Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press

  4. [4]

    Z., Bedi, A., and Huang, F

    Bornstein, M., Rabbani, T., Wang, E. Z., Bedi, A., and Huang, F. (2023). SWIFT: Rapid decentralized federated learning via wait-free model communication. InThe Eleventh International Conference on Learning Representations

  5. [5]

    Borzunov, A., Baranchuk, D., Dettmers, T., Riabinin, M., Belkada, Y ., Chumachenko, A., Samygin, P., and Raffel, C. (2023a). Petals: Collaborative inference and fine-tuning of large models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 558–568. Association for Computatio...

  6. [6]

    Borzunov, A., Ryabinin, M., Chumachenko, A., Baranchuk, D., Dettmers, T., Belkada, Y ., Samygin, P., and Raffel, C. A. (2023b). Distributed inference and fine-tuning of large language models over the internet. In Advances in Neural Information Processing Systems

  7. [7]

    Cao, Y ., Wu, Z., Yuan, K., and Sayed, A. H. (2024). On the trade-off between flatness and optimization in distributed learning. arXiv preprint arXiv:2406.20006

  8. [8]

    Cambridge bitcoin electricity consumption index (CBECI)

    CCAF (2023). Cambridge bitcoin electricity consumption index (CBECI). https://ccaf.io/ cbnsi/cbeci

  9. [9]

    Chen, L., Ye, H., and Luo, L. (2024). An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization. International Conference on Artificial Intelli- gence and Statistics

  10. [10]

    Chen, X., Huang, M., Ma, S., and Balasubramanian, K. (2023). Decentralized stochastic bilevel optimization with improved per-iteration complexity. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 4641–4671. PMLR

  11. [11]

    Chen, Y ., Yuan, K., Zhang, Y ., Pan, P., Xu, Y ., and Yin, W. (2021). Accelerating gossip sgd with periodic global averaging. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 1791–1802. PMLR

  12. [12]

    M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J

    Cohen, J. M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J. D. (2025). Understanding optimization in deep learning with central flows. In The Thirteenth International Conference on Learning Representations

  13. [13]

    Cyffers, E., Bellet, A., and Upadhyay, J. (2024). Differentially private decentralized learning with random walks. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 9762–9783

  14. [14]

    Damian, A., Nichani, E., and Lee, J. D. (2023). Self-stabilization: The implicit bias of gradient descent at the edge of stability. In the Eleventh International Conference on Learning Representations

  15. [15]

    Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(6):165–202

  16. [16]

    A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J

    Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. (2023). Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105

  17. [17]

    Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. (2018). Essentially no barriers in neural network energy landscape. In International conference on machine learning , pages 1309–1318. PMLR. 10

  18. [18]

    Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. (2022). The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations

  19. [19]

    Even, M., Koloskova, A., and Massoulie, L. (2024). Asynchronous SGD on graphs: a unified framework for asynchronous decentralized and federated optimization. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics

  20. [20]

    K., Paul, M., Kharaghani, S., Roy, D

    Fort, S., Dziugaite, G. K., Paul, M., Kharaghani, S., Roy, D. M., and Ganguli, S. (2020). Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems , 33:5850–5861

  21. [21]

    K., Roy, D., and Carbin, M

    Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR

  22. [22]

    Freeman, C. D. and Bruna, J. (2017). Topology and geometry of half-rectified network opti- mization. In International Conference on Learning Representations

  23. [23]

    Gao, H., Gu, B., and Thai, M. T. (2023). On the convergence of distributed stochastic bilevel optimization algorithms over a network. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206, pages 9238–9281. PMLR

  24. [24]

    and Huang, H

    Gao, H. and Huang, H. (2021). Fast training method for stochastic compositional optimization problems. Advances in Neural Information Processing Systems, 34:25334–25345

  25. [25]

    and Efrati, A

    Gardizy, A. and Efrati, A. (2024). Microsoft and OpenAI plot $100 billion stargate AI super- computer. The Information

  26. [26]

    P., and Wilson, A

    Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31

  27. [27]

    Ai infrastructure market size, share & growth report, 2030

    Grand View Research (2024). Ai infrastructure market size, share & growth report, 2030

  28. [28]

    Gu, X., Lyu, K., Arora, S., Zhang, J., and Huang, L. (2024). A quadratic synchronization rule for distributed deep learning. In The Twelfth International Conference on Learning Representations

  29. [29]

    Gu, X., Lyu, K., Huang, L., and Arora, S. (2023a). Why (and when) does local SGD generalize better than SGD? In International Conference on Learning Representations

  30. [30]

    Gu, X., Lyu, K., Huang, L., and Arora, S. (2023b). Why (and when) does local SGD generalize better than SGD? In The Eleventh International Conference on Learning Representations

  31. [31]

    Gurbuzbalaban, M., Hu, Y ., Simsekli, U., Yuan, K., and Zhu, L. (2022). Heavy-tail phenomenon in decentralized sgd. arXiv preprint arXiv:2205.06689

  32. [32]

    He, F., Nan, L., and Zhu, T. (2025). Imagining a democratic, affordable future of foundation models: A decentralised avenue. In Handbook of Blockchain Analytics. Springer

  33. [33]

    He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In European conference on computer vision

  34. [34]

    P., and Jaggi, M

    He, L., Karimireddy, S. P., and Jaggi, M. (2022). Byzantine-robust decentralized learning via clippedgossip. arXiv preprint arXiv:2202.01545

  35. [35]

    Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

    Hsu, T.-M. H., Qi, H., and Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335

  36. [36]

    T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A

    Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. (2023). Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations. 11

  37. [37]

    Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L

    Ilharco, G., Wortsman, M., Gadre, S. Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems

  38. [38]

    P., and Wilson, A

    Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Globerson, A. and Silva, R., editors, Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 876–885. AUAI Press

  39. [39]

    M., Basra, M., Obeid, F., Straube, J., Keiblinger, M., Bakouch, E., Atkins, L., Panahi, M., Goddard, C., et al

    Jaghouar, S., Ong, J. M., Basra, M., Obeid, F., Straube, J., Keiblinger, M., Bakouch, E., Atkins, L., Panahi, M., Goddard, C., et al. (2024). Intellect-1 technical report. arXiv preprint arXiv:2412.01152

  40. [40]

    Kharrat, S., Canini, M., and Horvath, S. (2024). Decentralized personalized federated learning. arXiv preprint arXiv:2406.06520

  41. [41]

    Kolehmainen, J., Blagoev, N., Donaghy, J., Ersoy, O., and Nies, C. (2025). Noloco: No-all- reduce low communication training method for large models. arXiv preprint arXiv:2506.10911

  42. [42]

    Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. (2020). A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning

  43. [43]

    Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021a). Consensus control for decentralized deep learning. In International Conference on Machine Learning. PMLR

  44. [44]

    Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021b). Consensus control for decentralized deep learning. In Proceedings of the 38th International Conference on Machine Learning

  45. [45]

    Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images (tech. rep.). University of Toronto

  46. [46]

    and Yang, X

    Le, Y . and Yang, X. (2015). Tiny imagenet visual recognition challenge. CS 231N

  47. [47]

    Le Bars, B., Bellet, A., Tommasi, M., Lavoie, E., and Kermarrec, A.-M. (2023). Refined convergence and topology learning for decentralized sgd with heterogeneous data. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics

  48. [48]

    Le Bars, B., Bellet, A., Tommasi, M., Scaman, K., and Neglia, G. (2024). Improved stability and generalization guarantees of the decentralized SGD algorithm. In Proceedings of the 41st International Conference on Machine Learning

  49. [49]

    G., Smola, A

    Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014). Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems

  50. [50]

    A., and Zettlemoyer, L

    Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2022a). Branch-train-merge: Embarrassingly parallel training of expert language models. In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022

  51. [51]

    Li, S., Zhou, T., Tian, X., and Tao, D. (2022b). Learning to collaborate in decentralized learning of personalized models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9766–9775

  52. [52]

    Li, Z., Wang, T., and Arora, S. (2022c). What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations

  53. [53]

    Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems

  54. [54]

    Lian, X., Zhang, W., Zhang, C., and Liu, J. (2018). Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. 12

  55. [55]

    P., Stich, S., and Jaggi, M

    Lin, T., Karimireddy, S. P., Stich, S., and Jaggi, M. (2021). Quasi-global momentum: Accelerat- ing decentralized deep learning on heterogeneous data. In Proceedings of the 38th International Conference on Machine Learning

  56. [56]

    and De Sa, C

    Lu, Y . and De Sa, C. (2021). Optimal complexity in decentralized training. InProceedings of the 38th International Conference on Machine Learning

  57. [57]

    Lyu, K. (2024). Implicit Bias of Deep Learning Optimization: A Mathematical Examination. PhD thesis, Princeton University

  58. [58]

    T., Pérez, M

    Martínez Beltrán, E. T., Pérez, M. Q., Sánchez, P. M. S., Bernal, S. L., Bovet, G., Pérez, M. G., Pérez, G. M., and Celdrán, A. H. (2023). Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 25(4):2983–3013

  59. [59]

    Matena, M. S. and Raffel, C. (2022). Merging models with fisher-weighted averaging. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors,Advances in Neural Information Processing Systems

  60. [60]

    Mavrovouniotis, M., Li, C., and Yang, S. (2017). A survey of swarm intelligence for dynamic optimization: Algorithms and applications. Swarm and Evolutionary Computation, 33:1–17

  61. [61]

    E., Cyffers, E., and Bellet, A

    Mrini, A. E., Cyffers, E., and Bellet, A. (2024). Privacy attacks in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning

  62. [62]

    Nadiradze, G., Sabour, A., Davies, P., Li, S., and Alistarh, D. (2021). Asynchronous de- centralized sgd with quantized and local updates. Advances in Neural Information Processing Systems

  63. [63]

    and Kolter, J

    Nagarajan, V . and Kolter, J. Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32

  64. [64]

    and Olshevsky, A

    Nedi’c, A. and Olshevsky, A. (2014). Distributed optimization over time-varying directed graphs. volume 60, pages 601–615. IEEE

  65. [65]

    and Ozdaglar, A

    Nedic, A. and Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimiza- tion. IEEE Transactions on Automatic Control, 54(1):48–61

  66. [66]

    Announcing the stargate project

    OpenAI (2025). Announcing the stargate project. https://openai.com/index/ announcing-the-stargate-project/

  67. [67]

    Ortiz-Jimenez, G., Favero, A., and Frossard, P. (2023). Task arithmetic in the tangent space: Improved editing of pre-trained models. In Thirty-seventh Conference on Neural Information Processing Systems

  68. [68]

    F., Sanders, J., Rahman, R., and Heim, L

    Pilz, K. F., Sanders, J., Rahman, R., and Heim, L. (2025). Trends in ai supercomputers. arXiv preprint arXiv:2504.16026

  69. [69]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Procee...

  70. [70]

    Ramasinghe, S., Ajanthan, T., Avraham, G., Zuo, Y ., and Long, A. (2025). Protocol models: Scaling decentralized training with communication-efficient model parallelism. arXiv preprint arXiv:2506.01260

  71. [71]

    Rame, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., and Lopez-Paz, D. (2023). Model ratatouille: Recycling diverse models for out-of-distribution generalization. In Krause, A., Brun- skill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings o...

  72. [72]

    Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., patrick gallinari, and Cord, M. (2022). Diverse weight averaging for out-of-distribution generalization. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems

  73. [73]

    Richards, D. et al. (2020). Graph-dependent implicit regularisation for distributed stochastic subgradient descent. Journal of Machine Learning Research

  74. [74]

    Ryabinin, M., Dettmers, T., Diskin, M., and Borzunov, A. (2023). SWARM parallelism: Training large models can be surprisingly communication-efficient. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 29416–29440. PMLR

  75. [75]

    Sayed, A. H. (2014). Adaptation, Learning, and Optimization over Networks. Now Publishers

  76. [76]

    Horovod: fast and easy distributed deep learning in TensorFlow

    Sergeev, A. and Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799

  77. [77]

    Shen, L., Sun, Y ., Yu, Z., Ding, L., Tian, X., and Tao, D. (2024). On efficient training of large-scale deep learning models. ACM Computing Surveys, 57(3)

  78. [78]

    Shen, T., Zhu, D., Zhao, Z., Wu, C., and Wu, F. (2025). Will llms scaling hit the wall? breaking barriers via distributed resources on massive edge devices. arXiv preprint arXiv:2503.08223

  79. [79]

    Singha, A., Lua, C., Guptaa, G., Chopraa, A., Blanca, J., Klinghoffera, T., Tiwarya, K., and Raskara, R. (2024). A perspective on decentralizing ai

  80. [80]

    Sonthalia, A., Rubinstein, A., Abbasnejad, E., and Oh, S. J. (2025). Do deep neural net- work solutions form a star domain? In The Thirteenth International Conference on Learning Representations

Showing first 80 references.