On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

arxiv: 2507.06542 · v4 · submitted 2025-07-09 · 💻 cs.LG · cs.DC· cs.MA· stat.ML

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Tongtian Zhu , Tianyu Zhang , Mingze Wang , Zhanpeng Zhou , Can Wang This is my paper

Pith reviewed 2026-05-19 05:34 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.MAstat.ML

keywords decentralized learningmodel mergingstochastic gradient descentconvergence ratedata heterogeneitycommunication efficiencydistributed optimization

0 comments p. Extension

The pith

Decentralized SGD with one final global merge achieves the convergence rate of parallel SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies when devices should synchronize in decentralized training and finds that saving communication for a single global merge at the very end markedly improves final test accuracy, especially under high data heterogeneity. It proves that this late merge produces a model whose convergence rate equals that of fully parallel SGD. The analysis does so by treating part of the variation across local models as a helpful signal instead of pure noise. A reader would care because the result suggests decentralized systems can reach strong performance with far less total communication than previously thought necessary. If the claim holds, it reframes limited peer-to-peer links from a bottleneck into a manageable constraint.

Core claim

Performing a single global merge of all local models at the final iteration of decentralized SGD yields an output model that attains the same convergence rate as parallel SGD. The proof obtains this rate by reinterpreting a portion of the discrepancies among the local models, previously regarded as detrimental noise, as constructive components that contribute to the overall convergence bound.

What carries the argument

The single global merging step that aggregates every local model only at the final training iteration.

If this is right

Decentralized training can reach comparable generalization to parallel training even when data partitions are highly non-uniform.
Communication budgets can be shifted almost entirely to the end of training without sacrificing the theoretical rate.
Standard decentralized SGD becomes practical under stricter limits on total peer-to-peer exchanges.
Model merging at the close of training can be viewed as a lightweight way to recover parallel-like guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same late-merge tactic might be tested on non-SGD optimizers to check whether the rate-matching benefit generalizes.
Dynamic schedules that trigger the global merge once local drift exceeds a threshold could be compared against the fixed final-step rule.
The constructive-discrepancy view may connect decentralized optimization to ensemble methods that deliberately preserve local diversity until the end.

Load-bearing premise

The proof requires reinterpreting discrepancies among local models as constructive components rather than detrimental noise.

What would settle it

A calculation or experiment in which the convergence rate of the globally merged model falls below the parallel-SGD rate on high-heterogeneity data would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.06542 by Can Wang, Mingze Wang, Tianyu Zhang, Tongtian Zhu, Zhanpeng Zhou.

**Figure 1.** Figure 1: (a, b): Comparisons of global test accuracy (see Definition 2) in training of CLIP ViTB/32 (a) and ResNet-18 (b) on the Tiny ImageNet dataset, distributed across 16 agents with high heterogeneity (Dirichlet α = 0.1; see details in Appendix C.1). Decentralized training here involves each agent syncing with a random peer per round. A single global model merging is performed at the final round. (c): Loss lan… view at source ↗

**Figure 2.** Figure 2: A comparative illustration of server-based, decentralized, and local training. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a, b): Comparisons of global test accuracy (see Definition 2) in decentralized training of ResNet-18 on the CIFAR-100 dataset, distributed across 16 agents with high heterogeneity (Dirichlet α = 0.1; see details in Appendix C.1). Fully-connected communication (synchronous AllReduce) is activated only in specific windows, while low communication with one random peer with a probability of 0.2 is used elsewh… view at source ↗

read the original abstract

Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies communication scheduling in decentralized SGD, presenting empirical results that a single global merge at the final training step substantially improves test performance under high data heterogeneity. Theoretically, it claims to be the first to prove that the globally merged model achieves the same O(1/sqrt(T)) convergence rate as parallel SGD, by reinterpreting local-model discrepancy vectors (previously viewed as noise) as constructive components whose inner products aid the bound.

Significance. If the rate-matching result holds, the work shows decentralized learning can match centralized rates with minimal (late-stage) communication, offering concrete evidence that heterogeneity need not preclude good generalization when merging is timed appropriately. This supplies a new lens for model-merging research and credits the constructive-component reinterpretation as the technical step that closes the analysis.

major comments (2)

[§4] §4 (Convergence Analysis), around the expansion of ||(1/n)∑w_i − w*||² after the single late merge: the proof reinterprets the cross terms 2⟨avg(w_i − w_avg), ∇f⟩ as non-detrimental or canceling, yet supplies no explicit bound or sign control on these terms that would guarantee they remain controlled when the merge occurs after hundreds of local steps in the high-heterogeneity regime used in the experiments. This step is load-bearing for equating the merged rate to that of synchronous parallel SGD.
[Theorem 1] Theorem 1 (rate-matching statement): the derivation assumes local models remain in a regime where the reinterpreted discrepancy terms do not introduce extra bias beyond the constants already used for parallel SGD; no separate lemma verifies this regime holds for the single-merge schedule and heterogeneity levels reported in §5.

minor comments (2)

[Abstract] Abstract: quantitative details (e.g., number of local steps before merge, dataset sizes, or observed accuracy deltas) are omitted, making the “surprising effectiveness” claim harder to evaluate at a glance.
[Experiments] Experimental section: tables and figures lack error bars or mention of the number of random seeds; adding these would strengthen the empirical support without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the requirements for rigorously establishing the rate-matching result. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [§4] §4 (Convergence Analysis), around the expansion of ||(1/n)∑w_i − w*||² after the single late merge: the proof reinterprets the cross terms 2⟨avg(w_i − w_avg), ∇f⟩ as non-detrimental or canceling, yet supplies no explicit bound or sign control on these terms that would guarantee they remain controlled when the merge occurs after hundreds of local steps in the high-heterogeneity regime used in the experiments. This step is load-bearing for equating the merged rate to that of synchronous parallel SGD.

Authors: We thank the referee for highlighting this point. The original analysis absorbs the cross terms into the existing constants via the reinterpretation of discrepancies as constructive, but we agree an explicit bound improves clarity. In the revised manuscript we expand the derivation in §4 to bound |2⟨avg(w_i − w_avg), ∇f⟩| using L-smoothness and the fact that the average discrepancy norm grows at most linearly with the number of local steps before the final merge; the resulting additive term remains O(1/sqrt(T)) and does not alter the leading rate, matching the parallel-SGD analysis under the same assumptions. revision: yes
Referee: [Theorem 1] Theorem 1 (rate-matching statement): the derivation assumes local models remain in a regime where the reinterpreted discrepancy terms do not introduce extra bias beyond the constants already used for parallel SGD; no separate lemma verifies this regime holds for the single-merge schedule and heterogeneity levels reported in §5.

Authors: We agree that an explicit verification of the regime is desirable. We have added a supporting lemma (now Lemma 3) in the appendix that bounds the discrepancy growth over the local phases preceding the single global merge. The lemma shows that, for the heterogeneity parameter and local-step counts used in the §5 experiments, the extra bias introduced by the reinterpreted terms stays within the constants already present in the parallel-SGD bound, thereby justifying the assumptions of Theorem 1 for the single-merge schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends standard SGD analysis independently

full rationale

The paper claims a single late global merge in decentralized SGD matches the O(1/sqrt(T)) rate of parallel SGD by reinterpreting local discrepancies as constructive components rather than noise. No quoted equations or steps in the provided abstract reduce the final bound to a fitted parameter, self-citation chain, or input by construction. The reinterpretation is presented as an original technical step in the convergence proof, without evidence that cross-term cancellations are forced by prior self-referenced results or ansatzes. The analysis therefore remains self-contained against external SGD benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard optimization assumptions (bounded gradients, appropriate learning-rate schedules) plus the novel reinterpretation of local-model discrepancies as constructive; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

standard math Standard SGD convergence assumptions such as bounded gradients and suitable learning-rate schedules hold for both decentralized and parallel settings.
Invoked to establish that the merged model matches parallel-SGD rate.

pith-pipeline@v0.9.0 · 5717 in / 1269 out tokens · 43747 ms · 2026-05-19T05:34:06.252283+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we novelly reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components that accelerate convergence
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A(t) ≜ ηL (2T2 + ... ) with T2 = (∇²L(¯θ(t)) Γ(t))⊤∇ Tr(∇²L(¯θ(t)) Γ(t))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 3 internal anchors

[1]

Ainsworth, S., Hayase, J., and Srinivasa, S. (2023). Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations

work page 2023
[2]

E., Jaggi, M., and Guerraoui, R

Allouah, Y ., Koloskova, A., Firdoussi, A. E., Jaggi, M., and Guerraoui, R. (2024). The privacy power of correlated noise in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 1115–1143

work page 2024
[3]

Bonabeau, E., Dorigo, M., and Theraulaz, G. (1999). Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press

work page 1999
[4]

Z., Bedi, A., and Huang, F

Bornstein, M., Rabbani, T., Wang, E. Z., Bedi, A., and Huang, F. (2023). SWIFT: Rapid decentralized federated learning via wait-free model communication. InThe Eleventh International Conference on Learning Representations

work page 2023
[5]

Borzunov, A., Baranchuk, D., Dettmers, T., Riabinin, M., Belkada, Y ., Chumachenko, A., Samygin, P., and Raffel, C. (2023a). Petals: Collaborative inference and fine-tuning of large models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 558–568. Association for Computatio...

work page
[6]

Borzunov, A., Ryabinin, M., Chumachenko, A., Baranchuk, D., Dettmers, T., Belkada, Y ., Samygin, P., and Raffel, C. A. (2023b). Distributed inference and fine-tuning of large language models over the internet. In Advances in Neural Information Processing Systems

work page
[7]

Cao, Y ., Wu, Z., Yuan, K., and Sayed, A. H. (2024). On the trade-off between flatness and optimization in distributed learning. arXiv preprint arXiv:2406.20006

work page arXiv 2024
[8]

Cambridge bitcoin electricity consumption index (CBECI)

CCAF (2023). Cambridge bitcoin electricity consumption index (CBECI). https://ccaf.io/ cbnsi/cbeci

work page 2023
[9]

Chen, L., Ye, H., and Luo, L. (2024). An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization. International Conference on Artificial Intelli- gence and Statistics

work page 2024
[10]

Chen, X., Huang, M., Ma, S., and Balasubramanian, K. (2023). Decentralized stochastic bilevel optimization with improved per-iteration complexity. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 4641–4671. PMLR

work page 2023
[11]

Chen, Y ., Yuan, K., Zhang, Y ., Pan, P., Xu, Y ., and Yin, W. (2021). Accelerating gossip sgd with periodic global averaging. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 1791–1802. PMLR

work page 2021
[12]

M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J

Cohen, J. M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J. D. (2025). Understanding optimization in deep learning with central flows. In The Thirteenth International Conference on Learning Representations

work page 2025
[13]

Cyffers, E., Bellet, A., and Upadhyay, J. (2024). Differentially private decentralized learning with random walks. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 9762–9783

work page 2024
[14]

Damian, A., Nichani, E., and Lee, J. D. (2023). Self-stabilization: The implicit bias of gradient descent at the edge of stability. In the Eleventh International Conference on Learning Representations

work page 2023
[15]

Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(6):165–202

work page 2012
[16]

A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J

Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. (2023). Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105

work page arXiv 2023
[17]

Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. (2018). Essentially no barriers in neural network energy landscape. In International conference on machine learning , pages 1309–1318. PMLR. 10

work page 2018
[18]

Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. (2022). The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations

work page 2022
[19]

Even, M., Koloskova, A., and Massoulie, L. (2024). Asynchronous SGD on graphs: a unified framework for asynchronous decentralized and federated optimization. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics

work page 2024
[20]

K., Paul, M., Kharaghani, S., Roy, D

Fort, S., Dziugaite, G. K., Paul, M., Kharaghani, S., Roy, D. M., and Ganguli, S. (2020). Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems , 33:5850–5861

work page 2020
[21]

K., Roy, D., and Carbin, M

Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR

work page 2020
[22]

Freeman, C. D. and Bruna, J. (2017). Topology and geometry of half-rectified network opti- mization. In International Conference on Learning Representations

work page 2017
[23]

Gao, H., Gu, B., and Thai, M. T. (2023). On the convergence of distributed stochastic bilevel optimization algorithms over a network. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206, pages 9238–9281. PMLR

work page 2023
[24]

and Huang, H

Gao, H. and Huang, H. (2021). Fast training method for stochastic compositional optimization problems. Advances in Neural Information Processing Systems, 34:25334–25345

work page 2021
[25]

and Efrati, A

Gardizy, A. and Efrati, A. (2024). Microsoft and OpenAI plot $100 billion stargate AI super- computer. The Information

work page 2024
[26]

P., and Wilson, A

Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31

work page 2018
[27]

Ai infrastructure market size, share & growth report, 2030

Grand View Research (2024). Ai infrastructure market size, share & growth report, 2030

work page 2024
[28]

Gu, X., Lyu, K., Arora, S., Zhang, J., and Huang, L. (2024). A quadratic synchronization rule for distributed deep learning. In The Twelfth International Conference on Learning Representations

work page 2024
[29]

Gu, X., Lyu, K., Huang, L., and Arora, S. (2023a). Why (and when) does local SGD generalize better than SGD? In International Conference on Learning Representations

work page
[30]

Gu, X., Lyu, K., Huang, L., and Arora, S. (2023b). Why (and when) does local SGD generalize better than SGD? In The Eleventh International Conference on Learning Representations

work page
[31]

Gurbuzbalaban, M., Hu, Y ., Simsekli, U., Yuan, K., and Zhu, L. (2022). Heavy-tail phenomenon in decentralized sgd. arXiv preprint arXiv:2205.06689

work page arXiv 2022
[32]

He, F., Nan, L., and Zhu, T. (2025). Imagining a democratic, affordable future of foundation models: A decentralised avenue. In Handbook of Blockchain Analytics. Springer

work page 2025
[33]

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In European conference on computer vision

work page 2016
[34]

P., and Jaggi, M

He, L., Karimireddy, S. P., and Jaggi, M. (2022). Byzantine-robust decentralized learning via clippedgossip. arXiv preprint arXiv:2202.01545

work page arXiv 2022
[35]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Hsu, T.-M. H., Qi, H., and Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335

work page internal anchor Pith review Pith/arXiv arXiv 2019
[36]

T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A

Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. (2023). Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations. 11

work page 2023
[37]

Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L

Ilharco, G., Wortsman, M., Gadre, S. Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems

work page 2022
[38]

P., and Wilson, A

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Globerson, A. and Silva, R., editors, Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 876–885. AUAI Press

work page 2018
[39]

M., Basra, M., Obeid, F., Straube, J., Keiblinger, M., Bakouch, E., Atkins, L., Panahi, M., Goddard, C., et al

Jaghouar, S., Ong, J. M., Basra, M., Obeid, F., Straube, J., Keiblinger, M., Bakouch, E., Atkins, L., Panahi, M., Goddard, C., et al. (2024). Intellect-1 technical report. arXiv preprint arXiv:2412.01152

work page arXiv 2024
[40]

Kharrat, S., Canini, M., and Horvath, S. (2024). Decentralized personalized federated learning. arXiv preprint arXiv:2406.06520

work page arXiv 2024
[41]

Kolehmainen, J., Blagoev, N., Donaghy, J., Ersoy, O., and Nies, C. (2025). Noloco: No-all- reduce low communication training method for large models. arXiv preprint arXiv:2506.10911

work page arXiv 2025
[42]

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. (2020). A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning

work page 2020
[43]

Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021a). Consensus control for decentralized deep learning. In International Conference on Machine Learning. PMLR

work page
[44]

Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021b). Consensus control for decentralized deep learning. In Proceedings of the 38th International Conference on Machine Learning

work page
[45]

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images (tech. rep.). University of Toronto

work page 2009
[46]

and Yang, X

Le, Y . and Yang, X. (2015). Tiny imagenet visual recognition challenge. CS 231N

work page 2015
[47]

Le Bars, B., Bellet, A., Tommasi, M., Lavoie, E., and Kermarrec, A.-M. (2023). Refined convergence and topology learning for decentralized sgd with heterogeneous data. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics

work page 2023
[48]

Le Bars, B., Bellet, A., Tommasi, M., Scaman, K., and Neglia, G. (2024). Improved stability and generalization guarantees of the decentralized SGD algorithm. In Proceedings of the 41st International Conference on Machine Learning

work page 2024
[49]

G., Smola, A

Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014). Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems

work page 2014
[50]

A., and Zettlemoyer, L

Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2022a). Branch-train-merge: Embarrassingly parallel training of expert language models. In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022

work page 2022
[51]

Li, S., Zhou, T., Tian, X., and Tao, D. (2022b). Learning to collaborate in decentralized learning of personalized models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9766–9775

work page
[52]

Li, Z., Wang, T., and Arora, S. (2022c). What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations

work page
[53]

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems

work page 2017
[54]

Lian, X., Zhang, W., Zhang, C., and Liu, J. (2018). Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. 12

work page 2018
[55]

P., Stich, S., and Jaggi, M

Lin, T., Karimireddy, S. P., Stich, S., and Jaggi, M. (2021). Quasi-global momentum: Accelerat- ing decentralized deep learning on heterogeneous data. In Proceedings of the 38th International Conference on Machine Learning

work page 2021
[56]

and De Sa, C

Lu, Y . and De Sa, C. (2021). Optimal complexity in decentralized training. InProceedings of the 38th International Conference on Machine Learning

work page 2021
[57]

Lyu, K. (2024). Implicit Bias of Deep Learning Optimization: A Mathematical Examination. PhD thesis, Princeton University

work page 2024
[58]

T., Pérez, M

Martínez Beltrán, E. T., Pérez, M. Q., Sánchez, P. M. S., Bernal, S. L., Bovet, G., Pérez, M. G., Pérez, G. M., and Celdrán, A. H. (2023). Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 25(4):2983–3013

work page 2023
[59]

Matena, M. S. and Raffel, C. (2022). Merging models with fisher-weighted averaging. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors,Advances in Neural Information Processing Systems

work page 2022
[60]

Mavrovouniotis, M., Li, C., and Yang, S. (2017). A survey of swarm intelligence for dynamic optimization: Algorithms and applications. Swarm and Evolutionary Computation, 33:1–17

work page 2017
[61]

E., Cyffers, E., and Bellet, A

Mrini, A. E., Cyffers, E., and Bellet, A. (2024). Privacy attacks in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning

work page 2024
[62]

Nadiradze, G., Sabour, A., Davies, P., Li, S., and Alistarh, D. (2021). Asynchronous de- centralized sgd with quantized and local updates. Advances in Neural Information Processing Systems

work page 2021
[63]

and Kolter, J

Nagarajan, V . and Kolter, J. Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32

work page 2019
[64]

and Olshevsky, A

Nedi’c, A. and Olshevsky, A. (2014). Distributed optimization over time-varying directed graphs. volume 60, pages 601–615. IEEE

work page 2014
[65]

and Ozdaglar, A

Nedic, A. and Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimiza- tion. IEEE Transactions on Automatic Control, 54(1):48–61

work page 2009
[66]

Announcing the stargate project

OpenAI (2025). Announcing the stargate project. https://openai.com/index/ announcing-the-stargate-project/

work page 2025
[67]

Ortiz-Jimenez, G., Favero, A., and Frossard, P. (2023). Task arithmetic in the tangent space: Improved editing of pre-trained models. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[68]

F., Sanders, J., Rahman, R., and Heim, L

Pilz, K. F., Sanders, J., Rahman, R., and Heim, L. (2025). Trends in ai supercomputers. arXiv preprint arXiv:2504.16026

work page arXiv 2025
[69]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Procee...

work page 2021
[70]

Ramasinghe, S., Ajanthan, T., Avraham, G., Zuo, Y ., and Long, A. (2025). Protocol models: Scaling decentralized training with communication-efficient model parallelism. arXiv preprint arXiv:2506.01260

work page arXiv 2025
[71]

Rame, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., and Lopez-Paz, D. (2023). Model ratatouille: Recycling diverse models for out-of-distribution generalization. In Krause, A., Brun- skill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings o...

work page 2023
[72]

Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., patrick gallinari, and Cord, M. (2022). Diverse weight averaging for out-of-distribution generalization. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems

work page 2022
[73]

Richards, D. et al. (2020). Graph-dependent implicit regularisation for distributed stochastic subgradient descent. Journal of Machine Learning Research

work page 2020
[74]

Ryabinin, M., Dettmers, T., Diskin, M., and Borzunov, A. (2023). SWARM parallelism: Training large models can be surprisingly communication-efficient. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 29416–29440. PMLR

work page 2023
[75]

Sayed, A. H. (2014). Adaptation, Learning, and Optimization over Networks. Now Publishers

work page 2014
[76]

Horovod: fast and easy distributed deep learning in TensorFlow

Sergeev, A. and Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799

work page internal anchor Pith review Pith/arXiv arXiv 2018
[77]

Shen, L., Sun, Y ., Yu, Z., Ding, L., Tian, X., and Tao, D. (2024). On efficient training of large-scale deep learning models. ACM Computing Surveys, 57(3)

work page 2024
[78]

Shen, T., Zhu, D., Zhao, Z., Wu, C., and Wu, F. (2025). Will llms scaling hit the wall? breaking barriers via distributed resources on massive edge devices. arXiv preprint arXiv:2503.08223

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Singha, A., Lua, C., Guptaa, G., Chopraa, A., Blanca, J., Klinghoffera, T., Tiwarya, K., and Raskara, R. (2024). A perspective on decentralizing ai

work page 2024
[80]

Sonthalia, A., Rubinstein, A., Abbasnejad, E., and Oh, S. J. (2025). Do deep neural net- work solutions form a star domain? In The Thirteenth International Conference on Learning Representations

work page 2025

Showing first 80 references.

[1] [1]

Ainsworth, S., Hayase, J., and Srinivasa, S. (2023). Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations

work page 2023

[2] [2]

E., Jaggi, M., and Guerraoui, R

Allouah, Y ., Koloskova, A., Firdoussi, A. E., Jaggi, M., and Guerraoui, R. (2024). The privacy power of correlated noise in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 1115–1143

work page 2024

[3] [3]

Bonabeau, E., Dorigo, M., and Theraulaz, G. (1999). Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press

work page 1999

[4] [4]

Z., Bedi, A., and Huang, F

Bornstein, M., Rabbani, T., Wang, E. Z., Bedi, A., and Huang, F. (2023). SWIFT: Rapid decentralized federated learning via wait-free model communication. InThe Eleventh International Conference on Learning Representations

work page 2023

[5] [5]

Borzunov, A., Baranchuk, D., Dettmers, T., Riabinin, M., Belkada, Y ., Chumachenko, A., Samygin, P., and Raffel, C. (2023a). Petals: Collaborative inference and fine-tuning of large models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 558–568. Association for Computatio...

work page

[6] [6]

Borzunov, A., Ryabinin, M., Chumachenko, A., Baranchuk, D., Dettmers, T., Belkada, Y ., Samygin, P., and Raffel, C. A. (2023b). Distributed inference and fine-tuning of large language models over the internet. In Advances in Neural Information Processing Systems

work page

[7] [7]

Cao, Y ., Wu, Z., Yuan, K., and Sayed, A. H. (2024). On the trade-off between flatness and optimization in distributed learning. arXiv preprint arXiv:2406.20006

work page arXiv 2024

[8] [8]

Cambridge bitcoin electricity consumption index (CBECI)

CCAF (2023). Cambridge bitcoin electricity consumption index (CBECI). https://ccaf.io/ cbnsi/cbeci

work page 2023

[9] [9]

Chen, L., Ye, H., and Luo, L. (2024). An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization. International Conference on Artificial Intelli- gence and Statistics

work page 2024

[10] [10]

Chen, X., Huang, M., Ma, S., and Balasubramanian, K. (2023). Decentralized stochastic bilevel optimization with improved per-iteration complexity. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 4641–4671. PMLR

work page 2023

[11] [11]

Chen, Y ., Yuan, K., Zhang, Y ., Pan, P., Xu, Y ., and Yin, W. (2021). Accelerating gossip sgd with periodic global averaging. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 1791–1802. PMLR

work page 2021

[12] [12]

M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J

Cohen, J. M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J. D. (2025). Understanding optimization in deep learning with central flows. In The Thirteenth International Conference on Learning Representations

work page 2025

[13] [13]

Cyffers, E., Bellet, A., and Upadhyay, J. (2024). Differentially private decentralized learning with random walks. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 9762–9783

work page 2024

[14] [14]

Damian, A., Nichani, E., and Lee, J. D. (2023). Self-stabilization: The implicit bias of gradient descent at the edge of stability. In the Eleventh International Conference on Learning Representations

work page 2023

[15] [15]

Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(6):165–202

work page 2012

[16] [16]

A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J

Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. (2023). Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105

work page arXiv 2023

[17] [17]

Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. (2018). Essentially no barriers in neural network energy landscape. In International conference on machine learning , pages 1309–1318. PMLR. 10

work page 2018

[18] [18]

Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. (2022). The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations

work page 2022

[19] [19]

Even, M., Koloskova, A., and Massoulie, L. (2024). Asynchronous SGD on graphs: a unified framework for asynchronous decentralized and federated optimization. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics

work page 2024

[20] [20]

K., Paul, M., Kharaghani, S., Roy, D

Fort, S., Dziugaite, G. K., Paul, M., Kharaghani, S., Roy, D. M., and Ganguli, S. (2020). Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems , 33:5850–5861

work page 2020

[21] [21]

K., Roy, D., and Carbin, M

Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR

work page 2020

[22] [22]

Freeman, C. D. and Bruna, J. (2017). Topology and geometry of half-rectified network opti- mization. In International Conference on Learning Representations

work page 2017

[23] [23]

Gao, H., Gu, B., and Thai, M. T. (2023). On the convergence of distributed stochastic bilevel optimization algorithms over a network. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206, pages 9238–9281. PMLR

work page 2023

[24] [24]

and Huang, H

Gao, H. and Huang, H. (2021). Fast training method for stochastic compositional optimization problems. Advances in Neural Information Processing Systems, 34:25334–25345

work page 2021

[25] [25]

and Efrati, A

Gardizy, A. and Efrati, A. (2024). Microsoft and OpenAI plot $100 billion stargate AI super- computer. The Information

work page 2024

[26] [26]

P., and Wilson, A

Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31

work page 2018

[27] [27]

Ai infrastructure market size, share & growth report, 2030

Grand View Research (2024). Ai infrastructure market size, share & growth report, 2030

work page 2024

[28] [28]

Gu, X., Lyu, K., Arora, S., Zhang, J., and Huang, L. (2024). A quadratic synchronization rule for distributed deep learning. In The Twelfth International Conference on Learning Representations

work page 2024

[29] [29]

Gu, X., Lyu, K., Huang, L., and Arora, S. (2023a). Why (and when) does local SGD generalize better than SGD? In International Conference on Learning Representations

work page

[30] [30]

Gu, X., Lyu, K., Huang, L., and Arora, S. (2023b). Why (and when) does local SGD generalize better than SGD? In The Eleventh International Conference on Learning Representations

work page

[31] [31]

Gurbuzbalaban, M., Hu, Y ., Simsekli, U., Yuan, K., and Zhu, L. (2022). Heavy-tail phenomenon in decentralized sgd. arXiv preprint arXiv:2205.06689

work page arXiv 2022

[32] [32]

He, F., Nan, L., and Zhu, T. (2025). Imagining a democratic, affordable future of foundation models: A decentralised avenue. In Handbook of Blockchain Analytics. Springer

work page 2025

[33] [33]

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In European conference on computer vision

work page 2016

[34] [34]

P., and Jaggi, M

He, L., Karimireddy, S. P., and Jaggi, M. (2022). Byzantine-robust decentralized learning via clippedgossip. arXiv preprint arXiv:2202.01545

work page arXiv 2022

[35] [35]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Hsu, T.-M. H., Qi, H., and Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335

work page internal anchor Pith review Pith/arXiv arXiv 2019

[36] [36]

T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A

Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. (2023). Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations. 11

work page 2023

[37] [37]

Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L

Ilharco, G., Wortsman, M., Gadre, S. Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems

work page 2022

[38] [38]

P., and Wilson, A

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Globerson, A. and Silva, R., editors, Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 876–885. AUAI Press

work page 2018

[39] [39]

M., Basra, M., Obeid, F., Straube, J., Keiblinger, M., Bakouch, E., Atkins, L., Panahi, M., Goddard, C., et al

Jaghouar, S., Ong, J. M., Basra, M., Obeid, F., Straube, J., Keiblinger, M., Bakouch, E., Atkins, L., Panahi, M., Goddard, C., et al. (2024). Intellect-1 technical report. arXiv preprint arXiv:2412.01152

work page arXiv 2024

[40] [40]

Kharrat, S., Canini, M., and Horvath, S. (2024). Decentralized personalized federated learning. arXiv preprint arXiv:2406.06520

work page arXiv 2024

[41] [41]

Kolehmainen, J., Blagoev, N., Donaghy, J., Ersoy, O., and Nies, C. (2025). Noloco: No-all- reduce low communication training method for large models. arXiv preprint arXiv:2506.10911

work page arXiv 2025

[42] [42]

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. (2020). A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning

work page 2020

[43] [43]

Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021a). Consensus control for decentralized deep learning. In International Conference on Machine Learning. PMLR

work page

[44] [44]

Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021b). Consensus control for decentralized deep learning. In Proceedings of the 38th International Conference on Machine Learning

work page

[45] [45]

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images (tech. rep.). University of Toronto

work page 2009

[46] [46]

and Yang, X

Le, Y . and Yang, X. (2015). Tiny imagenet visual recognition challenge. CS 231N

work page 2015

[47] [47]

Le Bars, B., Bellet, A., Tommasi, M., Lavoie, E., and Kermarrec, A.-M. (2023). Refined convergence and topology learning for decentralized sgd with heterogeneous data. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics

work page 2023

[48] [48]

Le Bars, B., Bellet, A., Tommasi, M., Scaman, K., and Neglia, G. (2024). Improved stability and generalization guarantees of the decentralized SGD algorithm. In Proceedings of the 41st International Conference on Machine Learning

work page 2024

[49] [49]

G., Smola, A

Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014). Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems

work page 2014

[50] [50]

A., and Zettlemoyer, L

Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2022a). Branch-train-merge: Embarrassingly parallel training of expert language models. In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022

work page 2022

[51] [51]

Li, S., Zhou, T., Tian, X., and Tao, D. (2022b). Learning to collaborate in decentralized learning of personalized models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9766–9775

work page

[52] [52]

Li, Z., Wang, T., and Arora, S. (2022c). What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations

work page

[53] [53]

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems

work page 2017

[54] [54]

Lian, X., Zhang, W., Zhang, C., and Liu, J. (2018). Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. 12

work page 2018

[55] [55]

P., Stich, S., and Jaggi, M

Lin, T., Karimireddy, S. P., Stich, S., and Jaggi, M. (2021). Quasi-global momentum: Accelerat- ing decentralized deep learning on heterogeneous data. In Proceedings of the 38th International Conference on Machine Learning

work page 2021

[56] [56]

and De Sa, C

Lu, Y . and De Sa, C. (2021). Optimal complexity in decentralized training. InProceedings of the 38th International Conference on Machine Learning

work page 2021

[57] [57]

Lyu, K. (2024). Implicit Bias of Deep Learning Optimization: A Mathematical Examination. PhD thesis, Princeton University

work page 2024

[58] [58]

T., Pérez, M

Martínez Beltrán, E. T., Pérez, M. Q., Sánchez, P. M. S., Bernal, S. L., Bovet, G., Pérez, M. G., Pérez, G. M., and Celdrán, A. H. (2023). Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 25(4):2983–3013

work page 2023

[59] [59]

Matena, M. S. and Raffel, C. (2022). Merging models with fisher-weighted averaging. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors,Advances in Neural Information Processing Systems

work page 2022

[60] [60]

Mavrovouniotis, M., Li, C., and Yang, S. (2017). A survey of swarm intelligence for dynamic optimization: Algorithms and applications. Swarm and Evolutionary Computation, 33:1–17

work page 2017

[61] [61]

E., Cyffers, E., and Bellet, A

Mrini, A. E., Cyffers, E., and Bellet, A. (2024). Privacy attacks in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning

work page 2024

[62] [62]

Nadiradze, G., Sabour, A., Davies, P., Li, S., and Alistarh, D. (2021). Asynchronous de- centralized sgd with quantized and local updates. Advances in Neural Information Processing Systems

work page 2021

[63] [63]

and Kolter, J

Nagarajan, V . and Kolter, J. Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32

work page 2019

[64] [64]

and Olshevsky, A

Nedi’c, A. and Olshevsky, A. (2014). Distributed optimization over time-varying directed graphs. volume 60, pages 601–615. IEEE

work page 2014

[65] [65]

and Ozdaglar, A

Nedic, A. and Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimiza- tion. IEEE Transactions on Automatic Control, 54(1):48–61

work page 2009

[66] [66]

Announcing the stargate project

OpenAI (2025). Announcing the stargate project. https://openai.com/index/ announcing-the-stargate-project/

work page 2025

[67] [67]

Ortiz-Jimenez, G., Favero, A., and Frossard, P. (2023). Task arithmetic in the tangent space: Improved editing of pre-trained models. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023

[68] [68]

F., Sanders, J., Rahman, R., and Heim, L

Pilz, K. F., Sanders, J., Rahman, R., and Heim, L. (2025). Trends in ai supercomputers. arXiv preprint arXiv:2504.16026

work page arXiv 2025

[69] [69]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Procee...

work page 2021

[70] [70]

Ramasinghe, S., Ajanthan, T., Avraham, G., Zuo, Y ., and Long, A. (2025). Protocol models: Scaling decentralized training with communication-efficient model parallelism. arXiv preprint arXiv:2506.01260

work page arXiv 2025

[71] [71]

Rame, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., and Lopez-Paz, D. (2023). Model ratatouille: Recycling diverse models for out-of-distribution generalization. In Krause, A., Brun- skill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings o...

work page 2023

[72] [72]

Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., patrick gallinari, and Cord, M. (2022). Diverse weight averaging for out-of-distribution generalization. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems

work page 2022

[73] [73]

Richards, D. et al. (2020). Graph-dependent implicit regularisation for distributed stochastic subgradient descent. Journal of Machine Learning Research

work page 2020

[74] [74]

Ryabinin, M., Dettmers, T., Diskin, M., and Borzunov, A. (2023). SWARM parallelism: Training large models can be surprisingly communication-efficient. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 29416–29440. PMLR

work page 2023

[75] [75]

Sayed, A. H. (2014). Adaptation, Learning, and Optimization over Networks. Now Publishers

work page 2014

[76] [76]

Horovod: fast and easy distributed deep learning in TensorFlow

Sergeev, A. and Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799

work page internal anchor Pith review Pith/arXiv arXiv 2018

[77] [77]

Shen, L., Sun, Y ., Yu, Z., Ding, L., Tian, X., and Tao, D. (2024). On efficient training of large-scale deep learning models. ACM Computing Surveys, 57(3)

work page 2024

[78] [78]

Shen, T., Zhu, D., Zhao, Z., Wu, C., and Wu, F. (2025). Will llms scaling hit the wall? breaking barriers via distributed resources on massive edge devices. arXiv preprint arXiv:2503.08223

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

Singha, A., Lua, C., Guptaa, G., Chopraa, A., Blanca, J., Klinghoffera, T., Tiwarya, K., and Raskara, R. (2024). A perspective on decentralizing ai

work page 2024

[80] [80]

Sonthalia, A., Rubinstein, A., Abbasnejad, E., and Oh, S. J. (2025). Do deep neural net- work solutions form a star domain? In The Thirteenth International Conference on Learning Representations

work page 2025