On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning
Pith reviewed 2026-05-19 05:34 UTC · model grok-4.3
The pith
Decentralized SGD with one final global merge achieves the convergence rate of parallel SGD.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performing a single global merge of all local models at the final iteration of decentralized SGD yields an output model that attains the same convergence rate as parallel SGD. The proof obtains this rate by reinterpreting a portion of the discrepancies among the local models, previously regarded as detrimental noise, as constructive components that contribute to the overall convergence bound.
What carries the argument
The single global merging step that aggregates every local model only at the final training iteration.
If this is right
- Decentralized training can reach comparable generalization to parallel training even when data partitions are highly non-uniform.
- Communication budgets can be shifted almost entirely to the end of training without sacrificing the theoretical rate.
- Standard decentralized SGD becomes practical under stricter limits on total peer-to-peer exchanges.
- Model merging at the close of training can be viewed as a lightweight way to recover parallel-like guarantees.
Where Pith is reading between the lines
- The same late-merge tactic might be tested on non-SGD optimizers to check whether the rate-matching benefit generalizes.
- Dynamic schedules that trigger the global merge once local drift exceeds a threshold could be compared against the fixed final-step rule.
- The constructive-discrepancy view may connect decentralized optimization to ensemble methods that deliberately preserve local diversity until the end.
Load-bearing premise
The proof requires reinterpreting discrepancies among local models as constructive components rather than detrimental noise.
What would settle it
A calculation or experiment in which the convergence rate of the globally merged model falls below the parallel-SGD rate on high-heterogeneity data would falsify the central claim.
Figures
read the original abstract
Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies communication scheduling in decentralized SGD, presenting empirical results that a single global merge at the final training step substantially improves test performance under high data heterogeneity. Theoretically, it claims to be the first to prove that the globally merged model achieves the same O(1/sqrt(T)) convergence rate as parallel SGD, by reinterpreting local-model discrepancy vectors (previously viewed as noise) as constructive components whose inner products aid the bound.
Significance. If the rate-matching result holds, the work shows decentralized learning can match centralized rates with minimal (late-stage) communication, offering concrete evidence that heterogeneity need not preclude good generalization when merging is timed appropriately. This supplies a new lens for model-merging research and credits the constructive-component reinterpretation as the technical step that closes the analysis.
major comments (2)
- [§4] §4 (Convergence Analysis), around the expansion of ||(1/n)∑w_i − w*||² after the single late merge: the proof reinterprets the cross terms 2⟨avg(w_i − w_avg), ∇f⟩ as non-detrimental or canceling, yet supplies no explicit bound or sign control on these terms that would guarantee they remain controlled when the merge occurs after hundreds of local steps in the high-heterogeneity regime used in the experiments. This step is load-bearing for equating the merged rate to that of synchronous parallel SGD.
- [Theorem 1] Theorem 1 (rate-matching statement): the derivation assumes local models remain in a regime where the reinterpreted discrepancy terms do not introduce extra bias beyond the constants already used for parallel SGD; no separate lemma verifies this regime holds for the single-merge schedule and heterogeneity levels reported in §5.
minor comments (2)
- [Abstract] Abstract: quantitative details (e.g., number of local steps before merge, dataset sizes, or observed accuracy deltas) are omitted, making the “surprising effectiveness” claim harder to evaluate at a glance.
- [Experiments] Experimental section: tables and figures lack error bars or mention of the number of random seeds; adding these would strengthen the empirical support without altering the central claim.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which help clarify the requirements for rigorously establishing the rate-matching result. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [§4] §4 (Convergence Analysis), around the expansion of ||(1/n)∑w_i − w*||² after the single late merge: the proof reinterprets the cross terms 2⟨avg(w_i − w_avg), ∇f⟩ as non-detrimental or canceling, yet supplies no explicit bound or sign control on these terms that would guarantee they remain controlled when the merge occurs after hundreds of local steps in the high-heterogeneity regime used in the experiments. This step is load-bearing for equating the merged rate to that of synchronous parallel SGD.
Authors: We thank the referee for highlighting this point. The original analysis absorbs the cross terms into the existing constants via the reinterpretation of discrepancies as constructive, but we agree an explicit bound improves clarity. In the revised manuscript we expand the derivation in §4 to bound |2⟨avg(w_i − w_avg), ∇f⟩| using L-smoothness and the fact that the average discrepancy norm grows at most linearly with the number of local steps before the final merge; the resulting additive term remains O(1/sqrt(T)) and does not alter the leading rate, matching the parallel-SGD analysis under the same assumptions. revision: yes
-
Referee: [Theorem 1] Theorem 1 (rate-matching statement): the derivation assumes local models remain in a regime where the reinterpreted discrepancy terms do not introduce extra bias beyond the constants already used for parallel SGD; no separate lemma verifies this regime holds for the single-merge schedule and heterogeneity levels reported in §5.
Authors: We agree that an explicit verification of the regime is desirable. We have added a supporting lemma (now Lemma 3) in the appendix that bounds the discrepancy growth over the local phases preceding the single global merge. The lemma shows that, for the heterogeneity parameter and local-step counts used in the §5 experiments, the extra bias introduced by the reinterpreted terms stays within the constants already present in the parallel-SGD bound, thereby justifying the assumptions of Theorem 1 for the single-merge schedule. revision: yes
Circularity Check
No significant circularity; derivation extends standard SGD analysis independently
full rationale
The paper claims a single late global merge in decentralized SGD matches the O(1/sqrt(T)) rate of parallel SGD by reinterpreting local discrepancies as constructive components rather than noise. No quoted equations or steps in the provided abstract reduce the final bound to a fitted parameter, self-citation chain, or input by construction. The reinterpretation is presented as an original technical step in the convergence proof, without evidence that cross-term cancellations are forced by prior self-referenced results or ansatzes. The analysis therefore remains self-contained against external SGD benchmarks and does not trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard SGD convergence assumptions such as bounded gradients and suitable learning-rate schedules hold for both decentralized and parallel settings.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we novelly reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components that accelerate convergence
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A(t) ≜ ηL (2T2 + ... ) with T2 = (∇²L(¯θ(t)) Γ(t))⊤∇ Tr(∇²L(¯θ(t)) Γ(t))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ainsworth, S., Hayase, J., and Srinivasa, S. (2023). Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations
work page 2023
-
[2]
E., Jaggi, M., and Guerraoui, R
Allouah, Y ., Koloskova, A., Firdoussi, A. E., Jaggi, M., and Guerraoui, R. (2024). The privacy power of correlated noise in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 1115–1143
work page 2024
-
[3]
Bonabeau, E., Dorigo, M., and Theraulaz, G. (1999). Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press
work page 1999
-
[4]
Bornstein, M., Rabbani, T., Wang, E. Z., Bedi, A., and Huang, F. (2023). SWIFT: Rapid decentralized federated learning via wait-free model communication. InThe Eleventh International Conference on Learning Representations
work page 2023
-
[5]
Borzunov, A., Baranchuk, D., Dettmers, T., Riabinin, M., Belkada, Y ., Chumachenko, A., Samygin, P., and Raffel, C. (2023a). Petals: Collaborative inference and fine-tuning of large models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 558–568. Association for Computatio...
-
[6]
Borzunov, A., Ryabinin, M., Chumachenko, A., Baranchuk, D., Dettmers, T., Belkada, Y ., Samygin, P., and Raffel, C. A. (2023b). Distributed inference and fine-tuning of large language models over the internet. In Advances in Neural Information Processing Systems
- [7]
-
[8]
Cambridge bitcoin electricity consumption index (CBECI)
CCAF (2023). Cambridge bitcoin electricity consumption index (CBECI). https://ccaf.io/ cbnsi/cbeci
work page 2023
-
[9]
Chen, L., Ye, H., and Luo, L. (2024). An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization. International Conference on Artificial Intelli- gence and Statistics
work page 2024
-
[10]
Chen, X., Huang, M., Ma, S., and Balasubramanian, K. (2023). Decentralized stochastic bilevel optimization with improved per-iteration complexity. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 4641–4671. PMLR
work page 2023
-
[11]
Chen, Y ., Yuan, K., Zhang, Y ., Pan, P., Xu, Y ., and Yin, W. (2021). Accelerating gossip sgd with periodic global averaging. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 1791–1802. PMLR
work page 2021
-
[12]
M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J
Cohen, J. M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J. D. (2025). Understanding optimization in deep learning with central flows. In The Thirteenth International Conference on Learning Representations
work page 2025
-
[13]
Cyffers, E., Bellet, A., and Upadhyay, J. (2024). Differentially private decentralized learning with random walks. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 9762–9783
work page 2024
-
[14]
Damian, A., Nichani, E., and Lee, J. D. (2023). Self-stabilization: The implicit bias of gradient descent at the edge of stability. In the Eleventh International Conference on Learning Representations
work page 2023
-
[15]
Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(6):165–202
work page 2012
-
[16]
A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J
Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y ., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. (2023). Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105
-
[17]
Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. (2018). Essentially no barriers in neural network energy landscape. In International conference on machine learning , pages 1309–1318. PMLR. 10
work page 2018
-
[18]
Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. (2022). The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations
work page 2022
-
[19]
Even, M., Koloskova, A., and Massoulie, L. (2024). Asynchronous SGD on graphs: a unified framework for asynchronous decentralized and federated optimization. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics
work page 2024
-
[20]
K., Paul, M., Kharaghani, S., Roy, D
Fort, S., Dziugaite, G. K., Paul, M., Kharaghani, S., Roy, D. M., and Ganguli, S. (2020). Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems , 33:5850–5861
work page 2020
-
[21]
Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR
work page 2020
-
[22]
Freeman, C. D. and Bruna, J. (2017). Topology and geometry of half-rectified network opti- mization. In International Conference on Learning Representations
work page 2017
-
[23]
Gao, H., Gu, B., and Thai, M. T. (2023). On the convergence of distributed stochastic bilevel optimization algorithms over a network. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206, pages 9238–9281. PMLR
work page 2023
-
[24]
Gao, H. and Huang, H. (2021). Fast training method for stochastic compositional optimization problems. Advances in Neural Information Processing Systems, 34:25334–25345
work page 2021
-
[25]
Gardizy, A. and Efrati, A. (2024). Microsoft and OpenAI plot $100 billion stargate AI super- computer. The Information
work page 2024
-
[26]
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31
work page 2018
-
[27]
Ai infrastructure market size, share & growth report, 2030
Grand View Research (2024). Ai infrastructure market size, share & growth report, 2030
work page 2024
-
[28]
Gu, X., Lyu, K., Arora, S., Zhang, J., and Huang, L. (2024). A quadratic synchronization rule for distributed deep learning. In The Twelfth International Conference on Learning Representations
work page 2024
-
[29]
Gu, X., Lyu, K., Huang, L., and Arora, S. (2023a). Why (and when) does local SGD generalize better than SGD? In International Conference on Learning Representations
-
[30]
Gu, X., Lyu, K., Huang, L., and Arora, S. (2023b). Why (and when) does local SGD generalize better than SGD? In The Eleventh International Conference on Learning Representations
- [31]
-
[32]
He, F., Nan, L., and Zhu, T. (2025). Imagining a democratic, affordable future of foundation models: A decentralised avenue. In Handbook of Blockchain Analytics. Springer
work page 2025
-
[33]
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In European conference on computer vision
work page 2016
-
[34]
He, L., Karimireddy, S. P., and Jaggi, M. (2022). Byzantine-robust decentralized learning via clippedgossip. arXiv preprint arXiv:2202.01545
-
[35]
Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification
Hsu, T.-M. H., Qi, H., and Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[36]
T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A
Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. (2023). Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations. 11
work page 2023
-
[37]
Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L
Ilharco, G., Wortsman, M., Gadre, S. Y ., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems
work page 2022
-
[38]
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Globerson, A. and Silva, R., editors, Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 876–885. AUAI Press
work page 2018
-
[39]
Jaghouar, S., Ong, J. M., Basra, M., Obeid, F., Straube, J., Keiblinger, M., Bakouch, E., Atkins, L., Panahi, M., Goddard, C., et al. (2024). Intellect-1 technical report. arXiv preprint arXiv:2412.01152
- [40]
- [41]
-
[42]
Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. (2020). A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning
work page 2020
-
[43]
Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021a). Consensus control for decentralized deep learning. In International Conference on Machine Learning. PMLR
-
[44]
Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. (2021b). Consensus control for decentralized deep learning. In Proceedings of the 38th International Conference on Machine Learning
-
[45]
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images (tech. rep.). University of Toronto
work page 2009
-
[46]
Le, Y . and Yang, X. (2015). Tiny imagenet visual recognition challenge. CS 231N
work page 2015
-
[47]
Le Bars, B., Bellet, A., Tommasi, M., Lavoie, E., and Kermarrec, A.-M. (2023). Refined convergence and topology learning for decentralized sgd with heterogeneous data. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics
work page 2023
-
[48]
Le Bars, B., Bellet, A., Tommasi, M., Scaman, K., and Neglia, G. (2024). Improved stability and generalization guarantees of the decentralized SGD algorithm. In Proceedings of the 41st International Conference on Machine Learning
work page 2024
-
[49]
Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014). Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems
work page 2014
-
[50]
Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2022a). Branch-train-merge: Embarrassingly parallel training of expert language models. In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022
work page 2022
-
[51]
Li, S., Zhou, T., Tian, X., and Tao, D. (2022b). Learning to collaborate in decentralized learning of personalized models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9766–9775
-
[52]
Li, Z., Wang, T., and Arora, S. (2022c). What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations
-
[53]
Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems
work page 2017
-
[54]
Lian, X., Zhang, W., Zhang, C., and Liu, J. (2018). Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. 12
work page 2018
-
[55]
Lin, T., Karimireddy, S. P., Stich, S., and Jaggi, M. (2021). Quasi-global momentum: Accelerat- ing decentralized deep learning on heterogeneous data. In Proceedings of the 38th International Conference on Machine Learning
work page 2021
-
[56]
Lu, Y . and De Sa, C. (2021). Optimal complexity in decentralized training. InProceedings of the 38th International Conference on Machine Learning
work page 2021
-
[57]
Lyu, K. (2024). Implicit Bias of Deep Learning Optimization: A Mathematical Examination. PhD thesis, Princeton University
work page 2024
-
[58]
Martínez Beltrán, E. T., Pérez, M. Q., Sánchez, P. M. S., Bernal, S. L., Bovet, G., Pérez, M. G., Pérez, G. M., and Celdrán, A. H. (2023). Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 25(4):2983–3013
work page 2023
-
[59]
Matena, M. S. and Raffel, C. (2022). Merging models with fisher-weighted averaging. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors,Advances in Neural Information Processing Systems
work page 2022
-
[60]
Mavrovouniotis, M., Li, C., and Yang, S. (2017). A survey of swarm intelligence for dynamic optimization: Algorithms and applications. Swarm and Evolutionary Computation, 33:1–17
work page 2017
-
[61]
E., Cyffers, E., and Bellet, A
Mrini, A. E., Cyffers, E., and Bellet, A. (2024). Privacy attacks in decentralized learning. In Proceedings of the 41st International Conference on Machine Learning
work page 2024
-
[62]
Nadiradze, G., Sabour, A., Davies, P., Li, S., and Alistarh, D. (2021). Asynchronous de- centralized sgd with quantized and local updates. Advances in Neural Information Processing Systems
work page 2021
-
[63]
Nagarajan, V . and Kolter, J. Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32
work page 2019
-
[64]
Nedi’c, A. and Olshevsky, A. (2014). Distributed optimization over time-varying directed graphs. volume 60, pages 601–615. IEEE
work page 2014
-
[65]
Nedic, A. and Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimiza- tion. IEEE Transactions on Automatic Control, 54(1):48–61
work page 2009
-
[66]
Announcing the stargate project
OpenAI (2025). Announcing the stargate project. https://openai.com/index/ announcing-the-stargate-project/
work page 2025
-
[67]
Ortiz-Jimenez, G., Favero, A., and Frossard, P. (2023). Task arithmetic in the tangent space: Improved editing of pre-trained models. In Thirty-seventh Conference on Neural Information Processing Systems
work page 2023
-
[68]
F., Sanders, J., Rahman, R., and Heim, L
Pilz, K. F., Sanders, J., Rahman, R., and Heim, L. (2025). Trends in ai supercomputers. arXiv preprint arXiv:2504.16026
-
[69]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Procee...
work page 2021
- [70]
-
[71]
Rame, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., and Lopez-Paz, D. (2023). Model ratatouille: Recycling diverse models for out-of-distribution generalization. In Krause, A., Brun- skill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings o...
work page 2023
-
[72]
Rame, A., Kirchmeyer, M., Rahier, T., Rakotomamonjy, A., patrick gallinari, and Cord, M. (2022). Diverse weight averaging for out-of-distribution generalization. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems
work page 2022
-
[73]
Richards, D. et al. (2020). Graph-dependent implicit regularisation for distributed stochastic subgradient descent. Journal of Machine Learning Research
work page 2020
-
[74]
Ryabinin, M., Dettmers, T., Diskin, M., and Borzunov, A. (2023). SWARM parallelism: Training large models can be surprisingly communication-efficient. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 29416–29440. PMLR
work page 2023
-
[75]
Sayed, A. H. (2014). Adaptation, Learning, and Optimization over Networks. Now Publishers
work page 2014
-
[76]
Horovod: fast and easy distributed deep learning in TensorFlow
Sergeev, A. and Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[77]
Shen, L., Sun, Y ., Yu, Z., Ding, L., Tian, X., and Tao, D. (2024). On efficient training of large-scale deep learning models. ACM Computing Surveys, 57(3)
work page 2024
-
[78]
Shen, T., Zhu, D., Zhao, Z., Wu, C., and Wu, F. (2025). Will llms scaling hit the wall? breaking barriers via distributed resources on massive edge devices. arXiv preprint arXiv:2503.08223
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Singha, A., Lua, C., Guptaa, G., Chopraa, A., Blanca, J., Klinghoffera, T., Tiwarya, K., and Raskara, R. (2024). A perspective on decentralizing ai
work page 2024
-
[80]
Sonthalia, A., Rubinstein, A., Abbasnejad, E., and Oh, S. J. (2025). Do deep neural net- work solutions form a star domain? In The Thirteenth International Conference on Learning Representations
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.