pith. sign in

arxiv: 2605.20866 · v1 · pith:GLH3PFKXnew · submitted 2026-05-20 · 💻 cs.LG · cs.DC· math.OC· stat.ML

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML
keywords local SGDsparse model averagingcommunication-computation overlapdelay correctionnon-convex optimizationdistributed learningheterogeneous workersconvergence analysis
0
0 comments X

The pith

LOSCAR-SGD combines sparse local updates with computation-communication overlap and a delay-corrected merge to converge on smooth non-convex objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LOSCAR-SGD as a local SGD variant for distributed settings where workers have heterogeneous compute speeds. It communicates only sparse model coordinates while allowing local optimization to continue during communication, using a delay-corrected merge to integrate the delayed information. Convergence guarantees are derived for smooth non-convex objectives, with rates that explicitly depend on the sparsity level, the amount of overlap, and the degree of worker heterogeneity. This supplies the first theoretical analysis for the practical combination of local training, sparsity, and overlap.

Core claim

LOSCAR-SGD is a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. This is the first theory for this combination of ingredients.

What carries the argument

The delay-corrected merge rule, which folds delayed sparse updates from heterogeneous workers back into the local models without erasing progress accumulated during the overlap interval.

If this is right

  • Sparsity level directly modulates the communication volume and appears in the convergence bound.
  • Communication-computation overlap shortens wall-clock training time without harming the asymptotic rate.
  • Worker heterogeneity increases the effective delay term and slows the rate in a quantifiable way.
  • The delay-corrected merge outperforms naive overwriting on both theory and reported experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same overlap-plus-correction idea could be applied to other first-order methods such as Adam or momentum variants.
  • Pairing the sparse merge with coordinate-wise quantization might yield multiplicative communication savings.
  • In federated settings the optimal overlap length could be tuned from measured round-trip times and compute variance.
  • The analysis suggests that very high sparsity may require compensatory increases in local steps to keep the rate acceptable.

Load-bearing premise

The delay-corrected merge rule correctly incorporates delayed synchronized information without discarding the progress made during the overlap phase.

What would settle it

Replace the delay-corrected merge with naive overwriting in a heterogeneous testbed with measurable overlap periods and observe whether convergence slows or fails relative to the predicted rate.

Figures

Figures reproduced from arXiv: 2605.20866 by Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik, Yassine Maziane.

Figure 1
Figure 1. Figure 1: Main experimental summary on a9a. (a) Overlap improves over blocking sparse averaging, and delay correction is best. (b) Smaller sparsity levels reduce the logical communication cost by orders of magnitude for the delay-corrected method. (c) Under a long communication delay, delay correction substantially outperforms overwrite. regime covered by the theory. It also includes CIFAR-10 and Tiny ImageNet neura… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy comparison for the controlled sparse-overlap experiment on [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training gradient norm for the controlled sparse-overlap experiment on [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Loss curves under equivalent resource parametrizations for the controlled sparse-overlap [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Long-compute, short-delay regime (M = 8, ζ = 6, worker times (1, 2, 3, 6)). Delay correction gives a small but persistent improvement over overwrite on both training and validation loss. (a) Training loss. (b) Validation loss [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Short-compute, long-delay regime (M = 2, ζ = 24, worker times (1, 2, 3, 6)). This is the regime with the largest gap: overwrite discards a substantial amount of overlap-phase progress, while delay correction preserves it. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Very heterogeneous worker-speed regime ( [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sparsity-level ablation on a9a. We vary p ∈ {0.001, 0.01, 0.1, 1.0} with M = 4, ζ = 6, worker times (1, 2, 3, 6), and otherwise identical training settings. Smaller p dramatically reduces the cumulative number of communicated bits while preserving a similar loss trajectory, showing that sparse parameter averaging gives a strong communication-accuracy tradeoff in this homogeneous-data regime. B.4 Ablation o… view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on the local computation budget. The blocking local-sparse baseline is sensitive [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training loss versus cumulative logical time for the communication-delay ablation. The [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Negative stress test with strongly heterogeneous data and very slow communication. The [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: CIFAR-10 convolutional-network experiment in the normal communication regime, with [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CIFAR-10 convolutional-network experiment in the communication-stress regime, with [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Tiny ImageNet experiment in the normal communication regime, with [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Tiny ImageNet experiment in the communication-stress regime, with [PITH_FULL_IMAGE:figures/full_fig_p042_15.png] view at source ↗
read the original abstract

Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes LOSCAR-SGD, a Local SGD algorithm for heterogeneous distributed settings that combines sparse coordinate communication, communication-computation overlap, and a delay-corrected merge rule to incorporate delayed updates without discarding local progress during overlap. It claims convergence guarantees for smooth non-convex objectives, with explicit dependence of the rate on sparsity level, overlap duration, and worker heterogeneity, and reports experiments showing reduced wall-clock time and better performance than naive overwriting.

Significance. If the convergence analysis is correct, the work would be significant as the first explicit theory for the joint combination of local steps, sparsity, overlap, and heterogeneity; the parameter dependence could directly inform practical tuning in large-scale training. The experiments provide supporting evidence for the overlap benefit, though verification is limited by the absence of full proof details and statistical error bars.

major comments (3)
  1. §3 (delay-corrected merge rule): the claim that the rule produces an unbiased estimator of the averaged model while preserving overlap-phase progress is load-bearing for all rate statements, yet the description does not specify whether the correction is applied coordinate-wise only to the sparse mask that was actually sent at the delayed time or uniformly; under heterogeneous delays and per-worker sparsity this risks introducing a bias term proportional to sparsity level times delay variance, which would invalidate the claimed rate.
  2. Theorem 1 (convergence bound): the rate is stated to depend explicitly on sparsity, overlap, and heterogeneity, but the proof sketch relies on the merge rule remaining unbiased without additional assumptions on consistent sparse masks across send/receive times; if the analysis applies a uniform correction, the variance term from coordinate-wise heterogeneity could grow and contradict the stated bound.
  3. §5 (experiments): the reported improvements in training time lack error bars or multiple independent runs, so it is impossible to assess whether the observed gains over naive overwriting are statistically reliable or sensitive to random seeds.
minor comments (2)
  1. Notation for the sparse mask and delay variables is introduced without a consolidated table; a single reference table would improve readability of the rate expressions.
  2. The abstract states this is the first theory for the combination, but the introduction omits explicit comparison to prior overlap analyses in Local SGD (e.g., those handling fixed delays without sparsity).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address each major comment in turn below and have revised the paper to improve clarity and completeness where needed.

read point-by-point responses
  1. Referee: §3 (delay-corrected merge rule): the claim that the rule produces an unbiased estimator of the averaged model while preserving overlap-phase progress is load-bearing for all rate statements, yet the description does not specify whether the correction is applied coordinate-wise only to the sparse mask that was actually sent at the delayed time or uniformly; under heterogeneous delays and per-worker sparsity this risks introducing a bias term proportional to sparsity level times delay variance, which would invalidate the claimed rate.

    Authors: The delay-corrected merge rule is defined to apply the correction coordinate-wise and exclusively to the coordinates present in the sparse mask that was transmitted at the delayed communication round. This is stated in Section 3 immediately after the algorithm pseudocode and is used in the subsequent analysis. Because only the communicated coordinates receive the delay adjustment, the estimator for the averaged model remains unbiased; local progress on non-communicated coordinates is retained without introducing an extra bias term. The dependence on sparsity level and delay already appears in the convergence bound of Theorem 1. We have added a short clarifying paragraph and a supporting lemma in the revised §3 to make the coordinate-wise application explicit. revision: yes

  2. Referee: Theorem 1 (convergence bound): the rate is stated to depend explicitly on sparsity, overlap, and heterogeneity, but the proof sketch relies on the merge rule remaining unbiased without additional assumptions on consistent sparse masks across send/receive times; if the analysis applies a uniform correction, the variance term from coordinate-wise heterogeneity could grow and contradict the stated bound.

    Authors: The full proof in the appendix explicitly assumes that the sparse masks are those chosen at the sending time and that the correction is applied only to those coordinates; a uniform correction is never used. Under this construction the unbiasedness holds and the variance contribution from coordinate-wise heterogeneity is controlled by the sparsity factor already present in the rate. We have expanded the proof sketch in the main text of the revised manuscript with a one-paragraph outline of the unbiasedness argument and a pointer to the relevant appendix lemma. revision: yes

  3. Referee: §5 (experiments): the reported improvements in training time lack error bars or multiple independent runs, so it is impossible to assess whether the observed gains over naive overwriting are statistically reliable or sensitive to random seeds.

    Authors: We agree that the experimental presentation would be strengthened by statistical reporting. In the revised manuscript we have repeated the wall-clock time experiments over five independent random seeds and added error bars (mean ± one standard deviation) to the relevant plots in §5. The observed gains of LOSCAR-SGD over naive overwriting remain consistent across seeds. revision: yes

Circularity Check

0 steps flagged

Convergence analysis derives from standard smoothness assumptions and proposed merge rule without tautological reduction

full rationale

The paper proposes LOSCAR-SGD with a delay-corrected sparse merge and derives convergence rates for smooth non-convex objectives directly from the algorithm's update rules, sparsity masks, overlap phases, and heterogeneity parameters. The rate expressions follow from standard bounded-variance and smoothness assumptions applied to the new merge operator; no step equates a claimed prediction or theorem to a fitted input or prior self-citation by construction. The 'first theory' claim for the combination of ingredients further indicates the derivation chain is self-contained rather than relying on load-bearing self-citations or ansatzes imported from the authors' prior work.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard smoothness and bounded-variance assumptions common to non-convex SGD theory plus the novel delay-correction mechanism; no new particles or dimensions are introduced.

free parameters (2)
  • sparsity level
    Fraction of coordinates communicated; chosen to trade communication cost against convergence speed.
  • overlap duration
    Time window during which local computation continues while communication is in flight; affects the delay term in the analysis.
axioms (2)
  • domain assumption Objective function is L-smooth
    Standard assumption invoked for non-convex convergence analysis of SGD variants.
  • domain assumption Workers may perform different numbers of local steps
    Heterogeneous compute setting explicitly stated as the operating regime.

pith-pipeline@v0.9.0 · 5745 in / 1353 out tokens · 25674 ms · 2026-05-21T05:46:16.523542+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · 22 internal anchors

  1. [1]

    Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

    Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=

  2. [2]

    Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

    Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=

  3. [3]

    Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

    Abdurakhmon Sadiev and Artavazd Maranjyan and Ivan Ilin and Peter Richt. Ringmaster. arXiv preprint arXiv:2605.18174 , year=

  4. [4]

    2025 , publisher =

    Andrej Karpathy , title =. 2025 , publisher =

  5. [5]

    First Provably Optimal Asynchronous

    Artavazd Maranjyan , year =. First Provably Optimal Asynchronous

  6. [6]

    Ringleader

    Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=

  7. [7]

    Ringmaster

    Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=

  8. [8]

    2025 , booktitle=

    Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=

  9. [9]

    MindFlayer

    Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=

  10. [10]

    Transactions on Machine Learning Research , issn=

    Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=

  11. [11]

    arXiv preprint arXiv:2412.17054 , year=

    Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=

  12. [12]

    The Thirteenth International Conference on Learning Representations , year=

    Laurent Condat and Artavazd Maranjyan and Peter Richt. The Thirteenth International Conference on Learning Representations , year=

  13. [13]

    arXiv preprint arXiv:2601.12400 , year=

    Condat, Laurent and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2601.12400 , year=

  14. [14]

    Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=

    Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=

  15. [15]

    On the divergence of

    Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of

  16. [16]

    On the unconditional convergence of

    Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of

  17. [17]

    On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

    Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

  18. [18]

    We did the math on

    O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =

  19. [19]

    Joule , volume=

    The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

  20. [20]

    Measuring the environmental impact of delivering

    Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering

  21. [21]

    The rising costs of training frontier

    Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier

  22. [22]

    Fradin, Adrien and Richt. Local. arXiv preprint arXiv:2509.23207 , year=

  23. [23]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  24. [24]

    2025 , booktitle=

    Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=

  25. [25]

    arXiv preprint arXiv:1910.05124 , year=

    Yang, Bowen and Zhang, Jian and Li, Jonathan and R. arXiv preprint arXiv:1910.05124 , year=

  26. [26]

    arXiv preprint arXiv:2509.19029 , year=

    Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=

  27. [27]

    International Conference on Machine Learning , pages=

    Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  28. [28]

    Proceedings of the 30th International Conference on Machine Learning , pages =

    Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

  29. [29]

    Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online

  30. [30]

    International Conference on Machine Learning , pages=

    Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  31. [31]

    Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

    Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

  32. [32]

    The Nonstochastic Multiarmed Bandit Problem , journal =

    Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =

  33. [33]

    arXiv preprint arXiv:1903.03934 , year=

    Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=

  34. [34]

    Journal of Machine Learning Research , volume=

    A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    arXiv preprint arXiv:2408.04929 , year=

    Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=

  37. [37]

    Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=

  38. [38]

    IEEE Transactions on Wireless Communications , volume=

    Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=

  39. [39]

    IEEE Transactions on Automatic Control , volume=

    Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=

  40. [40]

    Journal of Machine Learning Research , volume=

    Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=

  41. [41]

    Megatron-

    Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=. Megatron-

  42. [42]

    Efficient large-scale language model training on

    Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on

  43. [43]

    Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=

    In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=

  44. [44]

    Energy and

    International Energy Agency , year=. Energy and

  45. [45]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  46. [46]

    Advances in Neural Information Processing Systems , editor =

    Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =

  47. [47]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  48. [48]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  49. [49]

    2020 , organization=

    Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=

  50. [50]

    Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

    J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

  51. [51]

    Transactions on Machine Learning Research , issn=

    Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  52. [52]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=

  53. [53]

    Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =

    Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =

  54. [54]

    Annals of Mathematical Statistics , volume=

    A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=

  55. [55]

    Optimization Methods for Large-Scale Machine Learning , journal =

    Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =

  56. [56]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

  57. [57]

    Deep neural networks for

    Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for

  58. [58]

    End to End Learning for Self-Driving Cars

    End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=

  59. [59]

    Large Scale Distributed Deep Networks , url =

    Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

  60. [60]

    PyTorch Distributed: Experiences on Accelerating Data Parallel Training

    Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=

  61. [61]

    Advances in Neural Information Processing Systems , volume=

    Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=

  62. [62]

    Federated Learning with Non-IID Data

    Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

  63. [63]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

  64. [64]

    SIAM Journal on Optimization , volume=

    A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=

  65. [65]

    Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=

  66. [66]

    SIAM Journal on Optimization , volume=

    On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

  67. [67]

    Advances in Neural Information Processing Systems , volume=

    A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=

  68. [68]

    Mathematical Programming , volume=

    Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=

  69. [69]

    International Conference on Machine Learning , pages=

    No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  70. [70]

    IEEE Transactions on Mobile Computing , year=

    Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=

  71. [71]

    International Conference on Artificial Intelligence and Statistics , pages=

    Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  72. [72]

    Incremental Aggregated Asynchronous

    Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous

  73. [73]

    SIAM Journal on Optimization , volume=

    Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

  74. [74]

    SIAM Journal on Optimization , volume=

    Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

  75. [75]

    Advances in Neural Information Processing Systems , volume=

    Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=

  76. [76]

    arXiv preprint arXiv:2502.08206 , year=

    Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=

  77. [77]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  78. [78]

    Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

    Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

  79. [79]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  80. [80]

    Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

Showing first 80 references.