LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

Ammar Mahran; Artavazd Maranjyan; Peter Richt\'arik; Yassine Maziane

arxiv: 2605.20866 · v1 · pith:GLH3PFKXnew · submitted 2026-05-20 · 💻 cs.LG · cs.DC· math.OC· stat.ML

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

Yassine Maziane , Ammar Mahran , Artavazd Maranjyan , Peter Richt\'arik This is my paper

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML

keywords local SGDsparse model averagingcommunication-computation overlapdelay correctionnon-convex optimizationdistributed learningheterogeneous workersconvergence analysis

0 comments

The pith

LOSCAR-SGD combines sparse local updates with computation-communication overlap and a delay-corrected merge to converge on smooth non-convex objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LOSCAR-SGD as a local SGD variant for distributed settings where workers have heterogeneous compute speeds. It communicates only sparse model coordinates while allowing local optimization to continue during communication, using a delay-corrected merge to integrate the delayed information. Convergence guarantees are derived for smooth non-convex objectives, with rates that explicitly depend on the sparsity level, the amount of overlap, and the degree of worker heterogeneity. This supplies the first theoretical analysis for the practical combination of local training, sparsity, and overlap.

Core claim

LOSCAR-SGD is a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. This is the first theory for this combination of ingredients.

What carries the argument

The delay-corrected merge rule, which folds delayed sparse updates from heterogeneous workers back into the local models without erasing progress accumulated during the overlap interval.

If this is right

Sparsity level directly modulates the communication volume and appears in the convergence bound.
Communication-computation overlap shortens wall-clock training time without harming the asymptotic rate.
Worker heterogeneity increases the effective delay term and slows the rate in a quantifiable way.
The delay-corrected merge outperforms naive overwriting on both theory and reported experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap-plus-correction idea could be applied to other first-order methods such as Adam or momentum variants.
Pairing the sparse merge with coordinate-wise quantization might yield multiplicative communication savings.
In federated settings the optimal overlap length could be tuned from measured round-trip times and compute variance.
The analysis suggests that very high sparsity may require compensatory increases in local steps to keep the rate acceptable.

Load-bearing premise

The delay-corrected merge rule correctly incorporates delayed synchronized information without discarding the progress made during the overlap phase.

What would settle it

Replace the delay-corrected merge with naive overwriting in a heterogeneous testbed with measurable overlap periods and observe whether convergence slows or fails relative to the predicted rate.

Figures

Figures reproduced from arXiv: 2605.20866 by Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik, Yassine Maziane.

**Figure 1.** Figure 1: Main experimental summary on a9a. (a) Overlap improves over blocking sparse averaging, and delay correction is best. (b) Smaller sparsity levels reduce the logical communication cost by orders of magnitude for the delay-corrected method. (c) Under a long communication delay, delay correction substantially outperforms overwrite. regime covered by the theory. It also includes CIFAR-10 and Tiny ImageNet neura… view at source ↗

**Figure 2.** Figure 2: Accuracy comparison for the controlled sparse-overlap experiment on [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗

**Figure 3.** Figure 3: Training gradient norm for the controlled sparse-overlap experiment on [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗

**Figure 4.** Figure 4: Loss curves under equivalent resource parametrizations for the controlled sparse-overlap [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗

**Figure 5.** Figure 5: Long-compute, short-delay regime (M = 8, ζ = 6, worker times (1, 2, 3, 6)). Delay correction gives a small but persistent improvement over overwrite on both training and validation loss. (a) Training loss. (b) Validation loss [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Short-compute, long-delay regime (M = 2, ζ = 24, worker times (1, 2, 3, 6)). This is the regime with the largest gap: overwrite discards a substantial amount of overlap-phase progress, while delay correction preserves it. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

**Figure 7.** Figure 7: Very heterogeneous worker-speed regime ( [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗

**Figure 8.** Figure 8: Sparsity-level ablation on a9a. We vary p ∈ {0.001, 0.01, 0.1, 1.0} with M = 4, ζ = 6, worker times (1, 2, 3, 6), and otherwise identical training settings. Smaller p dramatically reduces the cumulative number of communicated bits while preserving a similar loss trajectory, showing that sparse parameter averaging gives a strong communication-accuracy tradeoff in this homogeneous-data regime. B.4 Ablation o… view at source ↗

**Figure 9.** Figure 9: Ablation on the local computation budget. The blocking local-sparse baseline is sensitive [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗

**Figure 10.** Figure 10: Training loss versus cumulative logical time for the communication-delay ablation. The [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: Negative stress test with strongly heterogeneous data and very slow communication. The [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: CIFAR-10 convolutional-network experiment in the normal communication regime, with [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗

**Figure 13.** Figure 13: CIFAR-10 convolutional-network experiment in the communication-stress regime, with [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗

**Figure 14.** Figure 14: Tiny ImageNet experiment in the normal communication regime, with [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗

**Figure 15.** Figure 15: Tiny ImageNet experiment in the communication-stress regime, with [PITH_FULL_IMAGE:figures/full_fig_p042_15.png] view at source ↗

read the original abstract

Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LOSCAR-SGD combines sparse local SGD, overlap, and a delay-corrected merge with non-convex rates that track heterogeneity, but the sparse delay correction under varying worker delays is the part that needs verification.

read the letter

The main point is that the paper gives a method for local SGD that sends only sparse coordinates, overlaps the communication with further local steps, and uses a delay-corrected rule to incorporate the late-arriving sparse updates without throwing away the work done in the overlap window. It supplies convergence bounds for smooth non-convex objectives that make the dependence on sparsity level, overlap duration, and worker heterogeneity explicit. Experiments indicate that the overlap reduces wall-clock time and that the corrected merge beats naive overwriting. That combination of ingredients plus the accompanying rates is what is new relative to earlier local-SGD and compression papers. The analysis appears to start from standard smoothness assumptions and derive the rate from the proposed merge rule rather than fitting parameters after the fact. The soft spot is exactly the one the stress-test flags: when only a sparse mask is communicated and delays differ across workers, it is not obvious that the correction term remains unbiased on the unsent coordinates or that the extra variance stays controlled. If the proof applies a uniform rescaling or assumes the mask is fixed between send and receive, the bound could pick up an extra factor linear in sparsity times delay, which would make the claimed rate less attractive. The abstract does not give the coordinate-wise details, so the proofs need a careful read. This is useful reading for people who already work on communication-efficient distributed training and want to see how overlap and delay correction interact with sparsity in heterogeneous settings. It is not a field-redefining result, but the integrated theory plus the practical timing numbers are enough to justify sending it out for referee comments rather than desk-rejecting it.

Referee Report

3 major / 2 minor

Summary. The paper proposes LOSCAR-SGD, a Local SGD algorithm for heterogeneous distributed settings that combines sparse coordinate communication, communication-computation overlap, and a delay-corrected merge rule to incorporate delayed updates without discarding local progress during overlap. It claims convergence guarantees for smooth non-convex objectives, with explicit dependence of the rate on sparsity level, overlap duration, and worker heterogeneity, and reports experiments showing reduced wall-clock time and better performance than naive overwriting.

Significance. If the convergence analysis is correct, the work would be significant as the first explicit theory for the joint combination of local steps, sparsity, overlap, and heterogeneity; the parameter dependence could directly inform practical tuning in large-scale training. The experiments provide supporting evidence for the overlap benefit, though verification is limited by the absence of full proof details and statistical error bars.

major comments (3)

§3 (delay-corrected merge rule): the claim that the rule produces an unbiased estimator of the averaged model while preserving overlap-phase progress is load-bearing for all rate statements, yet the description does not specify whether the correction is applied coordinate-wise only to the sparse mask that was actually sent at the delayed time or uniformly; under heterogeneous delays and per-worker sparsity this risks introducing a bias term proportional to sparsity level times delay variance, which would invalidate the claimed rate.
Theorem 1 (convergence bound): the rate is stated to depend explicitly on sparsity, overlap, and heterogeneity, but the proof sketch relies on the merge rule remaining unbiased without additional assumptions on consistent sparse masks across send/receive times; if the analysis applies a uniform correction, the variance term from coordinate-wise heterogeneity could grow and contradict the stated bound.
§5 (experiments): the reported improvements in training time lack error bars or multiple independent runs, so it is impossible to assess whether the observed gains over naive overwriting are statistically reliable or sensitive to random seeds.

minor comments (2)

Notation for the sparse mask and delay variables is introduced without a consolidated table; a single reference table would improve readability of the rate expressions.
The abstract states this is the first theory for the combination, but the introduction omits explicit comparison to prior overlap analyses in Local SGD (e.g., those handling fixed delays without sparsity).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address each major comment in turn below and have revised the paper to improve clarity and completeness where needed.

read point-by-point responses

Referee: §3 (delay-corrected merge rule): the claim that the rule produces an unbiased estimator of the averaged model while preserving overlap-phase progress is load-bearing for all rate statements, yet the description does not specify whether the correction is applied coordinate-wise only to the sparse mask that was actually sent at the delayed time or uniformly; under heterogeneous delays and per-worker sparsity this risks introducing a bias term proportional to sparsity level times delay variance, which would invalidate the claimed rate.

Authors: The delay-corrected merge rule is defined to apply the correction coordinate-wise and exclusively to the coordinates present in the sparse mask that was transmitted at the delayed communication round. This is stated in Section 3 immediately after the algorithm pseudocode and is used in the subsequent analysis. Because only the communicated coordinates receive the delay adjustment, the estimator for the averaged model remains unbiased; local progress on non-communicated coordinates is retained without introducing an extra bias term. The dependence on sparsity level and delay already appears in the convergence bound of Theorem 1. We have added a short clarifying paragraph and a supporting lemma in the revised §3 to make the coordinate-wise application explicit. revision: yes
Referee: Theorem 1 (convergence bound): the rate is stated to depend explicitly on sparsity, overlap, and heterogeneity, but the proof sketch relies on the merge rule remaining unbiased without additional assumptions on consistent sparse masks across send/receive times; if the analysis applies a uniform correction, the variance term from coordinate-wise heterogeneity could grow and contradict the stated bound.

Authors: The full proof in the appendix explicitly assumes that the sparse masks are those chosen at the sending time and that the correction is applied only to those coordinates; a uniform correction is never used. Under this construction the unbiasedness holds and the variance contribution from coordinate-wise heterogeneity is controlled by the sparsity factor already present in the rate. We have expanded the proof sketch in the main text of the revised manuscript with a one-paragraph outline of the unbiasedness argument and a pointer to the relevant appendix lemma. revision: yes
Referee: §5 (experiments): the reported improvements in training time lack error bars or multiple independent runs, so it is impossible to assess whether the observed gains over naive overwriting are statistically reliable or sensitive to random seeds.

Authors: We agree that the experimental presentation would be strengthened by statistical reporting. In the revised manuscript we have repeated the wall-clock time experiments over five independent random seeds and added error bars (mean ± one standard deviation) to the relevant plots in §5. The observed gains of LOSCAR-SGD over naive overwriting remain consistent across seeds. revision: yes

Circularity Check

0 steps flagged

Convergence analysis derives from standard smoothness assumptions and proposed merge rule without tautological reduction

full rationale

The paper proposes LOSCAR-SGD with a delay-corrected sparse merge and derives convergence rates for smooth non-convex objectives directly from the algorithm's update rules, sparsity masks, overlap phases, and heterogeneity parameters. The rate expressions follow from standard bounded-variance and smoothness assumptions applied to the new merge operator; no step equates a claimed prediction or theorem to a fitted input or prior self-citation by construction. The 'first theory' claim for the combination of ingredients further indicates the derivation chain is self-contained rather than relying on load-bearing self-citations or ansatzes imported from the authors' prior work.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard smoothness and bounded-variance assumptions common to non-convex SGD theory plus the novel delay-correction mechanism; no new particles or dimensions are introduced.

free parameters (2)

sparsity level
Fraction of coordinates communicated; chosen to trade communication cost against convergence speed.
overlap duration
Time window during which local computation continues while communication is in flight; affects the delay term in the analysis.

axioms (2)

domain assumption Objective function is L-smooth
Standard assumption invoked for non-convex convergence analysis of SGD variants.
domain assumption Workers may perform different numbers of local steps
Heterogeneous compute setting explicitly stated as the operating regime.

pith-pipeline@v0.9.0 · 5745 in / 1353 out tokens · 25674 ms · 2026-05-21T05:46:16.523542+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · 22 internal anchors

[1]

Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Abdurakhmon Sadiev and Artavazd Maranjyan and Ivan Ilin and Peter Richt. Ringmaster. arXiv preprint arXiv:2605.18174 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025
[5]

First Provably Optimal Asynchronous

Artavazd Maranjyan , year =. First Provably Optimal Asynchronous

work page
[6]

Ringleader

Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=

work page
[7]

Ringmaster

Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=

work page 2025
[8]

2025 , booktitle=

Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=

work page 2025
[9]

MindFlayer

Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=

work page
[10]

Transactions on Machine Learning Research , issn=

Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=

work page 2025
[11]

arXiv preprint arXiv:2412.17054 , year=

Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=

work page arXiv
[12]

The Thirteenth International Conference on Learning Representations , year=

Laurent Condat and Artavazd Maranjyan and Peter Richt. The Thirteenth International Conference on Learning Representations , year=

work page
[13]

arXiv preprint arXiv:2601.12400 , year=

Condat, Laurent and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2601.12400 , year=

work page arXiv
[14]

Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=

Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=

work page 2023
[15]

On the divergence of

Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of

work page
[16]

On the unconditional convergence of

Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of

work page
[17]

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

work page
[18]

We did the math on

O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =

work page 2025
[19]

Joule , volume=

The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

work page 2023
[20]

Measuring the environmental impact of delivering

Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering

work page
[21]

The rising costs of training frontier

Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier

work page
[22]

Fradin, Adrien and Richt. Local. arXiv preprint arXiv:2509.23207 , year=

work page arXiv
[23]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[24]

2025 , booktitle=

Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=

work page 2025
[25]

arXiv preprint arXiv:1910.05124 , year=

Yang, Bowen and Zhang, Jian and Li, Jonathan and R. arXiv preprint arXiv:1910.05124 , year=

work page arXiv 1910
[26]

arXiv preprint arXiv:2509.19029 , year=

Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=

work page arXiv
[27]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[28]

Proceedings of the 30th International Conference on Machine Learning , pages =

Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

work page 2013
[29]

Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online

work page
[30]

International Conference on Machine Learning , pages=

Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[31]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

work page 2019
[32]

The Nonstochastic Multiarmed Bandit Problem , journal =

Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =

work page doi:10.1137/s0097539701398375 2002
[33]

arXiv preprint arXiv:1903.03934 , year=

Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=

work page arXiv 1903
[34]

Journal of Machine Learning Research , volume=

A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=

work page
[35]

Advances in Neural Information Processing Systems , volume=

Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

arXiv preprint arXiv:2408.04929 , year=

Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=

work page arXiv
[37]

Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=

work page
[38]

IEEE Transactions on Wireless Communications , volume=

Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=

work page 2022
[39]

IEEE Transactions on Automatic Control , volume=

Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=

work page 1986
[40]

Journal of Machine Learning Research , volume=

Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=

work page
[41]

Megatron-

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=. Megatron-

work page
[42]

Efficient large-scale language model training on

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on

work page
[43]

Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=

In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=

work page 2017
[44]

Energy and

International Energy Agency , year=. Energy and

work page
[45]

Proceedings of the AAAI conference on artificial intelligence , volume=

Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[46]

Advances in Neural Information Processing Systems , editor =

Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =

work page
[47]

Proceedings of the 39th International Conference on Machine Learning , pages =

Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022
[48]

Proceedings of the 34th International Conference on Machine Learning , pages =

Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[49]

2020 , organization=

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=

work page 2020
[50]

Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

work page
[51]

Transactions on Machine Learning Research , issn=

Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[52]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =

Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =

work page 2013
[54]

Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=

work page
[55]

Optimization Methods for Large-Scale Machine Learning , journal =

Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =

work page 2018
[56]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

work page
[57]

Deep neural networks for

Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for

work page
[58]

End to End Learning for Self-Driving Cars

End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Large Scale Distributed Deep Networks , url =

Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

work page
[60]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[61]

Advances in Neural Information Processing Systems , volume=

Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=

work page
[62]

Federated Learning with Non-IID Data

Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

work page 2022
[64]

SIAM Journal on Optimization , volume=

A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=

work page 2007
[65]

Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=

work page
[66]

SIAM Journal on Optimization , volume=

On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017
[67]

Advances in Neural Information Processing Systems , volume=

A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=

work page
[68]

Mathematical Programming , volume=

Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=

work page 2017
[69]

International Conference on Machine Learning , pages=

No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[70]

IEEE Transactions on Mobile Computing , year=

Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=

work page
[71]

International Conference on Artificial Intelligence and Statistics , pages=

Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[72]

Incremental Aggregated Asynchronous

Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous

work page
[73]

SIAM Journal on Optimization , volume=

Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

work page 2018
[74]

SIAM Journal on Optimization , volume=

Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017
[75]

Advances in Neural Information Processing Systems , volume=

Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[76]

arXiv preprint arXiv:2502.08206 , year=

Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=

work page arXiv
[77]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[78]

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

work page
[79]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[80]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page

Showing first 80 references.

[1] [1]

Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Abdurakhmon Sadiev and Artavazd Maranjyan and Ivan Ilin and Peter Richt. Ringmaster. arXiv preprint arXiv:2605.18174 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025

[5] [5]

First Provably Optimal Asynchronous

Artavazd Maranjyan , year =. First Provably Optimal Asynchronous

work page

[6] [6]

Ringleader

Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=

work page

[7] [7]

Ringmaster

Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=

work page 2025

[8] [8]

2025 , booktitle=

Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=

work page 2025

[9] [9]

MindFlayer

Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=

work page

[10] [10]

Transactions on Machine Learning Research , issn=

Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=

work page 2025

[11] [11]

arXiv preprint arXiv:2412.17054 , year=

Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=

work page arXiv

[12] [12]

The Thirteenth International Conference on Learning Representations , year=

Laurent Condat and Artavazd Maranjyan and Peter Richt. The Thirteenth International Conference on Learning Representations , year=

work page

[13] [13]

arXiv preprint arXiv:2601.12400 , year=

Condat, Laurent and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2601.12400 , year=

work page arXiv

[14] [14]

Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=

Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=

work page 2023

[15] [15]

On the divergence of

Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of

work page

[16] [16]

On the unconditional convergence of

Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of

work page

[17] [17]

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

work page

[18] [18]

We did the math on

O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =

work page 2025

[19] [19]

Joule , volume=

The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

work page 2023

[20] [20]

Measuring the environmental impact of delivering

Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering

work page

[21] [21]

The rising costs of training frontier

Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier

work page

[22] [22]

Fradin, Adrien and Richt. Local. arXiv preprint arXiv:2509.23207 , year=

work page arXiv

[23] [23]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024

[24] [24]

2025 , booktitle=

Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=

work page 2025

[25] [25]

arXiv preprint arXiv:1910.05124 , year=

Yang, Bowen and Zhang, Jian and Li, Jonathan and R. arXiv preprint arXiv:1910.05124 , year=

work page arXiv 1910

[26] [26]

arXiv preprint arXiv:2509.19029 , year=

Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=

work page arXiv

[27] [27]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018

[28] [28]

Proceedings of the 30th International Conference on Machine Learning , pages =

Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

work page 2013

[29] [29]

Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online

work page

[30] [30]

International Conference on Machine Learning , pages=

Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[31] [31]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

work page 2019

[32] [32]

The Nonstochastic Multiarmed Bandit Problem , journal =

Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =

work page doi:10.1137/s0097539701398375 2002

[33] [33]

arXiv preprint arXiv:1903.03934 , year=

Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=

work page arXiv 1903

[34] [34]

Journal of Machine Learning Research , volume=

A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=

work page

[35] [35]

Advances in Neural Information Processing Systems , volume=

Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

work page

[36] [36]

arXiv preprint arXiv:2408.04929 , year=

Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=

work page arXiv

[37] [37]

Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=

work page

[38] [38]

IEEE Transactions on Wireless Communications , volume=

Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=

work page 2022

[39] [39]

IEEE Transactions on Automatic Control , volume=

Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=

work page 1986

[40] [40]

Journal of Machine Learning Research , volume=

Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=

work page

[41] [41]

Megatron-

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=. Megatron-

work page

[42] [42]

Efficient large-scale language model training on

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on

work page

[43] [43]

Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=

In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=

work page 2017

[44] [44]

Energy and

International Energy Agency , year=. Energy and

work page

[45] [45]

Proceedings of the AAAI conference on artificial intelligence , volume=

Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[46] [46]

Advances in Neural Information Processing Systems , editor =

Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =

work page

[47] [47]

Proceedings of the 39th International Conference on Machine Learning , pages =

Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022

[48] [48]

Proceedings of the 34th International Conference on Machine Learning , pages =

Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017

[49] [49]

2020 , organization=

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=

work page 2020

[50] [50]

Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

work page

[51] [51]

Transactions on Machine Learning Research , issn=

Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024

[52] [52]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =

Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =

work page 2013

[54] [54]

Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=

work page

[55] [55]

Optimization Methods for Large-Scale Machine Learning , journal =

Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =

work page 2018

[56] [56]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

work page

[57] [57]

Deep neural networks for

Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for

work page

[58] [58]

End to End Learning for Self-Driving Cars

End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

Large Scale Distributed Deep Networks , url =

Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

work page

[60] [60]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[61] [61]

Advances in Neural Information Processing Systems , volume=

Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=

work page

[62] [62]

Federated Learning with Non-IID Data

Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

work page 2022

[64] [64]

SIAM Journal on Optimization , volume=

A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=

work page 2007

[65] [65]

Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=

work page

[66] [66]

SIAM Journal on Optimization , volume=

On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017

[67] [67]

Advances in Neural Information Processing Systems , volume=

A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=

work page

[68] [68]

Mathematical Programming , volume=

Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=

work page 2017

[69] [69]

International Conference on Machine Learning , pages=

No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[70] [70]

IEEE Transactions on Mobile Computing , year=

Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=

work page

[71] [71]

International Conference on Artificial Intelligence and Statistics , pages=

Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022

[72] [72]

Incremental Aggregated Asynchronous

Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous

work page

[73] [73]

SIAM Journal on Optimization , volume=

Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

work page 2018

[74] [74]

SIAM Journal on Optimization , volume=

Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017

[75] [75]

Advances in Neural Information Processing Systems , volume=

Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=

work page

[76] [76]

arXiv preprint arXiv:2502.08206 , year=

Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=

work page arXiv

[77] [77]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[78] [78]

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

work page

[79] [79]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page

[80] [80]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page