Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Abdurakhmon Sadiev; Artavazd Maranjyan; Ivan Ilin; Peter Richt\'arik

arxiv: 2605.18174 · v1 · pith:7SLGFUZRnew · submitted 2026-05-18 · 💻 cs.LG · cs.DC· math.OC· stat.ML

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Abdurakhmon Sadiev , Artavazd Maranjyan , Ivan Ilin , Peter Richt\'arik This is my paper

Pith reviewed 2026-05-20 13:04 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML

keywords asynchronous optimizationlinear minimization oraclemomentum methoddelay thresholdingnonconvex optimizationstochastic optimizationdistributed trainingtime complexity

0 comments

The pith

Ringmaster LMO extends delay-thresholding to LMO momentum updates for asynchronous training and recovers optimal time complexity in smooth settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ringmaster LMO as an asynchronous momentum method based on linear minimization oracles for stochastic nonconvex problems. It adapts the delay-thresholding rule that skips stale gradients from earlier Ringmaster ASGD work so that it applies to general LMO steps. Convergence is shown under generalized (L0, L1)-smoothness, and these iteration bounds are turned into time-complexity results for systems where workers finish computations at different rates. In the standard Euclidean smooth case the time complexity matches the best known for Ringmaster ASGD. A parameter-agnostic version with decreasing steps and adaptive thresholds is also given, and tests on quadratics and language-model pretraining indicate larger gains as heterogeneity increases.

Core claim

Ringmaster LMO is an asynchronous LMO-based momentum method that applies a delay-thresholding rule to discard overly stale LMO updates, yielding convergence guarantees under generalized (L0, L1)-smoothness and time-complexity bounds that recover the optimal performance of Ringmaster ASGD in the classical Euclidean smooth setting.

What carries the argument

Delay-thresholding rule extended to discard stale LMO updates while preserving convergence behavior.

If this is right

Convergence guarantees are established for unconstrained stochastic nonconvex optimization.
Time complexity bounds are obtained for heterogeneous worker computation times.
A parameter-agnostic variant uses decreasing stepsizes and adaptive delay thresholds.
Empirical advantages increase with greater system heterogeneity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same thresholding idea could shorten wall-clock time for large-model training on real clusters with uneven hardware speeds.
The approach may transfer to other structured oracles or momentum variants in distributed nonconvex settings.
Performance under convex or strongly convex assumptions remains open for separate analysis.

Load-bearing premise

The delay-thresholding rule extends without modification to LMO momentum updates while preserving the same convergence behavior under generalized (L0, L1)-smoothness.

What would settle it

A controlled experiment showing that Ringmaster LMO fails to match the time complexity of Ringmaster ASGD under heterogeneous worker delays in the Euclidean smooth case would disprove the recovery claim.

Figures

Figures reproduced from arXiv: 2605.18174 by Abdurakhmon Sadiev, Artavazd Maranjyan, Ivan Ilin, Peter Richt\'arik.

**Figure 2.** Figure 2: NanoChat training loss versus simulated runtime for a [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: NanoChat training loss versus simulated runtime for the same model with [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Muon has recently emerged as a strong alternative to AdamW for training neural networks, with encouraging large-scale pretraining results and growing evidence that matrix-structured updates can be faster in practice. Yet Muon, and more generally Linear Minimization Oracle (LMO) based methods, are typically used synchronously. This is problematic in heterogeneous distributed systems, where workers complete gradient computations at different speeds and synchronous training must repeatedly wait for slower workers. In this work, we introduce Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. Our method builds on the delay-thresholding idea of Ringmaster ASGD. For SGD-type methods, Ringmaster ASGD achieves optimal time complexity by discarding overly stale gradients. Ringmaster LMO extends this mechanism to general LMO-based updates. We establish convergence guarantees under generalized $(L_0, L_1)$-smoothness and further develop a parameter-agnostic variant with decreasing stepsizes and adaptive delay thresholds. Finally, we translate our iteration guarantees into time complexity bounds under heterogeneous worker computation times. In the classical Euclidean smooth setting, these bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratic problems and NanoChat language-model pretraining show that the advantages of Ringmaster LMO grow with system heterogeneity and that the method outperforms strong synchronous and asynchronous baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ringmaster LMO adapts delay-thresholding to LMO momentum but the momentum buffer may add an unaccounted error term when discarding stale updates.

read the letter

The core contribution is extending Ringmaster ASGD's delay-thresholding to LMO-based momentum methods for asynchronous non-convex optimization. It claims the same optimal time complexity in the Euclidean smooth case while handling heterogeneous worker speeds through adaptive thresholds and a parameter-agnostic variant under (L0, L1)-smoothness. Experiments on quadratics and NanoChat pretraining indicate gains that scale with heterogeneity and beat listed baselines.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. It extends the delay-thresholding mechanism of Ringmaster ASGD to discard overly stale gradients when performing LMO updates, establishes convergence guarantees under generalized (L0, L1)-smoothness, develops a parameter-agnostic variant with decreasing stepsizes and adaptive thresholds, and translates iteration bounds into time-complexity results under heterogeneous worker speeds. In the Euclidean smooth case the bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratics and NanoChat pretraining illustrate improved robustness to system heterogeneity relative to synchronous and asynchronous baselines.

Significance. If the central extension of delay-thresholding to LMO momentum is rigorously justified, the work would be significant: it supplies the first theoretical support for asynchronous training of modern LMO-based optimizers such as Muon, together with explicit time-complexity translation and empirical evidence on language-model pretraining. The parameter-agnostic variant and recovery of known optimal rates are additional strengths.

major comments (2)

[Convergence analysis (section on guarantees under generalized smoothness)] The central claim that the delay-thresholding rule extends without modification to LMO momentum rests on the assumption that discarding a stale gradient leaves the momentum buffer with only a controllable bias. Because the momentum update couples past gradients (typically m_t = β m_{t-1} + (1-β) g_t) before the (possibly nonlinear) LMO is applied, a discarded stale g_t leaves a stale m_t whose error is not a simple linear combination of recent gradients. The convergence analysis must therefore derive an additional error term in the descent lemma that is absent from the ASGD case; without an explicit bound on this term under (L0, L1)-smoothness the iteration-to-time-complexity translation is not yet secured.
[Definition of Ringmaster LMO and the delay-thresholding rule] The abstract states that the same delay-thresholding rule preserves identical convergence behavior for LMO-based updates. The manuscript should supply the precise statement of the rule (which gradients are discarded and how the momentum buffer is updated when a gradient is discarded) together with the corresponding lemma that controls the extra bias introduced by the stale momentum vector.

minor comments (2)

[Parameter-agnostic variant] The description of the parameter-agnostic variant would benefit from an explicit statement of how the adaptive delay threshold is computed from observed worker times, including any constants that must be set in practice.
[Experiments] In the NanoChat experiments, report the precise model dimension, number of training tokens, and the implementation details of the synchronous and asynchronous baselines so that the claimed advantage with increasing heterogeneity can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of the convergence analysis and algorithmic definition that we will clarify in the revision. We address each major comment below.

read point-by-point responses

Referee: [Convergence analysis (section on guarantees under generalized smoothness)] The central claim that the delay-thresholding rule extends without modification to LMO momentum rests on the assumption that discarding a stale gradient leaves the momentum buffer with only a controllable bias. Because the momentum update couples past gradients (typically m_t = β m_{t-1} + (1-β) g_t) before the (possibly nonlinear) LMO is applied, a discarded stale g_t leaves a stale m_t whose error is not a simple linear combination of recent gradients. The convergence analysis must therefore derive an additional error term in the descent lemma that is absent from the ASGD case; without an explicit bound on this term under (L0, L1)-smoothness the iteration-to-time-complexity translation is not yet secured.

Authors: We agree that the momentum coupling requires an explicit treatment of the additional bias. Under (L0, L1)-smoothness the difference between the stale momentum vector and its fresh counterpart can be bounded by a term proportional to the delay and the smoothness constants; this term is absorbed into the existing descent inequality without altering the leading-order iteration complexity. We will insert a dedicated lemma that isolates this extra error term, shows how it is controlled by the delay threshold, and confirms that the subsequent translation to time complexity remains valid. revision: yes
Referee: [Definition of Ringmaster LMO and the delay-thresholding rule] The abstract states that the same delay-thresholding rule preserves identical convergence behavior for LMO-based updates. The manuscript should supply the precise statement of the rule (which gradients are discarded and how the momentum buffer is updated when a gradient is discarded) together with the corresponding lemma that controls the extra bias introduced by the stale momentum vector.

Authors: We will add an explicit algorithmic description of Ringmaster LMO that states: a gradient g_t computed by worker i is discarded if its delay d_t exceeds the current threshold τ; when discarded, the momentum update is skipped and the buffer retains its previous value m_t = m_{t-1}. We will also include the supporting lemma that bounds the bias ||m_t - m_t^*|| (where m_t^* uses only fresh gradients) under (L0, L1)-smoothness, showing that the bias remains O(τ) and is therefore compatible with the existing convergence guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation provides independent extension and proofs

full rationale

The paper introduces Ringmaster LMO as a novel asynchronous extension of delay-thresholding to LMO-based momentum updates, explicitly stating that it establishes new convergence guarantees under generalized (L0, L1)-smoothness and translates iteration bounds to time complexity that recover the Ringmaster ASGD optimum only in the classical Euclidean case. No quoted step reduces a central claim to a self-definitional fit, a renamed input, or a load-bearing self-citation chain; the analysis is presented as self-contained with parameter-agnostic variants and heterogeneous-worker bounds derived from the new LMO-specific lemmas rather than by construction from prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the extension of delay-thresholding to LMO updates and on generalized (L0, L1)-smoothness; no explicit free parameters are named in the abstract, but the adaptive delay thresholds in the parameter-agnostic variant are likely chosen or adapted from data.

axioms (1)

domain assumption Generalized (L0, L1)-smoothness condition for stochastic nonconvex objectives
Invoked to establish convergence guarantees for the asynchronous LMO method.

pith-pipeline@v0.9.0 · 5789 in / 1225 out tokens · 74236 ms · 2026-05-20T13:04:22.976689+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ringmaster LMO extends this mechanism to general LMO-based updates... convergence guarantees under generalized (L0, L1)-smoothness
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

delay-thresholding rule that discards stale gradients

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
cs.LG 2026-05 unverdicted novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 1 Pith paper · 22 internal anchors

[1]

Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025
[4]

First Provably Optimal Asynchronous

Artavazd Maranjyan , year =. First Provably Optimal Asynchronous

work page
[5]

Ringleader

Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=

work page
[6]

Ringmaster

Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=

work page 2025
[7]

2025 , booktitle=

Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=

work page 2025
[8]

MindFlayer

Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=

work page
[9]

Transactions on Machine Learning Research , issn=

Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=

work page 2025
[10]

arXiv preprint arXiv:2412.17054 , year=

Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=

work page arXiv
[11]

Condat, Laurent and Maranjyan, Artavazd and Richt. Proc. of International Conference on Learning Representations (ICLR) , year=

work page
[12]

Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=

Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=

work page 2023
[13]

On the divergence of

Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of

work page
[14]

On the unconditional convergence of

Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of

work page
[15]

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

work page
[16]

We did the math on

O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =

work page 2025
[17]

Joule , volume=

The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

work page 2023
[18]

Measuring the environmental impact of delivering

Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering

work page
[19]

The rising costs of training frontier

Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier

work page
[20]

Fradin, Adrien and Richt. Local. arXiv preprint arXiv:2509.23207 , year=

work page arXiv
[21]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[22]

2025 , booktitle=

Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=

work page 2025
[23]

arXiv preprint arXiv:1910.05124 , year=

Pipemare: Asynchronous pipeline parallel dnn training , author=. arXiv preprint arXiv:1910.05124 , year=

work page arXiv 1910
[24]

arXiv preprint arXiv:2509.19029 , year=

Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=

work page arXiv
[25]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[26]

Proceedings of the 30th International Conference on Machine Learning , pages =

Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

work page 2013
[27]

Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online

work page
[28]

International Conference on Machine Learning , pages=

Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[29]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

work page 2019
[30]

The Nonstochastic Multiarmed Bandit Problem , journal =

Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =

work page doi:10.1137/s0097539701398375 2002
[31]

arXiv preprint arXiv:1903.03934 , year=

Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=

work page arXiv 1903
[32]

Journal of Machine Learning Research , volume=

A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=

work page
[33]

Advances in Neural Information Processing Systems , volume=

Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

arXiv preprint arXiv:2408.04929 , year=

Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=

work page arXiv
[35]

Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=

work page
[36]

IEEE Transactions on Wireless Communications , volume=

Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=

work page 2022
[37]

IEEE Transactions on Automatic Control , volume=

Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=

work page 1986
[38]

Journal of Machine Learning Research , volume=

Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=

work page
[39]

Megatron-

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=. Megatron-

work page
[40]

Efficient large-scale language model training on

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on

work page
[41]

Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=

In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=

work page 2017
[42]

Energy and

International Energy Agency , year=. Energy and

work page
[43]

Proceedings of the AAAI conference on artificial intelligence , volume=

Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[44]

Advances in Neural Information Processing Systems , editor =

Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =

work page
[45]

Proceedings of the 39th International Conference on Machine Learning , pages =

Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022
[46]

Proceedings of the 34th International Conference on Machine Learning , pages =

Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[47]

2020 , organization=

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=

work page 2020
[48]

Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

work page
[49]

Transactions on Machine Learning Research , issn=

Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[50]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =

Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =

work page 2013
[52]

Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=

work page
[53]

Optimization Methods for Large-Scale Machine Learning , journal =

Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =

work page 2018
[54]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

work page
[55]

Deep neural networks for

Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for

work page
[56]

End to End Learning for Self-Driving Cars

End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Large Scale Distributed Deep Networks , url =

Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

work page
[58]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[59]

Advances in Neural Information Processing Systems , volume=

Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

Federated Learning with Non-IID Data

Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

work page 2022
[62]

SIAM Journal on Optimization , volume=

A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=

work page 2007
[63]

Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=

work page
[64]

SIAM Journal on Optimization , volume=

On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017
[65]

Advances in Neural Information Processing Systems , volume=

A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=

work page
[66]

Mathematical Programming , volume=

Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=

work page 2017
[67]

International Conference on Machine Learning , pages=

No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[68]

IEEE Transactions on Mobile Computing , year=

Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=

work page
[69]

International Conference on Artificial Intelligence and Statistics , pages=

Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[70]

Incremental Aggregated Asynchronous

Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous

work page
[71]

SIAM Journal on Optimization , volume=

Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

work page 2018
[72]

SIAM Journal on Optimization , volume=

Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017
[73]

Advances in Neural Information Processing Systems , volume=

Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

arXiv preprint arXiv:2502.08206 , year=

Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=

work page arXiv
[75]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[76]

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

work page
[77]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[78]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page
[79]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Cong Fang and Chris Junchi Li and Zhouchen Lin and Tong Zhang , booktitle =

work page

Showing first 80 references.

[1] [1]

Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025

[4] [4]

First Provably Optimal Asynchronous

Artavazd Maranjyan , year =. First Provably Optimal Asynchronous

work page

[5] [5]

Ringleader

Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=

work page

[6] [6]

Ringmaster

Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=

work page 2025

[7] [7]

2025 , booktitle=

Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=

work page 2025

[8] [8]

MindFlayer

Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=

work page

[9] [9]

Transactions on Machine Learning Research , issn=

Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=

work page 2025

[10] [10]

arXiv preprint arXiv:2412.17054 , year=

Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=

work page arXiv

[11] [11]

Condat, Laurent and Maranjyan, Artavazd and Richt. Proc. of International Conference on Learning Representations (ICLR) , year=

work page

[12] [12]

Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=

Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=

work page 2023

[13] [13]

On the divergence of

Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of

work page

[14] [14]

On the unconditional convergence of

Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of

work page

[15] [15]

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

work page

[16] [16]

We did the math on

O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =

work page 2025

[17] [17]

Joule , volume=

The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

work page 2023

[18] [18]

Measuring the environmental impact of delivering

Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering

work page

[19] [19]

The rising costs of training frontier

Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier

work page

[20] [20]

Fradin, Adrien and Richt. Local. arXiv preprint arXiv:2509.23207 , year=

work page arXiv

[21] [21]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024

[22] [22]

2025 , booktitle=

Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=

work page 2025

[23] [23]

arXiv preprint arXiv:1910.05124 , year=

Pipemare: Asynchronous pipeline parallel dnn training , author=. arXiv preprint arXiv:1910.05124 , year=

work page arXiv 1910

[24] [24]

arXiv preprint arXiv:2509.19029 , year=

Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=

work page arXiv

[25] [25]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018

[26] [26]

Proceedings of the 30th International Conference on Machine Learning , pages =

Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

work page 2013

[27] [27]

Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online

work page

[28] [28]

International Conference on Machine Learning , pages=

Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[29] [29]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

work page 2019

[30] [30]

The Nonstochastic Multiarmed Bandit Problem , journal =

Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =

work page doi:10.1137/s0097539701398375 2002

[31] [31]

arXiv preprint arXiv:1903.03934 , year=

Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=

work page arXiv 1903

[32] [32]

Journal of Machine Learning Research , volume=

A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=

work page

[33] [33]

Advances in Neural Information Processing Systems , volume=

Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

work page

[34] [34]

arXiv preprint arXiv:2408.04929 , year=

Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=

work page arXiv

[35] [35]

Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=

work page

[36] [36]

IEEE Transactions on Wireless Communications , volume=

Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=

work page 2022

[37] [37]

IEEE Transactions on Automatic Control , volume=

Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=

work page 1986

[38] [38]

Journal of Machine Learning Research , volume=

Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=

work page

[39] [39]

Megatron-

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=. Megatron-

work page

[40] [40]

Efficient large-scale language model training on

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on

work page

[41] [41]

Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=

In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=

work page 2017

[42] [42]

Energy and

International Energy Agency , year=. Energy and

work page

[43] [43]

Proceedings of the AAAI conference on artificial intelligence , volume=

Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[44] [44]

Advances in Neural Information Processing Systems , editor =

Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =

work page

[45] [45]

Proceedings of the 39th International Conference on Machine Learning , pages =

Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022

[46] [46]

Proceedings of the 34th International Conference on Machine Learning , pages =

Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017

[47] [47]

2020 , organization=

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=

work page 2020

[48] [48]

Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

work page

[49] [49]

Transactions on Machine Learning Research , issn=

Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024

[50] [50]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =

Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =

work page 2013

[52] [52]

Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=

work page

[53] [53]

Optimization Methods for Large-Scale Machine Learning , journal =

Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =

work page 2018

[54] [54]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

work page

[55] [55]

Deep neural networks for

Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for

work page

[56] [56]

End to End Learning for Self-Driving Cars

End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Large Scale Distributed Deep Networks , url =

Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

work page

[58] [58]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[59] [59]

Advances in Neural Information Processing Systems , volume=

Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=

work page

[60] [60]

Federated Learning with Non-IID Data

Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

work page 2022

[62] [62]

SIAM Journal on Optimization , volume=

A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=

work page 2007

[63] [63]

Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=

work page

[64] [64]

SIAM Journal on Optimization , volume=

On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017

[65] [65]

Advances in Neural Information Processing Systems , volume=

A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=

work page

[66] [66]

Mathematical Programming , volume=

Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=

work page 2017

[67] [67]

International Conference on Machine Learning , pages=

No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[68] [68]

IEEE Transactions on Mobile Computing , year=

Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=

work page

[69] [69]

International Conference on Artificial Intelligence and Statistics , pages=

Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022

[70] [70]

Incremental Aggregated Asynchronous

Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous

work page

[71] [71]

SIAM Journal on Optimization , volume=

Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

work page 2018

[72] [72]

SIAM Journal on Optimization , volume=

Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017

[73] [73]

Advances in Neural Information Processing Systems , volume=

Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=

work page

[74] [74]

arXiv preprint arXiv:2502.08206 , year=

Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=

work page arXiv

[75] [75]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[76] [76]

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

work page

[77] [77]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page

[78] [78]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page

[79] [79]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

Cong Fang and Chris Junchi Li and Zhouchen Lin and Tong Zhang , booktitle =

work page