pith. sign in

arxiv: 2605.18174 · v1 · pith:7SLGFUZRnew · submitted 2026-05-18 · 💻 cs.LG · cs.DC· math.OC· stat.ML

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Pith reviewed 2026-05-20 13:04 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML
keywords asynchronous optimizationlinear minimization oraclemomentum methoddelay thresholdingnonconvex optimizationstochastic optimizationdistributed trainingtime complexity
1
0 comments X

The pith

Ringmaster LMO extends delay-thresholding to LMO momentum updates for asynchronous training and recovers optimal time complexity in smooth settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ringmaster LMO as an asynchronous momentum method based on linear minimization oracles for stochastic nonconvex problems. It adapts the delay-thresholding rule that skips stale gradients from earlier Ringmaster ASGD work so that it applies to general LMO steps. Convergence is shown under generalized (L0, L1)-smoothness, and these iteration bounds are turned into time-complexity results for systems where workers finish computations at different rates. In the standard Euclidean smooth case the time complexity matches the best known for Ringmaster ASGD. A parameter-agnostic version with decreasing steps and adaptive thresholds is also given, and tests on quadratics and language-model pretraining indicate larger gains as heterogeneity increases.

Core claim

Ringmaster LMO is an asynchronous LMO-based momentum method that applies a delay-thresholding rule to discard overly stale LMO updates, yielding convergence guarantees under generalized (L0, L1)-smoothness and time-complexity bounds that recover the optimal performance of Ringmaster ASGD in the classical Euclidean smooth setting.

What carries the argument

Delay-thresholding rule extended to discard stale LMO updates while preserving convergence behavior.

If this is right

  • Convergence guarantees are established for unconstrained stochastic nonconvex optimization.
  • Time complexity bounds are obtained for heterogeneous worker computation times.
  • A parameter-agnostic variant uses decreasing stepsizes and adaptive delay thresholds.
  • Empirical advantages increase with greater system heterogeneity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same thresholding idea could shorten wall-clock time for large-model training on real clusters with uneven hardware speeds.
  • The approach may transfer to other structured oracles or momentum variants in distributed nonconvex settings.
  • Performance under convex or strongly convex assumptions remains open for separate analysis.

Load-bearing premise

The delay-thresholding rule extends without modification to LMO momentum updates while preserving the same convergence behavior under generalized (L0, L1)-smoothness.

What would settle it

A controlled experiment showing that Ringmaster LMO fails to match the time complexity of Ringmaster ASGD under heterogeneous worker delays in the Euclidean smooth case would disprove the recovery claim.

Figures

Figures reproduced from arXiv: 2605.18174 by Abdurakhmon Sadiev, Artavazd Maranjyan, Ivan Ilin, Peter Richt\'arik.

Figure 1
Figure 1. Figure 1: Comparison on the stochastic tridiagonal quadratic objective under similar, sublinear, and [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: NanoChat training loss versus simulated runtime for a [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: NanoChat training loss versus simulated runtime for the same model with [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

Muon has recently emerged as a strong alternative to AdamW for training neural networks, with encouraging large-scale pretraining results and growing evidence that matrix-structured updates can be faster in practice. Yet Muon, and more generally Linear Minimization Oracle (LMO) based methods, are typically used synchronously. This is problematic in heterogeneous distributed systems, where workers complete gradient computations at different speeds and synchronous training must repeatedly wait for slower workers. In this work, we introduce Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. Our method builds on the delay-thresholding idea of Ringmaster ASGD. For SGD-type methods, Ringmaster ASGD achieves optimal time complexity by discarding overly stale gradients. Ringmaster LMO extends this mechanism to general LMO-based updates. We establish convergence guarantees under generalized $(L_0, L_1)$-smoothness and further develop a parameter-agnostic variant with decreasing stepsizes and adaptive delay thresholds. Finally, we translate our iteration guarantees into time complexity bounds under heterogeneous worker computation times. In the classical Euclidean smooth setting, these bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratic problems and NanoChat language-model pretraining show that the advantages of Ringmaster LMO grow with system heterogeneity and that the method outperforms strong synchronous and asynchronous baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. It extends the delay-thresholding mechanism of Ringmaster ASGD to discard overly stale gradients when performing LMO updates, establishes convergence guarantees under generalized (L0, L1)-smoothness, develops a parameter-agnostic variant with decreasing stepsizes and adaptive thresholds, and translates iteration bounds into time-complexity results under heterogeneous worker speeds. In the Euclidean smooth case the bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratics and NanoChat pretraining illustrate improved robustness to system heterogeneity relative to synchronous and asynchronous baselines.

Significance. If the central extension of delay-thresholding to LMO momentum is rigorously justified, the work would be significant: it supplies the first theoretical support for asynchronous training of modern LMO-based optimizers such as Muon, together with explicit time-complexity translation and empirical evidence on language-model pretraining. The parameter-agnostic variant and recovery of known optimal rates are additional strengths.

major comments (2)
  1. [Convergence analysis (section on guarantees under generalized smoothness)] The central claim that the delay-thresholding rule extends without modification to LMO momentum rests on the assumption that discarding a stale gradient leaves the momentum buffer with only a controllable bias. Because the momentum update couples past gradients (typically m_t = β m_{t-1} + (1-β) g_t) before the (possibly nonlinear) LMO is applied, a discarded stale g_t leaves a stale m_t whose error is not a simple linear combination of recent gradients. The convergence analysis must therefore derive an additional error term in the descent lemma that is absent from the ASGD case; without an explicit bound on this term under (L0, L1)-smoothness the iteration-to-time-complexity translation is not yet secured.
  2. [Definition of Ringmaster LMO and the delay-thresholding rule] The abstract states that the same delay-thresholding rule preserves identical convergence behavior for LMO-based updates. The manuscript should supply the precise statement of the rule (which gradients are discarded and how the momentum buffer is updated when a gradient is discarded) together with the corresponding lemma that controls the extra bias introduced by the stale momentum vector.
minor comments (2)
  1. [Parameter-agnostic variant] The description of the parameter-agnostic variant would benefit from an explicit statement of how the adaptive delay threshold is computed from observed worker times, including any constants that must be set in practice.
  2. [Experiments] In the NanoChat experiments, report the precise model dimension, number of training tokens, and the implementation details of the synchronous and asynchronous baselines so that the claimed advantage with increasing heterogeneity can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of the convergence analysis and algorithmic definition that we will clarify in the revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Convergence analysis (section on guarantees under generalized smoothness)] The central claim that the delay-thresholding rule extends without modification to LMO momentum rests on the assumption that discarding a stale gradient leaves the momentum buffer with only a controllable bias. Because the momentum update couples past gradients (typically m_t = β m_{t-1} + (1-β) g_t) before the (possibly nonlinear) LMO is applied, a discarded stale g_t leaves a stale m_t whose error is not a simple linear combination of recent gradients. The convergence analysis must therefore derive an additional error term in the descent lemma that is absent from the ASGD case; without an explicit bound on this term under (L0, L1)-smoothness the iteration-to-time-complexity translation is not yet secured.

    Authors: We agree that the momentum coupling requires an explicit treatment of the additional bias. Under (L0, L1)-smoothness the difference between the stale momentum vector and its fresh counterpart can be bounded by a term proportional to the delay and the smoothness constants; this term is absorbed into the existing descent inequality without altering the leading-order iteration complexity. We will insert a dedicated lemma that isolates this extra error term, shows how it is controlled by the delay threshold, and confirms that the subsequent translation to time complexity remains valid. revision: yes

  2. Referee: [Definition of Ringmaster LMO and the delay-thresholding rule] The abstract states that the same delay-thresholding rule preserves identical convergence behavior for LMO-based updates. The manuscript should supply the precise statement of the rule (which gradients are discarded and how the momentum buffer is updated when a gradient is discarded) together with the corresponding lemma that controls the extra bias introduced by the stale momentum vector.

    Authors: We will add an explicit algorithmic description of Ringmaster LMO that states: a gradient g_t computed by worker i is discarded if its delay d_t exceeds the current threshold τ; when discarded, the momentum update is skipped and the buffer retains its previous value m_t = m_{t-1}. We will also include the supporting lemma that bounds the bias ||m_t - m_t^*|| (where m_t^* uses only fresh gradients) under (L0, L1)-smoothness, showing that the bias remains O(τ) and is therefore compatible with the existing convergence guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation provides independent extension and proofs

full rationale

The paper introduces Ringmaster LMO as a novel asynchronous extension of delay-thresholding to LMO-based momentum updates, explicitly stating that it establishes new convergence guarantees under generalized (L0, L1)-smoothness and translates iteration bounds to time complexity that recover the Ringmaster ASGD optimum only in the classical Euclidean case. No quoted step reduces a central claim to a self-definitional fit, a renamed input, or a load-bearing self-citation chain; the analysis is presented as self-contained with parameter-agnostic variants and heterogeneous-worker bounds derived from the new LMO-specific lemmas rather than by construction from prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the extension of delay-thresholding to LMO updates and on generalized (L0, L1)-smoothness; no explicit free parameters are named in the abstract, but the adaptive delay thresholds in the parameter-agnostic variant are likely chosen or adapted from data.

axioms (1)
  • domain assumption Generalized (L0, L1)-smoothness condition for stochastic nonconvex objectives
    Invoked to establish convergence guarantees for the asynchronous LMO method.

pith-pipeline@v0.9.0 · 5789 in / 1225 out tokens · 74236 ms · 2026-05-20T13:04:22.976689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

    cs.LG 2026-05 unverdicted novelty 7.0

    LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 1 Pith paper · 22 internal anchors

  1. [1]

    Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

    Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=

  2. [2]

    Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

    Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=

  3. [3]

    2025 , publisher =

    Andrej Karpathy , title =. 2025 , publisher =

  4. [4]

    First Provably Optimal Asynchronous

    Artavazd Maranjyan , year =. First Provably Optimal Asynchronous

  5. [5]

    Ringleader

    Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=

  6. [6]

    Ringmaster

    Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=

  7. [7]

    2025 , booktitle=

    Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=

  8. [8]

    MindFlayer

    Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=

  9. [9]

    Transactions on Machine Learning Research , issn=

    Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=

  10. [10]

    arXiv preprint arXiv:2412.17054 , year=

    Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=

  11. [11]

    Condat, Laurent and Maranjyan, Artavazd and Richt. Proc. of International Conference on Learning Representations (ICLR) , year=

  12. [12]

    Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=

    Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=

  13. [13]

    On the divergence of

    Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of

  14. [14]

    On the unconditional convergence of

    Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of

  15. [15]

    On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

    Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =

  16. [16]

    We did the math on

    O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =

  17. [17]

    Joule , volume=

    The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

  18. [18]

    Measuring the environmental impact of delivering

    Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering

  19. [19]

    The rising costs of training frontier

    Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier

  20. [20]

    Fradin, Adrien and Richt. Local. arXiv preprint arXiv:2509.23207 , year=

  21. [21]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  22. [22]

    2025 , booktitle=

    Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=

  23. [23]

    arXiv preprint arXiv:1910.05124 , year=

    Pipemare: Asynchronous pipeline parallel dnn training , author=. arXiv preprint arXiv:1910.05124 , year=

  24. [24]

    arXiv preprint arXiv:2509.19029 , year=

    Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=

  25. [25]

    International Conference on Machine Learning , pages=

    Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  26. [26]

    Proceedings of the 30th International Conference on Machine Learning , pages =

    Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

  27. [27]

    Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online

  28. [28]

    International Conference on Machine Learning , pages=

    Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  29. [29]

    Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

    Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

  30. [30]

    The Nonstochastic Multiarmed Bandit Problem , journal =

    Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =

  31. [31]

    arXiv preprint arXiv:1903.03934 , year=

    Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=

  32. [32]

    Journal of Machine Learning Research , volume=

    A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=

  33. [33]

    Advances in Neural Information Processing Systems , volume=

    Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

  34. [34]

    arXiv preprint arXiv:2408.04929 , year=

    Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=

  35. [35]

    Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=

  36. [36]

    IEEE Transactions on Wireless Communications , volume=

    Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=

  37. [37]

    IEEE Transactions on Automatic Control , volume=

    Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=

  38. [38]

    Journal of Machine Learning Research , volume=

    Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=

  39. [39]

    Megatron-

    Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=. Megatron-

  40. [40]

    Efficient large-scale language model training on

    Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on

  41. [41]

    Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=

    In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=

  42. [42]

    Energy and

    International Energy Agency , year=. Energy and

  43. [43]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  44. [44]

    Advances in Neural Information Processing Systems , editor =

    Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =

  45. [45]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  46. [46]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  47. [47]

    2020 , organization=

    Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=

  48. [48]

    Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

    J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=

  49. [49]

    Transactions on Machine Learning Research , issn=

    Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  50. [50]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=

  51. [51]

    Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =

    Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =

  52. [52]

    Annals of Mathematical Statistics , volume=

    A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=

  53. [53]

    Optimization Methods for Large-Scale Machine Learning , journal =

    Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =

  54. [54]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

  55. [55]

    Deep neural networks for

    Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for

  56. [56]

    End to End Learning for Self-Driving Cars

    End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=

  57. [57]

    Large Scale Distributed Deep Networks , url =

    Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

  58. [58]

    PyTorch Distributed: Experiences on Accelerating Data Parallel Training

    Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    Federated Learning with Non-IID Data

    Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

  61. [61]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

  62. [62]

    SIAM Journal on Optimization , volume=

    A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=

  63. [63]

    Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=

  64. [64]

    SIAM Journal on Optimization , volume=

    On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

  65. [65]

    Advances in Neural Information Processing Systems , volume=

    A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=

  66. [66]

    Mathematical Programming , volume=

    Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=

  67. [67]

    International Conference on Machine Learning , pages=

    No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  68. [68]

    IEEE Transactions on Mobile Computing , year=

    Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=

  69. [69]

    International Conference on Artificial Intelligence and Statistics , pages=

    Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  70. [70]

    Incremental Aggregated Asynchronous

    Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous

  71. [71]

    SIAM Journal on Optimization , volume=

    Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

  72. [72]

    SIAM Journal on Optimization , volume=

    Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

  73. [73]

    Advances in Neural Information Processing Systems , volume=

    Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=

  74. [74]

    arXiv preprint arXiv:2502.08206 , year=

    Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=

  75. [75]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  76. [76]

    Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

    Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =

  77. [77]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  78. [78]

    Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

  79. [79]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  80. [80]

    Cong Fang and Chris Junchi Li and Zhouchen Lin and Tong Zhang , booktitle =

Showing first 80 references.