Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Pith reviewed 2026-05-20 13:04 UTC · model grok-4.3
The pith
Ringmaster LMO extends delay-thresholding to LMO momentum updates for asynchronous training and recovers optimal time complexity in smooth settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ringmaster LMO is an asynchronous LMO-based momentum method that applies a delay-thresholding rule to discard overly stale LMO updates, yielding convergence guarantees under generalized (L0, L1)-smoothness and time-complexity bounds that recover the optimal performance of Ringmaster ASGD in the classical Euclidean smooth setting.
What carries the argument
Delay-thresholding rule extended to discard stale LMO updates while preserving convergence behavior.
If this is right
- Convergence guarantees are established for unconstrained stochastic nonconvex optimization.
- Time complexity bounds are obtained for heterogeneous worker computation times.
- A parameter-agnostic variant uses decreasing stepsizes and adaptive delay thresholds.
- Empirical advantages increase with greater system heterogeneity.
Where Pith is reading between the lines
- The same thresholding idea could shorten wall-clock time for large-model training on real clusters with uneven hardware speeds.
- The approach may transfer to other structured oracles or momentum variants in distributed nonconvex settings.
- Performance under convex or strongly convex assumptions remains open for separate analysis.
Load-bearing premise
The delay-thresholding rule extends without modification to LMO momentum updates while preserving the same convergence behavior under generalized (L0, L1)-smoothness.
What would settle it
A controlled experiment showing that Ringmaster LMO fails to match the time complexity of Ringmaster ASGD under heterogeneous worker delays in the Euclidean smooth case would disprove the recovery claim.
Figures
read the original abstract
Muon has recently emerged as a strong alternative to AdamW for training neural networks, with encouraging large-scale pretraining results and growing evidence that matrix-structured updates can be faster in practice. Yet Muon, and more generally Linear Minimization Oracle (LMO) based methods, are typically used synchronously. This is problematic in heterogeneous distributed systems, where workers complete gradient computations at different speeds and synchronous training must repeatedly wait for slower workers. In this work, we introduce Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. Our method builds on the delay-thresholding idea of Ringmaster ASGD. For SGD-type methods, Ringmaster ASGD achieves optimal time complexity by discarding overly stale gradients. Ringmaster LMO extends this mechanism to general LMO-based updates. We establish convergence guarantees under generalized $(L_0, L_1)$-smoothness and further develop a parameter-agnostic variant with decreasing stepsizes and adaptive delay thresholds. Finally, we translate our iteration guarantees into time complexity bounds under heterogeneous worker computation times. In the classical Euclidean smooth setting, these bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratic problems and NanoChat language-model pretraining show that the advantages of Ringmaster LMO grow with system heterogeneity and that the method outperforms strong synchronous and asynchronous baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. It extends the delay-thresholding mechanism of Ringmaster ASGD to discard overly stale gradients when performing LMO updates, establishes convergence guarantees under generalized (L0, L1)-smoothness, develops a parameter-agnostic variant with decreasing stepsizes and adaptive thresholds, and translates iteration bounds into time-complexity results under heterogeneous worker speeds. In the Euclidean smooth case the bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratics and NanoChat pretraining illustrate improved robustness to system heterogeneity relative to synchronous and asynchronous baselines.
Significance. If the central extension of delay-thresholding to LMO momentum is rigorously justified, the work would be significant: it supplies the first theoretical support for asynchronous training of modern LMO-based optimizers such as Muon, together with explicit time-complexity translation and empirical evidence on language-model pretraining. The parameter-agnostic variant and recovery of known optimal rates are additional strengths.
major comments (2)
- [Convergence analysis (section on guarantees under generalized smoothness)] The central claim that the delay-thresholding rule extends without modification to LMO momentum rests on the assumption that discarding a stale gradient leaves the momentum buffer with only a controllable bias. Because the momentum update couples past gradients (typically m_t = β m_{t-1} + (1-β) g_t) before the (possibly nonlinear) LMO is applied, a discarded stale g_t leaves a stale m_t whose error is not a simple linear combination of recent gradients. The convergence analysis must therefore derive an additional error term in the descent lemma that is absent from the ASGD case; without an explicit bound on this term under (L0, L1)-smoothness the iteration-to-time-complexity translation is not yet secured.
- [Definition of Ringmaster LMO and the delay-thresholding rule] The abstract states that the same delay-thresholding rule preserves identical convergence behavior for LMO-based updates. The manuscript should supply the precise statement of the rule (which gradients are discarded and how the momentum buffer is updated when a gradient is discarded) together with the corresponding lemma that controls the extra bias introduced by the stale momentum vector.
minor comments (2)
- [Parameter-agnostic variant] The description of the parameter-agnostic variant would benefit from an explicit statement of how the adaptive delay threshold is computed from observed worker times, including any constants that must be set in practice.
- [Experiments] In the NanoChat experiments, report the precise model dimension, number of training tokens, and the implementation details of the synchronous and asynchronous baselines so that the claimed advantage with increasing heterogeneity can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of the convergence analysis and algorithmic definition that we will clarify in the revision. We address each major comment below.
read point-by-point responses
-
Referee: [Convergence analysis (section on guarantees under generalized smoothness)] The central claim that the delay-thresholding rule extends without modification to LMO momentum rests on the assumption that discarding a stale gradient leaves the momentum buffer with only a controllable bias. Because the momentum update couples past gradients (typically m_t = β m_{t-1} + (1-β) g_t) before the (possibly nonlinear) LMO is applied, a discarded stale g_t leaves a stale m_t whose error is not a simple linear combination of recent gradients. The convergence analysis must therefore derive an additional error term in the descent lemma that is absent from the ASGD case; without an explicit bound on this term under (L0, L1)-smoothness the iteration-to-time-complexity translation is not yet secured.
Authors: We agree that the momentum coupling requires an explicit treatment of the additional bias. Under (L0, L1)-smoothness the difference between the stale momentum vector and its fresh counterpart can be bounded by a term proportional to the delay and the smoothness constants; this term is absorbed into the existing descent inequality without altering the leading-order iteration complexity. We will insert a dedicated lemma that isolates this extra error term, shows how it is controlled by the delay threshold, and confirms that the subsequent translation to time complexity remains valid. revision: yes
-
Referee: [Definition of Ringmaster LMO and the delay-thresholding rule] The abstract states that the same delay-thresholding rule preserves identical convergence behavior for LMO-based updates. The manuscript should supply the precise statement of the rule (which gradients are discarded and how the momentum buffer is updated when a gradient is discarded) together with the corresponding lemma that controls the extra bias introduced by the stale momentum vector.
Authors: We will add an explicit algorithmic description of Ringmaster LMO that states: a gradient g_t computed by worker i is discarded if its delay d_t exceeds the current threshold τ; when discarded, the momentum update is skipped and the buffer retains its previous value m_t = m_{t-1}. We will also include the supporting lemma that bounds the bias ||m_t - m_t^*|| (where m_t^* uses only fresh gradients) under (L0, L1)-smoothness, showing that the bias remains O(τ) and is therefore compatible with the existing convergence guarantees. revision: yes
Circularity Check
No significant circularity detected; derivation provides independent extension and proofs
full rationale
The paper introduces Ringmaster LMO as a novel asynchronous extension of delay-thresholding to LMO-based momentum updates, explicitly stating that it establishes new convergence guarantees under generalized (L0, L1)-smoothness and translates iteration bounds to time complexity that recover the Ringmaster ASGD optimum only in the classical Euclidean case. No quoted step reduces a central claim to a self-definitional fit, a renamed input, or a load-bearing self-citation chain; the analysis is presented as self-contained with parameter-agnostic variants and heterogeneous-worker bounds derived from the new LMO-specific lemmas rather than by construction from prior fitted quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generalized (L0, L1)-smoothness condition for stochastic nonconvex objectives
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ringmaster LMO extends this mechanism to general LMO-based updates... convergence guarantees under generalized (L0, L1)-smoothness
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
delay-thresholding rule that discards stale gradients
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Reference graph
Works this paper leans on
-
[1]
Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
-
[4]
First Provably Optimal Asynchronous
Artavazd Maranjyan , year =. First Provably Optimal Asynchronous
-
[5]
Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=
-
[6]
Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=
work page 2025
-
[7]
Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=
work page 2025
-
[8]
Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=
-
[9]
Transactions on Machine Learning Research , issn=
Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=
work page 2025
-
[10]
arXiv preprint arXiv:2412.17054 , year=
Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=
-
[11]
Condat, Laurent and Maranjyan, Artavazd and Richt. Proc. of International Conference on Learning Representations (ICLR) , year=
-
[12]
Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=
Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=
work page 2023
-
[13]
Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of
-
[14]
On the unconditional convergence of
Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of
-
[15]
On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =
Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =
-
[16]
O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =
work page 2025
-
[17]
The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=
work page 2023
-
[18]
Measuring the environmental impact of delivering
Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering
-
[19]
The rising costs of training frontier
Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier
- [20]
-
[21]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
-
[22]
Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=
work page 2025
-
[23]
arXiv preprint arXiv:1910.05124 , year=
Pipemare: Asynchronous pipeline parallel dnn training , author=. arXiv preprint arXiv:1910.05124 , year=
-
[24]
arXiv preprint arXiv:2509.19029 , year=
Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=
-
[25]
International Conference on Machine Learning , pages=
Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[26]
Proceedings of the 30th International Conference on Machine Learning , pages =
Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =
work page 2013
-
[27]
Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online
-
[28]
International Conference on Machine Learning , pages=
Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[29]
Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =
work page 2019
-
[30]
The Nonstochastic Multiarmed Bandit Problem , journal =
Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =
-
[31]
arXiv preprint arXiv:1903.03934 , year=
Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=
-
[32]
Journal of Machine Learning Research , volume=
A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=
-
[33]
Advances in Neural Information Processing Systems , volume=
Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
arXiv preprint arXiv:2408.04929 , year=
Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=
-
[35]
Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=
-
[36]
IEEE Transactions on Wireless Communications , volume=
Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=
work page 2022
-
[37]
IEEE Transactions on Automatic Control , volume=
Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=
work page 1986
-
[38]
Journal of Machine Learning Research , volume=
Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=
- [39]
-
[40]
Efficient large-scale language model training on
Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on
-
[41]
Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=
In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=
work page 2017
- [42]
-
[43]
Proceedings of the AAAI conference on artificial intelligence , volume=
Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[44]
Advances in Neural Information Processing Systems , editor =
Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =
-
[45]
Proceedings of the 39th International Conference on Machine Learning , pages =
Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =
work page 2022
-
[46]
Proceedings of the 34th International Conference on Machine Learning , pages =
Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
work page 2017
-
[47]
Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=
work page 2020
-
[48]
J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=
-
[49]
Transactions on Machine Learning Research , issn=
Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
work page 2024
-
[50]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =
Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =
work page 2013
-
[52]
Annals of Mathematical Statistics , volume=
A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=
-
[53]
Optimization Methods for Large-Scale Machine Learning , journal =
Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =
work page 2018
-
[54]
Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The
-
[55]
Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for
-
[56]
End to End Learning for Self-Driving Cars
End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Large Scale Distributed Deep Networks , url =
Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =
-
[58]
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[59]
Advances in Neural Information Processing Systems , volume=
Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
Federated Learning with Non-IID Data
Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
IEEE Transactions on Neural Networks and Learning Systems , volume=
Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=
work page 2022
-
[62]
SIAM Journal on Optimization , volume=
A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=
work page 2007
-
[63]
Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=
-
[64]
SIAM Journal on Optimization , volume=
On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=
work page 2017
-
[65]
Advances in Neural Information Processing Systems , volume=
A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=
-
[66]
Mathematical Programming , volume=
Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=
work page 2017
-
[67]
International Conference on Machine Learning , pages=
No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[68]
IEEE Transactions on Mobile Computing , year=
Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=
-
[69]
International Conference on Artificial Intelligence and Statistics , pages=
Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[70]
Incremental Aggregated Asynchronous
Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous
-
[71]
SIAM Journal on Optimization , volume=
Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=
work page 2018
-
[72]
SIAM Journal on Optimization , volume=
Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=
work page 2017
-
[73]
Advances in Neural Information Processing Systems , volume=
Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=
-
[74]
arXiv preprint arXiv:2502.08206 , year=
Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=
-
[75]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[76]
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =
Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =
-
[77]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[78]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
-
[79]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Cong Fang and Chris Junchi Li and Zhouchen Lin and Tong Zhang , booktitle =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.