LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3
The pith
LOSCAR-SGD combines sparse local updates with computation-communication overlap and a delay-corrected merge to converge on smooth non-convex objectives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LOSCAR-SGD is a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. This is the first theory for this combination of ingredients.
What carries the argument
The delay-corrected merge rule, which folds delayed sparse updates from heterogeneous workers back into the local models without erasing progress accumulated during the overlap interval.
If this is right
- Sparsity level directly modulates the communication volume and appears in the convergence bound.
- Communication-computation overlap shortens wall-clock training time without harming the asymptotic rate.
- Worker heterogeneity increases the effective delay term and slows the rate in a quantifiable way.
- The delay-corrected merge outperforms naive overwriting on both theory and reported experiments.
Where Pith is reading between the lines
- The same overlap-plus-correction idea could be applied to other first-order methods such as Adam or momentum variants.
- Pairing the sparse merge with coordinate-wise quantization might yield multiplicative communication savings.
- In federated settings the optimal overlap length could be tuned from measured round-trip times and compute variance.
- The analysis suggests that very high sparsity may require compensatory increases in local steps to keep the rate acceptable.
Load-bearing premise
The delay-corrected merge rule correctly incorporates delayed synchronized information without discarding the progress made during the overlap phase.
What would settle it
Replace the delay-corrected merge with naive overwriting in a heterogeneous testbed with measurable overlap periods and observe whether convergence slows or fails relative to the predicted rate.
Figures
read the original abstract
Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LOSCAR-SGD, a Local SGD algorithm for heterogeneous distributed settings that combines sparse coordinate communication, communication-computation overlap, and a delay-corrected merge rule to incorporate delayed updates without discarding local progress during overlap. It claims convergence guarantees for smooth non-convex objectives, with explicit dependence of the rate on sparsity level, overlap duration, and worker heterogeneity, and reports experiments showing reduced wall-clock time and better performance than naive overwriting.
Significance. If the convergence analysis is correct, the work would be significant as the first explicit theory for the joint combination of local steps, sparsity, overlap, and heterogeneity; the parameter dependence could directly inform practical tuning in large-scale training. The experiments provide supporting evidence for the overlap benefit, though verification is limited by the absence of full proof details and statistical error bars.
major comments (3)
- §3 (delay-corrected merge rule): the claim that the rule produces an unbiased estimator of the averaged model while preserving overlap-phase progress is load-bearing for all rate statements, yet the description does not specify whether the correction is applied coordinate-wise only to the sparse mask that was actually sent at the delayed time or uniformly; under heterogeneous delays and per-worker sparsity this risks introducing a bias term proportional to sparsity level times delay variance, which would invalidate the claimed rate.
- Theorem 1 (convergence bound): the rate is stated to depend explicitly on sparsity, overlap, and heterogeneity, but the proof sketch relies on the merge rule remaining unbiased without additional assumptions on consistent sparse masks across send/receive times; if the analysis applies a uniform correction, the variance term from coordinate-wise heterogeneity could grow and contradict the stated bound.
- §5 (experiments): the reported improvements in training time lack error bars or multiple independent runs, so it is impossible to assess whether the observed gains over naive overwriting are statistically reliable or sensitive to random seeds.
minor comments (2)
- Notation for the sparse mask and delay variables is introduced without a consolidated table; a single reference table would improve readability of the rate expressions.
- The abstract states this is the first theory for the combination, but the introduction omits explicit comparison to prior overlap analyses in Local SGD (e.g., those handling fixed delays without sparsity).
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review of our manuscript. We address each major comment in turn below and have revised the paper to improve clarity and completeness where needed.
read point-by-point responses
-
Referee: §3 (delay-corrected merge rule): the claim that the rule produces an unbiased estimator of the averaged model while preserving overlap-phase progress is load-bearing for all rate statements, yet the description does not specify whether the correction is applied coordinate-wise only to the sparse mask that was actually sent at the delayed time or uniformly; under heterogeneous delays and per-worker sparsity this risks introducing a bias term proportional to sparsity level times delay variance, which would invalidate the claimed rate.
Authors: The delay-corrected merge rule is defined to apply the correction coordinate-wise and exclusively to the coordinates present in the sparse mask that was transmitted at the delayed communication round. This is stated in Section 3 immediately after the algorithm pseudocode and is used in the subsequent analysis. Because only the communicated coordinates receive the delay adjustment, the estimator for the averaged model remains unbiased; local progress on non-communicated coordinates is retained without introducing an extra bias term. The dependence on sparsity level and delay already appears in the convergence bound of Theorem 1. We have added a short clarifying paragraph and a supporting lemma in the revised §3 to make the coordinate-wise application explicit. revision: yes
-
Referee: Theorem 1 (convergence bound): the rate is stated to depend explicitly on sparsity, overlap, and heterogeneity, but the proof sketch relies on the merge rule remaining unbiased without additional assumptions on consistent sparse masks across send/receive times; if the analysis applies a uniform correction, the variance term from coordinate-wise heterogeneity could grow and contradict the stated bound.
Authors: The full proof in the appendix explicitly assumes that the sparse masks are those chosen at the sending time and that the correction is applied only to those coordinates; a uniform correction is never used. Under this construction the unbiasedness holds and the variance contribution from coordinate-wise heterogeneity is controlled by the sparsity factor already present in the rate. We have expanded the proof sketch in the main text of the revised manuscript with a one-paragraph outline of the unbiasedness argument and a pointer to the relevant appendix lemma. revision: yes
-
Referee: §5 (experiments): the reported improvements in training time lack error bars or multiple independent runs, so it is impossible to assess whether the observed gains over naive overwriting are statistically reliable or sensitive to random seeds.
Authors: We agree that the experimental presentation would be strengthened by statistical reporting. In the revised manuscript we have repeated the wall-clock time experiments over five independent random seeds and added error bars (mean ± one standard deviation) to the relevant plots in §5. The observed gains of LOSCAR-SGD over naive overwriting remain consistent across seeds. revision: yes
Circularity Check
Convergence analysis derives from standard smoothness assumptions and proposed merge rule without tautological reduction
full rationale
The paper proposes LOSCAR-SGD with a delay-corrected sparse merge and derives convergence rates for smooth non-convex objectives directly from the algorithm's update rules, sparsity masks, overlap phases, and heterogeneity parameters. The rate expressions follow from standard bounded-variance and smoothness assumptions applied to the new merge operator; no step equates a claimed prediction or theorem to a fitted input or prior self-citation by construction. The 'first theory' claim for the combination of ingredients further indicates the derivation chain is self-contained rather than relying on load-bearing self-citations or ansatzes imported from the authors' prior work.
Axiom & Free-Parameter Ledger
free parameters (2)
- sparsity level
- overlap duration
axioms (2)
- domain assumption Objective function is L-smooth
- domain assumption Workers may perform different numbers of local steps
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tovmasyan, Zhirayr and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2605.08871 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Ammar Mahran and Artavazd Maranjyan and Peter Richt. Rescaled Asynchronous. arXiv preprint arXiv:2605.13434 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Abdurakhmon Sadiev and Artavazd Maranjyan and Ivan Ilin and Peter Richt. Ringmaster. arXiv preprint arXiv:2605.18174 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
First Provably Optimal Asynchronous
Artavazd Maranjyan , year =. First Provably Optimal Asynchronous
-
[6]
Artavazd Maranjyan and Peter Richt. Ringleader. The Fourteenth International Conference on Learning Representations , year=
-
[7]
Artavazd Maranjyan and Alexander Tyurin and Peter Richt. Ringmaster. 2025 , booktitle=
work page 2025
-
[8]
Maranjyan, Artavazd and Saad, El Mehdi and Richt. 2025 , booktitle=
work page 2025
-
[9]
Artavazd Maranjyan and Omar Shaikh Omar and Peter Richt. MindFlayer. The 41st Conference on Uncertainty in Artificial Intelligence , year=
-
[10]
Transactions on Machine Learning Research , issn=
Artavazd Maranjyan and Mher Safaryan and Peter Richt. Transactions on Machine Learning Research , issn=. 2025 , url=
work page 2025
-
[11]
arXiv preprint arXiv:2412.17054 , year=
Differentially Private Random Block Coordinate Descent , author=. arXiv preprint arXiv:2412.17054 , year=
-
[12]
The Thirteenth International Conference on Learning Representations , year=
Laurent Condat and Artavazd Maranjyan and Peter Richt. The Thirteenth International Conference on Learning Representations , year=
-
[13]
arXiv preprint arXiv:2601.12400 , year=
Condat, Laurent and Maranjyan, Artavazd and Richt. arXiv preprint arXiv:2601.12400 , year=
-
[14]
Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=
Grigoryan, Martin and Kamont, Anna and Maranjyan, Artavazd , title=. Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences) , volume=. 2023 , publisher=
work page 2023
-
[15]
Grigoryan, Martin and Maranjyan, Artavazd , journal=. On the divergence of
-
[16]
On the unconditional convergence of
Grigoryan, Tigran M and Maranjyan, Artavazd , journal=. On the unconditional convergence of
-
[17]
On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =
Defazio, Aaron and Bottou, Leon , booktitle =. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , volume =
-
[18]
O'Donnell, James and Crownhart, Casey , journal =. We did the math on. 2025 , month =
work page 2025
-
[19]
The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=
work page 2023
-
[20]
Measuring the environmental impact of delivering
Elsworth, Cooper and Huang, Keguo and Patterson, David and Schneider, Ian and Sedivy, Robert and Goodman, Savannah and Townsend, Ben and Ranganathan, Parthasarathy and Dean, Jeff and Vahdat, Amin and others , journal=. Measuring the environmental impact of delivering
-
[21]
The rising costs of training frontier
Cottier, Ben and Rahman, Robi and Fattorini, Loredana and Maslej, Nestor and Besiroglu, Tamay and Owen, David , journal=. The rising costs of training frontier
- [22]
-
[23]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
-
[24]
Nesterov Method for Asynchronous Pipeline Parallel Optimization , author=. 2025 , booktitle=
work page 2025
-
[25]
arXiv preprint arXiv:1910.05124 , year=
Yang, Bowen and Zhang, Jian and Li, Jonathan and R. arXiv preprint arXiv:1910.05124 , year=
-
[26]
arXiv preprint arXiv:2509.19029 , year=
Clapping: Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression , author=. arXiv preprint arXiv:2509.19029 , year=
-
[27]
International Conference on Machine Learning , pages=
Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[28]
Proceedings of the 30th International Conference on Machine Learning , pages =
Online Learning under Delayed Feedback , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =
work page 2013
-
[29]
Bistritz, Ilai and Zhou, Zhengyuan and Chen, Xi and Bambos, Nicholas and Blanchet, Jose , booktitle =. Online
-
[30]
International Conference on Machine Learning , pages=
Adapting to delays and data in adversarial multi-armed bandits , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[31]
Bandit Online Learning with Unknown Delays , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =
work page 2019
-
[32]
The Nonstochastic Multiarmed Bandit Problem , journal =
Auer, Peter and Cesa-Bianchi, Nicol\`. The Nonstochastic Multiarmed Bandit Problem , journal =. 2002 , doi =. https://doi.org/10.1137/S0097539701398375 , abstract =
-
[33]
arXiv preprint arXiv:1903.03934 , year=
Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=
-
[34]
Journal of Machine Learning Research , volume=
A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=
-
[35]
Advances in Neural Information Processing Systems , volume=
Asynchronous parallel stochastic gradient for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
arXiv preprint arXiv:2408.04929 , year=
Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics , author=. arXiv preprint arXiv:2408.04929 , year=
-
[37]
Wang, Qiyuan and Yang, Qianqian and He, Shibo and Shi, Zhiguo and Chen, Jiming , journal=
-
[38]
IEEE Transactions on Wireless Communications , volume=
Asynchronous federated learning over wireless communication networks , author=. IEEE Transactions on Wireless Communications , volume=. 2022 , publisher=
work page 2022
-
[39]
IEEE Transactions on Automatic Control , volume=
Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=. 1986 , publisher=
work page 1986
-
[40]
Journal of Machine Learning Research , volume=
Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees , author=. Journal of Machine Learning Research , volume=
- [41]
-
[42]
Efficient large-scale language model training on
Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on
-
[43]
Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=
In-Datacenter Performance Analysis of a Tensor Processing Unit , author=. Proceedings of the 44th Annual International Symposium on Computer Architecture , pages=. 2017 , month=
work page 2017
- [44]
-
[45]
Proceedings of the AAAI conference on artificial intelligence , volume=
Energy and policy considerations for modern deep learning research , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[46]
Advances in Neural Information Processing Systems , editor =
Cyclades: Conflict-free Asynchronous Machine Learning , author =. Advances in Neural Information Processing Systems , editor =
-
[47]
Proceedings of the 39th International Conference on Machine Learning , pages =
Delay-Adaptive Step-sizes for Asynchronous Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =
work page 2022
-
[48]
Proceedings of the 34th International Conference on Machine Learning , pages =
Asynchronous Stochastic Gradient Descent with Delay Compensation , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
work page 2017
-
[49]
Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=
work page 2020
-
[50]
J. Edward Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen , booktitle=
-
[51]
Transactions on Machine Learning Research , issn=
Efficient Large Language Models: A Survey , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
work page 2024
-
[52]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =
Ananthanarayanan, Ganesh and Ghodsi, Ali and Shenker, Scott and Stoica, Ion , title =. Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation , pages =. 2013 , publisher =
work page 2013
-
[54]
Annals of Mathematical Statistics , volume=
A Stochastic Approximation Method , author=. Annals of Mathematical Statistics , volume=
-
[55]
Optimization Methods for Large-Scale Machine Learning , journal =
Bottou, L\'. Optimization Methods for Large-Scale Machine Learning , journal =. 2018 , doi =
work page 2018
-
[56]
Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The
-
[57]
Covington, Paul and Adams, Jay and Sargin, Emre , booktitle=. Deep neural networks for
-
[58]
End to End Learning for Self-Driving Cars
End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Large Scale Distributed Deep Networks , url =
Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =
-
[60]
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Pytorch distributed: Experiences on accelerating data parallel training , author=. arXiv preprint arXiv:2006.15704 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[61]
Advances in Neural Information Processing Systems , volume=
Communication efficient distributed machine learning with the parameter server , author=. Advances in Neural Information Processing Systems , volume=
-
[62]
Federated Learning with Non-IID Data
Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
IEEE Transactions on Neural Networks and Learning Systems , volume=
Towards personalized federated learning , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=
work page 2022
-
[64]
SIAM Journal on Optimization , volume=
A convergent incremental gradient method with a constant step size , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=
work page 2007
-
[65]
Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , journal=
-
[66]
SIAM Journal on Optimization , volume=
On the convergence rate of incremental aggregated gradient algorithms , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=
work page 2017
-
[67]
Advances in Neural Information Processing Systems , volume=
A stochastic gradient method with an exponential convergence rate for finite training sets , author=. Advances in Neural Information Processing Systems , volume=
-
[68]
Mathematical Programming , volume=
Minimizing finite sums with the stochastic average gradient , author=. Mathematical Programming , volume=. 2017 , publisher=
work page 2017
-
[69]
International Conference on Machine Learning , pages=
No one idles: Efficient heterogeneous federated learning with parallel edge and server computation , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[70]
IEEE Transactions on Mobile Computing , year=
Achieving linear speedup in asynchronous federated learning with heterogeneous clients , author=. IEEE Transactions on Mobile Computing , year=
-
[71]
International Conference on Artificial Intelligence and Statistics , pages=
Asynchronous distributed optimization with stochastic delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[72]
Incremental Aggregated Asynchronous
Xiaolu Wang and Yuchang Sun and Hoi To Wai and Jun Zhang , year=. Incremental Aggregated Asynchronous
-
[73]
SIAM Journal on Optimization , volume=
Global convergence rate of proximal incremental aggregated gradient methods , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=
work page 2018
-
[74]
SIAM Journal on Optimization , volume=
Perturbed iterate analysis for asynchronous stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=
work page 2017
-
[75]
Advances in Neural Information Processing Systems , volume=
Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems , volume=
-
[76]
arXiv preprint arXiv:2502.08206 , year=
Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency , author=. arXiv preprint arXiv:2502.08206 , year=
-
[77]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[78]
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =
Anastasia Koloskova and Sebastian U Stich and Martin Jaggi , booktitle =. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , year =
-
[79]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[80]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.