Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

Athanasios Glentis; Chung-Yiu Yau; Dawei Li; Mingyi Hong

arxiv: 2605.17787 · v1 · pith:QKFYFXODnew · submitted 2026-05-18 · 💻 cs.LG

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

Athanasios Glentis , Dawei Li , Chung-Yiu Yau , Mingyi Hong This is my paper

Pith reviewed 2026-05-20 13:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords SGDAdamLLM pre-traininglearning rategradient clippingbatch sizeoptimization dynamicsvalidation loss

0 comments

The pith

Clipping mechanisms let plain SGD sustain large learning rates and close most of the gap to Adam in LLM pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to explain why stochastic gradient descent consistently trails adaptive optimizers such as Adam during large language model pre-training. It traces the difference to SGD's inability to keep learning rates as large as Adam's effective rates, a limitation that grows with the small gradient norms, high weight-to-gradient ratios, uneven output-layer gradients, and sudden spikes that appear in these runs. The authors then test whether targeted clipping can remove those restrictions without wrecking the training trajectory. When the clipping is applied, the validation-loss gap shrinks sharply, showing that the optimizer choice itself is less decisive than the ability to operate safely at high effective learning rates. A reader would care because the result suggests that simpler, non-adaptive methods could become practical for the largest training jobs once the right stabilizers are in place.

Core claim

In LLM pre-training, small gradient norms and large weight-to-gradient ratios require high effective learning rates, yet uneven output-layer gradients and frequent spikes prevent plain SGD from using them safely. Simple clipping mechanisms stabilize SGD at these large rates, allowing it to recover most of Adam's performance. In experiments pre-training a 1B-parameter LLaMA model with 1M-token batches, the validation loss gap falls from more than 50% to roughly 3.5%.

What carries the argument

clipping mechanisms that stabilize SGD at large learning rates

Load-bearing premise

The identified problems of small gradient norms, high weight-to-gradient ratios, uneven output gradients, and spikes can be fixed by clipping without creating fresh instabilities or shifting the optimization path in unintended directions.

What would settle it

A controlled run in which the same clipping rules are applied yet the validation loss gap remains above 30% or new instabilities appear would show that the clipping approach does not actually let SGD recover most of Adam's performance.

Figures

Figures reproduced from arXiv: 2605.17787 by Athanasios Glentis, Chung-Yiu Yau, Dawei Li, Mingyi Hong.

**Figure 2.** Figure 2: (a) Mean effective learning rate per layer for Adam and SGD. We use cosine scheduler with [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: (a) The dynamics of layer-wise weight to stochastic gradient (SG) norm ratio [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (a): Layer-wise stochastic gradient norm [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the two forms of gradient clipping in SGD-LL while training LLaMA 130M, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Training loss for various model sizes. The SGD-LL trajectory closely follows that of Adam. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Measuring the token class gradient imbalance by the ratio between maximum token class [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Training loss figure of 130M LLaMA on C4 pre-trained using SGD with either Layer-wise [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: The RMS norms of the weight matrices during training. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Average loss of the tokens with different frequency quantiles, with Quantile 0 representing [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Token count in one batch on different frequency quantiles, with Quantile 0 representing [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD's inability to sustain learning rates comparable to Adam's much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects severely restrict the admissible learning rate of SGD. Guided by this understanding, we show that simple clipping mechanisms that stabilize SGD at large learning rates enable it to recover most of Adam's performance. In our large-scale experiments, the validation loss gap between large-learning-rate SGD and Adam shrinks from more than 50% to only about 3.5% when pre-training a 1B-parameter LLaMA model with a 1M-token batch size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Clipping lets SGD run at higher effective learning rates and closes most of the gap to Adam in a 1B-model pre-training run, but the clipping step itself may be supplying some of the adaptivity.

read the letter

The main thing to know is that the authors tie the SGD-Adam difference in LLM pre-training to SGD's inability to sustain large effective learning rates, given the small gradient norms, high weight-to-gradient ratios, uneven output-layer gradients, and spikes they observe. They then show that clipping largely removes that restriction and shrinks the validation loss gap from over 50% to roughly 3.5% on a 1B LLaMA model with 1M-token batches.

Referee Report

2 major / 2 minor

Summary. The manuscript revisits the Adam-SGD performance gap in LLM pre-training. It attributes the gap primarily to SGD's inability to sustain large effective learning rates comparable to Adam's, caused by small gradient norms, high weight-to-gradient ratios (worsened at large batch sizes), highly uneven output-layer gradient magnitudes across token classes, and frequent gradient spikes. The authors argue that these dynamics restrict SGD's admissible learning rate and show that simple clipping mechanisms stabilize SGD at large learning rates, reducing the validation loss gap from over 50% to about 3.5% in a 1B-parameter LLaMA pre-training run with 1M-token batch size.

Significance. If the central attribution holds after clarification, the result would be significant for the field: it supplies a mechanistic account of why adaptive methods outperform SGD in large-scale LLM training and demonstrates that a lightweight stabilization technique can nearly close the gap. The large-scale 1B-model experiment with realistic batch size is a concrete strength, as are the direct measurements of gradient norms, spikes, and output-layer unevenness. These elements could influence practical optimizer choices and future theoretical analyses of training dynamics.

major comments (2)

[Abstract] Abstract: The central claim that clipping 'stabilize[s] SGD at large learning rates' and thereby recovers most of Adam's performance does not separate the contribution of the increased learning rate from the side-effect of clipping on the very phenomena identified earlier (uneven output-layer gradients and spikes). Because any form of clipping necessarily rescales or suppresses the largest components, it can act as a crude coordinate-wise normalization that partially replicates Adam's adaptivity; without an ablation that holds the effective learning rate fixed while varying the clipping, the observed 3.5% gap closure cannot be attributed solely to the larger admissible LR.
[Experimental section (1B-model results)] The experimental section (large-scale 1B LLaMA run): The reported validation-loss comparison between large-LR clipped SGD and Adam lacks controls that isolate the clipping mechanism from the learning-rate increase. A direct comparison of (i) clipped SGD at the large LR, (ii) unclipped SGD at its maximal stable LR, and (iii) Adam would be required to confirm that the gap reduction is driven by the admissible LR rather than by the clipping-induced gradient modification.

minor comments (2)

[Methods / Analysis] The definition and measurement protocol for 'effective learning rate' and 'weight-to-gradient ratio' should be stated explicitly with equations in the methods or analysis section to allow replication.
[Figures] Figure captions for the gradient-norm and spike plots should include the exact batch size, model scale, and number of runs used to generate the statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The points raised about isolating the contributions of learning rate and clipping are well taken, and we outline revisions below to strengthen the experimental controls while preserving the core mechanistic analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that clipping 'stabilize[s] SGD at large learning rates' and thereby recovers most of Adam's performance does not separate the contribution of the increased learning rate from the side-effect of clipping on the very phenomena identified earlier (uneven output-layer gradients and spikes). Because any form of clipping necessarily rescales or suppresses the largest components, it can act as a crude coordinate-wise normalization that partially replicates Adam's adaptivity; without an ablation that holds the effective learning rate fixed while varying the clipping, the observed 3.5% gap closure cannot be attributed solely to the larger admissible LR.

Authors: We agree that a clearer separation between the learning-rate increase and clipping-induced modifications is needed. In the manuscript we show that unclipped SGD is limited to much smaller learning rates by gradient spikes and output-layer unevenness, while clipping mitigates spikes to permit larger rates comparable to Adam's effective rates. To address the specific concern, we will add an ablation that applies clipping at the maximal stable learning rate of unclipped SGD (holding the rate fixed) and compare it to both unclipped SGD and to clipped SGD at the larger rate. This will demonstrate that clipping at the smaller rate yields only modest improvement, whereas enabling the larger rate accounts for most of the gap closure. revision: yes
Referee: [Experimental section (1B-model results)] The experimental section (large-scale 1B LLaMA run): The reported validation-loss comparison between large-LR clipped SGD and Adam lacks controls that isolate the clipping mechanism from the learning-rate increase. A direct comparison of (i) clipped SGD at the large LR, (ii) unclipped SGD at its maximal stable LR, and (iii) Adam would be required to confirm that the gap reduction is driven by the admissible LR rather than by the clipping-induced gradient modification.

Authors: We thank the referee for this concrete suggestion. The current results already include unclipped SGD at its maximal stable learning rate (showing a large gap to Adam) and clipped SGD at a substantially larger learning rate (closing most of the gap). To further isolate the mechanisms, we will add the requested control of clipped SGD run at the same smaller maximal stable learning rate used for unclipped SGD. This will allow direct comparison of the three conditions and confirm that the primary benefit arises from the admissible larger learning rate rather than from clipping's gradient modification alone. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements and large-scale experiments

full rationale

The paper attributes the Adam-SGD gap to observed training dynamics (small gradient norms, large weight-to-gradient ratios, uneven output gradients, and spikes) identified via direct measurements during LLM pre-training runs. It then demonstrates via experiments that clipping enables SGD to use larger learning rates and closes most of the validation loss gap. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the abstract or described chain that reduce the central result to its own inputs by construction. The analysis is self-contained against external benchmarks (actual training runs on 1B LLaMA models) and does not invoke uniqueness theorems or ansatzes from prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on empirical observations of gradient statistics during LLM training rather than on a large set of free parameters or new theoretical axioms.

axioms (1)

domain assumption Training dynamics are characterized by small gradient norms and large weight-to-gradient ratios that become more pronounced with larger batch sizes.
Invoked to explain why large effective learning rates are needed.

pith-pipeline@v0.9.0 · 5762 in / 1386 out tokens · 39644 ms · 2026-05-20T13:28:59.511098+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We attribute a large part of the discrepancy to SGD’s inability to sustain learning rates comparable to Adam’s much larger effective learning rates... small gradient norms and large weight-to-gradient ratios

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Token-level fitting issues of seq2seq models.arXiv preprint arXiv:2305.04493,

Guangsheng Bao, Zhiyang Teng, and Yue Zhang. Token-level fitting issues of seq2seq models.arXiv preprint arXiv:2305.04493,

work page arXiv
[2]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

Exploiting vocabulary frequency imbalance in language model pre-training.arXiv preprint arXiv:2508.15390,

Woojin Chung and Jeonghoon Kim. Exploiting vocabulary frequency imbalance in language model pre-training.arXiv preprint arXiv:2508.15390,

work page arXiv
[4]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models (2022).arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Improving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd.arXiv preprint arXiv:1712.07628,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, and Micah Goldblum. Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

work page arXiv
[7]

Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,

Yan Pan and Yuanzhi Li. Toward understanding why adam converges faster than sgd for transformers. arXiv preprint arXiv:2306.00204,

work page arXiv
[8]

On Information and Sufficiency

doi: 10.1214/aoms/1177729586. URL https: //doi.org/10.1214/aoms/1177729586. Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

work page doi:10.1214/aoms/1177729586
[9]

Is your batch size the problem? revisiting the adam-sgd gap in language modeling

Teodora Sre´ckovi´c, Jonas Geiping, and Antonio Orvieto. Is your batch size the problem? revisiting the adam-sgd gap in language modeling.arXiv preprint arXiv:2506.12543,

work page arXiv
[10]

Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903,

Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903,

work page arXiv
[11]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay B...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

B Experimental Settings For our main analysis we pre-train the LLaMA Touvron et al

11 A Additional Experimental Figures 0.0 0.5 1.0 1.5 2.0 T okens Accessed 1e9 102 103 (maxi gi )/(V 1 V i = 1 gi ) Output Layer T oken-wise Variation (Adam) B = 64 B = 65536 Figure 7: Measuring the token class gradient imbalance by the ratio between maximum token class gradient norm and average token-class gradient norm:(max i=1,...,V ∥gi∥2)/(V −1PV i=1 ∥...

work page 2020
[14]

We use a data-parallel setup with batch size 2048 (unless stated otherwise in the text) and sequence length 256, giving a token batch size of 524,288 tokens

with a vocabulary of 32,000 tokens. We use a data-parallel setup with batch size 2048 (unless stated otherwise in the text) and sequence length 256, giving a token batch size of 524,288 tokens. We train for a total of 5,000 steps, such that the model sees roughly the Chinchilla compute-optimal Hoffmann et al

work page 2048
[15]

We train using Pytorch’s Automatic Mixed Precision (AMP, BF16/FP32) and use the typical cosine learning rate scheduler with linear warm-up for the first 10% of the iterations

number of tokens (2.6B, i.e., 20 tokens per model parameter). We train using Pytorch’s Automatic Mixed Precision (AMP, BF16/FP32) and use the typical cosine learning rate scheduler with linear warm-up for the first 10% of the iterations. For our large scale experiments, we scale up-to the 1B LLaMA model, doubling the batch size to 4096 tokens (resulting i...

work page 2020
[16]

It is evident that the usual learning rate choice of SGD in the large batch setting will leave a significant 14 0 1000 2000 3000 4000 5000 Iterations 101 Avg

The large weight-SG norm ratio implies that a small choice of learning rate would limit the optimiza- tion algorithm’s solution space ofwt to a small neighborhood around the initialization state w0. It is evident that the usual learning rate choice of SGD in the large batch setting will leave a significant 14 0 1000 2000 3000 4000 5000 Iterations 101 Avg....

work page 2000

[1] [1]

Token-level fitting issues of seq2seq models.arXiv preprint arXiv:2305.04493,

Guangsheng Bao, Zhiyang Teng, and Yue Zhang. Token-level fitting issues of seq2seq models.arXiv preprint arXiv:2305.04493,

work page arXiv

[2] [2]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[3] [3]

Exploiting vocabulary frequency imbalance in language model pre-training.arXiv preprint arXiv:2508.15390,

Woojin Chung and Jeonghoon Kim. Exploiting vocabulary frequency imbalance in language model pre-training.arXiv preprint arXiv:2508.15390,

work page arXiv

[4] [4]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models (2022).arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Improving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd.arXiv preprint arXiv:1712.07628,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, and Micah Goldblum. Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

work page arXiv

[7] [7]

Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,

Yan Pan and Yuanzhi Li. Toward understanding why adam converges faster than sgd for transformers. arXiv preprint arXiv:2306.00204,

work page arXiv

[8] [8]

On Information and Sufficiency

doi: 10.1214/aoms/1177729586. URL https: //doi.org/10.1214/aoms/1177729586. Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,

work page doi:10.1214/aoms/1177729586

[9] [9]

Is your batch size the problem? revisiting the adam-sgd gap in language modeling

Teodora Sre´ckovi´c, Jonas Geiping, and Antonio Orvieto. Is your batch size the problem? revisiting the adam-sgd gap in language modeling.arXiv preprint arXiv:2506.12543,

work page arXiv

[10] [10]

Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903,

Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903,

work page arXiv

[11] [11]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay B...

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

B Experimental Settings For our main analysis we pre-train the LLaMA Touvron et al

11 A Additional Experimental Figures 0.0 0.5 1.0 1.5 2.0 T okens Accessed 1e9 102 103 (maxi gi )/(V 1 V i = 1 gi ) Output Layer T oken-wise Variation (Adam) B = 64 B = 65536 Figure 7: Measuring the token class gradient imbalance by the ratio between maximum token class gradient norm and average token-class gradient norm:(max i=1,...,V ∥gi∥2)/(V −1PV i=1 ∥...

work page 2020

[14] [14]

We use a data-parallel setup with batch size 2048 (unless stated otherwise in the text) and sequence length 256, giving a token batch size of 524,288 tokens

with a vocabulary of 32,000 tokens. We use a data-parallel setup with batch size 2048 (unless stated otherwise in the text) and sequence length 256, giving a token batch size of 524,288 tokens. We train for a total of 5,000 steps, such that the model sees roughly the Chinchilla compute-optimal Hoffmann et al

work page 2048

[15] [15]

We train using Pytorch’s Automatic Mixed Precision (AMP, BF16/FP32) and use the typical cosine learning rate scheduler with linear warm-up for the first 10% of the iterations

number of tokens (2.6B, i.e., 20 tokens per model parameter). We train using Pytorch’s Automatic Mixed Precision (AMP, BF16/FP32) and use the typical cosine learning rate scheduler with linear warm-up for the first 10% of the iterations. For our large scale experiments, we scale up-to the 1B LLaMA model, doubling the batch size to 4096 tokens (resulting i...

work page 2020

[16] [16]

It is evident that the usual learning rate choice of SGD in the large batch setting will leave a significant 14 0 1000 2000 3000 4000 5000 Iterations 101 Avg

The large weight-SG norm ratio implies that a small choice of learning rate would limit the optimiza- tion algorithm’s solution space ofwt to a small neighborhood around the initialization state w0. It is evident that the usual learning rate choice of SGD in the large batch setting will leave a significant 14 0 1000 2000 3000 4000 5000 Iterations 101 Avg....

work page 2000