Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

Allan Ma; Anna Choromanska; Kristi Topollai; Sui Jiet Tay; Tolga Dimlioglu

arxiv: 2605.28585 · v1 · pith:5CN7UH6Rnew · submitted 2026-05-27 · 💻 cs.LG

Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

Kristi Topollai , Allan Ma , Tolga Dimlioglu , Sui Jiet Tay , Anna Choromanska This is my paper

Pith reviewed 2026-06-29 13:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords outer momentum restartingdistributed optimizationDiLoCoempirical NTKtwo-phase optimizationcommunication-efficient traininglanguage model pretraining

0 comments

The pith

Periodic restarts of outer momentum in two-phase distributed optimization discard stale memory via phase cancellation while preserving inner-loop progress.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates periodic restarting of the outer momentum in communication-efficient setups like DiLoCo, where workers run many local steps before syncing. Using a linearized squared-loss model with residuals evolving under the empirical NTK, it derives that restarts trigger a mode-wise contraction which clears outdated momentum terms through phase cancellation but keeps the gains from the inner loop. Toy experiments match the predicted contraction, and language-model pretraining runs demonstrate that the restarts expand the range of stable outer learning rates and momentum values over varying communication periods.

Core claim

In the linearized squared-loss model where prediction-space residuals evolve under the empirical NTK, periodic restarts of the outer momentum produce a mode-wise restart contraction showing that resets exploit phase cancellation by discarding stale momentum while preserving inner-loop progress.

What carries the argument

mode-wise restart contraction under empirical NTK dynamics

If this is right

Periodic outer-momentum restarts widen the stable range of outer learning rates and momentum values across communication periods.
The outer optimizer controls how local-update progress accumulates across rounds, and restarts provide a complementary control on outer memory.
Toy experiments confirm the contraction behavior predicted by the linearized model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The restart mechanism might transfer to other two-phase or federated optimizers that rely on outer momentum.
If the contraction holds only in the linear regime, non-linear effects in very deep models could require adjusted restart frequencies.
Pairing restarts with inner-loop length tuning could further reduce communication while maintaining convergence.

Load-bearing premise

The linearized squared-loss model with empirical NTK dynamics accurately predicts behavior in the actual high-dimensional non-linear optimization used for language-model pretraining.

What would settle it

Language-model pretraining runs that achieve the same or wider stable range of outer learning rates and momentum values without any restarts would falsify the claimed benefit of the mode-wise contraction.

Figures

Figures reproduced from arXiv: 2605.28585 by Allan Ma, Anna Choromanska, Kristi Topollai, Sui Jiet Tay, Tolga Dimlioglu.

**Figure 2.** Figure 2: Robustness over outer hyperparameters. Each heatmap shows clipped log10 final loss over a grid of βout and ν. The best periodic restart enlarges the good-hyperparameter region for HB and NAG. In experiments, we select K from an admissible integer range by minimizing |χK(σ)|, equivalently maximizing the restarted rate rK. The phase interpretation clarifies why no single period is uniformly optimal as large… view at source ↗

**Figure 3.** Figure 3: Validation perplexity over the outer hyperparameter grid at S = 512 for HB(left) and NAG(right), comparing standard DiLoCo against DiLoCo with momentum restart period K = 3. Periodic restarts reduce the high-βout failure region while preserving peak performance. Restarts reduce retuning sensitivity across communication periods. We next test how much retuning is needed as the communication period changes. W… view at source ↗

**Figure 4.** Figure 4: Final validation perplexity across communication periods S ∈ {64, 128, 512}, for NAG (left) and HB (right). No-restart curves fix ν = ν ∗ and sweep βout; restart curves fix (ν, βout) = (ν ∗ , β∗ out) and sweep K. obtaining (ν ∗ , β∗ out) = (0.9, 0.7) for NAG and (1.1, 0.5) for HB. For no-restart runs, we keep ν = ν ∗ and sweep βout, while for runs with restart, we keep (ν, βout) = (ν ∗ , β∗ out) and sweep … view at source ↗

**Figure 5.** Figure 5: Robustness over outer hyperparameters. Each heatmap shows clipped log10 final loss over a grid of βout and ν. The best periodic restart enlarges the good-hyperparameter region for HB and NAG. B.2. Hyperparameter Sweeps for Llama-150M We evaluate DiLoCo by pretraining Llama-150M on the C4 dataset from scratch using a 2-replica DiLoCo configuration with two H200 GPUs (128GB memory). We first tuned the inner … view at source ↗

**Figure 6.** Figure 6: Soft restart sweeps. Each cell shows final validation perplexity under the boundary update m ← αm + βg¯, where α controls momentum retention and β controls pseudo-gradient injection at the restart boundary. B.4. Additional Robustness Results for DiLoCo We use the term robustness to mean low sensitivity, or equivalently low variation, of the validation metric across the explored hyperparameter settings. Int… view at source ↗

**Figure 7.** Figure 7: Final validation perplexity for NAG as outer optimizer, shown as a function of restart period K (bottom axis, red) and outer momentum coefficient βout (top axis, blue) for S ∈ {64, 128, 512, 1024, 2048}. The blue curve sweeps βout with no momentum restarts; the red curve fixes βout and sweeps K. Dashed lines with annotated values indicate runs that diverged and exceeded the plot range. Without restarts, pe… view at source ↗

**Figure 8.** Figure 8: Same as [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Validation perplexity over the outer hyperparameter grid at S = 128 for HB(left) and NAG(right), comparing standard DiLoCo against DiLoCo with momentum restart period K = 3. Periodic restarts reduce the high-βout failure region while preserving peak performance. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Communication-efficient distributed optimizers such as DiLoCo reduce synchronization costs by letting workers perform many local updates before aggregating their progress with an outer momentum optimizer. Recent theory suggests that the outer optimizer acts on an effective spectrum induced by the inner optimization loop, and that the choice of outer momentum controls how progress from local updates is accumulated across communication rounds. We study periodic restarting of the outer momentum as a simple complementary mechanism for controlling this outer memory. In a linearized squared-loss model where prediction-space residuals evolve under the empirical NTK, we derive a mode-wise restart contraction showing that resets exploit phase cancellation by discarding stale momentum while preserving inner-loop progress. Toy experiments verify the predicted contraction behavior, and language-model pretraining experiments show that periodic restarts widen the stable range of outer learning rates and momentum values across communication periods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds periodic restarts to outer momentum in DiLoCo-style training, derives a contraction via phase cancellation in a linearized NTK model, and shows wider stable outer-LR ranges in LM runs, but the model-to-practice gap is the main limit.

read the letter

The core takeaway is that restarting the outer momentum every few communication rounds gives a simple way to limit stale accumulation in two-phase distributed optimizers without hurting local inner-loop progress.

What is new is the explicit focus on outer-momentum restarts plus the mode-wise contraction derivation inside the linearized squared-loss model under empirical NTK dynamics. The paper shows that resets exploit phase cancellation to discard old momentum while the inner updates continue to accumulate. Toy experiments match the predicted contraction, and the LM pretraining runs indicate that restarts expand the range of outer learning rates and momentum values that remain stable across communication periods.

The derivation is the strongest part: it stays inside the stated linear model and produces a clear, falsifiable prediction rather than post-hoc fitting. The experiments are at least directionally consistent with that prediction.

The main soft spot is the modeling assumption. The contraction is shown only for prediction-space residuals under a linearized squared-loss NTK; how well this carries over to the non-linear, high-dimensional loss surfaces in actual language-model pretraining is not demonstrated beyond the reported runs. The abstract also gives no error bars, exact communication schedules, or ablation details, so the size of the practical gain is hard to judge from the summary alone.

This paper is for researchers tuning communication-efficient optimizers for large-scale training. Someone already working with DiLoCo or similar outer-momentum methods would find the restart idea and the linear analysis useful to try.

It deserves peer review. The combination of a clean derivation and relevant experiments is enough to warrant referee time, even if the theory-practice bridge needs more work.

Referee Report

0 major / 2 minor

Summary. The manuscript studies periodic restarting of the outer momentum in communication-efficient two-phase optimizers such as DiLoCo. In a linearized squared-loss model where prediction-space residuals evolve under the empirical NTK, it derives a mode-wise restart contraction that exploits phase cancellation to discard stale momentum while preserving inner-loop progress. Toy experiments are reported to verify the predicted contraction, and language-model pretraining runs indicate that periodic restarts widen the stable range of outer learning rates and momentum values across communication periods.

Significance. If the derivation holds, the work supplies a simple, theoretically motivated mechanism for controlling outer memory in distributed optimization of large models. The parameter-free derivation of the mode-wise contraction inside the linearized model, together with its verification in toy settings and the reported expansion of the stable hyperparameter region in LM pretraining, constitute concrete strengths.

minor comments (2)

Abstract: the summary supplies neither the key equations of the linearized model nor any quantitative statements (e.g., contraction factors or error bars), which reduces immediate verifiability of the central claim.
LM pretraining section: exclusion criteria for the reported runs and the precise definition of “stable range” are not stated, making it difficult to assess how the observed widening was quantified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive summary and recommendation of minor revision. No major comments were provided in the report, so we have no specific points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

Derivation self-contained in linearized NTK model

full rationale

The central derivation of mode-wise restart contraction occurs analytically inside the stated linearized squared-loss model with empirical NTK residual dynamics, using phase cancellation to show discarding of stale momentum while preserving inner-loop progress. This is a direct consequence of the model's equations rather than any fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation. Toy experiments verify the derived behavior and LM pretraining reports widened stable ranges, but neither feeds back into the derivation itself. No steps reduce by construction to the paper's inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the linearized NTK model captures the relevant outer-momentum dynamics; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The linearized squared-loss model with empirical NTK accurately represents the outer-momentum dynamics of the two-phase optimizer.
Invoked to derive the mode-wise restart contraction.

pith-pipeline@v0.9.1-grok · 5676 in / 1055 out tokens · 37322 ms · 2026-06-29T13:28:53.268724+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 20 canonical work pages · 1 internal anchor

[1]

High dimensional theory of two-phase optimizers.arXiv preprint arXiv:2603.26954, 2026

Atish Agarwala. High dimensional theory of two-phase optimizers.arXiv preprint arXiv:2603.26954, 2026. URLhttps://arxiv.org/abs/2603.26954

work page arXiv 2026
[2]

Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang

Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. InAdvances in Neural Information Processing Systems, volume 32, 2019. URLhttps://arxiv.org/abs/1904.11955

work page arXiv 2019
[3]

Convergence and accuracy trade-offs in federated learning and meta-learning

Zachary Charles and Jakub Kone ˇcn´y. Convergence and accuracy trade-offs in federated learning and meta-learning. InProceedings of The 24th International Conference on Arti- ficial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 2575–2583. PMLR, 2021. URLhttps://proceedings.mlr.press/v130/ charles21a.html

2021
[4]

Iterated vector fields and conservatism, with applications to federated learning

Zachary Charles and Keith Rush. Iterated vector fields and conservatism, with applications to federated learning. InProceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 ofProceedings of Machine Learning Research, pages 130–
[5]

URLhttps://proceedings.mlr.press/v167/charles22a

PMLR, 2022. URLhttps://proceedings.mlr.press/v167/charles22a. html

2022
[6]

Communication-efficient language model training scales reliably and robustly: Scaling laws for DiLoCo.arXiv preprint arXiv:2503.09799, 2025

Zachary Charles, Gabriel Teston, Lucio Dery, Keith Rush, Nova Fallen, Zachary Garrett, Arthur Szlam, and Arthur Douillard. Communication-efficient language model training scales reliably and robustly: Scaling laws for DiLoCo.arXiv preprint arXiv:2503.09799, 2025. doi: 10.48550/arxiv.2503.09799. URLhttps://arxiv.org/abs/2503.09799

work page doi:10.48550/arxiv.2503.09799 2025
[7]

Towards quan- tifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025

Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, and Ruoyu Sun. Towards quan- tifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025. doi: 10.48550/arxiv.2505.02809. URLhttps://arxiv.org/abs/2505.02809

work page doi:10.48550/arxiv.2505.02809 2025
[8]

Sally Floyd, Dr

Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen. DiLoCo: Distributed low- communication training of language models.arXiv preprint arXiv:2311.08105, 2023. doi: 10.48550/arxiv.2311.08105. URLhttps://arxiv.org/abs/2311.08105

work page doi:10.48550/arxiv.2311.08105 2023
[9]

Keith Rush, Satyen Kale, Zachary Charles, Gabriel Teston, Zachary Garrett, Jiajun Shen, Ross McIlroy, David Lacey, Alexandre Ram ´e, Arthur Szlam, Marc’Aurelio Ranzato, and Paul R

Arthur Douillard, Yani Donchev, J. Keith Rush, Satyen Kale, Zachary Charles, Gabriel Teston, Zachary Garrett, Jiajun Shen, Ross McIlroy, David Lacey, Alexandre Ram ´e, Arthur Szlam, Marc’Aurelio Ranzato, and Paul R. Barham. Streaming DiLoCo with overlapping commu- nication: Towards a distributed free lunch. InSecond Conference on Language Modeling,
[10]

URLhttps://openreview.net/forum?id= yYk3zK0X6Q

doi: 10.48550/arxiv.2501.18512. URLhttps://openreview.net/forum?id= yYk3zK0X6Q

work page doi:10.48550/arxiv.2501.18512
[11]

Adaptive restart of accelerated gradient methods under local quadratic growth condition.IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019

Olivier Fercoq and Zheng Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition.IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019. doi: 10.1093/imanum/drz007. URLhttps://arxiv.org/abs/1709.02300

work page doi:10.1093/imanum/drz007 2069
[12]

Monotonicity and restart in fast gradient methods

Pontus Giselsson and Stephen Boyd. Monotonicity and restart in fast gradient methods. In53rd IEEE Conference on Decision and Control, pages 5058–5063. IEEE, 2014. doi: 6 OUTER-MOMENTUMRESTARTING INHIGH-DIMENSIONALTWO-PHASEOPTIMIZATION 10.1109/cdc.2014.7040179. URLhttps://web.stanford.edu/ ˜boyd/papers/ restart_fgm.html

work page doi:10.1109/cdc.2014.7040179 2014
[13]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aure- lia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sif...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/068431-2176 2022
[14]

SPAM: Spike-aware adam with momentum reset for stable LLM training

Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, and Shiwei Liu. SPAM: Spike-aware adam with momentum reset for stable LLM training. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. doi: 10.48550/arxiv.2501.06842. URL https://arxiv.org/abs/2501.06842

work page doi:10.48550/arxiv.2501.06842 2025
[15]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Cl ´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31, 2018. URLhttps://arxiv.org/abs/1806.07572

work page arXiv 2018
[16]

SNOO: Step- K nesterov outer optimizer – the surprising effectiveness of nesterov momentum applied to pseudo-gradients.arXiv preprint arXiv:2510.15830, 2025

Dominik Kallusky, Vinay Rao, Vishal Nandavanam, and Hao-Jun Michael Shi. SNOO: Step- K nesterov outer optimizer – the surprising effectiveness of nesterov momentum applied to pseudo-gradients.arXiv preprint arXiv:2510.15830, 2025. doi: 10.48550/arxiv.2510.15830. URLhttps://arxiv.org/abs/2510.15830

work page doi:10.48550/arxiv.2510.15830 2025
[17]

Understanding outer optimizers in local SGD: Learning rates, momentum, and accelera- tion.arXiv preprint arXiv:2509.10439, 2025

Ahmed Khaled, Satyen Kale, Arthur Douillard, Chi Jin, Rob Fergus, and Manzil Zaheer. Understanding outer optimizers in local SGD: Learning rates, momentum, and accelera- tion.arXiv preprint arXiv:2509.10439, 2025. doi: 10.48550/arxiv.2509.10439. URL https://arxiv.org/abs/2509.10439

work page doi:10.48550/arxiv.2509.10439 2025
[18]

Donghwan Kim and Jeffrey A. Fessler. Adaptive restart of the optimized gradient method for convex optimization.Journal of Optimization Theory and Applications, 178(1):240– 263, 2018. doi: 10.1007/s10957-018-1287-4. URLhttps://doi.org/10.1007/ s10957-018-1287-4

work page doi:10.1007/s10957-018-1287-4 2018
[19]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as lin- ear models under gradient descent. InAdvances in Neural Information Processing Systems, volume 32, 2019. URLhttps://arxiv.org/abs/1902.06720

work page arXiv 2019
[20]

From local SGD to local fixed-point methods for federated learning

Grigory Malinovskiy, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, and Peter Richt ´arik. From local SGD to local fixed-point methods for federated learning. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 6692–6701. PMLR, 2020. URLhttps://proceedings. mlr.press/v119/mal...

2020
[21]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas. Communication-efficient learning of deep networks from decentralized data. InPro- 7 OUTER-MOMENTUMRESTARTING INHIGH-DIMENSIONALTWO-PHASEOPTIMIZATION ceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol- ume 54 ofProceedings of...

2017
[22]

Adaptive restart for accelerated gradient schemes.Foundations of Computational Mathematics, 15(3):715–732, 2015

Brendan O’Donoghue and Emmanuel Cand `es. Adaptive restart for accelerated gradient schemes.Foundations of Computational Mathematics, 15(3):715–732, 2015. doi: 10.1007/ s10208-013-9150-3. URLhttps://doi.org/10.1007/s10208-013-9150-3

work page doi:10.1007/s10208-013-9150-3 2015
[23]

Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn´y, Sanjiv Kumar, and H

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn´y, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. InIn- ternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=LkFG3lB13U5

2021
[24]

Sharpness, restart and acceleration

Vincent Roulet and Alexandre d’Aspremont. Sharpness, restart and acceleration. InAdvances in Neural Information Processing Systems, volume 30, 2017. URLhttps://papers. nips.cc/paper/6712-sharpness-restart-and-acceleration

2017
[25]

Understanding quantization of optimizer states in LLM pre-training: Dynamics of state staleness and effectiveness of state resets.arXiv preprint arXiv:2603.16731, 2026

Kristi Topollai and Anna Choromanska. Understanding quantization of optimizer states in LLM pre-training: Dynamics of state staleness and effectiveness of state resets.arXiv preprint arXiv:2603.16731, 2026. URLhttps://arxiv.org/abs/2603.16731

work page arXiv 2026
[26]

Nguyen, Andrea L

Bao Wang, Tan M. Nguyen, Andrea L. Bertozzi, Richard G. Baraniuk, and Stanley J. Osher. Scheduled restart momentum for accelerated stochastic gradient descent.SIAM Journal on Imaging Sciences, 15(2):738–761, 2022. doi: 10.1137/21M1453311. URLhttps://doi. org/10.1137/21M1453311

work page doi:10.1137/21m1453311 2022
[27]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025. doi: 10.48550/arxiv.2509.02046. URLhttps://arxiv.org/abs/2509.02046

work page doi:10.48550/arxiv.2509.02046 2025
[28]

Zhang, James Lucas, Jimmy Ba, and Geoffrey E

Michael R. Zhang, James Lucas, Jimmy Ba, and Geoffrey E. Hinton. Lookahead optimizer: ksteps forward, 1 step back. InAdvances in Neural Information Processing Systems, vol- ume 32, 2019. URLhttps://proceedings.neurips.cc/paper/2019/hash/ 90fd4f88f588ae64038134f1eeaa023f-Abstract.html

2019
[29]

Why transformers need adam: A hessian perspective

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective. InAdvances in Neural Infor- mation Processing Systems, volume 37, 2024. doi: 10.48550/arxiv.2402.16788. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ee0e45ff4de76cbfdf07015a7839f339-Abstract-Conference.html....

work page doi:10.48550/arxiv.2402.16788 2024

[1] [1]

High dimensional theory of two-phase optimizers.arXiv preprint arXiv:2603.26954, 2026

Atish Agarwala. High dimensional theory of two-phase optimizers.arXiv preprint arXiv:2603.26954, 2026. URLhttps://arxiv.org/abs/2603.26954

work page arXiv 2026

[2] [2]

Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang

Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. InAdvances in Neural Information Processing Systems, volume 32, 2019. URLhttps://arxiv.org/abs/1904.11955

work page arXiv 2019

[3] [3]

Convergence and accuracy trade-offs in federated learning and meta-learning

Zachary Charles and Jakub Kone ˇcn´y. Convergence and accuracy trade-offs in federated learning and meta-learning. InProceedings of The 24th International Conference on Arti- ficial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 2575–2583. PMLR, 2021. URLhttps://proceedings.mlr.press/v130/ charles21a.html

2021

[4] [4]

Iterated vector fields and conservatism, with applications to federated learning

Zachary Charles and Keith Rush. Iterated vector fields and conservatism, with applications to federated learning. InProceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 ofProceedings of Machine Learning Research, pages 130–

[5] [5]

URLhttps://proceedings.mlr.press/v167/charles22a

PMLR, 2022. URLhttps://proceedings.mlr.press/v167/charles22a. html

2022

[6] [6]

Communication-efficient language model training scales reliably and robustly: Scaling laws for DiLoCo.arXiv preprint arXiv:2503.09799, 2025

Zachary Charles, Gabriel Teston, Lucio Dery, Keith Rush, Nova Fallen, Zachary Garrett, Arthur Szlam, and Arthur Douillard. Communication-efficient language model training scales reliably and robustly: Scaling laws for DiLoCo.arXiv preprint arXiv:2503.09799, 2025. doi: 10.48550/arxiv.2503.09799. URLhttps://arxiv.org/abs/2503.09799

work page doi:10.48550/arxiv.2503.09799 2025

[7] [7]

Towards quan- tifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025

Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, and Ruoyu Sun. Towards quan- tifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025. doi: 10.48550/arxiv.2505.02809. URLhttps://arxiv.org/abs/2505.02809

work page doi:10.48550/arxiv.2505.02809 2025

[8] [8]

Sally Floyd, Dr

Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen. DiLoCo: Distributed low- communication training of language models.arXiv preprint arXiv:2311.08105, 2023. doi: 10.48550/arxiv.2311.08105. URLhttps://arxiv.org/abs/2311.08105

work page doi:10.48550/arxiv.2311.08105 2023

[9] [9]

Keith Rush, Satyen Kale, Zachary Charles, Gabriel Teston, Zachary Garrett, Jiajun Shen, Ross McIlroy, David Lacey, Alexandre Ram ´e, Arthur Szlam, Marc’Aurelio Ranzato, and Paul R

Arthur Douillard, Yani Donchev, J. Keith Rush, Satyen Kale, Zachary Charles, Gabriel Teston, Zachary Garrett, Jiajun Shen, Ross McIlroy, David Lacey, Alexandre Ram ´e, Arthur Szlam, Marc’Aurelio Ranzato, and Paul R. Barham. Streaming DiLoCo with overlapping commu- nication: Towards a distributed free lunch. InSecond Conference on Language Modeling,

[10] [10]

URLhttps://openreview.net/forum?id= yYk3zK0X6Q

doi: 10.48550/arxiv.2501.18512. URLhttps://openreview.net/forum?id= yYk3zK0X6Q

work page doi:10.48550/arxiv.2501.18512

[11] [11]

Adaptive restart of accelerated gradient methods under local quadratic growth condition.IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019

Olivier Fercoq and Zheng Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition.IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019. doi: 10.1093/imanum/drz007. URLhttps://arxiv.org/abs/1709.02300

work page doi:10.1093/imanum/drz007 2069

[12] [12]

Monotonicity and restart in fast gradient methods

Pontus Giselsson and Stephen Boyd. Monotonicity and restart in fast gradient methods. In53rd IEEE Conference on Decision and Control, pages 5058–5063. IEEE, 2014. doi: 6 OUTER-MOMENTUMRESTARTING INHIGH-DIMENSIONALTWO-PHASEOPTIMIZATION 10.1109/cdc.2014.7040179. URLhttps://web.stanford.edu/ ˜boyd/papers/ restart_fgm.html

work page doi:10.1109/cdc.2014.7040179 2014

[13] [13]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aure- lia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sif...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/068431-2176 2022

[14] [14]

SPAM: Spike-aware adam with momentum reset for stable LLM training

Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, and Shiwei Liu. SPAM: Spike-aware adam with momentum reset for stable LLM training. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. doi: 10.48550/arxiv.2501.06842. URL https://arxiv.org/abs/2501.06842

work page doi:10.48550/arxiv.2501.06842 2025

[15] [15]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Cl ´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31, 2018. URLhttps://arxiv.org/abs/1806.07572

work page arXiv 2018

[16] [16]

SNOO: Step- K nesterov outer optimizer – the surprising effectiveness of nesterov momentum applied to pseudo-gradients.arXiv preprint arXiv:2510.15830, 2025

Dominik Kallusky, Vinay Rao, Vishal Nandavanam, and Hao-Jun Michael Shi. SNOO: Step- K nesterov outer optimizer – the surprising effectiveness of nesterov momentum applied to pseudo-gradients.arXiv preprint arXiv:2510.15830, 2025. doi: 10.48550/arxiv.2510.15830. URLhttps://arxiv.org/abs/2510.15830

work page doi:10.48550/arxiv.2510.15830 2025

[17] [17]

Understanding outer optimizers in local SGD: Learning rates, momentum, and accelera- tion.arXiv preprint arXiv:2509.10439, 2025

Ahmed Khaled, Satyen Kale, Arthur Douillard, Chi Jin, Rob Fergus, and Manzil Zaheer. Understanding outer optimizers in local SGD: Learning rates, momentum, and accelera- tion.arXiv preprint arXiv:2509.10439, 2025. doi: 10.48550/arxiv.2509.10439. URL https://arxiv.org/abs/2509.10439

work page doi:10.48550/arxiv.2509.10439 2025

[18] [18]

Donghwan Kim and Jeffrey A. Fessler. Adaptive restart of the optimized gradient method for convex optimization.Journal of Optimization Theory and Applications, 178(1):240– 263, 2018. doi: 10.1007/s10957-018-1287-4. URLhttps://doi.org/10.1007/ s10957-018-1287-4

work page doi:10.1007/s10957-018-1287-4 2018

[19] [19]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as lin- ear models under gradient descent. InAdvances in Neural Information Processing Systems, volume 32, 2019. URLhttps://arxiv.org/abs/1902.06720

work page arXiv 2019

[20] [20]

From local SGD to local fixed-point methods for federated learning

Grigory Malinovskiy, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, and Peter Richt ´arik. From local SGD to local fixed-point methods for federated learning. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 6692–6701. PMLR, 2020. URLhttps://proceedings. mlr.press/v119/mal...

2020

[21] [21]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas. Communication-efficient learning of deep networks from decentralized data. InPro- 7 OUTER-MOMENTUMRESTARTING INHIGH-DIMENSIONALTWO-PHASEOPTIMIZATION ceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol- ume 54 ofProceedings of...

2017

[22] [22]

Adaptive restart for accelerated gradient schemes.Foundations of Computational Mathematics, 15(3):715–732, 2015

Brendan O’Donoghue and Emmanuel Cand `es. Adaptive restart for accelerated gradient schemes.Foundations of Computational Mathematics, 15(3):715–732, 2015. doi: 10.1007/ s10208-013-9150-3. URLhttps://doi.org/10.1007/s10208-013-9150-3

work page doi:10.1007/s10208-013-9150-3 2015

[23] [23]

Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn´y, Sanjiv Kumar, and H

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn´y, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. InIn- ternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=LkFG3lB13U5

2021

[24] [24]

Sharpness, restart and acceleration

Vincent Roulet and Alexandre d’Aspremont. Sharpness, restart and acceleration. InAdvances in Neural Information Processing Systems, volume 30, 2017. URLhttps://papers. nips.cc/paper/6712-sharpness-restart-and-acceleration

2017

[25] [25]

Understanding quantization of optimizer states in LLM pre-training: Dynamics of state staleness and effectiveness of state resets.arXiv preprint arXiv:2603.16731, 2026

Kristi Topollai and Anna Choromanska. Understanding quantization of optimizer states in LLM pre-training: Dynamics of state staleness and effectiveness of state resets.arXiv preprint arXiv:2603.16731, 2026. URLhttps://arxiv.org/abs/2603.16731

work page arXiv 2026

[26] [26]

Nguyen, Andrea L

Bao Wang, Tan M. Nguyen, Andrea L. Bertozzi, Richard G. Baraniuk, and Stanley J. Osher. Scheduled restart momentum for accelerated stochastic gradient descent.SIAM Journal on Imaging Sciences, 15(2):738–761, 2022. doi: 10.1137/21M1453311. URLhttps://doi. org/10.1137/21M1453311

work page doi:10.1137/21m1453311 2022

[27] [27]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025. doi: 10.48550/arxiv.2509.02046. URLhttps://arxiv.org/abs/2509.02046

work page doi:10.48550/arxiv.2509.02046 2025

[28] [28]

Zhang, James Lucas, Jimmy Ba, and Geoffrey E

Michael R. Zhang, James Lucas, Jimmy Ba, and Geoffrey E. Hinton. Lookahead optimizer: ksteps forward, 1 step back. InAdvances in Neural Information Processing Systems, vol- ume 32, 2019. URLhttps://proceedings.neurips.cc/paper/2019/hash/ 90fd4f88f588ae64038134f1eeaa023f-Abstract.html

2019

[29] [29]

Why transformers need adam: A hessian perspective

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective. InAdvances in Neural Infor- mation Processing Systems, volume 37, 2024. doi: 10.48550/arxiv.2402.16788. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ee0e45ff4de76cbfdf07015a7839f339-Abstract-Conference.html....

work page doi:10.48550/arxiv.2402.16788 2024