FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

Kyunghun Nam; Sumyeong Ahn

arxiv: 2606.02365 · v1 · pith:2FWF6442new · submitted 2026-06-01 · 💻 cs.LG · cs.AI

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

Kyunghun Nam , Sumyeong Ahn This is my paper

Pith reviewed 2026-06-28 15:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Shampooadaptive dampingstaleness errorpreconditionerssecond-order optimizationeigendecompositionmachine learning

0 comments

The pith

FOAM reduces wall-clock time for Shampoo by adaptively controlling damping and eigendecomposition frequency based on staleness error approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes how stale preconditioner updates in Shampoo degrade both convergence and numerical stability, while showing that damping can counteract those effects. From this analysis the authors derive FOAM, an algorithm that approximates the staleness-oriented error to decide when to adjust the damping factor and how often to recompute the eigendecomposition. Experiments indicate that the resulting adaptive schedule shortens wall-clock training time relative to standard Shampoo without harming final convergence. A reader would care because the main practical obstacle to using Shampoo at scale is the cost of frequent matrix operations, and FOAM directly targets that cost through error-driven adaptation.

Core claim

By modeling staleness effects on convergence and stability, the work establishes that an approximation of the staleness-oriented error can be used to dynamically set both the damping factor and the eigendecomposition frequency, thereby allowing Shampoo to operate with stale preconditioners while reducing wall-clock time and preserving robust convergence.

What carries the argument

FOAM, the adaptive algorithm that approximates the staleness-oriented error to guide changes in the damping factor and eigendecomposition frequency.

Load-bearing premise

An approximation of the staleness-oriented error can reliably guide dynamic control of the damping factor and eigendecomposition frequency without introducing new instabilities or degrading optimization performance.

What would settle it

Running the reported large-scale benchmarks and observing that FOAM either increases wall-clock time or produces unstable training or worse final performance than standard Shampoo.

Figures

Figures reproduced from arXiv: 2606.02365 by Kyunghun Nam, Sumyeong Ahn.

**Figure 1.** Figure 1: Wall-clock efficiency comparison between Shampoo with stale update and FOAM update. Figure 1a-1b presents the best training loss and validation accuracy for the ViT (ImageNet-1K) task, while Figure 1c-1d shows the training loss and WER for the Conformer (LibriSpeech) task. In all plots, the blue rectangular areas represent the region of superior performance, where FOAM achieves better convergence and final… view at source ↗

**Figure 4.** Figure 4: Ablation study of FOAM. Ablation study [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Learning curve. (Wall-clock time vs Loss) Learning curve. As described in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: ViT: Best training loss and validation accuracy Our experimental results in [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Conformer: Best train. loss and valid. WER Optimizer (f, τ , ϵ0, ϵmax) Train Loss Validation WER Wall-clock Time AdamW (N/A, N/A, 10−9 , N/A) 0.42 0.099 205 (minute) stale Shampoo (50, N/A, 10−9 , N/A) 0.17 0.069 264 (minute) DR-Shampoo (50, 0.4, 10−9 , N/A) 0.18 0.065 294 (minute) FOAM (Ours) (50, 0.4, 10−9 , 3 × 10−7 ) 0.14 0.065 257 (minute) SOAP (50, N/A, 10−9 , N/A) 0.17 0.067 265 (minute) [PITH_FULL… view at source ↗

**Figure 8.** Figure 8: Synthetic validation of the proxy. Panel (a) reports configuration-wise decision quality, showing that h(ϵ) provides an almost perfect trigger signal for eigendecomposition and substantially outperforms the diagonalization-residual baseline d(ϵ). Panels (b) and (c) pool all samples across the sweep: the calibration plot shows that ∆(ϵ)/h(ϵ) ≤ 1 throughout the evaluated range, indicating conservative behavi… view at source ↗

read the original abstract

Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FOAM adds adaptive control of damping and eigendecomposition frequency to Shampoo using a staleness-error approximation, but the abstract leaves the approximation and experimental controls too thin to judge reliability.

read the letter

FOAM targets the staleness that comes from infrequent preconditioner updates in Shampoo. The authors argue that staleness hurts both convergence and numerical stability, then show that damping can offset those effects. They turn that observation into an adaptive rule that changes the damping factor and the frequency of eigendecompositions according to an approximation of the staleness-induced error.

The analysis of how staleness interacts with convergence and stability is the clearest part of the work; it gives a concrete reason to treat damping as more than a fixed hyperparameter. The claim that this adaptive scheme cuts wall-clock time while keeping convergence intact is the main experimental result.

The soft spots are the missing details. The paper does not spell out how the staleness-oriented error is approximated or whether that approximation is derived independently or fitted to the same runs. The experiments are summarized at a high level with no datasets, model sizes, baseline comparisons, or error bars mentioned, so it is impossible to tell whether the reported speedups are robust or sensitive to particular choices. The scope stays inside the Shampoo family, which is fine for an engineering paper but limits broader impact.

This is for people already running or extending Shampoo on large models who need to trade off update cost against stability. It is worth sending to peer review because the underlying bottleneck is real and the proposed control loop is testable; a referee can ask for the missing derivation and the full experimental controls.

Referee Report

2 major / 0 minor

Summary. The paper provides a theoretical analysis of staleness effects on convergence and stability in the Shampoo optimizer, noting that stale preconditioner updates improve efficiency but degrade performance and introduce instability, with damping acting as a stabilizer. It proposes FOAM, an adaptive method that dynamically modulates the damping factor and eigendecomposition frequency using an approximation of the staleness-oriented error. Experiments claim that FOAM reduces wall-clock time relative to standard Shampoo while preserving robust convergence.

Significance. If the approximation of staleness-oriented error is independently derived and the adaptive controls prove stable, the work could meaningfully improve the practicality of second-order methods like Shampoo on large-scale problems by addressing a key computational bottleneck. The dual theoretical and experimental framing is a positive feature, though the absence of explicit derivations or controls in the provided text limits evaluation of its broader impact.

major comments (2)

[Abstract] Abstract: The abstract asserts a theoretical study of staleness plus supporting experiments, yet supplies no equations, proof sketches, dataset details, or error-bar information, so it is not possible to verify whether the data or derivations support the stated claim.
[Abstract] Abstract: Without details on the derivation, it is unclear whether the 'approximation of the staleness-oriented error' is derived independently or reduces to a quantity defined in terms of parameters fitted to the same training runs, which is load-bearing for the central adaptive-control claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address the two major comments on the abstract point by point below. The full manuscript contains the requested theoretical and experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts a theoretical study of staleness plus supporting experiments, yet supplies no equations, proof sketches, dataset details, or error-bar information, so it is not possible to verify whether the data or derivations support the stated claim.

Authors: Abstracts are concise summaries and standardly omit detailed equations, proofs, or experimental specifics to meet length limits. The full manuscript provides the theoretical analysis of staleness effects on convergence and stability (including equations and proof sketches) in the dedicated theory sections, along with dataset details and error bars in the experiments section. Readers can verify the claims from the main text. revision: no
Referee: [Abstract] Abstract: Without details on the derivation, it is unclear whether the 'approximation of the staleness-oriented error' is derived independently or reduces to a quantity defined in terms of parameters fitted to the same training runs, which is load-bearing for the central adaptive-control claim.

Authors: The approximation of the staleness-oriented error is independently derived from the theoretical analysis of how staleness degrades preconditioner quality and introduces instability. It follows directly from the mathematical modeling of staleness effects rather than from parameters fitted to training runs. The explicit derivation appears in the theoretical section of the manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The abstract presents a theoretical analysis of staleness effects on convergence and stability, followed by an adaptive method (FOAM) that uses an approximation of staleness-oriented error to control damping and eigendecomposition frequency. No equations, self-citations, or fitted parameters are quoted that reduce the central claim (reduced wall-clock time with maintained convergence) to a definition or input by construction. The approximation is described as guided by analysis rather than fitted to the target metric. This matches the default expectation of no circularity when the derivation chain does not exhibit the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; the central claim appears to rest on the unelaborated assertion that damping suppresses staleness effects and that the error approximation is accurate enough to drive adaptation.

pith-pipeline@v0.9.1-grok · 5685 in / 1026 out tokens · 28468 ms · 2026-06-28T15:47:12.370807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 20 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2512.05620 , year=

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales , author=. arXiv preprint arXiv:2512.05620 , year=

work page arXiv
[3]

Journal of Computer and System Sciences , volume=

Efficient algorithms for online decision problems , author=. Journal of Computer and System Sciences , volume=. 2005 , publisher=

2005
[4]

Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

Disentangling adaptive gradient methods from learning rates , author=. arXiv preprint arXiv:2002.11803 , year=

work page arXiv 2002
[5]

The Thirteenth International Conference on Learning Representations , year=

Adam-mini: Use Fewer Learning Rates To Gain More , author=. The Thirteenth International Conference on Learning Representations , year=
[6]

A Daleckiˇi-Kreˇin formula for the Fr

Noferini, Vanni , year=. A Daleckiˇi-Kreˇin formula for the Fr
[7]

2025 , url=

Gyu Yeol Kim and Min-hwan Oh , booktitle=. 2025 , url=

2025
[8]

Small Batch Size Training for Language Models: When Vanilla

Martin Marek and Sanae Lotfi and Aditya Somasundaram and Andrew Gordon Wilson and Micah Goldblum , booktitle=. Small Batch Size Training for Language Models: When Vanilla. 2026 , url=

2026
[9]

2022 , url =

Roger Grosse , title =. 2022 , url =

2022
[10]

arXiv preprint arXiv:2301.11235 , year=

Handbook of convergence theorems for (stochastic) gradient methods , author=. arXiv preprint arXiv:2301.11235 , year=

work page arXiv
[11]

Mathematical Programming , volume=

Nonsmooth optimization via quasi-Newton methods , author=. Mathematical Programming , volume=. 2013 , publisher=

2013
[12]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

2020 IEEE international conference on big data (Big data) , pages=

Pyhessian: Neural networks through the lens of the hessian , author=. 2020 IEEE international conference on big data (Big data) , pages=. 2020 , organization=

2020
[14]

2013 , publisher=

Matrix analysis , author=. 2013 , publisher=

2013
[15]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[16]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[17]

Journal of Machine Learning Research , volume=

New insights and perspectives on the natural gradient method , author=. Journal of Machine Learning Research , volume=
[18]

International Conference on Machine Learning , pages=

Second-order optimization with lazy Hessians , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[19]

Advances in neural information processing systems , volume=

Visualizing the loss landscape of neural nets , author=. Advances in neural information processing systems , volume=
[20]

The American Mathematical Monthly , volume=

A modification of Newton's method , author=. The American Mathematical Monthly , volume=. 1948 , publisher=

1948
[21]

Uspekhi Matematicheskikh Nauk , volume=

Functional analysis and applied mathematics , author=. Uspekhi Matematicheskikh Nauk , volume=. 1948 , publisher=

1948
[22]

The annals of mathematical statistics , pages=

A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

1951
[23]

Lowe and Felix Dangel and Runa Eschenhagen and Zikun Xu and Roger Baker Grosse , booktitle=

Wu Lin and Scott C. Lowe and Felix Dangel and Runa Eschenhagen and Zikun Xu and Roger Baker Grosse , booktitle=. Understanding and improving Shampoo and. 2026 , url=

2026
[24]

A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization

A unified approach to adaptive regularization in online and stochastic optimization , author=. arXiv preprint arXiv:1706.06569 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[26]

International Conference on Machine Learning , pages=

Asynchronous Byzantine machine learning (the case of SGD) , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[27]

International Conference on Learning Representations , volume=

SOAP: Improving and stabilizing shampoo using adam for language modeling , author=. International Conference on Learning Representations , volume=
[28]

International Conference on Machine Learning , year=

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise , author=. International Conference on Machine Learning , year=
[29]

2012 , publisher=

Matrix analysis , author=. 2012 , publisher=

2012
[30]

arXiv preprint arXiv:2306.07179 , year=

Benchmarking neural network training algorithms , author=. arXiv preprint arXiv:2306.07179 , year=

work page arXiv
[31]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[32]

Advances in Neural Information Processing Systems , volume=

Second-order forward-mode optimization of recurrent neural networks for neuroscience , author=. Advances in Neural Information Processing Systems , volume=
[33]

The Eleventh International Conference on Learning Representations , year=

Fisher-Legendre (FishLeg) optimization of deep neural networks , author=. The Eleventh International Conference on Learning Representations , year=
[34]

Advances in Neural Information Processing Systems , volume=

Exact natural gradient in deep linear networks and its application to the nonlinear case , author=. Advances in Neural Information Processing Systems , volume=
[35]

Advances in Neural Information Processing Systems , volume=

Practical quasi-newton methods for training deep neural networks , author=. Advances in Neural Information Processing Systems , volume=
[36]

Advances in Neural Information Processing Systems , volume=

Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization , author=. Advances in Neural Information Processing Systems , volume=
[37]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[38]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

2018
[39]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author =. arXiv preprint arXiv:2001.08361 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2001
[40]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Compute-Optimal Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[41]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[42]

The Twelfth International Conference on Learning Representations , year=

On the Parameterization of Second-Order Optimization Effective towards the Infinite Width , author=. The Twelfth International Conference on Learning Representations , year=
[43]

The Thirteenth International Conference on Learning Representations , year=

Accelerating neural network training: An analysis of the AlgoPerf competition , author=. The Thirteenth International Conference on Learning Representations , year=
[44]

Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28--July 1, 1977 , pages=

The Levenberg-Marquardt algorithm: implementation and theory , author=. Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28--July 1, 1977 , pages=. 2006 , organization=

1977
[45]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

2018 , publisher=

Lectures on convex optimization , author=. 2018 , publisher=

2018
[47]

arXiv preprint arXiv:2309.06497 , year=

A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale , author=. arXiv preprint arXiv:2309.06497 , year=

work page arXiv
[48]

Ussr computational mathematics and mathematical physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. Ussr computational mathematics and mathematical physics , volume=. 1964 , publisher=

1964
[49]

The Thirteenth International Conference on Learning Representations , year=

A New Perspective on Shampoo's Preconditioner , author=. The Thirteenth International Conference on Learning Representations , year=
[50]

Online Learning: A Modern Introduction Using Convex Optimization

A modern introduction to online learning , author=. arXiv preprint arXiv:1912.13213 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1912
[51]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=
[52]

Biometrika , volume=

A useful variant of the Davis--Kahan theorem for statisticians , author=. Biometrika , volume=. 2015 , publisher=

2015
[53]

arXiv preprint arXiv:2211.15596 , year=

A survey of deep learning optimizers--first and second order methods , author=. arXiv preprint arXiv:2211.15596 , year=

work page arXiv
[54]

2016 , publisher=

Information geometry and its applications , author=. 2016 , publisher=

2016
[55]

CoRR , year=

Discounted adaptive online prediction , author=. CoRR , year=
[56]

1998 , publisher=

The symmetric eigenvalue problem , author=. 1998 , publisher=

1998
[57]

Forty-first International Conference on Machine Learning , year=

Online Linear Regression in Dynamic Environments via Discounting , author=. Forty-first International Conference on Machine Learning , year=
[58]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[59]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

arXiv preprint arXiv:2405.09742 , year=

Random scaling and momentum for non-smooth non-convex optimization , author=. arXiv preprint arXiv:2405.09742 , year=

work page arXiv
[62]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

A Stable Whitening Optimizer for Efficient Neural Network Training , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[63]

The Twelfth International Conference on Learning Representations , year=

Combining axes preconditioners through kronecker approximation for deep learning , author=. The Twelfth International Conference on Learning Representations , year=
[64]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

2024
[65]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[66]

SIAM Journal on Matrix Analysis and Applications , volume=

Backward stability of iterations for computing the polar decomposition , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2012 , publisher=

2012
[67]

OPT 2024: Optimization for Machine Learning , year=

Old Optimizer, New Norm: An Anthology , author=. OPT 2024: Optimization for Machine Learning , year=

2024
[68]

arXiv preprint arXiv:2405.16002 , year=

Does SGD really happen in tiny subspaces? , author=. arXiv preprint arXiv:2405.16002 , year=

work page arXiv
[69]

Springer Science , volume=

Numerical optimization , author=. Springer Science , volume=
[70]

SIAM review , volume=

Quasi-Newton methods, motivation and theory , author=. SIAM review , volume=. 1977 , publisher=

1977
[71]

Mathematical programming , volume=

Cubic regularization of Newton method and its global performance , author=. Mathematical programming , volume=. 2006 , publisher=

2006
[72]

International Conference on Learning Representations , year=

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , author=. International Conference on Learning Representations , year=
[73]

Forty-second International Conference on Machine Learning , year=

Structured Preconditioners in Adaptive Optimization: A Unified Analysis , author=. Forty-second International Conference on Machine Learning , year=
[74]

, author=

Deep learning via hessian-free optimization. , author=. Icml , volume=
[75]

General framework for online-to-nonconvex conversion: Schedule-free

Kwangjun Ahn and Gagik Magakyan and Ashok Cutkosky , booktitle=. General framework for online-to-nonconvex conversion: Schedule-free. 2025 , url=

2025
[76]

International Conference on Machine Learning , pages=

Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[77]

arXiv preprint arXiv:2002.09018 , year=

Scalable second order optimization for deep learning , author=. arXiv preprint arXiv:2002.09018 , year=

work page arXiv 2002
[78]

Advances in neural information processing systems , volume=

Why transformers need adam: A hessian perspective , author=. Advances in neural information processing systems , volume=
[79]

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton , author=. arXiv preprint arXiv:2510.09378 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

, Wang, M

The sharpness disparity principle in transformers for accelerating language model pre-training , author=. arXiv preprint arXiv:2502.19002 , year=

work page arXiv

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2512.05620 , year=

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales , author=. arXiv preprint arXiv:2512.05620 , year=

work page arXiv

[3] [3]

Journal of Computer and System Sciences , volume=

Efficient algorithms for online decision problems , author=. Journal of Computer and System Sciences , volume=. 2005 , publisher=

2005

[4] [4]

Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

Disentangling adaptive gradient methods from learning rates , author=. arXiv preprint arXiv:2002.11803 , year=

work page arXiv 2002

[5] [5]

The Thirteenth International Conference on Learning Representations , year=

Adam-mini: Use Fewer Learning Rates To Gain More , author=. The Thirteenth International Conference on Learning Representations , year=

[6] [6]

A Daleckiˇi-Kreˇin formula for the Fr

Noferini, Vanni , year=. A Daleckiˇi-Kreˇin formula for the Fr

[7] [7]

2025 , url=

Gyu Yeol Kim and Min-hwan Oh , booktitle=. 2025 , url=

2025

[8] [8]

Small Batch Size Training for Language Models: When Vanilla

Martin Marek and Sanae Lotfi and Aditya Somasundaram and Andrew Gordon Wilson and Micah Goldblum , booktitle=. Small Batch Size Training for Language Models: When Vanilla. 2026 , url=

2026

[9] [9]

2022 , url =

Roger Grosse , title =. 2022 , url =

2022

[10] [10]

arXiv preprint arXiv:2301.11235 , year=

Handbook of convergence theorems for (stochastic) gradient methods , author=. arXiv preprint arXiv:2301.11235 , year=

work page arXiv

[11] [11]

Mathematical Programming , volume=

Nonsmooth optimization via quasi-Newton methods , author=. Mathematical Programming , volume=. 2013 , publisher=

2013

[12] [12]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

2020 IEEE international conference on big data (Big data) , pages=

Pyhessian: Neural networks through the lens of the hessian , author=. 2020 IEEE international conference on big data (Big data) , pages=. 2020 , organization=

2020

[14] [14]

2013 , publisher=

Matrix analysis , author=. 2013 , publisher=

2013

[15] [15]

International Conference on Machine Learning , pages=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018

[16] [16]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

[17] [17]

Journal of Machine Learning Research , volume=

New insights and perspectives on the natural gradient method , author=. Journal of Machine Learning Research , volume=

[18] [18]

International Conference on Machine Learning , pages=

Second-order optimization with lazy Hessians , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[19] [19]

Advances in neural information processing systems , volume=

Visualizing the loss landscape of neural nets , author=. Advances in neural information processing systems , volume=

[20] [20]

The American Mathematical Monthly , volume=

A modification of Newton's method , author=. The American Mathematical Monthly , volume=. 1948 , publisher=

1948

[21] [21]

Uspekhi Matematicheskikh Nauk , volume=

Functional analysis and applied mathematics , author=. Uspekhi Matematicheskikh Nauk , volume=. 1948 , publisher=

1948

[22] [22]

The annals of mathematical statistics , pages=

A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

1951

[23] [23]

Lowe and Felix Dangel and Runa Eschenhagen and Zikun Xu and Roger Baker Grosse , booktitle=

Wu Lin and Scott C. Lowe and Felix Dangel and Runa Eschenhagen and Zikun Xu and Roger Baker Grosse , booktitle=. Understanding and improving Shampoo and. 2026 , url=

2026

[24] [24]

A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization

A unified approach to adaptive regularization in online and stochastic optimization , author=. arXiv preprint arXiv:1706.06569 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[26] [26]

International Conference on Machine Learning , pages=

Asynchronous Byzantine machine learning (the case of SGD) , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018

[27] [27]

International Conference on Learning Representations , volume=

SOAP: Improving and stabilizing shampoo using adam for language modeling , author=. International Conference on Learning Representations , volume=

[28] [28]

International Conference on Machine Learning , year=

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise , author=. International Conference on Machine Learning , year=

[29] [29]

2012 , publisher=

Matrix analysis , author=. 2012 , publisher=

2012

[30] [30]

arXiv preprint arXiv:2306.07179 , year=

Benchmarking neural network training algorithms , author=. arXiv preprint arXiv:2306.07179 , year=

work page arXiv

[31] [31]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[32] [32]

Advances in Neural Information Processing Systems , volume=

Second-order forward-mode optimization of recurrent neural networks for neuroscience , author=. Advances in Neural Information Processing Systems , volume=

[33] [33]

The Eleventh International Conference on Learning Representations , year=

Fisher-Legendre (FishLeg) optimization of deep neural networks , author=. The Eleventh International Conference on Learning Representations , year=

[34] [34]

Advances in Neural Information Processing Systems , volume=

Exact natural gradient in deep linear networks and its application to the nonlinear case , author=. Advances in Neural Information Processing Systems , volume=

[35] [35]

Advances in Neural Information Processing Systems , volume=

Practical quasi-newton methods for training deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

[36] [36]

Advances in Neural Information Processing Systems , volume=

Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization , author=. Advances in Neural Information Processing Systems , volume=

[37] [37]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[38] [38]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

2018

[39] [39]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author =. arXiv preprint arXiv:2001.08361 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2001

[40] [40]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Compute-Optimal Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[41] [41]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[42] [42]

The Twelfth International Conference on Learning Representations , year=

On the Parameterization of Second-Order Optimization Effective towards the Infinite Width , author=. The Twelfth International Conference on Learning Representations , year=

[43] [43]

The Thirteenth International Conference on Learning Representations , year=

Accelerating neural network training: An analysis of the AlgoPerf competition , author=. The Thirteenth International Conference on Learning Representations , year=

[44] [44]

Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28--July 1, 1977 , pages=

The Levenberg-Marquardt algorithm: implementation and theory , author=. Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28--July 1, 1977 , pages=. 2006 , organization=

1977

[45] [45]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

2018 , publisher=

Lectures on convex optimization , author=. 2018 , publisher=

2018

[47] [47]

arXiv preprint arXiv:2309.06497 , year=

A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale , author=. arXiv preprint arXiv:2309.06497 , year=

work page arXiv

[48] [48]

Ussr computational mathematics and mathematical physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. Ussr computational mathematics and mathematical physics , volume=. 1964 , publisher=

1964

[49] [49]

The Thirteenth International Conference on Learning Representations , year=

A New Perspective on Shampoo's Preconditioner , author=. The Thirteenth International Conference on Learning Representations , year=

[50] [50]

Online Learning: A Modern Introduction Using Convex Optimization

A modern introduction to online learning , author=. arXiv preprint arXiv:1912.13213 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1912

[51] [51]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=

[52] [52]

Biometrika , volume=

A useful variant of the Davis--Kahan theorem for statisticians , author=. Biometrika , volume=. 2015 , publisher=

2015

[53] [53]

arXiv preprint arXiv:2211.15596 , year=

A survey of deep learning optimizers--first and second order methods , author=. arXiv preprint arXiv:2211.15596 , year=

work page arXiv

[54] [54]

2016 , publisher=

Information geometry and its applications , author=. 2016 , publisher=

2016

[55] [55]

CoRR , year=

Discounted adaptive online prediction , author=. CoRR , year=

[56] [56]

1998 , publisher=

The symmetric eigenvalue problem , author=. 1998 , publisher=

1998

[57] [57]

Forty-first International Conference on Machine Learning , year=

Online Linear Regression in Dynamic Environments via Discounting , author=. Forty-first International Conference on Machine Learning , year=

[58] [58]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[59] [59]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

arXiv preprint arXiv:2405.09742 , year=

Random scaling and momentum for non-smooth non-convex optimization , author=. arXiv preprint arXiv:2405.09742 , year=

work page arXiv

[62] [62]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

A Stable Whitening Optimizer for Efficient Neural Network Training , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[63] [63]

The Twelfth International Conference on Learning Representations , year=

Combining axes preconditioners through kronecker approximation for deep learning , author=. The Twelfth International Conference on Learning Representations , year=

[64] [64]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

2024

[65] [65]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[66] [66]

SIAM Journal on Matrix Analysis and Applications , volume=

Backward stability of iterations for computing the polar decomposition , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2012 , publisher=

2012

[67] [67]

OPT 2024: Optimization for Machine Learning , year=

Old Optimizer, New Norm: An Anthology , author=. OPT 2024: Optimization for Machine Learning , year=

2024

[68] [68]

arXiv preprint arXiv:2405.16002 , year=

Does SGD really happen in tiny subspaces? , author=. arXiv preprint arXiv:2405.16002 , year=

work page arXiv

[69] [69]

Springer Science , volume=

Numerical optimization , author=. Springer Science , volume=

[70] [70]

SIAM review , volume=

Quasi-Newton methods, motivation and theory , author=. SIAM review , volume=. 1977 , publisher=

1977

[71] [71]

Mathematical programming , volume=

Cubic regularization of Newton method and its global performance , author=. Mathematical programming , volume=. 2006 , publisher=

2006

[72] [72]

International Conference on Learning Representations , year=

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , author=. International Conference on Learning Representations , year=

[73] [73]

Forty-second International Conference on Machine Learning , year=

Structured Preconditioners in Adaptive Optimization: A Unified Analysis , author=. Forty-second International Conference on Machine Learning , year=

[74] [74]

, author=

Deep learning via hessian-free optimization. , author=. Icml , volume=

[75] [75]

General framework for online-to-nonconvex conversion: Schedule-free

Kwangjun Ahn and Gagik Magakyan and Ashok Cutkosky , booktitle=. General framework for online-to-nonconvex conversion: Schedule-free. 2025 , url=

2025

[76] [76]

International Conference on Machine Learning , pages=

Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[77] [77]

arXiv preprint arXiv:2002.09018 , year=

Scalable second order optimization for deep learning , author=. arXiv preprint arXiv:2002.09018 , year=

work page arXiv 2002

[78] [78]

Advances in neural information processing systems , volume=

Why transformers need adam: A hessian perspective , author=. Advances in neural information processing systems , volume=

[79] [79]

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton , author=. arXiv preprint arXiv:2510.09378 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

, Wang, M

The sharpness disparity principle in transformers for accelerating language model pre-training , author=. arXiv preprint arXiv:2502.19002 , year=

work page arXiv