pith. machine review for the scientific record. sign in

arxiv: 2409.11321 · v2 · submitted 2024-09-17 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SOAP: Improving and Stabilizing Shampoo using Adam

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords SOAP optimizerShampooAdampreconditioningeigenbasislanguage model pretraininglarge batch trainingsecond moment update
0
0 comments X

The pith

SOAP runs Adam inside Shampoo's eigenbasis to cut large-batch iterations by over 40 percent versus AdamW.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a formal equivalence showing that half-power Shampoo is identical to running Adafactor inside the eigenbasis of its preconditioner. This equivalence motivates SOAP, which keeps the slowly changing eigenbasis from infrequent Shampoo steps but performs continuous Adam-style first- and second-moment updates inside that basis. The change removes most of the performance drop seen when eigendecompositions are simply spaced farther apart. On 360-million and 660-million parameter language models in the large-batch regime, SOAP requires more than 40 percent fewer iterations and more than 35 percent less wall-clock time than AdamW while also improving roughly 20 percent over standard Shampoo.

Core claim

Shampoo implemented with the one-half power is equivalent to Adafactor executed in the eigenbasis of the Shampoo preconditioner. SOAP therefore maintains the current eigenbasis from infrequent full Shampoo steps and applies standard Adam moment averaging directly inside that coordinate frame, which prevents the degradation that occurs when the basis is held fixed without continued moment updates.

What carries the argument

The eigenbasis of the Shampoo preconditioner, inside which Adam performs its first- and second-moment updates.

If this is right

  • SOAP adds only one new hyperparameter beyond Adam, the preconditioning frequency.
  • It delivers over 40 percent fewer iterations than AdamW and about 20 percent fewer than Shampoo in large-batch language-model pretraining.
  • Wall-clock time falls more than 35 percent versus AdamW and roughly 20 percent versus Shampoo on the tested 360M and 660M models.
  • The method inherits Adam's moment-update machinery, so it requires no new memory beyond what Shampoo already stores for its basis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same basis-rotation idea could be applied to other second-order or adaptive methods that maintain an approximate eigenframe.
  • Further reductions in memory traffic may be possible by updating only selected blocks of the basis rather than recomputing the full decomposition.
  • The approach invites direct comparison against other memory-efficient second-order methods on tasks beyond language modeling, such as vision transformers or reinforcement-learning policies.

Load-bearing premise

Holding the eigenbasis fixed for many steps does not materially degrade the quality of the preconditioning.

What would settle it

If recomputing the eigendecomposition at every step in SOAP produces no further gain over the infrequent version, or if loss curves diverge sharply once the basis update interval exceeds the values tested, the efficiency and stability claims would be falsified.

read the original abstract

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript establishes a formal algebraic connection between 1/2-power Shampoo and Adafactor, showing that Shampoo is equivalent to running Adafactor in the eigenbasis of its preconditioner. It proposes SOAP, which applies Adam-style updates to the second-moment statistics while holding the eigenbasis fixed and updating it only at a tunable preconditioning frequency. Empirical evaluation on 360M and 660M language-model pre-training shows SOAP reducing iterations by over 40% and wall-clock time by over 35% versus AdamW (and ~20% versus Shampoo) in the large-batch regime.

Significance. If the results hold, the work is significant because it supplies a practical bridge between higher-order preconditioners and Adam-family methods, introducing only one new hyperparameter while retaining most of Shampoo's benefits and lowering memory/compute overhead. The algebraic identity, public code, and concrete large-batch gains constitute clear strengths for the optimization literature.

major comments (2)
  1. [Derivation of the Shampoo–Adafactor equivalence (likely §3)] The formal equivalence between 1/2-power Shampoo and Adafactor is exact only inside the instantaneous eigenbasis. No analytic bound is supplied on how quickly the basis may drift before the rotated Adam update loses its advantage when the preconditioner is held fixed for many steps (the central design choice controlled by the preconditioning-frequency hyperparameter). This assumption is load-bearing for the claim that SOAP stabilizes Shampoo without degradation.
  2. [Experimental evaluation (§4)] The reported performance gains (40% iteration reduction, 35% wall-clock reduction) rest on single runs for the 360M and 660M models. Absence of variance across random seeds and lack of ablations at larger scales make it difficult to assess whether the gains are robust or sensitive to the specific eigenbasis-drift regime encountered in these experiments.
minor comments (1)
  1. [Abstract] The abstract states “approximately 20% improvements” without naming the exact metrics or the preconditioning frequency used in the headline experiments; adding these numbers would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation and constructive comments. We address each major point below and describe the corresponding revisions.

read point-by-point responses
  1. Referee: The formal equivalence between 1/2-power Shampoo and Adafactor is exact only inside the instantaneous eigenbasis. No analytic bound is supplied on how quickly the basis may drift before the rotated Adam update loses its advantage when the preconditioner is held fixed for many steps (the central design choice controlled by the preconditioning-frequency hyperparameter). This assumption is load-bearing for the claim that SOAP stabilizes Shampoo without degradation.

    Authors: We agree that the algebraic equivalence holds exactly only for the instantaneous eigenbasis and that the manuscript provides no analytic bound on eigenbasis drift or its effect on the Adam-style update when the basis is held fixed. In the revised manuscript we will add a short clarifying paragraph in Section 3 that explicitly states the instantaneous character of the identity and notes that the preconditioning frequency is selected empirically. We will also reference the existing frequency ablations to show that the chosen values maintain performance without visible degradation in the reported regimes. revision: partial

  2. Referee: The reported performance gains (40% iteration reduction, 35% wall-clock reduction) rest on single runs for the 360M and 660M models. Absence of variance across random seeds and lack of ablations at larger scales make it difficult to assess whether the gains are robust or sensitive to the specific eigenbasis-drift regime encountered in these experiments.

    Authors: We acknowledge that the main results are shown from single runs and that this limits assessment of variance and robustness. In the revision we will rerun the primary 360M and 660M experiments with three random seeds, report means and standard deviations, and add a brief discussion of sensitivity to the preconditioning frequency. We will also expand the existing frequency ablations to include more points that probe different drift regimes. Experiments at substantially larger scales were outside our current compute allocation; we will note this limitation explicitly while emphasizing that the code release enables such follow-up work. revision: yes

Circularity Check

0 steps flagged

Algebraic equivalence is a mathematical identity; no circular reduction in derivation

full rationale

The paper's central step is deriving that 1/2-power Shampoo equals Adafactor run in the preconditioner's eigenbasis, presented as an algebraic identity between two prior algorithms. SOAP is then defined directly from this identity by updating the second-moment average in the (slowly changing) basis while holding the eigenbasis fixed for a chosen frequency. This frequency is introduced as a single practical hyperparameter, not fitted to any target metric or performance outcome. All reported gains (iteration and wall-clock reductions vs. AdamW and Shampoo) are empirical results on 360M/660M models, not outputs of the derivation itself. No self-citation is load-bearing, no parameter is renamed as a prediction, and no ansatz or uniqueness claim reduces to prior author work. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper rests on one new practical hyperparameter (preconditioning frequency) and the algebraic identity that 1/2-power Shampoo equals Adafactor inside the eigenbasis; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • preconditioning frequency
    How often the eigenbasis of the Shampoo preconditioner is recomputed; introduced as the sole additional hyperparameter relative to Adam.
axioms (1)
  • domain assumption Shampoo implemented with the matrix 1/2-power is algebraically equivalent to running Adafactor inside the eigenbasis of that preconditioner.
    This identity is the load-bearing derivation that justifies running Adam updates in the slowly changing coordinate system.

pith-pipeline@v0.9.0 · 5646 in / 1318 out tokens · 26024 ms · 2026-05-15T00:57:19.167991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A meshfree exterior calculus for generalizable and data-efficient learning of physics from point clouds

    cs.LG 2026-05 unverdicted novelty 8.0

    MEEC equips point clouds with a discrete exterior calculus that satisfies exact conservation and is differentiable in point positions, allowing a single trained kernel to produce compatible physics on unseen geometrie...

  2. Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

  3. Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

    cs.LG 2026-05 unverdicted novelty 7.0

    Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.

  4. Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

    cs.LG 2026-05 unverdicted novelty 7.0

    Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

  5. When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize

    cs.LG 2026-05 unverdicted novelty 7.0

    SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance ov...

  6. A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

    cs.LG 2026-04 unverdicted novelty 7.0

    A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

  7. Hard-constrained Physics-informed Neural Networks for Interface Problems

    math.NA 2026-04 conditional novelty 7.0

    Hard-constrained PINN formulations via windowing and buffer approaches enforce interface conditions by design and outperform soft-constrained baselines on 1D and 2D elliptic interface problems.

  8. Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

    cs.LG 2026-03 unverdicted novelty 7.0

    Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...

  9. Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations

    cs.LG 2026-05 unverdicted novelty 6.0

    CLDNet is a conditional latent dynamics network surrogate for the shallow water equations that delivers 115x faster 96-hour flood forecasts on irregular metropolitan basins while maintaining usable accuracy against ga...

  10. GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

    cs.LG 2026-05 unverdicted novelty 6.0

    GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new num...

  11. OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

    cs.LG 2026-05 unverdicted novelty 6.0

    OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...

  12. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  13. Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

    cs.LG 2026-05 unverdicted novelty 6.0

    Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-...

  14. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

  15. Large-eddy simulation nets (LESnets) based on physics-informed neural operator for wall-bounded turbulence

    physics.flu-dyn 2026-04 unverdicted novelty 6.0

    LESnets integrates LES equations and the law of the wall into F-FNO to enable data-free, stable long-term predictions of wall-bounded turbulence at Re_tau up to 1000 on coarse grids, matching traditional LES accuracy ...

  16. When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions

    cs.LG 2026-04 conditional novelty 6.0

    PINNs fail on spurious solutions admitted by the residual loss; adaptive pseudo-time stepping with Jacobian-based step selection improves accuracy and robustness on PDE benchmarks.

  17. $\phi-$DeepONet: A Discontinuity Capturing Neural Operator

    cs.CE 2026-04 unverdicted novelty 6.0

    φ-DeepONet learns mappings with discontinuities in inputs and outputs by combining multiple branch networks with a nonlinear interface embedding in the trunk, trained via physics- and interface-informed loss, and show...

  18. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  19. Spectral Condition for $\mu$P under Width-Depth Scaling

    cs.LG 2026-02 unverdicted novelty 6.0

    A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.

  20. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  21. Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

    cs.LG 2026-04 unverdicted novelty 4.0

    Curvature-aware optimizers such as natural gradient and self-scaling BFGS/Broyden accelerate PINN convergence and accuracy on PDEs including Helmholtz, Stokes, Burgers, and Euler equations plus stiff ODEs, with new mo...

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 21 Pith papers · 1 internal anchor

  1. [1]

    2601.22579

    URL https://openreview.net/forum?id=cScb-RrBQC. Omead Pooladzandi and Xi-Lin Li. Curvature-informed SGD via general purpose lie-group preconditioners, 2024. URL https://openreview.net/forum?id=sawjxRnVpF. Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models. C...

  2. [3]

    (360m) We sweep over the cross product of best 3 learning rates and β1 ∈ {0.9, 0.95, 0.99}

  3. [4]

    The last two of the sweeps did not yield any benefit for the 360m model with 2m batch size hence we only sweep over learning rate for the 660m model with 2m batch size

    (360m) We sweep over the cross product of best 3 learning rates and β2 ∈ {0.9, 0.95, 0.99}. The last two of the sweeps did not yield any benefit for the 360m model with 2m batch size hence we only sweep over learning rate for the 660m model with 2m batch size. DistributedShampoo, 2m batch size: Starting from the default hyperparameters above we do the fol...

  4. [6]

    (360m) We sweep over over the cross product of best 3 learning rates from above and ϵshampoo ∈ {1e−11, 1e−12, 1e−13}

  5. [7]

    (360m) We sweep over over the cross product of best 3 learning rates from above and βshampoo ∈ {.9, .95, .975}

  6. [8]

    We also sweep over the cross product of best 3 learning rates from above and (e1, e2) in {(2, 2), (2.5, 2.5), (3, 3), (2, 4)}

    Let e1, e2 denote the exponents used in DistributedShampoo for 1D and 2D parameters respec- tively. We also sweep over the cross product of best 3 learning rates from above and (e1, e2) in {(2, 2), (2.5, 2.5), (3, 3), (2, 4)}. These sweeps did not yield any significant improvement in performance (< .004) for the 360m model. Hence we only sweep over the le...

  7. [10]

    In the second sweep we observe small improvements in performance by using β2 = .99, so our final numbers use β2 = .99

    We sweep over the cross product of best 3 learning rates and β2 ∈ {0.95, 0.99}. In the second sweep we observe small improvements in performance by using β2 = .99, so our final numbers use β2 = .99. This (small) improvement in performance by using a larger β2 at smaller batch sizes was also observed by Porian et al. (2024); Zhao et al. (2024c). Distribute...

  8. [11]

    ,3.16e−4}

    We sweep over learning rate in {.1, .0316, .01, . . . ,3.16e−4}. 19

  9. [12]

    In the second sweep we observe small improvements in performance by using β2 = βshampoo = .99, so our final numbers use β2 = βshampoo = .99

    We sweep over the cross product of best 3 learning rates and (β2, βshampoo) ∈ {(.95, .95), (.99, .99)}. In the second sweep we observe small improvements in performance by using β2 = βshampoo = .99, so our final numbers use β2 = βshampoo = .99. SOAP, 256k batch size: For the 360m model with 256 batch size we start from the default hyperparameters and do t...

  10. [13]

    ,3.16e−4}

    We sweep over learning rate in {.1, .0316, .01, . . . ,3.16e−4}

  11. [14]

    In the second sweep we observe small improvements in performance by using β2 = .99, so our final numbers use β2 = .99

    We sweep over the cross product of best 3 learning rates and β2 ∈ {.95, .99}. In the second sweep we observe small improvements in performance by using β2 = .99, so our final numbers use β2 = .99. Preconditioning frequency sweeps: For the preconditioning frequency experiments of SOAP and Shampoo ( Fig- ure 1 (right)), for each frequency we do a learning r...

  12. [15]

    Frequency 200 had the best results matching the observation of Zhao et al

    We swept the cross product over learning rate (3.16e−4, 1e−3, 3.16e−3, 1e−2), preconditioning frequency (10, 50, 200), both sided and one sided versions. Frequency 200 had the best results matching the observation of Zhao et al. (2024a)

  13. [16]

    We did a cross product sweep over learning rate ( 3.16e − 4, 1e − 3, 3.16e − 3, 1e − 2), both sided and one sided versions with β2 = .99 instead of .95 and preconditioning frequency 200

  14. [17]

    The best performing run among all of these achieved a final loss of 3.12 while the best Shampoo run achieved a final loss of 3.10

    We did a cross product sweep over learning rate ( 3.16e − 4, 1e − 3, 3.16e − 3, 1e − 2), both sided and one sided versions, preconditioning frequency (50, 200) with β1 = .9 instead of .95. The best performing run among all of these achieved a final loss of 3.12 while the best Shampoo run achieved a final loss of 3.10. 20