The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Pith reviewed 2026-05-18 08:16 UTC · model grok-4.3
The pith
Full Gauss-Newton preconditioning cuts LLM training iterations by 5.4 times over strong baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying full Gauss-Newton preconditioning to transformers up to 150M parameters produces a 5.4x reduction in the number of training iterations relative to strong first-order and approximate second-order optimizers. A precise layerwise GN method nearly matches the performance of the full cross-layer version. These experiments establish a practical upper bound on iteration complexity and show that current methods sit well below an idealized layerwise oracle.
What carries the argument
Full Gauss-Newton preconditioner, which uses the exact Gauss-Newton approximation to the Hessian to rescale gradient steps during training.
If this is right
- The Gauss-Newton approximation is highly effective, so higher-order loss terms may not be critical for convergence speed.
- Layerwise Hessian structure contains sufficient information to capture most of the observed gains.
- A sizable performance gap remains between existing approximate methods and an idealized layerwise oracle.
Where Pith is reading between the lines
- Efficient layerwise GN approximations could be scaled to models far larger than 150M parameters where full computation is impossible.
- If the iteration savings persist at scale, pretraining compute budgets could shrink substantially.
- Direct comparisons of full GN against other second-order families such as KFAC on identical model sizes would clarify relative strengths.
Load-bearing premise
Iteration-complexity improvements seen on models up to 150M parameters will continue to hold at larger scales and across different data distributions.
What would settle it
Training a model with more than one billion parameters using full GN and measuring whether the iteration reduction factor stays near 5x against the same baselines.
read the original abstract
Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper explores the use of full Gauss-Newton preconditioning as a way to establish an upper bound on the iteration complexity achievable by second-order optimization methods for training large language models. Through experiments on transformer models with up to 150 million parameters, it demonstrates that full GN updates achieve a 5.4 times reduction in training iterations relative to strong baselines including SOAP and Muon. The study also shows that a layerwise version of GN, which does not account for cross-layer curvature, performs comparably to the full method. The authors suggest that these findings indicate the effectiveness of the GN approximation and the sufficiency of layerwise Hessian information for most performance gains.
Significance. Should the empirical results prove robust and generalizable, this manuscript would offer a significant contribution by quantifying the potential benefits of second-order methods in LLM training and identifying that much of the gain can be retained with layerwise approximations. This could guide the development of more efficient optimizers. The direct measurement of iteration reductions, without reliance on circular definitions, strengthens the empirical foundation. Nevertheless, the restriction to models no larger than 150M parameters limits the immediate applicability to current LLM scales, where different dynamics may prevail.
major comments (2)
- Abstract, experimental results paragraph: The 5.4x reduction in training iterations is reported for models up to 150M parameters, but the manuscript lacks any scaling analysis or experiments at larger scales (e.g., 1B+ parameters). This is a load-bearing issue for the claim that the results provide a practical upper bound relevant to LLM pretraining, as curvature and data effects may change with model size and depth.
- Abstract: The experimental results are presented without details on the implementation of baselines (SOAP, Muon), hyperparameter optimization procedures, or error bars from multiple runs. These omissions hinder a full assessment of the reliability and reproducibility of the reported quantitative gains.
minor comments (1)
- Abstract: Consider adding a brief mention of the model sizes and training setup in the abstract to provide immediate context for the scale of the experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: Abstract, experimental results paragraph: The 5.4x reduction in training iterations is reported for models up to 150M parameters, but the manuscript lacks any scaling analysis or experiments at larger scales (e.g., 1B+ parameters). This is a load-bearing issue for the claim that the results provide a practical upper bound relevant to LLM pretraining, as curvature and data effects may change with model size and depth.
Authors: We acknowledge that the absence of scaling experiments beyond 150M parameters limits direct extrapolation to current LLM scales. Full Gauss-Newton preconditioning is computationally intractable at 1B+ parameters due to the prohibitive memory and compute costs associated with forming and inverting the curvature matrix. Our study is explicitly designed to measure an upper bound in the largest regime where exact full GN is feasible. We will revise the manuscript to add an explicit limitations paragraph discussing this constraint, potential changes in curvature with scale, and related work on optimization dynamics at larger sizes. revision: partial
-
Referee: Abstract: The experimental results are presented without details on the implementation of baselines (SOAP, Muon), hyperparameter optimization procedures, or error bars from multiple runs. These omissions hinder a full assessment of the reliability and reproducibility of the reported quantitative gains.
Authors: We will expand the manuscript to include additional details on the precise implementation of the SOAP and Muon baselines, the hyperparameter optimization procedures employed, and error bars or standard deviations computed over multiple independent runs. These elements will be incorporated into the experimental section and referenced in the abstract to improve clarity and reproducibility. revision: yes
Circularity Check
No circularity: results are direct empirical measurements of iteration counts
full rationale
The paper reports experimental outcomes from running full Gauss-Newton preconditioning on transformer models up to 150M parameters and measuring the number of iterations needed to reach target losses, then comparing those counts against SOAP and Muon baselines. These iteration counts are obtained from actual training runs rather than derived from any equation or parameter fit that presupposes the reported speedup. No self-definitional step, fitted-input-as-prediction, or self-citation chain appears in the abstract or described results; the layerwise-vs-full comparison is likewise an empirical observation. The suggestions about LLM-scale potential are presented as extrapolations from the measured upper bound, not as mathematically forced conclusions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters... 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Non-Convex Self-Concordant Functions: Practical Algorithms and Complexity Analysis
Non-convex self-concordant functions enable regularized Newton and adaptive algorithms to achieve epsilon-approximate first-order stationary points in O(epsilon^{-2}) iterations with global convergence guarantees.
-
Error whitening: Why Gauss-Newton outperforms Newton
Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
-
Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
Introduces ML-FOP-SOAP optimizer using Fisher-Orthogonal Projection and hierarchical folding to mitigate modality competition in multimodal autoregressive training, reporting gains over AdamW on Janus and Emu3.
-
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.