The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Depen Morwani; Natalie Abreu; Nikhil Vyas; Sham Kakade

arxiv: 2510.09378 · v2 · submitted 2025-10-10 · 💻 cs.LG · cs.AI

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Natalie Abreu , Nikhil Vyas , Sham Kakade , Depen Morwani This is my paper

Pith reviewed 2026-05-18 08:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Gauss-Newtonsecond-order optimizationLLM pretrainingtransformer optimizationpreconditioningHessian approximationlayerwise methods

0 comments

The pith

Full Gauss-Newton preconditioning cuts LLM training iterations by 5.4 times over strong baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies full Gauss-Newton preconditioning to transformer models up to 150 million parameters to quantify the performance left on the table by current approximations to second-order optimization. Full GN updates achieve a 5.4x reduction in training iterations compared to baselines such as SOAP and Muon. A layerwise GN preconditioner that drops all cross-layer information performs nearly as well as the full version. The results indicate that the Gauss-Newton approximation itself is highly effective and that most of the benefit comes from within-layer curvature information.

Core claim

Applying full Gauss-Newton preconditioning to transformers up to 150M parameters produces a 5.4x reduction in the number of training iterations relative to strong first-order and approximate second-order optimizers. A precise layerwise GN method nearly matches the performance of the full cross-layer version. These experiments establish a practical upper bound on iteration complexity and show that current methods sit well below an idealized layerwise oracle.

What carries the argument

Full Gauss-Newton preconditioner, which uses the exact Gauss-Newton approximation to the Hessian to rescale gradient steps during training.

If this is right

The Gauss-Newton approximation is highly effective, so higher-order loss terms may not be critical for convergence speed.
Layerwise Hessian structure contains sufficient information to capture most of the observed gains.
A sizable performance gap remains between existing approximate methods and an idealized layerwise oracle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Efficient layerwise GN approximations could be scaled to models far larger than 150M parameters where full computation is impossible.
If the iteration savings persist at scale, pretraining compute budgets could shrink substantially.
Direct comparisons of full GN against other second-order families such as KFAC on identical model sizes would clarify relative strengths.

Load-bearing premise

Iteration-complexity improvements seen on models up to 150M parameters will continue to hold at larger scales and across different data distributions.

What would settle it

Training a model with more than one billion parameters using full GN and measuring whether the iteration reduction factor stays near 5x against the same baselines.

read the original abstract

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This paper explores the use of full Gauss-Newton preconditioning as a way to establish an upper bound on the iteration complexity achievable by second-order optimization methods for training large language models. Through experiments on transformer models with up to 150 million parameters, it demonstrates that full GN updates achieve a 5.4 times reduction in training iterations relative to strong baselines including SOAP and Muon. The study also shows that a layerwise version of GN, which does not account for cross-layer curvature, performs comparably to the full method. The authors suggest that these findings indicate the effectiveness of the GN approximation and the sufficiency of layerwise Hessian information for most performance gains.

Significance. Should the empirical results prove robust and generalizable, this manuscript would offer a significant contribution by quantifying the potential benefits of second-order methods in LLM training and identifying that much of the gain can be retained with layerwise approximations. This could guide the development of more efficient optimizers. The direct measurement of iteration reductions, without reliance on circular definitions, strengthens the empirical foundation. Nevertheless, the restriction to models no larger than 150M parameters limits the immediate applicability to current LLM scales, where different dynamics may prevail.

major comments (2)

Abstract, experimental results paragraph: The 5.4x reduction in training iterations is reported for models up to 150M parameters, but the manuscript lacks any scaling analysis or experiments at larger scales (e.g., 1B+ parameters). This is a load-bearing issue for the claim that the results provide a practical upper bound relevant to LLM pretraining, as curvature and data effects may change with model size and depth.
Abstract: The experimental results are presented without details on the implementation of baselines (SOAP, Muon), hyperparameter optimization procedures, or error bars from multiple runs. These omissions hinder a full assessment of the reliability and reproducibility of the reported quantitative gains.

minor comments (1)

Abstract: Consider adding a brief mention of the model sizes and training setup in the abstract to provide immediate context for the scale of the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: Abstract, experimental results paragraph: The 5.4x reduction in training iterations is reported for models up to 150M parameters, but the manuscript lacks any scaling analysis or experiments at larger scales (e.g., 1B+ parameters). This is a load-bearing issue for the claim that the results provide a practical upper bound relevant to LLM pretraining, as curvature and data effects may change with model size and depth.

Authors: We acknowledge that the absence of scaling experiments beyond 150M parameters limits direct extrapolation to current LLM scales. Full Gauss-Newton preconditioning is computationally intractable at 1B+ parameters due to the prohibitive memory and compute costs associated with forming and inverting the curvature matrix. Our study is explicitly designed to measure an upper bound in the largest regime where exact full GN is feasible. We will revise the manuscript to add an explicit limitations paragraph discussing this constraint, potential changes in curvature with scale, and related work on optimization dynamics at larger sizes. revision: partial
Referee: Abstract: The experimental results are presented without details on the implementation of baselines (SOAP, Muon), hyperparameter optimization procedures, or error bars from multiple runs. These omissions hinder a full assessment of the reliability and reproducibility of the reported quantitative gains.

Authors: We will expand the manuscript to include additional details on the precise implementation of the SOAP and Muon baselines, the hyperparameter optimization procedures employed, and error bars or standard deviations computed over multiple independent runs. These elements will be incorporated into the experimental section and referenced in the abstract to improve clarity and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements of iteration counts

full rationale

The paper reports experimental outcomes from running full Gauss-Newton preconditioning on transformer models up to 150M parameters and measuring the number of iterations needed to reach target losses, then comparing those counts against SOAP and Muon baselines. These iteration counts are obtained from actual training runs rather than derived from any equation or parameter fit that presupposes the reported speedup. No self-definitional step, fitted-input-as-prediction, or self-citation chain appears in the abstract or described results; the layerwise-vs-full comparison is likewise an empirical observation. The suggestions about LLM-scale potential are presented as extrapolations from the measured upper bound, not as mathematically forced conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and does not introduce new theoretical axioms, free parameters fitted to the target metric, or invented entities.

pith-pipeline@v0.9.0 · 5721 in / 960 out tokens · 46821 ms · 2026-05-18T08:16:36.584816+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters... 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Non-Convex Self-Concordant Functions: Practical Algorithms and Complexity Analysis
math.OC 2025-11 unverdicted novelty 7.0

Non-convex self-concordant functions enable regularized Newton and adaptive algorithms to achieve epsilon-approximate first-order stationary points in O(epsilon^{-2}) iterations with global convergence guarantees.
Error whitening: Why Gauss-Newton outperforms Newton
cs.LG 2026-05 conditional novelty 6.0

Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
cs.CV 2026-05 unverdicted novelty 5.0

Introduces ML-FOP-SOAP optimizer using Fisher-Orthogonal Projection and hierarchical folding to mitigate modality competition in multimodal autoregressive training, reporting gains over AdamW on Janus and Emu3.
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
cs.LG 2026-03 conditional novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.