The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

· 2025 · cs.LG · arXiv 2510.09378

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Non-Convex Self-Concordant Functions: Practical Algorithms and Complexity Analysis

math.OC · 2025-11-19 · unverdicted · novelty 7.0

Non-convex self-concordant functions enable regularized Newton and adaptive algorithms to achieve epsilon-approximate first-order stationary points in O(epsilon^{-2}) iterations with global convergence guarantees.

Error whitening: Why Gauss-Newton outperforms Newton

cs.LG · 2026-05-11 · conditional · novelty 6.0

Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

Introduces ML-FOP-SOAP optimizer using Fisher-Orthogonal Projection and hierarchical folding to mitigate modality competition in multimodal autoregressive training, reporting gains over AdamW on Janus and Emu3.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

cs.LG · 2026-03-20 · conditional · novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

citing papers explorer

Showing 4 of 4 citing papers.

Non-Convex Self-Concordant Functions: Practical Algorithms and Complexity Analysis math.OC · 2025-11-19 · unverdicted · none · ref 1 · internal anchor
Non-convex self-concordant functions enable regularized Newton and adaptive algorithms to achieve epsilon-approximate first-order stationary points in O(epsilon^{-2}) iterations with global convergence guarantees.
Error whitening: Why Gauss-Newton outperforms Newton cs.LG · 2026-05-11 · conditional · none · ref 1 · internal anchor
Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models cs.CV · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
Introduces ML-FOP-SOAP optimizer using Fisher-Orthogonal Projection and hierarchical folding to mitigate modality competition in multimodal autoregressive training, reporting gains over AdamW on Janus and Emu3.
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization cs.LG · 2026-03-20 · conditional · none · ref 5 · internal anchor
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer