Adaptive Preconditioners Trigger Loss Spikes in Adam

Feiyu Xiong; Hongkang Yang; Jiajie Zhao; Xiaolong Li; Yaoyu Zhang; Zhangchen Zhou; Zhi-Qin John Xu; Zhiwei Bai; Zhiyu Li

arxiv: 2506.04805 · v2 · pith:WJXR7AP6new · submitted 2025-06-05 · 💻 cs.LG

Adaptive Preconditioners Trigger Loss Spikes in Adam

Zhiwei Bai , Zhangchen Zhou , Jiajie Zhao , Xiaolong Li , Zhiyu Li , Feiyu Xiong , Hongkang Yang , Yaoyu Zhang

show 1 more author

Zhi-Qin John Xu

This is my paper

classification 💻 cs.LG

keywords lossmechanismspikesadamadaptivedecouplinggradientsneural

0 comments

read the original abstract

Loss spikes commonly emerge during neural network training with the Adam optimizer across diverse architectures and scales, yet their underlying mechanism remains elusive. While previous explanations attribute these phenomena to sharper loss landscapes at lower loss, we show that landscape geometry alone is insufficient to explain the phenomenon. In this work, we pinpoint the root cause in the internal dynamics of Adam's second moment estimator. We identify a critical ``decoupling'' mechanism where the adaptive preconditioner $v_t$ fails to track the instantaneous squared gradients $g_t^2$, causing the adaptive mechanism to effectively fail. This decoupling allows the preconditioner to decay autonomously despite rising gradients, which pushes the maximum eigenvalue of the preconditioned Hessian beyond the stability threshold $2/\eta$ for sustained periods, manifesting as dramatic loss spikes. Through a quadratic approximation analysis, we theoretically and experimentally characterize five distinct stages of spike evolution and propose a predictor for anticipating spikes based on gradient-directional curvature. We empirically find that the proposed loss spike mechanism, although derived from simplified models, generalizes well to practical scenarios ranging from small neural networks to large-scale Transformers.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
cs.LG 2026-05 unverdicted novelty 7.0

Slingshot loss spikes result from floating-point precision limits that round correct-class gradients to zero, triggering Numerical Feature Inflation and breaking gradient zero-sum constraints.
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
cs.LG 2026-05 unverdicted novelty 7.0

Slingshot loss spikes arise from floating-point precision limits that round correct-class gradients to zero, breaking zero-sum constraints and driving exponential parameter growth through numerical feature inflation.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
cs.DC 2026-04 unverdicted novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

DualSFT derives parameter masks and data subsets as row- and column-wise aggregations of one gradient interaction matrix under first- and second-order validation-improvement approximations.