Non-normal spectral signatures of instability in neural network training dynamics
Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3
The pith
Non-normal update operators in common optimizers produce transient amplification that signals training instability even when eigenvalues suggest stability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the κ(V) → ∞ limiting case of this framework. Numerical 2D
What carries the argument
the condition number κ(V) of the eigenvector matrix V of the linearized update operator, which quantifies non-normality and bounds transient amplification via pseudospectral theory
If this is right
- The spectral radius ρ(J) alone supplies no separation between stable and unstable training phases.
- κ(V) separates the same phases by approximately one order of magnitude.
- Exceptional points of the update operator arise precisely as the κ(V) → ∞ limit of the pseudospectral framework.
- The non-normal measure complements the classical sharpness criterion by supplying a continuous severity indicator of transient amplification.
Where Pith is reading between the lines
- Design of new adaptive optimizers could explicitly penalize large commutators [H, M] to reduce non-normality.
- Monitoring κ(V) during training on larger models might allow preemptive step-size or momentum adjustments before visible instability occurs.
- The same pseudospectral analysis could be applied to other state-augmented methods such as Nesterov or heavy-ball variants.
Load-bearing premise
The linearized update operators for practically used optimizers are generically non-normal, with non-normality for Adam controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner and for momentum SGD arising from the augmented state-space structure of the update map.
What would settle it
A set of training runs across multiple optimizers and network depths in which κ(V) shows no consistent separation between phases that exhibit loss spikes or oscillatory behavior and phases that converge smoothly.
Figures
read the original abstract
Training instabilities in deep networks - loss spikes, oscillatory convergence, and gradient pathologies - are empirically prevalent but lack a rigorous operator-theoretic explanation. We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which \kappa(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the \kappa(V) -> \infty limiting case of this framework. Numerical experiments on two-layer networks confirm that the spectral radius \rho(J) provides no separation between stable and unstable training phases while \kappa(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion with a continuous severity measure of non-normal amplification. These results establish non-Hermitian operator theory as a useful and underexplored framework for neural network optimization stability, offering a diagnostic language and proof-of-concept benchmark for understanding adaptive optimization stability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript applies non-normal stability theory to the linearized update operators of Adam and momentum SGD. It claims these operators are generically non-normal, with non-normality for Adam controlled by the commutator [H, M] and for momentum SGD arising from the augmented state-space structure. A conservative pseudospectral precursor bound is derived in which κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius ρ(J) < 1, with exceptional points appearing as the κ(V) → ∞ limit. Numerical experiments on two-layer networks show that ρ(J) provides no separation between stable and unstable phases while κ(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion.
Significance. If the claims hold after addressing the time-variation issue, the work would establish non-Hermitian operator theory as a useful framework for neural network optimization stability, offering κ(V) as a continuous severity measure of non-normal amplification and a diagnostic complement to sharpness. The two-layer numerical separation provides a concrete proof-of-concept benchmark for the approach.
major comments (2)
- [Derivation of the pseudospectral bound (around the statement of the main theorem)] The pseudospectral precursor bound and κ(V) indicator are derived under the assumption of autonomous (constant-coefficient) linear systems, but the update operator J is explicitly time-dependent because both the Hessian H and the adaptive preconditioner M (or momentum buffer) evolve along the training trajectory. No additional justification or extension (e.g., via frozen-coefficient analysis or slow-variation bounds) is provided to establish that the snapshot κ(V) remains a reliable early-warning quantity under realistic rates of variation.
- [Numerical experiments section] The numerical claim that κ(V) separates stable and unstable phases by approximately one order of magnitude while ρ(J) does not is demonstrated only on two-layer networks. It is unclear whether this separation persists in deeper architectures, where the structure of the commutator [H, M] and the augmented state may produce qualitatively different non-normality effects.
minor comments (2)
- [Theory section] The precise definition of the pseudospectral bound, the matrix V, and the quantity κ(V) should be stated explicitly with equation numbers rather than referenced only descriptively.
- [Abstract and numerical results] The abstract states separation 'by approximately one order of magnitude'; the main text should report the exact observed ratio, number of runs, and any statistical measures of separation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of non-Hermitian operator theory in this setting. We address each major comment below with proposed revisions where appropriate.
read point-by-point responses
-
Referee: [Derivation of the pseudospectral bound (around the statement of the main theorem)] The pseudospectral precursor bound and κ(V) indicator are derived under the assumption of autonomous (constant-coefficient) linear systems, but the update operator J is explicitly time-dependent because both the Hessian H and the adaptive preconditioner M (or momentum buffer) evolve along the training trajectory. No additional justification or extension (e.g., via frozen-coefficient analysis or slow-variation bounds) is provided to establish that the snapshot κ(V) remains a reliable early-warning quantity under realistic rates of variation.
Authors: We agree that the core pseudospectral bound is stated for autonomous linear systems. The manuscript applies the bound to instantaneous linearizations (snapshots) of the time-varying operator along the trajectory, which is a standard practice in non-normal stability analysis of slowly varying systems. However, we did not include an explicit discussion of the frozen-coefficient approximation or slow-variation error bounds. We will revise the manuscript to add a dedicated paragraph in the theory section justifying the snapshot approach under the assumption that H and M vary slowly relative to the transient amplification timescale, with supporting references from the non-autonomous stability literature. revision: yes
-
Referee: [Numerical experiments section] The numerical claim that κ(V) separates stable and unstable phases by approximately one order of magnitude while ρ(J) does not is demonstrated only on two-layer networks. It is unclear whether this separation persists in deeper architectures, where the structure of the commutator [H, M] and the augmented state may produce qualitatively different non-normality effects.
Authors: The experiments are restricted to two-layer networks to isolate the non-normality mechanisms (commutator [H, M] and augmented state) in a minimal setting where exact Hessian and preconditioner computations are feasible. The theoretical derivations for both Adam and momentum SGD are architecture-agnostic and depend only on the algebraic structure of the update operator. We will revise the discussion section to explicitly state that the two-layer results constitute a controlled proof-of-concept benchmark, note that deeper networks may exhibit additional non-normality sources, and flag extension to deeper architectures as an important direction for follow-up work. revision: partial
Circularity Check
No significant circularity; derivation applies external non-normal theory to independently defined operators
full rationale
The paper first constructs the linearized update operators J explicitly from the standard forms of Adam (via commutator [H,M]) and momentum SGD (via augmented state-space), then invokes established non-normal stability theory to obtain the pseudospectral precursor bound with κ(V) as indicator. No equation shows the bound or κ(V) reducing to a fitted parameter, self-definition, or input quantity by construction. Numerical separation of stable/unstable phases by κ(V) is an empirical observation, not a definitional tautology. No load-bearing self-citation chain or ansatz smuggling is present in the provided derivation steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The training dynamics near fixed points can be accurately captured by linearizing the update operators of the optimizers.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M]
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kreiss, H. O. , title =. BIT Numerical Mathematics , volume =. 1962 , doi =
work page 1962
- [2]
-
[3]
Trefethen, Lloyd N. and Trefethen, Anne E. and Reddy, Satish C. and Driscoll, Tobin A. , title =. Science , volume =. 1993 , doi =
work page 1993
-
[4]
Advances in Physics , volume =
Ashida, Yuto and Gong, Zongping and Ueda, Masahito , title =. Advances in Physics , volume =. 2020 , doi =
work page 2020
-
[5]
and Kaur, Simran and Li, Yuanzhi and Kolter, J
Cohen, Jeremy M. and Kaur, Simran and Li, Yuanzhi and Kolter, J. Zico and Talwalkar, Ameet , title =. International Conference on Learning Representations (ICLR) , year =
- [6]
-
[7]
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
Ghorbani, Behrooz and Krishnan, Shankar and Xiao, Ying , title =. arXiv preprint arXiv:1901.10159 , year =
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[8]
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
Sagun, Levent and Evci, Utku and G. Empirical Analysis of the. arXiv preprint arXiv:1706.04454 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Appearance of Random Matrix Theory in Deep Learning , journal =
Sagun, Levent and Evci, Utku and Dauphin, Yann and Bottou, L. Appearance of Random Matrix Theory in Deep Learning , journal =. 2021 , url =
work page 2021
-
[10]
arXiv preprint arXiv:2305.12133 , year =
Zhang, Zhihao and others , title =. arXiv preprint arXiv:2305.12133 , year =
- [11]
-
[12]
International Conference on Learning Representations (ICLR) , year =
Foret, Pierre and Kleiner, Ariel and Mobahi, Hossein and Neyshabur, Behnam , title =. International Conference on Learning Representations (ICLR) , year =
-
[13]
Neural Tangent Kernel: Convergence and Generalization in Neural Networks , journal =
Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural Tangent Kernel: Convergence and Generalization in Neural Networks , journal =. 2018 , url =
work page 2018
-
[14]
The Tradeoffs of Large Scale Learning , journal =
Bottou, L. The Tradeoffs of Large Scale Learning , journal =. 2012 , publisher =
work page 2012
-
[15]
Adam: A Method for Stochastic Optimization
Kingma, Diederik P. and Ba, Jimmy , title =. arXiv preprint arXiv:1412.6980 , year =
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.