Non-normal spectral signatures of instability in neural network training dynamics

Souvik Ghosh

arxiv: 2605.23476 · v1 · pith:3BU6Q6UFnew · submitted 2026-05-22 · 💻 cs.LG · cond-mat.dis-nn· cond-mat.mtrl-sci· math.OC

Non-normal spectral signatures of instability in neural network training dynamics

Souvik Ghosh This is my paper

Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncond-mat.mtrl-scimath.OC

keywords non-normal operatorsneural network trainingoptimization stabilitypseudospectraAdam optimizermomentum SGDtransient amplificationexceptional points

0 comments

The pith

Non-normal update operators in common optimizers produce transient amplification that signals training instability even when eigenvalues suggest stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the linearized update maps for Adam and momentum SGD are generically non-normal. Non-normality in Adam is governed by the commutator between the Hessian and the adaptive preconditioner, while in momentum SGD it comes from the augmented state-space form of the update. Non-normal stability theory then supplies a pseudospectral bound in which the eigenvector-matrix condition number κ(V) acts as an early-warning indicator of transient growth before the spectral radius exceeds one. Experiments on two-layer networks confirm that κ(V) cleanly separates stable and unstable phases by roughly an order of magnitude, whereas the spectral radius alone does not. This supplies a continuous, operator-theoretic severity measure that augments the classical sharpness criterion.

Core claim

We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the κ(V) → ∞ limiting case of this framework. Numerical 2D

What carries the argument

the condition number κ(V) of the eigenvector matrix V of the linearized update operator, which quantifies non-normality and bounds transient amplification via pseudospectral theory

If this is right

The spectral radius ρ(J) alone supplies no separation between stable and unstable training phases.
κ(V) separates the same phases by approximately one order of magnitude.
Exceptional points of the update operator arise precisely as the κ(V) → ∞ limit of the pseudospectral framework.
The non-normal measure complements the classical sharpness criterion by supplying a continuous severity indicator of transient amplification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design of new adaptive optimizers could explicitly penalize large commutators [H, M] to reduce non-normality.
Monitoring κ(V) during training on larger models might allow preemptive step-size or momentum adjustments before visible instability occurs.
The same pseudospectral analysis could be applied to other state-augmented methods such as Nesterov or heavy-ball variants.

Load-bearing premise

The linearized update operators for practically used optimizers are generically non-normal, with non-normality for Adam controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner and for momentum SGD arising from the augmented state-space structure of the update map.

What would settle it

A set of training runs across multiple optimizers and network depths in which κ(V) shows no consistent separation between phases that exhibit loss spikes or oscillatory behavior and phases that converge smoothly.

Figures

Figures reproduced from arXiv: 2605.23476 by Souvik Ghosh.

**Figure 2.** Figure 2: FIG. 2. Reproducibility of the spectral radius result across [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Training instabilities in deep networks - loss spikes, oscillatory convergence, and gradient pathologies - are empirically prevalent but lack a rigorous operator-theoretic explanation. We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which \kappa(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the \kappa(V) -> \infty limiting case of this framework. Numerical experiments on two-layer networks confirm that the spectral radius \rho(J) provides no separation between stable and unstable training phases while \kappa(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion with a continuous severity measure of non-normal amplification. These results establish non-Hermitian operator theory as a useful and underexplored framework for neural network optimization stability, offering a diagnostic language and proof-of-concept benchmark for understanding adaptive optimization stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies non-normal theory to Adam and momentum update operators with a κ(V) early-warning bound, but time-varying Hessians and preconditioners raise questions about the autonomous-system bounds.

read the letter

The paper's core move is to treat the linearized update maps for Adam and momentum SGD as non-normal operators. Non-normality for Adam traces to the commutator between the Hessian and the diagonal preconditioner; for momentum it comes from the augmented state. From there they pull a pseudospectral bound in which κ(V) flags possible transient growth even when the spectral radius stays below one, and they note that exceptional points sit at the κ(V) to infinity limit. On two-layer networks the experiments show κ(V) separating stable and unstable phases by roughly an order of magnitude while the spectral radius gives no separation. That numerical contrast is the concrete result they deliver. It is a clean way to import non-Hermitian ideas into optimization stability and to supply a continuous severity measure that sits alongside sharpness criteria. The main limitation is the time dependence. Both the Hessian and the adaptive terms evolve along the trajectory, so the operator J is not constant. Standard non-normal bounds assume autonomous linear systems; the abstract gives no argument that snapshot values or slow-variation approximations remain reliable under realistic rates of change. The experiments are also confined to two-layer nets, which limits how far the separation result can be read. This is for people already working on dynamical-systems accounts of training. A reader who follows operator theory or optimization stability will find a fresh angle and a specific diagnostic to test. It is coherent enough on its own terms to merit referee time, mainly to pressure the time-variation step and to see whether the bound scales.

Referee Report

2 major / 2 minor

Summary. The manuscript applies non-normal stability theory to the linearized update operators of Adam and momentum SGD. It claims these operators are generically non-normal, with non-normality for Adam controlled by the commutator [H, M] and for momentum SGD arising from the augmented state-space structure. A conservative pseudospectral precursor bound is derived in which κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius ρ(J) < 1, with exceptional points appearing as the κ(V) → ∞ limit. Numerical experiments on two-layer networks show that ρ(J) provides no separation between stable and unstable phases while κ(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion.

Significance. If the claims hold after addressing the time-variation issue, the work would establish non-Hermitian operator theory as a useful framework for neural network optimization stability, offering κ(V) as a continuous severity measure of non-normal amplification and a diagnostic complement to sharpness. The two-layer numerical separation provides a concrete proof-of-concept benchmark for the approach.

major comments (2)

[Derivation of the pseudospectral bound (around the statement of the main theorem)] The pseudospectral precursor bound and κ(V) indicator are derived under the assumption of autonomous (constant-coefficient) linear systems, but the update operator J is explicitly time-dependent because both the Hessian H and the adaptive preconditioner M (or momentum buffer) evolve along the training trajectory. No additional justification or extension (e.g., via frozen-coefficient analysis or slow-variation bounds) is provided to establish that the snapshot κ(V) remains a reliable early-warning quantity under realistic rates of variation.
[Numerical experiments section] The numerical claim that κ(V) separates stable and unstable phases by approximately one order of magnitude while ρ(J) does not is demonstrated only on two-layer networks. It is unclear whether this separation persists in deeper architectures, where the structure of the commutator [H, M] and the augmented state may produce qualitatively different non-normality effects.

minor comments (2)

[Theory section] The precise definition of the pseudospectral bound, the matrix V, and the quantity κ(V) should be stated explicitly with equation numbers rather than referenced only descriptively.
[Abstract and numerical results] The abstract states separation 'by approximately one order of magnitude'; the main text should report the exact observed ratio, number of runs, and any statistical measures of separation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of non-Hermitian operator theory in this setting. We address each major comment below with proposed revisions where appropriate.

read point-by-point responses

Referee: [Derivation of the pseudospectral bound (around the statement of the main theorem)] The pseudospectral precursor bound and κ(V) indicator are derived under the assumption of autonomous (constant-coefficient) linear systems, but the update operator J is explicitly time-dependent because both the Hessian H and the adaptive preconditioner M (or momentum buffer) evolve along the training trajectory. No additional justification or extension (e.g., via frozen-coefficient analysis or slow-variation bounds) is provided to establish that the snapshot κ(V) remains a reliable early-warning quantity under realistic rates of variation.

Authors: We agree that the core pseudospectral bound is stated for autonomous linear systems. The manuscript applies the bound to instantaneous linearizations (snapshots) of the time-varying operator along the trajectory, which is a standard practice in non-normal stability analysis of slowly varying systems. However, we did not include an explicit discussion of the frozen-coefficient approximation or slow-variation error bounds. We will revise the manuscript to add a dedicated paragraph in the theory section justifying the snapshot approach under the assumption that H and M vary slowly relative to the transient amplification timescale, with supporting references from the non-autonomous stability literature. revision: yes
Referee: [Numerical experiments section] The numerical claim that κ(V) separates stable and unstable phases by approximately one order of magnitude while ρ(J) does not is demonstrated only on two-layer networks. It is unclear whether this separation persists in deeper architectures, where the structure of the commutator [H, M] and the augmented state may produce qualitatively different non-normality effects.

Authors: The experiments are restricted to two-layer networks to isolate the non-normality mechanisms (commutator [H, M] and augmented state) in a minimal setting where exact Hessian and preconditioner computations are feasible. The theoretical derivations for both Adam and momentum SGD are architecture-agnostic and depend only on the algebraic structure of the update operator. We will revise the discussion section to explicitly state that the two-layer results constitute a controlled proof-of-concept benchmark, note that deeper networks may exhibit additional non-normality sources, and flag extension to deeper architectures as an important direction for follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation applies external non-normal theory to independently defined operators

full rationale

The paper first constructs the linearized update operators J explicitly from the standard forms of Adam (via commutator [H,M]) and momentum SGD (via augmented state-space), then invokes established non-normal stability theory to obtain the pseudospectral precursor bound with κ(V) as indicator. No equation shows the bound or κ(V) reducing to a fitted parameter, self-definition, or input quantity by construction. Numerical separation of stable/unstable phases by κ(V) is an empirical observation, not a definitional tautology. No load-bearing self-citation chain or ansatz smuggling is present in the provided derivation steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of non-normal operator theory to the linearized forms of Adam and momentum SGD updates; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption The training dynamics near fixed points can be accurately captured by linearizing the update operators of the optimizers.
This linearization is the starting point for applying non-normal stability theory as described in the abstract.

pith-pipeline@v0.9.0 · 5759 in / 1277 out tokens · 26275 ms · 2026-05-25T05:15:45.731913+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M]
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Kreiss, H. O. , title =. BIT Numerical Mathematics , volume =. 1962 , doi =

work page 1962
[2]

and Embree, Mark , title =

Trefethen, Lloyd N. and Embree, Mark , title =

work page
[3]

and Trefethen, Anne E

Trefethen, Lloyd N. and Trefethen, Anne E. and Reddy, Satish C. and Driscoll, Tobin A. , title =. Science , volume =. 1993 , doi =

work page 1993
[4]

Advances in Physics , volume =

Ashida, Yuto and Gong, Zongping and Ueda, Masahito , title =. Advances in Physics , volume =. 2020 , doi =

work page 2020
[5]

and Kaur, Simran and Li, Yuanzhi and Kolter, J

Cohen, Jeremy M. and Kaur, Simran and Li, Yuanzhi and Kolter, J. Zico and Talwalkar, Ameet , title =. International Conference on Learning Representations (ICLR) , year =

work page
[6]

, title =

Damian, Alexandru and Nichani, Eshaan and Ge, Rong and Lee, Jason D. , title =. arXiv preprint arXiv:2209.15594 , year =

work page arXiv
[7]

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Ghorbani, Behrooz and Krishnan, Shankar and Xiao, Ying , title =. arXiv preprint arXiv:1901.10159 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1901
[8]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Sagun, Levent and Evci, Utku and G. Empirical Analysis of the. arXiv preprint arXiv:1706.04454 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Appearance of Random Matrix Theory in Deep Learning , journal =

Sagun, Levent and Evci, Utku and Dauphin, Yann and Bottou, L. Appearance of Random Matrix Theory in Deep Learning , journal =. 2021 , url =

work page 2021
[10]

arXiv preprint arXiv:2305.12133 , year =

Zhang, Zhihao and others , title =. arXiv preprint arXiv:2305.12133 , year =

work page arXiv
[11]

, title =

Yao, Zhewei and Gholami, Amir and Lei, Qi and Keutzer, Kurt and Mahoney, Michael W. , title =. arXiv preprint arXiv:1802.08241 , year =

work page arXiv
[12]

International Conference on Learning Representations (ICLR) , year =

Foret, Pierre and Kleiner, Ariel and Mobahi, Hossein and Neyshabur, Behnam , title =. International Conference on Learning Representations (ICLR) , year =

work page
[13]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , journal =

Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural Tangent Kernel: Convergence and Generalization in Neural Networks , journal =. 2018 , url =

work page 2018
[14]

The Tradeoffs of Large Scale Learning , journal =

Bottou, L. The Tradeoffs of Large Scale Learning , journal =. 2012 , publisher =

work page 2012
[15]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P. and Ba, Jimmy , title =. arXiv preprint arXiv:1412.6980 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Kreiss, H. O. , title =. BIT Numerical Mathematics , volume =. 1962 , doi =

work page 1962

[2] [2]

and Embree, Mark , title =

Trefethen, Lloyd N. and Embree, Mark , title =

work page

[3] [3]

and Trefethen, Anne E

Trefethen, Lloyd N. and Trefethen, Anne E. and Reddy, Satish C. and Driscoll, Tobin A. , title =. Science , volume =. 1993 , doi =

work page 1993

[4] [4]

Advances in Physics , volume =

Ashida, Yuto and Gong, Zongping and Ueda, Masahito , title =. Advances in Physics , volume =. 2020 , doi =

work page 2020

[5] [5]

and Kaur, Simran and Li, Yuanzhi and Kolter, J

Cohen, Jeremy M. and Kaur, Simran and Li, Yuanzhi and Kolter, J. Zico and Talwalkar, Ameet , title =. International Conference on Learning Representations (ICLR) , year =

work page

[6] [6]

, title =

Damian, Alexandru and Nichani, Eshaan and Ge, Rong and Lee, Jason D. , title =. arXiv preprint arXiv:2209.15594 , year =

work page arXiv

[7] [7]

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Ghorbani, Behrooz and Krishnan, Shankar and Xiao, Ying , title =. arXiv preprint arXiv:1901.10159 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1901

[8] [8]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Sagun, Levent and Evci, Utku and G. Empirical Analysis of the. arXiv preprint arXiv:1706.04454 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Appearance of Random Matrix Theory in Deep Learning , journal =

Sagun, Levent and Evci, Utku and Dauphin, Yann and Bottou, L. Appearance of Random Matrix Theory in Deep Learning , journal =. 2021 , url =

work page 2021

[10] [10]

arXiv preprint arXiv:2305.12133 , year =

Zhang, Zhihao and others , title =. arXiv preprint arXiv:2305.12133 , year =

work page arXiv

[11] [11]

, title =

Yao, Zhewei and Gholami, Amir and Lei, Qi and Keutzer, Kurt and Mahoney, Michael W. , title =. arXiv preprint arXiv:1802.08241 , year =

work page arXiv

[12] [12]

International Conference on Learning Representations (ICLR) , year =

Foret, Pierre and Kleiner, Ariel and Mobahi, Hossein and Neyshabur, Behnam , title =. International Conference on Learning Representations (ICLR) , year =

work page

[13] [13]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , journal =

Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural Tangent Kernel: Convergence and Generalization in Neural Networks , journal =. 2018 , url =

work page 2018

[14] [14]

The Tradeoffs of Large Scale Learning , journal =

Bottou, L. The Tradeoffs of Large Scale Learning , journal =. 2012 , publisher =

work page 2012

[15] [15]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P. and Ba, Jimmy , title =. arXiv preprint arXiv:1412.6980 , year =

work page internal anchor Pith review Pith/arXiv arXiv