Depth, Not Data: An Analysis of Hessian Spectral Bifurcation
Pith reviewed 2026-05-21 13:49 UTC · model grok-4.3
The pith
Even with perfectly balanced data, the Hessian in deep linear networks develops a dominant eigenvalue cluster and a bulk cluster whose ratio scales linearly with depth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a deep linear network with perfectly balanced data covariance, the Hessian matrix still exhibits a bifurcation eigenvalue structure consisting of a dominant cluster and a bulk cluster. The ratio between the dominant and bulk eigenvalues scales linearly with the network depth. This establishes that the spectral gap is shaped by the network architecture independently of data distribution.
What carries the argument
The exact closed-form expression for the Hessian in deep linear networks, which isolates the depth-dependent scaling of the dominant-to-bulk eigenvalue ratio.
Load-bearing premise
The analysis uses a deep linear network whose weights and data permit an exact closed-form Hessian.
What would settle it
Compute the Hessian eigenvalues for deep linear networks with identity covariance at successive depths and check whether the observed dominant-to-bulk ratio increases linearly.
Figures
read the original abstract
The eigenvalue distribution of the Hessian matrix plays a crucial role in understanding the optimization landscape of deep neural networks. Prior work has attributed the well-documented ``bulk-and-spike'' spectral structure, where a few dominant eigenvalues are separated from a bulk of smaller ones, to the imbalance in the data covariance matrix. In this work, we challenge this view by demonstrating that such spectral Bifurcation can arise purely from the network architecture, independent of data imbalance. Specifically, we analyze a deep linear network setup and prove that, even when the data covariance is perfectly balanced, the Hessian still exhibits a Bifurcation eigenvalue structure: a dominant cluster and a bulk cluster. Crucially, we establish that the ratio between dominant and bulk eigenvalues scales linearly with the network depth. This reveals that the spectral gap is strongly affected by the network architecture rather than solely by data distribution. Our results suggest that both model architecture and data characteristics should be considered when designing optimization algorithms for deep networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes the eigenvalue spectrum of the Hessian in deep linear networks. It claims to prove that a bifurcation into a dominant eigenvalue cluster and a bulk cluster occurs even when the data covariance is the identity matrix, and that the ratio between the dominant and bulk eigenvalues scales linearly with network depth. This is used to argue that the spectral structure arises from the network architecture independently of data imbalance.
Significance. If the derivation holds, the result is significant because it supplies an exact, closed-form counterexample to the data-covariance explanation for Hessian bulk-and-spike structure, isolating depth as a sufficient cause in the linear setting. The parameter-free linear scaling with depth is a clear strength that could inform architecture-aware optimization methods. The restriction to linear networks remains a modeling choice whose implications for nonlinear networks are left for future work.
major comments (2)
- [§4] §4, main theorem and Eq. (12): the closed-form Hessian obtained via chain-rule expansion of the quadratic loss must be shown to contain no auxiliary assumptions on the singular values of the individual layer weights or on the evaluation point; otherwise the observed depth-dependent conditioning of the end-to-end map could produce the eigenvalue separation even under balanced covariance, undermining the claim that the bifurcation is attributable purely to depth.
- [Theorem 2] Theorem 2: the linear scaling of the dominant-to-bulk ratio is derived under the assumption that the data covariance is exactly the identity; an explicit statement is needed on whether the same scaling persists when the covariance is only approximately balanced or when finite-width effects are re-introduced.
minor comments (2)
- [Abstract] The abstract and introduction capitalize 'Bifurcation' inconsistently; standardize to lowercase except at sentence start.
- [Figure 2] Figure 2 caption should explicitly state the network depth values used for the plotted spectra.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. These have prompted us to clarify the assumptions underlying our derivations and to better delineate the scope of our results. We respond to each major comment below, indicating the changes made to the manuscript.
read point-by-point responses
-
Referee: [§4] §4, main theorem and Eq. (12): the closed-form Hessian obtained via chain-rule expansion of the quadratic loss must be shown to contain no auxiliary assumptions on the singular values of the individual layer weights or on the evaluation point; otherwise the observed depth-dependent conditioning of the end-to-end map could produce the eigenvalue separation even under balanced covariance, undermining the claim that the bifurcation is attributable purely to depth.
Authors: We agree that the assumptions must be stated explicitly. The chain-rule expansion begins from the quadratic loss and differentiates the composition of linear layers without imposing extra constraints on the singular values of individual weight matrices; the only structural assumption is the balanced initialization in which the product of the weight matrices equals the identity when the data covariance is the identity. The evaluation point is the critical point at which the gradient vanishes. To address the concern, we have revised §4 to include an expanded derivation that isolates the depth-dependent multiplicative factors arising from the chain rule. We have also added a short lemma in the appendix confirming that the eigenvalue separation is preserved under small perturbations of the layer singular values, thereby showing that the bifurcation is driven by depth rather than by any auxiliary conditioning of the end-to-end map. revision: yes
-
Referee: [Theorem 2] Theorem 2: the linear scaling of the dominant-to-bulk ratio is derived under the assumption that the data covariance is exactly the identity; an explicit statement is needed on whether the same scaling persists when the covariance is only approximately balanced or when finite-width effects are re-introduced.
Authors: Theorem 2 is proved for the exact identity covariance, which supplies the cleanest counter-example to data-imbalance explanations. For covariances that are only approximately balanced we expect the linear depth scaling to persist to leading order, because the perturbation introduced by a small deviation from the identity enters as a lower-order correction that does not cancel the dominant depth factor. We have added a remark after Theorem 2 stating this expectation and sketching the first-order perturbation argument. Finite-width effects lie outside the present analysis, which is conducted in the exact linear, infinite-width setting; incorporating them would require random-matrix techniques that are beyond the scope of the current manuscript. We have inserted a brief discussion paragraph noting this limitation and identifying it as a direction for future work. revision: partial
Circularity Check
No significant circularity detected; derivation relies on direct closed-form spectral analysis rather than self-referential definitions or fitted inputs.
full rationale
The paper derives the Hessian eigenvalue bifurcation and its linear scaling with depth via explicit mathematical analysis of a deep linear network under the assumption of balanced (identity) data covariance. This proceeds from the chain-rule expansion of the quadratic loss to obtain a closed-form Hessian expression, followed by direct computation of its spectrum, without re-expressing a fitted quantity as a prediction or importing uniqueness via self-citation chains. The central claim is therefore self-contained against the stated modeling assumptions and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A deep linear network admits an exact closed-form Hessian whose eigenvalues can be analyzed by depth.
Forward citations
Cited by 1 Pith paper
-
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
Reference graph
Works this paper leans on
-
[1]
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. InInternational Conference on Machine Learning, pages 1724–1732. PMLR, 2017
work page 2017
-
[2]
Adahessian: An adaptive second order optimizer for machine learning
Zhewei Yao, Amir Gholami, Sheng Shen, Kurtutzer Mustafa, and Michael Mahoney. Adahessian: An adaptive second order optimizer for machine learning. InAAAI Conference on Artificial Intelligence, 2021
work page 2021
-
[3]
Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a
Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342, 2023
-
[4]
Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,
Yuxin Zhang, Congliang Chen, Zingshi Li, Tian Ding, Zhurong Wu, Yuxin Ye, Zhichi He, Le Sun, and Zhou Yu. Adam-mini: Use 1/2 gpu memory of adam with better performance.arXiv preprint arXiv:2406.16793, 2024
-
[5]
On large-batch training for deep learning: Generalization gap and sharp minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017
work page 2017
-
[6]
Visualizing the loss landscape of neural nets
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[7]
Vetrov, and Andrew Gordon Wilson
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[8]
Levent Sagun, Utku Evci, V . U. Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017. 7 Depth, Not Data: An Analysis of Hessian Spectral Bifurcation
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Gradient Descent Happens in a Tiny Subspace
Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
An investigation into neural net optimization via hessian eigenvalue density
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241. PMLR, 2019
work page 2019
-
[11]
Pyhessian: Neural networks through the lens of the hessian
Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Pyhessian: Neural networks through the lens of the hessian. In2020 IEEE International Conference on Big Data (Big Data), pages 581–590. IEEE, 2020
work page 2020
-
[12]
Zico Kolter, and Ameet Talwalkar
Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[13]
Does sgd really happen in tiny subspaces?, 2024
Minhak Song, Kwangjun Ahn, and Chulhee Yun. Does sgd really happen in tiny subspaces?, 2024
work page 2024
-
[14]
Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research (JMLR), 22(165):1–73, 2021
work page 2021
-
[15]
Models of heavy-tailed mechanistic universality
Liam Hodgkinson, Zhichao Wang, and Michael W Mahoney. Models of heavy-tailed mechanistic universality. arXiv preprint arXiv:2506.03470, 2025
-
[16]
Geometry of neural network loss surfaces via random matrix theory
Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning (ICML), pages 2798–2806, 2017
work page 2017
-
[17]
A random matrix approach to neural networks.The Annals of Applied Probability, 28(2):1190–1248, 2018
Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks.The Annals of Applied Probability, 28(2):1190–1248, 2018
work page 2018
-
[18]
The spectrum of the fisher information matrix of a single-hidden-layer neural network
Jeffrey Pennington and Pratik Worah. The spectrum of the fisher information matrix of a single-hidden-layer neural network. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017
work page 2017
-
[19]
Vardan Papyan. Traces of class/cross-class structure in deep network spectra.Journal of Machine Learning Research, 21(252):1–64, 2020
work page 2020
-
[20]
A convergence analysis of gradient descent for deep linear neural networks
Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. InInternational Conference on Learning Representations, 2019
work page 2019
-
[21]
Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021
work page 2021
-
[22]
Blake Bordelon and Cengiz Pehlevan. Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer. InInternational Conference on Machine Learning, 2025
work page 2025
-
[23]
Theoretical characterisation of the gauss-newton conditioning in neural networks
Jim Zhao, Sidak Pal Singh, and Aurelien Lucchi. Theoretical characterisation of the gauss-newton conditioning in neural networks. InNeurIPS, 2024
work page 2024
-
[24]
Depth, not data: An analysis of hessian spectral bifurcation
Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, and Yaoqing Yang. Depth, not data: An analysis of hessian spectral bifurcation. available on arxiv, 2026. A Proof of Main Theorem A.1 Notation and Problem Setup In this section, we recall the notation, problem setup, and key definitions from the main text that are essential for the subsequent theoreti...
work page 2026
-
[25]
For1< k < L, considering the dynamic behavior between stept= 0and stept= 1: ∇W k L(W k 0 ) =W k+1:L 0 (W L:1 0 Σxx −Σ yx)W 1:k−1 0 ≃Σ (L−k)/LU ⊤(UΣV ⊤Σxx −Σ yx)VΣ (k−1)/L (31) Under Assumption 3.1, Σxx ≃I r. Since the balanced initialization chooses V such that its column space lies within the support ofΣ xx, we haveV ⊤ΣxxV=I d∗. Therefore, V ⊤Σxx =V ⊤ (3...
-
[26]
Fork= 1, considering the dynamic behavior between stept= 0and stept= 1: ∇W 1 L(W 1 0 ) =W 2:L 0 (W L:1 0 Σxx −Σ yx) ≃Σ (L−1)/LU ⊤(UΣV ⊤ −U V ⊤) = Σ(L−1)/LU ⊤U(Σ−I d∗)V ⊤ = Σ(L−1)/L(Σ−I d∗)V ⊤ = (Σ(2L−1)/L −Σ (L−1)/L)V ⊤ 12 Depth, Not Data: An Analysis of Hessian Spectral Bifurcation For the update ruleW 1 t+1 =W 1 t −η∇ W 1 L(W 1 t ), sinceW 1 0 ≃Σ 1/LV ⊤...
-
[27]
Note that ifd L =d ∗ =d 0, thenU∈R dL×dL is a square orthogonal matrix
Fork=L, considering the dynamic behavior between stept= 0and stept= 1: ∇W L L(W L 0 ) = (W L:1 0 Σxx −Σ yx)W 1:L−1 0 ≃(UΣV ⊤ −U V ⊤)VΣ (L−1)/L =U(Σ−I d∗)V ⊤VΣ (L−1)/L =U(Σ−I d∗)Σ(L−1)/L =U(Σ (2L−1)/L −Σ (L−1)/L) Now we analyze when the eigenvectors ofW L are invariant. Note that ifd L =d ∗ =d 0, thenU∈R dL×dL is a square orthogonal matrix. SinceW L 0 ≃UΣ ...
-
[28]
For1< k < L:W k t ≃Σ 1/L t
-
[29]
Fork= 1:W 1 t ≃Σ 1/L t V ⊤
-
[30]
Furthermore, the eigenvaluesλ i,t fori= 1,
Fork=L:W L t ≃UΣ 1/L t where U∈R dL×d∗ and V∈R d0×d∗ are the left and right singular vector matrices from the balanced initialization, which remain constant throughout training. Furthermore, the eigenvaluesλ i,t fori= 1, . . . , revolve according to: λi,t+1 =λ i,t −ηλ 2L−1 i,t +ηλ L−1 i,t (37) while the eigenvaluesλ i,t fori=r+ 1, . . . , d ∗ remain uncha...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.