Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

Boyao Liao; Shenyang Deng; Tianyu Pang; Yaoqing Yang; Zhuoli Ouyang

arxiv: 2602.00545 · v2 · pith:YFNK3DNPnew · submitted 2026-01-31 · 💻 cs.LG

Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

Shenyang Deng , Boyao Liao , Zhuoli Ouyang , Tianyu Pang , Yaoqing Yang This is my paper

Pith reviewed 2026-05-21 13:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords Hessian eigenvaluesdeep linear networksspectral bifurcationoptimization landscapenetwork depthbulk-and-spike structureloss surface

0 comments

The pith

Even with perfectly balanced data, the Hessian in deep linear networks develops a dominant eigenvalue cluster and a bulk cluster whose ratio scales linearly with depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior work attributed the bulk-and-spike eigenvalue pattern in neural network Hessians mainly to imbalances in the data covariance matrix. This paper shows the pattern can appear from the network architecture itself. Using deep linear networks that admit exact closed-form Hessians, the authors prove that balanced data covariance still produces a clear separation between a few large eigenvalues and a cluster of smaller ones. They further show that the ratio of these two clusters grows in direct proportion to the number of layers. The result indicates that optimization methods need to account for depth-driven effects on the loss landscape in addition to data properties.

Core claim

In a deep linear network with perfectly balanced data covariance, the Hessian matrix still exhibits a bifurcation eigenvalue structure consisting of a dominant cluster and a bulk cluster. The ratio between the dominant and bulk eigenvalues scales linearly with the network depth. This establishes that the spectral gap is shaped by the network architecture independently of data distribution.

What carries the argument

The exact closed-form expression for the Hessian in deep linear networks, which isolates the depth-dependent scaling of the dominant-to-bulk eigenvalue ratio.

Load-bearing premise

The analysis uses a deep linear network whose weights and data permit an exact closed-form Hessian.

What would settle it

Compute the Hessian eigenvalues for deep linear networks with identity covariance at successive depths and check whether the observed dominant-to-bulk ratio increases linearly.

Figures

Figures reproduced from arXiv: 2602.00545 by Boyao Liao, Shenyang Deng, Tianyu Pang, Yaoqing Yang, Zhuoli Ouyang.

**Figure 1.** Figure 1: Eigenvalue distribution of the Hessian for a deep linear network training on whitened data. Despite the data [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Evolution of Hessian eigenvalues and training loss for deep linear networks with increasing depth [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 4.** Figure 4: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 5.** Figure 5: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 6.** Figure 6: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 7.** Figure 7: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 8.** Figure 8: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 9.** Figure 9: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 10.** Figure 10: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 11.** Figure 11: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 12.** Figure 12: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 13.** Figure 13: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 14.** Figure 14: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 15.** Figure 15: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 16.** Figure 16: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

read the original abstract

The eigenvalue distribution of the Hessian matrix plays a crucial role in understanding the optimization landscape of deep neural networks. Prior work has attributed the well-documented ``bulk-and-spike'' spectral structure, where a few dominant eigenvalues are separated from a bulk of smaller ones, to the imbalance in the data covariance matrix. In this work, we challenge this view by demonstrating that such spectral Bifurcation can arise purely from the network architecture, independent of data imbalance. Specifically, we analyze a deep linear network setup and prove that, even when the data covariance is perfectly balanced, the Hessian still exhibits a Bifurcation eigenvalue structure: a dominant cluster and a bulk cluster. Crucially, we establish that the ratio between dominant and bulk eigenvalues scales linearly with the network depth. This reveals that the spectral gap is strongly affected by the network architecture rather than solely by data distribution. Our results suggest that both model architecture and data characteristics should be considered when designing optimization algorithms for deep networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the eigenvalue spectrum of the Hessian in deep linear networks. It claims to prove that a bifurcation into a dominant eigenvalue cluster and a bulk cluster occurs even when the data covariance is the identity matrix, and that the ratio between the dominant and bulk eigenvalues scales linearly with network depth. This is used to argue that the spectral structure arises from the network architecture independently of data imbalance.

Significance. If the derivation holds, the result is significant because it supplies an exact, closed-form counterexample to the data-covariance explanation for Hessian bulk-and-spike structure, isolating depth as a sufficient cause in the linear setting. The parameter-free linear scaling with depth is a clear strength that could inform architecture-aware optimization methods. The restriction to linear networks remains a modeling choice whose implications for nonlinear networks are left for future work.

major comments (2)

[§4] §4, main theorem and Eq. (12): the closed-form Hessian obtained via chain-rule expansion of the quadratic loss must be shown to contain no auxiliary assumptions on the singular values of the individual layer weights or on the evaluation point; otherwise the observed depth-dependent conditioning of the end-to-end map could produce the eigenvalue separation even under balanced covariance, undermining the claim that the bifurcation is attributable purely to depth.
[Theorem 2] Theorem 2: the linear scaling of the dominant-to-bulk ratio is derived under the assumption that the data covariance is exactly the identity; an explicit statement is needed on whether the same scaling persists when the covariance is only approximately balanced or when finite-width effects are re-introduced.

minor comments (2)

[Abstract] The abstract and introduction capitalize 'Bifurcation' inconsistently; standardize to lowercase except at sentence start.
[Figure 2] Figure 2 caption should explicitly state the network depth values used for the plotted spectra.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. These have prompted us to clarify the assumptions underlying our derivations and to better delineate the scope of our results. We respond to each major comment below, indicating the changes made to the manuscript.

read point-by-point responses

Referee: [§4] §4, main theorem and Eq. (12): the closed-form Hessian obtained via chain-rule expansion of the quadratic loss must be shown to contain no auxiliary assumptions on the singular values of the individual layer weights or on the evaluation point; otherwise the observed depth-dependent conditioning of the end-to-end map could produce the eigenvalue separation even under balanced covariance, undermining the claim that the bifurcation is attributable purely to depth.

Authors: We agree that the assumptions must be stated explicitly. The chain-rule expansion begins from the quadratic loss and differentiates the composition of linear layers without imposing extra constraints on the singular values of individual weight matrices; the only structural assumption is the balanced initialization in which the product of the weight matrices equals the identity when the data covariance is the identity. The evaluation point is the critical point at which the gradient vanishes. To address the concern, we have revised §4 to include an expanded derivation that isolates the depth-dependent multiplicative factors arising from the chain rule. We have also added a short lemma in the appendix confirming that the eigenvalue separation is preserved under small perturbations of the layer singular values, thereby showing that the bifurcation is driven by depth rather than by any auxiliary conditioning of the end-to-end map. revision: yes
Referee: [Theorem 2] Theorem 2: the linear scaling of the dominant-to-bulk ratio is derived under the assumption that the data covariance is exactly the identity; an explicit statement is needed on whether the same scaling persists when the covariance is only approximately balanced or when finite-width effects are re-introduced.

Authors: Theorem 2 is proved for the exact identity covariance, which supplies the cleanest counter-example to data-imbalance explanations. For covariances that are only approximately balanced we expect the linear depth scaling to persist to leading order, because the perturbation introduced by a small deviation from the identity enters as a lower-order correction that does not cancel the dominant depth factor. We have added a remark after Theorem 2 stating this expectation and sketching the first-order perturbation argument. Finite-width effects lie outside the present analysis, which is conducted in the exact linear, infinite-width setting; incorporating them would require random-matrix techniques that are beyond the scope of the current manuscript. We have inserted a brief discussion paragraph noting this limitation and identifying it as a direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on direct closed-form spectral analysis rather than self-referential definitions or fitted inputs.

full rationale

The paper derives the Hessian eigenvalue bifurcation and its linear scaling with depth via explicit mathematical analysis of a deep linear network under the assumption of balanced (identity) data covariance. This proceeds from the chain-rule expansion of the quadratic loss to obtain a closed-form Hessian expression, followed by direct computation of its spectrum, without re-expressing a fitted quantity as a prediction or importing uniqueness via self-citation chains. The central claim is therefore self-contained against the stated modeling assumptions and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the mathematical tractability of the deep linear network Hessian under the balanced-covariance assumption; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption A deep linear network admits an exact closed-form Hessian whose eigenvalues can be analyzed by depth.
Invoked to separate architectural effects from data imbalance.

pith-pipeline@v0.9.0 · 5706 in / 1085 out tokens · 65810 ms · 2026-05-21T13:49:10.013389+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
cs.LG 2026-03 conditional novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Kakade, and Michael I

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. InInternational Conference on Machine Learning, pages 1724–1732. PMLR, 2017

work page 2017
[2]

Adahessian: An adaptive second order optimizer for machine learning

Zhewei Yao, Amir Gholami, Sheng Shen, Kurtutzer Mustafa, and Michael Mahoney. Adahessian: An adaptive second order optimizer for machine learning. InAAAI Conference on Artificial Intelligence, 2021

work page 2021
[3]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342, 2023

work page arXiv 2023
[4]

Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

Yuxin Zhang, Congliang Chen, Zingshi Li, Tian Ding, Zhurong Wu, Yuxin Ye, Zhichi He, Le Sun, and Zhou Yu. Adam-mini: Use 1/2 gpu memory of adam with better performance.arXiv preprint arXiv:2406.16793, 2024

work page arXiv 2024
[5]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

work page 2017
[6]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[7]

Vetrov, and Andrew Gordon Wilson

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[8]

Levent Sagun, Utku Evci, V . U. Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017. 7 Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

An investigation into neural net optimization via hessian eigenvalue density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241. PMLR, 2019

work page 2019
[11]

Pyhessian: Neural networks through the lens of the hessian

Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Pyhessian: Neural networks through the lens of the hessian. In2020 IEEE International Conference on Big Data (Big Data), pages 581–590. IEEE, 2020

work page 2020
[12]

Zico Kolter, and Ameet Talwalkar

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[13]

Does sgd really happen in tiny subspaces?, 2024

Minhak Song, Kwangjun Ahn, and Chulhee Yun. Does sgd really happen in tiny subspaces?, 2024

work page 2024
[14]

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research (JMLR), 22(165):1–73, 2021

work page 2021
[15]

Models of heavy-tailed mechanistic universality

Liam Hodgkinson, Zhichao Wang, and Michael W Mahoney. Models of heavy-tailed mechanistic universality. arXiv preprint arXiv:2506.03470, 2025

work page arXiv 2025
[16]

Geometry of neural network loss surfaces via random matrix theory

Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning (ICML), pages 2798–2806, 2017

work page 2017
[17]

A random matrix approach to neural networks.The Annals of Applied Probability, 28(2):1190–1248, 2018

Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks.The Annals of Applied Probability, 28(2):1190–1248, 2018

work page 2018
[18]

The spectrum of the fisher information matrix of a single-hidden-layer neural network

Jeffrey Pennington and Pratik Worah. The spectrum of the fisher information matrix of a single-hidden-layer neural network. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

work page 2017
[19]

Traces of class/cross-class structure in deep network spectra.Journal of Machine Learning Research, 21(252):1–64, 2020

Vardan Papyan. Traces of class/cross-class structure in deep network spectra.Journal of Machine Learning Research, 21(252):1–64, 2020

work page 2020
[20]

A convergence analysis of gradient descent for deep linear neural networks

Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. InInternational Conference on Learning Representations, 2019

work page 2019
[21]

Analytic insights into structure and rank of neural network hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

work page 2021
[22]

Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer

Blake Bordelon and Cengiz Pehlevan. Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer. InInternational Conference on Machine Learning, 2025

work page 2025
[23]

Theoretical characterisation of the gauss-newton conditioning in neural networks

Jim Zhao, Sidak Pal Singh, and Aurelien Lucchi. Theoretical characterisation of the gauss-newton conditioning in neural networks. InNeurIPS, 2024

work page 2024
[24]

Depth, not data: An analysis of hessian spectral bifurcation

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, and Yaoqing Yang. Depth, not data: An analysis of hessian spectral bifurcation. available on arxiv, 2026. A Proof of Main Theorem A.1 Notation and Problem Setup In this section, we recall the notation, problem setup, and key definitions from the main text that are essential for the subsequent theoreti...

work page 2026
[25]

Since the balanced initialization chooses V such that its column space lies within the support ofΣ xx, we haveV ⊤ΣxxV=I d∗

For1< k < L, considering the dynamic behavior between stept= 0and stept= 1: ∇W k L(W k 0 ) =W k+1:L 0 (W L:1 0 Σxx −Σ yx)W 1:k−1 0 ≃Σ (L−k)/LU ⊤(UΣV ⊤Σxx −Σ yx)VΣ (k−1)/L (31) Under Assumption 3.1, Σxx ≃I r. Since the balanced initialization chooses V such that its column space lies within the support ofΣ xx, we haveV ⊤ΣxxV=I d∗. Therefore, V ⊤Σxx =V ⊤ (3...

work page
[26]

Fork= 1, considering the dynamic behavior between stept= 0and stept= 1: ∇W 1 L(W 1 0 ) =W 2:L 0 (W L:1 0 Σxx −Σ yx) ≃Σ (L−1)/LU ⊤(UΣV ⊤ −U V ⊤) = Σ(L−1)/LU ⊤U(Σ−I d∗)V ⊤ = Σ(L−1)/L(Σ−I d∗)V ⊤ = (Σ(2L−1)/L −Σ (L−1)/L)V ⊤ 12 Depth, Not Data: An Analysis of Hessian Spectral Bifurcation For the update ruleW 1 t+1 =W 1 t −η∇ W 1 L(W 1 t ), sinceW 1 0 ≃Σ 1/LV ⊤...

work page
[27]

Note that ifd L =d ∗ =d 0, thenU∈R dL×dL is a square orthogonal matrix

Fork=L, considering the dynamic behavior between stept= 0and stept= 1: ∇W L L(W L 0 ) = (W L:1 0 Σxx −Σ yx)W 1:L−1 0 ≃(UΣV ⊤ −U V ⊤)VΣ (L−1)/L =U(Σ−I d∗)V ⊤VΣ (L−1)/L =U(Σ−I d∗)Σ(L−1)/L =U(Σ (2L−1)/L −Σ (L−1)/L) Now we analyze when the eigenvectors ofW L are invariant. Note that ifd L =d ∗ =d 0, thenU∈R dL×dL is a square orthogonal matrix. SinceW L 0 ≃UΣ ...

work page
[28]

For1< k < L:W k t ≃Σ 1/L t

work page
[29]

Fork= 1:W 1 t ≃Σ 1/L t V ⊤

work page
[30]

Furthermore, the eigenvaluesλ i,t fori= 1,

Fork=L:W L t ≃UΣ 1/L t where U∈R dL×d∗ and V∈R d0×d∗ are the left and right singular vector matrices from the balanced initialization, which remain constant throughout training. Furthermore, the eigenvaluesλ i,t fori= 1, . . . , revolve according to: λi,t+1 =λ i,t −ηλ 2L−1 i,t +ηλ L−1 i,t (37) while the eigenvaluesλ i,t fori=r+ 1, . . . , d ∗ remain uncha...

work page

[1] [1]

Kakade, and Michael I

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. InInternational Conference on Machine Learning, pages 1724–1732. PMLR, 2017

work page 2017

[2] [2]

Adahessian: An adaptive second order optimizer for machine learning

Zhewei Yao, Amir Gholami, Sheng Shen, Kurtutzer Mustafa, and Michael Mahoney. Adahessian: An adaptive second order optimizer for machine learning. InAAAI Conference on Artificial Intelligence, 2021

work page 2021

[3] [3]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342, 2023

work page arXiv 2023

[4] [4]

Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

Yuxin Zhang, Congliang Chen, Zingshi Li, Tian Ding, Zhurong Wu, Yuxin Ye, Zhichi He, Le Sun, and Zhou Yu. Adam-mini: Use 1/2 gpu memory of adam with better performance.arXiv preprint arXiv:2406.16793, 2024

work page arXiv 2024

[5] [5]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

work page 2017

[6] [6]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018

[7] [7]

Vetrov, and Andrew Gordon Wilson

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018

[8] [8]

Levent Sagun, Utku Evci, V . U. Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017. 7 Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

An investigation into neural net optimization via hessian eigenvalue density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241. PMLR, 2019

work page 2019

[11] [11]

Pyhessian: Neural networks through the lens of the hessian

Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Pyhessian: Neural networks through the lens of the hessian. In2020 IEEE International Conference on Big Data (Big Data), pages 581–590. IEEE, 2020

work page 2020

[12] [12]

Zico Kolter, and Ameet Talwalkar

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[13] [13]

Does sgd really happen in tiny subspaces?, 2024

Minhak Song, Kwangjun Ahn, and Chulhee Yun. Does sgd really happen in tiny subspaces?, 2024

work page 2024

[14] [14]

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research (JMLR), 22(165):1–73, 2021

work page 2021

[15] [15]

Models of heavy-tailed mechanistic universality

Liam Hodgkinson, Zhichao Wang, and Michael W Mahoney. Models of heavy-tailed mechanistic universality. arXiv preprint arXiv:2506.03470, 2025

work page arXiv 2025

[16] [16]

Geometry of neural network loss surfaces via random matrix theory

Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning (ICML), pages 2798–2806, 2017

work page 2017

[17] [17]

A random matrix approach to neural networks.The Annals of Applied Probability, 28(2):1190–1248, 2018

Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks.The Annals of Applied Probability, 28(2):1190–1248, 2018

work page 2018

[18] [18]

The spectrum of the fisher information matrix of a single-hidden-layer neural network

Jeffrey Pennington and Pratik Worah. The spectrum of the fisher information matrix of a single-hidden-layer neural network. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

work page 2017

[19] [19]

Traces of class/cross-class structure in deep network spectra.Journal of Machine Learning Research, 21(252):1–64, 2020

Vardan Papyan. Traces of class/cross-class structure in deep network spectra.Journal of Machine Learning Research, 21(252):1–64, 2020

work page 2020

[20] [20]

A convergence analysis of gradient descent for deep linear neural networks

Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. InInternational Conference on Learning Representations, 2019

work page 2019

[21] [21]

Analytic insights into structure and rank of neural network hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

work page 2021

[22] [22]

Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer

Blake Bordelon and Cengiz Pehlevan. Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer. InInternational Conference on Machine Learning, 2025

work page 2025

[23] [23]

Theoretical characterisation of the gauss-newton conditioning in neural networks

Jim Zhao, Sidak Pal Singh, and Aurelien Lucchi. Theoretical characterisation of the gauss-newton conditioning in neural networks. InNeurIPS, 2024

work page 2024

[24] [24]

Depth, not data: An analysis of hessian spectral bifurcation

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, and Yaoqing Yang. Depth, not data: An analysis of hessian spectral bifurcation. available on arxiv, 2026. A Proof of Main Theorem A.1 Notation and Problem Setup In this section, we recall the notation, problem setup, and key definitions from the main text that are essential for the subsequent theoreti...

work page 2026

[25] [25]

Since the balanced initialization chooses V such that its column space lies within the support ofΣ xx, we haveV ⊤ΣxxV=I d∗

For1< k < L, considering the dynamic behavior between stept= 0and stept= 1: ∇W k L(W k 0 ) =W k+1:L 0 (W L:1 0 Σxx −Σ yx)W 1:k−1 0 ≃Σ (L−k)/LU ⊤(UΣV ⊤Σxx −Σ yx)VΣ (k−1)/L (31) Under Assumption 3.1, Σxx ≃I r. Since the balanced initialization chooses V such that its column space lies within the support ofΣ xx, we haveV ⊤ΣxxV=I d∗. Therefore, V ⊤Σxx =V ⊤ (3...

work page

[26] [26]

Fork= 1, considering the dynamic behavior between stept= 0and stept= 1: ∇W 1 L(W 1 0 ) =W 2:L 0 (W L:1 0 Σxx −Σ yx) ≃Σ (L−1)/LU ⊤(UΣV ⊤ −U V ⊤) = Σ(L−1)/LU ⊤U(Σ−I d∗)V ⊤ = Σ(L−1)/L(Σ−I d∗)V ⊤ = (Σ(2L−1)/L −Σ (L−1)/L)V ⊤ 12 Depth, Not Data: An Analysis of Hessian Spectral Bifurcation For the update ruleW 1 t+1 =W 1 t −η∇ W 1 L(W 1 t ), sinceW 1 0 ≃Σ 1/LV ⊤...

work page

[27] [27]

Note that ifd L =d ∗ =d 0, thenU∈R dL×dL is a square orthogonal matrix

Fork=L, considering the dynamic behavior between stept= 0and stept= 1: ∇W L L(W L 0 ) = (W L:1 0 Σxx −Σ yx)W 1:L−1 0 ≃(UΣV ⊤ −U V ⊤)VΣ (L−1)/L =U(Σ−I d∗)V ⊤VΣ (L−1)/L =U(Σ−I d∗)Σ(L−1)/L =U(Σ (2L−1)/L −Σ (L−1)/L) Now we analyze when the eigenvectors ofW L are invariant. Note that ifd L =d ∗ =d 0, thenU∈R dL×dL is a square orthogonal matrix. SinceW L 0 ≃UΣ ...

work page

[28] [28]

For1< k < L:W k t ≃Σ 1/L t

work page

[29] [29]

Fork= 1:W 1 t ≃Σ 1/L t V ⊤

work page

[30] [30]

Furthermore, the eigenvaluesλ i,t fori= 1,

Fork=L:W L t ≃UΣ 1/L t where U∈R dL×d∗ and V∈R d0×d∗ are the left and right singular vector matrices from the balanced initialization, which remain constant throughout training. Furthermore, the eigenvaluesλ i,t fori= 1, . . . , revolve according to: λi,t+1 =λ i,t −ηλ 2L−1 i,t +ηλ L−1 i,t (37) while the eigenvaluesλ i,t fori=r+ 1, . . . , d ∗ remain uncha...

work page