Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

Boyao Liao; Shenyang Deng; Tianyu Pang; Yaoqing Yang; Zhuoli Ouyang

arxiv: 2602.00545 · v3 · pith:YFNK3DNPnew · submitted 2026-01-31 · 💻 cs.LG

Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

Shenyang Deng , Boyao Liao , Zhuoli Ouyang , Tianyu Pang , Yaoqing Yang This is my paper

Pith reviewed 2026-05-21 13:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords Hessian eigenvaluesdeep linear networksspectral bifurcationoptimization landscapenetwork depthbulk-and-spike structureloss surface

0 comments

The pith

Even with perfectly balanced data, the Hessian in deep linear networks develops a dominant eigenvalue cluster and a bulk cluster whose ratio scales linearly with depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior work attributed the bulk-and-spike eigenvalue pattern in neural network Hessians mainly to imbalances in the data covariance matrix. This paper shows the pattern can appear from the network architecture itself. Using deep linear networks that admit exact closed-form Hessians, the authors prove that balanced data covariance still produces a clear separation between a few large eigenvalues and a cluster of smaller ones. They further show that the ratio of these two clusters grows in direct proportion to the number of layers. The result indicates that optimization methods need to account for depth-driven effects on the loss landscape in addition to data properties.

Core claim

In a deep linear network with perfectly balanced data covariance, the Hessian matrix still exhibits a bifurcation eigenvalue structure consisting of a dominant cluster and a bulk cluster. The ratio between the dominant and bulk eigenvalues scales linearly with the network depth. This establishes that the spectral gap is shaped by the network architecture independently of data distribution.

What carries the argument

The exact closed-form expression for the Hessian in deep linear networks, which isolates the depth-dependent scaling of the dominant-to-bulk eigenvalue ratio.

Load-bearing premise

The analysis uses a deep linear network whose weights and data permit an exact closed-form Hessian.

What would settle it

Compute the Hessian eigenvalues for deep linear networks with identity covariance at successive depths and check whether the observed dominant-to-bulk ratio increases linearly.

Figures

Figures reproduced from arXiv: 2602.00545 by Boyao Liao, Shenyang Deng, Tianyu Pang, Yaoqing Yang, Zhuoli Ouyang.

**Figure 1.** Figure 1: Eigenvalue distribution of the Hessian for a deep linear network training on whitened data. Despite the data [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Evolution of Hessian eigenvalues and training loss for deep linear networks with increasing depth [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 4.** Figure 4: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 5.** Figure 5: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 6.** Figure 6: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 7.** Figure 7: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 8.** Figure 8: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 9.** Figure 9: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. The… view at source ↗

**Figure 10.** Figure 10: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 11.** Figure 11: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 12.** Figure 12: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 13.** Figure 13: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 14.** Figure 14: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 15.** Figure 15: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

**Figure 16.** Figure 16: Eigenvalue evolution. The curves are color-coded by subspace: purple for the dominant space, orange for the bulk space, and green for the near-zero space. The dominant space has a dimension of rank2 , while the combined dimension of the dominant and bulk spaces equals the product of the input and output dimensions. The final eigenvalues of the dominant space converge to L times those of the bulk space. Th… view at source ↗

read the original abstract

The eigenvalue distribution of the Hessian matrix plays a crucial role in understanding the optimization landscape of deep neural networks. Prior work has attributed the well-documented ``bulk-and-spike'' spectral structure, where a few dominant eigenvalues are separated from a bulk of smaller ones, to the imbalance in the data covariance matrix. In this work, we challenge this view by demonstrating that such spectral Bifurcation can arise purely from the network architecture, independent of data imbalance. Specifically, we analyze a deep linear network setup and prove that, even when the data covariance is perfectly balanced, the Hessian still exhibits a Bifurcation eigenvalue structure: a dominant cluster and a bulk cluster. Crucially, we establish that the ratio between dominant and bulk eigenvalues scales linearly with the network depth. This reveals that the spectral gap is strongly affected by the network architecture rather than solely by data distribution. Our results suggest that both model architecture and data characteristics should be considered when designing optimization algorithms for deep networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the eigenvalue spectrum of the Hessian in deep linear networks. It claims to prove that a bifurcation into a dominant eigenvalue cluster and a bulk cluster occurs even when the data covariance is the identity matrix, and that the ratio between the dominant and bulk eigenvalues scales linearly with network depth. This is used to argue that the spectral structure arises from the network architecture independently of data imbalance.

Significance. If the derivation holds, the result is significant because it supplies an exact, closed-form counterexample to the data-covariance explanation for Hessian bulk-and-spike structure, isolating depth as a sufficient cause in the linear setting. The parameter-free linear scaling with depth is a clear strength that could inform architecture-aware optimization methods. The restriction to linear networks remains a modeling choice whose implications for nonlinear networks are left for future work.

major comments (2)

[§4] §4, main theorem and Eq. (12): the closed-form Hessian obtained via chain-rule expansion of the quadratic loss must be shown to contain no auxiliary assumptions on the singular values of the individual layer weights or on the evaluation point; otherwise the observed depth-dependent conditioning of the end-to-end map could produce the eigenvalue separation even under balanced covariance, undermining the claim that the bifurcation is attributable purely to depth.
[Theorem 2] Theorem 2: the linear scaling of the dominant-to-bulk ratio is derived under the assumption that the data covariance is exactly the identity; an explicit statement is needed on whether the same scaling persists when the covariance is only approximately balanced or when finite-width effects are re-introduced.

minor comments (2)

[Abstract] The abstract and introduction capitalize 'Bifurcation' inconsistently; standardize to lowercase except at sentence start.
[Figure 2] Figure 2 caption should explicitly state the network depth values used for the plotted spectra.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. These have prompted us to clarify the assumptions underlying our derivations and to better delineate the scope of our results. We respond to each major comment below, indicating the changes made to the manuscript.

read point-by-point responses

Referee: [§4] §4, main theorem and Eq. (12): the closed-form Hessian obtained via chain-rule expansion of the quadratic loss must be shown to contain no auxiliary assumptions on the singular values of the individual layer weights or on the evaluation point; otherwise the observed depth-dependent conditioning of the end-to-end map could produce the eigenvalue separation even under balanced covariance, undermining the claim that the bifurcation is attributable purely to depth.

Authors: We agree that the assumptions must be stated explicitly. The chain-rule expansion begins from the quadratic loss and differentiates the composition of linear layers without imposing extra constraints on the singular values of individual weight matrices; the only structural assumption is the balanced initialization in which the product of the weight matrices equals the identity when the data covariance is the identity. The evaluation point is the critical point at which the gradient vanishes. To address the concern, we have revised §4 to include an expanded derivation that isolates the depth-dependent multiplicative factors arising from the chain rule. We have also added a short lemma in the appendix confirming that the eigenvalue separation is preserved under small perturbations of the layer singular values, thereby showing that the bifurcation is driven by depth rather than by any auxiliary conditioning of the end-to-end map. revision: yes
Referee: [Theorem 2] Theorem 2: the linear scaling of the dominant-to-bulk ratio is derived under the assumption that the data covariance is exactly the identity; an explicit statement is needed on whether the same scaling persists when the covariance is only approximately balanced or when finite-width effects are re-introduced.

Authors: Theorem 2 is proved for the exact identity covariance, which supplies the cleanest counter-example to data-imbalance explanations. For covariances that are only approximately balanced we expect the linear depth scaling to persist to leading order, because the perturbation introduced by a small deviation from the identity enters as a lower-order correction that does not cancel the dominant depth factor. We have added a remark after Theorem 2 stating this expectation and sketching the first-order perturbation argument. Finite-width effects lie outside the present analysis, which is conducted in the exact linear, infinite-width setting; incorporating them would require random-matrix techniques that are beyond the scope of the current manuscript. We have inserted a brief discussion paragraph noting this limitation and identifying it as a direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on direct closed-form spectral analysis rather than self-referential definitions or fitted inputs.

full rationale

The paper derives the Hessian eigenvalue bifurcation and its linear scaling with depth via explicit mathematical analysis of a deep linear network under the assumption of balanced (identity) data covariance. This proceeds from the chain-rule expansion of the quadratic loss to obtain a closed-form Hessian expression, followed by direct computation of its spectrum, without re-expressing a fitted quantity as a prediction or importing uniqueness via self-citation chains. The central claim is therefore self-contained against the stated modeling assumptions and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the mathematical tractability of the deep linear network Hessian under the balanced-covariance assumption; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption A deep linear network admits an exact closed-form Hessian whose eigenvalues can be analyzed by depth.
Invoked to separate architectural effects from data imbalance.

pith-pipeline@v0.9.0 · 5706 in / 1085 out tokens · 65810 ms · 2026-05-21T13:49:10.013389+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
cs.LG 2026-03 conditional novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.